100 Language Processing Knock UNIX Commands Learned in Chapter 2

Introduction

I'm solving 100 knocks on language processing at a study session centered on in-house members, but the answer code and the solution This is a summary of the tricks that I found useful in the process. Most of the content has been investigated and verified by myself, but it also contains information shared by other study group members.

This time I will summarize the basics of UNIX commands, but the preceding @ moriwo and [@segavvy](https://qiita.com/ Since the article segavvy / items / fb50ba8097d59475f760) has already written a fairly detailed explanation, I would like to make the explanation in this article lightly conservative. If you have any questions after reading the following, it is recommended that you take a look at the articles of both parties from the link above.

series

-Unix commands learned in Chapter 2 of 100 language processing knocks (this article) -Regular expressions learned in Chapter 3 of 100 language processing knocks -Morphological analysis learned in Chapter 4 of 100 language processing knocks

environment

code

10. Counting the number of lines

Python


def count_lines():
    with open('hightemp.txt') as file:
        return len(file.readlines())

count_lines()

Result (Python)


24

UNIX


!wc -l hightemp.txt

Result (UNIX)


      24 hightemp.txt

UNIX commands were overwhelmingly concise. By the way, the ! In front of the wc is used when executing UNIX commands in JupyterLab or Notebook (in some cases, it works without the !).

11. Replace tabs with spaces

Python


def replace_tabs():
    with open('hightemp.txt') as file:
        return file.read().replace('\t', ' ')
    
print(replace_tabs())

UNIX


!cat hightemp.txt | sed $'s/\t/ /g'

Result (common to Python and UNIX)


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
Yamagata 40 Yamagata.8 1933-07-25
Yamanashi Prefecture Kofu 40.7 2013-08-10
...

Regarding UNIX sed, I thought it should be noted that \ t is not recognized as a tab symbol unless the $ symbol is added.

12. Save the first column in col1.txt and the second column in col2.txt

Python


import pandas as pd

def separate_columns():
    df = pd.read_csv('hightemp.txt', sep='\t', header=None)
    df.iloc[:,0].to_csv('col1.txt', header=False, index=False)
    df.iloc[:,1].to_csv('col2.txt', header=False, index=False)

separate_columns()

UNIX


!cut -f 1 hightemp.txt > col1_unix.txt
!cut -f 2 hightemp.txt > col2_unix.txt

If you check the result with ! Head col1.txt col2.txt, it will be as follows. The same applies when ! Head col1_unix.txt col2_unix.txt is used.

Result (Python)


==> col1.txt <==
Kochi Prefecture
Saitama
...
==> col2.txt <==
Ekawasaki
Kumagaya
...

13. Merge col1.txt and col2.txt

Python


def merge_columns():
    with open('col1.txt') as col1_file, open('col2.txt') as col2_file, \
         open('merge.txt', mode='w') as new_file:
        
        for col1_line, col2_line in zip(col1_file, col2_file):
            new_file.write(f'{col1_line.rstrip()}\t{col2_line.rstrip()}\n')

merge_columns()

UNIX


!paste col[1-2].txt > merge_unix.txt

Result (common to Python and UNIX)


Kochi Prefecture Ekawasaki
Kumagaya, Saitama Prefecture
Gifu Prefecture Tajimi
Yamagata Prefecture Yamagata
...

To check the result, use ! Head merge.txt or! Head merge_unix.txt and you should get the above output.

14. Output N lines from the beginning

Python


def show_head():
    n = int(input())

    with open('hightemp.txt') as file:
        for line in file.readlines()[:n]:
            print(line.rstrip())
    
show_head()

UNIX


!head -3 hightemp.txt

Result (common to Python and UNIX)


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9	2007-08-16
40 Tajimi, Gifu Prefecture.9	2007-08-16

For Python, if you want to return a list from a function, you might write something like this:

Python


def show_head():
    n = int(input())

    with open('hightemp.txt') as file:
        return [line for line in file.readlines()[:n]]
    
print(*show_head())

On the other hand, in UNIX, it is difficult to receive an integer on the command line like Python, but if you write an integer after -, you can specify how many lines to display. As an applied usage, for example, at the end of the answer of 12

UNIX


!cat hightemp.txt | sed $'s/\t/ /g' | head -5

You can also write and display only the first 5 lines.

15. Output the last N lines

Python


def show_tail():
    n = int(input())

    with open('hightemp.txt') as file:
        return [line for line in file.readlines()[-n:]]

print(*show_tail())

UNIX


!tail -3 hightemp.txt

Result (common to Python and UNIX)


Yamanashi Prefecture Otsuki 39.9	1990-07-19
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Aichi Prefecture Nagoya 39.9	1942-08-02

Almost the same as 14.

16. Divide the file into N

Python


import math

def split_file():
    n = int(input())

    with open('hightemp.txt') as file:
        lines = file.readlines()
        num = math.ceil(len(lines) / n)
        for i in range(n):
            with open('split{}.txt'.format(i + 1), mode='w') as new_file:
                text = ''.join(lines[i * num:(i + 1) * num])
                new_file.write(text)

split_file()

UNIX


!split -n 5 -d hightemp.txt split_unix

Result (Python)


==> split1.txt <==
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9	2007-08-16
40 Tajimi, Gifu Prefecture.9	2007-08-16
Yamagata 40 Yamagata.8	1933-07-25
Yamanashi Prefecture Kofu 40.7	2013-08-10

==> split5.txt <==
Toyonaka 39, Osaka.9	1994-08-08
Yamanashi Prefecture Otsuki 39.9	1990-07-19
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Aichi Prefecture Nagoya 39.9	1942-08-02

The above is the output of Python confirmed by ! Head split1.txt split5.txt.

On the other hand, the above UNIX command does not work in my environment, so I tried this with Colab (Google Colaboratory). It seems that the -n command is generally provided in Linux (@IT). It doesn't seem to work on the default macOS.

When I executed the above in Colab, 5 files from split_unix00 to split_unix04 were created, but when I tried to add the extension txt to it, I felt that it would be a little troublesome. .. [@ Moriwo's article](https://qiita.com/moriwo/items/9d2a73a75f543e2ea6af#16-%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E3 % 82% 92n% E5% 88% 86% E5% 89% B2% E3% 81% 99% E3% 82% 8B) introduces an implementation example using ʻawk` etc., but I use Python I wondered if it would be easier to read the code.

17. Difference in the character string in the first column

Python


import pandas as pd

def get_chars_set():
    df = pd.read_csv('hightemp.txt', sep='\t', header=None)
    return set(df.iloc[:, 0])

print(get_chars_set())

Result (Python)


{'Chiba', 'Saitama', 'Yamagata Prefecture', 'Wakayama Prefecture', 'Shizuoka Prefecture', 'Kochi Prefecture', 'Osaka', 'Gifu Prefecture', 'Gunma Prefecture', 'Ehime Prefecture', 'Yamanashi Prefecture', 'Aichi prefecture'}

UNIX


!sort -u col1_unix.txt

Result (UNIX)


Chiba
Saitama
Osaka
Yamagata Prefecture
...

UNIX commands can also pipe sort and deduplication processes and write ! Sort col1_unix.txt | uniq.

18. Sort each row in descending order of the numbers in the third column

Python


def sort_rows():
    df = pd.read_csv('hightemp.txt', sep='\t', header=None)
    df.rename(columns={0: 'Prefect', 1: 'City', 2: 'Temp', 3: 'Date'}, inplace=True)
    df.sort_values(by='Temp', inplace=True)
    return df

sort_rows()

Result (Python)


   Prefect  City  Temp        Date
23 Nagoya, Aichi 39.9  1942-08-02
21 Yamanashi Prefecture Otsuki 39.9  1990-07-19
20 Toyonaka, Osaka 39.9  1994-08-08
...

UNIX


!sort hightemp.txt -k 3

Result (UNIX)


Aichi Prefecture Nagoya 39.9	1942-08-02
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Yamanashi Prefecture Otsuki 39.9	1990-07-19
Toyonaka 39, Osaka.9	1994-08-08
...

I'm not sure if the word "reverse order" means the reverse of the original or the descending order, but I tried to solve it with the latter interpretation. I feel especially strongly about this problem that UNIX commands can be written short.

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Python


def count_freq():
    df = pd.read_csv('hightemp.txt', sep='\t', header=None)
    return df[0].value_counts()
    
count_freq()

Result (Python)


Gunma Prefecture 3
Yamanashi 3
Yamagata Prefecture 3
Saitama Prefecture 3

UNIX


!cut -f 1 hightemp.txt | sort | uniq -c | sort -r

Result (UNIX)


3 Gunma Prefecture
3 Yamanashi Prefecture
3 Yamagata Prefecture
3 Saitama Prefecture

UNIX commands are a bit longer, but first cut out the first column (cut -f 1), then find its frequency of occurrence (sort | uniq -c), and finally in reverse order of frequency of occurrence. I wrote it so that the flow of arranging (sort -r) is easy to understand.

However, considering the excellence of value_counts () in pandas, I think it's easier to understand using Python here.

Summary

That's all for this chapter, but if you make a mistake, please comment.

Recommended Posts

100 Language Processing Knock UNIX Commands Learned in Chapter 2
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock Regular Expressions Learned in Chapter 3
100 Language Processing Knock Chapter 1 in Python
[Language processing 100 knocks 2020] Chapter 2: UNIX commands
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 10 to 14]
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
100 language processing knocks Morphological analysis learned in Chapter 4
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (Second Half)
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (First Half)
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 2 [First half: 10 ~ 15]
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping