"Chapter 2: Basics of UNIX Commands" of Language Processing 100 Knock 2015 It is a record of ecei.tohoku.ac.jp/nlp100/#ch2). Chapter 2 is related to CSV file operations. This is a review of what I did over a year ago. At the time I did it, I thought "Python is fine without using UNIX commands", but when dealing with large files, UNIX commands are generally faster. ** UNIX commands are worth remembering **. This time, I'm using a lot of Pandas packages for the Python part. It is really convenient for handling matrix data such as CSV.

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

Chapter 2: UNIX Command Basics

content of study

Experience useful UNIX tools for research and data analysis. Through these reimplements, you will experience the ecosystem of existing tools while improving your programming skills.

head, tail, cut, paste, split, sort, uniq, sed, tr, expand

Knock content

hightemp.txt records the highest temperature in Japan as "prefecture", "point", and "℃". It is a file stored in the tab-delimited format of "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

[010. Counting lines.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3%E3 % 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 010.% E8% A1% 8C% E6% 95% B0% E3% 81% AE% E3% 82% AB% E3% 82% A6% E3% 83% B3% E3% 83% 88.ipynb)

In Python, it should be the fastest to read at once with readlines (I haven't checked much).

`Python part`


print(len(open('./hightemp.txt').readlines()))

`Terminal output result`

wc is an abbreviation for Word Count. The -l option counts line feed codes. ** Large files come in handy because it takes time just to open them with a text editor **.

`Bash part`


wc hightemp.txt -l

`Terminal output result`


24 hightemp.txt

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

[011. Replace tabs with spaces.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 011.% E3% 82% BF% E3% 83% 96% E3% 82% 92% E3% 82% B9 % E3% 83% 9A% E3% 83% BC% E3% 82% B9% E3% 81% AB% E7% BD% AE% E6% 8F% 9B.ipynb)

Replace using the replace function. I use pprint because it is difficult to see the result without line breaks.

`Python part`


from pprint import pprint

with open('./hightemp.txt') as f:
    pprint([line.replace('\t', ' ')for line in f])

`Terminal output result`


['Kochi Prefecture Ekawasaki 41 2013-08-12\n',
 '40 Kumagaya, Saitama Prefecture.9 2007-08-16\n',
 '40 Tajimi, Gifu Prefecture.9 2007-08-16\n',

Omission

 'Yamanashi Prefecture Otsuki 39.9 1990-07-19\n',
 '39 Tsuruoka, Yamagata Prefecture.9 1978-08-03\n',
 'Aichi Prefecture Nagoya 39.9 1942-08-02\n']

sed can replace strings and delete lines. Executing this command only prints the result to the terminal, it does not update the contents of the file. I referred to "[sed] Replace character strings and delete lines".

`Bash part`


sed 's/\t/ /g' ./hightemp.txt

`Terminal output result`


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16

Omission

Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-02

Save the 12.1st column in col1.txt and the 2nd column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

[012.1 Save the first column in col1.txt and the second column in col2.txt.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3 % 83% 9E% E3% 83% B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 012.1% E5% 88% 97% E7% 9B% AE% E3 % 82% 92col1.txt% E3% 81% AB% EF% BC% 8C2% E5% 88% 97% E7% 9B% AE% E3% 82% 92col2.txt% E3% 81% AB% E4% BF% 9D % E5% AD% 98.ipynb)

I used Pandas. The columns read by the parameter ʻusecols` are limited to the 1st and 2nd columns. It's convenient.

`Python part`


import pandas as pd

df = pd.read_table('./hightemp.txt', header=None, usecols=[0, 1])
df[0].to_csv('012.col1.txt',index=False, header=False)
df[1].to_csv('012.col2.txt',index=False, header=False)

Check the contents with cut. I referred to "[cut] command-cut out from a line in fixed length or field units".

`Bash part`


cut -f 1 ./hightemp.txt
cut -f 2 ./hightemp.txt

`Terminal output result(1st row)`


Kochi Prefecture
Saitama
Gifu Prefecture

Omission

Yamanashi Prefecture
Yamagata Prefecture
Aichi prefecture

`Terminal output result(2nd row)`


Ekawasaki
Kumagaya
Tajimi
Yamagata

Omission

Otsuki
Tsuruoka
Nagoya

Merge 13.col1.txt and col2.txt

Combine the col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

[Merge 013.col1.txt and col2.txt.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3% 83% B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 013.col1.txt% E3% 81% A8col2.txt% E3% 82% 92% E3% 83% 9E% E3% 83% BC% E3% 82% B8.ipynb)

I read and connect two files with pandas.

`Python part`


import pandas as pd

result = pd.read_csv('012.col1.txt', header=None)
result[1] = pd.read_csv('012.col2.txt', header=None)

result.to_csv('013.col1_2.txt', index=False, header=None, sep='\t')

I referred to "A detailed summary of paste commands [Linux command collection]". The output result is omitted.

`Bash part`


paste 012.col1.txt 012.col2.txt

14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.

[014. Output N lines from the beginning.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3 % E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 041.4% E5% 85% 88% E9% A0% AD% E3% 81% 8B% E3% 82% 89N% E8% A1% 8C% E3% 82% 92% E5% 87% BA% E5% 8A% 9B.ipynb)

The argument is received by the ʻinput` function.

`Python part`


from pprint import pprint

n = int(input('N Lines--> '))

with open('hightemp.txt') as f:
    for i, line in enumerate(f):
        if i < n:
            pprint(line)
        else:
            break

`Terminal output result`


'Kochi Prefecture\t Ekawasaki\t41\t2013-08-12\n'
'Saitama\t Kumagaya\t40.9\t2007-08-16\n'
'Gifu Prefecture\t Tajimi\t40.9\t2007-08-16\n'

I referred to "Detailed summary of head command displayed from the beginning of the file [Linux command collection]".

`Bash part`


head hightemp.txt -n 3

`Terminal output result`


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9	2007-08-16
40 Tajimi, Gifu Prefecture.9	2007-08-16

15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

[015. Output the last N lines.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3 % E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 015.% E6% 9C% AB% E5% B0% BE% E3% 81% AEN% E8% A1% 8C% E3% 82% 92% E5% 87% BA% E5% 8A% 9B.ipynb)

This was quite confusing. I didn't want to read all the files when the file was large, so after checking all the files with the Linux head, linecache package I thought I would use .html), but if so, tail is fine. I ended up using readlines.

`Python part`


from pprint import pprint

n = int(input('N Lines--> '))

with open('hightemp.txt') as f:
    pprint(f.readlines()[-n:])

`Terminal output result`


['Yamanashi Prefecture\t otsuki\t39.9\t1990-07-19\n',
 'Yamagata Prefecture\t Tsuruoka\t39.9\t1978-08-03\n',
 'Aichi prefecture\t Nagoya\t39.9\t1942-08-02\n']

`Bash part`


tail hightemp.txt -n 3

`Terminal output result`


Yamanashi Prefecture Otsuki 39.9	1990-07-19
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Aichi Prefecture Nagoya 39.9	1942-08-02

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

[016. Divide the file into N.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 016.% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB % E3% 82% 92N% E5% 88% 86% E5% 89% B2% E3% 81% 99% E3% 82% 8B.ipynb)

The quotient is rounded up using the ceil function of the math package. File writing is added all at once with the writelines function.

`Python part`


import math

n = int(input('N spilits--> '))

with open('./hightemp.txt') as f:
    lines = f.readlines()

unit = math.ceil(len(lines) / n)

for i in range(0, n):
    with open('016.hightemp{}.txt'.format(i), 'w') as out_file:
        out_file.writelines(lines[i*unit:(i+1)*unit])

I referred to "[split] command-split files".

`Bash part`


split -n 3 -d hightemp.txt 016.hightemp-u

Differences in the character strings in the first column

Find the type of string in the first column (a set of different strings). Use the sort and uniq commands for confirmation.

[017. Difference in the character string in the first column.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83 % B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 017.% EF% BC% 91% E5% 88% 97% E7% 9B% AE% E3% 81% AE% E6% 96% 87% E5% AD% 97% E5% 88% 97% E3% 81% AE% E7% 95% B0% E3% 81% AA% E3% 82% 8A.ipynb)

I used the ʻunique function of pandas. pandas` makes this kind of processing very easy.

`Python part`


import pandas as pd

df = pd.read_table('hightemp.txt', header=None, usecols=[0])

print(df[0].unique())

`Terminal output result`


['Kochi Prefecture' 'Saitama' 'Gifu Prefecture' 'Yamagata Prefecture' 'Yamanashi Prefecture' 'Wakayama Prefecture' 'Shizuoka Prefecture' 'Gunma Prefecture' 'Aichi prefecture' 'Chiba' 'Ehime Prefecture' 'Osaka']

I referred to "Sort command summary [Linux command collection]".

`Bash part`


cut --fields=1 hightemp.txt | sort | uniq > result.txt

`Terminal output result`


Chiba
Wakayama Prefecture
Saitama
Osaka
Yamagata Prefecture
Yamanashi Prefecture
Gifu Prefecture
Ehime Prefecture
Aichi prefecture
Gunma Prefecture
Shizuoka Prefecture
Kochi Prefecture

18. Sort each line in descending order of the numerical value in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

[018. Sort each line in descending order of the numbers in the third column.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E% E3% 83% B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 018.% E5% 90% 84% E8% A1% 8C% E3% 82% 923 % E3% 82% B3% E3% 83% A9% E3% 83% A0% E7% 9B% AE% E3% 81% AE% E6% 95% B0% E5% 80% A4% E3% 81% AE% E9 % 99% 8D% E9% A0% 86% E3% 81% AB% E3% 82% BD% E3% 83% BC% E3% 83% 88.ipynb)

I used the sort_values function of pandas.

`Python part`


import pandas as pd

df = pd.read_table('hightemp.txt', header=None, usecols=[0])

print(df.sort_values(2, ascending=False))

`Terminal output result`


       0     1     2           3
0 Kochi Prefecture Ekawasaki 41.0  2013-08-12
2 Tajimi, Gifu Prefecture 40.9  2007-08-16
1 Kumagaya, Saitama Prefecture 40.9  2007-08-16

Omission

21 Yamanashi Prefecture Otsuki 39.9  1990-07-19
22 Yamagata Prefecture Tsuruoka 39.9  1978-08-03
23 Nagoya, Aichi 39.9  1942-08-02

`Bash part`


sort hightemp.txt -k 3 -n -r

`Terminal output result`


Kochi Prefecture Ekawasaki 41 2013-08-12
40 Tajimi, Gifu Prefecture.9	2007-08-16
40 Kumagaya, Saitama Prefecture.9	2007-08-16

Omission

Toyonaka 39, Osaka.9	1994-08-08
39 Hatoyama, Saitama Prefecture.9	1997-07-05
39 Mobara, Chiba.9	2013-08-11

19. Find the frequency of appearance of the character string in the first column of each line and arrange it in descending order of frequency of appearance

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

[019. Find the frequency of appearance of the character string in the first column of each line and arrange it in descending order of frequency of appearance.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82 % B3% E3% 83% 9E% E3% 83% B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 019.% E5% 90% 84% E8% A1% 8C% E3% 81% AE1% E3% 82% B3% E3% 83% A9% E3% 83% A0% E7% 9B% AE% E3% 81% AE% E6% 96% 87% E5% AD% 97% E5% 88% 97% E3% 81% AE% E5% 87% BA% E7% 8F% BE% E9% A0% BB% E5% BA% A6% E3% 82% 92% E6% B1% 82% E3% 82% 81% EF% BC% 8C% E5% 87% BA% E7% 8F% BE% E9% A0% BB% E5% BA% A6% E3% 81% AE% E9% AB% 98% E3% 81% 84% E9% A0% 86% E3% 81% AB% E4% B8% A6% E3% 81% B9% E3% 82% 8B.ipynb)

I used the value_counts function of pandas.

`Python part`


import pandas as pd

df = pd.read_table('hightemp.txt', header=None, usecols=[0])

print(df[0].value_counts(ascending=False))

`Terminal output result`


Saitama Prefecture 3
Yamanashi 3
Yamagata Prefecture 3

Omission

Ehime Prefecture 1
Kochi Prefecture 1
Osaka 1

`Bash part`


cut -f 1 hightemp.txt | sort | uniq -c | sort -r

`Terminal output result`


3 Gunma Prefecture
3 Yamanashi Prefecture
3 Yamagata Prefecture

Omission

1 Ehime prefecture
1 Osaka
1 Wakayama Prefecture

100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)

environment

content of study

Knock content

[010. Counting lines.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3%E3 % 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 010.% E8% A1% 8C% E6% 95% B0% E3% 81% AE% E3% 82% AB% E3% 82% A6% E3% 83% B3% E3% 83% 88.ipynb)

Python part

Terminal output result

Bash part

Terminal output result

Python part

Terminal output result

Bash part

Terminal output result

Python part

Bash part

Terminal output result(1st row)

Terminal output result(2nd row)

[Merge 013.col1.txt and col2.txt.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3% 83% B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 013.col1.txt% E3% 81% A8col2.txt% E3% 82% 92% E3% 83% 9E% E3% 83% BC% E3% 82% B8.ipynb)

Python part

Bash part

[014. Output N lines from the beginning.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3 % E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 041.4% E5% 85% 88% E9% A0% AD% E3% 81% 8B% E3% 82% 89N% E8% A1% 8C% E3% 82% 92% E5% 87% BA% E5% 8A% 9B.ipynb)

Python part

Terminal output result

Bash part

Terminal output result

[015. Output the last N lines.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3 % E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 015.% E6% 9C% AB% E5% B0% BE% E3% 81% AEN% E8% A1% 8C% E3% 82% 92% E5% 87% BA% E5% 8A% 9B.ipynb)

Python part

Terminal output result

Bash part

Terminal output result

[016. Divide the file into N.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/02.UNIX%E3%82%B3%E3%83%9E%E3%83%B3% E3% 83% 89% E3% 81% AE% E5% 9F% BA% E7% A4% 8E / 016.% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83% AB % E3% 82% 92N% E5% 88% 86% E5% 89% B2% E3% 81% 99% E3% 82% 8B.ipynb)

Python part

Bash part

Python part

Terminal output result

Bash part

Terminal output result

Python part

Terminal output result

Bash part

Terminal output result

Python part

Terminal output result

Bash part

Terminal output result

`Python part`

`Terminal output result`

`Bash part`

`Terminal output result`

`Python part`

`Terminal output result`

`Bash part`

`Terminal output result`

`Python part`

`Bash part`

`Terminal output result(1st row)`

`Terminal output result(2nd row)`

`Python part`

`Bash part`

`Python part`

`Terminal output result`

`Bash part`

`Terminal output result`

`Python part`

`Terminal output result`

`Bash part`

`Terminal output result`

`Python part`

`Bash part`

`Python part`

`Terminal output result`

`Bash part`

`Terminal output result`

`Python part`

`Terminal output result`

`Bash part`

`Terminal output result`

`Python part`

`Terminal output result`

`Bash part`

`Terminal output result`