"Chapter 2: Basics of UNIX Commands" of Language Processing 100 Knock 2015 It is a record of ecei.tohoku.ac.jp/nlp100/#ch2).
Chapter 2 is related to CSV file operations. This is a review of what I did over a year ago. At the time I did it, I thought "Python is fine without using UNIX commands", but when dealing with large files, UNIX commands are generally faster. ** UNIX commands are worth remembering **.
This time, I'm using a lot of Pandas
packages for the Python part. It is really convenient for handling matrix data such as CSV.
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
Experience useful UNIX tools for research and data analysis. Through these reimplements, you will experience the ecosystem of existing tools while improving your programming skills.
head, tail, cut, paste, split, sort, uniq, sed, tr, expand
hightemp.txt records the highest temperature in Japan as "prefecture", "point", and "℃". It is a file stored in the tab-delimited format of "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.
Count the number of lines. Use the wc command for confirmation.
In Python, it should be the fastest to read at once with readlines
(I haven't checked much).
Python part
print(len(open('./hightemp.txt').readlines()))
Terminal output result
24
wc is an abbreviation for Word Count. The -l
option counts line feed codes. ** Large files come in handy because it takes time just to open them with a text editor **.
Bash part
wc hightemp.txt -l
Terminal output result
24 hightemp.txt
Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.
Replace using the replace
function. I use pprint
because it is difficult to see the result without line breaks.
Python part
from pprint import pprint
with open('./hightemp.txt') as f:
pprint([line.replace('\t', ' ')for line in f])
Terminal output result
['Kochi Prefecture Ekawasaki 41 2013-08-12\n',
'40 Kumagaya, Saitama Prefecture.9 2007-08-16\n',
'40 Tajimi, Gifu Prefecture.9 2007-08-16\n',
Omission
'Yamanashi Prefecture Otsuki 39.9 1990-07-19\n',
'39 Tsuruoka, Yamagata Prefecture.9 1978-08-03\n',
'Aichi Prefecture Nagoya 39.9 1942-08-02\n']
sed
can replace strings and delete lines. Executing this command only prints the result to the terminal, it does not update the contents of the file. I referred to "[sed] Replace character strings and delete lines".
Bash part
sed 's/\t/ /g' ./hightemp.txt
Terminal output result
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
Omission
Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-02
Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.
I used Pandas
. The columns read by the parameter ʻusecols` are limited to the 1st and 2nd columns. It's convenient.
Python part
import pandas as pd
df = pd.read_table('./hightemp.txt', header=None, usecols=[0, 1])
df[0].to_csv('012.col1.txt',index=False, header=False)
df[1].to_csv('012.col2.txt',index=False, header=False)
Check the contents with cut
. I referred to "[cut] command-cut out from a line in fixed length or field units".
Bash part
cut -f 1 ./hightemp.txt
cut -f 2 ./hightemp.txt
Terminal output result(1st row)
Kochi Prefecture
Saitama
Gifu Prefecture
Omission
Yamanashi Prefecture
Yamagata Prefecture
Aichi prefecture
Terminal output result(2nd row)
Ekawasaki
Kumagaya
Tajimi
Yamagata
Omission
Otsuki
Tsuruoka
Nagoya
Combine the col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.
I read and connect two files with pandas
.
Python part
import pandas as pd
result = pd.read_csv('012.col1.txt', header=None)
result[1] = pd.read_csv('012.col2.txt', header=None)
result.to_csv('013.col1_2.txt', index=False, header=None, sep='\t')
I referred to "A detailed summary of paste commands [Linux command collection]". The output result is omitted.
Bash part
paste 012.col1.txt 012.col2.txt
Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.
The argument is received by the ʻinput` function.
Python part
from pprint import pprint
n = int(input('N Lines--> '))
with open('hightemp.txt') as f:
for i, line in enumerate(f):
if i < n:
pprint(line)
else:
break
Terminal output result
'Kochi Prefecture\t Ekawasaki\t41\t2013-08-12\n'
'Saitama\t Kumagaya\t40.9\t2007-08-16\n'
'Gifu Prefecture\t Tajimi\t40.9\t2007-08-16\n'
I referred to "Detailed summary of head command displayed from the beginning of the file [Linux command collection]".
Bash part
head hightemp.txt -n 3
Terminal output result
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.
This was quite confusing. I didn't want to read all the files when the file was large, so after checking all the files with the Linux head
, linecache
package I thought I would use .html), but if so, tail
is fine. I ended up using readlines
.
Python part
from pprint import pprint
n = int(input('N Lines--> '))
with open('hightemp.txt') as f:
pprint(f.readlines()[-n:])
Terminal output result
['Yamanashi Prefecture\t otsuki\t39.9\t1990-07-19\n',
'Yamagata Prefecture\t Tsuruoka\t39.9\t1978-08-03\n',
'Aichi prefecture\t Nagoya\t39.9\t1942-08-02\n']
Bash part
tail hightemp.txt -n 3
Terminal output result
Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-02
Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.
The quotient is rounded up using the ceil
function of the math
package.
File writing is added all at once with the writelines
function.
Python part
import math
n = int(input('N spilits--> '))
with open('./hightemp.txt') as f:
lines = f.readlines()
unit = math.ceil(len(lines) / n)
for i in range(0, n):
with open('016.hightemp{}.txt'.format(i), 'w') as out_file:
out_file.writelines(lines[i*unit:(i+1)*unit])
I referred to "[split] command-split files".
Bash part
split -n 3 -d hightemp.txt 016.hightemp-u
Find the type of string in the first column (a set of different strings). Use the sort and uniq commands for confirmation.
I used the ʻunique function of
pandas.
pandas` makes this kind of processing very easy.
Python part
import pandas as pd
df = pd.read_table('hightemp.txt', header=None, usecols=[0])
print(df[0].unique())
Terminal output result
['Kochi Prefecture' 'Saitama' 'Gifu Prefecture' 'Yamagata Prefecture' 'Yamanashi Prefecture' 'Wakayama Prefecture' 'Shizuoka Prefecture' 'Gunma Prefecture' 'Aichi prefecture' 'Chiba' 'Ehime Prefecture' 'Osaka']
I referred to "Sort command summary [Linux command collection]".
Bash part
cut --fields=1 hightemp.txt | sort | uniq > result.txt
Terminal output result
Chiba
Wakayama Prefecture
Saitama
Osaka
Yamagata Prefecture
Yamanashi Prefecture
Gifu Prefecture
Ehime Prefecture
Aichi prefecture
Gunma Prefecture
Shizuoka Prefecture
Kochi Prefecture
Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).
I used the sort_values
function of pandas
.
Python part
import pandas as pd
df = pd.read_table('hightemp.txt', header=None, usecols=[0])
print(df.sort_values(2, ascending=False))
Terminal output result
0 1 2 3
0 Kochi Prefecture Ekawasaki 41.0 2013-08-12
2 Tajimi, Gifu Prefecture 40.9 2007-08-16
1 Kumagaya, Saitama Prefecture 40.9 2007-08-16
Omission
21 Yamanashi Prefecture Otsuki 39.9 1990-07-19
22 Yamagata Prefecture Tsuruoka 39.9 1978-08-03
23 Nagoya, Aichi 39.9 1942-08-02
Bash part
sort hightemp.txt -k 3 -n -r
Terminal output result
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Tajimi, Gifu Prefecture.9 2007-08-16
40 Kumagaya, Saitama Prefecture.9 2007-08-16
Omission
Toyonaka 39, Osaka.9 1994-08-08
39 Hatoyama, Saitama Prefecture.9 1997-07-05
39 Mobara, Chiba.9 2013-08-11
Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.
I used the value_counts
function of pandas
.
Python part
import pandas as pd
df = pd.read_table('hightemp.txt', header=None, usecols=[0])
print(df[0].value_counts(ascending=False))
Terminal output result
Saitama Prefecture 3
Yamanashi 3
Yamagata Prefecture 3
Omission
Ehime Prefecture 1
Kochi Prefecture 1
Osaka 1
Bash part
cut -f 1 hightemp.txt | sort | uniq -c | sort -r
Terminal output result
3 Gunma Prefecture
3 Yamanashi Prefecture
3 Yamagata Prefecture
Omission
1 Ehime prefecture
1 Osaka
1 Wakayama Prefecture
Recommended Posts