100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (First Half)

A record of solving the problems in the first half of Chapter 2. The execution result of UNIX command is also shown.

The target file is hightemp.txt as shown on the web page.

hightemp.txt is a file that stores the record of the highest temperature in Japan in the tab-delimited format of "prefecture", "point", "℃", and "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

</ i> 10. Counting the number of rows

Count the number of lines. Use the wc command for confirmation.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

f = open('hightemp.txt')
lines = f.readlines()
print len(lines)
f.close()

#=> 24

Read the target file line by line and count the number of lines.

cat hightemp.txt | grep -c ""

#=> 24

Display the text with cat and pipe it to grep to count the number of lines. By the way, you can count the number of lines in the same way with wc -l instead of grep -c "" after the pipe, but with wc there is space in the output. I will join. It's often better to use grep -c "", as spaces may be annoying when piped the output of the number of lines to another process.

</ i> 11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import re

inputfile = 'hightemp.txt'
outputfile = 'hightemp_tab2space.txt'

f = open(inputfile)
lines = f.readlines()
g = open(outputfile, 'w')
for line in lines:
    line = re.sub('\t', ' ', line)
    g.write(line)
    print line
f.close()
g.close()

#=>Kochi Prefecture Ekawasaki 41 2013-08-12
#=>40 Kumagaya, Saitama Prefecture.9 2007-08-16
#=>40 Tajimi, Gifu Prefecture.9 2007-08-16
#=>Yamagata 40 Yamagata.8 1933-07-25
#=>Yamanashi Prefecture Kofu 40.7 2013-08-10
#=>Wakayama Prefecture Katsuragi 40.6 1994-08-08
#=>Shizuoka Prefecture Tenryu 40.6 1994-08-04
#=>40 Katsunuma, Yamanashi Prefecture.5 2013-08-10
#=>40 Koshigaya, Saitama Prefecture.4 2007-08-16
#=>Gunma Prefecture Tatebayashi 40.3 2007-08-16
#=>40 Kamisatomi, Gunma Prefecture.3 1998-07-04
#=>Aisai 40, Aichi Prefecture.3 1994-08-05
#=>Chiba Prefecture Ushiku 40.2 2004-07-20
#=>40 Sakuma, Shizuoka Prefecture.2 2001-07-24
#=>40 Uwajima, Ehime Prefecture.2 1927-07-22
#=>40 Sakata, Yamagata Prefecture.1 1978-08-03
#=>Gifu Prefecture Mino 40 2007-08-16
#=>Gunma Prefecture Maebashi 40 2001-07-24
#=>39 Mobara, Chiba.9 2013-08-11
#=>39 Hatoyama, Saitama Prefecture.9 1997-07-05
#=>Toyonaka 39, Osaka.9 1994-08-08
#=>Yamanashi Prefecture Otsuki 39.9 1990-07-19
#=>39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
#=>Aichi Prefecture Nagoya 39.9 1942-08-02

Replace the tab character \ t with a space.

cat hightemp.txt | tr "\t" " " > hightemp_tr.txt

#=> (Output is the same as above)

</ i> 12. Save the first column in col1.txt and the second column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

inputfile = 'hightemp.txt'
outputfile1 = 'col1.txt'
outputfile2 = 'col2.txt'
f = open(inputfile)
lines = f.readlines()
g = open(outputfile1, "w")
h = open(outputfile2, "w")
for line in lines:
    line = line.split('\t')
    g.write(line[0].strip('\n') + '\n')
    h.write(line[1].strip('\n') + '\n')
f.close()
g.close()
h.close()

# (col1.txt)
#=>Kochi Prefecture
#=>Saitama
#=>Gifu Prefecture
#=>Yamagata Prefecture
#=>Yamanashi Prefecture
#=>Wakayama Prefecture
#=>Shizuoka Prefecture
#=>Yamanashi Prefecture
#=>Saitama
#=>Gunma Prefecture
#=>Gunma Prefecture
#=>Aichi prefecture
#=>Chiba
#=>Shizuoka Prefecture
#=>Ehime Prefecture
#=>Yamagata Prefecture
#=>Gifu Prefecture
#=>Gunma Prefecture
#=>Chiba
#=>Saitama
#=>Osaka
#=>Yamanashi Prefecture
#=>Yamagata Prefecture
#=>Aichi prefecture

# (col2.txt)
#=>Ekawasaki
#=>Kumagaya
#=>Tajimi
#=>Yamagata
#=>Kofu
#=>Katsuragi
#=>Tenryu
#=>Katsunuma
#=>Koshigaya
#=>Tatebayashi
#=>Kamisatomi
#=>Aisai
#=>Ushiku
#=>Sakuma
#=>Uwajima
#=>Sakata
#=>Mino
#=>Maebashi
#=>Mobara
#=>Hatoyama
#=>Toyonaka
#=>Otsuki
#=>Tsuruoka
#=>Nagoya

Split by tab delimiter and output each target to a file

cut -f 1 hightemp.txt > hightemp_cut1.txt
cut -f 2 hightemp.txt > hightemp_cut2.txt

#=> (Same as above, so output is omitted)

</ i> 13. Merge col1.txt and col2.txt

Combine the col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

inputfile1 = 'col1.txt'
inputfile2 = 'col2.txt'
outputfile = 'col_merge.txt'

f = open(inputfile1)
g = open(inputfile2)
h = open(outputfile, "w")

lines1 = f.readlines()
lines2 = g.readlines()
for a, b in zip(lines1, lines2):
    h.write(a.strip() + '\t' + b.strip() + '\n')
f.close()
g.close()
h.close()

#=>Kochi Prefecture Ekawasaki
#=>Kumagaya, Saitama Prefecture
#=>Gifu Prefecture Tajimi
#=>Yamagata Prefecture Yamagata
#=>Yamanashi Prefecture Kofu
#=>Wakayama Prefecture Katsuragi
#=>Shizuoka Prefecture Tenryu
#=>Yamanashi Prefecture Katsunuma
#=>Koshigaya, Saitama Prefecture
#=>Gunma Prefecture Tatebayashi
#=>Kamisatomi, Gunma Prefecture
#=>Aisai, Aichi Prefecture
#=>Chiba Prefecture Ushiku
#=>Sakuma, Shizuoka Prefecture
#=>Uwajima, Ehime Prefecture
#=>Yamagata Prefecture Sakata
#=>Gifu Prefecture Mino
#=>Gunma Prefecture Maebashi
#=>Mobara, Chiba
#=>Hatoyama, Saitama Prefecture
#=>Toyonaka, Osaka
#=>Yamanashi Prefecture Otsuki
#=>Yamagata Prefecture Tsuruoka
#=>Aichi Prefecture Nagoya

Read two files and process sequence objects in parallel with the zip function.

paste col1.txt col2.txt > hightemp_paste.txt

#=> (Output is the same as above)

</ i> 14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import sys

if len(sys.argv) == 3:
    N = int(sys.argv[1])
    f = open(sys.argv[2])
    lines = f.readlines()
    for i in xrange(N):
        print lines[i].strip()
    f.close()
else:
    print "please input \'N\' and \'FILENAME\'"

# (python problem14.py 5 hightemp.txt)
#=>Kochi Prefecture Ekawasaki 41 2013-08-12
#=>40 Kumagaya, Saitama Prefecture.9    2007-08-16
#=>40 Tajimi, Gifu Prefecture.9    2007-08-16
#=>Yamagata 40 Yamagata.8    1933-07-25
#=>Yamanashi Prefecture Kofu 40.7    2013-08-10

Output the number of lines read as many as the number of lines received.

head -n 5 hightemp.txt

#=> (Output is the same as above)

Recommended Posts