Knock 100 Language Processing published on the Tohoku University Inui-Okazaki Laboratory Web page for training in natural language processing and Python. I will challenge nlp100 /). I plan to make a note of the code implemented in it and the techniques that should be suppressed. The code will also be published on GitHub.
This is a continuation of Chapter 1.
Only here, I think it would be good to insert a little explanation of UNIX commands as well as Python.
For detailed options of UNIX commands, check the man
command or ITpro's website and you will be able to study properly!
hightemp.txt is a file that stores the record of the highest temperature in Japan in the tab-delimited format of "prefecture", "point", "℃", and "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.
Count the number of lines. Use the wc command for confirmation.
10.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 10.py
import sys
f = open(sys.argv[1])
lines = f.readlines()
print(len(lines))
f.close()
Since the problem statement says "hightemp.txt as an input file", I designed it so that it can take command line arguments using sys.argv
.
At the time of execution, it is set to $ python 10.py high temp.txt
, so in this casesys.argv [0] == "10.py"
,sys.argv [1] == "hightemp.txt"
It means that the character string is stored.
Regarding reading files
f = open (filename)
hoge = f.read() / f.readline() / f.readlines()
f.close()
I will go with the flow. The three types of functions that appear in 2 . Behave as follows. Please use properly as needed.
read()
readline()
readline ()
is executed every time you loop. Other than this, read () / readlines ()
is a batch read, so this is recommended when you want to exit when a large file or conditions are met.readlines()
len ()
).with
For reading (writing) a file, there is a writing method that uses with
in addition to the writing method that involvesclose ()
as described above. It seems that this is recommended to prevent forgetting to add close ()
and forgetting to handle exceptions, which are common when using with
.
The following program is a trial rewrite of 10.py
using with
.
When using the with syntax
#!/usr/bin/env python
# -*- coding:utf-8 -*-
# 10.py
import sys
with open(sys.argv[1]) as f:
lines = f.readlines()
print(len(lines))
After that, in principle, use with
to read and write files. Only if you can't write with with
(is it there?), Program in the legacy way.
$ wc -l hightemp.txt
24 hightemp.txt
The wc
command will display the number of lines, words, and bytes in the file.
If no option is specified, it will be output in order as follows.
$ wc hightemp.txt
24 98 813 hightemp.txt
The options are -l
for the number of lines, -w
for the number of words, and -c
for the number of bytes.
Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.
11.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 11.py
import sys
with open(sys.argv[1]) as f:
str = f.read()
print(str.replace("\t", " "))
Unlike the previous one that focused on lines, this time I just want to replace characters at once, so I simply use read ()
.
The replace ()
function that appeared in the previous chapter replaces the tab character (\ t
) with a space.
~~ I don't like the extra line breaks left at the end of the output result, but that's pretty cute ...? ~~
By default, print ()
will have a newline at the end. To avoid this, in Python 2, you can add a comma at the end, such as print" hogehoge ",
. In Python 3, you can specify the character to be added to the end with ʻend, such as
print ("hogehoge", end = "") , so you can specify
"" `.
//sed version (Note that it depends on the environment)
$ sed -e s/$'\t'/" "/g hightemp.txt
// tr version
$ cat hightemp.txt | tr "\t" " "
// expand version
$ expand -t 1 hightemp.txt
//The result is the same
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
(Omitted...)
Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-02
sed
is a convenient command that can handle various character editing, but for limited purposes (character replacement) like this time, it would be wise to use the command (tr
) for it.
On the contrary, ʻexpand` has too limited uses, so you may not have a chance to touch it.
sed
-e
and describe the processing you want to perform after that, the result will be output to the standard output. The notation may be unique, but it may be familiar to Vim users.tr
expand
-t
.Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.
12.py
#! /usr/bin/env python
# -*- coding:utf-8 -*-
# 12.py
import sys
def write_col(source_lines, colunm_number, filename):
col = []
for line in source_lines:
col.append(line.split()[colunm_number] + "\n")
with open(filename, "w") as writer:
writer.writelines(col)
with open(sys.argv[1]) as f:
lines = f.readlines()
write_col(lines, 0, "col1.txt")
write_col(lines, 1, "col2.txt")
I made it a function because it performs similar processing. Write the line specified by the 2nd argument of the list
received by the 1st argument as the file name of the 3rd argument. ʻAppend ()` adds a newline character to improve the appearance.
I don't use any new technology, so I can comment on this much, but I'm embarrassed that the algorithms are different depending on the program ... Details will be described later, but it is posted as it is for reflection.
$ cut -f 1 hightemp.txt
Kochi Prefecture
Saitama
Gifu Prefecture
(Omitted...)
Yamanashi Prefecture
Yamagata Prefecture
Aichi prefecture
$ cut -f 2 hightemp.txt
Ekawasaki
Kumagaya
Tajimi
(Omitted...)
Otsuki
Tsuruoka
Nagoya
As with Python, what we are doing is specifying fields (lines) with -f
. Note that in Python it was zero-based (line 0, line 1 ...), whereas in UNIX commands it was one-based (line 1, line 2 ...).
Combine the col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.
13.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 13.py
with open("col1.txt") as f1, open("col2.txt") as f2:
lines1, lines2 = f1.readlines(), f2.readlines()
with open("merge.txt", "w") as writer:
for col1, col2 in zip(lines1, lines2):
writer.write("\t".join([col1.rstrip(), col2]))
I'm getting used to Python, so I wrote the first half of the reading part with a little familiarity. It is strong to be able to write like this.
For the writing part in the latter half, I tried to write using zip ()
as a review of Chapter 1. Contrary to 12., this time both newline characters remain at the end of col1 and col2
, so the newline character at the end of col1
is removed by rstrip ()
.
Here is a review and rewritten intensional notation.
When rewritten by intensional notation
#If it is in parentheses, it will be interpreted properly even if a line break occurs in the code
with open("merge.txt", "w") as writer:
writer.write(
"\n".join(
["\t".join([col1.rstrip(), col2.rstrip()])
for col1, col2 in zip(lines1, lines2)]
)
)
Since various notations came out, I tried to measure and compare the execution time of each method using timeit
.
Execution time measurement program using timeit
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 13_timeit.py
import timeit
#Preprocessing; col1,col2.Read txt
s0 = """
with open("col1.txt") as f1, open("col2.txt") as f2:
lines1, lines2 = f1.readlines(), f2.readlines()
"""
#naive implementation;Add strings
s1 = """
merged_txt = ""
for i in xrange(len(lines1)):
merged_txt = merged_txt + lines1[i].rstrip() + "\t" + lines2[i]
with open("merge.txt", "w") as writer:
writer.write(merged_txt)
"""
#Implementation using zip
s2 = """
with open("merge.txt", "w") as writer:
for col1, col2 in zip(lines1, lines2):
writer.write("\t".join([col1.rstrip(), col2]))
"""
#Intensional notation(connotation)Implementation by
# "\\n"If you don't write, you will get a SyntaxError ... why?
s3 = """
with open("merge.txt", "w") as writer:
writer.write(
"\\n".join(
["\t".join([col1.rstrip(), col2.rstrip()])
for col1, col2 in zip(lines1, lines2)]
)
)
"""
print("naive:", timeit.repeat(stmt=s1, setup=s0, number=100000))
print("zip:", timeit.repeat(stmt=s2, setup=s0, number=100000))
print("connotation:", timeit.repeat(stmt=s3, setup=s0, number=100000))
It is the calculation time (seconds) when 100000 laps of the loop are performed 3 times (default) by 3 types of methods. According to the Official Document, the execution time should be evaluated by the minimum value, not the average or maximum value.
Execution result
$ python 13_timeit.py
('naive:', [32.61601686477661, 47.96871089935303, 33.15881299972534])
('zip:', [49.846755027770996, 45.05450105667114, 58.70397615432739])
('connotation:', [46.472286224365234, 52.708040952682495, 46.71139121055603])
As a result, in terms of execution time alone, the method of simply adding character strings was the best. Even if the order is changed. In general, it seems that speedup can be expected with the comprehension method, but what is the boundary between speedup and non-speedup?
$ paste col1.txt col2.txt
Kochi Prefecture Ekawasaki
Kumagaya, Saitama Prefecture
Gifu Prefecture Tajimi
(Omitted...)
Yamanashi Prefecture Otsuki
Yamagata Prefecture Tsuruoka
Aichi Prefecture Nagoya
The paste
command concatenates files horizontally.
The default delimiter is tab, but it can be specified with the -d
option.
Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.
14.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 14.py
# Usage: python 14.py [filename] [number of lines]
import sys
with open(sys.argv[1]) as f:
lines = f.readlines()
for line in lines[:int(sys.argv[2])]:
print line,
At first, I used the following implementation using xrange ()
, but
xrange()Implementation using
#Omission
for i in xrange(int(sys.argv[2])):
print lines[i],
If you do this, you will get ʻIndexErrorwhen you specify a number that exceeds the number of lines in the file, so I think it would be wise to implement using slices. I think the problem is that the input value is not checked and the error handling is not written in the first place ... Regarding the output, as explained in 11., we added
,to the end of the
print` statement to remove unnecessary line breaks.
$ head -3 hightemp.txt
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
This is also simple, you can specify the number of lines as an option.
Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.
15.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 15.py
import sys
with open(sys.argv[1]) as f:
lines = f.readlines()
for line in lines[len(lines) - int(sys.argv[2]):]:
print line,
It is almost the same as the previous 14. Although the slice specification is slightly complicated.
$ tail -3 hightemp.txt
Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-02
Almost the same as head
.
Since it has become long, I have divided the article about Chapter 2. Continue to Chapter 2, Part 2.
Recommended Posts