Introduction

Knock 100 Language Processing published on the Tohoku University Inui-Okazaki Laboratory Web page for training in natural language processing and Python. I will challenge nlp100 /). I plan to make a note of the code implemented in it and the techniques that should be suppressed. The code will also be published on GitHub.

This is a continuation of Chapter 1. Only here, I think it would be good to insert a little explanation of UNIX commands as well as Python. For detailed options of UNIX commands, check the man command or ITpro's website and you will be able to study properly!

Chapter 2: UNIX Command Basics

hightemp.txt is a file that stores the record of the highest temperature in Japan in the tab-delimited format of "prefecture", "point", "℃", and "day". Create a program that performs the following processing and execute hightemp.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

Answer in Python

`10.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 10.py

import sys

f = open(sys.argv[1])
lines = f.readlines()
print(len(lines))

f.close()

Comments on python answers

Since the problem statement says "hightemp.txt as an input file", I designed it so that it can take command line arguments using sys.argv. At the time of execution, it is set to $ python 10.py high temp.txt, so in this casesys.argv [0] == "10.py",sys.argv [1] == "hightemp.txt" It means that the character string is stored.

Regarding reading files

f = open (filename)
hoge = f.read() / f.readline() / f.readlines()
f.close()

I will go with the flow. The three types of functions that appear in 2 . Behave as follows. Please use properly as needed.

read()
Reads the specified files as a character string at once.
readline()
Read the specified file line by line. So you need to write the code so that readline () is executed every time you loop. Other than this, read () / readlines () is a batch read, so this is recommended when you want to exit when a large file or conditions are met.
readlines()
Reads the specified file as a list of character strings for each line. This time I wanted to know the number of lines, so I calculated it by finding the length of this list (len ()).

Pattern using `with`

For reading (writing) a file, there is a writing method that uses with in addition to the writing method that involvesclose ()as described above. It seems that this is recommended to prevent forgetting to add close () and forgetting to handle exceptions, which are common when using with. The following program is a trial rewrite of 10.py using with.

`When using the with syntax`


#!/usr/bin/env python
# -*- coding:utf-8 -*-
# 10.py

import sys

with open(sys.argv[1]) as f:
    lines = f.readlines()

print(len(lines))

After that, in principle, use with to read and write files. Only if you can't write with with (is it there?), Program in the legacy way.

UNIX answer

$ wc -l hightemp.txt
      24 hightemp.txt

Comments on UNIX Answers

The wc command will display the number of lines, words, and bytes in the file. If no option is specified, it will be output in order as follows.

$ wc hightemp.txt 
      24      98     813 hightemp.txt

The options are -l for the number of lines, -w for the number of words, and -c for the number of bytes.

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

Answer in Python

`11.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 11.py

import sys

with open(sys.argv[1]) as f:
    str = f.read()

print(str.replace("\t", " "))

Comments on Python Answers

Unlike the previous one that focused on lines, this time I just want to replace characters at once, so I simply use read (). The replace () function that appeared in the previous chapter replaces the tab character (\ t) with a space. ~~ I don't like the extra line breaks left at the end of the output result, but that's pretty cute ...? ~~ By default, print () will have a newline at the end. To avoid this, in Python 2, you can add a comma at the end, such as print" hogehoge ",. In Python 3, you can specify the character to be added to the end with ʻend, such as print ("hogehoge", end = "") , so you can specify "" `.

UNIX answer

//sed version (Note that it depends on the environment)
$ sed -e s/$'\t'/" "/g hightemp.txt
// tr version
$ cat hightemp.txt | tr "\t" " "
// expand version
$ expand -t 1 hightemp.txt

//The result is the same
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9 2007-08-16
40 Tajimi, Gifu Prefecture.9 2007-08-16
(Omitted...）
Yamanashi Prefecture Otsuki 39.9 1990-07-19
39 Tsuruoka, Yamagata Prefecture.9 1978-08-03
Aichi Prefecture Nagoya 39.9 1942-08-02

Comments on UNIX Answers

sed is a convenient command that can handle various character editing, but for limited purposes (character replacement) like this time, it would be wise to use the command (tr) for it. On the contrary, ʻexpand` has too limited uses, so you may not have a chance to touch it.

sed
If you specify the option -e and describe the processing you want to perform after that, the result will be output to the standard output. The notation may be unique, but it may be familiar to Vim users.
It seems that it depends on the environment, so it was quite difficult to find a writing style that works ... Attention!
tr
Character replacement command. Not only character replacement but also uppercase and lowercase letters can be optionally changed, so it seems to be sober and versatile.
expand
Converts tabs to spaces. You can specify the tab width with the option -t.

12. Save the first column in col1.txt and the second column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

Answer in Python

`12.py`


#! /usr/bin/env python
# -*- coding:utf-8 -*-
# 12.py

import sys


def write_col(source_lines, colunm_number, filename):
    col = []
    for line in source_lines:
        col.append(line.split()[colunm_number] + "\n")
    with open(filename, "w") as writer:
        writer.writelines(col)


with open(sys.argv[1]) as f:
    lines = f.readlines()

write_col(lines, 0, "col1.txt")
write_col(lines, 1, "col2.txt")

Comments on Python Answers

I made it a function because it performs similar processing. Write the line specified by the 2nd argument of the list received by the 1st argument as the file name of the 3rd argument. ʻAppend ()` adds a newline character to improve the appearance.

I don't use any new technology, so I can comment on this much, but I'm embarrassed that the algorithms are different depending on the program ... Details will be described later, but it is posted as it is for reflection.

UNIX answer

$ cut -f 1 hightemp.txt
Kochi Prefecture
Saitama
Gifu Prefecture
(Omitted...）
Yamanashi Prefecture
Yamagata Prefecture
Aichi prefecture
$ cut -f 2 hightemp.txt
Ekawasaki
Kumagaya
Tajimi
(Omitted...）
Otsuki
Tsuruoka
Nagoya

Comments on UNIX Answers

As with Python, what we are doing is specifying fields (lines) with -f. Note that in Python it was zero-based (line 0, line 1 ...), whereas in UNIX commands it was one-based (line 1, line 2 ...).

13. Merge col1.txt and col2.txt

Combine the col1.txt and col2.txt created in 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

Answer in Python

`13.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 13.py

with open("col1.txt") as f1, open("col2.txt") as f2:
    lines1, lines2 = f1.readlines(), f2.readlines()

with open("merge.txt", "w") as writer:
    for col1, col2 in zip(lines1, lines2):
        writer.write("\t".join([col1.rstrip(), col2]))

Comments on Python Answers

I'm getting used to Python, so I wrote the first half of the reading part with a little familiarity. It is strong to be able to write like this. For the writing part in the latter half, I tried to write using zip () as a review of Chapter 1. Contrary to 12., this time both newline characters remain at the end of col1 and col2, so the newline character at the end of col1 is removed by rstrip ().

Here is a review and rewritten intensional notation.

`When rewritten by intensional notation`


#If it is in parentheses, it will be interpreted properly even if a line break occurs in the code
with open("merge.txt", "w") as writer:
    writer.write(
        "\n".join(
            ["\t".join([col1.rstrip(), col2.rstrip()])
                for col1, col2 in zip(lines1, lines2)]
        )
    )

Comparison of execution time

Since various notations came out, I tried to measure and compare the execution time of each method using timeit.

`Execution time measurement program using timeit`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 13_timeit.py

import timeit

#Preprocessing; col1,col2.Read txt
s0 = """
with open("col1.txt") as f1, open("col2.txt") as f2:
    lines1, lines2 = f1.readlines(), f2.readlines()
"""

#naive implementation;Add strings
s1 = """
merged_txt = ""
for i in xrange(len(lines1)):
    merged_txt = merged_txt + lines1[i].rstrip() + "\t" + lines2[i]

with open("merge.txt", "w") as writer:
    writer.write(merged_txt)
"""

#Implementation using zip
s2 = """
with open("merge.txt", "w") as writer:
    for col1, col2 in zip(lines1, lines2):
        writer.write("\t".join([col1.rstrip(), col2]))
"""

#Intensional notation(connotation)Implementation by
# "\\n"If you don't write, you will get a SyntaxError ... why?
s3 = """
with open("merge.txt", "w") as writer:
    writer.write(
        "\\n".join(
            ["\t".join([col1.rstrip(), col2.rstrip()])
                for col1, col2 in zip(lines1, lines2)]
        )
    )
"""

print("naive:", timeit.repeat(stmt=s1, setup=s0, number=100000))
print("zip:", timeit.repeat(stmt=s2, setup=s0, number=100000))
print("connotation:", timeit.repeat(stmt=s3, setup=s0, number=100000))

It is the calculation time (seconds) when 100000 laps of the loop are performed 3 times (default) by 3 types of methods. According to the Official Document, the execution time should be evaluated by the minimum value, not the average or maximum value.

`Execution result`


$ python 13_timeit.py
('naive:', [32.61601686477661, 47.96871089935303, 33.15881299972534])
('zip:', [49.846755027770996, 45.05450105667114, 58.70397615432739])
('connotation:', [46.472286224365234, 52.708040952682495, 46.71139121055603])

As a result, in terms of execution time alone, the method of simply adding character strings was the best. Even if the order is changed. In general, it seems that speedup can be expected with the comprehension method, but what is the boundary between speedup and non-speedup?

UNIX answer

$ paste col1.txt col2.txt 
Kochi Prefecture Ekawasaki
Kumagaya, Saitama Prefecture
Gifu Prefecture Tajimi
(Omitted...）
Yamanashi Prefecture Otsuki
Yamagata Prefecture Tsuruoka
Aichi Prefecture Nagoya

Comments on UNIX Answers

The paste command concatenates files horizontally. The default delimiter is tab, but it can be specified with the -d option.

14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument and display only the first N lines of the input. Use the head command for confirmation.

Answer in Python

`14.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 14.py

# Usage: python 14.py [filename] [number of lines]

import sys

with open(sys.argv[1]) as f:
    lines = f.readlines()

for line in lines[:int(sys.argv[2])]:
    print line,

Comments on Python Answers

At first, I used the following implementation using xrange (), but

`xrange()Implementation using`


#Omission

for i in xrange(int(sys.argv[2])):
    print lines[i],

If you do this, you will get ʻIndexErrorwhen you specify a number that exceeds the number of lines in the file, so I think it would be wise to implement using slices. I think the problem is that the input value is not checked and the error handling is not written in the first place ... Regarding the output, as explained in 11., we added,to the end of theprint` statement to remove unnecessary line breaks.

UNIX answer

$ head -3 hightemp.txt
Kochi Prefecture Ekawasaki 41 2013-08-12
40 Kumagaya, Saitama Prefecture.9	2007-08-16
40 Tajimi, Gifu Prefecture.9	2007-08-16

Comments on UNIX Answers

This is also simple, you can specify the number of lines as an option.

15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

Answer in Python

`15.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 15.py

import sys

with open(sys.argv[1]) as f:
    lines = f.readlines()

for line in lines[len(lines) - int(sys.argv[2]):]:
    print line,

Comments on Python Answers

It is almost the same as the previous 14. Although the slice specification is slightly complicated.

UNIX answer

$ tail -3 hightemp.txt

Yamanashi Prefecture Otsuki 39.9	1990-07-19
39 Tsuruoka, Yamagata Prefecture.9	1978-08-03
Aichi Prefecture Nagoya 39.9	1942-08-02

Comments on UNIX Answers

Almost the same as head.

in conclusion

Since it has become long, I have divided the article about Chapter 2. Continue to Chapter 2, Part 2.

100 Language Processing Knock with Python (Chapter 2, Part 1)

Introduction

Chapter 2: UNIX Command Basics

10. Counting the number of lines

Answer in Python

10.py

Comments on python answers

Pattern using with

When using the with syntax

UNIX answer

Comments on UNIX Answers

11. Replace tabs with spaces

Answer in Python

11.py

Comments on Python Answers

UNIX answer

Comments on UNIX Answers

12. Save the first column in col1.txt and the second column in col2.txt

Answer in Python

12.py

Comments on Python Answers

UNIX answer

Comments on UNIX Answers

13. Merge col1.txt and col2.txt

Answer in Python

13.py

Comments on Python Answers

When rewritten by intensional notation

Comparison of execution time

Execution time measurement program using timeit

Execution result

UNIX answer

Comments on UNIX Answers

14. Output N lines from the beginning

Answer in Python

14.py

Comments on Python Answers

xrange()Implementation using

UNIX answer

Comments on UNIX Answers

15. Output the last N lines

Answer in Python

15.py

Comments on Python Answers

UNIX answer

Comments on UNIX Answers

in conclusion

`10.py`

Pattern using `with`

`When using the with syntax`

`11.py`

`12.py`

`13.py`

`When rewritten by intensional notation`

`Execution time measurement program using timeit`

`Execution result`

`14.py`

`xrange()Implementation using`

`15.py`