While wandering around the net, I suddenly came across a site called "Language Processing 100 Knock 2020". While I wanted to touch natural language processing, programming was a new programmer who did a little competition pro. I'm a little interested, so I'll try it. At the time of writing this article, only half of the total is finished, but I will write it in a memorial sense. I will stop if my heart breaks. Please guess if there is no previous article.
I will try to write a commentary as much as possible, but if you are interested, I recommend you to check it.
So far with Last time.
The following quote is from here
popular-names.txt is a file that stores the "name", "gender", "number of people", and "year" of a baby born in the United States in a tab-delimited format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.
It's not a hassle to do the same with UNIX commands. (Is that ok)
Count the number of lines. Use the wc command for confirmation.
10.py
with open("popular-names.txt") as f:
print(len(f.readlines()))
Terminal
2780
with open () as ~
does not require you to use close ()
, unlike using ʻopen ()alone. When the indentation is over, it will close without permission.
readlines ()` is a function that returns the entire file as a line break delimited list.
Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.
11.py
from functools import reduce
with open("popular-names.txt") as f:
print(reduce(lambda a, b: (a+b).replace("\t", " "), f.readlines()))
Terminal
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
・
・
Code golf is still going on ... (a waste of effort). The result is so long that it's only the beginning.
reduce ()
is a higher-order function, just like map ()
.
You can adapt the function to something iterable. This is a convenient function for finding the sum.
12.py
with open("popular-names.txt") as a,\
open("col1.txt", mode="w") as b,\
open("col2.txt", mode="w") as c:
for l in a.readlines():
x, y, *z = l.split("\t")
b.write(x+"\n")
c.write(y+"\n")
col1.txt
Mary
Anna
Emma
Elizabeth
・
・
col2.txt
F
F
F
F
・
・
You can connect multiple with open ()
. Since it seemed to be long horizontally, I used \
to break the line.
For x, y, * z =
, the first return value is in x
, the second is in y
, and the rest is in z
.
All you have to do is write what you need to the file.
Combine col1.txt and col2.txt created in> 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.
13.py
with open("marge.txt", mode="w") as a,\
open("col1.txt") as b,\
open("col2.txt") as c:
for x, y in zip(b.readlines(), c.readlines()):
a.write(x[:-1]+" "+y)
marge.txt
Mary F
Anna F
Emma F
Elizabeth F
・
・
zip ()
is a function that can get the elements of multiple lists at once.
Both elements have a newline at the end, so the x
removes the last character.
(I'm not writing anymore ...)
Receive the natural number N by means such as a command line argument, and display only the first N lines of the input. Use the head command for confirmation.
14.py
import sys
from functools import reduce
with open(sys.argv[2]) as f:
S = f.readlines()
print(reduce(lambda a, b: a+b, S[:min(len(S), int(sys.argv[1]))]),
end="")
No ... reduce ()
is convenient ...
sys.argv
stores the string entered on the command line, including" filename.py ".
It allows you to use command line arguments.
Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.
15.py
import sys
from functools import reduce
with open(sys.argv[2]) as f:
S = f.readlines()
print(reduce(lambda a, b: a+b, S[max(0, len(S)-int(sys.argv[1])):]),
end="")
It is a rebroadcast of what was the 14th question.
I'm using max ()
to avoid overshooting, as I'm having trouble getting more requests than the number of lines in the file.
This time, I didn't have much material (not interesting), but how was it? The commentary has probably increased. It has become so techy that The is attached, but I hope this will be one of the answers to 100 language processing knocks. There are quite a lot of articles about this, so please take a look if you are interested.
See you in the next article, Chapter 2, Part 2. If you have any ideas for shortening the code, please comment.
Well then.
Recommended Posts