Introduction

While wandering around the net, I suddenly came across a site called "Language Processing 100 Knock 2020". While I wanted to touch natural language processing, programming was a new programmer who did a little competition pro. I'm a little interested, so I'll try it. At the time of writing this article, only half of the total is finished, but I will write it in a memorial sense. I will stop if my heart breaks. Please guess if there is no previous article.

Environment and stance

environment

OS : macOS Catalina 10.15.3
Python : 3.7.6

stance

Implementation does not work very hard
I don't know the custom.
I don't think about safety that much.
Do your best so that others can read it.
I want to write as gently as possible for Python beginners (desire).

I will try to write a commentary as much as possible, but if you are interested, I recommend you to check it.

So far with Last time.

Solve "Chapter 2: UNIX Commands"

The following quote is from here

popular-names.txt is a file that stores the "name", "gender", "number of people", and "year" of a baby born in the United States in a tab-delimited format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

It's not a hassle to do the same with UNIX commands. (Is that ok)

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

`10.py`


with open("popular-names.txt") as f:
    print(len(f.readlines()))

`Terminal`

with open () as ~ does not require you to use close (), unlike using ʻopen ()alone. When the indentation is over, it will close without permission. readlines ()` is a function that returns the entire file as a line break delimited list.

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

`11.py`


from functools import reduce

with open("popular-names.txt") as f:
    print(reduce(lambda a, b: (a+b).replace("\t", " "), f.readlines()))

`Terminal`


Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
・
・

Code golf is still going on ... (a waste of effort). The result is so long that it's only the beginning. reduce () is a higher-order function, just like map (). You can adapt the function to something iterable. This is a convenient function for finding the sum.

12. Save the first column in col1.txt and the second column in col2.txt

`12.py`


with open("popular-names.txt") as a,\
        open("col1.txt", mode="w") as b,\
        open("col2.txt", mode="w") as c:
    for l in a.readlines():
        x, y, *z = l.split("\t")
        b.write(x+"\n")
        c.write(y+"\n")

`col1.txt`


Mary
Anna
Emma
Elizabeth
・
・

`col2.txt`


F
F
F
F
・
・

You can connect multiple with open (). Since it seemed to be long horizontally, I used \ to break the line. For x, y, * z =, the first return value is in x, the second is in y, and the rest is in z. All you have to do is write what you need to the file.

13. Merge col1.txt and col2.txt

Combine col1.txt and col2.txt created in> 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

`13.py`


with open("marge.txt", mode="w") as a,\
        open("col1.txt") as b,\
        open("col2.txt") as c:
    for x, y in zip(b.readlines(), c.readlines()):
        a.write(x[:-1]+" "+y)

`marge.txt`


Mary F
Anna F
Emma F
Elizabeth F
・
・

zip () is a function that can get the elements of multiple lists at once. Both elements have a newline at the end, so the x removes the last character.

(I'm not writing anymore ...)

14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument, and display only the first N lines of the input. Use the head command for confirmation.

`14.py`


import sys
from functools import reduce

with open(sys.argv[2]) as f:
    S = f.readlines()
    print(reduce(lambda a, b: a+b, S[:min(len(S), int(sys.argv[1]))]),
          end="")

No ... reduce () is convenient ... sys.argv stores the string entered on the command line, including" filename.py ". It allows you to use command line arguments.

15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

`15.py`


import sys
from functools import reduce

with open(sys.argv[2]) as f:
    S = f.readlines()
    print(reduce(lambda a, b: a+b, S[max(0, len(S)-int(sys.argv[1])):]),
          end="")

It is a rebroadcast of what was the 14th question. I'm using max () to avoid overshooting, as I'm having trouble getting more requests than the number of lines in the file.

in conclusion

This time, I didn't have much material (not interesting), but how was it? The commentary has probably increased. It has become so techy that The is attached, but I hope this will be one of the answers to 100 language processing knocks. There are quite a lot of articles about this, so please take a look if you are interested.

See you in the next article, Chapter 2, Part 2. If you have any ideas for shortening the code, please comment.

Well then.

[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 2 [First half: 10 ~ 15]

Introduction

Environment and stance

environment

stance

Solve "Chapter 2: UNIX Commands"

10. Counting the number of lines

10.py

Terminal

11. Replace tabs with spaces

11.py

Terminal

12. Save the first column in col1.txt and the second column in col2.txt

12.py

col1.txt

col2.txt

13. Merge col1.txt and col2.txt

13.py

marge.txt

14. Output N lines from the beginning

14.py

15. Output the last N lines

15.py

in conclusion

`10.py`

`Terminal`

`11.py`

`Terminal`

`12.py`

`col1.txt`

`col2.txt`

`13.py`

`marge.txt`

`14.py`

`15.py`