[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 2 [First half: 10 ~ 15]

Introduction

While wandering around the net, I suddenly came across a site called "Language Processing 100 Knock 2020". While I wanted to touch natural language processing, programming was a new programmer who did a little competition pro. I'm a little interested, so I'll try it. At the time of writing this article, only half of the total is finished, but I will write it in a memorial sense. I will stop if my heart breaks. Please guess if there is no previous article.

Environment and stance

environment

stance

I will try to write a commentary as much as possible, but if you are interested, I recommend you to check it.

So far with Last time.

Solve "Chapter 2: UNIX Commands"

The following quote is from here

popular-names.txt is a file that stores the "name", "gender", "number of people", and "year" of a baby born in the United States in a tab-delimited format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

It's not a hassle to do the same with UNIX commands. (Is that ok)

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

10.py


with open("popular-names.txt") as f:
    print(len(f.readlines()))

Terminal


2780

with open () as ~ does not require you to use close (), unlike using ʻopen ()alone. When the indentation is over, it will close without permission. readlines ()` is a function that returns the entire file as a line break delimited list.

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

11.py


from functools import reduce

with open("popular-names.txt") as f:
    print(reduce(lambda a, b: (a+b).replace("\t", " "), f.readlines()))

Terminal


Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
・
・

Code golf is still going on ... (a waste of effort). The result is so long that it's only the beginning. reduce () is a higher-order function, just like map (). You can adapt the function to something iterable. This is a convenient function for finding the sum.

12. Save the first column in col1.txt and the second column in col2.txt

12.py


with open("popular-names.txt") as a,\
        open("col1.txt", mode="w") as b,\
        open("col2.txt", mode="w") as c:
    for l in a.readlines():
        x, y, *z = l.split("\t")
        b.write(x+"\n")
        c.write(y+"\n")

col1.txt


Mary
Anna
Emma
Elizabeth
・
・

col2.txt


F
F
F
F
・
・

You can connect multiple with open (). Since it seemed to be long horizontally, I used \ to break the line. For x, y, * z =, the first return value is in x, the second is in y, and the rest is in z. All you have to do is write what you need to the file.

13. Merge col1.txt and col2.txt

Combine col1.txt and col2.txt created in> 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

13.py


with open("marge.txt", mode="w") as a,\
        open("col1.txt") as b,\
        open("col2.txt") as c:
    for x, y in zip(b.readlines(), c.readlines()):
        a.write(x[:-1]+" "+y)

marge.txt


Mary F
Anna F
Emma F
Elizabeth F
・
・

zip () is a function that can get the elements of multiple lists at once. Both elements have a newline at the end, so the x removes the last character.

(I'm not writing anymore ...)

14. Output N lines from the beginning

Receive the natural number N by means such as a command line argument, and display only the first N lines of the input. Use the head command for confirmation.

14.py


import sys
from functools import reduce

with open(sys.argv[2]) as f:
    S = f.readlines()
    print(reduce(lambda a, b: a+b, S[:min(len(S), int(sys.argv[1]))]),
          end="")

No ... reduce () is convenient ... sys.argv stores the string entered on the command line, including" filename.py ". It allows you to use command line arguments.

15. Output the last N lines

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

15.py


import sys
from functools import reduce

with open(sys.argv[2]) as f:
    S = f.readlines()
    print(reduce(lambda a, b: a+b, S[max(0, len(S)-int(sys.argv[1])):]),
          end="")

It is a rebroadcast of what was the 14th question. I'm using max () to avoid overshooting, as I'm having trouble getting more requests than the number of lines in the file.

in conclusion

This time, I didn't have much material (not interesting), but how was it? The commentary has probably increased. It has become so techy that The is attached, but I hope this will be one of the answers to 100 language processing knocks. There are quite a lot of articles about this, so please take a look if you are interested.

See you in the next article, Chapter 2, Part 2. If you have any ideas for shortening the code, please comment.

Well then.

Recommended Posts

[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 2 [First half: 10 ~ 15]
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 Natural Language Processing Knock Chapter 2 UNIX Command Basics (First Half)
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 natural language processing knocks Chapter 4 Morphological analysis (first half)
100 natural language processing knocks Chapter 1 Preparatory movement (first half)
100 natural language processing knocks Chapter 3 Regular expressions (first half)
100 natural language processing knocks Chapter 5 Dependency analysis (first half)
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock (2020): 28
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 10 to 14]
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing Knock (2020): 38
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
100 language processing knock 00 ~ 02
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 natural language processing knocks Chapter 3 Regular expressions (second half)
100 natural language processing knocks Chapter 6 English text processing (second half)
100 natural language processing knocks Chapter 5 Dependency analysis (second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping