100 Language Processing Knock 2020 Chapter 2


Language processing 100 knock 2020 has been released, so I will try it immediately. Chapter 1 is exactly the same as 2015 (and I was doing just that), so I'll start with Chapter 2.

Some articles have already been published on Qiita, but in addition to being able to learn an overview of natural language processing, I think that the beginning will be useful not only for language processing but also for Linux beginners and professional beginners.

Chapter 2: UNIX Commands

popular-names.txt is a file that stores the "name", "gender", "number of people", and "year" of a baby born in the United States in a tab-delimited format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.

10. Counting the number of lines

Count the number of lines. Use the wc command for confirmation.


import pandas as pd

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)


wc -l popular-names.txt

11. Replace tabs with spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.


import pandas as pd

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
df.to_csv('popular-names-space.txt', sep=' ', index=False, header=None)


sed -e $'s/\t/ /g' popular-names.txt > popular-names-space.txt

12. Save the first column in col1.txt and the second column in col2.txt

Save only the first column of each row as col1.txt and the second column as col2.txt. Use the cut command for confirmation.


import pandas as pd

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
df[0].to_csv('col1.txt', index=False, header=None)
df[1].to_csv('col2.txt', index=False, header=None)


cut -f 1 popular-names.txt > col1.txt
cut -f 2 popular-names.txt > col2.txt

13. Merge col1.txt and col2.txt

Combine the col1.txt and col2.txt created in 12, and create a text file in which the first and second columns of the original file are arranged by tab delimiters. Use the paste command for confirmation.


import pandas as pd

df1 = pd.read_csv('col1.txt', header=None)
df2 = pd.read_csv('col2.txt', header=None)
df_concat = pd.concat([df1, df2], axis=1)
df_concat.to_csv('col3.txt', sep='\t', index=False, header=None)


paste col1.txt col2.txt > col3.txt

14. Output N lines from the beginning

Receive the natural number N by means such as command line arguments, and display only the first N lines of the input. Use the head command for confirmation.


import pandas as pd
n = int(input())

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)


head -5 popular-names.txt

15. Output the last N lines

Receive the natural number N by means such as a command line argument, and display only the last N lines of the input. Use the tail command for confirmation.


import pandas as pd
n = int(input())

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)


tail -5 popular-names.txt

16. Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.


import pandas as pd
n = int(input())

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)

for i in range(1, len(df) // n + 1):
    df[n*i:n*i+n:].to_csv('popular-names' + str(i) + '.txt', index=False, header=None)

It's not very beautiful here, but it didn't seem like a concise way to split the DataFrame line by line.


split -l 200 popular-names.txt popular-names-

17. Difference in the character string in the first column

Find the type of character string in the first column (set of different character strings). Use the cut, sort, and uniq commands for confirmation.


import pandas as pd

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)


cut -f 1 popular-names.txt | sort | uniq

ʻUniq` needs to be sorted in advance.

18. Sort each row in descending order of the numbers in the third column

Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).


import pandas as pd

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print(df.sort_values(2, ascending=False))


sort -n -r -k 3 popular-names.txt | head -10

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.


import pandas as pd

df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)


cut -f 1 popular-names.txt | sort | uniq -c | sort -n -r -k 1 | head -10

in conclusion

What you can learn in Chapter 2

Recommended Posts

100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 2 [First half: 10 ~ 15]
[Language processing 100 knocks 2020] Chapter 3: Regular expressions
100 natural language processing knocks Chapter 4 Commentary
[Language processing 100 knocks 2020] Chapter 6: Machine learning