2020 version of 100 knocks of language processing, which is famous as a collection of problems of natural language processing, has been released. This article summarizes the results of solving Chapter 2: UNIX Commands from Chapters 1 to 10 below. ..
-Chapter 1: Preparatory Movement --Chapter 2: UNIX Commands -Chapter 3: Regular Expressions -Chapter 4: Morphological analysis -Chapter 5: Dependency Analysis -Chapter 6: Machine Learning --Chapter 7: Word Vector --Chapter 8: Neural Net --Chapter 9: RNN, CNN --Chapter 10: Machine Translation
We use Google Colaboratory for answers. For details on how to set up and use Google Colaboratory, see this article. The notebook containing the execution results of the following answers is available on github.
popular-names.txt is a file that stores the "name", "gender", "number of people", and "year" of a baby born in the United States in a tab-delimited format. Create a program that performs the following processing, and execute popular-names.txt as an input file. Furthermore, execute the same process with UNIX commands and check the execution result of the program.
First, download the specified data. If you execute the following command on the cell of Google Colaboratory, the target text file will be downloaded to the current directory.
!wget https://nlp100.github.io/data/popular-names.txt
[Wget] command-Download file by specifying URL
Count the number of lines. Use the wc command for confirmation.
This time, we will process each question after reading it as a pandas data frame. In addition, we also check the results with commands according to the instructions in the problem statement.
import pandas as pd
df = pd.read_table('./popular-names.txt', header=None, sep='\t', names=['name', 'sex', 'number', 'year'])
print(len(df))
output
2780
#Verification
!wc -l ./popular-names.txt
output
2780 ./popular-names.txt
Read csv / tsv file with pandas Get the number of rows, columns, and total elements (size) with pandas [[Cat] command-Easily check the contents of the configuration file] (https://www.atmarkit.co.jp/ait/articles/1602/25/news034.html) [Wc] command-counts the number of characters and lines in a text file
Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.
Since this question seems to assume the replacement of the tab that is the delimiter of the original data, it is not executed in the data frame that has already been read, and only the confirmation by the command is performed.
#Verification
!sed -e 's/\t/ /g' ./popular-names.txt | head -n 5
output
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
[Sed] command (basic part 4) -replaces / outputs the replaced line
Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.
col1 = df['name']
col1.to_csv('./col1.txt', index=False)
print(col1.head())
output
0 Mary
1 Anna
2 Emma
3 Elizabeth
4 Minnie
Name: name, dtype: object
#Verification
!cut -f 1 ./popular-names.txt > ./col1_chk.txt
!cat ./col1_chk.txt | head -n 5
output
Mary
Anna
Emma
Elizabeth
Minnie
col2 = df['sex']
col2.to_csv('./col2.txt', index=False)
output
0 F
1 F
2 F
3 F
4 F
Name: sex, dtype: object
#Verification
!cut -f 2 ./popular-names.txt > ./col2_chk.txt
!cat ./col2_chk.txt | head -n 5
output
F
F
F
F
F
Select and get rows / columns by pandas index reference Export csv file with pandas [Cut] command-cut out from line by fixed length or field Save command execution result / standard output to file
Combine col1.txt and col2.txt created in> 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.
col1 = pd.read_table('./col1.txt')
col2 = pd.read_table('./col2.txt')
merged_1_2 = pd.concat([col1, col2], axis=1)
merged_1_2.to_csv('./merged_1_2.txt', sep='\t', index=False)
print(merged_1_2.head())
output
name sex
0 Mary F
1 Anna F
2 Emma F
3 Elizabeth F
4 Minnie F
#Verification
!paste ./col1_chk.txt ./col2_chk.txt | head -n 5
output
Mary F
Anna F
Emma F
Elizabeth F
Minnie F
Concatenate pandas.DataFrame, Series [Paste] command-Concatenate multiple files line by line
Receive the natural number N by means such as a command line argument, and display only the first N lines of the input. Use the head command for confirmation.
def output_head(N):
print(df.head(N))
output_head(5)
output
name sex number year
0 Mary F 7065 1880
1 Anna F 2604 1880
2 Emma F 2003 1880
3 Elizabeth F 1939 1880
4 Minnie F 1746 1880
#Verification
!head -n 5 ./popular-names.txt
output
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Define / call a function in Python Returns the first and last lines of pandas.DataFrame, Series [[Head] command / [tail] command-Display only the beginning / end of a long message or text file](https://www.atmarkit.co.jp/ait/articles/1603/07/news023. html)
Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.
def output_tail(N):
print(df.tail(N))
output_tail(5)
output
name sex number year
2775 Benjamin M 13381 2018
2776 Elijah M 12886 2018
2777 Lucas M 12585 2018
2778 Mason M 12435 2018
2779 Logan M 12352 2018
#Verification
!tail -n 5 ./popular-names.txt
output
Benjamin M 13381 2018
Elijah M 12886 2018
Lucas M 12585 2018
Mason M 12435 2018
Logan M 12352 2018
Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.
I think there are various ways to do this, but here, a flag is added to the serial number of the record to divide the file into N by applying qcut, which calculates the N quantile.
def split_file(N):
tmp = df.reset_index(drop=False)
df_cut = pd.qcut(tmp.index, N, labels=[i for i in range(N)])
df_cut = pd.concat([df, pd.Series(df_cut, name='sp')], axis=1)
return df_cut
df_cut = split_file(10)
print(df_cut['sp'].value_counts())
output
9 278
8 278
7 278
6 278
5 278
4 278
3 278
2 278
1 278
0 278
Name: sp, dtype: int64
print(df_cut.head())
output
name sex number year sp
0 Mary F 7065 1880 0
1 Anna F 2604 1880 0
2 Emma F 2003 1880 0
3 Elizabeth F 1939 1880 0
4 Minnie F 1746 1880 0
#Split by command
!split -l 200 -d ./popular-names.txt sp
Re-index pandas.DataFrame, Series Binning process with pandas cut and qcut functions (binning) Count the number and frequency (number of occurrences) of unique elements in pandas [Split] command-split files
Find the type of string in the first column (a set of different strings). Use the cut, sort, and uniq commands for confirmation.
print(len(df.drop_duplicates(subset='name')))
output
136
#Verification
!cut -f 1 ./popular-names.txt | sort | uniq | wc -l
output
136
Extract / delete duplicate rows of pandas.DataFrame, Series [Sort] command-sorts text files line by line [Uniq] command-delete duplicate lines
Arrange each row in the reverse order of the numbers in the third column (Note: sort the contents of each row unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).
df.sort_values(by='number', ascending=False, inplace=True)
print(df.head())
output
name sex number year
1340 Linda F 99689 1947
1360 Linda F 96211 1948
1350 James M 94757 1947
1550 Michael M 92704 1957
1351 Robert M 91640 1947
#Verification
!cat ./popular-names.txt | sort -rnk 3 | head -n 5
output
Linda F 99689 1947
Linda F 96211 1948
James M 94757 1947
Michael M 92704 1957
Robert M 91640 1947
Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.
print(df['name'].value_counts())
output
James 118
William 111
Robert 108
John 108
Mary 92
...
Crystal 1
Rachel 1
Scott 1
Lucas 1
Carolyn 1
Name: name, Length: 136, dtype: int64
#Verification
!cut -f 1 ./popular-names.txt | sort | uniq -c | sort -rn
output
118 James
111 William
108 Robert
108 John
92 Mary
100 Language Processing Knock is designed so that you can learn not only natural language processing itself, but also basic data processing and general-purpose machine learning. Even those who are studying machine learning in online courses will be able to practice very good output, so please try it.
Recommended Posts