I tried Language processing 100 knock 2020. You can see the link of other chapters from here, and the source code from here. (Confirmation using UNIX has not been verified.)
Count the number of lines. Use the wc command for confirmation.
010.py
path = "popular-names.txt"
with open(path) as file:
print(len(file.readlines()))
# -> 2700
I used the with
block because it was tedious to writeclose ()
at the end of the file operation.
Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.
011.py
path = "popular-names.txt"
with open(path) as file:
print(file.read().replace("\t", " "), end="")
# -> Mary F 7065 1880
# Anna F 2604 1880
# Emma F 2003 1880 ...
Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.
012.py
path = "popular-names.txt"
path_col1 = "col1_012.txt"
path_col2 = "col2_012.txt"
with open(path) as file:
with open(path_col1, mode="w") as col1:
with open(path_col2, mode="w") as col2:
item_split = [item.split("\t") for item in file.readlines()]
for item in item_split:
col1.write(item[0] + "\n")
col2.write(item[1] + "\n")
# col1.txt
# -> Mary
# Anna...
# col2.txt
# -> F
# F...
The operation of the file is specified by mode
of ʻopen (). The default is
mode ='r'`, but you should write it properly without omitting it ...?
Combine col1.txt and col2.txt created in> 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.
013.py
path_col1 = "col1_012.txt"
path_col2 = "col2_012.txt"
path_merge = "merge.txt"
with open(path_col1) as col1:
col1_list = col1.readlines()
with open(path_col2) as col2:
col2_list = col2.readlines()
with open(path_merge, mode="w") as mrg:
for i in range(len(col1_list)):
mrg.write(col1_list[i].replace("\n", "") + "\t" + col2_list[i])
# merge.txt
# -> Mary F
# Anna F
# Emma F
The other person's answer used zip ()
to generate it. With this answer, I'm in trouble when len (col1_list)> len (col2_list)
, and that is smarter.
Receive the natural number N by means such as a command line argument, and display only the first N lines of the input. Use the head command for confirmation.
014.py
import sys
N = int(sys.argv[2])
with open(sys.argv[1]) as file:
for i in range(N):
print(file.readline().replace("\n",""))
# python 014.py popular-names.txt 3
# -> Mary F 7065 1880
# Anna F 2604 1880
# Emma F 2003 1880
It seems that you can get a list with command line arguments by using the ʻargvfunction of the
sys` module.
Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.
015.py
import pandas as pd
path = "popular-names.txt"
df = pd.read_table(path, header=None)
print(df.tail())
# -> 0 1 2 3
# 2775 Benjamin M 13381 2018
# 2776 Elijah M 12886 2018
# 2777 Lucas M 12585 2018
# 2778 Mason M 12435 2018
# 2779 Logan M 12352 2018
For those who know it, it seems like it's new, but it seems that there is a library called pandas that is convenient for data processing, so I tried using it.
read_csv (path, sep =" \ t ")
was fine, but read_table
is simple, isn't it?
Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.
016.py
import pandas as pd
import sys
N = int(sys.argv[1])
path = "popular-names.txt"
df = pd.read_table(path, header=None)
col_n = -(-len(df) // N)
for i in range(N):
print(df.iloc[col_n * i : col_n * (i + 1), :])
# python 016.py 2
# -> 0 1 2 3
# 0 Mary F 7065 1880
# 1 Anna F 2604 1880
# ... ... .. ... ...
# 1389 Sharon F 25711 1949
#
# [1390 rows x 4 columns]
# 0 1 2 3
# 1390 James M 86857 1949
# 1391 Robert M 83872 1949
# ... ... .. ... ...
# 2779 Logan M 12352 2018
#
# [1390 rows x 4 columns]
For the output, I used ʻiloc because I want to specify multiple lines of
df` by index.
Find the type of string in the first column (a set of different strings). Use the cut, sort, and uniq commands for confirmation.
017.py
import pandas as pd
path = "popular-names.txt"
df = pd.read_table(path, header=None)
print(df[0].unique())
# -> ['Mary' 'Anna' 'Emma' 'Elizabeth' 'Minnie' 'Margaret' 'Ida' 'Alice'...
ʻUnique ()returns the value of a unique element as a NumPy
ndarraytype. The number of unique elements can be obtained by
df [0] .nunique ()in addition to
len (df [0] .unique ())`.
Arrange each line in the reverse order of the numbers in the third column (Note: Sort the contents of each line unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).
018.py
import pandas as pd
path = "popular-names.txt"
df = pd.read_table(path, header=None)
print(df.sort_values(2, ascending=False))
# -> 0 1 2 3
# 1340 Linda F 99689 1947
# 1360 Linda F 96211 1948
# 1350 James M 94757 1947...
Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.
019.py
import pandas as pd
path = "popular-names.txt"
df = pd.read_table(path, header=None)
print(df[0].value_counts())
# -> James 118
# William 111
# John 108
value_counts
outputs unique elements and their number as pandas.Series
type.
It's confusing that ʻunique ()shows a list of unique elements,
nunique ()shows the total number of unique elements, and
value_counts ()` shows the frequency of each element.
Frequent Pandas basic operations in data analysis upura / nlp100v2020 Solve "100 Language Processing Knock 2020" with Python Amateur language processing 100 knock summary
Recommended Posts