Introduction

I tried Language processing 100 knock 2020. You can see the link of other chapters from here, and the source code from here. (Confirmation using UNIX has not been verified.)

Chapter 2

No.10 Counting the number of lines

Count the number of lines. Use the wc command for confirmation.

Answer

`010.py`


path = "popular-names.txt"
with open(path) as file:
    print(len(file.readlines()))

# -> 2700

Comments

I used the with block because it was tedious to writeclose ()at the end of the file operation.

No.11 Convert tabs to spaces

Replace each tab character with one space character. Use the sed command, tr command, or expand command for confirmation.

Answer

`011.py`


path = "popular-names.txt"
with open(path) as file:
     print(file.read().replace("\t", " "), end="")

# -> Mary F 7065 1880 
#    Anna F 2604 1880 
#    Emma F 2003 1880 ...

Comments

I converted the space to `\ t` with` replace () `. If you output as it is, one line blank will occur at the line break of the `print` function +` \ n` at the end of the text line, but if you specify the ʻend` option for the` print` function, ʻend` will be inserted at the end of the text. I tried using it because it looks like.

No.12 Save the first column in col1.txt and the second column in col2.txt

Save the extracted version of only the first column of each row as col1.txt and the extracted version of only the second column as col2.txt. Use the cut command for confirmation.

Answer

`012.py`


path = "popular-names.txt"
path_col1 = "col1_012.txt"
path_col2 = "col2_012.txt"

with open(path) as file:
    with open(path_col1, mode="w") as col1:
        with open(path_col2, mode="w") as col2:
           item_split = [item.split("\t") for item in file.readlines()]
           for item in item_split:
               col1.write(item[0] + "\n")
               col2.write(item[1] + "\n")

# col1.txt
# -> Mary
#    Anna...
# col2.txt
# -> F
#    F...

Comments

The operation of the file is specified by mode of ʻopen (). The default is mode ='r'`, but you should write it properly without omitting it ...?

No.13 Merge col1.txt and col2.txt

Combine col1.txt and col2.txt created in> 12 to create a text file in which the first and second columns of the original file are arranged tab-delimited. Use the paste command for confirmation.

Answer

`013.py`


path_col1 = "col1_012.txt"
path_col2 = "col2_012.txt"
path_merge = "merge.txt"

with open(path_col1) as col1:
    col1_list = col1.readlines()
    with open(path_col2) as col2:
        col2_list = col2.readlines()
        with open(path_merge, mode="w") as mrg:
            for i in range(len(col1_list)):
                mrg.write(col1_list[i].replace("\n", "") + "\t" + col2_list[i])

# merge.txt
# -> Mary	F
#    Anna	F
#    Emma	F

Comments

The other person's answer used zip () to generate it. With this answer, I'm in trouble when len (col1_list)> len (col2_list), and that is smarter.

No.14 Output N lines from the beginning

Receive the natural number N by means such as a command line argument, and display only the first N lines of the input. Use the head command for confirmation.

Answer

`014.py`


import sys

N = int(sys.argv[2])
with open(sys.argv[1]) as file:
    for i in range(N):
        print(file.readline().replace("\n",""))

# python 014.py popular-names.txt 3
# -> Mary    F       7065    1880
#    Anna    F       2604    1880
#    Emma    F       2003    1880

Comments

It seems that you can get a list with command line arguments by using the ʻargvfunction of thesys` module.

No.15 Output N lines at the end

Receive the natural number N by means such as command line arguments and display only the last N lines of the input. Use the tail command for confirmation.

Answer

`015.py`


import pandas as pd

path = "popular-names.txt"
df = pd.read_table(path, header=None)
print(df.tail())

# ->              0  1      2     3
#    2775  Benjamin  M  13381  2018
#    2776    Elijah  M  12886  2018
#    2777     Lucas  M  12585  2018
#    2778     Mason  M  12435  2018
#    2779     Logan  M  12352  2018

Comments

For those who know it, it seems like it's new, but it seems that there is a library called pandas that is convenient for data processing, so I tried using it. read_csv (path, sep =" \ t ") was fine, but read_table is simple, isn't it?

No.16 Divide the file into N

Receive the natural number N by means such as command line arguments, and divide the input file into N line by line. Achieve the same processing with the split command.

Answer

`016.py`


import pandas as pd
import sys

N = int(sys.argv[1])
path = "popular-names.txt"
df = pd.read_table(path, header=None)
col_n = -(-len(df) // N)
for i in range(N):
    print(df.iloc[col_n * i : col_n * (i + 1), :])

# python 016.py 2
# ->               0  1      2     3
#    0          Mary  F   7065  1880
#    1          Anna  F   2604  1880
#    ...         ... ..    ...   ...
#    1389     Sharon  F  25711  1949
#
#    [1390 rows x 4 columns]
#                  0  1      2     3
#     1390     James  M  86857  1949
#     1391    Robert  M  83872  1949
#     ...        ... ..    ...   ...
#     2779     Logan  M  12352  2018
#
#    [1390 rows x 4 columns]

Comments

`col_n =-(-len (df) // N)` calculates the rounded-up integer of` len (df) / N`. It's more intuitive to use `math.ceil ()`, but I wonder if this kind of notation is possible.

For the output, I used ʻiloc because I want to specify multiple lines of df` by index.

No.17 Overlapping of character strings in the first column

Find the type of string in the first column (a set of different strings). Use the cut, sort, and uniq commands for confirmation.

Answer

`017.py`


import pandas as pd

path = "popular-names.txt"
df = pd.read_table(path, header=None)
print(df[0].unique())

# -> ['Mary' 'Anna' 'Emma' 'Elizabeth' 'Minnie' 'Margaret' 'Ida' 'Alice'...

Comments

ʻUnique ()returns the value of a unique element as a NumPyndarraytype. The number of unique elements can be obtained bydf [0] .nunique ()in addition tolen (df [0] .unique ())`.

No.18 Sort each row in descending order of the numerical value in the third column

Arrange each line in the reverse order of the numbers in the third column (Note: Sort the contents of each line unchanged). Use the sort command for confirmation (this problem does not have to match the result of executing the command).

Answer

`018.py`


import pandas as pd

path = "popular-names.txt"
df = pd.read_table(path, header=None)
print(df.sort_values(2, ascending=False))

# ->            0  1      2     3
#   1340    Linda  F  99689  1947
#   1360    Linda  F  96211  1948
#   1350    James  M  94757  1947...

Comments

`sort_values` is available in both` pandas.DataFrame` and `pandas.Series`. Very convenient to sort easily. Also, personally, I'm more of a "column" group than a "column".

No.19 Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Find the frequency of occurrence of the character string in the first column of each line, and display them in descending order. Use the cut, uniq, and sort commands for confirmation.

Answer

`019.py`


import pandas as pd

path = "popular-names.txt"
df = pd.read_table(path, header=None)
print(df[0].value_counts())

# -> James      118
#    William    111
#    John       108

Comments

value_counts outputs unique elements and their number as pandas.Series type. It's confusing that ʻunique ()shows a list of unique elements,nunique ()shows the total number of unique elements, andvalue_counts ()` shows the frequency of each element.

reference

Frequent Pandas basic operations in data analysis upura / nlp100v2020 Solve "100 Language Processing Knock 2020" with Python Amateur language processing 100 knock summary

I tried 100 language processing knock 2020: Chapter 2

Introduction

Chapter 2

No.10 Counting the number of lines

010.py

No.11 Convert tabs to spaces

011.py

No.12 Save the first column in col1.txt and the second column in col2.txt

012.py

No.13 Merge col1.txt and col2.txt

013.py

No.14 Output N lines from the beginning

014.py

No.15 Output N lines at the end

015.py

No.16 Divide the file into N

016.py

No.17 Overlapping of character strings in the first column

017.py

No.18 Sort each row in descending order of the numerical value in the third column

018.py

No.19 Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

019.py

reference

`010.py`

`011.py`

`012.py`

`013.py`

`014.py`

`015.py`

`016.py`

`017.py`

`018.py`

`019.py`