Introduction

I tried Language processing 100 knock 2020. You can see the link of other chapters from here, and the source code from here.

Chapter 4 Morphological analysis

No.30 Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

Answer

`030.py`


import pandas as pd

with open(file="neko.txt.mecab", mode="rt", encoding="utf-8") as neko:
    nekotext = neko.readlines()

nekolist = []
for str in nekotext:
    list = str.replace("\n", "").replace(" ", "").replace("\t", ",").split(",")
    if list[0] != "EOS": nekolist.append([list[0], list[7], list[1], list[2]])
    else: nekolist.append([list[0], "*", "*", "*"])
pd.set_option('display.unicode.east_asian_width', True)
df_neko = pd.DataFrame(nekolist, columns=["surface", "base", "pos", "pos1"])
print(df_neko)


# ->          surface      base     pos                      pos1
#0 11 Noun number
#1 Symbol blank
#2 I am I am a noun pronoun
#3 is a particle particle
#4 cat cat noun general...

Comments

I've put it together in pandas, but ... it's convenient, it looks like a mapping type, isn't it ?? (No)

No.31 Verb

Extract all the surface forms of the verb.

Answer

`031.py`


import input_neko as nk

df = nk.input()
print(df.query("pos == 'verb'")["surface"].values.tolist())

# -> ['Born', 'Tsuka', 'Shi', 'Crying',...

Comments

I am using the result of No.30 as ʻimport. The Series type and the list` type can be converted to each other, which is convenient.

No.32 Prototype of verb

Extract all the original forms of the verb.

Answer

`032.py`


import input_neko as nk

df = nk.input()
print(df.query("pos == 'verb'")["base"].values.tolist())

# -> ['Born', 'Tsukuri', 'To do', 'cry',...

Comments

No.31's surface has just changed to base.

No.33 "A B"

Extract a noun phrase in which two nouns are connected by "no".

Answer

`033.py`


import input_neko as nk

df = nk.input()
df.reset_index()
list_index = df.query("surface == 'of' & pos == 'Particle'").index
print([f"{df.iloc[item-1,1]}of{df.iloc[item+1,1]}" for item in list_index if df.iloc[item-1, 2] == df.iloc[item+1, 2] == "noun"])

# -> ['His palm', 'On the palm', 'Student's face', 'Should face',...

Comments

I thought I could output it with df.iloc [item-1: item + 1, 1], but it didn't work, so I ended up with a long code.

No.34 Noun concatenation

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.

Answer

`034.py`


import input_neko as nk

df = nk.input()
df.reset_index()
num = 0
str = ""
ans = []
for i in range(len(df)):
    if df.iloc[i, 2] == "noun":
        num = num + 1
        str = str + df.iloc[i, 0]
    else:
        if num >= 2:
            ans.append(str)
        num = 0
        str = ""
print(ans)

# -> ['In humans', 'The worst', 'Timely', 'One hair',...

Comments

We add nouns to str and add them to ʻans` when there are two or more nouns next to each other.

No.35 Word appearance frequency

Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.

Answer

`035.py`


import input_neko as nk

df = nk.input()
print(df["surface"].value_counts().to_dict())

# -> {'of': 9194, '。': 7486, 'hand': 6868, '、': 6772,...

Comments

It is said that Series is convenient because it can also be converted to dict type.

No.36 Top 10 most frequent words

Display the 10 words with high frequency of appearance and their frequency of appearance in a graph (for example, a bar graph).

Answer

`036.py`


import input_neko as nk
import japanize_matplotlib
import matplotlib.pyplot as plt


df = nk.input()
df_dict = df["surface"].value_counts()[:10].to_dict()
left = list(df_dict.keys())
height = list(df_dict.values())
fig = plt.figure()
plt.bar(left, height)
plt.show()
fig.savefig("036_graph.png ")

# ->

Comments

It took a long time to enable the Japanese characters of matplotlib, but I solved it by inserting japanize_matplotlib.

No.37 Top 10 words that frequently co-occur with "cat"

Display 10 words that often co-occur with "cats" (high frequency of co-occurrence) and their frequency of occurrence in a graph (for example, a bar graph).

Answer

`037.py`


from collections import defaultdict
import input_neko as nk
import japanize_matplotlib
import matplotlib.pyplot as plt

df = nk.input()
start = 0
neko_phrase = []
freq = defaultdict(int)
for i in range(len(df)):
    if df.iloc[i, 0] == "EOS":
        phrase = df.iloc[start:i, 0].to_list()
        if "Cat" in phrase:
            neko_phrase.append(phrase)
            for word in phrase:
                if word != "Cat":  freq[word] += 1
        start = i + 1
neko_relation = sorted(freq.items(), key=lambda x: x[1], reverse=True)[:10]
left = [item[0] for item in neko_relation]
height = [item[1] for item in neko_relation]
fig = plt.figure()
plt.bar(left, height)
fig.savefig("037_graph.png ")

# ->

Comments

Words that often appear in sentences that include cats are counted as having a high co-occurrence frequency. A lambda expression is used in the middle to sort by a value value of type dict.

No.38 Histogram

Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph).

Answer

`038.py`


import input_neko as nk
import pandas as pd
import japanize_matplotlib
import matplotlib.pyplot as plt

df = nk.input()
df_dict = df["surface"].value_counts().to_dict()
word = list(df_dict.keys())
count = list(df_dict.values())
d = pd.DataFrame(count, index=word).groupby(0).size()
height = list(d[:10])
fig = plt.figure()
plt.bar(range(1, 11), height)
fig.savefig("038_graph.png ")

# ->

Comments

I thought I should use groupby, but I feel that converting dict, DataFrame, Series, list has made it more difficult to understand ...

No.39 Zipf's Law

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

Answer

`039.py`


import input_neko as nk
import pandas as pd
import japanize_matplotlib
import matplotlib.pyplot as plt

df = nk.input()
df_dict = df["surface"].value_counts().to_dict()
word = list(df_dict.keys())
count = list(df_dict.values())
d = pd.DataFrame(count, index=word).groupby(0).size()
height = list(d[:])
fig = plt.figure()
plt.xscale("log")
plt.yscale("log")
plt.plot(range(len(height)), height)
fig.savefig("039_graph.png ")

# ->

Comments

[Wikipedia](https://ja.wikipedia.org/wiki/%E3%82%B8%E3%83%83%E3%83%97%E3%81%AE%E6%B3%95%E5%89 I checked Zipf's law with% 87).

Zipf's law (Zipf's law) or Zipf's law is an empirical rule that the proportion of the kth most frequent element in the whole is proportional to 1 / k.

Ideally, if you take a logarithmic graph on both axes, it will be a straight line that descends to the right, but I feel that the output results are generally the same. It's strange that Zipf's law appears in various situations.

I tried 100 language processing knock 2020: Chapter 4

Introduction

Chapter 4 Morphological analysis

No.30 Reading morphological analysis results

030.py

No.31 Verb

031.py

No.32 Prototype of verb

032.py

No.33 "A B"

033.py

No.34 Noun concatenation

034.py

No.35 Word appearance frequency

035.py

No.36 Top 10 most frequent words

036.py

No.37 Top 10 words that frequently co-occur with "cat"

037.py

No.38 Histogram

038.py

No.39 Zipf's Law

039.py

`030.py`

`031.py`

`032.py`

`033.py`

`034.py`

`035.py`

`036.py`

`037.py`

`038.py`

`039.py`