I tried Language processing 100 knock 2020. You can see the link of other chapters from here, and the source code from here.
Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.
030.py
import pandas as pd
with open(file="neko.txt.mecab", mode="rt", encoding="utf-8") as neko:
nekotext = neko.readlines()
nekolist = []
for str in nekotext:
list = str.replace("\n", "").replace(" ", "").replace("\t", ",").split(",")
if list[0] != "EOS": nekolist.append([list[0], list[7], list[1], list[2]])
else: nekolist.append([list[0], "*", "*", "*"])
pd.set_option('display.unicode.east_asian_width', True)
df_neko = pd.DataFrame(nekolist, columns=["surface", "base", "pos", "pos1"])
print(df_neko)
# -> surface base pos pos1
#0 11 Noun number
#1 Symbol blank
#2 I am I am a noun pronoun
#3 is a particle particle
#4 cat cat noun general...
I've put it together in pandas
, but ... it's convenient, it looks like a mapping type, isn't it ?? (No)
Extract all the surface forms of the verb.
031.py
import input_neko as nk
df = nk.input()
print(df.query("pos == 'verb'")["surface"].values.tolist())
# -> ['Born', 'Tsuka', 'Shi', 'Crying',...
I am using the result of No.30 as ʻimport. The
Series type and the
list` type can be converted to each other, which is convenient.
Extract all the original forms of the verb.
032.py
import input_neko as nk
df = nk.input()
print(df.query("pos == 'verb'")["base"].values.tolist())
# -> ['Born', 'Tsukuri', 'To do', 'cry',...
No.31's surface
has just changed to base
.
Extract a noun phrase in which two nouns are connected by "no".
033.py
import input_neko as nk
df = nk.input()
df.reset_index()
list_index = df.query("surface == 'of' & pos == 'Particle'").index
print([f"{df.iloc[item-1,1]}of{df.iloc[item+1,1]}" for item in list_index if df.iloc[item-1, 2] == df.iloc[item+1, 2] == "noun"])
# -> ['His palm', 'On the palm', 'Student's face', 'Should face',...
I thought I could output it with df.iloc [item-1: item + 1, 1]
, but it didn't work, so I ended up with a long code.
Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.
034.py
import input_neko as nk
df = nk.input()
df.reset_index()
num = 0
str = ""
ans = []
for i in range(len(df)):
if df.iloc[i, 2] == "noun":
num = num + 1
str = str + df.iloc[i, 0]
else:
if num >= 2:
ans.append(str)
num = 0
str = ""
print(ans)
# -> ['In humans', 'The worst', 'Timely', 'One hair',...
We add nouns to str
and add them to ʻans` when there are two or more nouns next to each other.
Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.
035.py
import input_neko as nk
df = nk.input()
print(df["surface"].value_counts().to_dict())
# -> {'of': 9194, '。': 7486, 'hand': 6868, '、': 6772,...
It is said that Series
is convenient because it can also be converted to dict
type.
Display the 10 words with high frequency of appearance and their frequency of appearance in a graph (for example, a bar graph).
036.py
import input_neko as nk
import japanize_matplotlib
import matplotlib.pyplot as plt
df = nk.input()
df_dict = df["surface"].value_counts()[:10].to_dict()
left = list(df_dict.keys())
height = list(df_dict.values())
fig = plt.figure()
plt.bar(left, height)
plt.show()
fig.savefig("036_graph.png ")
# ->
It took a long time to enable the Japanese characters of matplotlib
, but I solved it by inserting japanize_matplotlib
.
Display 10 words that often co-occur with "cats" (high frequency of co-occurrence) and their frequency of occurrence in a graph (for example, a bar graph).
037.py
from collections import defaultdict
import input_neko as nk
import japanize_matplotlib
import matplotlib.pyplot as plt
df = nk.input()
start = 0
neko_phrase = []
freq = defaultdict(int)
for i in range(len(df)):
if df.iloc[i, 0] == "EOS":
phrase = df.iloc[start:i, 0].to_list()
if "Cat" in phrase:
neko_phrase.append(phrase)
for word in phrase:
if word != "Cat": freq[word] += 1
start = i + 1
neko_relation = sorted(freq.items(), key=lambda x: x[1], reverse=True)[:10]
left = [item[0] for item in neko_relation]
height = [item[1] for item in neko_relation]
fig = plt.figure()
plt.bar(left, height)
fig.savefig("037_graph.png ")
# ->
Words that often appear in sentences that include cats are counted as having a high co-occurrence frequency.
A lambda
expression is used in the middle to sort by a value
value of type dict
.
Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph).
038.py
import input_neko as nk
import pandas as pd
import japanize_matplotlib
import matplotlib.pyplot as plt
df = nk.input()
df_dict = df["surface"].value_counts().to_dict()
word = list(df_dict.keys())
count = list(df_dict.values())
d = pd.DataFrame(count, index=word).groupby(0).size()
height = list(d[:10])
fig = plt.figure()
plt.bar(range(1, 11), height)
fig.savefig("038_graph.png ")
# ->
I thought I should use groupby
, but I feel that converting dict
, DataFrame
, Series
, list
has made it more difficult to understand ...
Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.
039.py
import input_neko as nk
import pandas as pd
import japanize_matplotlib
import matplotlib.pyplot as plt
df = nk.input()
df_dict = df["surface"].value_counts().to_dict()
word = list(df_dict.keys())
count = list(df_dict.values())
d = pd.DataFrame(count, index=word).groupby(0).size()
height = list(d[:])
fig = plt.figure()
plt.xscale("log")
plt.yscale("log")
plt.plot(range(len(height)), height)
fig.savefig("039_graph.png ")
# ->
[Wikipedia](https://ja.wikipedia.org/wiki/%E3%82%B8%E3%83%83%E3%83%97%E3%81%AE%E6%B3%95%E5%89 I checked Zipf's law with% 87).
Zipf's law (Zipf's law) or Zipf's law is an empirical rule that the proportion of the kth most frequent element in the whole is proportional to 1 / k.
Ideally, if you take a logarithmic graph on both axes, it will be a straight line that descends to the right, but I feel that the output results are generally the same. It's strange that Zipf's law appears in various situations.
Recommended Posts