Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 39th "Zipf's Law" of .ac.jp/nlp100/#ch4). "Zipf's Law" is [Wikipedia](https://ja.wikipedia.org/wiki/%E3%82%B8%E3%83%83%E3%83%97%E3%81%AE%E6%B3 According to% 95% E5% 89% 87), it is written in the following explanation, and to put it plainly, the law that ** the more frequently it appears, the greater the proportion of the whole **.
Zipf's law (Zipf's law) or Zipf's law is an empirical rule that the proportion of the kth most frequent element in the whole is proportional to $ \ frac {1} {k} $.
Link | Remarks |
---|---|
039.Zipf's law.ipynb | Answer program GitHub link |
100 amateur language processing knocks:39 | Copy and paste source of many source parts |
MeCab Official | The first MeCab page to look at |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
matplotlib | 3.1.3 |
pandas | 1.0.1 |
Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.
Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot
Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.
import matplotlib.pyplot as plt
import pandas as pd
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]
df = read_text()
frequency = df['surface'].value_counts().values.tolist()
plt.xscale('log')
plt.yscale('log')
plt.xlim(1, len(frequency)+1)
plt.ylim(1, frequency[0])
plt.xlabel('Rank')
plt.ylabel('Frequency of appearance')
plt.scatter(x=range(1, len(frequency)+1), y=df['surface'].value_counts().values.tolist())
The value_counts
function counts the unique frequencies and the tolist
function lists them.
python
frequency = df['surface'].value_counts().values.tolist()
This time, the problem statement is a "log-log graph", so I'm using a log scale.
Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.
python
plt.xscale('log')
plt.yscale('log')
The x-axis maximizes the list length + 1 (since Python starts at 0), and the y-axis maximizes the 0th value, which is the most common in descending sorts.
python
plt.xlim(1, len(frequency)+1)
plt.ylim(1, frequency[0])
When the program is executed, the following results will be output. It's brilliantly downhill.
Recommended Posts