Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 39th "Zipf's Law" of .ac.jp/nlp100/#ch4). "Zipf's Law" is [Wikipedia](https://ja.wikipedia.org/wiki/%E3%82%B8%E3%83%83%E3%83%97%E3%81%AE%E6%B3 According to% 95% E5% 89% 87), it is written in the following explanation, and to put it plainly, the law that ** the more frequently it appears, the greater the proportion of the whole **.

Zipf's law (Zipf's law) or Zipf's law is an empirical rule that the proportion of the kth most frequent element in the whole is proportional to $ \ frac {1} {k} $.

Reference link

Link	Remarks
039.Zipf's law.ipynb	Answer program GitHub link
100 amateur language processing knocks:39	Copy and paste source of many source parts
MeCab Official	The first MeCab page to look at

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Mecab	0.996-5	apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
matplotlib	3.1.3
pandas	1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

39. Zipf's Law

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

Answer

Answer Program [039. Zipf's Law.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0%E8 % A7% A3% E6% 9E% 90 / 039.Zipf% E3% 81% AE% E6% B3% 95% E5% 89% 87.ipynb)

import matplotlib.pyplot as plt
import pandas as pd

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]

df = read_text()

frequency = df['surface'].value_counts().values.tolist()

plt.xscale('log')
plt.yscale('log')

plt.xlim(1, len(frequency)+1)
plt.ylim(1, frequency[0])
plt.xlabel('Rank')
plt.ylabel('Frequency of appearance')

plt.scatter(x=range(1, len(frequency)+1), y=df['surface'].value_counts().values.tolist())

Answer commentary

List of appearance frequency

The value_counts function counts the unique frequencies and the tolist function lists them.

`python`


frequency = df['surface'].value_counts().values.tolist()

Log scale display

This time, the problem statement is a "log-log graph", so I'm using a log scale.

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

`python`


plt.xscale('log')
plt.yscale('log')

The x-axis maximizes the list length + 1 (since Python starts at 0), and the y-axis maximizes the 0th value, which is the most common in descending sorts.

`python`


plt.xlim(1, len(frequency)+1)
plt.ylim(1, frequency[0])

Output result (execution result)

When the program is executed, the following results will be output. It's brilliantly downhill.

100 Language Processing Knock-39 (using pandas): Zipf's Law

Reference link

environment

Chapter 4: Morphological analysis

content of study

Knock content

39. Zipf's Law

Answer

Answer Program [039. Zipf's Law.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0%E8 % A7% A3% E6% 9E% 90 / 039.Zipf% E3% 81% AE% E6% B3% 95% E5% 89% 87.ipynb)

Answer commentary

List of appearance frequency

python

Log scale display

python

python

Output result (execution result)

`python`

`python`

`python`