Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" This is the 38th "histogram" record of .ac.jp/nlp100/#ch4). It's easy as long as you get over the "tofu" that was knocked last time. If you don't put out the label, you don't have to deal with "tofu".
Link | Remarks |
---|---|
038.histogram.ipynb | Answer program GitHub link |
100 amateur language processing knocks:38 | Copy and paste source of many source parts |
MeCab Official | The first MeCab page to look at |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
matplotlib | 3.1.3 |
pandas | 1.0.1 |
Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.
Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot
Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph).
import matplotlib.pyplot as plt
import pandas as pd
plt.rcParams['font.family'] = 'IPAexGothic'
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]
df = read_text()
hist = df['surface'].value_counts().plot.hist(bins=20, range=(1, 20))
hist.set_xlabel('Frequency of appearance')
hist.set_ylabel('Number of word types')
Just use the plot
of pandas
. I also added a label.
python
hist = df['surface'].value_counts().plot.hist(bins=20, range=(1, 20))
hist.set_xlabel('Frequency of appearance')
hist.set_ylabel('Number of word types')
When the program is executed, the following results will be output. Well, it looks like this.
Recommended Posts