Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" This is the record of 37th "Top 10 most frequent words" of .ac.jp/nlp100/#ch4).
This time, we will use matplotlib
for graph display. It seems that everyone will fall into the matplotlib
"Tofu problem" (Japanese Corresponds to the phenomenon that the displayed tofu-like characters are displayed on the graph).
Link | Remarks |
---|---|
037.Top 10 most frequent words.ipynb | Answer program GitHub link |
100 amateur language processing knocks:37 | Copy and paste source of many source parts |
MeCab Official | The first MeCab page to look at |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
matplotlib | 3.1.3 |
pandas | 1.0.1 |
Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.
Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot
Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
Display the 10 words that appear frequently and their frequency of appearance in a graph (for example, a bar graph).
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'IPAexGothic'
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]
df = read_text()
df['surface'].value_counts()[:10].plot.bar()
#Exclude particles and auxiliary verbs
df[~df['pos'].str.startswith('Assist')]['surface'].value_counts()[:10].plot.bar()
I made tofu (corresponding to garbled graph characters) by referring to the following article. Please note that the support method depends greatly on the OS and Python environment (such as using pyenv).
-[Resolve garbled Japanese characters in matplotlib](https://qiita.com/katuemon/items/5c4db01997ad9dc343e0#%E3%83%95%E3%82%A9%E3%83%B3%E3%83% 88% E3% 82% AD% E3% 83% A3% E3% 83% 83% E3% 82% B7% E3% 83% A5% E3% 81% AE% E5% 89% 8A% E9% 99% A4) -About garbled Japanese characters in matplotlib
Install fonts with ʻapt-get`
apt-get install fonts-ipaexfont
Physically delete the following files that are the font cache of matplotlib
. I don't know the difference between the two, but I deleted it with the feeling that I can clear the cache.
-/Users/username/.cache/matplotlib/fontlist-v300.json -/Users/username/.cache/matplotlib/fontlist-v310.json
Specify the font to output the graph with the following settings on Python. This completes the "tofu" support.
python
plt.rcParams['font.family'] = 'IPAexGothic'
pandas
is very convenient because it can be output as it is using plot
.
python
df['surface'].value_counts()[:10].plot.bar()
#Exclude particles and auxiliary verbs
df[~df['pos'].str.startswith('Assist')]['surface'].value_counts()[:10].plot.bar()
When the program is executed, the following results will be output. After all, it is easier to understand if you graph it rather than just looking at the numbers.
Recommended Posts