Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" This is the record of 37th "Top 10 most frequent words" of .ac.jp/nlp100/#ch4). This time, we will use matplotlib for graph display. It seems that everyone will fall into the matplotlib "Tofu problem" (Japanese Corresponds to the phenomenon that the displayed tofu-like characters are displayed on the graph).

Reference link

Link	Remarks
037.Top 10 most frequent words.ipynb	Answer program GitHub link
100 amateur language processing knocks:37	Copy and paste source of many source parts
MeCab Official	The first MeCab page to look at

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Mecab	0.996-5	apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
matplotlib	3.1.3
pandas	1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

37. Top 10 most frequent words

Display the 10 words that appear frequently and their frequency of appearance in a graph (for example, a bar graph).

Answer

Answer program [037. Top 10 most frequently used words.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0% E8% A7% A3% E6% 9E% 90/037.% 20% E9% A0% BB% E5% BA% A6% E4% B8% 8A% E4% BD% 8D10% E8% AA% 9E.ipynb)

import pandas as pd
import matplotlib.pyplot as plt

plt.rcParams['font.family'] = 'IPAexGothic'

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]

df = read_text()

df['surface'].value_counts()[:10].plot.bar()

#Exclude particles and auxiliary verbs
df[~df['pos'].str.startswith('Assist')]['surface'].value_counts()[:10].plot.bar()

Answer commentary

Compatible with tofu (corresponding to garbled graph characters)

I made tofu (corresponding to garbled graph characters) by referring to the following article. Please note that the support method depends greatly on the OS and Python environment (such as using pyenv).

-[Resolve garbled Japanese characters in matplotlib](https://qiita.com/katuemon/items/5c4db01997ad9dc343e0#%E3%83%95%E3%82%A9%E3%83%B3%E3%83% 88% E3% 82% AD% E3% 83% A3% E3% 83% 83% E3% 82% B7% E3% 83% A5% E3% 81% AE% E5% 89% 8A% E9% 99% A4) -About garbled Japanese characters in matplotlib

1. Font installation

Install fonts with ʻapt-get`

apt-get install fonts-ipaexfont

2. Delete cache

Physically delete the following files that are the font cache of matplotlib. I don't know the difference between the two, but I deleted it with the feeling that I can clear the cache.

-/Users/username/.cache/matplotlib/fontlist-v300.json -/Users/username/.cache/matplotlib/fontlist-v310.json

3. Specify font in Python

Specify the font to output the graph with the following settings on Python. This completes the "tofu" support.

`python`


plt.rcParams['font.family'] = 'IPAexGothic'

Graph output

pandas is very convenient because it can be output as it is using plot.

`python`


df['surface'].value_counts()[:10].plot.bar()

#Exclude particles and auxiliary verbs
df[~df['pos'].str.startswith('Assist')]['surface'].value_counts()[:10].plot.bar()

Output result (execution result)

When the program is executed, the following results will be output. After all, it is easier to understand if you graph it rather than just looking at the numbers.

100 language processing knocks-37 (using pandas): Top 10 most frequent words

Reference link

environment

Chapter 4: Morphological analysis

content of study

Knock content

37. Top 10 most frequent words

Answer

Answer program [037. Top 10 most frequently used words.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0% E8% A7% A3% E6% 9E% 90/037.% 20% E9% A0% BB% E5% BA% A6% E4% B8% 8A% E4% BD% 8D10% E8% AA% 9E.ipynb)

Answer commentary

Compatible with tofu (corresponding to garbled graph characters)

1. Font installation

2. Delete cache

3. Specify font in Python

`python`

Graph output

`python`

Output result (execution result)

All words

Words excluding particles and auxiliary verbs