Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 36th "word appearance frequency" of .ac.jp/nlp100/#ch4). This time, it's super easy because pandas is good at counting the number of appearances and sorting.

Reference link

Link	Remarks
036.Frequency of word occurrence.ipynb	Answer program GitHub link
100 amateur language processing knocks:36	Copy and paste source of many source parts
MeCab Official	The first MeCab page to look at

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Mecab	0.996-5	apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
pandas	1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

36. Frequency of word occurrence

Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.

Answer

Answer program [036. Frequency of word occurrence.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0% E8% A7% A3% E6% 9E% 90/036.% E5% 8D% 98% E8% AA% 9E% E3% 81% AE% E5% 87% BA% E7% 8F% BE% E9% A0% BB % E5% BA% A6.ipynb)

import pandas as pd

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]

df = read_text()

df['surface'].value_counts()[:30]

#Exclude particles and auxiliary verbs
df[~df['pos'].str.startswith('Assist')]['surface'].value_counts()[:30]

Answer commentary

Occurrence frequency count and sort

[Knock 19th "Calculate the frequency of appearance of the character string in the first column of each line and arrange it in descending order of frequency of appearance"](https://qiita.com/FukuharaYohei/items/87f0413b87c6109e8ca4#019%E5%90%84% E8% A1% 8C% E3% 81% AE1% E3% 82% B3% E3% 83% A9% E3% 83% A0% E7% 9B% AE% E3% 81% AE% E6% 96% 87% E5% AD% 97% E5% 88% 97% E3% 81% AE% E5% 87% BA% E7% 8F% BE% E9% A0% BB% E5% BA% A6% E3% 82% 92% E6% B1% 82% E3% 82% 81% E5% 87% BA% E7% 8F% BE% E9% A0% BB% E5% BA% A6% E3% 81% AE% E9% AB% 98% E3% 81% 84% E9% A0% 86% E3% 81% AB% E4% B8% A6% E3% 81% B9% E3% 82% 8Bipynb) Use value_counts to delete and sort duplicates. It is convenient to sort in descending order by default.

`python`


df['surface'].value_counts()[:30]

Excluding particles and auxiliary verbs only adds a condition. Since the only part of speech that starts with "assistant" is particles and auxiliary verbs, I made it a negative condition for str.starts with.

`python`


#Exclude particles and auxiliary verbs
df[~df['pos'].str.startswith('Assist')]['surface'].value_counts()[:30]

Output result (execution result)

When the program is executed, the following results will be output. The top 30 for all targets. Naturally, the content cannot be inferred with only particles and auxiliary verbs.

`Output result(All targets)`


9109
6697
Is 6384
To 6147
6068
And 5476
Is 5259
3916
At 3774
Also 2433
2272
2264
Not 2254
From 2001
There is 1705
1579
Or 1446
1416
1249
Thing 1177
To 1033
986
974
Things 971
You 955
Say 937
Master 928
U 922
Yo 687
673
Name: surface, dtype: int64

Particles and auxiliary verbs have been excluded from output. It's much easier to analogize "I am a cat" than to target everything.

`Output result(Exclude particles and auxiliary verbs)`


2201
1597
1249
Thing 1177
986
Things 971
You 955
Say 937
Master 928
There is 723
Not 708
Yo 687
Hmm 667
This 635
Go 598
That 560
What 518
I 477
Person 449
Yes 448
443
Become 410
403
This 397
It 370
Coming 367
See 349
Labyrinth 343
Re 327
Time 316
Name: surface, dtype: int64

100 Language Processing Knock-36 (using pandas): Frequency of word occurrence

Reference link

environment

Chapter 4: Morphological analysis

content of study

Knock content

36. Frequency of word occurrence

Answer

Answer program [036. Frequency of word occurrence.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0% E8% A7% A3% E6% 9E% 90/036.% E5% 8D% 98% E8% AA% 9E% E3% 81% AE% E5% 87% BA% E7% 8F% BE% E9% A0% BB % E5% BA% A6.ipynb)

Answer commentary

Occurrence frequency count and sort

python

python

Output result (execution result)

Output result(All targets)

Output result(Exclude particles and auxiliary verbs)

`python`

`python`

`Output result(All targets)`

`Output result(Exclude particles and auxiliary verbs)`