Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 36th "word appearance frequency" of .ac.jp/nlp100/#ch4). This time, it's super easy because pandas is good at counting the number of appearances and sorting.
Link | Remarks |
---|---|
036.Frequency of word occurrence.ipynb | Answer program GitHub link |
100 amateur language processing knocks:36 | Copy and paste source of many source parts |
MeCab Official | The first MeCab page to look at |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 1.0.1 |
Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.
Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot
Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.
import pandas as pd
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]
df = read_text()
df['surface'].value_counts()[:30]
#Exclude particles and auxiliary verbs
df[~df['pos'].str.startswith('Assist')]['surface'].value_counts()[:30]
[Knock 19th "Calculate the frequency of appearance of the character string in the first column of each line and arrange it in descending order of frequency of appearance"](https://qiita.com/FukuharaYohei/items/87f0413b87c6109e8ca4#019%E5%90%84% E8% A1% 8C% E3% 81% AE1% E3% 82% B3% E3% 83% A9% E3% 83% A0% E7% 9B% AE% E3% 81% AE% E6% 96% 87% E5% AD% 97% E5% 88% 97% E3% 81% AE% E5% 87% BA% E7% 8F% BE% E9% A0% BB% E5% BA% A6% E3% 82% 92% E6% B1% 82% E3% 82% 81% E5% 87% BA% E7% 8F% BE% E9% A0% BB% E5% BA% A6% E3% 81% AE% E9% AB% 98% E3% 81% 84% E9% A0% 86% E3% 81% AB% E4% B8% A6% E3% 81% B9% E3% 82% 8Bipynb) Use value_counts
to delete and sort duplicates. It is convenient to sort in descending order by default.
python
df['surface'].value_counts()[:30]
Excluding particles and auxiliary verbs only adds a condition. Since the only part of speech that starts with "assistant" is particles and auxiliary verbs, I made it a negative condition for str.starts with
.
python
#Exclude particles and auxiliary verbs
df[~df['pos'].str.startswith('Assist')]['surface'].value_counts()[:30]
When the program is executed, the following results will be output. The top 30 for all targets. Naturally, the content cannot be inferred with only particles and auxiliary verbs.
Output result(All targets)
9109
6697
Is 6384
To 6147
6068
And 5476
Is 5259
3916
At 3774
Also 2433
2272
2264
Not 2254
From 2001
There is 1705
1579
Or 1446
1416
1249
Thing 1177
To 1033
986
974
Things 971
You 955
Say 937
Master 928
U 922
Yo 687
673
Name: surface, dtype: int64
Particles and auxiliary verbs have been excluded from output. It's much easier to analogize "I am a cat" than to target everything.
Output result(Exclude particles and auxiliary verbs)
2201
1597
1249
Thing 1177
986
Things 971
You 955
Say 937
Master 928
There is 723
Not 708
Yo 687
Hmm 667
This 635
Go 598
That 560
What 518
I 477
Person 449
Yes 448
443
Become 410
403
This 397
It 370
Coming 367
See 349
Labyrinth 343
Re 327
Time 316
Name: surface, dtype: int64
Recommended Posts