Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 32nd "original form of verb" of .ac.jp/nlp100/#ch4). Continuing from the previous time, I'm using pandas this time, so I can process it in one sentence, and it's so easy that I can't beat it. It doesn't need to be an independent article ...
Link | Remarks |
---|---|
032.The original form of the verb.ipynb | Answer program GitHub link |
100 amateur language processing knocks:32 | Copy and paste source of many source parts |
MeCab Official | The first MeCab page to look at |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 1.0.1 |
Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.
Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot
Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
Extract all the original forms of the verb.
import pandas as pd
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]
df = read_text()
df[df['pos'] == 'verb']['base']
The previous "surface form of the verb" has just changed to the "original form of the verb". With pandas, just rewrite the conditions a little.
python
df[df['pos'] == 'verb']['base']
When the program is executed, the following results will be output.
Output result
13 born
19
31 cry
37
39
...
212527 die
212532 get
212537 die
212540 get
212541
Name: base, Length: 28119, dtype: object
Recommended Posts