Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 32nd "original form of verb" of .ac.jp/nlp100/#ch4). Continuing from the previous time, I'm using pandas this time, so I can process it in one sentence, and it's so easy that I can't beat it. It doesn't need to be an independent article ...

Reference link

Link	Remarks
032.The original form of the verb.ipynb	Answer program GitHub link
100 amateur language processing knocks:32	Copy and paste source of many source parts
MeCab Official	The first MeCab page to look at

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Mecab	0.996-5	apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
pandas	1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

32. Original form of verb

Extract all the original forms of the verb.

Answer

Answer program [032. Verb prototype.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0%E8 % A7% A3% E6% 9E% 90/032.% E5% 8B% 95% E8% A9% 9E% E3% 81% AE% E5% 8E% 9F% E5% BD% A2.ipynb)

import pandas as pd

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]

df = read_text()
df[df['pos'] == 'verb']['base']

Answer commentary

The previous "surface form of the verb" has just changed to the "original form of the verb". With pandas, just rewrite the conditions a little.

`python`


df[df['pos'] == 'verb']['base']

Output result (execution result)

When the program is executed, the following results will be output.

`Output result`


13 born
19
31 cry
37
39
         ... 
212527 die
212532 get
212537 die
212540 get
212541
Name: base, Length: 28119, dtype: object

100 Language Processing Knock-32 (using pandas): Prototype of verb