Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 31st "verb" of .ac.jp/nlp100/#ch4). Since I use pandas, I can process it in one sentence, and it's so easy that I can't beat it.

Reference link

Link	Remarks
031.verb.ipynb	Answer program GitHub link
100 amateur language processing knocks:31	Copy and paste source of many source parts
MeCab Official	The first MeCab page to look at

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Mecab	0.996-5	apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
pandas	1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

31. Verb

Extract all the surface forms of the verb.

Answer

Answer Program [031. Verb.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7 % A3% E6% 9E% 90 / 031.% E5% 8B% 95% E8% A9% 9E.ipynb)

import pandas as pd

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]

df = read_text()
df[df['pos'] == 'verb']['surface']

Answer commentary

Delete extra rows in DataFrame

I'm removing the extra lines I read from the file. Technically, we are extracting only the necessary lines rather than "delete". df ['pos']! ='Blank' should be specified for pos1 (part of speech subclassification 1), but as I explained last time, the blank is shifted by one column, so it can't be helped. Conditions are specified for pos (part of speech).

`python`


df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]

As a result of removing the extra lines, the DataFrame information in print (df.info ()) is as follows.

Int64Index: 180417 entries, 0 to 212550
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   surface  180417 non-null  object
 1   pos      180417 non-null  object
 2   pos1     180417 non-null  object
 3   base     180417 non-null  object
dtypes: object(4)
memory usage: 6.9+ MB

And the first and last 5 lines of the DataFrame.

Surface form extraction of verbs

This is the part where the "surface form" of the "verb" is extracted.

`python`


df[df['pos'] == 'verb']['surface']

Output result (execution result)

When the program is executed, the following results will be output.

`Output result`


13 Born
19
31 crying
37
39
          ..
212527 dead
212532 get
212537 dead
212540 gain
212541
Name: surface, Length: 28119, dtype: object

100 Language Processing Knock-31 (using pandas): Verb

Reference link

environment

Chapter 4: Morphological analysis

content of study

Knock content

31. Verb

Answer

Answer Program [031. Verb.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7 % A3% E6% 9E% 90 / 031.% E5% 8B% 95% E8% A9% 9E.ipynb)

Answer commentary

Delete extra rows in DataFrame

python

Surface form extraction of verbs

python

Output result (execution result)

Output result

`python`

`python`

`Output result`