Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 31st "verb" of .ac.jp/nlp100/#ch4).
Since I use pandas
, I can process it in one sentence, and it's so easy that I can't beat it.
Link | Remarks |
---|---|
031.verb.ipynb | Answer program GitHub link |
100 amateur language processing knocks:31 | Copy and paste source of many source parts |
MeCab Official | The first MeCab page to look at |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 1.0.1 |
Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.
Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot
Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
Extract all the surface forms of the verb.
import pandas as pd
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]
df = read_text()
df[df['pos'] == 'verb']['surface']
I'm removing the extra lines I read from the file. Technically, we are extracting only the necessary lines rather than "delete".
df ['pos']! ='Blank'
should be specified for pos1
(part of speech subclassification 1), but as I explained last time, the blank is shifted by one column, so it can't be helped. Conditions are specified for pos
(part of speech).
python
df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]
As a result of removing the extra lines, the DataFrame information in print (df.info ())
is as follows.
Int64Index: 180417 entries, 0 to 212550
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 surface 180417 non-null object
1 pos 180417 non-null object
2 pos1 180417 non-null object
3 base 180417 non-null object
dtypes: object(4)
memory usage: 6.9+ MB
And the first and last 5 lines of the DataFrame.
This is the part where the "surface form" of the "verb" is extracted.
python
df[df['pos'] == 'verb']['surface']
When the program is executed, the following results will be output.
Output result
13 Born
19
31 crying
37
39
..
212527 dead
212532 get
212537 dead
212540 gain
212541
Name: surface, Length: 28119, dtype: object
Recommended Posts