Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 33rd "Sahen noun" of .ac.jp/nlp100/#ch4). Just like last time, it's very easy, just changing the extraction conditions.
Link | Remarks |
---|---|
033.Sa hen noun.ipynb | Answer program GitHub link |
100 amateur language processing knocks:33 | Copy and paste source of many source parts |
MeCab Official | The first MeCab page to look at |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 1.0.1 |
Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.
Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot
Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
Extract all the nouns of the s-irregular connection.
import pandas as pd
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
return df[(df['pos'] != 'Blank') & (df['surface'] != 'EOS') & (df['pos'] != 'symbol')]
df = read_text()
df[(df['pos'] == 'noun') & (df['pos1'] == 'Change connection')]
The following sentence is different from the last time. It's not a big deal.
python
df[(df['pos'] == 'noun') & (df['pos1'] == 'Change connection')]
When the program is executed, the following results will be output. Isn't 75 "yes" a mistake in MeCab?
Recommended Posts