Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 30th "Reading morphological analysis result" of .ac.jp/nlp100/#ch4). "Morphological analysis" Entering the chapter, it has become more like full-scale language processing. Morphological analysis is a method that divides sentences such as "waiting" into "waiting", "shi", "te", "ori", and "masu", and adds information such as part of speech to each. For more information "Wikipedia" Etc.

Reference link

Link	Remarks
030.Reading morphological analysis results.ipynb	Answer program GitHub link
100 amateur language processing knocks:30	Copy and paste source of many source parts
MeCab Official	The first MeCab page to look at

environment

I'm using Python 3.8.1 from this time (3.6.9 until the last time). In Chapter 3, "Regular Expressions", collections.OrderdDict was used to support ordered dictionary types, but since Python 3.7.1, even standard dictionary types are guaranteed to be ordered. .org / ja / 3 / whatsnew / 3.7.html). There was no particular reason to stick to 3.6.9, so I renewed the environment. I forgot how to install MeCab. I installed it a year ago, but I don't remember stumbling.

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv
Mecab	0.996-5	apt-Install with get

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
pandas	1.0.1

Chapter 4: Morphological analysis

content of study

Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.

Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot

Knock content

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

30. Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

Problem supplement (About "MeCab")

It is "MeCab" which is a standard of morphological analysis. For comparison with other morphological analyzers, I referred to the article "Comparison of morphological analyzers at the end of 2019" (comparison result "MeCab" I thought I'd do it). If you use MeCab, it will judge the information in the following format for the divided words. Note that the delimiter is a tab (\ t) and a comma (why?).

Surface form \ t Part of speech, Part of speech subclassification 1, Part of speech subclassification 2, Part of speech subclassification 3, Conjugation type, Conjugation form, Original form, Reading, Pronunciation

For example, in the case of "Sumomomo Momomo", the output result is as follows.

No	Surface type	Part of speech	Part of speech細分類1	Part of speech細分類2	Part of speech細分類3	Utilization type	Inflected form	Prototype	reading	pronunciation
1	Plum	noun	General	*	*	*	*	Plum	Plum	Plum
2	Also	Particle	係Particle	*	*	*	*	Also	Mo	Mo
3	Peaches	noun	General	*	*	*	*	Peaches	peach	peach
4	Also	Particle	係Particle	*	*	*	*	Also	Mo	Mo
5	Peaches	noun	General	*	*	*	*	Peaches	peach	peach
6	of	Particle	Attributive	*	*	*	*	of	No	No
7	home	noun	Non-independent	Adverbs possible	*	*	*	home	Uchi	Uchi
8	EOS

Answer

Answer program (Run MeCab) [Chapter 4_ Morphological analysis.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7% B4% A0% E8% A7% A3% E6% 9E% 90 /% E7% AC% AC4% E7% AB% A0_% 20% E5% BD% A2% E6% 85% 8B% E7% B4% A0% E8 % A7% A3% E6% 9E% 90.ipynb)

It is the following MeCab execution part that is the premise of Chapter 4.

Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab

It's a simple process that ends with a single command.

mecab neko.txt -o neko.txt.mecab

For the time being, I created a program using mecab-python3 (ver0.996.3) in Python as shown below, but the result is slightly different from when the command is executed. ** The sentence was not separated by EOS (End Of Statement) ** was fatal to the subsequent knock. The method of specifying options may be bad, but I don't want to dig deeper, so I haven't used the execution result of the Python program later.

import MeCab
mecab = MeCab.Tagger()

with open('./neko.txt') as in_file, \
    open('./neko.txt.mecab', mode='w') as out_file:   
    out_file.write(mecab.parse(in_file.read()))

Answer program (list creation) [030. Reading morphological analysis results.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7 % B4% A0% E8% A7% A3% E6% 9E% 90 / 030.% E5% BD% A2% E6% 85% 8B% E7% B4% A0% E8% A7% A3% E6% 9E% 90% E7% B5% 90% E6% 9E% 9C% E3% 81% AE% E8% AA% AD% E3% 81% BF% E8% BE% BC% E3% 81% BF.ipynb)

from pprint import pprint

import pandas as pd

def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')
    #The blank is actually pos1, but it is out of alignment.
    return df[df['pos'] != 'Blank']

df = read_text()
print(df.info())

target = []
morphemes = []

for i, row in df.iterrows():
    if row['surface'] == 'EOS' \
     and len(target) != 0:
        morphemes.append(df.loc[target].to_dict(orient='records'))
        target = []
    else:
        target.append(i)

print(len(morphemes))
pprint(morphemes[:5])

Answer commentary

File reading section

The file created by MeCab is read by read_table. It's a little annoying that the delimiters are tab (\ t) and comma (, ). This is achieved by using a regular expression (OR with |) with the parameter sep and changing ʻengineto'python'. I setskiprows and skipfooter` because they were annoying to see the contents of the file.

`python`


def read_text():
    # 0:Surface type(surface)
    # 1:Part of speech(pos)
    # 2:Part of speech subclassification 1(pos1)
    # 7:Uninflected word(base)
    df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None, 
                       usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'], 
                       skiprows=4, skipfooter=1 ,engine='python')    
    return df

The data frame is the following information.

`Read_table` problem with whitespace

It's hard to understand, but if the following (half-width space) comes to the beginning of the line, the column will shift when reading with the `read_table` function. Ignoring and \ t (tab), the first column is recognized as a" symbol ". I did some trial and error, such as setting the parameter skipinitialspace, but I couldn't solve it. I think it's probably a pandas bug. This time, I didn't have to be particular about it, so I'm excluding "blank" lines.

symbol,Blank,*,*,*,*,　,　,

DataFrame information

The information as a DataFrame of the read file is output as follows with df.info ().

<class 'pandas.core.frame.DataFrame'>
Int64Index: 212143 entries, 0 to 212552
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   surface  212143 non-null  object
 1   pos      202182 non-null  object
 2   pos1     202182 non-null  object
 3   base     202182 non-null  object
dtypes: object(4)
memory usage: 8.1+ MB
None

Dictionary type list output

Store each morpheme in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and express one sentence as a list of morphemes (mapping type).

Make a list of mapping type (dictionary type). However, I haven't used it in subsequent knocks ** and it's completely for Python practice (I don't need this tedious process with pandas in subsequent knocks). When EOS (End Of Statement) is issued, it is the end of one sentence, so the morphemes up to that point are output with the to_dict function.

`python`


target = []
morphemes = []

for i, row in df.iterrows():
    if row['surface'] == 'EOS' \
     and len(target) != 0:
        morphemes.append(df.loc[target].to_dict(orient='records'))
        target = []
    else:
        target.append(i)

Output result (execution result)

When the program is executed, the following result is output (only the first 5 sentences). By the way, the reason why "I am a cat" on the first line is a noun is because it is a proper noun of the book title. It is correct that the sentence in the book is decomposed, but it is not done so far.

`Output result`


[[{'base': 'I am a cat', 'pos': 'noun', 'pos1': '固有noun', 'surface': 'I am a cat'},
  {'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}],
 [{'base': 'name', 'pos': 'noun', 'pos1': 'General', 'surface': 'name'},
  {'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle', 'surface': 'Is'},
  {'base': 'yet', 'pos': 'adverb', 'pos1': 'Particle connection', 'surface': 'yet'},
  {'base': 'No', 'pos': 'adjective', 'pos1': 'Independence', 'surface': 'No'},
  {'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}],
 [{'base': None, 'pos': None, 'pos1': None, 'surface': 'EOS'},
  {'base': 'Where', 'pos': 'noun', 'pos1': '代noun', 'surface': 'Where'},
  {'base': 'so', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'so'},
  {'base': 'Born', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'Born'},
  {'base': 'Ta', 'pos': 'Auxiliary verb', 'pos1': '*', 'surface': 'Ta'},
  {'base': 'Fire', 'pos': 'noun', 'pos1': 'General', 'surface': 'Katon'},
  {'base': 'When', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'When'},
  {'base': 'Register', 'pos': 'noun', 'pos1': 'Change connection', 'surface': 'Register'},
  {'base': 'But', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'But'},
  {'base': 'Tsukuri', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'Tsuka'},
  {'base': 'Nu', 'pos': 'Auxiliary verb', 'pos1': '*', 'surface': 'Nu'},
  {'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}],
 [{'base': 'what', 'pos': 'noun', 'pos1': '代noun', 'surface': 'what'},
  {'base': 'But', 'pos': 'Particle', 'pos1': '副Particle', 'surface': 'But'},
  {'base': 'dim', 'pos': 'adjective', 'pos1': 'Independence', 'surface': 'dim'},
  {'base': 'Damp', 'pos': 'adverb', 'pos1': 'General', 'surface': 'Damp'},
  {'base': 'did', 'pos': 'noun', 'pos1': 'General', 'surface': 'did'},
  {'base': 'Place', 'pos': 'noun', 'pos1': 'suffix', 'surface': 'Place'},
  {'base': 'so', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'so'},
  {'base': 'Meow meow', 'pos': 'adverb', 'pos1': 'General', 'surface': 'Meow meow'},
  {'base': 'cry', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'Crying'},
  {'base': 'hand', 'pos': 'Particle', 'pos1': '接続Particle', 'surface': 'hand'},
  {'base': 'What was there', 'pos': 'noun', 'pos1': 'General', 'surface': 'What was there'},
  {'base': 'Only', 'pos': 'Particle', 'pos1': '副Particle', 'surface': 'Only'},
  {'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle', 'surface': 'Is'},
  {'base': 'Memory', 'pos': 'noun', 'pos1': 'Change connection', 'surface': 'Memory'},
  {'base': 'To do', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'Shi'},
  {'base': 'hand', 'pos': 'Particle', 'pos1': '接続Particle', 'surface': 'hand'},
  {'base': 'Is', 'pos': 'verb', 'pos1': 'Non-independent', 'surface': 'Is'},
  {'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}],
 [{'base': 'I', 'pos': 'noun', 'pos1': '代noun', 'surface': 'I'},
  {'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle', 'surface': 'Is'},
  {'base': 'here', 'pos': 'noun', 'pos1': '代noun', 'surface': 'here'},
  {'base': 'so', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'so'},
  {'base': 'start', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'start'},
  {'base': 'hand', 'pos': 'Particle', 'pos1': '接続Particle', 'surface': 'hand'},
  {'base': 'Human', 'pos': 'noun', 'pos1': 'General', 'surface': 'Human'},
  {'base': 'That', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'That'},
  {'base': 'thing', 'pos': 'noun', 'pos1': 'Non-independent', 'surface': 'thing'},
  {'base': 'To', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'To'},
  {'base': 'to see', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'You see'},
  {'base': 'Ta', 'pos': 'Auxiliary verb', 'pos1': '*', 'surface': 'Ta'},
  {'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}]]

100 language processing knock-30 (using pandas): reading morphological analysis results

Reference link

environment

Chapter 4: Morphological analysis

content of study

Knock content

30. Reading morphological analysis results

Problem supplement (About "MeCab")

Answer

Answer program (Run MeCab) [Chapter 4_ Morphological analysis.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/04.%E5%BD%A2%E6%85%8B%E7% B4% A0% E8% A7% A3% E6% 9E% 90 /% E7% AC% AC4% E7% AB% A0_% 20% E5% BD% A2% E6% 85% 8B% E7% B4% A0% E8 % A7% A3% E6% 9E% 90.ipynb)

Answer commentary

File reading section

python

Read_table problem with whitespace

DataFrame information

Dictionary type list output

python

Output result (execution result)

Output result

`python`

`Read_table` problem with whitespace

`python`

`Output result`