Language processing 100 knocks 2015 "Chapter 4: Morphological analysis" It is a record of 30th "Reading morphological analysis result" of .ac.jp/nlp100/#ch4). "Morphological analysis" Entering the chapter, it has become more like full-scale language processing. Morphological analysis is a method that divides sentences such as "waiting" into "waiting", "shi", "te", "ori", and "masu", and adds information such as part of speech to each. For more information "Wikipedia" Etc.
Link | Remarks |
---|---|
030.Reading morphological analysis results.ipynb | Answer program GitHub link |
100 amateur language processing knocks:30 | Copy and paste source of many source parts |
MeCab Official | The first MeCab page to look at |
I'm using Python 3.8.1 from this time (3.6.9 until the last time). In Chapter 3, "Regular Expressions", collections.OrderdDict
was used to support ordered dictionary types, but since Python 3.7.1, even standard dictionary types are guaranteed to be ordered. .org / ja / 3 / whatsnew / 3.7.html). There was no particular reason to stick to 3.6.9, so I renewed the environment.
I forgot how to install MeCab. I installed it a year ago, but I don't remember stumbling.
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.16 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.8.1 | python3 on pyenv.8.I'm using 1 Packages are managed using venv |
Mecab | 0.996-5 | apt-Install with get |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 1.0.1 |
Apply the morphological analyzer MeCab to Natsume Soseki's novel "I Am a Cat" to obtain the statistics of the words in the novel.
Morphological analysis, MeCab, part of speech, frequency of occurrence, Zipf's law, matplotlib, Gnuplot
Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
For problems 37, 38, and 39, use matplotlib or Gnuplot.
Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.
It is "MeCab" which is a standard of morphological analysis. For comparison with other morphological analyzers, I referred to the article "Comparison of morphological analyzers at the end of 2019" (comparison result "MeCab" I thought I'd do it).
If you use MeCab, it will judge the information in the following format for the divided words. Note that the delimiter is a tab (\ t
) and a comma (why?).
Surface form \ t Part of speech, Part of speech subclassification 1, Part of speech subclassification 2, Part of speech subclassification 3, Conjugation type, Conjugation form, Original form, Reading, Pronunciation
For example, in the case of "Sumomomo Momomo", the output result is as follows.
No | Surface type | Part of speech | Part of speech細分類1 | Part of speech細分類2 | Part of speech細分類3 | Utilization type | Inflected form | Prototype | reading | pronunciation |
---|---|---|---|---|---|---|---|---|---|---|
1 | Plum | noun | General | * | * | * | * | Plum | Plum | Plum |
2 | Also | Particle | 係Particle | * | * | * | * | Also | Mo | Mo |
3 | Peaches | noun | General | * | * | * | * | Peaches | peach | peach |
4 | Also | Particle | 係Particle | * | * | * | * | Also | Mo | Mo |
5 | Peaches | noun | General | * | * | * | * | Peaches | peach | peach |
6 | of | Particle | Attributive | * | * | * | * | of | No | No |
7 | home | noun | Non-independent | Adverbs possible | * | * | * | home | Uchi | Uchi |
8 | EOS |
It is the following MeCab execution part that is the premise of Chapter 4.
Using MeCab for the text (neko.txt) of Natsume Soseki's novel "I am a cat" Morphological analysis and save the result in a file called neko.txt.mecab
It's a simple process that ends with a single command.
mecab neko.txt -o neko.txt.mecab
For the time being, I created a program using mecab-python3
(ver0.996.3) in Python as shown below, but the result is slightly different from when the command is executed. ** The sentence was not separated by EOS (End Of Statement) ** was fatal to the subsequent knock. The method of specifying options may be bad, but I don't want to dig deeper, so I haven't used the execution result of the Python program later.
import MeCab
mecab = MeCab.Tagger()
with open('./neko.txt') as in_file, \
open('./neko.txt.mecab', mode='w') as out_file:
out_file.write(mecab.parse(in_file.read()))
from pprint import pprint
import pandas as pd
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
#The blank is actually pos1, but it is out of alignment.
return df[df['pos'] != 'Blank']
df = read_text()
print(df.info())
target = []
morphemes = []
for i, row in df.iterrows():
if row['surface'] == 'EOS' \
and len(target) != 0:
morphemes.append(df.loc[target].to_dict(orient='records'))
target = []
else:
target.append(i)
print(len(morphemes))
pprint(morphemes[:5])
The file created by MeCab is read by read_table
. It's a little annoying that the delimiters are tab (\ t
) and comma (,
). This is achieved by using a regular expression (OR with |
) with the parameter sep
and changing ʻengineto'python'. I set
skiprows and
skipfooter` because they were annoying to see the contents of the file.
python
def read_text():
# 0:Surface type(surface)
# 1:Part of speech(pos)
# 2:Part of speech subclassification 1(pos1)
# 7:Uninflected word(base)
df = pd.read_table('./neko.txt.mecab', sep='\t|,', header=None,
usecols=[0, 1, 2, 7], names=['surface', 'pos', 'pos1', 'base'],
skiprows=4, skipfooter=1 ,engine='python')
return df
The data frame is the following information.
Read_table
problem with whitespaceIt's hard to understand, but if the following (half-width space) comes to the beginning of the line, the column will shift when reading with the `read_table` function. Ignoring
and \ t
(tab), the first column is recognized as a" symbol ". I did some trial and error, such as setting the parameter skipinitialspace
, but I couldn't solve it. I think it's probably a pandas bug.
This time, I didn't have to be particular about it, so I'm excluding "blank" lines.
symbol,Blank,*,*,*,*, , ,
The information as a DataFrame of the read file is output as follows with df.info ()
.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 212143 entries, 0 to 212552
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 surface 212143 non-null object
1 pos 202182 non-null object
2 pos1 202182 non-null object
3 base 202182 non-null object
dtypes: object(4)
memory usage: 8.1+ MB
None
Store each morpheme in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and express one sentence as a list of morphemes (mapping type).
Make a list of mapping type (dictionary type). However, I haven't used it in subsequent knocks ** and it's completely for Python practice (I don't need this tedious process with pandas in subsequent knocks).
When EOS (End Of Statement) is issued, it is the end of one sentence, so the morphemes up to that point are output with the to_dict
function.
python
target = []
morphemes = []
for i, row in df.iterrows():
if row['surface'] == 'EOS' \
and len(target) != 0:
morphemes.append(df.loc[target].to_dict(orient='records'))
target = []
else:
target.append(i)
When the program is executed, the following result is output (only the first 5 sentences). By the way, the reason why "I am a cat" on the first line is a noun is because it is a proper noun of the book title. It is correct that the sentence in the book is decomposed, but it is not done so far.
Output result
[[{'base': 'I am a cat', 'pos': 'noun', 'pos1': '固有noun', 'surface': 'I am a cat'},
{'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}],
[{'base': 'name', 'pos': 'noun', 'pos1': 'General', 'surface': 'name'},
{'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle', 'surface': 'Is'},
{'base': 'yet', 'pos': 'adverb', 'pos1': 'Particle connection', 'surface': 'yet'},
{'base': 'No', 'pos': 'adjective', 'pos1': 'Independence', 'surface': 'No'},
{'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}],
[{'base': None, 'pos': None, 'pos1': None, 'surface': 'EOS'},
{'base': 'Where', 'pos': 'noun', 'pos1': '代noun', 'surface': 'Where'},
{'base': 'so', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'so'},
{'base': 'Born', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'Born'},
{'base': 'Ta', 'pos': 'Auxiliary verb', 'pos1': '*', 'surface': 'Ta'},
{'base': 'Fire', 'pos': 'noun', 'pos1': 'General', 'surface': 'Katon'},
{'base': 'When', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'When'},
{'base': 'Register', 'pos': 'noun', 'pos1': 'Change connection', 'surface': 'Register'},
{'base': 'But', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'But'},
{'base': 'Tsukuri', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'Tsuka'},
{'base': 'Nu', 'pos': 'Auxiliary verb', 'pos1': '*', 'surface': 'Nu'},
{'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}],
[{'base': 'what', 'pos': 'noun', 'pos1': '代noun', 'surface': 'what'},
{'base': 'But', 'pos': 'Particle', 'pos1': '副Particle', 'surface': 'But'},
{'base': 'dim', 'pos': 'adjective', 'pos1': 'Independence', 'surface': 'dim'},
{'base': 'Damp', 'pos': 'adverb', 'pos1': 'General', 'surface': 'Damp'},
{'base': 'did', 'pos': 'noun', 'pos1': 'General', 'surface': 'did'},
{'base': 'Place', 'pos': 'noun', 'pos1': 'suffix', 'surface': 'Place'},
{'base': 'so', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'so'},
{'base': 'Meow meow', 'pos': 'adverb', 'pos1': 'General', 'surface': 'Meow meow'},
{'base': 'cry', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'Crying'},
{'base': 'hand', 'pos': 'Particle', 'pos1': '接続Particle', 'surface': 'hand'},
{'base': 'What was there', 'pos': 'noun', 'pos1': 'General', 'surface': 'What was there'},
{'base': 'Only', 'pos': 'Particle', 'pos1': '副Particle', 'surface': 'Only'},
{'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle', 'surface': 'Is'},
{'base': 'Memory', 'pos': 'noun', 'pos1': 'Change connection', 'surface': 'Memory'},
{'base': 'To do', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'Shi'},
{'base': 'hand', 'pos': 'Particle', 'pos1': '接続Particle', 'surface': 'hand'},
{'base': 'Is', 'pos': 'verb', 'pos1': 'Non-independent', 'surface': 'Is'},
{'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}],
[{'base': 'I', 'pos': 'noun', 'pos1': '代noun', 'surface': 'I'},
{'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle', 'surface': 'Is'},
{'base': 'here', 'pos': 'noun', 'pos1': '代noun', 'surface': 'here'},
{'base': 'so', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'so'},
{'base': 'start', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'start'},
{'base': 'hand', 'pos': 'Particle', 'pos1': '接続Particle', 'surface': 'hand'},
{'base': 'Human', 'pos': 'noun', 'pos1': 'General', 'surface': 'Human'},
{'base': 'That', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'That'},
{'base': 'thing', 'pos': 'noun', 'pos1': 'Non-independent', 'surface': 'thing'},
{'base': 'To', 'pos': 'Particle', 'pos1': '格Particle', 'surface': 'To'},
{'base': 'to see', 'pos': 'verb', 'pos1': 'Independence', 'surface': 'You see'},
{'base': 'Ta', 'pos': 'Auxiliary verb', 'pos1': '*', 'surface': 'Ta'},
{'base': '。', 'pos': 'symbol', 'pos1': 'Kuten', 'surface': '。'}]]
Recommended Posts