2020 version of 100 knocks of language processing, which is famous as a collection of problems of natural language processing, has been released. This article summarizes the results of solving Chapter 4: Morphological Analysis from the following Chapters 1 to 10. ..
-Chapter 1: Preparatory Movement -Chapter 2: UNIX Commands -Chapter 3: Regular Expressions --Chapter 4: Morphological analysis -Chapter 5: Dependency Analysis -Chapter 6: Machine Learning --Chapter 7: Word Vector --Chapter 8: Neural Net --Chapter 9: RNN, CNN --Chapter 10: Machine Translation
We use Google Colaboratory for answers. For details on how to set up and use Google Colaboratory, see this article. The notebook containing the execution results of the following answers is available on github.
Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.
First, download the specified data. If you execute the following command on the cell of Google Colaboratory, the target file will be downloaded to the current directory.
!wget https://nlp100.github.io/data/neko.txt
[Wget] command-Download file by specifying URL
Next, install MeCab.
!apt install mecab libmecab-dev mecab-ipadic-utf8
After the installation is complete, we will perform morphological analysis immediately.
By executing the following command, the result of morphological analysis of `neko.txt``` will be output as
`neko.txt.mecab```.
!mecab -o ./neko.txt.mecab ./neko.txt
Morphological analysis (Wikipedia) MeCab command line argument list and its execution example
Check the output result.
#Check the number of lines
!wc -l ./neko.txt.mecab
output
216302 ./neko.txt.mecab
#Check the first 10 lines
!head -10 ./neko.txt.mecab
output
One noun,number,*,*,*,*,one,Ichi,Ichi
EOS
EOS
symbol,Blank,*,*,*,*, , ,
I noun,Pronoun,General,*,*,*,I,Wagamama,Wagamama
Is a particle,Particle,*,*,*,*,Is,C,Wow
Cat noun,General,*,*,*,*,Cat,cat,cat
Auxiliary verb,*,*,*,Special,Continuous form,Is,De,De
Auxiliary verb,*,*,*,Five steps, La line Al,Uninflected word,is there,Al,Al
.. symbol,Kuten,*,*,*,*,。,。,。
Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.
filename = './neko.txt.mecab'
sentences = []
morphs = []
with open(filename, mode='r') as f:
for line in f: #Read line by line
if line != 'EOS\n': #Other than the end of the sentence: Stores morphological analysis information in a dictionary and adds it to the morpheme list.
surface, attr = line.split('\t')
attr = attr.split(',')
morph = {'surface': surface, 'base': attr[6], 'pos': attr[0], 'pos1': attr[1]}
morphs.append(morph)
else: #End of sentence: Add morpheme list to sentence list
sentences.append(morphs)
morphs = []
#Verification
for morph in sentences[2]:
print(morph)
output
{'surface': '\u3000', 'base': '\u3000', 'pos': 'symbol', 'pos1': 'Blank'}
{'surface': 'I', 'base': 'I', 'pos': 'noun', 'pos1': '代noun'}
{'surface': 'Is', 'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle'}
{'surface': 'Cat', 'base': 'Cat', 'pos': 'noun', 'pos1': 'General'}
{'surface': 'so', 'base': 'Is', 'pos': 'Auxiliary verb', 'pos1': '*'}
{'surface': 'is there', 'base': 'is there', 'pos': 'Auxiliary verb', 'pos1': '*'}
{'surface': '。', 'base': '。', 'pos': 'symbol', 'pos1': 'Kuten'}
Reading and writing files with Python Split strings in Python Loop processing by Python for statement How to write conditional branch by if statement in Python Dictionary creation in Python dict () and curly braces, dictionary comprehension Adding an element to a list (array) in Python
Extract all the surface forms of the verb.
After that, we will process the `sentences``` created in 30. The
`set``` type that stores the result here is a data type that represents a set and does not allow duplication. Therefore, even if you add an element without thinking about it, you can get a result that does not overlap naturally, which is convenient in a case like this question.
ans = set()
for sentence in sentences:
for morph in sentence:
if morph['pos'] == 'verb':
ans.add(morph['surface']) #Since it is a set type, only unique elements are retained.
#Verification
print(f'Types of surface forms of verbs: {len(ans)}\n')
print('---sample---')
for i in range(10):
print(list(ans)[i])
output
Types of surface forms of verbs: 3893
---sample---
Profit
Shinobi
Line up
Notice
Show
Fast
Separated
Keep
Matsuwa
From
Python, set operation with set type
Extract all the original forms of the verb.
ans = set()
for sentence in sentences:
for morph in sentence:
if morph['pos'] == 'verb':
ans.add(morph['base'])
#Verification
print(f'Types of verb prototypes: {len(ans)}\n')
print('---sample---')
for i in range(10):
print(list(ans)[i])
output
Types of verb prototypes: 2300
---sample---
Line up
Await
Relentless
Can hit
Respond
undertake
Fold in
pierce
Grow
tell
Extract a noun phrase in which two nouns are connected by "no".
ans = set()
for sentence in sentences:
for i in range(1, len(sentence) - 1):
if sentence[i - 1]['pos'] == 'noun' and sentence[i]['surface'] == 'of' and sentence[i + 1]['pos'] == 'noun':
ans.add(sentence[i - 1]['surface'] + sentence[i]['surface'] + sentence[i + 1]['surface'])
#Verification
print(f'"noun+of+名詞」of種類: {len(ans)}\n')
print('---sample---')
for i in range(10):
print(list(ans)[i])
output
"noun+of+名詞」of種類: 4924
---sample---
The body of sickness
One side of the
I hate myself
Police trouble
Of the law
Suitable for things
Detective of the world
Fear of protection
Two elements
Standing flock
Get the size of objects of various types with Python's len function Concatenate and combine strings with Python
Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.
For each sentence, the following rules are applied in order from the first morpheme, and the concatenation of nouns is extracted with the longest match.
nouns
and count the number of concatenations ( num
)`nouns``` and
`num```. nouns
and `` `num```ans = set()
for sentence in sentences:
nouns = ''
num = 0
for i in range(len(sentence)):
if sentence[i]['pos'] == 'noun': # 最初の形態素から順に、nounであればnounsに連結し、連結数(num)Count
nouns = ''.join([nouns, sentence[i]['surface']])
num += 1
elif num >= 2: #If it is not a noun and the number of concatenations up to this point is 2 or more, output it and initialize nouns and num.
ans.add(nouns)
nouns = ''
num = 0
else: #Otherwise, initialize nouns and num
nouns = ''
num = 0
#Verification
print(f'Types of articulated nouns: {len(ans)}\n')
print('---sample---')
for i in range(10):
print(list(ans)[i])
output
Types of articulated nouns: 4454
---sample---
Kan Inoguchi
Street today
Must-have world
Two sheets
You champagne
Approaching
Idiot
Hibiscus
10 years now
Other than stimulation
Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.
from collections import defaultdict
ans = defaultdict(int)
for sentence in sentences:
for i in range(len(sentence)):
if sentence[i]['pos'] != 'symbol':
ans[sentence[i]['base']] += 1 #Update the number of words(Set 1 if it is the first word)
ans = sorted(ans.items(), key=lambda x: x[1], reverse=True)
#Verification
for i in range(5):
print(ans[i])
output
('of', 9194)
('hand', 6848)
('Is', 6420)
('To', 6243)
('To', 6071)
How to use Python defaultdict Sort the list of dictionaries in Python according to the value of a specific key
Display the 10 words with high frequency of appearance and their frequency of appearance in a graph (for example, a bar graph).
Install `` `japanize_matplotlib``` to display Japanese with matplotlib.
!pip install japanize_matplotlib
[Super easy] How to make matplotlib Japanese notation correspond in just 2 steps
Then, as with 35, aggregate the frequency of occurrence and visualize it with a bar graph.
import matplotlib.pyplot as plt
import japanize_matplotlib
ans = defaultdict(int)
for sentence in sentences:
for i in range(len(sentence)):
if sentence[i]['pos'] != 'symbol':
ans[sentence[i]['base']] += 1 #Update the number of words(Set 1 if it is the first word)
ans = sorted(ans.items(), key=lambda x: x[1], reverse=True)
keys = [a[0] for a in ans[0:10]]
values = [a[1] for a in ans[0:10]]
plt.figure(figsize=(8, 4))
plt.bar(keys, values)
plt.show()
Basics of Python graph drawing library Matplotlib
Display 10 words that often co-occur with "cats" (high frequency of co-occurrence) and their frequency of occurrence in a graph (for example, a bar graph).
Part of speech is not selected here because there is no particular instruction, but depending on the purpose, removing particles etc. may result in more meaningful results.
ans = defaultdict(int)
for sentence in sentences:
if 'Cat' in [morph['surface'] for morph in sentence]: # 文の形態素に「Cat」が含まれる場合のみ辞書に追加
for i in range(len(sentence)):
if sentence[i]['pos'] != 'symbol':
ans[sentence[i]['base']] += 1 #Update the number of words(Set 1 if it is the first word)
del ans['Cat']
ans = sorted(ans.items(), key=lambda x: x[1], reverse=True)
keys = [a[0] for a in ans[0:10]]
values = [a[1] for a in ans[0:10]]
plt.figure(figsize=(8, 4))
plt.bar(keys, values)
plt.show()
Delete dictionary elements in Python
Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph).
ans = defaultdict(int)
for sentence in sentences:
for i in range(len(sentence)):
if sentence[i]['pos'] != 'symbol':
ans[sentence[i]['base']] += 1 #Update the number of words(Set 1 if it is the first word)
ans = ans.values()
plt.figure(figsize=(8, 4))
plt.hist(ans, bins=100)
plt.show()
Get only dictionary keys and values as a list in Python
Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.
import math
ans = defaultdict(int)
for sentence in sentences:
for i in range(len(sentence)):
if sentence[i]['pos'] != 'symbol':
ans[sentence[i]['base']] += 1 #Update the number of words(Set 1 if it is the first word)
ans = sorted(ans.items(), key=lambda x: x[1], reverse=True)
ranks = [math.log(r + 1) for r in range(len(ans))]
values = [math.log(a[1]) for a in ans]
plt.figure(figsize=(8, 4))
plt.scatter(ranks, values)
plt.show()
[Zip's Law (Wikipedia)](https://ja.wikipedia.org/wiki/%E3%82%B8%E3%83%83%E3%83%97%E3%81%AE%E6%B3% 95% E5% 89% 87) Calculate exponential / logarithmic functions in Python
100 Language Processing Knock is designed so that you can learn not only natural language processing itself, but also basic data processing and general-purpose machine learning. Even those who are studying machine learning in online courses will be able to practice very good output, so please try it.
Recommended Posts