[Language processing 100 knocks 2020] Chapter 4: Morphological analysis

Introduction

2020 version of 100 knocks of language processing, which is famous as a collection of problems of natural language processing, has been released. This article summarizes the results of solving Chapter 4: Morphological Analysis from the following Chapters 1 to 10. ..

-Chapter 1: Preparatory Movement -Chapter 2: UNIX Commands -Chapter 3: Regular Expressions --Chapter 4: Morphological analysis -Chapter 5: Dependency Analysis -Chapter 6: Machine Learning --Chapter 7: Word Vector --Chapter 8: Neural Net --Chapter 9: RNN, CNN --Chapter 10: Machine Translation

Advance preparation

We use Google Colaboratory for answers. For details on how to set up and use Google Colaboratory, see this article. The notebook containing the execution results of the following answers is available on github.

Chapter 4: Morphological analysis

Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

First, download the specified data. If you execute the following command on the cell of Google Colaboratory, the target file will be downloaded to the current directory.

!wget https://nlp100.github.io/data/neko.txt

[Wget] command-Download file by specifying URL

Next, install MeCab.

!apt install mecab libmecab-dev mecab-ipadic-utf8

After the installation is complete, we will perform morphological analysis immediately. By executing the following command, the result of morphological analysis of `neko.txt``` will be output as `neko.txt.mecab```.

!mecab -o ./neko.txt.mecab ./neko.txt

Morphological analysis (Wikipedia) MeCab command line argument list and its execution example

Check the output result.

#Check the number of lines
!wc -l ./neko.txt.mecab

`output`


216302 ./neko.txt.mecab

#Check the first 10 lines
!head -10 ./neko.txt.mecab

`output`


One noun,number,*,*,*,*,one,Ichi,Ichi
EOS
EOS
symbol,Blank,*,*,*,*,　,　,　
I noun,Pronoun,General,*,*,*,I,Wagamama,Wagamama
Is a particle,Particle,*,*,*,*,Is,C,Wow
Cat noun,General,*,*,*,*,Cat,cat,cat
Auxiliary verb,*,*,*,Special,Continuous form,Is,De,De
Auxiliary verb,*,*,*,Five steps, La line Al,Uninflected word,is there,Al,Al
.. symbol,Kuten,*,*,*,*,。,。,。

30. Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

filename = './neko.txt.mecab'

sentences = []
morphs = []
with open(filename, mode='r') as f:
  for line in f:  #Read line by line
    if line != 'EOS\n':  #Other than the end of the sentence: Stores morphological analysis information in a dictionary and adds it to the morpheme list.
      surface, attr = line.split('\t')
      attr = attr.split(',')
      morph = {'surface': surface, 'base': attr[6], 'pos': attr[0], 'pos1': attr[1]}
      morphs.append(morph)
    else:  #End of sentence: Add morpheme list to sentence list
      sentences.append(morphs)
      morphs = []

#Verification
for morph in sentences[2]:
  print(morph)

`output`


{'surface': '\u3000', 'base': '\u3000', 'pos': 'symbol', 'pos1': 'Blank'}
{'surface': 'I', 'base': 'I', 'pos': 'noun', 'pos1': '代noun'}
{'surface': 'Is', 'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle'}
{'surface': 'Cat', 'base': 'Cat', 'pos': 'noun', 'pos1': 'General'}
{'surface': 'so', 'base': 'Is', 'pos': 'Auxiliary verb', 'pos1': '*'}
{'surface': 'is there', 'base': 'is there', 'pos': 'Auxiliary verb', 'pos1': '*'}
{'surface': '。', 'base': '。', 'pos': 'symbol', 'pos1': 'Kuten'}

Reading and writing files with Python Split strings in Python Loop processing by Python for statement How to write conditional branch by if statement in Python Dictionary creation in Python dict () and curly braces, dictionary comprehension Adding an element to a list (array) in Python

31. Verb

Extract all the surface forms of the verb.

After that, we will process the `sentences``` created in 30. The `set``` type that stores the result here is a data type that represents a set and does not allow duplication. Therefore, even if you add an element without thinking about it, you can get a result that does not overlap naturally, which is convenient in a case like this question.

ans = set()
for sentence in sentences:
  for morph in sentence:
    if morph['pos'] == 'verb':
      ans.add(morph['surface'])  #Since it is a set type, only unique elements are retained.

#Verification
print(f'Types of surface forms of verbs: {len(ans)}\n')
print('---sample---')
for i in range(10):
  print(list(ans)[i])

`output`


Types of surface forms of verbs: 3893

---sample---
Profit
Shinobi
Line up
Notice
Show
Fast
Separated
Keep
Matsuwa
From

Python, set operation with set type

32. The original form of the verb

Extract all the original forms of the verb.

ans = set()
for sentence in sentences:
  for morph in sentence:
    if morph['pos'] == 'verb':
      ans.add(morph['base'])

#Verification
print(f'Types of verb prototypes: {len(ans)}\n')
print('---sample---')
for i in range(10):
  print(list(ans)[i])

`output`


Types of verb prototypes: 2300

---sample---
Line up
Await
Relentless
Can hit
Respond
undertake
Fold in
pierce
Grow
tell

33. "B of A"

Extract a noun phrase in which two nouns are connected by "no".

ans = set()
for sentence in sentences:
  for i in range(1, len(sentence) - 1):
    if sentence[i - 1]['pos'] == 'noun' and sentence[i]['surface'] == 'of' and sentence[i + 1]['pos'] == 'noun':
      ans.add(sentence[i - 1]['surface'] + sentence[i]['surface'] + sentence[i + 1]['surface'])

#Verification
print(f'"noun+of+名詞」of種類: {len(ans)}\n')
print('---sample---')
for i in range(10):
  print(list(ans)[i])

`output`


"noun+of+名詞」of種類: 4924

---sample---
The body of sickness
One side of the
I hate myself
Police trouble
Of the law
Suitable for things
Detective of the world
Fear of protection
Two elements
Standing flock

Get the size of objects of various types with Python's len function Concatenate and combine strings with Python

34. Noun articulation

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match.

For each sentence, the following rules are applied in order from the first morpheme, and the concatenation of nouns is extracted with the longest match.

If it is a noun, concatenate it to nouns and count the number of concatenations ( num)
For non-nouns, if the number of concatenations up to this point is 2 or more, output and initialize `nouns``` and `num```.
Otherwise, initialize nouns and `` `num```

ans = set()
for sentence in sentences:
  nouns = ''
  num = 0
  for i in range(len(sentence)):
    if sentence[i]['pos'] == 'noun':  # 最初の形態素から順に、nounであればnounsに連結し、連結数(num)Count
      nouns = ''.join([nouns, sentence[i]['surface']])
      num += 1
    elif num >= 2:  #If it is not a noun and the number of concatenations up to this point is 2 or more, output it and initialize nouns and num.
      ans.add(nouns)
      nouns = ''
      num = 0
    else:  #Otherwise, initialize nouns and num
      nouns = ''
      num = 0

#Verification
print(f'Types of articulated nouns: {len(ans)}\n')
print('---sample---')
for i in range(10):
  print(list(ans)[i])

`output`


Types of articulated nouns: 4454

---sample---
Kan Inoguchi
Street today
Must-have world
Two sheets
You champagne
Approaching
Idiot
Hibiscus
10 years now
Other than stimulation

35. Frequency of word occurrence

Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.

from collections import defaultdict 

ans = defaultdict(int)
for sentence in sentences:
  for i in range(len(sentence)):
    if sentence[i]['pos'] != 'symbol':
      ans[sentence[i]['base']] += 1  #Update the number of words(Set 1 if it is the first word)
ans = sorted(ans.items(), key=lambda x: x[1], reverse=True)

#Verification
for i in range(5):
  print(ans[i])

`output`


('of', 9194)
('hand', 6848)
('Is', 6420)
('To', 6243)
('To', 6071)

How to use Python defaultdict Sort the list of dictionaries in Python according to the value of a specific key

36. Top 10 most frequent words

Display the 10 words with high frequency of appearance and their frequency of appearance in a graph (for example, a bar graph).

Install `` `japanize_matplotlib``` to display Japanese with matplotlib.

!pip install japanize_matplotlib

[Super easy] How to make matplotlib Japanese notation correspond in just 2 steps

Then, as with 35, aggregate the frequency of occurrence and visualize it with a bar graph.

import matplotlib.pyplot as plt
import japanize_matplotlib

ans = defaultdict(int)
for sentence in sentences:
  for i in range(len(sentence)):
    if sentence[i]['pos'] != 'symbol':
      ans[sentence[i]['base']] += 1  #Update the number of words(Set 1 if it is the first word)
ans = sorted(ans.items(), key=lambda x: x[1], reverse=True)

keys = [a[0] for a in ans[0:10]]
values = [a[1] for a in ans[0:10]]
plt.figure(figsize=(8, 4))
plt.bar(keys, values)
plt.show()

Basics of Python graph drawing library Matplotlib

37. Top 10 words that frequently co-occur with "cat"

Display 10 words that often co-occur with "cats" (high frequency of co-occurrence) and their frequency of occurrence in a graph (for example, a bar graph).

Part of speech is not selected here because there is no particular instruction, but depending on the purpose, removing particles etc. may result in more meaningful results.

ans = defaultdict(int)
for sentence in sentences:
  if 'Cat' in [morph['surface'] for morph in sentence]:  # 文の形態素に「Cat」が含まれる場合のみ辞書に追加
    for i in range(len(sentence)):
      if sentence[i]['pos'] != 'symbol':
        ans[sentence[i]['base']] += 1  #Update the number of words(Set 1 if it is the first word)
del ans['Cat']
ans = sorted(ans.items(), key=lambda x: x[1], reverse=True)

keys = [a[0] for a in ans[0:10]]
values = [a[1] for a in ans[0:10]]
plt.figure(figsize=(8, 4))
plt.bar(keys, values)
plt.show()

Delete dictionary elements in Python

38. Histogram

Draw a histogram of the frequency of occurrence of words (the horizontal axis represents the frequency of occurrence and the vertical axis represents the number of types of words that take the frequency of occurrence as a bar graph).

ans = defaultdict(int)
for sentence in sentences:
  for i in range(len(sentence)):
    if sentence[i]['pos'] != 'symbol':
      ans[sentence[i]['base']] += 1  #Update the number of words(Set 1 if it is the first word)
ans = ans.values()

plt.figure(figsize=(8, 4))
plt.hist(ans, bins=100)
plt.show()

Get only dictionary keys and values as a list in Python

39. Zipf's Law

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

import math

ans = defaultdict(int)
for sentence in sentences:
  for i in range(len(sentence)):
    if sentence[i]['pos'] != 'symbol':
      ans[sentence[i]['base']] += 1  #Update the number of words(Set 1 if it is the first word)
ans = sorted(ans.items(), key=lambda x: x[1], reverse=True)

ranks = [math.log(r + 1) for r in range(len(ans))]
values = [math.log(a[1]) for a in ans]
plt.figure(figsize=(8, 4))
plt.scatter(ranks, values)
plt.show()

[Zip's Law (Wikipedia)](https://ja.wikipedia.org/wiki/%E3%82%B8%E3%83%83%E3%83%97%E3%81%AE%E6%B3% 95% E5% 89% 87) Calculate exponential / logarithmic functions in Python

in conclusion

100 Language Processing Knock is designed so that you can learn not only natural language processing itself, but also basic data processing and general-purpose machine learning. Even those who are studying machine learning in online courses will be able to practice very good output, so please try it.