100 language processing knocks Morphological analysis learned in Chapter 4

Introduction

I am solving 100 language processing knocks at a study session centered on members of the company, but the answer code and the solution This is a summary of the tricks that I found useful in the process. Most of the content has been investigated and verified by myself, but it also contains information shared by other study group members.

Up to Chapter 3, the content was usable for all Python programming, but since this chapter is morphological analysis, it is finally becoming more like language processing.

series

-Unix commands learned in Chapter 2 of 100 language processing knocks -Regular expressions learned in Chapter 3 of 100 language processing knocks -Morphological analysis learned in Chapter 4 of 100 language processing knocks (this article)

environment

macOS
Python 3.8.1
JupyterLab

code

Preprocessing

Put neko.txt in the same directory as the ipynb (or Python) file and then

!mecab neko.txt -o neko.txt.mecab

Then, the result of morphological analysis will be recorded in neko.txt.mecab.

30. Reading morphological analysis results

import re
import itertools

def parse_text(flat=True):
    with open('neko.txt.mecab') as file:
        morphs = []
        sentence = []

        for line in file:
            if re.search(r'EOS', line):
                continue

            surface, rest = line.split('\t')
            arr = rest.split(',')
            sentence.append({
                'surface': surface,
                'base': arr[6],
                'pos': arr[0],
                'pos1': arr[1],
            })
                
            if surface == '　': #White space is not considered a morpheme
                sentence.pop(-1)
            if surface in ['　', '。']: #Consider whitespace and punctuation as the end of a sentence
                morphs.append(sentence)
                sentence = []
        
        if flat:
            return list(itertools.chain.from_iterable(morphs))
        else:
            return morphs

parse_text(flat=False)[:10]

`result`


[[{'surface': 'I', 'base': 'I', 'pos': 'noun', 'pos1': '代noun'},
  {'surface': 'Is', 'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle'},
  {'surface': 'Cat', 'base': 'Cat', 'pos': 'noun', 'pos1': 'General'},
  {'surface': 'so', 'base': 'Is', 'pos': 'Auxiliary verb', 'pos1': '*'},
  {'surface': 'is there', 'base': 'is there', 'pos': 'Auxiliary verb', 'pos1': '*'},
  {'surface': '。', 'base': '。', 'pos': 'symbol', 'pos1': 'Kuten'}],
 [{'surface': 'name', 'base': 'name', 'pos': 'noun', 'pos1': 'General'},

This problem requires "expressing one sentence as a list of morphemes (mapping types)", but when using this function in problems 31 and later, the return value was flattened (not nested). ) Is easier to handle, so I decided to prepare a formal argument called flat. The default value for this flat is True, but I'm passing the argument flat = False because I want an unflattened list for this issue.

The reason why the last line says [:10] is that it was just right that the number of elements to be displayed for confirmation was about 10, so there is no particular necessity.

31. Verb

import numpy as np

def extract_surface():
    verbs = list(filter(lambda morph: morph['pos'] == 'verb', parse_text()))
    return [verb['surface'] for verb in verbs]

print(extract_surface()[:100])

`result`


['Born', 'Tsuka', 'Shi', 'Crying', 'Shi', 'Is', 'start', 'You see', 'listen', 'Catch', 'Boiled', 'Eat', 'Thoughts', 'Loading', 'Be', 'Lift', 'Be', 'Shi', 'Ah', 'Calm down', 'You see', 'You see', 'Thoughts', 'Remaining', 'Is', 'Sa', 'Re', 'Shi', 'Meet', 'Meet', 'Shi', 'only', 'If', 'Shi', 'Is', 'Blow', 'Se', 'Weak', 'to drink', 'Know', 'Sit down', 'Oh', 'To do', 'Shi', 'start', 'Move', 'Move', 'Understand', 'Go around', 'Become', 'From assistant', 'Thoughts', 'Is', 'Saり', 'Shi', 'Out', 'Shi', 'Is', 'ThoughtsOutそ', 'Understand', 'With', 'You seeる', 'I', 'Oh', 'You seeえ', '隠Shi', 'Shiまっ', 'Different', '明I', 'I', 'Be', '這IOutShi', 'You seeる', 'Throw away', 'Be', '這IOutす', 'is there', 'Sit down', 'Shi', 'Thoughts', 'You see', 'Out', 'Shi', 'Crying', 'Coming', 'くReる', 'ThoughtsWith', 'Finally', 'You see', 'Coming', 'Crossed', 'Take', 'Decreased', 'Coming', 'Crying', 'Out', 'is there', 'is there', 'Shi', 'It's about time']

The return value can also be made as list (map (lambda morph: morph ['surface'], verbs)), but using the inclusion notation as above is a more concise code.

Also, this depends on how you interpret the problem statement, but if you don't want to allow duplicate elements in the return list, put the last line of the function.

return set([verb['surface'] for verb in verbs])

There are methods such as.

32. The original form of the verb

def extract_base():
    verbs = list(filter(lambda morph: morph['pos'] == 'verb', parse_text()))
    return [verb['base'] for verb in verbs]

print(extract_base()[:100])

`result`


['Born', 'Tsukuri', 'To do', 'cry', 'To do', 'Is', 'start', 'to see', 'listen', 'capture', 'Boil', 'Eat', 'think', 'Put', 'Be', 'lift', 'Be', 'To do', 'is there', '落ちTsukuri', 'to see', 'to see', 'think', 'Remain', 'Is', 'To do', 'To be', 'To do', 'Meet', 'meet', 'To do', 'Drinking', 'Become', 'To do', 'Is', 'Blow', 'To do', 'Weak', 'to drink', 'know', 'Sit down', 'Oru', 'To do', 'To do', 'start', 'Move', 'Move', 'Understand', 'Go around', 'Become', 'Be saved', 'think', 'Is', 'Monkey', 'To do', 'Get out', 'To do', 'Is', 'Come up', 'Understand', 'Attach', 'to see', 'Is', 'Oru', 'appear', 'hide', 'End up', 'Wrong', 'To be clear', 'Is', 'Be', 'Crawl out', 'to see', 'Discard', 'Be', 'Crawl out', 'is there', 'Sit down', 'To do', 'Think', 'to see', 'Get out', 'To do', 'cry', 'come', 'くTo be', '考えAttach', 'do', 'to see', 'come', 'Cross', 'Take', 'decrease', 'come', 'cry', 'Get out', 'is there', 'is there', 'To do', 'Shave']

Almost the same as 31.

33. Sahen noun

def extract_sahens():
    return list(filter(lambda morph: morph['pos1'] == 'Change connection', parse_text()))

extract_sahens()[:20]

`result`


[{'surface': 'Register', 'base': 'Register', 'pos': 'noun', 'pos1': 'Change connection'},
 {'surface': 'Memory', 'base': 'Memory', 'pos': 'noun', 'pos1': 'Change connection'},
 {'surface': 'Talk', 'base': 'Talk', 'pos': 'noun', 'pos1': 'Change connection'},
 {'surface': 'Decoration', 'base': 'Decoration', 'pos': 'noun', 'pos1': 'Change connection'},
 {'surface': 'Protrusion', 'base': 'Protrusion', 'pos': 'noun', 'pos1': 'Change connection'},
 ...

Almost the same as 31.

34. "B of A"

def extract_noun_phrases():
    morphs = parse_text()
    phrases = []

    for i, morph in enumerate(morphs):
        if morph['surface'] == 'of' and morphs[i - 1]['pos'] == 'noun' \
                and morphs[i + 1]['pos'] == 'noun':
            phrases.append(
                morphs[i - 1]['surface'] + 'of' + morphs[i + 1]['surface'])

    return phrases
    
print(extract_noun_phrases()[:100])

`result`


['His palm', 'On the palm', 'Student's face', 'Should face', 'In the middle of the face', 'In the hole', 'Calligraphy palm', 'The back of the palm', 'What', 'Essential mother', 'On the straw', 'In Sasahara', 'In front of the pond', 'On the pond', 'Hedge hole', 'Three neighbors', 'Passage of time', 'Inside the house', 'His student', 'Humans other than', 'Previous student', 'Your chance', 'Three of you', 'Chest itching', 'Housekeeper', 'Master', 'Under the nose', 'My face', 'My home', 'My husband', 'Home stuff', 'Ours', 'His study', 'On the book', 'Skin color', 'On the book', 'His every night', 'Other than', 'Beside my husband', 'His knees', 'On the lap', 'On experience', 'On the rice bowl', 'On the kotatsu', 'Of here', 'Bed to accompany', 'Between them', 'Companion', 'Example nerve', 'Sexual master', 'Next room', 'Selfish', 'For me', 'Between the kitchen boards', 'Respect for me', 'White', 'Like a ball', 'House there', 'Home student', 'Back pond', 'Parent-child love', 'More discussion', 'Stab head', 'Mullet navel', 'For them', 'Military house', 'Master of the substitute', 'Teacher's house', 'Cat season', 'My house', 'Housekeeper', 'Full of English', 'Weak stomach habit', 'In the back rack', 'Hira no Sect', 'Monthly salary', 'For the time being', 'Like below', 'Like now', 'Master's memoir', 'His friend', 'Gold-rimmed glasses', 'Master's face', 'Imagination inside', 'Translated', 'Landlord of interest', 'Behind the gold rim', 'Behind me', 'His friend', 'My circle', 'Around the face', 'The result of the addition', 'Facial features', 'Other cats', 'Clumsy servant', 'My husband', 'Cat of the same production', 'Variegated skin', 'Coloring of the master', 'Relative muscles']

Using ʻenumerate (), you can handle the information that morph is in morphs` as a variable, which is a neat code.

35. Noun articulation

def extract_continuous_nouns():
    morphs = parse_text()
    continuous_nouns = []

    for i, morph in enumerate(morphs):
        if morph['pos'] == 'noun' and morphs[i + 1]['pos'] == 'noun':
            continuous_noun = morph['surface'] + morphs[i + 1]['surface']
            
            j = 1
            while morphs[i + 1 + j]['pos'] == 'noun':
                continuous_noun += morphs[i + 1 + j]['surface']
                j += 1

            continuous_nouns.append(continuous_noun)
            
    return continuous_nouns

print(extract_continuous_nouns()[:100])

`result`


['In humans', 'Timely', 'Then the cat', 'Puupuu and smoke', 'Inside the mansion', 'Calico', 'Other than student', 'Forty-five hen', 'Gohen', 'The other day', 'Mima', 'Your kitchen', 'Mama back', 'Dwelling house', 'All-day study', 'Hard worker', 'Hard worker', 'Diligent', 'A few pages', 'Three pages', 'Other than my husband', 'As far as I am', 'Morning master', 'Two persons', 'Last hard', '--Cat', 'Neurogastric weakness', 'Stomach weakness', 'Finger', 'Terrible ass', 'Language break', 'My wife', 'Overall', 'Muscle direction', 'White', 'Every time', 'White', 'The other day ball', 'Four swords', 'third day', 'Day', 'Four swords', 'White', 'Our cats', 'Felis', 'Cats', 'Family life', 'Life', 'Calico君', 'Mao', 'Ownership', 'Between us', 'Between the same family', 'Eye stab', 'They human', 'We', 'I', 'White', 'Calico君', 'Mao', 'Full of mistakes', 'Mr.', 'Munemori', 'Munemori', 'Monthly salary date', 'Watercolor paint', 'Every day every day study', 'Daily study', 'person's', 'Brush yourself', 'Over glasses', 'Tairi', 'Landlord Andrea del Sarto', 'Dew', 'Jackdaw', 'This width', 'Live painting', 'The next day', 'Spicy stick', 'Now I', 'Now I', 'Wave product', 'Wonder', 'Blind cat', 'Secret in my heart', 'How much Andrea del Sarto', 'Another big', 'Break', 'Stupid guy', 'Stupid guy', 'Spicy stick', 'Stupid guyCall', 'Call a bastard', 'Call', 'Hirao', 'Stupid guy', 'Everyone grows', 'Where to go', 'Several times', '10 tsubo']

36. Frequency of word occurrence

import pandas as pd

def sort_by_freq():
    words = [morph['base'] for morph in parse_text()]
    
    return pd.Series(words).value_counts()

sort_by_freq()[:20]

`result`

As an alternative solution, import the standard library collections as shown below.

from collections import Counter

You can also use the most_common () method of the Counter () object (instead of the value_counts () of pandas).

return Counter(words).most_common()

`result`


[('of', 9194),
 ('。', 7486),
 ('hand', 6868),
 ('、', 6772),
 ('Is', 6420),
 ...

The difference between these two solutions is that value_counts () returns Series, whilemost_common ()returns an array of tuples, so it depends on which one is used for easier subsequent processing. Isn't it good?

37. Top 10 most frequent words

import japanize_matplotlib

def draw_bar_graph():
    sort_by_freq()[:10].plot.bar()

draw_bar_graph()

It takes a while to display Japanese, but it was easy to use the library called japanize_matplotlib introduced in the article here.

Also, regarding the contents of the function, I wanted to write it short, so I used .plot.bar () for the Series object, but

import matplotlib.pyplot as plt

After importing matplotlib as, it works even if you write as follows.

morph_freqs = sort_by_freq()[:10]
plt.bar(morph_freqs.index, morph_freqs)

38. Histogram

import matplotlib.pyplot as plt

def draw_hist():
    plt.xlabel('Frequency of appearance')
    plt.ylabel('Number of word types')
    plt.hist(sort_by_freq(), bins=200)
    
draw_hist()

If you want to display the whole, it will be such a code, but it is a little difficult to see, so it may be more realistic to limit the display range as follows.

def draw_hist_2():
    plt.xlabel('Frequency of appearance')
    plt.ylabel('Number of word types')
    plt.title('Appearance frequency 20 or less')
    plt.hist(sort_by_freq(), bins=20, range=(1, 20))

draw_hist_2()

39. Zipf's Law

def draw_log_graph():
    plt.xscale('log')
    plt.yscale('log')
    plt.xlabel('Appearance frequency ranking')
    plt.ylabel('Frequency of appearance')
    plt.scatter(range(1, len(sort_by_freq()) + 1), sort_by_freq(), s=10)
    
draw_log_graph()

When calling the scatter method, you can specify the size of the points with the s option. The default is 20, but it was a little big, so I set it to 10.

Summary

MeCab is open source software, but it was interesting to see that it could do so many things.

On the other hand, I also felt that the accuracy of the analysis was not perfect with this alone (for example, "Puputo" returned as the output result at 35 is not a noun but an adverb, etc.). In order to solve this problem, I think it is necessary to try various dictionaries or customize them by yourself.

That's all for this chapter, but if you make a mistake, please comment.