3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]

** mecab-ipadic-NEologd ** is a customized dictionary that complements the standard MeCab dictionary.
Words have been added from many language resources on the Web, and ** supports new words, compound words, idiomatic expressions, etc. **.
As mentioned above, the MeCab standard divides it into "individual" and "principle", but in mecab-ipadic-NEologd it is treated as one word "individualism".

Task

** Using mecab-ipadic-NEologd **, ** remove stopwords ** and ** express in co-occurrence network **.

1. Preparation of text data

⑴ Reading text data

For the corpus, we will use Prime Minister Suga's speech at the UN General Assembly (September 26, 2nd year of Reiwa). In addition, the speeches and press conferences of successive Prime Ministers are published in uncut texts and videos at the following URL.
https://www.kantei.go.jp/jp/99_suga/statement/index.html
I copied the text on the screen in advance and created a text file on the local PC. Upload it to Colaboratory and load it.

from google.colab import files
uploaded = files.upload()

with open('20200926_suga_un.txt', mode='rt', encoding='utf-8') as f:
    read_text = f.read()
sugatxt = read_text

⑵ Data cleaning

Removes noise such as line feed codes and symbols, and divides into sentence units with the punctuation mark ".".

#Delete unnecessary characters / symbols
def clean(text):
    text = text.replace("\n", "")
    text = text.replace("\u3000", "")
    text = text.replace("「", "")
    text = text.replace("」", "")
    text = text.replace("（", "")
    text = text.replace("）", "")
    text = text.replace("、", "")
    return text

text = clean(sugatxt)

#Split line by line
lines = text.split("。")

2. Creating co-occurrence data

Use MeCab and ** mecab-ipadic-NEologd ** to perform morphological analysis on a sentence-by-sentence basis to create a ** list of nouns only ** excluding stopwords **.

⑶ Installation of MeCab and mecab-ipadic-NEologd

# MeCab
!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!pip install mecab-python3 > /dev/null

# mecab-ipadic-NEologd
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null 
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1

#Error avoidance by symbolic links
!ln -s /etc/mecabrc /usr/local/etc/mecabrc

Avoid errors such as broken links and pointing to the wrong file, and take measures to ensure that the path in the next section works properly.

⑷ Create an instance by specifying mecab-ipadic-NEologd

#Check the dictionary path
!echo `mecab-config --dicdir`"/mecab-ipadic-neologd"

Create an instance m_neo with the output mode as path (= mecab-ipadic-NEologd).

import MeCab

path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
m_neo = MeCab.Tagger(path)

⑸ Create a sentence-based noun list

Set the word ** stop word ** to be deleted. We list words such as numbers, numbers, and demonstrative pronouns that do not have useful meanings for interpretation by themselves. In this example, the symbols left after the morphological analysis are also added.

stopwords = ["１", "２", "３", "４", "５", "６", "７", "８", "９", "０", 
             "1", "2", "3", "4", "5", "6", "7", "8", "9", "0", 
             "one", "two", "three", "four", "Five", "Six", "Seven", "Eight", "Nine", "〇", 
             "Year", "Month", "Day", "Next", "Discount", "Times", "Target", "Disease", "that's all", "Less than", "周Year", "Case", "Every time",
             "of", "もof", "thing", "Yo", "Sama", "For", "Tend to", "this", "It", "that", "Who", 
             "*", ",", "，"]

Create a sentence-based noun list noun_list.
From the text data lines divided into line units, repeat the following process line by line and add it to noun_list.
Let the result of morphological analysis be v1, and let this be a list divided into words bysplitlines ()as v2.
Repeat the following process word by word from v2 and add it to result.
Divide the analysis result for one word into split ("\ t") that is, a space, and divide it into two parts, the" original word "and the" content part of the analysis "to make v3.
By setting ʻif len (v3) == 2, except for "EOS" and "", the content part of the analysis v3 [1]is divided bysplit (','), that is, a comma Let's say v4`.
If this first element v4 [0] is a noun and the original v4 [6] is not a stopword, add it to result.
When you have finished processing all the words, add result to noun_list and move on to processing the next sentence.

noun_list  = []

for line in lines:
    result = []
    v1 = m_neo.parse(line)
    v2 = v1.splitlines()
    for v in v2:
        v3 = v.split("\t")
        if len(v3) == 2:
            v4 = v3[1].split(',')
            if (v4[0] == "noun") and (v4[6] not in stopwords):
                 result.append(v4[6])
    noun_list.append(result)

⑹ Generation of co-occurrence data

Sequentially from the sentence-based noun list noun_list, generate a pair (combination of two words) with ʻitertools.combinations () for those with two or more words with ʻif len (noun_list)> = 2. , List it with list () and store it in pair_list.
Next, flatten the pair_list in sentence units to make it ʻall_pairs`, and then count the number of occurrences of the pair.

import itertools #A module that collects iterator functions
from collections import Counter #A class that counts the number of times a dictionary type appears

#Generate a sentence-based noun pair list
pair_list = []
for n in noun_list:
    if len(noun_list) >= 2:
        lt = list(itertools.combinations(n, 2))
        pair_list.append(lt)

#Flatten the noun pair list
all_pairs = []
for p in pair_list:
    all_pairs.extend(p)

#Count the frequency of noun pairs
cnt_pairs = Counter(all_pairs)

3. Draw network diagram

⑺ Creation of drawing data

For drawing, use only the top 30 pairs by the number of appearances.
It is an argument of sorted () [:30] to sort and retrieve 30 pairs, but the element of cnt_pairs iskey = lambda x: x [1]and the number of occurrences is targeted and reverse = True Sorted in descending order by.
Furthermore, the dictionary type dict is converted into a two-dimensional array to make it data for drawing.

import pandas as pd
import numpy as np

#Generate the top 30 pairs of dictionaries
dict = sorted(cnt_pairs.items(), key=lambda x:x[1], reverse=True)[:30]

#Convert dict type to 2D array
result = []
for key, value in dict:
    temp = []
    for k in key:
        temp.append(k)
    temp.append(value)
    result.append(temp)

data = np.array(result)

⑻ Import of visualization library

Use the package ** networkX ** for creating and manipulating complex networks and graph structures in Python.

import matplotlib.pyplot as plt
import networkx as nx
%matplotlib inline 

#Module to make matplotlib support Japanese display
!pip install japanize-matplotlib
import japanize_matplotlib

⑼ Visualization by NetworkX

Create an object with a graph structure, load data into it, and draw by specifying specifications such as nodes and edges on matplotlib.
In order to make the node label correspond to Japanese display, Japanese font is specified as font_family =" IPAexGothic.

#Generating a graph object
G = nx.Graph()

#Data reading
G.add_weighted_edges_from(data)

#Drawing a graph
plt.figure(figsize=(10,10))
nx.draw_networkx(G,
                 node_shape = "s",
                 node_color = "chartreuse", 
                 node_size = 800,
                 edge_color = "gray", 
                 font_family = "IPAexGothic") #Japanese font specification

plt.show()

As shown below, words that would be split by the MeCab standard are processed as one word.

mecab-ipadic-NEologd	MeCab standard
"Infection"	"infection", "Disease"
"Developing countries"	"On the way", "Country"
"Association of Southeast Asian Nations"	"Southeast Asia", "Countries", "Union"
"Human security"	"Human", "of", "safety", "security"

The words written and spoken by living humans are like living things in their own right and are constantly metabolizing. I really feel that the dictionary must be updated accordingly.

3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]

Task

** 1. Preparation of text data **

⑴ Reading text data

⑵ Data cleaning

** 2. Creating co-occurrence data **

⑶ Installation of MeCab and mecab-ipadic-NEologd

⑷ Create an instance by specifying mecab-ipadic-NEologd

⑸ Create a sentence-based noun list

⑹ Generation of co-occurrence data

** 3. Draw network diagram **

⑺ Creation of drawing data

⑻ Import of visualization library

⑼ Visualization by NetworkX

1. Preparation of text data

2. Creating co-occurrence data

3. Draw network diagram