3. Natural language processing with Python 2-1. Co-occurrence network

** Word N-gram ** uses a set of adjacent words as the unit of data. 2-gram (2 words) is as follows.
** Co-occurrence (co-location) ** counts the number of times ** words appear together in the target unit (sentence) **.
The above is an example of two words for nouns. In other words, regardless of their mutual positional relationship, ** the combination of words that appear in the same sentence is the unit of data **.

1. Preparation of text data

⑴ Import of various modules

import re
import zipfile
import urllib.request
import os.path
import glob

re: Abbreviation for Regular Expression, a module for manipulating regular expressions
zipfile: Module for manipulating zip files
ʻUrllib.request`: Module for retrieving resources on the internet
ʻOs.path`: Module for manipulating pathnames
glob: Module to get the file path name

⑵ Get file path

For the corpus, from the electronic library on the Internet "Aozora Bunko", Natsume Soseki's "My individualism (new character new pseudonym, work ID: 772) ) ”I will borrow.
How to get a text file from Aozora Bunko

URL = 'https://www.aozora.gr.jp/cards/000148/files/772_ruby_33099.zip'

(3) Acquisition of text file and extraction of text

Two methods are defined below.
The first is a method to get the zip file, unzip it, and get the path of the text file.

def download(URL):

    #Download zip file
    zip_file = re.split(r'/', URL)[-1]
    urllib.request.urlretrieve(URL, zip_file)
    dir = os.path.splitext(zip_file)[0]

    #Unzip and save the zip file
    with zipfile.ZipFile(zip_file) as zip_object:
        zip_object.extractall(dir)
    os.remove(zip_file)

    #Get the path of the text file
    path = os.path.join(dir,'*.txt')
    list = glob.glob(path)
    return list[0]

The second method is to read a text file and extract only the text, but also delete ruby characters, notes, line feed codes, unnecessary spaces, etc. contained in the text.

def convert(download_text):

    #Read file
    data = open(download_text, 'rb').read()
    text = data.decode('shift_jis')

    #Extraction of text
    text = re.split(r'\-{5,}', text)[2] 
    text = re.split(r'Bottom book:', text)[0]
    text = re.split(r'[#New Page]', text)[0]

    #Delete unnecessary parts
    text = re.sub(r'《.+?》', '', text)
    text = re.sub(r'［＃.+?］', '', text)
    text = re.sub(r'｜', '', text)
    text = re.sub(r'\r\n', '', text)
    text = re.sub(r'\u3000', '', text)
    text = re.sub(r'「', '', text)
    text = re.sub(r'」', '', text)

    return text

Now, execute the two methods with the previously acquired file path as an argument, and further divide it into sentence units with the punctuation mark ".".

#Get file path
download_file = download(URL)

#Extract only the text
text = convert(download_file)

#Split into a statement-based list
sentences = text.split("。")

Based on this, the text is divided into sentences, and ** "co-occurrence data" consisting of pairs of co-occurrence words and frequency of occurrence ** will be created.

2. Creating co-occurrence data

⑷ Installation of MeCab

!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7

⑸ Sentence-based noun list generation

The argument of MeCab.Tagger () is the specification of "output mode", but -Ochasen outputs the result of morphological analysis.
Morphological analysis is performed on a sentence-by-sentence basis, and the basic form of a word is acquired for nouns to generate a ** sentence-based noun list **.

import MeCab
mecab = MeCab.Tagger("-Ochasen")

#Generate a sentence-by-sentence noun list
noun_list = [
             [v.split()[2] for v in mecab.parse(sentence).splitlines()
             if (len(v.split())>=3 and v.split()[3][:2]=='noun')]
             for sentence in sentences
             ]

Every time a sentence is taken out by for sentence in sentences, it is subjected to morphological analysis bymecab.parse (sentence).
Each time, the list divided into word units by splitlines () is used as v, and the third element[2]is added to the list by dividing v by split (). To get
The following is an example of the output format for morphological analysis. The tab-delimited [2] is the uninflected word ( ■ </ font> part).

Also, depending on the if statement v.split () [3] [: 2] =='noun', the part of speech corresponding to the fourth element[3]of v is a noun (<font color = "LightBlue"). Only those that match "> ■ </ font> part) will be extracted.

As shown below, only the basic forms of nouns are extracted to form a sentence-based list.

⑹ Generation of co-occurrence data

Co-occurrence data is a dictionary-type object consisting of co-occurrence word pairs and frequency of occurrence.

import itertools
from collections import Counter

ʻItertools`: A module that collects iterator functions for efficient loop processing.
Counter: Module for counting the number of occurrences of each element

#Generate a sentence-based noun pair list
pair_list = [
             list(itertools.combinations(n, 2))
             for n in noun_list if len(noun_list) >=2
             ]

#Flattening the noun pair list
all_pairs = []
for u in pair_list:
    all_pairs.extend(u)

#Count the frequency of noun pairs
cnt_pairs = Counter(all_pairs)

Sequentially extract two or more words from the sentence-based nomenclature list, generate a combination of two words with ʻitertools.combinations (), list them with list (), and store them in pair_list`.
However, since pair_list is a sentence unit, it cannot be counted as it is. Therefore, flatten it by sequentially adding it to the newly prepared variable ʻall_pairs with ʻextend ().
Pass this to Counter () to generate ** dictionary-type co-occurrence data ** cnt_pairs.

3. Creation of drawing data

import pandas as pd
import numpy as np

⑺ Narrow down co-occurrence data

Narrow down the elements to simplify the appearance when drawing. Here, we will generate a list of the top 50 sets by appearance frequency.

tops = sorted(
    cnt_pairs.items(), 
    key=lambda x: x[1], reverse=True
    )[:50]

The syntax is a combination of sorted () and lambda expressions, and sorts dictionary-type objects based on the elements specified under key = lambda.
The reference x [1] extracts the top 50 pairs from the second element, that is, the reverse sort by frequency reverse = True.

⑻ Weighted data generation

noun_1 = []
noun_2 = []
frequency = []

#Creating a data frame
for n,f in tops:
    noun_1.append(n[0])    
    noun_2.append(n[1])
    frequency.append(f)

df = pd.DataFrame({'The above noun': noun_1, 'Later noun': noun_2, 'Frequency of appearance': frequency})

#Setting weighted data
weighted_edges = np.array(df)

Converted the co-occurrence data of the top 50 sets to array to make weighted_edges (weighted data).
The following shows the data frame before converting to array.

4. Drawing a network diagram

⑼ Import of visualization library

import matplotlib.pyplot as plt
import networkx as nx
%matplotlib inline

** networkX ** is a package for creating and manipulating complex networks and graph structures in Python.
In the network diagram, the vertices are called ** nodes **, and the edges that connect the vertices are called ** edges **.
In order to display the node label in Japanese, it is necessary to import the following japanize_matplotlib and then specify the Japanese font.

#Module to make matplotlib support Japanese display
!pip install japanize-matplotlib
import japanize_matplotlib

⑽ Visualization by NetworkX

The procedure for drawing a network diagram with networkX is 3 steps: ➀ create an object with a graph structure, ➁ load data into it, and ➂ specify specifications such as nodes and edges on matplotlib and draw.
It seems to be confusing, but font_family =" IPAexGothic " is important, and by specifying ** font_family with Japanese font **, the node label will be made compatible with Japanese display.

#Generating a graph object
G = nx.Graph()

#Reading weighted data
G.add_weighted_edges_from(weighted_edges)

#Drawing a network diagram
plt.figure(figsize=(10,10))
nx.draw_networkx(G,
                 node_shape = "s",
                 node_color = "c", 
                 node_size = 500,
                 edge_color = "gray", 
                 font_family = "IPAexGothic") #Font specification

plt.show()

In order to understand the mechanism of co-occurrence network analysis as one big flow, details such as setting stop words (words to be excluded) and processing of idioms (for example, "individualism" instead of "individual" and "principle"), etc. I closed my eyes.
Also, for convenience, I divided it into the following four work stages. There are four steps: ➀ text data preparation, ➁ co-occurrence data creation, ➂ drawing data creation, and ➃ network diagram drawing. However, in general, I think that it is understood in three stages: ➊ preprocessing, ➋ analysis, and ➌ visualization.
Especially ➊ I think preprocessing is the heart of natural language processing. Actually, it may be incorporated as part of ➋ in the script, but in short, it is a question of "how to extract the necessary words from the raw data". What kind of analysis perspective and what criteria should be used to extract words? It will appear directly in the analysis results, which will affect the interpretation. It is the unit of work that requires the most consideration and takes time and energy.