** mecab-ipadic-NEologd ** is a customized dictionary that complements the standard MeCab dictionary.
Words have been added from many language resources on the Web, and ** supports new words, compound words, idiomatic expressions, etc. **.
As mentioned above, the MeCab standard divides it into "individual" and "principle", but in mecab-ipadic-NEologd it is treated as one word "individualism".
from google.colab import files
uploaded = files.upload()
with open('20200926_suga_un.txt', mode='rt', encoding='utf-8') as f:
read_text = f.read()
sugatxt = read_text
#Delete unnecessary characters / symbols
def clean(text):
text = text.replace("\n", "")
text = text.replace("\u3000", "")
text = text.replace("「", "")
text = text.replace("」", "")
text = text.replace("(", "")
text = text.replace(")", "")
text = text.replace("、", "")
return text
text = clean(sugatxt)
#Split line by line
lines = text.split("。")
# MeCab
!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!pip install mecab-python3 > /dev/null
# mecab-ipadic-NEologd
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1
#Error avoidance by symbolic links
!ln -s /etc/mecabrc /usr/local/etc/mecabrc
path
in the next section works properly.#Check the dictionary path
!echo `mecab-config --dicdir`"/mecab-ipadic-neologd"
m_neo
with the output mode as path
(= mecab-ipadic-NEologd).import MeCab
path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
m_neo = MeCab.Tagger(path)
stopwords = ["1", "2", "3", "4", "5", "6", "7", "8", "9", "0",
"1", "2", "3", "4", "5", "6", "7", "8", "9", "0",
"one", "two", "three", "four", "Five", "Six", "Seven", "Eight", "Nine", "〇",
"Year", "Month", "Day", "Next", "Discount", "Times", "Target", "Disease", "that's all", "Less than", "周Year", "Case", "Every time",
"of", "もof", "thing", "Yo", "Sama", "For", "Tend to", "this", "It", "that", "Who",
"*", ",", ","]
noun_list
.lines
divided into line units, repeat the following process line by line and add it to noun_list
.v1
, and let this be a list divided into words bysplitlines ()
as v2
.v2
and add it to result
.split ("\ t")
that is, a space, and divide it into two parts, the" original word "and the" content part of the analysis "to make v3
., except for "EOS" and "", the content part of the analysis
v3 [1]is divided by
split (','), that is, a comma
Let's say v4`.v4 [0]
is a noun and the original v4 [6]
is not a stopword, add it to result
.result
to noun_list
and move on to processing the next sentence.noun_list = []
for line in lines:
result = []
v1 = m_neo.parse(line)
v2 = v1.splitlines()
for v in v2:
v3 = v.split("\t")
if len(v3) == 2:
v4 = v3[1].split(',')
if (v4[0] == "noun") and (v4[6] not in stopwords):
result.append(v4[6])
noun_list.append(result)
noun_list
, generate a pair (combination of two words) with ʻitertools.combinations () for those with two or more words with ʻif len (noun_list)> = 2
. , List it with list ()
and store it in pair_list
.pair_list
in sentence units to make it ʻall_pairs`, and then count the number of occurrences of the pair.import itertools #A module that collects iterator functions
from collections import Counter #A class that counts the number of times a dictionary type appears
#Generate a sentence-based noun pair list
pair_list = []
for n in noun_list:
if len(noun_list) >= 2:
lt = list(itertools.combinations(n, 2))
pair_list.append(lt)
#Flatten the noun pair list
all_pairs = []
for p in pair_list:
all_pairs.extend(p)
#Count the frequency of noun pairs
cnt_pairs = Counter(all_pairs)
sorted () [:30]
to sort and retrieve 30 pairs, but the element of cnt_pairs
iskey = lambda x: x [1]
and the number of occurrences is targeted and reverse = True
Sorted in descending order by.dict
is converted into a two-dimensional array to make it data
for drawing.import pandas as pd
import numpy as np
#Generate the top 30 pairs of dictionaries
dict = sorted(cnt_pairs.items(), key=lambda x:x[1], reverse=True)[:30]
#Convert dict type to 2D array
result = []
for key, value in dict:
temp = []
for k in key:
temp.append(k)
temp.append(value)
result.append(temp)
data = np.array(result)
import matplotlib.pyplot as plt
import networkx as nx
%matplotlib inline
#Module to make matplotlib support Japanese display
!pip install japanize-matplotlib
import japanize_matplotlib
font_family =" IPAexGothic
.#Generating a graph object
G = nx.Graph()
#Data reading
G.add_weighted_edges_from(data)
#Drawing a graph
plt.figure(figsize=(10,10))
nx.draw_networkx(G,
node_shape = "s",
node_color = "chartreuse",
node_size = 800,
edge_color = "gray",
font_family = "IPAexGothic") #Japanese font specification
plt.show()
mecab-ipadic-NEologd | MeCab standard |
---|---|
"Infection" | "infection", "Disease" |
"Developing countries" | "On the way", "Country" |
"Association of Southeast Asian Nations" | "Southeast Asia", "Countries", "Union" |
"Human security" | "Human", "of", "safety", "security" |
Recommended Posts