2ch takes time to read each threadless thread, so I tried to visualize the thread information with WordCloud and get the whole picture easily. In the previous Scraping Edition, the less content of the desired thread group was extracted. This time, as the second part, we will analyze the morphological analysis of the lessons collected last time and output it in WordCloud.
As before, use Google Colaboratory. Google Colaboratory is a Python execution environment that runs on a browser. Anyone with a Google account can use it. Mecab needs additional installation (described later), but WordCloud is included in Google Colaboratory by default, so installation is not required.
#Library import
import requests, bs4
import re
import time
import pandas as pd
from urllib.parse import urljoin
#Install fonts locally in Colab
from google.colab import drive
drive.mount("/content/gdrive")
#Create a folder called font at the top of My Drive in your Google Drive in advance, and put the desired font file in it.
#Copy each folder locally to Colab
!cp -a "gdrive/My Drive/font/" "/usr/share/fonts/"
# ------------------------------------------------------------------------
#Preparation
log_database = [] #A list that stores thread information
base_url = "https://www.logsoku.com/search?q=FFRK&p="
#Implementation of web scraping
for i in range(1,4): #Which page to go back to (here, tentatively up to the 4th page)
logs_url = base_url+str(i)
#Scraping processing body
res = requests.get(logs_url)
soup = bs4.BeautifulSoup(res.text, "html.parser")
#What to do when no search results are found
if soup.find(class_="search_not_found"):break
#Get table / row where thread information is stored
thread_table = soup.find(id="search_result_threads")
thread_rows = thread_table.find_all("tr")
#Processing for each row
for thread_row in thread_rows:
tmp_dict = {}
tags = thread_row.find_all(class_=["thread","date","length"])
#Organize the contents
for tag in tags:
if "thread" in str(tag):
tmp_dict["title"] = tag.get("title")
tmp_dict["link"] = tag.get("href")
elif "date" in str(tag):
tmp_dict["date"] = tag.text
elif "length" in str(tag):
tmp_dict["length"] = tag.text
#Only those with more than 50 lesss will be added to the database
if tmp_dict["length"].isdecimal() and int(tmp_dict["length"]) > 50:
log_database.append(tmp_dict)
time.sleep(1)
#Convert to DataFrame
thread_df = pd.DataFrame(log_database)
# ------------------------------------------------------------------------
#Get less from past logs
log_url_base = "http://nozomi.2ch.sc/test/read.cgi/"
res_database = []
for thread in log_database:
#Board name and bulletin board No. from the past log list.And generate the URL of the past log
board_and_code_match = re.search("[a-zA-Z0-9_]*?/[0-9]*?/$",thread["link"])
board_and_code = board_and_code_match.group()
thread_url = urljoin(log_url_base, board_and_code)
#HTML extraction from past log page
res = requests.get(thread_url)
soup = bs4.BeautifulSoup(res.text, "html5lib")
tmp_dict = {}
#Information such as date in the dt tag
#The comment is stored in the dd tag
dddt = soup.find_all(["dd","dt"])
for tag in dddt[::-1]: #Extract from behind
#Extract only the date from the dt tag
if "<dt>" in str(tag):
date_result = re.search(r"\d*/\d*/\d*",tag.text) # "(←'"'I don't care (to avoid display abnormalities of qiita)
if date_result:
date_str = date_result.group()
tmp_dict["date"] = date_str
#Extract less content from dd tag
if "<dd>" in str(tag):
tmp_dict["comment"] = re.sub("\n","",tag.text)
# tmp_The contents stored in dict are res_Post to database
if "date" in tmp_dict and "comment" in tmp_dict:
tmp_dict["thread_title"] = thread["title"]
res_database.append(tmp_dict)
tmp_dict = {}
time.sleep(1) #promise
#Convert to DataFrame
res_df = pd.DataFrame(res_database)
# ------------------------------------------------------------------------
#Morphological analysis library MeCab and dictionary(mecab-ipadic-NEologd)Installation of
!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1
!pip install mecab-python3 > /dev/null
#Error avoidance by symbolic links
!ln -s /etc/mecabrc /usr/local/etc/mecabrc
#Les n(=10000)Separate by less and combine with commas
#The purpose of partitioning is because later mecab cannot handle too many characters.
sentences_sep = []
n = 10000
for i in range(0, len(res_df["comment"]), n):
sentences_sep.append(",".join(res_df["comment"][i: i + n]))
# ------------------------------------------------------------------------
import MeCab
# mecab-ipadic-Specify the path where the neologd dictionary is stored
path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
#Above path (/usr/~) Can be obtained with the following command
# !echo `mecab-config --dicdir`"/mecab-ipadic-neologd"
#Creating a Tagger object
mecab = MeCab.Tagger(path)
#Perform morphological analysis for each separated group
chasen_list = [mecab.parse(sentence) for sentence in sentences_sep]
word_list = []
# chasen_Disassemble list to one line
# ex.Iron giant noun,Proper noun,General,*,*,*,Iron giant,Tetsukyojin,Tetsukyojin)
for chasen in chasen_list:
for line in chasen.splitlines():
if len(line) <= 1: break
speech = line.split()[-1]
if "noun" in speech:
if (not "Non-independent" in speech) and (not "Pronoun" in speech) and (not "number" in speech):
word_list.append(line.split()[0])
word_line = ",".join(word_list)
# ------------------------------------------------------------------------
from wordcloud import WordCloud
import matplotlib.pyplot as plt
f_path = "BIZ-UDGothicB.ttc" #Must be copied to Colab's local fonts folder
stop_words = ["https","imgur","net","jpg","com","so"]
#Instance generation (parameter setting)
wordcloud = WordCloud(
font_path=f_path, #Font specification
width=1024, height=640, #Specifying the size of the generated image
background_color="white", #Specifying the background color
stopwords=set(stop_words), #Words that are not intentionally displayed
max_words=350, #Maximum number of words
max_font_size=200, min_font_size=5, #Font size range
collocations = False #Display of compound words
)
#Image generation
output_img = wordcloud.generate(word_line)
#indicate
plt.figure(figsize=(18,15)) #Specify the size to be displayed with figsize
plt.imshow(output_img)
plt.axis("off") #Hide the scale
plt.show()
Morphological analysis is the process of breaking down a natural language sentence into words (more accurately, units called morphemes that are finer than words). Unlike English, Japanese does not put spaces between words, so it is necessary to ** perform morphological analysis to separate words **. There are several tools for morphological analysis, but this time we will use "Mecab", which has high processing speed and high accuracy.
Mecab is not included in Google Colaboratory by default, so install it by executing the following each time.
#Install Mecab
!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!pip install mecab-python3 > /dev/null
#Error avoidance by symbolic links
!ln -s /etc/mecabrc /usr/local/etc/mecabrc
Mecab's default dictionary "mecab-ipadic" is not very accurate for new words. Therefore, we recommend the specification of the dictionary ** "mecab-ipadic-NEologd" **. "Mecab-ipadic-NEologd" is one of the system dictionaries that can be used in Mecab, and because it is updated frequently **, it is strong against new words **. For example, when the keyword "Aeris" is morphologically analyzed. In the default dictionary, the morpheme is divided into "air / squirrel", but in "mecab-ipadic-NEologd", "airis" is properly judged as one word. In an environment where new words such as 2ch are sloppy, using "mecab-ipadic-NEologd" should improve the accuracy of analysis. The installation method is as follows.
#dictionary(mecab-ipadic-NEologd)Installation of
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1
Since it is necessary to specify the path where the mecab-ipadic-neologd dictionary is stored when calling it with Mecab later, define it.
path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
!echo `mecab-config --dicdir`"/mecab-ipadic-neologd"
There are two points when passing text data to Mecab ・ Mecab passes one str type data (character string combined with comma in this case) -If the above str type has too much ** data amount, analysis will fail, so it is necessary to divide and pass the data to Mecab **. Therefore, the previously scraped less content (DataFrame type: res_df) is connected to one str for every 10000 less. I stored it in the list one after another.
#Les n(=10000)Separate by less and combine with commas
#The purpose of partitioning is because later mecab cannot handle too many characters.
sentences_sep = []
n = 10000
for i in range(0, len(res_df["comment"]), n):
sentences_sep.append(",".join(res_df["comment"][i: i + n]))
Mecab is performed in the flow of (1) Creation of Mecab.Tagger instance
→ (2) Pass the target text to the instance and parse (analyze)
.
Specify analysis options when creating an instance of ①.
I want to use the system dictionary "mecab-ipadic-NEologd" mentioned above, so specify the path obtained in advance.
The analysis result of ② is acquired by Tagger instance.parse (str type)
.
In this case, as mentioned above, the less group was divided into a list type, so I tried to process it using the inclusion notation like python.
import MeCab
#Instance generation
mecab = MeCab.Tagger(path)
#Perform morphological analysis for each separated group
chasen_list = [mecab.parse(sentence) for sentence in sentences_sep]
The output will be a str type with the following line breaks and tab delimiters.
Of the words that have been cut out, words that do not make sense by themselves, such as "particles," "auxiliary verbs," and "adverbs," can be noise, so they are excluded.
This time, I tried to simply extract only "nouns". However, among the nouns, non-independence, pronouns, numbers, etc. are excluded because they are likely to cause noise.
As a process, the str type of Mecab output result is decomposed line by line with .splitlines ()
→ further decomposed into word and part of speech information with .split ()
→ When the part of speech information matches the conditions, the word part is added to word_list.
#Removal of noise (unnecessary part of speech)
for chasen in chasen_list:
for line in chasen.splitlines():
if len(line) <= 1: break
speech = line.split()[-1] ##Extract part of speech information
if "noun" in speech:
if (not "Non-independent" in speech) and (not "Pronoun" in speech) and (not "number" in speech):
word_list.append(line.split()[0])
ʻIf len (line) <= 1: break` on the way is an error (probably due to EOS) countermeasure. Finally, the list type is concatenated into one str type.
#Word concatenation
word_line = ",".join(word_list)
When applying Japanese to WordCloud, it is necessary to specify the font to support Japanese. If it's local, all you have to do is specify the path of the desired font, In the case of Google Colaboratory, it's a little troublesome. ↓ First, ** copy the desired font file on your Google Drive in advance ** (only TrueType fonts are supported). The location is arbitrary, but in analogy to the article I referred to, I created a "font" folder in My Drive Top and stored the files in it. Mount Google Drive on Colaboratory.
#Install fonts locally in Colab
from google.colab import drive
drive.mount("/content/gdrive")
When you execute the above, a link to mount Google Drive will be displayed. Click it to select an account → Allow → Enter the displayed code on Google Colaboratory to mount it.
Use the command to copy the font file to the specified folder locally in Colaboratory.
!cp -a "gdrive/My Drive/font/" "/usr/share/fonts/"
Import the WordCloud library and instantiate it with WordCloud ()
.
Various output parameters can be set by giving an argument in this ().
from wordcloud import WordCloud
f_path = "BIZ-UDGothicB.ttc" #Must be copied to Colab's local fonts folder
stop_words = ["https","imgur","net","jpg","com","so"]
#Instance generation (parameter setting)
wordcloud = WordCloud(
font_path=f_path, #Font specification
width=1024, height=640, #Specifying the size of the generated image
background_color="white", #Specifying the background color
stopwords=set(stop_words), #Words that are not intentionally displayed
max_words=350, #Maximum number of words
max_font_size=200, min_font_size=5, #Font size range
collocations = False #Display of compound words
)
The contents of each parameter are as follows.
Parameters | Description | Set value |
---|---|---|
font_path | Font specification | The font path mentioned above (f_path) |
colormap | Font color set (Specified by matplotlib color map) |
Not set (default: viridis) |
width | Width of generated image | 1024 |
height | Height of generated image | 640 |
background_color | Background color | white |
stopwords | Words that are not intentionally displayed (set) | ["https","imgur","net","jpg","com","so"] |
max_words | Maximum number of words to display | 350 |
max_font_size | Font size for the most words | 200 |
min_font_size | Font size for the smallest word | 5 |
collocations | Whether to display connected words | False |
For parameters other than the above, refer to the article in the article below.
Generate a figure from the target character string with the method .generate (concatenated word group: str type)
of the wordcloud instance generated above.
#Generate a WordCloud image by giving a string
output_img = wordcloud.generate(word_line)
import matplotlib.pyplot as plt
plt.figure(figsize=(18,15)) #Specify the size to be displayed with figsize
plt.imshow(output_img)
plt.axis("off") #Hide the scale
plt.show()
I was able to display it safely.
For the time being, I managed to visualize it, but I feel that it has become blurry. I think one of the reasons is that the "time axis" and "correlation between words" have disappeared. So when you have time ・ Correlation display with time axis (graph) ・ I want to play with the co-occurrence network. ~~ I'm tired of writing a long article ~~ It's undecided whether to write an article.
Summary of how to use Google Colab Install MeCab and ipadic-NEologd on Google Colab How to put your favorite font in Google Colaboratory and use it with matplotlib I made Word Cloud with Python ← There is an explanation of wordcloud parameters not mentioned this time.
Recommended Posts