Visualize 2ch threads with WordCloud-Morphological analysis / WordCloud-

Introduction

2ch takes time to read each threadless thread, so I tried to visualize the thread information with WordCloud and get the whole picture easily. In the previous Scraping Edition, the less content of the desired thread group was extracted. This time, as the second part, we will analyze the morphological analysis of the lessons collected last time and output it in WordCloud.

Overall flow

  1. Extract the URL of the target thread by scraping "log speed"
  2. Scraping 2ch thread to extract less
  3. [Morphological analysis of extracted less content with Mecab](## Morphological analysis with Mecab) ← Explanation this time
  4. [Output with WordCloud](## Output with WordCloud) ← Explanation this time

environment

As before, use Google Colaboratory. Google Colaboratory is a Python execution environment that runs on a browser. Anyone with a Google account can use it. Mecab needs additional installation (described later), but WordCloud is included in Google Colaboratory by default, so installation is not required.

Full code

Click to view full text (including scraping)
#Library import
import requests, bs4
import re
import time
import pandas as pd
from urllib.parse import urljoin

#Install fonts locally in Colab
from google.colab import drive
drive.mount("/content/gdrive")
#Create a folder called font at the top of My Drive in your Google Drive in advance, and put the desired font file in it.
#Copy each folder locally to Colab
!cp -a "gdrive/My Drive/font/" "/usr/share/fonts/"

# ------------------------------------------------------------------------
#Preparation
log_database = []  #A list that stores thread information
base_url = "https://www.logsoku.com/search?q=FFRK&p="

#Implementation of web scraping
for i in range(1,4):  #Which page to go back to (here, tentatively up to the 4th page)
  logs_url = base_url+str(i)

  #Scraping processing body
  res = requests.get(logs_url)
  soup = bs4.BeautifulSoup(res.text, "html.parser")

  #What to do when no search results are found
  if soup.find(class_="search_not_found"):break

  #Get table / row where thread information is stored
  thread_table = soup.find(id="search_result_threads")
  thread_rows = thread_table.find_all("tr")

  #Processing for each row
  for thread_row in thread_rows:
    tmp_dict = {}
    tags = thread_row.find_all(class_=["thread","date","length"])

    #Organize the contents
    for tag in tags:
      if "thread" in str(tag):
        tmp_dict["title"] = tag.get("title")
        tmp_dict["link"] = tag.get("href")
      elif "date" in str(tag):
        tmp_dict["date"] = tag.text
      elif "length" in str(tag):
        tmp_dict["length"] = tag.text

    #Only those with more than 50 lesss will be added to the database
    if tmp_dict["length"].isdecimal() and int(tmp_dict["length"]) > 50:
      log_database.append(tmp_dict)

  time.sleep(1)

#Convert to DataFrame
thread_df = pd.DataFrame(log_database)

# ------------------------------------------------------------------------
#Get less from past logs
log_url_base = "http://nozomi.2ch.sc/test/read.cgi/"
res_database = []

for thread in log_database:
  #Board name and bulletin board No. from the past log list.And generate the URL of the past log
  board_and_code_match = re.search("[a-zA-Z0-9_]*?/[0-9]*?/$",thread["link"])
  board_and_code = board_and_code_match.group()
  thread_url = urljoin(log_url_base, board_and_code)

  #HTML extraction from past log page
  res = requests.get(thread_url)
  soup = bs4.BeautifulSoup(res.text, "html5lib")

  tmp_dict = {}
  #Information such as date in the dt tag
  #The comment is stored in the dd tag
  dddt = soup.find_all(["dd","dt"])

  for tag in dddt[::-1]:  #Extract from behind

    #Extract only the date from the dt tag
    if "<dt>" in str(tag):
      date_result = re.search(r"\d*/\d*/\d*",tag.text)  #  "(←'"'I don't care (to avoid display abnormalities of qiita)
      if date_result:
        date_str = date_result.group()
        tmp_dict["date"] = date_str

    #Extract less content from dd tag
    if "<dd>" in str(tag):
      tmp_dict["comment"] = re.sub("\n","",tag.text)

    # tmp_The contents stored in dict are res_Post to database
    if "date" in tmp_dict and "comment" in tmp_dict:
      tmp_dict["thread_title"] = thread["title"]
      res_database.append(tmp_dict)
      tmp_dict = {}

  time.sleep(1)  #promise

#Convert to DataFrame
res_df = pd.DataFrame(res_database)

# ------------------------------------------------------------------------

#Morphological analysis library MeCab and dictionary(mecab-ipadic-NEologd)Installation of
!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null 
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1
!pip install mecab-python3 > /dev/null

#Error avoidance by symbolic links
!ln -s /etc/mecabrc /usr/local/etc/mecabrc


#Les n(=10000)Separate by less and combine with commas
#The purpose of partitioning is because later mecab cannot handle too many characters.
sentences_sep = []
n = 10000
for i in range(0, len(res_df["comment"]), n):
  sentences_sep.append(",".join(res_df["comment"][i: i + n]))

# ------------------------------------------------------------------------
import MeCab

# mecab-ipadic-Specify the path where the neologd dictionary is stored
path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
#Above path (/usr/~) Can be obtained with the following command
# !echo `mecab-config --dicdir`"/mecab-ipadic-neologd"

#Creating a Tagger object
mecab = MeCab.Tagger(path)

#Perform morphological analysis for each separated group
chasen_list = [mecab.parse(sentence) for sentence in sentences_sep]

word_list = []

# chasen_Disassemble list to one line
# ex.Iron giant noun,Proper noun,General,*,*,*,Iron giant,Tetsukyojin,Tetsukyojin)
for chasen in chasen_list:
  for line in chasen.splitlines():
    
    if len(line) <= 1: break

    speech = line.split()[-1]
    if "noun" in speech:
      if  (not "Non-independent" in speech) and (not "Pronoun" in speech) and (not "number" in speech):
        word_list.append(line.split()[0])

word_line = ",".join(word_list)

# ------------------------------------------------------------------------
from wordcloud import WordCloud
import matplotlib.pyplot as plt

f_path = "BIZ-UDGothicB.ttc"  #Must be copied to Colab's local fonts folder
stop_words = ["https","imgur","net","jpg","com","so"]

#Instance generation (parameter setting)
wordcloud = WordCloud(
    font_path=f_path, #Font specification
    width=1024, height=640,   #Specifying the size of the generated image
    background_color="white",   #Specifying the background color
    stopwords=set(stop_words),   #Words that are not intentionally displayed
    max_words=350,   #Maximum number of words
    max_font_size=200, min_font_size=5,   #Font size range
    collocations = False    #Display of compound words
    )

#Image generation
output_img = wordcloud.generate(word_line)

#indicate
plt.figure(figsize=(18,15))  #Specify the size to be displayed with figsize
plt.imshow(output_img)
plt.axis("off")  #Hide the scale
plt.show()

Description

Morphological analysis with Mecab

Morphological analysis is the process of breaking down a natural language sentence into words (more accurately, units called morphemes that are finer than words). Unlike English, Japanese does not put spaces between words, so it is necessary to ** perform morphological analysis to separate words **. There are several tools for morphological analysis, but this time we will use "Mecab", which has high processing speed and high accuracy.

Install Mecab

Mecab is not included in Google Colaboratory by default, so install it by executing the following each time.

#Install Mecab
!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!pip install mecab-python3 > /dev/null

#Error avoidance by symbolic links
!ln -s /etc/mecabrc /usr/local/etc/mecabrc

Specification of dictionary (mecab-ipadic-NEologd)

Mecab's default dictionary "mecab-ipadic" is not very accurate for new words. Therefore, we recommend the specification of the dictionary ** "mecab-ipadic-NEologd" **. "Mecab-ipadic-NEologd" is one of the system dictionaries that can be used in Mecab, and because it is updated frequently **, it is strong against new words **. For example, when the keyword "Aeris" is morphologically analyzed. In the default dictionary, the morpheme is divided into "air / squirrel", but in "mecab-ipadic-NEologd", "airis" is properly judged as one word. In an environment where new words such as 2ch are sloppy, using "mecab-ipadic-NEologd" should improve the accuracy of analysis. The installation method is as follows.

#dictionary(mecab-ipadic-NEologd)Installation of
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git > /dev/null 
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1

Since it is necessary to specify the path where the mecab-ipadic-neologd dictionary is stored when calling it with Mecab later, define it.

path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
If it doesn't work (click to display)
The above path (/ usr / ~) should be basically fine, but if it doesn't work, get the dictionary path with the following command.
!echo `mecab-config --dicdir`"/mecab-ipadic-neologd"

Data preprocessing for morphological analysis

There are two points when passing text data to Mecab ・ Mecab passes one str type data (character string combined with comma in this case) -If the above str type has too much ** data amount, analysis will fail, so it is necessary to divide and pass the data to Mecab **. Therefore, the previously scraped less content (DataFrame type: res_df) is connected to one str for every 10000 less. I stored it in the list one after another.

#Les n(=10000)Separate by less and combine with commas
#The purpose of partitioning is because later mecab cannot handle too many characters.
sentences_sep = []
n = 10000
for i in range(0, len(res_df["comment"]), n):
  sentences_sep.append(",".join(res_df["comment"][i: i + n]))

Performing morphological analysis

Mecab is performed in the flow of (1) Creation of Mecab.Tagger instance(2) Pass the target text to the instance and parse (analyze). Specify analysis options when creating an instance of ①. I want to use the system dictionary "mecab-ipadic-NEologd" mentioned above, so specify the path obtained in advance. The analysis result of ② is acquired by Tagger instance.parse (str type). In this case, as mentioned above, the less group was divided into a list type, so I tried to process it using the inclusion notation like python.

import MeCab

#Instance generation
mecab = MeCab.Tagger(path)

#Perform morphological analysis for each separated group
chasen_list = [mecab.parse(sentence) for sentence in sentences_sep]

The output will be a str type with the following line breaks and tab delimiters.

Noise removal

Of the words that have been cut out, words that do not make sense by themselves, such as "particles," "auxiliary verbs," and "adverbs," can be noise, so they are excluded. This time, I tried to simply extract only "nouns". However, among the nouns, non-independence, pronouns, numbers, etc. are excluded because they are likely to cause noise. As a process, the str type of Mecab output result is decomposed line by line with .splitlines () → further decomposed into word and part of speech information with .split () → When the part of speech information matches the conditions, the word part is added to word_list.

#Removal of noise (unnecessary part of speech)
for chasen in chasen_list:
  for line in chasen.splitlines():
    
    if len(line) <= 1: break

    speech = line.split()[-1]  ##Extract part of speech information
    if "noun" in speech:
      if  (not "Non-independent" in speech) and (not "Pronoun" in speech) and (not "number" in speech):
        word_list.append(line.split()[0])

ʻIf len (line) <= 1: break` on the way is an error (probably due to EOS) countermeasure. Finally, the list type is concatenated into one str type.

#Word concatenation
word_line = ",".join(word_list)

Output with WordCloud

(Preparation) Installation of Japanese fonts

When applying Japanese to WordCloud, it is necessary to specify the font to support Japanese. If it's local, all you have to do is specify the path of the desired font, In the case of Google Colaboratory, it's a little troublesome.    ↓ First, ** copy the desired font file on your Google Drive in advance ** (only TrueType fonts are supported). The location is arbitrary, but in analogy to the article I referred to, I created a "font" folder in My Drive Top and stored the files in it. Mount Google Drive on Colaboratory.

#Install fonts locally in Colab
from google.colab import drive
drive.mount("/content/gdrive")

When you execute the above, a link to mount Google Drive will be displayed. Click it to select an account → Allow → Enter the displayed code on Google Colaboratory to mount it.

Use the command to copy the font file to the specified folder locally in Colaboratory.

!cp -a "gdrive/My Drive/font/" "/usr/share/fonts/"
If it doesn't work (click to display) Once, for some reason, I got an error when mounting the drive and could not install the font. In that case, upload the file directly to the local font folder of Google Colaboratory. ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/642673/74d79528-0f78-b5ce-8275-079f4e74f2ce.png)

Run Word Cloud

Import the WordCloud library and instantiate it with WordCloud (). Various output parameters can be set by giving an argument in this ().

from wordcloud import WordCloud

f_path = "BIZ-UDGothicB.ttc"  #Must be copied to Colab's local fonts folder
stop_words = ["https","imgur","net","jpg","com","so"]

#Instance generation (parameter setting)
wordcloud = WordCloud(
    font_path=f_path, #Font specification
    width=1024, height=640,   #Specifying the size of the generated image
    background_color="white",   #Specifying the background color
    stopwords=set(stop_words),   #Words that are not intentionally displayed
    max_words=350,   #Maximum number of words
    max_font_size=200, min_font_size=5,   #Font size range
    collocations = False    #Display of compound words
    )

The contents of each parameter are as follows.

Parameters Description Set value
font_path Font specification The font path mentioned above (f_path)
colormap Font color set
(Specified by matplotlib color map)
Not set (default: viridis)
width Width of generated image 1024
height Height of generated image 640
background_color Background color white
stopwords Words that are not intentionally displayed (set) ["https","imgur","net","jpg","com","so"]
max_words Maximum number of words to display 350
max_font_size Font size for the most words 200
min_font_size Font size for the smallest word 5
collocations Whether to display connected words False

For parameters other than the above, refer to the article in the article below. Generate a figure from the target character string with the method .generate (concatenated word group: str type) of the wordcloud instance generated above.

#Generate a WordCloud image by giving a string
output_img = wordcloud.generate(word_line)

Displayed with matplotlib

import matplotlib.pyplot as plt

plt.figure(figsize=(18,15))  #Specify the size to be displayed with figsize
plt.imshow(output_img)
plt.axis("off")  #Hide the scale
plt.show()
I was able to display it safely.

Impressions / future

For the time being, I managed to visualize it, but I feel that it has become blurry. I think one of the reasons is that the "time axis" and "correlation between words" have disappeared. So when you have time ・ Correlation display with time axis (graph) ・ I want to play with the co-occurrence network. ~~ I'm tired of writing a long article ~~ It's undecided whether to write an article.

Reference article

Summary of how to use Google Colab Install MeCab and ipadic-NEologd on Google Colab How to put your favorite font in Google Colaboratory and use it with matplotlib I made Word Cloud with Python ← There is an explanation of wordcloud parameters not mentioned this time.

Recommended Posts