Text analysis that can be done in 5 minutes [Word Cloud]

Let's do text mining very easily using Python 3.x series.

** This time, in addition to processing on the LINUX terminal as much as possible so that even people who have never used Python can understand it, please be assured that the commands to be input are also described! ** (I don't know anything about Python ...)

What is text mining?

Text mining (English: text mining) is data mining for character strings. This is a text data analysis method that extracts useful information by dividing data consisting of ordinary sentences into words and phrases and analyzing the frequency of their appearance, the correlation of co-appearance, the tendency of appearance, and the time series. Source [Wikipedia](https://ja.m.wikipedia.org/wiki/%E3%83%86%E3%82%AD%E3%82%B9%E3%83%88%E3%83%9E% E3% 82% A4% E3% 83% 8B% E3% 83% B3% E3% 82% B0)

This time, let's create a * word cloud * with text mining technology! This is what a word cloud is. ↓

First prepare the data

First, prepare the data to be analyzed. However, it is difficult to prepare immediately, so this time I will use the tweet data ** of the online event ** "Idolmaster Shiny Colors MUSIC DAWN DAY 1" held on October 31st.

click here to download [Text data # Shanimas MUSICDAWNday2](https://www.github.com/ysok2135/py/tree/main/%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7%A3 % E6% 9E% 90% E5% 85% 83% E3% 83% 86% E3% 82% 99% E3% 83% BC% E3% 82% BF_SC_DOWN_20201031_utf8.csv)

Installation of Python 3.x series

sudo apt install python3.7

Perform morphological analysis of data

Unlike English, Japanese does not separate segments with spaces, so you cannot do text mining from the beginning. Therefore, this time, we will use the ** open source morphological analysis engine MeCab **, which is familiar in the streets.

By the way, mecab is a project led by the Institute of Informatics, Kyoto University.

MeCab related installation

Type in the following command order.

udo apt install mecab
sudo apt install libmecab-dev
sudo apt install mecab-ipadic
sudo apt install mecab-ipadic-utf8
pip install mecab-python3

If you want to improve the search accuracy, you should also install additional dictionaries such as NEologd, but this time it is not annoying.

Actually perform morphological analysis

Many sites run on python, but I think this is much easier. First, set the analysis source file to "test.txt". Then enter the following in the terminal:

mecab -Owakati test.txt -o sample.txt

**that's all! ** ** When I check the file, it is analyzed properly. スクリーンショット 2020-11-01 15.42.09.png スクリーンショット 2020-11-01 15.42.23.png

Finally work on WordCloud!

Wordcloud installation

pip install wordcloud

That's all.

Try to create a word cloud

Copy and paste the sample code below.

`sample.py`


from wordcloud import WordCloud
with open('sample.txt') as f:
        text = f.read()
stop_words = [ u'https', u'co', u'Thank you', u'RT', u'Ah', u'']
wc = WordCloud(background_color="white",width=1600, height=1200, font_path='GenEiLateGoP_v2.ttf', stopwords=set(stop_words))
wc.generate(text)
wc.to_file('wc1.png')

Code description

** ① Read wordcloud and import files **

from wordcloud import WordCloud
with open('sample.txt') as f:
        text = f.read()

** ② Various settings ** stop_words ・・・ Set keywords to exclude ** It is recommended to try several times and set keywords. ** ** background_color ・・・ Background color width, height ・・・ Set the size of the image (unit is pixel) fonf_path ・・・ Specify font path (This time, I am using English source Latemin) ↑ ** [Super important! If you don't load the Japanese font, you will get tofu! !! !! ] **

stop_words = [ u'https', u'co', u'Thank you', u'RT', u'Ah', u'']
wc = WordCloud(background_color="white",width=1600, height=1200, font_path='GenEiLateGoP_v2.ttf', stopwords=set(stop_words))

** ③ Execution processing **

wc.generate(text)
wc.to_file('wc1.png')

I actually went

python3 sample.py

Execution result

Great! !! !! Mr. Tsuda's presence is dangerous! (Lol)

You may want to watch it with the theme of Aozora Bunko. I hope that you will be interested in emotion analysis and so on. Thank you until the end.

I'm doing Twitter

* Verification environment
Ubuntu 18.04 LTS
Python 3.7