Let's do text mining very easily using Python 3.x series.
** This time, in addition to processing on the LINUX terminal as much as possible so that even people who have never used Python can understand it, please be assured that the commands to be input are also described! ** (I don't know anything about Python ...)
Text mining (English: text mining) is data mining for character strings. This is a text data analysis method that extracts useful information by dividing data consisting of ordinary sentences into words and phrases and analyzing the frequency of their appearance, the correlation of co-appearance, the tendency of appearance, and the time series. Source [Wikipedia](https://ja.m.wikipedia.org/wiki/%E3%83%86%E3%82%AD%E3%82%B9%E3%83%88%E3%83%9E% E3% 82% A4% E3% 83% 8B% E3% 83% B3% E3% 82% B0)
This time, let's create a * word cloud * with text mining technology! This is what a word cloud is. ↓
First, prepare the data to be analyzed. However, it is difficult to prepare immediately, so this time I will use the tweet data ** of the online event ** "Idolmaster Shiny Colors MUSIC DAWN DAY 1" held on October 31st.
click here to download [Text data # Shanimas MUSICDAWNday2](https://www.github.com/ysok2135/py/tree/main/%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7%A3 % E6% 9E% 90% E5% 85% 83% E3% 83% 86% E3% 82% 99% E3% 83% BC% E3% 82% BF_SC_DOWN_20201031_utf8.csv)
sudo apt install python3.7
Unlike English, Japanese does not separate segments with spaces, so you cannot do text mining from the beginning. Therefore, this time, we will use the ** open source morphological analysis engine MeCab **, which is familiar in the streets.
Type in the following command order.
udo apt install mecab
sudo apt install libmecab-dev
sudo apt install mecab-ipadic
sudo apt install mecab-ipadic-utf8
pip install mecab-python3
If you want to improve the search accuracy, you should also install additional dictionaries such as NEologd, but this time it is not annoying.
Many sites run on python, but I think this is much easier. First, set the analysis source file to "test.txt". Then enter the following in the terminal:
mecab -Owakati test.txt -o sample.txt
**that's all! ** ** When I check the file, it is analyzed properly.
pip install wordcloud
That's all.
Copy and paste the sample code below.
sample.py
from wordcloud import WordCloud
with open('sample.txt') as f:
text = f.read()
stop_words = [ u'https', u'co', u'Thank you', u'RT', u'Ah', u'']
wc = WordCloud(background_color="white",width=1600, height=1200, font_path='GenEiLateGoP_v2.ttf', stopwords=set(stop_words))
wc.generate(text)
wc.to_file('wc1.png')
** ① Read wordcloud and import files **
from wordcloud import WordCloud
with open('sample.txt') as f:
text = f.read()
** ② Various settings ** stop_words ・ ・ ・ Set keywords to exclude ** It is recommended to try several times and set keywords. ** ** background_color ・ ・ ・ Background color width, height ・ ・ ・ Set the size of the image (unit is pixel) fonf_path ・ ・ ・ Specify font path (This time, I am using English source Latemin) ↑ ** [Super important! If you don't load the Japanese font, you will get tofu! !! !! ] **
stop_words = [ u'https', u'co', u'Thank you', u'RT', u'Ah', u'']
wc = WordCloud(background_color="white",width=1600, height=1200, font_path='GenEiLateGoP_v2.ttf', stopwords=set(stop_words))
** ③ Execution processing **
wc.generate(text)
wc.to_file('wc1.png')
python3 sample.py
Execution result
Great! !! !! Mr. Tsuda's presence is dangerous! (Lol)
You may want to watch it with the theme of Aozora Bunko. I hope that you will be interested in emotion analysis and so on. Thank you until the end.
* Verification environment
Ubuntu 18.04 LTS
Python 3.7
Recommended Posts