[Python] I tried to visualize tweets about Corona with WordCloud

Introduction

I collected tweets for use in machine learning, but I tried to visualize it because it was a big deal. I'll leave that method here. For some reason I ran it on windows and macOS, so it works on both.

Target person

・ Can write Python programs to some extent ・ Those who are interested in wordcloud

environment

Operable OS (works on both windows and mac) ┗mac OS Catalina 10.15.7 ┗Widows 10 Python 3.8.3 mecab-python3

What is Word Cloud

WordCloud is a method of selecting multiple words that appear frequently in a sentence and displaying them in a size according to the frequency. It refers to automatically arranging words that frequently appear on web pages and blogs. By changing not only the size of the characters but also the color, font, and orientation, you can impress the content of the text at a glance. (From the commentary on Digital Daijisen)

Simply put, it visualizes the frequency of occurrence of words in an easy-to-understand manner. There is a library in python that makes this easy to implement. After reading this article, you will be able to create something like the one shown below. Figure_2 1.jpg

What is MeCab

It is a library that analyzes sentences by morphological elements. Breaking down sentences / phrases into "minimum units with meaning". For example When I perform morphological analysis on the sentence "I program at the company", It can be divided into the minimum units such as "I / is / company / at / programming / is /".

By using this library, you can extract only words like WordCloud above.

Data collection

Actually, we collected tweets using Twitter API and collected about 70,000 data, but we will omit the method this time. In the future, I may write another article if I can create the best program for myself. I thought it would be difficult to collect tweets with this, so I prepared a file here. Since it is a tweet, the amount of data is limited to 8000. input_file (tweet data) ↓ https://17.gigafile.nu/1108-d3e975ac3446f65274267ced0915bc8ff word_except_file (list of excluded words) ↓ https://17.gigafile.nu/1108-c355b7876fecb940dd6efd712b84adda8

Word Cloud generation

Finally, I will generate WordCloud, but since WordC Loud does not support Japanese fonts, from here (https://moji.or.jp/ipafont/ipa00303/) 4 Download the typeface pack (Ver.003.03), answer in the appropriate location, and then place the file named ** ipag.ttf ** in the same hierarchy as the program below (if you understand it, the full path is fine). .. The word_except_file contains words that are close to "corona" so that you can exclude words that are easily related to the word, such as "corona" and "infection". In addition, unnecessary words that will inevitably appear in morphological element analysis are also included in the exclusion list.

Execution method python3 makeWordCloud.py colona_data.txt except_word_list.txt

If you understand, please rewrite it to your own environment as appropriate.

makeWordCloud.py


import MeCab
import sys
from matplotlib import pyplot as plt
from wordcloud import WordCloud

args = sys.argv
input_file = args[1]
word_except_file = args[2]

#Read text file
with open('input_file', mode='rt', encoding='utf-8') as fi:
    source_text = fi.read()

#Preparing for MeCab
tagger = MeCab.Tagger()
tagger.parse('')
node = tagger.parseToNode(source_text)

#Extract nouns
word_list = []
while node:
    word_type = node.feature.split(',')[0]
    if word_type == 'noun':
        word_list.append(node.surface)
    node = node.next

#Reading excluded words
except_word_list = []
f = open(except_word_file)
for i in f:
    except_word_list.append(i.rstrip())

#Convert list to string
word_chain = ' '.join(word_list)

#Word cloud creation
W = WordCloud(width=640,height=480,background_color='white',font_path="./ipag.ttf",stopwords = except_word_list).generate(word_chain)

plt.imshow(W)
plt.axis('off')
plt.show()

Finally

Since I posted to Qiita for the first time, there were many things I didn't understand, but it was fun. If you have any questions or need to improve, please leave a comment. If you come up with something again, I'll post it. Well then.

Recommended Posts

[Python] I tried to visualize tweets about Corona with WordCloud
[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to visualize AutoEncoder with TensorFlow
I tried to summarize everyone's remarks on slack with wordcloud (Python)
I tried to get CloudWatch data with Python
I tried to automate sushi making with python
I tried to visualize the text of the novel "Weathering with You" with WordCloud
I tried fp-growth with python
I tried scraping with Python
I tried gRPC with Python
I tried scraping with python
I made wordcloud with Python.
I tried to implement Minesweeper on terminal with python
I tried to get started with blender python script_Part 01
I tried to touch the CSV file with Python
I tried to draw a route map with Python
I tried to solve the soma cube with python
I tried to get started with blender python script_Part 02
I tried to implement an artificial perceptron with python
I tried to automatically generate a password with Python3
I tried to solve the problem with Python Vol.1
I tried to analyze J League data with Python
I tried to solve AOJ's number theory with Python
I tried to find the entropy of the image with python
I tried to simulate how the infection spreads with Python
I tried to make various "dummy data" with Python faker
I tried various methods to send Japanese mail with Python
I tried to touch Python (installation)
I tried web scraping with python.
[Python] I tried to visualize the follow relationship of Twitter
Mayungo's Python Learning Episode 3: I tried to print numbers with print
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
I tried to organize about MCMC.
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried to divide the file into folders with Python
I want to debug with Python
I tried running prolog with python 3.8.2.
I tried SMTP communication with Python
[5th] I tried to make a certain authenticator-like tool with python
I tried to solve the ant book beginner's edition with python
[2nd] I tried to make a certain authenticator-like tool with python
I tried to visualize bookmarks flying to Slack with Doc2Vec and PCA
[3rd] I tried to make a certain authenticator-like tool with python
[Python] A memo that I tried to get started with asyncio
I tried to create a list of prime numbers with python
[Pandas] I tried to analyze sales data with Python [For beginners]
I tried to fix "I tried stochastic simulation of bingo game with Python"
I tried to make a periodical process with Selenium and Python
I tried to find out if ReDoS is possible with Python
I tried to make a 2channel post notification application with Python
I tried to make a todo application using bottle with python
[4th] I tried to make a certain authenticator-like tool with python
I tried to easily detect facial landmarks with python and dlib
[1st] I tried to make a certain authenticator-like tool with python
I tried to improve the efficiency of daily work with Python
I tried to automatically collect images of Kanna Hashimoto with Python! !!
I tried to make an image similarity function with Python + OpenCV
I tried to summarize Python exception handling
I tried to implement PLSA in Python
I tried to get started with Hy