I collected tweets for use in machine learning, but I tried to visualize it because it was a big deal. I'll leave that method here. For some reason I ran it on windows and macOS, so it works on both.
・ Can write Python programs to some extent ・ Those who are interested in wordcloud
Operable OS (works on both windows and mac) ┗mac OS Catalina 10.15.7 ┗Widows 10 Python 3.8.3 mecab-python3
WordCloud is a method of selecting multiple words that appear frequently in a sentence and displaying them in a size according to the frequency. It refers to automatically arranging words that frequently appear on web pages and blogs. By changing not only the size of the characters but also the color, font, and orientation, you can impress the content of the text at a glance. (From the commentary on Digital Daijisen)
Simply put, it visualizes the frequency of occurrence of words in an easy-to-understand manner. There is a library in python that makes this easy to implement. After reading this article, you will be able to create something like the one shown below.
It is a library that analyzes sentences by morphological elements. Breaking down sentences / phrases into "minimum units with meaning". For example When I perform morphological analysis on the sentence "I program at the company", It can be divided into the minimum units such as "I / is / company / at / programming / is /".
By using this library, you can extract only words like WordCloud above.
Actually, we collected tweets using Twitter API and collected about 70,000 data, but we will omit the method this time. In the future, I may write another article if I can create the best program for myself. I thought it would be difficult to collect tweets with this, so I prepared a file here. Since it is a tweet, the amount of data is limited to 8000. input_file (tweet data) ↓ https://17.gigafile.nu/1108-d3e975ac3446f65274267ced0915bc8ff word_except_file (list of excluded words) ↓ https://17.gigafile.nu/1108-c355b7876fecb940dd6efd712b84adda8
Finally, I will generate WordCloud, but since WordC Loud does not support Japanese fonts, from here (https://moji.or.jp/ipafont/ipa00303/) 4 Download the typeface pack (Ver.003.03), answer in the appropriate location, and then place the file named ** ipag.ttf ** in the same hierarchy as the program below (if you understand it, the full path is fine). .. The word_except_file contains words that are close to "corona" so that you can exclude words that are easily related to the word, such as "corona" and "infection". In addition, unnecessary words that will inevitably appear in morphological element analysis are also included in the exclusion list.
Execution method python3 makeWordCloud.py colona_data.txt except_word_list.txt
If you understand, please rewrite it to your own environment as appropriate.
makeWordCloud.py
import MeCab
import sys
from matplotlib import pyplot as plt
from wordcloud import WordCloud
args = sys.argv
input_file = args[1]
word_except_file = args[2]
#Read text file
with open('input_file', mode='rt', encoding='utf-8') as fi:
source_text = fi.read()
#Preparing for MeCab
tagger = MeCab.Tagger()
tagger.parse('')
node = tagger.parseToNode(source_text)
#Extract nouns
word_list = []
while node:
word_type = node.feature.split(',')[0]
if word_type == 'noun':
word_list.append(node.surface)
node = node.next
#Reading excluded words
except_word_list = []
f = open(except_word_file)
for i in f:
except_word_list.append(i.rstrip())
#Convert list to string
word_chain = ' '.join(word_list)
#Word cloud creation
W = WordCloud(width=640,height=480,background_color='white',font_path="./ipag.ttf",stopwords = except_word_list).generate(word_chain)
plt.imshow(W)
plt.axis('off')
plt.show()
Since I posted to Qiita for the first time, there were many things I didn't understand, but it was fun. If you have any questions or need to improve, please leave a comment. If you come up with something again, I'll post it. Well then.
Recommended Posts