I'm a beginner, but after practicing Python, I want to write an impressive picture in Word Cloud! I enjoyed it. Write down the work contents as a memorandum.
Working environment is Ubuntu18.04.4 LTS Python 3.6.9 mecab-python3 0.996.5
Please read the file arguments etc. in the source code of this article as appropriate for your own environment.
WordCloud is a method of selecting multiple words that appear frequently in a sentence and displaying them in a size according to the frequency. It refers to automatically arranging words that frequently appear on web pages and blogs. By changing not only the size of the characters but also the color, font, and orientation, you can impress the content of the text at a glance. (From the commentary on Digital Daijisen)
This is the final completed word cloud diagram. The text was created separately from a speech by Apple founder Steve Jobs and passed as an input file.
Also, using a mask image, the character string is displayed inside the outline of the Jobs and Apple logos.
Characters such as myself, life, liking, and university stand out. I'm happy because I personally think it was cool: clap:
Here is the final source code to create the image above.
sample4wordcloud.py
#coding: utf-8
from PIL import Image
import numpy as np
from matplotlib import pyplot as plt
from wordcloud import WordCloud
import requests
import MeCab
#Word cloud creation function(English text version)
def create_wordcloud_en(text, image):
fontpath = 'NotoSansCJK-Regular.ttc'
stop_words_en = [u'am', u'is', u'of', u'and', u'the', u'to', u'it', \
u'for', u'in', u'as', u'or', u'are', u'be', u'this', u'that', u'will', u'there', u'was']
wordcloud = WordCloud(background_color="white",
font_path=fontpath,
width=900,
height=500,
mask = msk,
contour_width=1,
contour_color="black",
stopwords=set(stop_words_en)).generate(text)
#drawing
plt.figure(figsize=(15,20))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
#png output
wordcloud.to_file("wc_image_en.png ")
#Word cloud creation function(Japanese text version)
def create_wordcloud_ja(text, image):
fontpath = 'NotoSansCJK-Regular.ttc'
stop_words_ja = ['thing', 'thing', 'When', 'so', 'Etc.', 'this', 'Yo', 'thisら', 'It', 'all']
#Morphological analysis
tagger = MeCab.Tagger()
tagger.parse('')
node = tagger.parseToNode(text)
word_list = []
while node:
word_type = node.feature.split(',')[0]
word_surf = node.surface.split(',')[0]
if word_type == 'noun' and word_surf not in stop_words_ja:
word_list.append(node.surface)
node = node.next
word_chain = ' '.join(word_list)
wordcloud = WordCloud(background_color="white",
font_path=fontpath,
width=900,
height=500,
mask = msk,
contour_width=1,
contour_color="black",
stopwords=set(stop_words_ja)).generate(word_chain)
#drawing
plt.figure(figsize=(15,20))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
wordcloud.to_file("wc_image_ja.png ")
#Calling required files
#Reading text
with open('jobs.txt', 'r', encoding='utf-8') as fi:
text = fi.read()
#Loading the mask image to use
msk = np.array(Image.open("apple.png "))
create_wordcloud_ja(text, msk)
Two functions for creating a word cloud are defined, one for Japanese text and one for English text. I've written similar code, so I'm sure it's smarter here ... Should I write using the class?
The processing method until drawing a word cloud is different between English text and Japanese text.
In English, like "I like Apple.", Each word is separated by a space, so when you divide it into part of speech, you will not lose track of the division. However, in the case of Japanese, the division is not clear like "I like Apple."
Therefore, in the case of Japanese, it is necessary to perform morphological analysis in order to separate the character strings. This time, morphological analysis was performed using MeCab.
sample.py
tagger = MeCab.Tagger()
tagger.parse('')
node = tagger.parseToNode(text)
word_list = []
while node:
word_type = node.feature.split(',')[0]
word_surf = node.surface.split(',')[0]
if word_type == 'noun' and word_surf not in stop_words_ja:
word_list.append(node.surface)
node = node.next
The above is the part where morphological analysis is performed.
tagger = MeCab.Tagger()
Output mode setting. The output mode changes when the argument settings are changed.
-"-Ochasen": (ChaSen compatible format) -"-Owakati": (output only word-separation) -"Oyomi": (output only reading)
All the arguments start with O and it's cute (laughs)
tagger.parse('')
I don't really understand this part, It seems that you can avoid UnicodeDecodeError by writing this before passing the data to the parser ...
node = tagger.parseToNode(text)
Substitute the analysis result with surface (word) and feature (part of speech information) for node. You can access each by writing node.surface or node.feature.
word_list = []
while node:
word_type = node.feature.split(',')[0]
word_surf = node.surface.split(',')[0]
if word_type == 'noun' and word_surf not in stop_words_ja:
word_list.append(node.surface)
node = node.next
word_chain = ' '.join(word_list)
Read each node in order, and add the part of speech that is a noun and that is not in stop_words_ja to word_list.
Then leave the delimiter blank and convert the list to a string to get word_chain.
Actually, the first picture I drew was output from all nouns without setting hidden characters. Then it looks like this ...
In this figure, character strings such as "koto", "it", and "yo" that are not interesting even if they are displayed are conspicuous.
This is something ...: frowning2:
So, I made a word list that I do not want to display, and I tried not to display character strings that I do not want to be displayed.
I tried to touch Python after a long time. After all it's fun ~: relaxed: Next, I'm thinking of scraping SNS and playing with it. If you have any mistakes or advice in the content of this article, please let me know.
https://sleepless-se.net/2018/08/24/python-mecab-wakatigaki/
https://qiita.com/furipon308/items/be97abf25cf4caa0574e
https://qiita.com/yonedaco/items/27e1ad19132c9f1c9180
https://analysis-navi.com/?p=2295
Recommended Posts