This article is from nem # 2 Advent Calendar 2019.
① Extract only the characters from the nem related documents and (2) Break it down into part of speech with Mecab and visualize it with WordCloud. ③ Furthermore, just WordCloud is not interesting, so I will add a little effort.
Mac 10.15.1 Python 3.7.4
By the way, what is the nem-related document that seems to be a summary of 2019? I agree. It's an Advent calendar.
~~ This time, I extracted the character strings from all the articles of this year's nem Advent calendar ~~ If you do that, it doesn't seem to make much sense unless you save the last day, so for the time being, the first article @ 44uk_i3's Summary of specifications that change between NEM1 and NEM2. (Properly, I got permission)
scrapy.py
import urllib.request
from bs4 import BeautifulSoup
text = []
#URL of the target site
url = 'https://qiita.com/44uk_i3/items/53ad306d2c82df41803f'
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html,'html.parser')
#Qiita's article is<div class="p-items_article">Because the inside is the text
article = soup.findAll('div' , class_= 'p-items_article')
#Extract only the text from the text
for i in article:
text.append(i.text)
#Extracted text nem.Save to txt
file = open('nem.txt','w',encoding='utf-8')
file.writelines(text)
file.close()
I think there is something special to mention. Please comment if anything happens.
Yes, it will be disassembled. Writing this way reminds me of the day I was impressed with Mecab for the first time.
Mecab is such a guy.
$ mecab
Of the thighs and thighs
Plum noun,General,*,*,*,*,Plum,Plum,Plum
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Particles,Attributive,*,*,*,*,of,No,No
Noun,Non-independent,Adverbs possible,*,*,*,home,Uchi,Uchi
EOS
nem_wordcloud.py
import MeCab
from wordcloud import WordCloud
#Open the saved text
data = open("./nem.txt","rb").read()
text = data.decode('utf-8')
mecab = MeCab.Tagger('-Ochasen')
mecab.parse('')
#Morphological analysis
node = mecab.parseToNode(text)
#Word list to use for WordCloud
output = []
#Separate words using part of speech
while node:
word = node.surface
hinnsi = node.feature.split(",")[0]
#Specify the part of speech to add to the array
if hinnsi in ["verb","adverb","adjective","noun"]:
output.append(word)
node = node.next
text = ' '.join(output)
#Japanese font path(mac)
fpath = "/System/Library/Fonts/Hiragino Mincho ProN.ttc"
#WordCloud generation. Specify background color
wc = WordCloud(
background_color="white",
font_path=fpath,
width=800,
height=600).generate(text)
#Save as png
wc.to_file("./wc.png ")
Yes, change the font to your liking.
The default fonts for Mac are in / System / Library / Fonts /
. Please google Windows.
I made something like that.
With the contents so far, there are many articles on Qiita, so I will add another effort.
nem_wordcloud_2.py
import MeCab
from wordcloud import WordCloud
import numpy as np
from wordcloud import WordCloud ,ImageColorGenerator
from PIL import Image
#Open the saved text
data = open("./nem.txt","rb").read()
text = data.decode('utf-8')
mecab = MeCab.Tagger('-Ochasen')
mecab.parse('')
#Morphological analysis
node = mecab.parseToNode(text)
#List of words to use for WordCloud
output = []
#Separate words using part of speech
while node:
word = node.surface
hinnsi = node.feature.split(",")[0]
#Specify the part of speech to add to the array
if hinnsi in ["verb","adverb","adjective","noun"]:
output.append(word)
node = node.next
#Japanese font path(mac)
fpath = "/System/Library/Fonts/Hiragino Mincho ProN.ttc"
text = ' '.join(output)
imagepaht = "./nem_icon_black.png "
img_color = np.array(Image.open(imagepaht))
wc = WordCloud(
width=800,
height=800,
font_path=fpath,
mask=img_color,
background_color="white",
collocations=False,).generate(text)
wc.to_file("./wc_nem.png ")
Yes, I prepared this image first.
As you all know, I made that icon black.
This image is the ./nem_icon_black.png
specified in ʻimagepaht`.
So, here is the image that is created when you execute this code.
Better than I expected.
If you increase the amount of data a little more, it seems that you can analyze what was important for nem in 2019.
You can also make such a version by changing the original image or changing the settings.
Recommended Posts