This article is from nem # 2 Advent Calendar 2019.

Content of this article

① Extract only the characters from the nem related documents and (2) Break it down into part of speech with Mecab and visualize it with WordCloud. ③ Furthermore, just WordCloud is not interesting, so I will add a little effort.

environment

Mac　10.15.1 Python 3.7.4

1. Extract only characters from nem related documents

By the way, what is the nem-related document that seems to be a summary of 2019? I agree. It's an Advent calendar.

~~ This time, I extracted the character strings from all the articles of this year's nem Advent calendar ~~ If you do that, it doesn't seem to make much sense unless you save the last day, so for the time being, the first article @ 44uk_i3's Summary of specifications that change between NEM1 and NEM2. (Properly, I got permission)

Style to paste the source code for the time being

`scrapy.py`


import urllib.request
from bs4 import BeautifulSoup

text = []

#URL of the target site
url = 'https://qiita.com/44uk_i3/items/53ad306d2c82df41803f'
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html,'html.parser')

#Qiita's article is<div class="p-items_article">Because the inside is the text
article =  soup.findAll('div' , class_= 'p-items_article')

#Extract only the text from the text
for i in article:
    text.append(i.text)

#Extracted text nem.Save to txt
file = open('nem.txt','w',encoding='utf-8')
file.writelines(text)
file.close()

I think there is something special to mention. Please comment if anything happens.

2. Break down into part of speech with Mecab and visualize with WordCloud

Yes, it will be disassembled. Writing this way reminds me of the day I was impressed with Mecab for the first time.

Mecab is such a guy.

$ mecab
Of the thighs and thighs
Plum noun,General,*,*,*,*,Plum,Plum,Plum
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Also particles,Particle,*,*,*,*,Also,Mo,Mo
Peach noun,General,*,*,*,*,Peaches,peach,peach
Particles,Attributive,*,*,*,*,of,No,No
Noun,Non-independent,Adverbs possible,*,*,*,home,Uchi,Uchi
EOS

Talk in source code

`nem_wordcloud.py`


import MeCab
from wordcloud import WordCloud

#Open the saved text
data = open("./nem.txt","rb").read()
text = data.decode('utf-8')

mecab = MeCab.Tagger('-Ochasen')
mecab.parse('')

#Morphological analysis
node = mecab.parseToNode(text)

#Word list to use for WordCloud
output = []

#Separate words using part of speech
while node:
  word = node.surface
  hinnsi = node.feature.split(",")[0]
  #Specify the part of speech to add to the array
  if hinnsi in ["verb","adverb","adjective","noun"]:
    output.append(word)
  node = node.next

text = ' '.join(output)

#Japanese font path(mac)
fpath = "/System/Library/Fonts/Hiragino Mincho ProN.ttc"

#WordCloud generation. Specify background color
wc = WordCloud(
    background_color="white",
    font_path=fpath,
    width=800,
    height=600).generate(text)

#Save as png
wc.to_file("./wc.png ")

Yes, change the font to your liking. The default fonts for Mac are in / System / Library / Fonts /. Please google Windows.

Things that have been completed so far

I made something like that.

3. Add another effort

With the contents so far, there are many articles on Qiita, so I will add another effort.

Spoil first with source code

`nem_wordcloud_2.py`


import MeCab
from wordcloud import WordCloud
import numpy as np
from wordcloud import WordCloud ,ImageColorGenerator
from PIL import Image

#Open the saved text
data = open("./nem.txt","rb").read()
text = data.decode('utf-8')

mecab = MeCab.Tagger('-Ochasen')
mecab.parse('')

#Morphological analysis
node = mecab.parseToNode(text)

#List of words to use for WordCloud
output = []

#Separate words using part of speech
while node:
  word = node.surface
  hinnsi = node.feature.split(",")[0]
  #Specify the part of speech to add to the array
  if hinnsi in ["verb","adverb","adjective","noun"]:
    output.append(word)
  node = node.next

#Japanese font path(mac)
fpath = "/System/Library/Fonts/Hiragino Mincho ProN.ttc"

text = ' '.join(output)

imagepaht = "./nem_icon_black.png "
img_color = np.array(Image.open(imagepaht))

wc = WordCloud(
    width=800,
    height=800,
    font_path=fpath,
    mask=img_color,
    background_color="white",
    collocations=False,).generate(text)

wc.to_file("./wc_nem.png ")

Yes, I prepared this image first.

As you all know, I made that icon black. This image is the ./nem_icon_black.png specified in ʻimagepaht`.

So, here is the image that is created when you execute this code.

The finished product

Better than I expected.

Summary

If you increase the amount of data a little more, it seems that you can analyze what was important for nem in 2019.

bonus

You can also make such a version by changing the original image or changing the settings.

Visualize 2019 nem with WordCloud