This is a 12/8 article from jsys19AdventCalender (https://adventar.org/calendars/4301).
This is the first time I have sent my code along with the text, and although it will be a poor text and code, I would appreciate it if you could keep an eye on it and tell me if there is something that you think "this is the way to go!".
Suddenly, do you all know what a word cloud is?
A method of selecting multiple words that appear frequently in a sentence and displaying them in a size according to the frequency. It refers to automatically arranging words that frequently appear on web pages and blogs. By changing not only the size of the characters but also the color, font, and orientation, you can impress the content of the text at a glance. https://kotobank.jp/word/%E3%83%AF%E3%83%BC%E3%83%89%E3%82%AF%E3%83%A9%E3%82%A6%E3%83%89-674221
It looks like this, the actual one is like the image below This is an image of typescript-eslint's github page on the word cloud.
I've seen this way of expressing words a little interestingly on the net before, and I thought "Isn't it interesting to do this in the slack log?" And wrote an article.
wordcloud can only receive words separated by space. Everyone's remarks are not so, so I will use MeCab to write a word. Before that, I put in the work of putting all the remarks together.
First, we will get an archive of everyone's remarks from slack from the director of the workspace owner and try to extract the sentences. When you open the file, there is a folder for each channel, in which information such as the sender and reaction of the remark is stored in json format. (At this point, it's easier to delete the folder of the channel where many bots say
ex-2020-6-31.json
[
{
"client_msg_id": "hoge",
"type": "message",
"text": "I became a hatachi",
"user": "hogee",
"ts": "hooge",
"team": "foo",
"user_team": "foo",
"source_team": "foo",
"user_profile": {
"avatar_hash": "bar",
"image_72": "https:\/\/avatars.slack-edge.com\/ore.png ",
"first_name": "Murakami",
"real_name": "Murakami ore",
"display_name": "Murakami",
"team": "piyo",
"name": "s31051315",
"is_restricted": false,
"is_ultra_restricted": false
},
}
]
Below is the code to scan all the json files in the archive folder and put the contents of the text property that indicates the statement in one variable.
from pathlib import Path
import glob
import json
import re
main_text = ""
json_path=Path("src/jsys_archive")
dirs=list(json_path.glob("**/*.json"))
for i in dirs:
json_open = open(i)
json_text = json.load(json_open)
json_dicts = len(json_text)
for j in range(json_dicts):
json_text_fixed = re.sub("<.*?>|:.*?:","",json_text[j]["text"])
main_text += json_text_fixed
I put the path of the folder I want to check in Path () and make it a path object, and pass "*** / **. Json" to glob () to search for an arbitrary json file.
pa_th=Path("src/jsys_archive")
dirs=list(pa_th.glob("**/*.json"))
And everyone's remarks are mixed with non-pure text noise such as data and mention information that will be handled on various slack enclosed in <>, reaction information enclosed in ::. If these are also included, the output word cloud will be only system messages, so character string operations are performed using regular expressions.
json_text_fixed = re.sub("<.*?>|:.*?:","",json_text[j]["text"])
#<>, Or::And erase the text inside it
Now everyone's remarks are gathered in the variable main_text (huge). The rest is going to MeCab.
wordcloud can only receive space-separated ones. Everyone's remarks are not so, so I will use MeCab to write a word.
Do this.
import MeCab
words = MeCab.Tagger("-Owakati")
nodes = words.parseToNode(main_text)
s = []
while nodes:
if nodes.feature[:2] == "noun":
s.append(nodes.surface)
nodes = nodes.next
To do this, give " -Owakati "
to ``` MeCab.Tagger ()` `` and share it. The Tagger object can mainly take the following four arguments.
1, "mecabrc" (no arguments) 2, "-Ochasen" (ChaSen compatible format) 3, "-Owakati" (output word-separation) ← 4, "-Oyomi" (output reading) This time, we will use 3 "Share" ~~ (MeCab's argument Japanese-like is interesting, but I don't call it a share) ~~
Next, the Node object parsed and returned by (Tagger instance) .parseToNode (" string ")` `` has two properties, `` `.surface
and `` `.feature```. there is.
The surface contains the character string data of the Node object, and the feature contains [part of speech, part of speech classification 1, part of speech classification 2, part of speech classification 3, conjugation, conjugation, prototype, reading, pronunciation].
Below is an example program.
feature_example
import MeCab
mecab = MeCab.Tagger()
nodes = mecab.parseToNode("Information Media System Bureau")
while nodes:
print(nodes.feature)
nodes = nodes.next
↓ Execution result
noun,General,*,*,*,*,information,Jouhou,Joe Ho
noun,General,*,*,*,*,media,media,media
noun,General,*,*,*,*,system,system,system
noun,suffix,General,*,*,*,Station,Kyoku,Kyoku
Since only nouns need to be displayed in the figure, pass only the nouns with if and add the character string data to the prepared empty list. Then, the completed list is converted into a character string separated by half-width spaces, and the preparation is finally completed.
s = []
while nodes:
if nodes.feature[:2] == "noun":
s.append(nodes.surface)
nodes = nodes.next
parsed_main_text = " ".join(s)
Finally you can make an image.
wc = wordcloud()
Create a wordcloud object by setting various images in.
I think that the height, width, background_color, etc. that set the height and width of the image are stylized and easy to understand. There are various other things such as collocation to avoid the appearance of the same word, stopwords to set words that you do not want to appear, but this time we will use only those that are here.
The mask that determines the shape of the output image will be described later.
import numpy
from PIL import Image
from wordcloud import WordCloud
mask_jsys = numpy.array(Image.open("jsys.jpeg "))
wc = WordCloud(width=1200, height=800,
background_color="black",
collocations = False,
mask=mask_jsys,
stopwords={"thing","this","For","It","By the way",
"Yo","From","Mr.","but","thing","so"},
font_path="/System/Library/Fonts/Hiragino Horn Gothic W6.ttc")
The first line determines the shape of the image. This time I used the image below. I like the font, but I use Impact.
This will place the word cloud text only in the jsys text part of this image.
Pass the parsed_main_text created earlier to wc.generate () to generate the image and save it as wc.to_file ("filename").
wc.generate(parsed_main_text)
wc.to_file('jsys_wordcloud.png')
This is finally complete. It was long,,
Is it good? (Self-praise) Did you say this? I'm sure there are some remarks that I think, but I think there are remarks like this. Personally, it's interesting that "request" and "okay" become bigger. I'm glad that the group name jsys also came out.
https://oku.edu.mie-u.ac.jp/~okumura/python/wordcloud.html https://qiita.com/sea_ship/items/7c8811b5cf37d700adc4 https://www.pynote.info/entry/python-wordcloud#%E3%83%9E%E3%82%B9%E3%82%AF%E3%82%92%E4%BD%BF%E7%94%A8%E3%81%99%E3%82%8B https://takaxtech.com/2018/11/03/article271/ https://qiita.com/amowwee/items/e63b3610ea750f7dba1b
Recommended Posts