I think I've seen WordCloud once, but it was easy when I tried it, so I'll put it in an article. You can make something like this.
·environment ・ The simplest example ・ Try a little ・ Think about usage
This time, I used Jetson-nano. Therefore, the base environment is as linked, that is, the Ubuntu environment. Normally, you can install it with:
$ pip3 install wordcloud
However, I got some errors and could not install it, so
$ sudo pip3 install wordcloud
I was able to install it with. In the case of Japanese, it is also necessary to Install MeCab etc. to analyze the word division and part of speech. In addition, Japanese fonts were installed as shown in Reference (2) below. First, download From Link Noto Sans CJK JP.
$ unzip NotoSansCJKjp-hinted.zip
$ mkdir -p ~/.fonts
$ cp *otf ~/.fonts
$ fc-cache -f -v # optional
【reference】 ①amueller/word_cloud ② [Note] Create a Japanese word cloud
Looking at the reference code below, WordCloud seems to output by changing the size and output direction of characters randomly in a certain area according to the character frequency. 【reference】 ・ Word_cloud / wordcloud / wordcloud.py So, the simplest usage code is as follows.
from MeCab import Tagger
import matplotlib.pyplot as plt
from wordcloud import WordCloud
t = Tagger()
text = "On the 25th, Meijo University (Nagoya City) awarded the title of "Special Honorary Professor" to Akira Yoshino (72), a professor at the same university who won the Nobel Prize in Chemistry for the development of lithium-ion batteries and an honorary fellow of Asahi Kasei. Mr. Yoshino has been a professor at the Graduate School of Science and Engineering since 2017, and is in charge of lectures once a week. According to Meijo University, the special honorary professor is a title to honor faculty members who have won the Nobel Prize. It was founded in 2014 when Isamu Akasaki, a tenured professor, and Hiroshi Amano, a former professor, won the Nobel Prize in Physics for the development of blue light emitting diodes (LEDs)."
splitted = " ".join([x.split("\t")[0] for x in t.parse(text).splitlines()[:-1]])
print("1",splitted)
wc = WordCloud(font_path="/home/muauan/.fonts/NotoSansCJKjp-Regular.otf")
wc.generate(splitted)
plt.axis("off")
plt.imshow(wc)
plt.pause(1)
plt.savefig('./output_images/yosino0_{}.png'.format(text[0]))
plt.close()
If you change the t = Tagger () part of this code and move it, you can generate the one in the Word Cloud column in the table below. From the top, the same item numbers correspond.
Item number | dictionary | Word-separation & part of speech deletion |
---|---|---|
0 | t = Tagger() | splitted = " ".join([x.split("\t")[0] for x in t.parse(text).splitlines()[:-1]]) |
1 | t = Tagger(" -d " + args.dictionary) | splitted = " ".join([x.split("\t")[0] for x in t.parse(text).splitlines()[:-1]]) |
2 | t = Tagger(" -d " + args.dictionary) | splitted = " ".join([x.split("\t")[0] for x in t.parse(text).splitlines()[:-1] if x.split("\t")[1].split(",")[0] not in ["Particle", "Auxiliary verb", "adverb", "Adnominal adjective", "verb"]]) |
Item number | dictionary | Word-separation & part of speech deletion | Word Cloud |
---|---|---|---|
0 | default dictionary | On the 25th, Meijo University (Nagoya City) awarded the title of "Special Honorary Professor" to Asahi Kasei Honorary Fellow Akira Yoshino (72), a professor at the same university who received the Nobel Prize in Chemistry for the development of lithium-ion batteries. Mr. Yoshino has been a professor at the Graduate School of Science and Engineering since 2017, and is in charge of lectures once a week. According to Meijo University, the special honorary professor is a title to honor the faculty members who received the Nobel Prize. It was founded in 2014 when Isamu Akasaki, a tenured professor, and Hiroshi Amano, a former professor, received the Nobel Prize in Physics for the development of blue light emitting diodes (LEDs). | |
1 | neologd | On the 25th, Meijo University (Nagoya City) awarded the title of "Special Honorary Professor" to Asahi Kasei Honorary Fellow Akira Yoshino (72), a professor at the same university who received the Nobel Prize in Chemistry for the development of lithium-ion batteries. Mr. Yoshino has been a professor at the Graduate School of Science and Engineering since 2017, and is in charge of lectures once a week. According to Meijo University, the special honorary professor is a title to honor the faculty members who received the Nobel Prize. It was founded in 2014 when Isamu Akasaki, a tenured professor, and Hiroshi Amano, a former professor, received the Nobel Prize in Physics for the development of blue light emitting diodes (LEDs). | |
2 | neologd +Delete particles, auxiliary verbs, etc. | Meijo University (Nagoya City) Received the Nobel Prize in Chemistry for Lithium Ion Battery Development on the 25th. Professor of the same university Asahi Kasei Honorary Fellow Akira Yoshino (72) Awarded the title of "Special Honorary Professor". Mr. Yoshino, Professor, Graduate School of Science and Engineering, 2017, lecture once a week. Meijo University, Special Honorary Professor Nobel Prize Winner Title for faculty member. Tenured professor Isamu Akasaki Former professor Hiroshi Amano, Blue light emitting diode (LED) development Nobel Prize in Physics was awarded. |
For the time being, I put the code to generate WordCloud for the text entered on the keyboard below. WordCloud/wc_input_original.py
In the above, unnecessary part-speech characters were deleted using the part-speech that was divided by Mecab. However, even with this, it is still insufficient to extract only the characters that represent the sentence, and I would like to further reduce the characters and display them with more impact. So, we will introduce a stopword to delete specific characters and strings. The function to realize is as follows. 【reference】 ③ Review tendency of highly rated ramen shops in TF-IDF (natural language processing, TF-IDF, Mecab, wordcloud, morphological analysis, word-separation) / 14/142128 #% E3% 82% B9% E3% 83% 88% E3% 83% 83% E3% 83% 97% E3% 83% AF% E3% 83% BC% E3% 83% 89) ④ [slothlib --Revision 77: /CSharp/Version1/SlothLib/NLP/Filter/StopWord/word Japanese.txt](http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/ Filter / StopWord / word /) The character string you want to delete can be deleted by using the following exclude_stopword () function by putting the above Japanese.txt in Dir and adding it to that file.
#Stopword read function
stop_words = []
if args.stop_words:
for line in open(args.stop_words, "r", encoding="utf-8"):
stop_words.append(line.strip())
print(stop_words)
#A function that converts a list to a string
def join_list_str(list):
return ' '.join(list)
#Stopword exclusion function
def exclude_stopword(text):
changed_text = [token for token in text.lower().split(" ") if token != "" if token not in stop_words]
#If it is left as above, it will be in list format, so convert it to a space-separated character string
changed_text = join_list_str(changed_text)
return changed_text
The function to generate WordCloud is as follows. The following code is based on Reference ⑤, Reference ③ and Reference ① below. 【reference】 ⑤ Create Wordcloud with masked image Simply put ・ First define the Japanese font -The argument sk is used to identify the file name -The argument imgpath is the file path of the mask image when using the mask function. ・ When using a mask Execute the first if statement or less ・ If you do not use the mask, execute else or less. Here, the argument of WordCloud is almost Defalt and can be redefined (explanation for each item is described as a bonus)
fpath="/home/muauan/.fonts/NotoSansCJKjp-Regular.otf"
def get_wordcrowd_color_mask(sk, text, imgpath ):
plt.figure(figsize=(6,6), dpi=200)
if imgpath != "":
img_color = np.array(Image.open( imgpath ))
image_colors = ImageColorGenerator(img_color)
wc = WordCloud(width=400,
height=300,
font_path=fpath,
mask=img_color,
collocations=False, #Don't duplicate words
).generate( text )
plt.imshow(wc.recolor(color_func=image_colors), #Use the color of the original image
interpolation="bilinear")
else:
#wc = WordCloud(font_path=fpath, regexp="[\w']+").generate( text )
wc = WordCloud(font_path=fpath, width=400, height=200, margin=2,
ranks_only=None, prefer_horizontal=.9, mask=None, scale=1,
color_func=None, max_words=200, min_font_size=4,
stopwords=None, random_state=None, background_color='black',
max_font_size=None, font_step=1, mode="RGB",
relative_scaling='auto', regexp=r"\w[\w']+" , collocations=True,
colormap=None, normalize_plurals=True, contour_width=0,
contour_color='black', repeat=False,
include_numbers=False, min_word_length=0).generate(text)
plt.imshow(wc)
# show
plt.axis("off")
plt.tight_layout()
plt.pause(1)
plt.savefig('./output_images/{}-yosino_{}.png'.format(sk,text[0]))
plt.close()
Generate with the following code using the above function. I decided to copy the text I want to generate in WordCloud to line. Three-step processing was carried out to see the effect of stop words and the like. Here, the last processing is described below, and the others are described in bonus 2, so the effect is clear when compared.
while True:
line = input("> ")
if not line:
break
splitted = " ".join([x.split("\t")[0] for x in t.parse(line).splitlines()[:-1] if x.split("\t")[1].split(",")[0] not in ["Particle", "Auxiliary verb", "adverb", "Adnominal adjective","conjunction","verb","symbol"]])
splitted = exclude_stopword(splitted)
print("2",splitted)
get_wordcrowd_color_mask(4,splitted, '')
get_wordcrowd_color_mask(5,splitted, './mask_images/alice_color.png')
・ WordCloud was able to display the outline of the text ・ I saw that the accuracy of the outline changes depending on the control by stop words and part of speech. ・ If you use a mask, you can see that it can be generated only in a certain area. ・ Jetson-nano can also be generated in a short time
・ I want to think about effective usage scenes and services using real-time output, etc.
$ python3 wc_input_original.py -d /usr/lib/aarch64-linux-gnu/mecab/dic/mecab-ipadic-neologd -s japanese.txt
Japanese.txt
['over there', 'Per', 'there', 'Over there', 'after', 'hole', 'holeた', 'that', 'How many', 'When', 'Now', 'Disagreeable', 'various', 'home', 'Roughly', 'You', 'I', 'O', 'Gai', 'Draw', 'Shape', 'Wonder', 'Kayano', 'From', 'Gara', 'Came', 'Habit', 'here', 'here', 'thing', 'Every', 'Here', 'Messed up', 'this', 'thisら', 'Around', 'Various', 'Relief', 'Mr.', 'How', 'Try', 'Suka', 'One by one', 'Shin', 'all', 'All', 'so', 'There', 'there', 'Over there', 'Sleeve', 'It', 'Itぞれ', 'Itなり', 'たくMr.', 'Etc.', 'Every time', 'For', 'No good', 'Cha', 'Chaん', 'Ten', 'とOり', 'When', 'Where', 'Whereか', 'By the way', 'Which', 'Somewhere', 'Which', 'which one', 'Inside', 'Insideば', 'Without', 'What', 'Such', 'What', 'Whatか', 'To', 'of', 'Begin', 'Should be', 'Haruka', 'People', 'Peopleつ', 'Clothes', 'Yellowtail', 'Betsu', 'Strange', 'Pen', 'How', 'Other', 'Masa', 'Better', 'Decent', 'As it is', 'want to see', 'Three', 'みなMr.', 'Everyone', 'Originally', 'もof', 'gate', 'Guy', 'Yo', 'Outside', 'reason', 'I', 'Yes', 'Up', 'During ~', 'under', 'Character', 'Year', 'Month', 'Day', 'Time', 'Minutes', 'Seconds', 'week', 'fire', 'water', 'wood', 'Money', 'soil', 'Country', 'Tokyo', 'road', 'Fu', 'Prefecture', 'city', 'Ward', 'town', 'village', 'each', 'No.', 'One', 'what', 'Target', 'Every time', 'Sentence', 'Person', 'sex', 'body', 'Man', 'other', 'now', 'Department', 'Division', 'Person in charge', 'Outside', 'Kind', 'Tatsu', 'Qi', 'Room', 'mouth', 'Who', 'for', 'Kingdom', 'Meeting', 'neck', 'Man', 'woman', 'Another', 'Talk', 'I', 'Shop', 'shop', 'House', 'Place', 'etc', 'You see', 'When', 'View', 'Step', 'Abbreviation', 'Example', 'system', 'Theory', 'form', 'while', 'Ground', 'Member', 'line', 'point', 'book', 'Goods', 'Power', 'Law', 'Feeling', 'Written', 'Former', 'hand', 'number', 'he', 'hewoman', 'Child', 'Inside', 'easy', 'Joy', 'Angry', 'Sorrow', 'ring', 'Around', 'To', 'Border', 'me', 'guy', 'High', 'school', 'Woman', 'Shin', 'Ki', 'magazine', 'Re', 'line', 'Column', 'Thing', 'Shi', 'Stand', 'Collection', 'Mr', 'Place', 'History', 'vessel', 'Name', 'Emotion', 'Communicating', 'every', 'formula', 'Book', 'Times', 'Animal', 'Pieces', 'seat', 'bundle', 'age', 'Eye', 'Connoisseur', 'surface', 'Circle', 'ball', 'Sheet', 'Before', 'rear', 'left', 'right', 'Next', 'Ahead', 'spring', 'summer', 'autumn', 'winter', 'one', 'two', 'three', 'four', 'Five', 'Six', 'Seven', 'Eight', 'Nine', 'Ten', 'hundred', 'thousand', 'Ten thousand', 'Billion', 'Trillion', 'under記', 'Up記', 'Timewhile', 'nowTimes', 'BeforeTimes', 'Place合', 'oneつ', 'Year生', '自Minutes', 'ヶPlace', 'ヵPlace', 'カPlace', '箇Place', 'ヶMonth', 'ヵMonth', 'カMonth', '箇Month', 'NameBefore', 'For real', 'Certainly', 'Timepoint', '全Department', '関Person in charge', 'near', 'OneLaw', 'we', 'the difference', 'Many', 'Treatment', 'new', 'そofrear', 'middle', 'After all', 'Mr々', '以Before', '以rear', 'Or later', 'Less than', '以Up', '以under', 'how many', 'everyDay', '自body', 'Over there', 'whatMan', 'handStep', 'the same', 'Feelingじ']
input.
>On the 25th, Meijo University (Nagoya City) awarded the title of "Special Honorary Professor" to Akira Yoshino (72), a professor at the same university who won the Nobel Prize in Chemistry for the development of lithium-ion batteries and an honorary fellow of Asahi Kasei. Mr. Yoshino has been a professor at the Graduate School of Science and Engineering since 2017, and is in charge of lectures once a week. According to Meijo University, the special honorary professor is a title to honor faculty members who have won the Nobel Prize. It was founded in 2014 when Isamu Akasaki, a tenured professor, and Hiroshi Amano, a former professor, won the Nobel Prize in Physics for the development of blue light emitting diodes (LEDs).
splitted = " ".join([x.split("\t")[0] for x in t.parse(line).splitlines()[:-1] if x.split("\t")[1].split(",")[0] not in [""]])
print("0",splitted)
get_wordcrowd_color_mask(0,splitted, '')
get_wordcrowd_color_mask(1,splitted, './mask_images/alice_color.png')
Output 0.
0 On the 25th, Meijo University (Nagoya City) awarded the title of "Special Honorary Professor" to Asahi Kasei Honorary Fellow Akira Yoshino (72), a professor at the same university who received the Nobel Prize in Chemistry for the development of lithium-ion batteries. .. Mr. Yoshino has been a professor at the Graduate School of Science and Engineering since 2017, and is in charge of lectures once a week. According to Meijo University, the special honorary professor is a title to honor the faculty members who received the Nobel Prize. It was founded in 2014 when Isamu Akasaki, a tenured professor, and Hiroshi Amano, a former professor, received the Nobel Prize in Physics for the development of blue light emitting diodes (LEDs).
splitted = " ".join([x.split("\t")[0] for x in t.parse(line).splitlines()[:-1] if x.split("\t")[1].split(",")[0] not in ["Particle", "Auxiliary verb", "adverb", "Adnominal adjective","conjunction","verb","symbol"]])
print("1",splitted)
get_wordcrowd_color_mask(2,splitted, '')
get_wordcrowd_color_mask(3,splitted, './mask_images/alice_color.png')
Output 1.
1 Meijo University Nagoya City 25th Lithium Ion Battery Development Nobel Chemistry Award Professor of the same university Professor Asahi Kasei Honorary Fellow Akira Yoshino 72 Special Honorary Professor Akira Yoshino 2017 Graduate School of Science and Engineering Professor Weekly Lecture Meijo University Special Honorary Professor Nobel Award for faculty title 14 years Lifelong professor Isamu Akasaki Former professor Hiroshi Amano Blue light emitting diode LED development Nobel Physics Award Established
Output 2.
2 Meijo University Nagoya City 25th Lithium Ion Battery Development Nobel Chemistry Award Professor of the same university Professor Asahi Kasei Honorary Fellow Akira Yoshino 72 Special Honorary Professor Akira Yoshino 2017 Graduate School of Science and Engineering Professor 1st Lecture Meijo University Special Honorary Professor Nobel Award Title 14 years Lifelong Professor Isamu Akasaki Professor Hiroshi Amano Blue light emitting diode led development Nobel Physics Award Awarded founding
WordCloud argument list (reference; from the explanation in the code below) ・ Word_cloud / wordcloud / wordcloud.py
Parameters | |
---|---|
font_path : string | Font path to the font that will be used (OTF or TTF). Defaults to DroidSansMono path on a Linux machine. If you are on another OS or don't have this font, you need to adjust this path. |
width : int (default=400) | Width of the canvas. |
height : int (default=200) | Height of the canvas. |
prefer_horizontal : float (default=0.90) | The ratio of times to try horizontal fitting as opposed to vertical. If prefer_horizontal < 1, the algorithm will try rotating the word if it doesn't fit. (There is currently no built-in way to get only vertical words.) |
mask : nd-array or None (default=None) | If not None, gives a binary mask on where to draw words. If mask is not None, width and height will be ignored and the shape of mask will be used instead. All white (#FF or #FFFFFF) entries will be considerd "masked out" while other entries will be free to draw on. [This changed in the most recent version!] |
contour_width: float (default=0) | If mask is not None and contour_width > 0, draw the mask contour. |
contour_color: color value (default="black") | Mask contour color. |
scale : float (default=1) | Scaling between computation and drawing. For large word-cloud images, using scale instead of larger canvas size is significantly faster, but might lead to a coarser fit for the words. |
min_font_size : int (default=4) | Smallest font size to use. Will stop when there is no more room in this size. |
font_step : int (default=1) | Step size for the font. font_step > 1 might speed up computation but give a worse fit. |
max_words : number (default=200) | The maximum number of words. |
stopwords : set of strings or None | The words that will be eliminated. If None, the build-in STOPWORDS list will be used. Ignored if using generate_from_frequencies. |
background_color : color value (default="black") | Background color for the word cloud image. |
max_font_size : int or None (default=None) | Maximum font size for the largest word. If None, height of the image is used. |
mode : string (default="RGB") | Transparent background will be generated when mode is "RGBA" and background_color is None. |
relative_scaling : float (default='auto') | Importance of relative word frequencies for font-size. With relative_scaling=0, only word-ranks are considered. With relative_scaling=1, a word that is twice as frequent will have twice the size. If you want to consider the word frequencies and not only their rank, relative_scaling around .5 often looks good. If 'auto' it will be set to 0.5 unless repeat is true, in which case it will be set to 0. ..versionchanged: 2.0 Default is now 'auto'. |
color_func : callable, default=None | Callable with parameters word, font_size, position, orientation, font_path, random_state that returns a PIL color for each word. Overwrites "colormap". See colormap for specifying a matplotlib colormap instead. To create a word cloud with a single color, use color_func=lambda *args, **kwargs: "white" . The single color can also be specified using RGB code. For example color_func=lambda *args, **kwargs: (255,0,0) sets color to red. |
regexp : string or None (optional) | Regular expression to split the input text into tokens in process_text. If None is specified, r"\w[\w']+" is used. Ignored if using generate_from_frequencies. |
collocations : bool, default=True | Whether to include collocations (bigrams) of two words. Ignored if using generate_from_frequencies. .. versionadded: 2.0 |
colormap : string or matplotlib colormap, default="viridis" | Matplotlib colormap to randomly draw colors from for each word. Ignored if "color_func" is specified. .. versionadded: 2.0 |
normalize_plurals : bool, default=True | Whether to remove trailing 's' from words. If True and a word appears with and without a trailing 's', the one with trailing 's' is removed and its counts are added to the version without trailing 's' -- unless the word ends with 'ss'. Ignored if using generate_from_frequencies. |
repeat : bool, default=False | Whether to repeat words and phrases until max_words or min_font_size is reached. |
include_numbers : bool, default=False | Whether to include numbers as phrases or not. |
min_word_length : int, default=0 | Minimum number of letters a word must have to be included. |
Recommended Posts