Introduction

Since I decided to use wordcloud, I posted it as a memorandum

Since mecab is used, if you are asking "What is mecab?", Please click [here] 1!

I tried to summarize from the installation of wordcloud to image output

The item description is as follows

What is this story?
What is wordcloud
I actually moved it
Various settings
Common mistakes in Japanese
in conclusion

What story is this?

Since it's a big deal, I will issue the problem output by wordcloud (laugh)

I will write the answer in ** Conclusion **!

What is wordcloud

A method of selecting multiple words that appear frequently in a sentence and displaying them in a size according to the frequency.

The official is [here] 2

Installation can be used immediately by installing with pip etc.

pip install wordcloud

I actually moved it

I think it is faster to explain using images, so I tried moving it immediately The story used here is "Little Red Riding Hood"

program

import MeCab

from wordcloud import WordCloud

FILE_NAME = "sample.txt"

with open(FILE_NAME, "r", encoding="utf-8") as f:
    CONTENT = f.read()

tagger = MeCab.Tagger("-Owakati")
parse = tagger.parse(CONTENT)

wordcloud = WordCloud()
wordcloud.generate(CONTENT)
wordcloud.to_file("wordcloud.png ")

wordcloud = WordCloud()

Word cloud object for generation and drawing

wordcloud.generate ("string")

Create wordcloud from text (string)

wordcloud.to_file ("photo name")

Export to image file

The above steps will create a wordcloud image.

image

Wordcloud displays frequently used words in large size

However, note that ** one-letter words ** such as A and me are not displayed!

It can be seen that grandmother, Little Red, and Red Riding are often used in "Little Red Riding Hood".

Various settings

You can add settings within WordCloud, such as backgrounds and character limits

Here are some of the settings you will use most often.

parameter	Default	Description
width	400	Width
height	200	Vertical width
background_color	"black"	Background color
colormap	None	Letter color
collocations	True	Collocation
stopwords	None	Words to exclude (list)
max_words	200	Maximum number of words to display
regexp	r"\w[\w']+"	Regular expression of the displayed characters

I want to change the size of the image

The previous image is a little small (because it is for Qiita)

If you try to set it to 1080 vertical and 1920 horizontal, which is also the size of Desktop, it will be as follows

wordcloud = WordCloud(width=1920, height=1080)

I want to change the color

The background and text colors are hard to see ...

Declare the background color you want to specify Since there are several image colors of characters, declare them.

This time, the background color is white and the image color of the characters is summer.

wordcloud = WordCloud(background_color="white", colormap="summer")

I want to break down collocations like Red Riding

Often "Red" appears on the screen, like Red Riding and Little Red.

So, try setting as follows Very convenient because you can judge collocations as separate words

wordcloud = WordCloud(background_color="white", colormap="summer", collocations=False)

I don't want to display a certain character

It doesn't make much sense to put words like "the, and, to" on wordcloud

If you do not want to display those words, you can declare it using an array as follows. (This time, for the sake of clarity, try not to display ["Little", "grandmother"])

wordcloud = WordCloud(background_color="white", colormap="summer", collocations=False, stopwords=["Little", "grandmother"])

I want to limit the number of characters that can be displayed

wordcloud is set to output 200 characters by default You can set how many characters to output by setting as follows.

wordcloud = WordCloud(background_color="white", colormap="summer", collocations=False, stopwords=["Little", "grandmother"], max_words=10])

Looking at this, it seems that you can get data that seems to be good if you erase around [the, and, to]? ??

I want to display even one-letter words

As mentioned above, wordcloud cannot output single-letter words. By limiting with regexp, even words with one or more letters can be supported.

wordcloud = WordCloud(background_color="white", colormap="summmer", collocations=False, stopwords=["the", "and", "to"], max_words=20, regexp=r"[\w']+")

It's understandable that ** a ** is the most common ...

Tell me more! From [Official] 2

Common mistakes in Japanese

If you play a Japanese sentence with the above program, you will see the following image ...

This is because the font used in wordcloud does not support Japanese.

So you can set the font

The font settings are as follows.

FONT_FILE = "C:\Windows\Fonts\MSGOTHIC.TTC" wordcloud = WordCloud(font_path=FONT_FILE, background_color="white", colormap="summer", collocations=False, regexp=r"[\w']+")

e? Why is it MS Gothic? ** Former COBOL ** That's why! (Those who understand ... I think)

Any font can be used, so please choose the font you like best (^-^)

That's why the output was like this

in conclusion

I roughly summarized wordcloud

By the way, the answer to the previous question is ...

** The Three Little Pigs **!

wordcloud is a word that often has large letters Looking at the image

little pig house

The above three are the words that often appear!

By making it wordcloud like this, It can also be used as an index such as what the character string represents (˘ω˘)

I played with wordcloud!