Since I decided to use wordcloud, I posted it as a memorandum
Since mecab is used, if you are asking "What is mecab?", Please click [here] 1!
I tried to summarize from the installation of wordcloud to image output
The item description is as follows
Since it's a big deal, I will issue the problem output by wordcloud (laugh)
I will write the answer in ** Conclusion **!
A method of selecting multiple words that appear frequently in a sentence and displaying them in a size according to the frequency.
The official is [here] 2
Installation can be used immediately by installing with pip etc.
pip install wordcloud
I think it is faster to explain using images, so I tried moving it immediately The story used here is "Little Red Riding Hood"
import MeCab
from wordcloud import WordCloud
FILE_NAME = "sample.txt"
with open(FILE_NAME, "r", encoding="utf-8") as f:
CONTENT = f.read()
tagger = MeCab.Tagger("-Owakati")
parse = tagger.parse(CONTENT)
wordcloud = WordCloud()
wordcloud.generate(CONTENT)
wordcloud.to_file("wordcloud.png ")
wordcloud = WordCloud()
Word cloud object for generation and drawing
wordcloud.generate ("string")
Create wordcloud from text (string)
wordcloud.to_file ("photo name")
Export to image file
The above steps will create a wordcloud image.
Wordcloud displays frequently used words in large size
However, note that ** one-letter words ** such as A and me are not displayed!
It can be seen that grandmother, Little Red, and Red Riding are often used in "Little Red Riding Hood".
You can add settings within WordCloud, such as backgrounds and character limits
Here are some of the settings you will use most often.
parameter | Default | Description |
---|---|---|
width | 400 | Width |
height | 200 | Vertical width |
background_color | "black" | Background color |
colormap | None | Letter color |
collocations | True | Collocation |
stopwords | None | Words to exclude (list) |
max_words | 200 | Maximum number of words to display |
regexp | r"\w[\w']+" | Regular expression of the displayed characters |
The previous image is a little small (because it is for Qiita)
If you try to set it to 1080 vertical and 1920 horizontal, which is also the size of Desktop, it will be as follows
wordcloud = WordCloud(width=1920, height=1080)
The background and text colors are hard to see ...
Declare the background color you want to specify Since there are several image colors of characters, declare them.
This time, the background color is white and the image color of the characters is summer.
wordcloud = WordCloud(background_color="white", colormap="summer")
Often "Red" appears on the screen, like Red Riding and Little Red.
So, try setting as follows Very convenient because you can judge collocations as separate words
wordcloud = WordCloud(background_color="white", colormap="summer", collocations=False)
It doesn't make much sense to put words like "the, and, to" on wordcloud
If you do not want to display those words, you can declare it using an array as follows. (This time, for the sake of clarity, try not to display ["Little", "grandmother"])
wordcloud = WordCloud(background_color="white", colormap="summer", collocations=False, stopwords=["Little", "grandmother"])
wordcloud is set to output 200 characters by default You can set how many characters to output by setting as follows.
wordcloud = WordCloud(background_color="white", colormap="summer", collocations=False, stopwords=["Little", "grandmother"], max_words=10])
Looking at this, it seems that you can get data that seems to be good if you erase around [the, and, to]? ??
As mentioned above, wordcloud cannot output single-letter words. By limiting with regexp, even words with one or more letters can be supported.
wordcloud = WordCloud(background_color="white", colormap="summmer", collocations=False, stopwords=["the", "and", "to"], max_words=20, regexp=r"[\w']+")
It's understandable that ** a ** is the most common ...
Tell me more! From [Official] 2
If you play a Japanese sentence with the above program, you will see the following image ...
This is because the font used in wordcloud does not support Japanese.
So you can set the font
The font settings are as follows.
FONT_FILE = "C:\Windows\Fonts\MSGOTHIC.TTC" wordcloud = WordCloud(font_path=FONT_FILE, background_color="white", colormap="summer", collocations=False, regexp=r"[\w']+")
e? Why is it MS Gothic? ** Former COBOL ** That's why! (Those who understand ... I think)
That's why the output was like this
I roughly summarized wordcloud
By the way, the answer to the previous question is ...
** The Three Little Pigs **!
wordcloud is a word that often has large letters Looking at the image
little pig house
The above three are the words that often appear!
By making it wordcloud like this, It can also be used as an index such as what the character string represents (˘ω˘)
Recommended Posts