Visualize the frequency of word occurrences in sentences with Word Cloud. [Python]

I would like to visualize the frequency of occurrence of words using the Word Cloud library of python created by amueller.

It's this kind of guy. constitution-compressor.png

A description of this library can be found here. http://amueller.github.io/word_cloud/index.html

1. Installation of various libraries

1-1. Installing the word_cloud library

You can easily install it just by getting the source code from git.

git clone https://github.com/amueller/word_cloud
cd word_cloud
python setup.py install

1-2. Installation of various python libraries

Unlike English, Japanese does not have clear word breaks, so in order to separate words, we use software called MeCab to cut them out into words. [Install Mecab] (Http://qiita.com/kenmatsu4/items/02034e5688cc186f224b#1-1mecab installation) was explained in this link, so you can install it referring to here.

In addition, the following libraries are also required, so prepare them.

pip install beautifulsoup4
pip install requests

2. Creating Word Cloud

Now that I'm ready, I'll write the code right away. The first is importing the required libraries.

#Library import
%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from bs4 import BeautifulSoup
import requests
import MeCab as mc

A function that uses MeCab to cut out words and pack them into a list. Part of speech is limited to nouns, verbs, adjectives, and adverbs in order to visualize and extract words that are likely to be meaningful.

def mecab_analysis(text):
    t = mc.Tagger('-Ochasen -d /usr/local/Cellar/mecab/0.996/lib/mecab/dic/mecab-ipadic-neologd/')
    enc_text = text.encode('utf-8') 
    node = t.parseToNode(enc_text) 
    output = []
    while(node):
        if node.surface != "":  #Exclude headers and footers
            word_type = node.feature.split(",")[0]
            if word_type in ["adjective", "verb","noun", "adverb"]:
                output.append(node.surface)
        node = node.next
        if node is None:
            break
    return output

Use BeutifulSoup to capture the text specified in the URL. Only the text can be extracted according to the HTML structure of Qiita.

def get_wordlist_from_QiitaURL(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text)
    text = soup.body.section.get_text().replace('\n','').replace('\t','')
    return mecab_analysis(text)

From here is the production, Word Cloud generation. You can exclude words that don't make much sense by specifying them as stop words, so use this. Also, when implementing on Mac, it is necessary to specify the font, so specify font_path.

def create_wordcloud(text):
    
    #Specify the font path according to the environment.
    #fpath = "/System/Library/Fonts/HelveticaNeue-UltraLight.otf"
    fpath = "/Library/Fonts/Hiragino Kakugo Pro W3.otf"

    #Stop word setting
    stop_words = [ u'Teru', u'Is', u'Become', u'To be', u'To do', u'is there', u'thing', u'this', u'Mr.', u'do it', \
             u'Give me', u'do', u'Give me', u'so', u'Let', u'did',  u'think',  \
             u'It', u'here', u'Chan', u'Kun', u'', u'hand',u'To',u'To',u'Is',u'of', u'But', u'When', u'Ta', u'Shi', u'so', \
             u'Absent', u'Also', u'Nana', u'I', u'Or', u'So', u'Yo', u'']
     
    wordcloud = WordCloud(background_color="white",font_path=fpath, width=900, height=500, \
                          stopwords=set(stop_words)).generate(text)

    plt.figure(figsize=(15,12))
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()

Since the above is the function definition of the necessary processing, we will create a Word Cloud using these. Separate each word into a single string and pass it to the Word Cloud creation function.

I would like to use @ t_saeko's article "What I did when I was suddenly put into a burning project as a director". (Because it was interesting to read recently)

url = "http://qiita.com/t_saeko/items/2b475b8657c826abc114"
wordlist = get_wordlist_from_QiitaURL(url)
create_wordcloud(" ".join(wordlist).decode('utf-8'))

It feels pretty good! wordcloud-compressor.png

The full code has been uploaded to gist.

Recommended Posts

Visualize the frequency of word occurrences in sentences with Word Cloud. [Python]
[Flask & Bootstrap] Visualize the content of lyrics in Word Cloud ~ Lyrics Word Cloud ~
Visualize the range of interpolation and extrapolation with python
Visualize keywords in documents with TF-IDF and Word Cloud
How to count the number of occurrences of each element in the list in Python with weight
Try scraping the data of COVID-19 in Tokyo with Python
Visualize the results of decision trees performed with Python scikit-learn
Calculate the square root of 2 in millions of digits with python
[Homology] Count the number of holes in data with Python
Output the contents of ~ .xlsx in the folder to HTML with Python
Check the behavior of destructor in Python
Check the existence of the file with python
Display Python 3 in the browser with MAMP
The result of installing python in Anaconda
The basics of running NoxPlayer in Python
Create a word frequency counter with Python 3.4
Text mining with Python ② Visualization with Word Cloud
In search of the fastest FizzBuzz in Python
Receive a list of the results of parallel processing in Python with starmap
Since it is the 20th anniversary of the formation, I tried to visualize the lyrics of Perfume with Word Cloud
Output the number of CPU cores in Python
[Python] Get the files in a folder with Python
Load the network modeled with Rhinoceros in Python ③
Prepare the execution environment of Python3 with Docker
Try it with Word Cloud Japanese Python JupyterLab.
2016 The University of Tokyo Mathematics Solved with Python
[Note] Export the html of the site with python.
Get the caller of a function in Python
Match the distribution of each group in Python
View the result of geometry processing in Python
[Automation] Extract the table in PDF with Python
Check the date of the flag duty with Python
Find the divisor of the value entered in python
Load the network modeled with Rhinoceros in Python ②
Find the solution of the nth-order equation in python
The story of reading HSPICE data in Python
[Note] About the role of underscore "_" in Python
About the behavior of Model.get_or_create () of peewee in Python
Solving the equation of motion in Python (odeint)
Output in the form of a python array
Visualize the behavior of the sorting algorithm with matplotlib
Convert the character code of the file with Python3
[Python] Determine the type of iris with SVM
Load the network modeled with Rhinoceros in Python ①
Visualize the timeline of the number of issues on GitHub assigned to you in Python
Compare the sum of each element in two lists with the specified value in Python
How to get a list of files in the same directory with python
Using the naive Bayes classifier implemented in Python 3.3, calculate the similarity from the co-occurrence frequency of words in sentences and strings.
Experience the good calculation efficiency of vectorization in Python
Extract the table of image files with OneDrive & Python
How to get the number of digits in Python
Learn Nim with Python (from the beginning of the year).
Get the value of a specific key in a list from the dictionary type in the list with Python
How to identify the element with the smallest number of characters in a Python list?
[Python] Get the numbers in the graph image with OCR
I tried to put out the frequent word ranking of LINE talk with Python
[python] Get the list of classes defined in the module
the zen of Python
Crawl the URL contained in the twitter tweet with python
The story of FileNotFound in Python open () mode ='w'
Convert the image in .zip to PDF with Python