I would like to visualize the frequency of occurrence of words using the Word Cloud library of python created by amueller.
It's this kind of guy.
A description of this library can be found here. http://amueller.github.io/word_cloud/index.html
You can easily install it just by getting the source code from git.
git clone https://github.com/amueller/word_cloud
cd word_cloud
python setup.py install
Unlike English, Japanese does not have clear word breaks, so in order to separate words, we use software called MeCab to cut them out into words. [Install Mecab] (Http://qiita.com/kenmatsu4/items/02034e5688cc186f224b#1-1mecab installation) was explained in this link, so you can install it referring to here.
In addition, the following libraries are also required, so prepare them.
pip install beautifulsoup4
pip install requests
Now that I'm ready, I'll write the code right away. The first is importing the required libraries.
#Library import
%matplotlib inline
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from bs4 import BeautifulSoup
import requests
import MeCab as mc
A function that uses MeCab to cut out words and pack them into a list. Part of speech is limited to nouns, verbs, adjectives, and adverbs in order to visualize and extract words that are likely to be meaningful.
def mecab_analysis(text):
t = mc.Tagger('-Ochasen -d /usr/local/Cellar/mecab/0.996/lib/mecab/dic/mecab-ipadic-neologd/')
enc_text = text.encode('utf-8')
node = t.parseToNode(enc_text)
output = []
while(node):
if node.surface != "": #Exclude headers and footers
word_type = node.feature.split(",")[0]
if word_type in ["adjective", "verb","noun", "adverb"]:
output.append(node.surface)
node = node.next
if node is None:
break
return output
Use BeutifulSoup to capture the text specified in the URL. Only the text can be extracted according to the HTML structure of Qiita.
def get_wordlist_from_QiitaURL(url):
res = requests.get(url)
soup = BeautifulSoup(res.text)
text = soup.body.section.get_text().replace('\n','').replace('\t','')
return mecab_analysis(text)
From here is the production, Word Cloud generation. You can exclude words that don't make much sense by specifying them as stop words, so use this. Also, when implementing on Mac, it is necessary to specify the font, so specify font_path.
def create_wordcloud(text):
#Specify the font path according to the environment.
#fpath = "/System/Library/Fonts/HelveticaNeue-UltraLight.otf"
fpath = "/Library/Fonts/Hiragino Kakugo Pro W3.otf"
#Stop word setting
stop_words = [ u'Teru', u'Is', u'Become', u'To be', u'To do', u'is there', u'thing', u'this', u'Mr.', u'do it', \
u'Give me', u'do', u'Give me', u'so', u'Let', u'did', u'think', \
u'It', u'here', u'Chan', u'Kun', u'', u'hand',u'To',u'To',u'Is',u'of', u'But', u'When', u'Ta', u'Shi', u'so', \
u'Absent', u'Also', u'Nana', u'I', u'Or', u'So', u'Yo', u'']
wordcloud = WordCloud(background_color="white",font_path=fpath, width=900, height=500, \
stopwords=set(stop_words)).generate(text)
plt.figure(figsize=(15,12))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
Since the above is the function definition of the necessary processing, we will create a Word Cloud using these. Separate each word into a single string and pass it to the Word Cloud creation function.
I would like to use @ t_saeko's article "What I did when I was suddenly put into a burning project as a director". (Because it was interesting to read recently)
url = "http://qiita.com/t_saeko/items/2b475b8657c826abc114"
wordlist = get_wordlist_from_QiitaURL(url)
create_wordcloud(" ".join(wordlist).decode('utf-8'))
It feels pretty good!
The full code has been uploaded to gist.
Recommended Posts