Visualize keywords in documents with TF-IDF and Word Cloud

word cloud memo

Prepare word dictionary (vocab) and TF-IDF

#All words(Below is an example)
$ vocab
array(['a', 'able', 'at', ..., 'zebra', 'zone', 'zoo'], dtype='<U79')

#TF for each document-IDF vector
array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [61.9792226 ,  0.        ,  3.38385083, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  6.76770166, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 2.75463212,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.37731606,  2.84060202,  0.        , ...,  0.        ,
         0.        ,  0.        ]])

Create dic [word] = vec

words = vocab.tolist()
vecs = TF_IDF.tolist()
temp_dic = {}
vecs_dic = []
for vec in vecs:
    for i in range(len(vec)):
        temp_dic[words[i]] = vec[i] 
    temp_dic = {} 
$ len(vecs_dic)
(Number of documents)

$ len(vecs_dic[0])
(Number of dimensions of vector)


#Visualize the 89th document from the document list
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import sys

wordcloud = WordCloud(background_color='white', width=1024, height=674)
plt.imshow(wordcloud, interpolation='bilinear')


If you get a Zero Division Error in Word Cloud

Solved by adding small values with reference to reference [2]

words = vocab.tolist()
vecs = TF_IDF.tolist()
temp_dic = {}
vecs_dic = []
for vec in vecs:
    for i in range(len(vec)):
        temp_dic[words[i]] = vec[i] + 1e-5 #Prevent the element from becoming 0
    temp_dic = {} 

Create and save images for each document

To save it, add wordcloud.to_file and change it as follows.

for v in vecs_dic:
  wordcloud = WordCloud(background_color='white', width=1024, height=674)
  wordcloud.to_file([PATH] + str(i) + ".png ")


