This article is a continuation of [Job Change Conference] Try to classify companies by processing word-of-mouth in natural language with word2vec.
Last time, I wrote that I was able to look up similar companies and words as a result of natural language processing of word-of-mouth of the job change meeting with word2vec, but this time I will visualize the result.
As I wrote in the previous article, this method had the drawback that "I understand that word-of-mouth talks about overtime, but I don't know if there are more or less."
This visualization has not fixed the defect, so I hope you can see it as a visualization sample.
Also, last time I wrote it on the company's Advent calendar, but this time I wrote it as an individual, so the content of this article has nothing to do with the views of the organization to which I belong.
[Here in the previous article](http://qiita.com/naotaka1128/items/2c4551abfd40e43b0146#2-gensim-%E3%81%A7-doc2vec-%E3%81%AE%E3%83%A2%E3% Read the model saved by 83% 87% E3% 83% AB% E6% A7% 8B% E7% AF% 89).
model = models.Doc2Vec.load('./data/doc2vec.model')
I defined the method to write the distribution map as follows.
Usually, the vector representation of a word is trained in a model in 100 or 300 dimensions. Visualization is performed after compressing it in dimension and dropping it into two dimensions.
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
def draw_word_scatter(word, topn=30):
"""A method for drawing a distribution map of words that are similar to the entered word"""
#Use the following features of Gensim word2vec to find similar words
# model.most_similar(word, topn=topn)
words = [x[0] for x in sorted(model.most_similar(word, topn=topn))]
words.append(word)
#Find the vector representation of each word. Gensim most_Based on similar
#A method that returns a vector of words(model.calc_vec)Is defined
#The implementation is described at the end of this article as it will be long.
vecs = [model.calc_vec(word) for word in words]
#Distribution map
draw_scatter_plot(vecs, words)
def draw_scatter_plot(vecs, tags, clusters)
"""Scatter plot based on the input vector(With label)Method for drawing"""
# Scikit-Dimensionality reduction and its visualization by learn PCA
pca = PCA(n_components=2)
coords = pca.fit_transform(vecs)
#Visualization with matplotlib
fig, ax = plt.subplots()
x = [v[0] for v in coords]
y = [v[1] for v in coords]
#Consider the cluster if a cluster for each point is set
#Error handling is appropriate
if clusters:
ax.scatter(x, y, c=clusters)
else:
ax.scatter(x, y)
for i, txt in enumerate(tags):
ax.annotate(txt, (coords[i][0], coords[i][1]))
plt.show()
I will draw a distribution map when I am ready.
# "overtime"Visualize words that resemble
draw_word_scatter('overtime', topn=40)
The result was something I couldn't see without tears.
The area where the morning return, morning, last train, and overtime work are gathered in the middle is especially miserable. Even more scary, "sleep" is plotted farthest from the area. I cannot help feeling the melancholy of office workers and the danger of death from overwork ...
I'm a little lonely, so I'll try even positive words.
# "Rewarding"Visualize words that resemble
draw_word_scatter('Rewarding')
It's very different from the previous distribution map ...! It's good to be proud, rewarding, and give dreams. By the way, we are looking for comrades who will accomplish a rewarding job together.
If you run a website, you may want to group words.
Personally, I think that one of the reasons why WELQ and MERY were overwhelmingly strong in SEO was the influence of proper layering of tags and grouping. It can also be used for such things, and it would be nice to create a landing page by automatically classifying the inflow keywords in listing ads.
Here, let's draw a tree diagram for proper stratification and grouping.
import pandas as pd
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, dendrogram
def draw_similar_word_dendrogram(word, topn=30):
"""A method for drawing a tree diagram of words that resemble the entered word"""
# draw_word_Same as scatter(I wrote it redundantly for the qiita article)
words = [x[0] for x in sorted(model.most_similar(word, topn=topn))]
words.append(word)
vecs = [model.calc_vec(word) for word in words]
#Visualization using SciPy functions
#I referred to the code of Python machine learning professional
df = pd.DataFrame(vecs, index=words)
row_clusters = linkage(pdist(df, metric='euclidean'), method='complete')
dendrogram(row_clusters, labels=words)
plt.show()
I will write it.
# "overtime"Write a tree diagram of words similar to
draw_similar_word_dendrogram('overtime')
I'm sorry that the letters of the word are small, but I was able to draw the tree diagram as it was. Obviously, even here, the morning return, the morning, and the last train are lined up. Go home early ...
Grouping can be done by cutting this tree diagram at an appropriate height.
Next, let's draw a distribution map of the company.
Here, let's write a distribution map of each company from the word-of-mouth of each company's ** corporate culture only **. The aim is to find a company with a similar corporate culture.
Use the function called infer_vector of Gensim Doc2Vec when calculating the vector representation using the model already learned. By the way, this function was commented in the article the other day, but honestly, it is not very accurate.
However, I think that it is not a big problem compared to the problem that word2vec is involved in processing company reviews in the first place, so I use it as it is.
First, calculate the company vector representation. The target was a Web-based company with a certain number of reviews or more.
#Read model
model = models.Doc2Vec.load('./data/doc2vec.model')
#Company,Reading word-of-mouth data from DB
companies = connect_mysql(QUERY_COMPANIES, DB_NAME)
reviews = connect_mysql(QUERY_COMPANY_CULTURE, DB_NAME)
#Morphological analysis of word-of-mouth data
# utils.The stem contains the processing of morphological analysis by MeCab.
words = [utils.stems(review) for review in reviews]
#Calculate the vector representation of each company from the word-of-mouth data
vecs = [models.Doc2Vec.infer_vector(model, word) for word in words]
Now that we have calculated the vector representation, let's visualize it.
#Visualization using the method defined above
draw_scatter_plot(vecs, companies)
It's cluttered and hard to see, but the recruiting series is stuck in the upper part, and game companies are gathered in the lower part.
However, is it true that Gree and Mixi are in similar positions? Because there are some places, there may be a precision problem of word2vec and infer_vector, and distortion that forced the 100-dimensional vector into 2D.
The distribution map shown above was cluttered and hard to see.
If you perform clustering and color the plot, it will be a little easier to see, so let's draw a distribution map after finding the cluster of each company.
import pandas as pd
from sklearn.cluster import KMeans
def kmeans_clustering(tags, vecs, n_clusters):
"""K-means clustering method"""
km = KMeans(n_clusters=n_clusters,
init='k-means++',
n_init=20,
max_iter=1000,
tol=1e-04,
random_state=0)
clusters = km.fit_predict(vecs)
return pd.DataFrame(clusters, index=tags)
Execution of clustering and visualization considering it
#The number of clusters is appropriate(I searched for a reasonable number by the Ichiou elbow method)
clusters = kmeans_clustering(companies, vecs, 10)
#Plot the distribution map with cluster information
draw_scatter_plot(vecs, companies, clusters)
I wonder if it doesn't change much ... However, it may be an advantage to be able to see the distortion that was forcibly dropped into two dimensions.
Also, Mr. Cookpad and Mr. DMM are in similar positions and it seems to call ripples, but I think that it will be similar to the store where you go to eat because you live together at Yebisu Garden Place, and it will be a similar corporate culture. , Masu ... (painful excuse)
This time, the visualization of clustering was not good, but it may be improved to some extent by changing the method of dimension compression. It seems that it is worthwhile to devise various improvements such as changing the PCA part to manifold.TSNE of Scikit-learn.