This is the record of the 98th "clustering by Ward method" of Language processing 100 knock 2015. Unlike the previous non-hierarchical clustering, hierarchical clustering is performed. The knock result is in the form shown below in which the clusters are hierarchical (large because there are 238 countries ...).
Link | Remarks |
---|---|
098.Clustering by Ward method.ipynb | Answer program GitHub link |
100 amateur language processing knocks:98 | I am always indebted to you by knocking 100 language processing |
Creating a Dendrogram with Python and the elements of linkage | How to implement Ward's method and dendrogram in Python |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
matplotlib | 3.1.1 |
pandas | 0.25.3 |
scipy | 1.4.1 |
In Chapter 10, we will continue to study word vectors from the previous chapter.
Perform hierarchical clustering by Ward's method for> 96 word vectors. Furthermore, visualize the clustering result as a dendrogram.
This time it is hierarchical clustering. The details were very easy to understand in Link. The number of clusters can be set arbitrarily by dividing by the upper hierarchy.
import matplotlib.pyplot as plt
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage
country_vec = pd.read_pickle('./096.country_vector.zip')
print(country_vec.info())
clustered = linkage(country_vec, method='ward')
plt.figure(figsize=(20, 50), dpi=100, facecolor='c')
_ = dendrogram(clustered, labels=list(country_vec.index), leaf_font_size=8, orientation='right')
plt.show()
In general, I refer to the article "About creating a Dendrogram with Python and the elements of linkage".
Cluster using the function linkage
. By setting the parameter method
to ward
, clustering is performed by Ward's method.
clustered = linkage(country_vec, method='ward')
After that, the result is displayed. The displayed dendrogram (tree diagram) is on the top. I purposely set the background color to c
(cyan) because the default white label disappeared when I made it into an image file.
plt.figure(figsize=(20, 50), dpi=100, facecolor='c')
_ = dendrogram(clustered, labels=list(country_vec.index), leaf_font_size=8, orientation='right')
plt.show()
I will look only at the top. Unlike K-Means, it's nice to know the degree of similarity between individual countries. If you think that maritime nations are lined up, Japan and China are close to each other. I wonder why. Iraq, Afghanistan, and Israel are close in physical distance, but are they similar in meaning?
Recommended Posts