This is the record of the 97th "k-means clustering" of Language Processing 100 Knock 2015. Classify the countries into 5 clusters using the word vector of the country name obtained in the previous knock. K-Means used at that time is learned in "Coursera Machine Learning Introductory Course (8th week-Unsupervised Learning (K-Means and PCA))" However, it is a clustering method.

Reference link

Link	Remarks
097.k-means clustering.ipynb	Answer program GitHub link
100 amateur language processing knocks:97	I am always indebted to you by knocking 100 language processing

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type	version
pandas	0.25.3
scikit-learn	0.21.3

Task

Chapter 10: Vector Space Method (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

97. k-means clustering

Execute k-means clustering for> 96 word vectors with the number of clusters $ k = 5 $.

Task Supplement (K-Means)

Regarding K-Means, "I tried to visualize the K-means method with D3.js" It's easy to understand. You can skip the statistical and mathematical parts and understand it sensuously. For those who are not satisfied, the free Coursera Machine Learning Introductory Course is recommended, and the content is the article "Coursera Machine Learning Introductory Online Course Tora no Maki (Liberal Arts Society)" Recommended for people) ".

Answer

Answer program [097.k-means clustering.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/10.%E3%83%99%E3%82%AF%E3%83%88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (II) /097.k-means%E3%82%AF%E3%83%A9%E3%82 % B9% E3% 82% BF% E3% 83% AA% E3% 83% B3% E3% 82% B0.ipynb)

import pandas as pd
from sklearn.cluster import KMeans

country_vec = pd.read_pickle('./096.country_vector.zip')
print(country_vec.info())

# K-Means clustering
country_vec['class'] = KMeans(n_clusters=5).fit_predict(country_vec)

for i in range(5):
    print('{} Cluster:{}'.format(i, country_vec[country_vec['class'] == i].index))

Answer commentary

Read the last knock file.

country_vec = pd.read_pickle('./096.country_vector.zip')
print(country_vec.info())

The following attributes are output as a DataFrame.

<class 'pandas.core.frame.DataFrame'>
Index: 238 entries, American_Samoa to Zimbabwe
Columns: 300 entries, 0 to 299
dtypes: float64(300)
memory usage: 559.7+ KB
None

With Scikit-learn, you can do K-Means with just this. It is convenient to be able to pass a DataFrame.

#KMeans clustering
predicts = KMeans(n_clusters=5).fit_predict(country_vec)

Let's take a quick look at the clustering results.

for i in range(5):
    print('{} Cluster:{}'.format(i, country_vec[country_vec['class'] == i].index))

Cluster 0 is unusually high at 153, but is it something else? Cluster 1 is like a so-called maritime nation such as New Zealand, Great Britain, and Japan, but it also includes India and China. Cluster 2 has many European countries, but it is mixed with Brazil and Argentina. It is a subtle result that cannot be judged whether it is successful or not.

0 Cluster:Index(['American_Samoa', 'Antigua_and_Barbuda', 'Bosnia_and_Herzegovina',
       'Burkina_Faso', 'Cabo_Verde', 'Cayman_Islands',
       'Central_African_Republic', 'Christmas_Island', 'Keeling_Islands',
       'Cocos_Islands',
       ...
       'Tonga', 'Tunisia', 'Turkmenistan', 'Tuvalu', 'Uruguay', 'Uzbekistan',
       'Vanuatu', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=153)
1 Cluster:Index(['New_Zealand', 'United_Kingdom', 'United_States', 'Australia', 'Canada',
       'China', 'India', 'Ireland', 'Israel', 'Japan', 'Pakistan'],
      dtype='object')
2 Cluster:Index(['Argentina', 'Austria', 'Belgium', 'Brazil', 'Bulgaria', 'Denmark',
       'Egypt', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Italy',
       'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Spain',
       'Sweden', 'Switzerland'],
      dtype='object')
3 Cluster:Index(['Guinea', 'Jersey', 'Mexico'], dtype='object')
4 Cluster:Index(['Czech_Republic', 'Hong_Kong', 'People's_Republic_of_China',
       'Puerto_Rico', 'South_Africa', 'Sri_Lanka', 'Great_Britain',
       'Northern_Ireland', 'Afghanistan', 'Albania', 'Algeria', 'Angola',
       'Armenia', 'Azerbaijan', 'Bangladesh', 'Cambodia', 'Chile', 'Colombia',
       'Croatia', 'Cuba', 'Cyprus', 'Ethiopia', 'Fiji', 'Georgia', 'Ghana',
       'Iceland', 'Indonesia', 'Iraq', 'Kenya', 'Latvia', 'Lebanon', 'Libya',
       'Lithuania', 'Malaysia', 'Malta', 'Mongolia', 'Morocco', 'Nepal',
       'Nigeria', 'Panama', 'Peru', 'Philippines', 'Serbia', 'Singapore',
       'Slovakia', 'Sudan', 'Thailand', 'Turkey', 'Uganda', 'Ukraine'],
      dtype='object')

100 language processing knock-97 (using scikit-learn): k-means clustering