This is the record of the 97th "k-means clustering" of Language Processing 100 Knock 2015. Classify the countries into 5 clusters using the word vector of the country name obtained in the previous knock. K-Means used at that time is learned in "Coursera Machine Learning Introductory Course (8th week-Unsupervised Learning (K-Means and PCA))" However, it is a clustering method.
Link | Remarks |
---|---|
097.k-means clustering.ipynb | Answer program GitHub link |
100 amateur language processing knocks:97 | I am always indebted to you by knocking 100 language processing |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In the above environment, I am using the following additional Python packages. Just install with regular pip.
type | version |
---|---|
pandas | 0.25.3 |
scikit-learn | 0.21.3 |
In Chapter 10, we will continue to study word vectors from the previous chapter.
Execute k-means clustering for> 96 word vectors with the number of clusters $ k = 5 $.
Regarding K-Means, "I tried to visualize the K-means method with D3.js" It's easy to understand. You can skip the statistical and mathematical parts and understand it sensuously. For those who are not satisfied, the free Coursera Machine Learning Introductory Course is recommended, and the content is the article "Coursera Machine Learning Introductory Online Course Tora no Maki (Liberal Arts Society)" Recommended for people) ".
import pandas as pd
from sklearn.cluster import KMeans
country_vec = pd.read_pickle('./096.country_vector.zip')
print(country_vec.info())
# K-Means clustering
country_vec['class'] = KMeans(n_clusters=5).fit_predict(country_vec)
for i in range(5):
print('{} Cluster:{}'.format(i, country_vec[country_vec['class'] == i].index))
Read the last knock file.
country_vec = pd.read_pickle('./096.country_vector.zip')
print(country_vec.info())
The following attributes are output as a DataFrame.
<class 'pandas.core.frame.DataFrame'>
Index: 238 entries, American_Samoa to Zimbabwe
Columns: 300 entries, 0 to 299
dtypes: float64(300)
memory usage: 559.7+ KB
None
With Scikit-learn, you can do K-Means with just this. It is convenient to be able to pass a DataFrame.
#KMeans clustering
predicts = KMeans(n_clusters=5).fit_predict(country_vec)
Let's take a quick look at the clustering results.
for i in range(5):
print('{} Cluster:{}'.format(i, country_vec[country_vec['class'] == i].index))
Cluster 0 is unusually high at 153, but is it something else? Cluster 1 is like a so-called maritime nation such as New Zealand, Great Britain, and Japan, but it also includes India and China. Cluster 2 has many European countries, but it is mixed with Brazil and Argentina. It is a subtle result that cannot be judged whether it is successful or not.
0 Cluster:Index(['American_Samoa', 'Antigua_and_Barbuda', 'Bosnia_and_Herzegovina',
'Burkina_Faso', 'Cabo_Verde', 'Cayman_Islands',
'Central_African_Republic', 'Christmas_Island', 'Keeling_Islands',
'Cocos_Islands',
...
'Tonga', 'Tunisia', 'Turkmenistan', 'Tuvalu', 'Uruguay', 'Uzbekistan',
'Vanuatu', 'Yemen', 'Zambia', 'Zimbabwe'],
dtype='object', length=153)
1 Cluster:Index(['New_Zealand', 'United_Kingdom', 'United_States', 'Australia', 'Canada',
'China', 'India', 'Ireland', 'Israel', 'Japan', 'Pakistan'],
dtype='object')
2 Cluster:Index(['Argentina', 'Austria', 'Belgium', 'Brazil', 'Bulgaria', 'Denmark',
'Egypt', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Italy',
'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Spain',
'Sweden', 'Switzerland'],
dtype='object')
3 Cluster:Index(['Guinea', 'Jersey', 'Mexico'], dtype='object')
4 Cluster:Index(['Czech_Republic', 'Hong_Kong', 'People's_Republic_of_China',
'Puerto_Rico', 'South_Africa', 'Sri_Lanka', 'Great_Britain',
'Northern_Ireland', 'Afghanistan', 'Albania', 'Algeria', 'Angola',
'Armenia', 'Azerbaijan', 'Bangladesh', 'Cambodia', 'Chile', 'Colombia',
'Croatia', 'Cuba', 'Cyprus', 'Ethiopia', 'Fiji', 'Georgia', 'Ghana',
'Iceland', 'Indonesia', 'Iraq', 'Kenya', 'Latvia', 'Lebanon', 'Libya',
'Lithuania', 'Malaysia', 'Malta', 'Mongolia', 'Morocco', 'Nepal',
'Nigeria', 'Panama', 'Peru', 'Philippines', 'Serbia', 'Singapore',
'Slovakia', 'Sudan', 'Thailand', 'Turkey', 'Uganda', 'Ukraine'],
dtype='object')
Recommended Posts