100 Amateur Language Processing Knock: 97

It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).

Chapter 10: Vector Space Law (II)

In Chapter 10, we will continue to study word vectors from the previous chapter.

97. k-means clustering

For a word vector of> 96, execute k-means clustering with the number of clusters $ k = 5 $.

The finished code:

`main.py`


# coding: utf-8
import pickle
from collections import OrderedDict
from scipy import io
import numpy as np
from sklearn.cluster import KMeans

fname_dict_index_t = 'dict_index_country'
fname_matrix_x300 = 'matrix_x300_country'

#Read dictionary
with open(fname_dict_index_t, 'rb') as data_file:
		dict_index_t = pickle.load(data_file)

#Matrix reading
matrix_x300 = io.loadmat(fname_matrix_x300)['matrix_x300']

#KMeans clustering
predicts = KMeans(n_clusters=5).fit_predict(matrix_x300)

# (Country,Class number)List creation
result = zip(dict_index_t.keys(), predicts)

#Sort by classification number and display
for country, category in sorted(result, key=lambda x: x[1]):
	print('{}\t{}'.format(category, country))

Execution result:

The classification numbers from 0 to 4 and the country names are output tab-delimited.

`Execution result`


0	Andorra
0	Antarctica
0	Antigua_and_Barbuda
0	Bahamas
0	Bahrain
0	Barbados
0	Belarus
0	Belize
0	Benin
0	Bermuda
0	Bhutan
0	Bosnia_and_Herzegovina
0	Botswana
0	Burkina_Faso
0	Burundi
0	Cameroon
0	Central_African_Republic
0	Chad
0	Commonwealth_of_Australia
0	Comoros
0	Congo
0	Cook_Islands
0	Democratic_Republic_of_the_Congo
0	Djibouti
0	Dominica
0	Dominican_Republic
0	Ecuador
0	Eritrea
0	Estonia
0	Federal_Republic_of_Germany
0	Federated_States_of_Micronesia
0	French_Republic
0	Gabon
0	Gambia
0	Gibraltar
0	Greenland
0	Grenada
0	Guadeloupe
0	Guam
0	Guatemala
0	Guinea-Bissau
0	Guyana
0	Haiti
0	Honduras
0	Jamaica
0	Jordan
0	Kazakhstan
0	Kingdom_of_the_Netherlands
0	Kiribati
0	Kuwait
0	Kyrgyzstan
0	Lao
0	Latvia
0	Lesotho
0	Liberia
0	Liechtenstein
0	Luxembourg
0	Macau
0	Madagascar
0	Malawi
0	Maldives
0	Mali
0	Martinique
0	Mauritania
0	Mauritius
0	Mayotte
0	Micronesia
0	Moldova
0	Monaco
0	Mongolia
0	Montenegro
0	Mozambique
0	Myanmar
0	Namibia
0	Nauru
0	Nicaragua
0	Niger
0	Niue
0	Oman
0	Palau
0	Paraguay
0	Qatar
0	Republic_of_Albania
0	Republic_of_Armenia
0	Republic_of_Austria
0	Republic_of_Congo
0	Republic_of_Croatia
0	Republic_of_Estonia
0	Republic_of_Korea
0	Republic_of_Poland
0	Republic_of_Singapore
0	Republic_of_South_Africa
0	Republic_of_Turkey
0	Republic_of_the_Philippines
0	Russian_Federation
0	Rwanda
0	Saint_Lucia
0	Senegal
0	Seychelles
0	Slovenia
0	Solomon_Islands
0	Somalia
0	State_of_Israel
0	Suriname
0	Swaziland
0	Tajikistan
0	Tibet
0	Timor-Leste
0	Togo
0	Tokelau
0	Tonga
0	Tunisia
0	Turkmenistan
0	Tuvalu
0	United_Arab_Emirates
0	United_States_of_America
0	Uzbekistan
0	Vanuatu
0	Vatican
0	Yemen
0	Zambia
0	Zimbabwe
1	Austria
1	Belgium
1	Bulgaria
1	Denmark
1	Egypt
1	France
1	Germany
1	Greece
1	Hungary
1	Ireland
1	Italy
1	Macedonia
1	Netherlands
1	Norway
1	Poland
1	Portugal
1	Romania
1	Spain
1	Sweden
2	Afghanistan
2	China
2	India
2	Iraq
2	Israel
2	Korea
2	Pakistan
2	Taiwan
2	United_States
2	Vietnam
3	Argentina
3	Australia
3	Brazil
3	Canada
3	Japan
3	Mexico
3	New_Zealand
3	Switzerland
4	Albania
4	Algeria
4	Angola
4	Armenia
4	Azerbaijan
4	Bangladesh
4	Bolivia
4	Cambodia
4	Chile
4	Colombia
4	Croatia
4	Cuba
4	Cyprus
4	Czech_Republic
4	Ethiopia
4	Fiji
4	Finland
4	Georgia
4	Ghana
4	Guinea
4	Iceland
4	Indonesia
4	Iran
4	Kenya
4	Kosovo
4	Lebanon
4	Libya
4	Lithuania
4	Malaysia
4	Malta
4	Morocco
4	Nepal
4	Nigeria
4	Panama
4	People's_Republic_of_China
4	Peru
4	Philippines
4	Samoa
4	Serbia
4	Singapore
4	Slovakia
4	Sudan
4	Syria
4	Tanzania
4	Thailand
4	Turkey
4	Uganda
4	Ukraine
4	Uruguay
4	Venezuela

Techniques for classifying data

There are two main methods for classifying data: classification (classification) and clustering (clustering).

Classification (classification)

Classification is a method of pre-defining a group (class), for example, when news is divided into pre-defined groups such as "politics", "economy", and "sports". Also known as classification.

Clustering

Clustering is a method of extracting clusters of similar people without defining groups in advance. When analyzing a large amount of data, you can't see each one, so you can roughly group them into similar chunks and see the properties and trends of each chunk. What kind of group would you like to divide this issue into five? There is no definition. Therefore, we use clustering.

The clustering method is divided into non-hierarchical clustering such as K-Means used this time and hierarchical clustering such as Ward method used in the next problem 98.

K-Means clustering

The basic algorithm of K-Means is simple.

First, select the center of the specified number of clusters in the vector space. Then, for each data, check the distance to the center of each cluster and distribute to the cluster with the closest center. This completes the first distribution. After sorting, check the average score of the data belonging to each cluster and use it as the new center point for that cluster. Here, reset the distribution of data once, check the distance to the new center of each cluster for each data again, and reallocate to the cluster with the closest center. This completes the second distribution. Repeat this until the center hardly moves, and you're done.

A lot of explanations of K-Means will appear when you google, so I will omit the details. You can study at Week 8 at Coursera's Machine Learning.

Implementation of K-Means

Clustering with K-Means uses scikit-learn, which I started using in Problem 85. It's easy to use. Specify the number of clusters in the sklearn.cluster.KMeans class and [fit_predict () ] If you pass a matrix in (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit_predict), clustering will be performed and an array of classification numbers for each row You can get (value range from 0 to number of clusters-1).

The center point of the first cluster selected is random. Therefore, the result may change slightly each time it is executed. If you do not want to change it, specify random_state of sklearn.cluster.KMeans to get a random number. You can specify the seed value of.

In this program, I tried to sort and display by classification number. However, even if you look at the results, you can't tell what kind of perspective they were classified into. Somehow, classification number 1 seems to be a collection of European countries, but in official names such as "Republic_of_Austria" it has become classification number 0. "United_States_of_America" and "United_States" are also different categories. Problem 96 The country extraction may not have been good ...

That's all for the 98th knock. If you have any mistakes, I would appreciate it if you could point them out.

The execution result includes a part of the data distributed in Corpus data used for 100 knocks. I will. The license for corpus data used in this Chapter 10 is Creative Commons Attribution-Inheritance 3.0 Non-Portable (Japanese translation). For a list of country names, see "KIDS Ministry of Foreign Affairs-Countries of the World" (Ministry of Foreign Affairs) (http://www.mofa.go.jp/mofaj/kids/ichiran/index.html .mofa.go.jp/mofaj/kids/ichiran/index.html)) and [Countries and Regions of the World from A] in nationsonline.org to Z](http://www.nationsonline.org/oneworld/countries_of_the_world.htm) is processed and created. *