Introduction

One of the non-hierarchical clustering methods is the k-means method (k-means method). The description of "Chapter 3 Information and Data Science Second Half Learning 16. Classification by Clustering" in the teaching materials is quoted because it is easy to understand.

In the k-means method, clustering is performed according to the following procedure.

Determine the number of clusters to be divided in advance, and randomly determine the representative points (centroids).

Find the distance between the data and each representative point and classify it into the cluster of the closest representative points.

Calculate the average for each cluster and use it as a new representative point.

If the position of the representative point has changed, return to 2. If there is no change, the classification ends. By randomly determining the representative points according to 1), the results will differ greatly, resulting in appropriate clustering. It may not be. It can be improved by repeating the analysis several times or by using the k-means ++ method.

1') Randomly select one representative point from the data, and select the remaining representative points with a probability proportional to the square of the distance from that point.

In the section "Chapter 3 Information and Data Science Second Half Learning 16. Classification by Clustering" where the explanation about clustering of teaching materials is written, it has already been explained by the implementation example by python. This time, in "Chapter 5 Exploration of Problem Discovery / Solution Utilizing Information and Information Technology, Activity Example at the End of the Book 3. Utilization of Information Technology for Utilizing Data", an implementation example written in R is used in python. By replacing it, I would like to confirm the data analysis by clustering using the k-means method.

Teaching materials

[High School Information Department "Information II" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/mext_00742.html "High School Information Department "Information II" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") Chapter 5 Search for Problems and Solutions Utilizing Information and Information Technology, End of Book (PDF: 4.1MB)

environment

ipython Colaboratory - Google Colab

Parts to be taken up in the teaching materials

Activity example 3 Utilization of information technology to utilize data

Implementation example and result in python

Before doing the analysis

This time, the teaching materials use Japanese for graph plotting. Therefore, it is necessary to set in advance so that Japanese can be used in the graph plot (matplotlib).

!apt-get -y install fonts-ipafont-gothic
!ls -ll /root/.cache/matplotlib/

：
-rw-r--r-- 1 root root 46443 Sep 18 20:45 fontList.json
-rw-r--r-- 1 root root 29337 Sep 18 20:25 fontlist-v310.json
drwxr-xr-x 2 root root  4096 Sep 18 20:25 tex.cache

Delete the old font cache fontlist-v310.json based on the information of the ls command.

#Delete the cache.
!rm /root/.cache/matplotlib/fontlist-v310.json #Cache to be erased
!ls -ll /root/.cache/matplotlib/

#Delete the cache.
!rm /root/.cache/matplotlib/fontlist-v310.json #Cache to be erased
!ls -ll /root/.cache/matplotlib/

Now, start the runtime of google colab. Next, set up matplotlib to use Japanese.

import matplotlib

#Japanese display
matplotlib.rcParams['font.family'] = "IPAGothic"

Preprocessing

Download the following Excel data as a "Survey on the actual situation of computerization of education in schools".

["Actual conditions of" computer installation status "and" Internet connection status "by prefecture (high school)"](https://www.e-stat.go.jp/stat-search/files?page=1&query= % E5% AD% A6% E6% A0% A1% E3% 81% AB% E3% 81% 8A% E3% 81% 91% E3% 82% 8B% E6% 95% 99% E8% 82% B2% E3 % 81% AE% E6% 83% 85% E5% A0% B1% E5% 8C% 96% E3% 81% AE% E5% AE% 9F% E6% 85% 8B% E7% AD% 89% E3% 81 % AB% E9% 96% A2% E3% 81% 99% E3% 82% 8B% E8% AA% BF% E6% 9F% BB & layout = dataset & stat_infid = 000031898768 & metadata = 1 & data = 1 "By prefecture" Computer installation status "And the actual situation of" Internet connection status "(high school)" ")

As with the teaching materials, data cleaning is performed on Excel before the first analysis with python. The data that has been organized and shaped is as follows.

pc_sjis.csv

The processing performed is as follows.

--Delete unnecessary headers and footers --Delete unnecessary items --Remove commas to separate digits to convert data to CSV format --Changed the item name to alphabetic characters to make it easier to work --Each item of data is pref (by prefecture), school (number of schools), student (number of students), room (number of ordinary classrooms), PC (total number of PCs for learners), spp (PC1 for learners) Number of children per vehicle), prj (large presentation device maintenance rate in ordinary classrooms), lan (school LAN maintenance rate in ordinary classrooms), wlan (wireless LAN maintenance rate in ordinary classrooms)

Based on these, the data is read.

import pandas as pd
from IPython.display import display

pc = pd.read_csv('/content/pc_sjis.csv', encoding='shift_jis')
display(pc.head())

The teaching materials are as follows.

In the teaching materials, there seems to be an error that the total number of educational PCs is reading where the total number of learner PCs should be read.

Data analysis and visualization

To understand what trends you can read, first try displaying the scatterplot matrix. This time, I will use the seaborn module.

import seaborn as sns

pg = sns.pairplot(pc)
print(type(pg))

seaborn_pairplot (1).png

From the teaching materials

Those with clear linear trends, such as the number of students and the number of classrooms, are subject to the correlation coefficient and simple regression analysis learned in "Information I". This time, we will not look at the linear tendency, so let's consider wlan (wireless LAN) and spp (number of students per PC).

Since there is, take out the values of wlan (wireless LAN) and spp (number of students per PC) and scale.

Specifically, we have standardized.

from sklearn.preprocessing import StandardScaler

#Value extraction(wlan spp)
pc_ws = pc[['wlan', 'spp']]

#Standardization(How to use Standard Scaler)
std_sc = StandardScaler()
std_sc.fit(pc_ws)
pcs = std_sc.transform(pc_ws)
pcs_df = pd.DataFrame(pcs, columns = pc_ws.columns)
display(pcs_df.head())

Since the types of data handled are different, we are standardizing them in the same way as textbooks. For standardization, past articles will be helpful. https://qiita.com/ereyester/items/b78b22a76a8f50006880

Next, create and classify the model.

from sklearn.cluster import KMeans

#Creating a model
km = KMeans(init='random', n_clusters=2 , random_state=0)
#Forecast
pc_cluster = km.fit_predict(pcs_df)
cluster_df = pd.DataFrame(pc_cluster, columns=['cluster'])

#Value extraction(pref wlan spp cluster)
pcs_cluster_df = pd.concat([pc[['pref', 'wlan', 'spp']], cluster_df], axis=1)
display(pcs_cluster_df.head())

I would like to confirm the result with a scatter plot.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

_, ax = plt.subplots(figsize=(5, 5), dpi=200)

sns.scatterplot(data=pcs_cluster_df, x="wlan", y="spp", hue="cluster", ax=ax)

for k, v in pcs_cluster_df.iterrows():
    ax.annotate(v['pref'],xy=(v['wlan'],v['spp']),size=5)

plt.show()

It seems that the wireless LAN (wlan) is generally classified based on the information. Also, Chiba and Saga prefectures appear to be off the center of the group.

Further analysis

Next, let's plot the graph of the number of students and the number of learning PCs for which a clear positive correlation can be read, color-coded in the previous cluster.

#Value extraction(pref student pc cluster)
pcs_cluster2_df = pd.concat([pc[['pref', 'student', 'pc']], cluster_df], axis=1)

_, ax2 = plt.subplots(figsize=(5, 5), dpi=200)

sns.scatterplot(data=pcs_cluster2_df, x="student", y="pc", hue="cluster", ax=ax2)

for k, v in pcs_cluster2_df.iterrows():
    ax2.annotate(v['pref'],xy=(v['student'],v['pc']),size=5)

plt.show()

If the ratio of PCs (total number of learner PCs) to students (number of students) is large, the group tends to have a high maintenance rate of wlan (wireless LAN maintenance rate of ordinary classrooms), otherwise wlan (ordinary classrooms) It seems that there is a tendency for the group to have a low maintenance rate (wireless LAN maintenance rate). In Saga prefecture, the ratio of PCs (total number of learners'PCs) to students (number of students) is very large, while in Chiba prefecture, the ratio of PCs (total number of learners' PCs) to students (number of students) is very small. You can see the characteristics of.

Source code

https://gist.github.com/ereyester/ce9370e3022f05f4d7548a8ccaed33cc

Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)