What is unsupervised learning?

In supervised learning (regression, classification), the answer is known It is to train AI using data (a set of input values and corresponding output values).

For unanswered datasets, as opposed to supervised learning There is unsupervised learning in which AI decides the answer by itself.

You will learn about "clustering" and "principal component analysis", which are unsupervised learning.

Types of unsupervised learning

Clustering

There is a technique called "clustering" as a representative of unsupervised learning. An operation that divides data into "clusters" In Japanese, data division is sometimes referred to as "clustering".

The following shows how data is manipulated using the "k-means method" as an example of clustering.

The black dots are the state before clustering. The purple dot is a parameter called the "center of gravity" of the data. The k-means method learns the optimum position of this center of gravity from the data. Cluster the data using the learned centroids.

There are two types of clustering: an automatic estimation of the number of clusters and a human-determined method.

k-The means method is one of the methods that human beings use by deciding the number of clusters.

Number of clusters

The purpose of unsupervised learning is to mechanically capture and analyze the characteristics of the data you want to analyze. For this reason, there is also the idea that it is better for people not to determine the number of clusters.

A technique called "hierarchical" is a technique that automatically estimates the number of clusters. However, the hierarchical method requires a relatively large amount of calculation. If you have a lot of data, a non-hierarchical approach may be appropriate.

Principal component analysis

"Principal component analysis" is a technique often used to "reduce" data into graphs.

Dimensionality reduction is the lowering of the dimensions that represent the data. For example, you can create a 2D graph by reducing one coordinate axis from 3D data.

Consider a concrete example. Suppose you have a lot of data about your students, such as test scores, number of questions in class, number of late arrivals, and sleep time. How can you graph the student characteristics from these data?

You may be able to create a graph for each data. However, it is difficult to analyze the tendency of hundreds or thousands of students from multiple graphs. Principal component analysis allows you to combine different types of data to create a single graph, such as 2D or 3D, while preserving the information in each data as much as possible.

You can convert it to data using principal component analysis, as in the example above. First, the machine learns the axes (main components) that specifically indicate the characteristics of the data. If you recreate the graph with the learned axes, you can easily see all the data in one graph as shown in the above figure while keeping the information as much as possible. The method for determining this axis is the outline of principal component analysis.

Prior knowledge

Euclidean distance

Given the coordinates x (x1, x2), y (y1, y2) of the two points The distance between two points can be obtained from the Pythagorean theorem.

More generally, an extension of this between two points in n-dimensional space

It is called the Euclidean distance.

"Distance" in a space of n = 4 or more can no longer be imagined by human intuitive spatial recognition, but in mathematical formulas, the expression simply extended as above is defined as distance. .. The Euclidean distance is also sometimes called the norm.

You can also use numpy to find the Euclidean distance.

import numpy as np
vec_a = np.array([1, 2, 3])
vec_b = np.array([2, 3, 4])
print(np.linalg.norm(vec_a - vec_b))

Cosine similarity

When a two-dimensional vector a → = (a1, a2), b → = (b1, b2) is given I would like to evaluate how similar these two vectors (actually some 2D data) have.

The properties that represent a vector are "length" and "direction". Here, we focus on "direction". What is the similarity of the "directions" that two vectors are facing? You can think of it simply as corresponding to the angle between these two vectors.

Assuming that the angle formed by the two vectors is θ, the smaller θ is, the more similar the two data are. Here, the formula for calculating the inner product of vectors

If you calculate a little

It will be. The smaller the θ, the larger the cos θ.

From the above, it was found that cosθ represents the similarity between the two data. The cos of the angle formed in this way is used as an index of the similarity of the data.

It is called "cosine similarity".

Extend "Cosine Similarity" so that it can be used for n-dimensional data as well as the Euclidean distance. When two n-dimensional vectors a → = (a1, a2, ⋯, an), b → = (b1, b2, ⋯, bn) are given "Cosine similarity" is expressed by the following formula.

In addition, the cosine similarity can be calculated with the following code.

import numpy as np
vec_a = np.array([1, 2, 3])
vec_b = np.array([2, 3, 4])
print(np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b)))

Python: Unsupervised Learning: Basics