Last time, in Analysis of time series data of stocks, I talked a little about machine learning in the second half. Today we will touch on the machine learning library scikit-learn.

Speaking of scikit-learn, I mentioned Simple clustering example and Support vector machine. items / 33231bf23d40d1c1f344), Solve problems with clustering, Calculate TF-IDF ), Visualization of regression model, Clustering by DBSCAN However, I will reorganize the functions of the library.

Machine learning has the image of making full use of difficult mathematics, but if you use a highly complete library, you can use it without implementing the machine learning method itself. Of course, it is necessary to understand the contents of the method, but since there is a library called scikit-learn that can be said to be the de facto de facto, I think it is better to start from the point of using this.

I wrote this story in previous, but the biggest merit of using Python is related to scientific calculation and data analysis such as statistics and machine learning. The library is extremely rich. There are very good libraries called NumPy and SciPy for numerical calculations in general, pandas for handling data frames, and matplotlib for visualization. Trying to do the same thing in another language (such as Ruby or Scala) can be a daunting task or almost impossible, so it's probably not an option. There is another language for statistics, R, but I think that Python is more complete as a language because it is also a general language, and R is characterized by having more existing assets. ..

Basic steps of machine learning

First, let's organize the procedure.

Obtaining data
Data preprocessing (creating features by processing, shaping, scale conversion, etc.)
Method selection
Parameter selection
Model learning
Model evaluation
Tuning (repeat steps 3 to 6)

The basic procedure is as above.

Method selection

The best thing you can do with scikit-learn is to look at the cheat sheet.

The original is in the head family.

Choosing the right estimator http://scikit-learn.org/stable/tutorial/machine_learning_map/

It is useful because you can jump to the explanation of each method from here.

The following is a summary of what you can do.

Classification-Learn labels and data and predict labels for the data.
Regression-Learn real numbers from data and predict real numbers.
Clustering-Discover the structure of data by grouping similar data together.
Dimensionality reduction-Reduces the dimensions of data to discover factors (such as principal component analysis) or to input other techniques (avoidance of dimensionality).

Let's follow the features one by one.

Classification

SVM (Support Vector Machine, Linear Support Vector Machine)

It has high generalization performance and can handle various data because kernel functions can be selected.

K-nearest neighbor method

It boasts high accuracy in spite of its simplicity.

Random forest

It has the features that it is not necessary to consider overfitting and it is easy to perform parallel calculation.

Regression

It is a normal linear regression.

Lasso return

Build a model with a small number of variables, but assume that some variables are not used.

Ridge regression

It is not easily affected by multicollinearity and has a weaker variable selection ability than lasso regression.

SVR

You can capture non-linearity in the kernel.

Clustering

K-means clustering (K Means)

This is a typical clustering method in which the number of clusters is specified in advance as k. It's simple and fast.

Mixed Gaussian distribution (GMM)

You can find the probability of belonging to a cluster. Assume a normal distribution.

Mean Shift

A robust, non-parametric method that uses kernel density estimation. The kernel width (radius h) you set automatically determines the number of clusters. Since the center point is calculated by considering a circle with radius h using the principle of the steepest descent method for all input point groups, the cost tends to be high.

This method is also applied to scenes such as image segmentation and edge preservation image smoothing.

Meanshifted cluster analysis with infinite kernel width h can also be interpreted as k-means.

Dimensional Reduction

Principal component analysis (PCA)

It has the advantage of being quick to handle sparse matrices. Assume a normal distribution.

Non-negative matrix factorization (NMF)

Only non-negative matrices can be used, but it may be easier to extract features.

You can also use linear discrimination (LDA) and deep learning.

BLAS and LAPACK

The library on which scikit-learn depends. Not only scikit-learn, but also various numerical calculation software such as SciPy and programs in other languages such as C and Fortran depend on BLAS and LAPACK. Most of them are provided as packages in GNU / Linux distributions. It is available in a variety of languages and has a long history, so it is the de facto standard for performing linear algebra operations on a computer.

BLAS-Standard specification for linear algebra libraries.
LAPACK --Solves advanced linear algebra based on BLAS.

LAPACK specifically includes simultaneous linear equations, least squares, eigenvalue problems, singular value problems, matrix (LU) decomposition, choleskey decomposition, QR decomposition, singular value decomposition, Schur decomposition, generalized Schur decomposition, and conditional numbers. It provides estimation routines, inverse matrix calculations, and various subroutines.

There are various implementations of BLAS, but the typical ones are as follows.

Intel MKL --Used in MATLAB etc. There is a charge, but it is very fast.
OpenBLAS --Free to use with BSD license.

If you don't care about speed so much, I think that you will often use OpenBLAS for the time being.

Parameter selection

Methods for efficiently performing good parameters are called cross-validation and grid search. We will make full use of these to repeat learning and evaluation of the model, but we will come back to this later.

Summary

This time, we have summarized typical machine learning methods from the viewpoint of the methods implemented in the machine learning library scikit-learn. scikit-learn is a very rich and high quality library, so I think it will be easier to understand if you hold down here first.

Overview of machine learning techniques learned from scikit-learn

Basic steps of machine learning

Method selection

Classification

SVM (Support Vector Machine, Linear Support Vector Machine)

K-nearest neighbor method

Random forest

Regression

Regression

Lasso return

Ridge regression

Clustering

K-means clustering (K Means)

Mixed Gaussian distribution (GMM)

Mean Shift

Dimensional Reduction

Principal component analysis (PCA)

Non-negative matrix factorization (NMF)

BLAS and LAPACK

Parameter selection

Summary