Overview of machine learning techniques learned from scikit-learn

Last time, in Analysis of time series data of stocks, I talked a little about machine learning in the second half. Today we will touch on the machine learning library scikit-learn.

Speaking of scikit-learn, I mentioned Simple clustering example and Support vector machine. items / 33231bf23d40d1c1f344), Solve problems with clustering, Calculate TF-IDF ), Visualization of regression model, Clustering by DBSCAN However, I will reorganize the functions of the library.

Machine learning has the image of making full use of difficult mathematics, but if you use a highly complete library, you can use it without implementing the machine learning method itself. Of course, it is necessary to understand the contents of the method, but since there is a library called scikit-learn that can be said to be the de facto de facto, I think it is better to start from the point of using this.

I wrote this story in previous, but the biggest merit of using Python is related to scientific calculation and data analysis such as statistics and machine learning. The library is extremely rich. There are very good libraries called NumPy and SciPy for numerical calculations in general, pandas for handling data frames, and matplotlib for visualization. Trying to do the same thing in another language (such as Ruby or Scala) can be a daunting task or almost impossible, so it's probably not an option. There is another language for statistics, R, but I think that Python is more complete as a language because it is also a general language, and R is characterized by having more existing assets. ..

Basic steps of machine learning

First, let's organize the procedure.

  1. Obtaining data
  2. Data preprocessing (creating features by processing, shaping, scale conversion, etc.)
  3. Method selection
  4. Parameter selection
  5. Model learning
  6. Model evaluation
  7. Tuning (repeat steps 3 to 6)

The basic procedure is as above.

Method selection

The best thing you can do with scikit-learn is to look at the cheat sheet.

ml_map.png

The original is in the head family.

Choosing the right estimator http://scikit-learn.org/stable/tutorial/machine_learning_map/

It is useful because you can jump to the explanation of each method from here.

The following is a summary of what you can do.

Let's follow the features one by one.

Classification

SVM (Support Vector Machine, Linear Support Vector Machine)

It has high generalization performance and can handle various data because kernel functions can be selected.

K-nearest neighbor method

It boasts high accuracy in spite of its simplicity.

Random forest

It has the features that it is not necessary to consider overfitting and it is easy to perform parallel calculation.

Regression

Regression

It is a normal linear regression.

Lasso return

Build a model with a small number of variables, but assume that some variables are not used.

Ridge regression

It is not easily affected by multicollinearity and has a weaker variable selection ability than lasso regression.

SVR

You can capture non-linearity in the kernel.

Clustering

K-means clustering (K Means)

This is a typical clustering method in which the number of clusters is specified in advance as k. It's simple and fast.

Mixed Gaussian distribution (GMM)

You can find the probability of belonging to a cluster. Assume a normal distribution.

Mean Shift

A robust, non-parametric method that uses kernel density estimation. The kernel width (radius h) you set automatically determines the number of clusters. Since the center point is calculated by considering a circle with radius h using the principle of the steepest descent method for all input point groups, the cost tends to be high.

This method is also applied to scenes such as image segmentation and edge preservation image smoothing.

Meanshifted cluster analysis with infinite kernel width h can also be interpreted as k-means.

Dimensional Reduction

Principal component analysis (PCA)

It has the advantage of being quick to handle sparse matrices. Assume a normal distribution.

Non-negative matrix factorization (NMF)

Only non-negative matrices can be used, but it may be easier to extract features.

You can also use linear discrimination (LDA) and deep learning.

BLAS and LAPACK

The library on which scikit-learn depends. Not only scikit-learn, but also various numerical calculation software such as SciPy and programs in other languages such as C and Fortran depend on BLAS and LAPACK. Most of them are provided as packages in GNU / Linux distributions. It is available in a variety of languages and has a long history, so it is the de facto standard for performing linear algebra operations on a computer.

LAPACK specifically includes simultaneous linear equations, least squares, eigenvalue problems, singular value problems, matrix (LU) decomposition, choleskey decomposition, QR decomposition, singular value decomposition, Schur decomposition, generalized Schur decomposition, and conditional numbers. It provides estimation routines, inverse matrix calculations, and various subroutines.

There are various implementations of BLAS, but the typical ones are as follows.

If you don't care about speed so much, I think that you will often use OpenBLAS for the time being.

Parameter selection

Methods for efficiently performing good parameters are called cross-validation and grid search. We will make full use of these to repeat learning and evaluation of the model, but we will come back to this later.

Summary

This time, we have summarized typical machine learning methods from the viewpoint of the methods implemented in the machine learning library scikit-learn. scikit-learn is a very rich and high quality library, so I think it will be easier to understand if you hold down here first.

Recommended Posts

Overview of machine learning techniques learned from scikit-learn
Machine learning starting from scratch (machine learning learned with Kaggle)
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
[Machine learning] Understanding SVM from both scikit-learn and mathematics
Installation of TensorFlow, a machine learning library from Google
Machine learning learned with Pokemon
Machine learning / classification related techniques
Basics of Machine Learning (Notes)
Importance of machine learning datasets
[Machine learning] Understanding decision trees from both scikit-learn and mathematics
[Machine learning] Understanding logistic regression from both scikit-learn and mathematics
Machine learning ③ Summary of decision tree
Try machine learning with scikit-learn SVM
Python: Preprocessing in Machine Learning: Overview
[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics
Machine learning algorithm (generalization of linear regression)
scikit-learn How to use summary (machine learning)
Stock price forecast using machine learning (scikit-learn)
I tried calling the prediction API of the machine learning model from WordPress
[Machine learning] LDA topic classification using scikit-learn
Learning notes from the beginning of Python 1
How to use machine learning for work? 02_Overview of AI development project
[python] Frequently used techniques in machine learning
2020 Recommended 20 selections of introductory machine learning books
Machine learning algorithm (implementation of multi-class classification)
Source code of sound source separation (machine learning practice series) learned with Python
[Machine learning] List of frequently used packages
Learning record of reading "Deep Learning from scratch"
Learning notes from the beginning of Python 2
Judgment of igneous rock by machine learning ②
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
A memorandum of method often used in machine learning using scikit-learn (for beginners)
Machine learning memo of a fledgling engineer Part 1
Classification of guitar images by machine learning Part 1
Beginning of machine learning (recommended teaching materials / information)
Deep Learning from scratch ① Chapter 6 "Techniques related to learning"
Machine learning of sports-Analysis of J-League as an example-②
Machine learning starting from 0 for theoretical physics students # 1
Python & Machine Learning Study Memo ⑤: Classification of irises
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
[Learning memo] Deep Learning from scratch ~ Implementation of Dropout ~
Python & Machine Learning Study Memo ②: Introduction of Library
Full disclosure of methods used in machine learning
Notes on machine learning (updated from time to time)
Machine learning algorithms (from two-class classification to multi-class classification)
Basics of computational complexity improvement learned from ABC163C
List of links that machine learning beginners are learning
About the development contents of machine learning (Example)
Summary of evaluation functions used in machine learning
Analysis of shared space usage by machine learning
Machine learning memo of a fledgling engineer Part 2
Reasonable price estimation of Mercari by machine learning
Machine learning starting from 0 for theoretical physics students # 2
Classification of guitar images by machine learning Part 2
Get a glimpse of machine learning in Python
Deep learning learned by implementation (segmentation) ~ Implementation of SegNet ~
Try using Jupyter Notebook of Azure Machine Learning
Arrangement of self-mentioned things related to machine learning
Causal reasoning using machine learning (organization of causal reasoning methods)
Basic visualization techniques learned from Kaggle Titanic data
Overview and useful features of scikit-learn that can also be used for deep learning