Last time, in Analysis of time series data of stocks, I talked a little about machine learning in the second half. Today we will touch on the machine learning library scikit-learn.
Speaking of scikit-learn, I mentioned Simple clustering example and Support vector machine. items / 33231bf23d40d1c1f344), Solve problems with clustering, Calculate TF-IDF ), Visualization of regression model, Clustering by DBSCAN However, I will reorganize the functions of the library.
Machine learning has the image of making full use of difficult mathematics, but if you use a highly complete library, you can use it without implementing the machine learning method itself. Of course, it is necessary to understand the contents of the method, but since there is a library called scikit-learn that can be said to be the de facto de facto, I think it is better to start from the point of using this.
I wrote this story in previous, but the biggest merit of using Python is related to scientific calculation and data analysis such as statistics and machine learning. The library is extremely rich. There are very good libraries called NumPy and SciPy for numerical calculations in general, pandas for handling data frames, and matplotlib for visualization. Trying to do the same thing in another language (such as Ruby or Scala) can be a daunting task or almost impossible, so it's probably not an option. There is another language for statistics, R, but I think that Python is more complete as a language because it is also a general language, and R is characterized by having more existing assets. ..
First, let's organize the procedure.
The basic procedure is as above.
The best thing you can do with scikit-learn is to look at the cheat sheet.
The original is in the head family.
Choosing the right estimator http://scikit-learn.org/stable/tutorial/machine_learning_map/
It is useful because you can jump to the explanation of each method from here.
The following is a summary of what you can do.
Let's follow the features one by one.
It has high generalization performance and can handle various data because kernel functions can be selected.
It boasts high accuracy in spite of its simplicity.
It has the features that it is not necessary to consider overfitting and it is easy to perform parallel calculation.
It is a normal linear regression.
Build a model with a small number of variables, but assume that some variables are not used.
It is not easily affected by multicollinearity and has a weaker variable selection ability than lasso regression.
SVR
You can capture non-linearity in the kernel.
This is a typical clustering method in which the number of clusters is specified in advance as k. It's simple and fast.
You can find the probability of belonging to a cluster. Assume a normal distribution.
A robust, non-parametric method that uses kernel density estimation. The kernel width (radius h) you set automatically determines the number of clusters. Since the center point is calculated by considering a circle with radius h using the principle of the steepest descent method for all input point groups, the cost tends to be high.
This method is also applied to scenes such as image segmentation and edge preservation image smoothing.
Meanshifted cluster analysis with infinite kernel width h can also be interpreted as k-means.
It has the advantage of being quick to handle sparse matrices. Assume a normal distribution.
Only non-negative matrices can be used, but it may be easier to extract features.
You can also use linear discrimination (LDA) and deep learning.
The library on which scikit-learn depends. Not only scikit-learn, but also various numerical calculation software such as SciPy and programs in other languages such as C and Fortran depend on BLAS and LAPACK. Most of them are provided as packages in GNU / Linux distributions. It is available in a variety of languages and has a long history, so it is the de facto standard for performing linear algebra operations on a computer.
LAPACK specifically includes simultaneous linear equations, least squares, eigenvalue problems, singular value problems, matrix (LU) decomposition, choleskey decomposition, QR decomposition, singular value decomposition, Schur decomposition, generalized Schur decomposition, and conditional numbers. It provides estimation routines, inverse matrix calculations, and various subroutines.
There are various implementations of BLAS, but the typical ones are as follows.
If you don't care about speed so much, I think that you will often use OpenBLAS for the time being.
Methods for efficiently performing good parameters are called cross-validation and grid search. We will make full use of these to repeat learning and evaluation of the model, but we will come back to this later.
This time, we have summarized typical machine learning methods from the viewpoint of the methods implemented in the machine learning library scikit-learn. scikit-learn is a very rich and high quality library, so I think it will be easier to understand if you hold down here first.
Recommended Posts