Introduction

There is a limit to what you can do from scratch. There is also the phrase "standing on the shoulders of giants," but I would like to use articles that can be used as reference as the wisdom of our predecessors to improve our level.

-Personal notes and links about machine learning ① (Machine learning) -Personal notes and links about machine learning (2) (Deep Learning) -[Personal notes and links about machine learning ③ (BI / Visualization)] (https://qiita.com/CraveOwl/items/7846abccbbaebed6ce63)

Machine learning method

There are various methods for machine learning, and it is helpful to organize them as follows.

-Overview of machine learning techniques learned from scikit-learn -Thaw! There are many data analysis and machine learning methods, but when should I use them?

Classification

Decision Tree

The accuracy is not high, but the visualization by the tree is highly explanatory.

-[Decision tree analysis with scikit-learn (CART method)](https://pythondatascience.plavox.info/scikit-learn/scikit-learn%E3%81%A7%E6%B1%BA%E5%AE%9A % E6% 9C% A8% E5% 88% 86% E6% 9E% 90) -Decision Tree and Random Forest -Generate Python code from scikit-learn decision tree / random forest rules

Support Vector Machine

Random Forest

-Machine Learning for Package Users (5): Random Forest -What I was asked when using Random Forest in practice -Importance of features that can be calculated by Random Forest -[Compare Random Forest vs SVM with Python scikit-learn] (http://yut.hatenablog.com/entry/20121012/1349997641) -Verification of tuneRF function behavior

Regression

Linear regression

-[Machine learning] Regression analysis using scikit learn -Linear regression in Python (statmodels, scikit-learn, PyMC3) -Linear? non-linear?

Lasso regression

Regression model for L1 regularization

LASSO and Ridge regression

SVR -Multivariable regression model with scikit-learn --I tried to compare and verify SVR

Time series analysis

How to Grid Search ARIMA Model Hyperparameters with Python -Python memorandum (Scikit-learn, stats models) -Time series data prediction library --PyFlux- -I want to do machine learning even without a server --Time Series Edition- -[State Space Model](https://logics-of-blue.com/%E7%8A%B6%E6%85%8B%E7%A9%BA%E9%96%93%E3%83%A2% E3% 83% 87% E3% 83% AB% E3% 81% A8% E3% 81% AF /) -Introduction to Prophet [Python] Facebook Time Series Prediction Tool

Clustering

Hierarchical cluster analysis (aggregation method)

A method of visually showing how many clusters it is appropriate to divide by drawing a dendrogram (tree diagram) that shows the closeness of objects. However, the number of objects is limited to several hundreds because it is within the range that can be represented by a tree diagram. Beyond that, reading is difficult.

In the world of Data Mining and Big Data, the amount of data has increased enormously and it has become less popular.

-Heatmap with Dendrogram in Python + matplotlib -Python: Hierarchical clustering dendrogram drawing and threshold division Tweet

Non-hierarchical cluster analysis (k-means)

The most famous non-hierarchical clustering technique. If you divide the number of clusters into K, how to divide them will automatically determine the optimization based on the input information.

The biggest feature and weakness of this method is that it is necessary to determine the number of clusters (K) in advance. To avoid this, methods such as K-means ++ and X-means that automatically derive the optimum number of clusters have also been developed.

It is also used when clustering customers according to purchasing tendency, but it is often extremely divided, such as a cluster with tens of thousands of people and a cluster with only a few people at the same time, to avoid that. I don't use it much personally because it is difficult to adjust the parameters.

-[Cluster analysis with scikit-learn (K-means method)](https://pythondatascience.plavox.info/scikit-learn/%E3%82%AF%E3%83%A9%E3%82%B9%E3 % 82% BF% E5% 88% 86% E6% 9E% 90-k-means) -I checked the X-means method that automatically estimates the number of clusters

Spectral clustering

-Spectral Clustering Story -I tried spectral clustering

Self-organizing map (SOM, Kohonen)

A model that expresses the similarity of input information given by a type of neural network by the distance on the map.

Since it is expressed on a map (two-dimensional), when determining the number of clusters, it is necessary to think about multiplication in consideration of vertical and horizontal, such as a 3x3 map. (Therefore, the prime numbers such as 5 or 7 clusters are only 1x5 and 1x7, which is somewhat unpleasant.)

Personally, when it comes to customer clustering, I love it so much that I should use this method. Compared to other methods such as K-means, it is less likely to be divided into extremes, and it tends to be vertical and horizontal, so it is easy for anyone to interpret the results.

Since it is a model devised by Dr. T. Kohonen, it is often called Kohonen instead of a self-organizing map (SOM).

-Self-organizing map in Python NumPy version -Generative Topographic Mapping (GTM) -Upward compatible method of self-organizing map (SOM)-

Topic model

Originally used as a method of statistical latent semantic analysis in natural language processing to estimate the "probability of appearance of a word" in a sentence, it is a kind of numerical probability model and estimates the "probability of appearance". Networking that is not 1: 1 when used in data (eg: one customer does not belong to one cluster, but to multiple clusters. The probability of belonging to cluster A is 60%, B is 30% ... ・ It is also used for (the probability of belonging to is divided).

Although there are various methods for topic models, LDA (Latent Dirichlet Allocation) is often used.

Since the model has different affiliation probabilities, it goes well with the idea of product DNA (I personally think).

-"Statistical Latent Semantics Analysis by Topic Model" Reading Group "Chapter 1 What is Statistical Latent Semantics" -Consider the probability of generating topics and documents with LDA -Machine learning_Latent semantic analysis_Implemented with python -PLSA (Stochastic Latent Semantics)

Dimensional compression

-[Data science by R] Multidimensional scaling (continued) Non-lightweight MDS -Notice of release of python library of high-dimensional vector data search technology "NGT"

Mechanism to support learning (even if this is the main)

Parameter tuning

-Parameter optimization by grid search from Scikit learn -Easy tuning with grid search function or option for machine learning with R -Automatically optimize machine learning hyperparameters, Preferred Networks publishes library -Hyperparameter automatic optimization tool "Optuna" released -Optimize CNN hyperparameters with Optuna + Keras

Feature selection

-[Machine learning] Selection of features using RFE -Feature engineering for machine learning starting with the 1st Google Colaboratory -Feature engineering for machine learning starting with the 2nd Google Colaboratory

Other

-Useful tool when using sklearn from pandas -Pivot table -Parallel processing -Save classifiers together