Learning record (13th day)

Start studying: Saturday, December 7th

Teaching materials, etc .: ・ Miyuki Oshige "Details! Python3 Introductory Note ”(Sotec, 2017): Completed on Thursday, December 19th ・ Progate Python course (5 courses in total): Ends on Saturday, December 21st -** Andreas C. Müller, Sarah Guido "(Japanese title) Machine Learning Beginning with Python" (O'Reilly Japan, 2017) **: Saturday, December 21-

Chapter 1 Introduction

-In order to apply the machine learning model to new data, it must be generalized well. ・ Generally, about 25% of the data is assigned to the test set. • Observe the data first to see if it is necessary to use machine learning and if it contains the required data. -There is a method of observing by creating a pair plot etc. by scatter matrix of pandas.

Chapter 2 Supervised Learning

-It can be roughly divided into two types: classification and regression. If there is continuity, it can be said to be the latter. -Search for sweet spots that show the best generalization performance in the trade-off between underfitting and overfitting.

K-Nearest Neighbors Classifier

-Find the closest point from the training dataset. · Good baseline for small data -In many cases, it shows sufficiently high performance without much adjustment. Use it as a baseline before using more advanced technology. -However, it does not work well with a dataset with a large number of features (hundreds or more), and performance deteriorates with a sparse dataset where most of the features are 0 in many cases.

Linear model

-Predict using the linear function of the input features. (Image of drawing a line so that it is closest to each data) -Very effective when having a large number of features. Algorithms to try first • Signs of overfitting if performance is significantly different between training set and test set On the contrary, if it is too close, it is a sign of insufficient conformity. ・ Ridge: One of the regressions by linear model. Strong constraints and less risk of overfitting. High generalization performance. -Lasso: An image that automatically selects features. For example, when it is expected that there are many features but few with a high degree of judo. -Scikit-learn also has an ElasticNet class that combines the above two. -Logistic Regression: Linear model for classification -Linear support vector machine (linearSVM): Same as above

Naive Bayesian classifier

-A type of classifier that closely resembles a linear model. The feature is that training is fast. Can only be used for classification. -Useful as a baseline model for large datasets where even a linear model takes time.

Decision Tree

-Widely used for classification and regression tasks. ・ Learn a hierarchical structure consisting of questions that can be answered with Yes / No. (Is the feature a larger than b? Etc. It feels like an akinator?) ・ Visualization is possible and easy to explain. Very fast. ・ If the depth of the decision tree is not constrained, it will be as deep and complicated as possible, which tends to induce overfitting and reduce generalization performance. -You can visualize the tree with export_graphviz of the tree module. ・ Estimate the characteristic value of behavior from the feature importance, etc. However, there are cases where even unused features are simply not adopted in the decision tree.

Decision tree ensemble method (Ensembles)

・ A method to build a more cooperative model by combining multiple machine learning models

Random Forest

• One of the ways to deal with the problems of decision trees that are overfitting to training data. ・ The most commonly used machine learning method for both regression and classification -Not suitable for high-dimensional sparse data. ・ The degree of overfitting can be reduced by making many decision trees that are overfitted in different directions and taking the average. -Bootstrap sample: Randomly restore and extract data points. Make a decision tree with the completed new dataset. Select a feature subset while controlling it with max_fearture.

Gradient Boosting

-Make the mistakes of the previous decision tree in order so that the next decision tree corrects them. ・ Combining a large number of weak learners. -As long as the parameters are set correctly, the performance is better than Random Forest. Training takes time. -Similar to Random Forest, it does not work very well for high-dimensional and sparse data such as sentences. ・ Parameters such as learning_rate, n_estimator, and max_depth are important. ・ The first thing to try is Random Forest (because it is more robust) If the predicted time is very important, or if you want to narrow down the performance to the last 1%, try this. -Refer to the xgboost package and python interface when applying to large problems.

Support vector machine using kernel method

-An extension of linear SVM to enable more complex models. -Powerful for medium-sized datasets consisting of features with similar meanings. Sensitive to parameters. -Linear models in low dimensions are very restrictive because straight lines and hyperplanes limit flexibility. To make it more flexible, we use the interaction (product) of input features and polynomial terms. Gaussian kernel: Computes all polynomials up to a specific degree of the original feature. Polynomial kernel Radial basis function (RBF) • Only specific training data located at the boundary between the two classes determines the decision boundary. These data points are called support vectors. (Origin of the name) -Differences in the details of features have a destructive effect on SVM. As a method to solve this, there is a method called Min-Max Scaler as one of the methods of converting so that all of them have almost the same scale. (Put it between 0 and 1) -The strength is that complex decision boundaries can be generated even when the data has only a small amount of features. -The problem is that data preprocessing and parameter adjustment need to be done carefully. This is why many apps use decision tree-based models such as gradient boosting. In addition, it is difficult to verify and understand the reason why a certain prediction was made, and it is difficult to explain to non-experts. However, it is worth trying SVMwo for the results of measuring instruments with similar features such as camera pixels. -The parameters gamma (reciprocal of the width of the Gaussian kernel) and C (regularization parameter) are important.

Neural network (deep learning)

・ About multilayer perceptron (MLP) -Effective for particularly large data sets. Sensitive to parameters. Training takes time. ・ Input weighting for output is important in MLP -On the way from input to output, there is an intermediate processing step (hidden units) that calculates the weighted sum, and the weighted sum is further performed on the value calculated here and the result is output. To. -Since the process up to this point is mathematically the same as calculating one weighted sum, a nonlinear function is applied to the result in order to make this model stronger than linear. In many cases, relu (rectified unit: rectified linear function) and tanh (hyperbolic tangent: hyperbolic tangent function) are used. -By default, MLP uses 100 hidden layers, but it needs to be changed according to the size of the dataset. -Similar to SVM, it is necessary to convert the data scale. I use a standard scaler in writing.

ConvergenceWarning: #Convergence warning:
 Stochastic Optimizer: Maximum iterations reached and the optimization
 hasn't converged yet. #Probabilistic optimizer:The number of iterations has reached the upper limit, but the optimization has not converged

-The above is the function of the adam algorithm used for model learning. It means that the number of learning repetitions should be increased. There is a possibility that generalization performance will be improved by strengthening regularization for weights by changing the alpha parameter. ・ If you want to handle more flexible and large models, you should use keras, lasgana, and tensor-flow. ・ Sufficient calculation time, data, and careful parameter adjustment often outperform other machine learning algorithms. But this is also a drawback, and big and powerful ones are very time consuming. Also, parameter tuning is a technique in itself. -The decision tree model has better performance for data that is not homogeneous and has various types of features. -The number of hidden layers and the number of hidden units per layer are the most important parameters.

Finished until [Chapter 2 Supervised Learning (p.126)]

Learning record 9 (13th day)