Mathematics for ML

Linear Models Let y hat be the predicted value, vector w (= w1, w2 ..., wp) be the coefficient (coef), and w0 be the intercept (intercept).

\hat{y}(w, x) = w_0 + w_1 x_1 + ... + w_p x_p

Ordinary Least Squares Find the following coefficients that minimize the residual sum of squares. The L2 norm means an ordinary Euclidean distance.

\min_{w} || X w - y||_2^2

sklearn

class sklearn.linear_model.LinearRegression(*, fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)

Implementation

Ridge Regression As a loss function, the regularization term of the square of the L2 norm is added. The absolute value of the coefficient is suppressed, which prevents overfitting.

\min_{w} || X w - y||_2^2 + \alpha ||w||_2^2

sklearn

class sklearn.linear_model.Ridge(alpha=1.0, *, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, solver='auto', random_state=None)

Lasso Regression As a loss function, the regularization term of L1 norm (Manhattan distance) is added. It may be possible to reduce the dimension of the feature quantity by setting a part of the coefficient to 0.

\min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}

Multi-task Lasso

Elastic-Net Add a regularization term for the sum of the L1 norm and the L2 norm. It results in Ridge regression when ρ = 0 and Lasso regression when ρ = 1.

\min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha \rho ||w||_1 +
\frac{\alpha(1-\rho)}{2} ||w||_2 ^ 2}

Multi-task Elastic-Net

Least Angle Regression (LARS)

Orthogonal Matching Pursuit (OMP) There is a stop condition.

\underset{w}{\operatorname{arg\,min\,}}  ||y - Xw||_2^2 \text{ subject to } ||w||_0 \leq n_{\text{nonzero\_coefs}}

Bayesian Regression

p(y|X,w,\alpha) = \mathcal{N}(y|X w,\alpha)

Logistic Regression Classification while saying Regression. A statistical regression model of variables that follow the Bernoulli distribution. Use logit as a concatenation function.


K-nearest Neighbors / k-nearest neighbor method

application

Predict user hobbies such as movies, music, search results, and shopping. There are Collaborative Filtering that makes predictions based on similar user preferences, and Content-based Filtering that makes predictions based on what users have liked in the past.

Q-Learning / Q learning

A learning algorithm for the state action value Q (s, a) when s is the state, a is the action, and r is the reward. In the following equation, α means the learning rate and γ means the discount rate. Q (st, at) is updated one after another according to α as follows. The maximum Q value of the update destination state st + 1 is adopted according to γ.

Q(s_t, a_t) \leftarrow (1-\alpha)Q(s_t, a_t) + \alpha(r_{t+1} + \gamma \max_{a_{t+1}}Q(s_{t+1}, a_{t+1}))\\

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha(r_{t+1} + \gamma \max_{a_{t+1}}Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))

Sarsa

Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha(r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))

Monte Carlo method

Returns(s, a) \leftarrow append(Returns(s, a), r)\\
Q(s, a) \leftarrow average(Returns(s, a))

Recommended Posts

Mathematics for ML
TensorFlow MNIST For ML Beginners Translation
TensorFlow Tutorial MNIST For ML Beginners
TensorFlow Tutorial -MNIST For ML Beginners
Supplementary notes for TensorFlow MNIST For ML Beginners
Conducting the TensorFlow MNIST For ML Beginners Tutorial