E qualification report bullet points

linear algebra

〇April 25-Procession 04 Vector ... size + direction Scalar ... size Matrix ... A means to simply express simultaneous equations Extracting only the coefficients of the simultaneous equations

〇April 26-Procession 09 Matrix calculation ... Similar to the calculation of simultaneous equations … To solve simultaneous equations, it is possible to multiply by a specific matrix. (Elementary matrix) Inverse matrix: A matrix that acts as a reciprocal, similar to dividing a matrix … Called A- (A inverse) … A matrix that converts to an identity matrix (something like 1.0.0.1) Identity matrix: A matrix that results in the same matrix when multiplied How to find the inverse matrix ... Gaussian sweeping method

〇April 27-Eigenvalue ② Inverse matrix ... (-1) Called inverse Condition that the inverse matrix does not exist ... When the solution is not fixed in one, that is … A: b = c: d, that is, ad = bc, that is … When the area of the parallelogram is 0 when the matrix is replaced with two vectors

The same can be said when the number of dimensions increases to 3D and 4D.

Vector property (linearity): 0 if there is even one of the same thing … If even one is multiplied, the whole is multiplied … If even one is added, the whole is added … If you flip one element of the vector, the whole becomes negative … One vector can be expanded into multiple vectors Calculation of square matrix ... Calculation of cubic expression can be converted to calculation of quadratic expression Eigenvectors, eigenvalues, etc. When a certain vector is multiplied by a matrix, it can be expressed by the same vector when a specific scalar is applied. Things that multiply by a constant.

〇April 29-Eigenvalue 7 Eigenvalue: Calculated as a specific numerical value Eigenvectors ... can only be obtained up to a constant multiple of something Eigenvalue decomposition ... AV = VΛ, A = VΛV⁻1, When multiplying multiple A's, the inverse matrix of V and the matrix are multiplied, and only the calculation of the venue of Λ is performed, which makes the calculation easier. Dispersion: How one item is scattered Covariance: Differences in trends between the two data series. Positive and similar tendencies, negative and reverse tendencies, zero and irrelevant

〇April 30-Eigenvalue 12, Probability / Statistics 1-23 Singular value decomposition ... T (Transfer) ... Matrix rows and columns flipped over If MM⒯ is decomposed into eigenvalues, its left difference vector and the square of the singular value can be obtained. Eigenvalue decomposition or singular value decomposition is utilized as a data ashing technique in the field of image analysis. One of the means to approximate well. By using singular value decomposition, it is possible to determine that the images are similar to each other even on a PC. It is not possible to tell whether or not the images are the same only by the data of the images compressed by the singular value decomposition. Because the data is different. However, if the singular values are compared and the large items are similar, it may be possible to determine that the images are the same. (May be used for unsupervised learning of images)

Set ... S = {a, b, c, d, e, f, g ...} aS… a is included in one of the “elements” of S (indicating that it is the smallest unit) Union ... A or B A∪B A cup B Common part (not a product set !!) ... A and B, A∩B, A cap B Absolute supplement (other than supplement) ... U \ A = A upper bar (all parts other than A), A's own denial Relative supplement ... B \ A (the part of B other than A)

probability Frequency probability (objective probability) ... You can check the probability by trying it many times. Bayesian probability (subjective probability): The probability is expressed as the degree of belief, and 100% survey cannot be performed. … The probability of influenza is 80%, etc. P(A)＝n(A)/n(U) A=accident, U=universe, P=probability, n=number The problem is P (A ｜ B) = P (A∩B) / P (B), the probability of having a traffic accident under rainy conditions, and where to put U. Conditional probability Easy to calculate for simultaneous probabilities of independent events, Descriptive statistics ... Finding the properties of the entire population from the data Statistical inference: Statistics that infer the properties of the entire population from the data of some extracted samples Random variable: A numerical value associated with an event, sometimes self-proclaimed Probability distribution: Distribution of the probability that an event will occur, can be shown in the table if it is a Lee production area Expected value = value can be calculated by sigma or integral Dispersion ... The dimension has become higher because it has been squared. Standard deviation (lowercase sigma = σ)… I'll make the root so that the dimensions do not change

Probability distribution Bernoulli distribution ... Image of coin toss Martineuy distribution (category distribution, categorical distribution) ... Image of rolling dice Binomial distribution: Part of the Bernoulli distribution Gaussian distribution: A bell-shaped continuous distribution, which usually becomes this type when the number of samples increases, so if you do not know at all, you often apply it to the Gaussian distribution. It is a function that has been elaborated so that it becomes 1 when it is normally divided (combining the area).

Estimation: There are two types: point estimation and interval estimation. Point estimation is to estimate each value such as garden training ground, and interval estimation is to estimate the existing range such as average value.

While using machine learning, the idea of "estimation" is rarely used. Estimator Θ Theta: A numerical calculation method or formula used to estimate parameters. Also called an estimation function Estimated value (estimate) Θ hat, theta hat ... Value calculated from the result of actual enforcement

Sample mean: Mean of samples taken from the population Consistency… The larger the number of samples, the closer to the population value Unbiased ... No matter how many samples you have, the expected value is the same as the value of the population. E = expected value, θ = estimator, θ hat = estimated value E (θ hat) = θ

Specimen variance ... Consistency is not ass, but unbiasedness is not satisfied! … That is, the standard variance of the population and the sample variance of some samples do not match

Unbiased variance: By multiplying n / n-1 by the sample variance, the value of the population variance is approached. … Because the difference from the average value is taken, the value of the sample cannot be completely freely selected, and when n-1 is selected, the value of the last sample is already decided. Therefore, I will divide it by 1 / n-1. However, when only a small number of data can be obtained, the difference in this unbiased variance becomes large, but when the number of samples becomes large, 1 / n and 1 / n-1 are almost the same, so there is not much effect. come.

The amount of increase is the same, but the awareness is different ... Because the amount of change / parameter is different. The rate of increase is important. I realized that human senses compare the ease of understanding of information by "ratio".

Amount of self-information ... When the base of the logarithm is 2, the unit is bit. … When the base of the logarithm is napia (e), the unit is nat (nat) natural logarithm = natural I(x)=-log(P(x))=log(W(x)) The ratio of feeling the increase in information = logarithm is sensuously refreshing.

Shannon entropy ... Expected value of self-information content, differential entropy? H（x）=E(I(x))=-E(log(P(x))=-Σ(P(x)log(P(x)))

Kullback-Leibler Divergence … Represents the difference between different probability distributions P and Q in the same event and random variable

Cross entropy ... Can be expressed using KLD

Machine learning report

〇May 2 ~ ML_05_04_ Hands-on (house price forecast) Separately for various models Training data ... train Verification data ... Add test The hat is attached only to the estimated data, because it does not mix with the actual data.

How to build parameters Mean squared error (MSE) … Numerical value determined only by the square error of data and model output, and parameters Least squares … Find the parameter that minimizes the mean square error. Find the point where the gradient becomes 0.

If you use a library, you can achieve mean square error etc. just by reading fit and the library, but it is important to know what is actually happening behind the scenes.

〇May 3 Regression, non-linear regression Basis set = variable Nonlinear regression model uses polynomial (multiplier function) and Gaussian basis (natural logarithm) Bandwidth changes depending on Hj in Gaussian basis

Regularization ... Lasso return for first-order penalties (L1 norm) Ridge regression for secondary penalties (L2 norm) Lasso regression finds the point of contact between the circle and the error function. It is called "reduced estimation". In Ridge regression, it is called "sparse estimation" to find the point of contact between the square and the error function. Since the intercept of Y or X becomes 0, the variable may be simplified.

Logistic regression ... Although it is named regression, it is an algorithm related to classification. Binary classification. Use the sigmoid function. Monotonically increasing function. Takes a value between 0 and 1. If the objective variable is 0, death is determined, and if 1, survival (in the case of the Titanic model) is determined. It is expressed by σ (x) = 1/1 + exp (-ax), and when the parameter a becomes large, it becomes like a staircase. When A is made small, the slope becomes a gentle function. The derivative of the sigmoid function can be expressed by the sigmoid function itself Logistic regression utilizes the Bernoulli distribution. Bernoulli distribution: Discrete probability distribution with 1 on one side and 1-p on the other side The generated data depends on the value of the parameter (p). 　　　　　　　　P=ｐｙ(1-p)1-p Maximum likelihood estimation: A method of point estimation of the population parameter of the probability distribution that it follows from given data in statistics. Simultaneous probability: Since it can be assumed that the random variables are independent, it can be calculated by multiplication. Likelihood function: A method of finding the optimum parameters by fixing the data and changing the parameters. … The estimation method that maximizes this parameter is called likelihood estimation. I want to know the slope of the parameter by differentiating the likelihood function, but since it is a function by multiplication of w, I will calculate it in the state of taking the logarithm and multiplying it. It has been proved that the maximum value of the likelihood function is the same even if the logarithm is taken. (Proof is omitted) Gradient descent method: A method of sequentially updating parameters. … If all the data is loaded in one update, resources may be insufficient in memory, so we are aiming to solve it by a method called stochastic gradient descent. Stochastic Gradient Descent (SGD)… See only one or a few updated data, not all data … In the case of logistic regression, since the sigmoid function is used as a model, the objective function always increases as the value increases, so this function is often used. If this is a function with many peaks and valleys, such as a cubic function, SGD is not very useful.

How to validate the model True positive False negative (the model is mistakenly judged to be negative) ... It is necessary to check whether it is really abnormal False positives (models are mistakenly judged to be positive) ... Abnormal things are passed through if they are not abnormal True negative

Correct answer rate ... Yes, it was true positive Recall rate: What percentage of the actual correct values could be judged to be correct? If you are not confident, set it to positive and use it when you check it later. This value is important when you want to prevent omissions even if there are many mistakes. (If you have cancer, but you don't accidentally get through) (Additional data need to be revalidated) Conformance rate: The percentage of those that the algorithm chose to be positive that were really correct. Only those who are confident will be positive. (I don't want non-spam emails to be spam, so an algorithm that determines only those that I'm confident about as spam) F value: Both the recall rate and the precision rate should be high, but since there is a trade-off between the two, it is the value obtained by taking the harmonic mean of both. The higher the F-number, the higher the both recall and precision values.

Titanic hands-on … The implementation of logistic regression by numpy is not implemented in the video, but it is asked during the actual test, so it is necessary to check the code. It's OK if you understand only the algorithm. There is not much question about how to visualize. ... It is possible to easily calculate the value by using the skitlearn model, but the result cannot be explained unless the method for calculating the probability of each data can also be calculated. … When I created a new variable by adding the grade and gender data, I was able to lower the dimension of the result and gave an easy-to-understand explanation.

Principal component analysis: One of the dimensions reduction methods. I want to reduce only the dimensions without lowering the explanation level of the factors. … If we consider the amount of information as the magnitude of the variance, we can find the projection axis that maximizes the variance of the variables after linear return. Lagrange function: A blind Kanstragrange multiplier with constraints calculated. This is the same as finding the point where the so-called slope becomes zero. Differentiate the Lagrange function ... Same as the 2 hour derivative of the matrix. This is the same as eigenvalues and eigenvectors. The vector that maximizes the variance is the same as the eigenvalue eigenvector. Vaar(X)aj = λaj Contribution rate: A value that indicates how much information has been deleted as a result of compression. The sum of all the variances. Find out how much information you have in the whole information. Calculation of contribution rate ... Since data of only the first principal component is rarely used, how much information is possessed by adding the second principal component, the third principal component, and the fourth principal component? It means to investigate. You can see how much it contributes by the magnitude of this value. The issue of cumulative contribution rate is also important.

When I explain it, I don't understand even if I say "variance-covariance ...", so I think I can explain how many events with these two values by using this principal component analysis. It is necessary to be able to think like this. Especially important when explaining to your boss or doing business. KNN (near K) ... Teacher-led learning. A method of taking a majority vote among the data with the correct label and the K data whose distance between the data to be measured is short, and adopting the larger correct label. It is necessary to set the number of K as a parameter in advance. Kmeans… Unsupervised learning. A method of clustering (classifying) into K groups. Take any K points and group values with close average distances around them. After that, with K, the center of the formed group, as a new center, the distance to each data is taken again, and by repeating this, the final stable part can be seen. Since the selection of K parameters is important, a method called Kmeans ++, which sets the parameters to be placed at a remote location rather than randomly, is also attracting attention these days.