This time, I will summarize a method called principal component analysis. This principal component analysis is a method of data analysis. It is used because it is a useful means for handling big data these days. It is also the basis of the ** autoencoder concept ** in neural networks such as GAN. Here is the motivation for me to study.
The materials referred to this time are here. ** Especially in the PRML book, the description was polite, so it was easy to follow the expansion of the formula and the understanding progressed. ** I felt like I understood the reason for being a masterpiece.
Under pattern recognition and machine learning (commonly known as PRML)
Principal Component Analysis http://cda.psych.uiuc.edu/statistical_learning_course/Jolliffe%20I.%20Principal%20Component%20Analysis%20(2ed.,%20Springer,%202002)(518s)MVsa.pdf
Feature Engineering for Machine Learning-The Principles and Practice with Python (O'Reilly)
Principal component analysis (commonly known as PCA) refers to replacing a large number of ** variables with a small number of new variables. ** As another name, it seems to be called Karhunen-Loeve conversion.
To put it a little more in textbooks, it is an orthogonal projection of data points onto a linear space to a lower dimension called the principal subspace. And the point is to set the variance of this projected data to be maximized. It can also be defined as minimizing the expected value of the projection cost function, which is defined by the mean of the squared distance between the data point and the projected point. These two ideas are equivalent.
The idea is briefly shown below.
The variance maximization of 1. is the idea that comes from the desire to express in a new dimension without impairing the variation expressed in the original observation data. To minimize the expected value of the cost function in 2., the point is to find a dimension that expresses the error as small as possible. Although 1 and 2 are different expressions, they are synonymous as a result.
Consider a set of observed data in dimension $ D $ {$ \ bf x_n $} $ (n = 1,2, .... N) $. The purpose of this time is to project the data into a space of $ M $ dimension (<$ D
The direction of this one-dimensional space is expressed as a $ D $ dimensional vector $ \ bf u_1 $ (see the figure below).
Now assume that the $ \ bf u_1 $ vector is a unit vector ($ \ bf {u_1 ^ T u_1} = 1 $). ** This is a procedure taken for the importance of the defined direction and the ease of calculation, not the size of $ u_1 $ itself. ** ** Then each data point {$\ bf x_n $} is projected onto the scalar value $ \ bf u_1 ^ Tx_n $. Now, to maximize the variance, let $ \ bar {x_n} $ be the mean and continue below. The variance of the projected data is
\begin{align}
&\frac{1}{n} \sum_{n=1}^{n} ||\mathbf{u_1}^T \mathbf{x_i} - \mathbf{u_1}^T \bar{\mathbf{x}}||^2 \\
= &\frac{1}{n} \sum_{n=1}^{n} ||\mathbf{u_1}^T (\mathbf{x_i} - \bar{\mathbf{x}})||^2 (Equation 1)\\
= &\frac{1}{n} \sum_{n=1}^{n} \bigl( \mathbf{u_1}^T (\mathbf{x_i} - \bar{\mathbf{x}}) \bigr) \bigl( (\mathbf{x_i} - \bar{\mathbf{x}})^T \mathbf{u_1} \bigr) \\
= &\mathbf{u_1}^T \bigl( \frac{1}{n} \sum_{n=1}^{n} (\mathbf{x_i} - \bar{\mathbf{x}}) (\mathbf{x_i} - \bar{\mathbf{x}})^T \bigr) \mathbf{u_1} \\
= &\mathbf{u_1}^T \mathbf{S} \mathbf{u_1} \\
\end{align}
Given in. Where $ S $ is a data covariance matrix, defined as:
\frac{1}{n} \sum_{n=1}^{n} (\mathbf{x_i} - \bar{\mathbf{x}}) (\mathbf{x_i} - \bar{\mathbf{x}})^T = \mathbf{S}
Now, the projected dispersion
Now, when Lagrange's undetermined multiplier method is actually applied to the above equation, $ \ lambda_1 $ is set as Lagrange's multiplier.
L(\mathbf{u_1}, \lambda_1) = \frac{1}{2} \mathbf{u_1}^T \mathbf{S} \mathbf{u_1} - \lambda_1 (\mathbf{u_1}^T \mathbf{u_1} - 1)
Can be expressed as. To find the minimum value of this function, it can be found by partially differentiating it with $ \ bf u_1 $ and placing it as $ 0 $.
\frac{\partial L}{\partial \mathbf{u_1}} = \mathbf{S} \mathbf{u_1} - \lambda_1 \mathbf{u_1} = 0 \\
\mathbf{S} \mathbf{u_1} = \lambda_1 \mathbf{u_1} \ (Equation 1)\\
Can be transformed with. $ \ bf u_1 $ is
||\bf S -\lambda_1 I|| =0
It can be obtained by solving the determinant of.
Also, for Equation 1 above, multiplying $ \ bf u_1 ^ T $ from the left and applying $ \ bf {u_1 ^ T u_1} = 1 $, the variance will be
\mathbf{u_1}^T \mathbf{S} \mathbf{u_1} = \lambda_1
Can be asked.
Therefore, we can see that the variance is maximized when $ \ bf u_1 ^ T $ is selected as the eigenvector belonging to the maximum eigenvalue $ \ lambda_1 $. This eigenvector is called the ** first principal component **.
Now, let's consider the principal component analysis when projecting in one dimension above. In general, consider projection into an M-dimensional space. At this time, in fact, it is possible to easily obtain the matrix of the original data set {$ \ bf x_n $} $ (n = 1,2, .... N) $ by ** singular value decomposition **. Become. Singular value decomposition is an operation in a matrix, and the outline and implementation method are described below.
I tried hard to understand Spectral Normalization and singular value decomposition, which contribute to the stability of GAN. https://qiita.com/Fumio-eisan/items/54d138df12737c0984b2
Next, let us consider principal component analysis from the perspective of minimizing the sum of squares of the residuals, which is another idea.
The sum of squares is $ x_n- \ bf \ tilde {x_n} $ in the above figure. $ \ bf \ tilde {x_n} $ is expressed using $ \ bf {x_n} $ and $ \ bf u_1 $. When you put $ \ bf \ tilde {x_n} = t \ bf u_1 $,
\bf u_1^T (\bf x_n -t \bf u_1) =0
When you solve
\bf u^T \bf x_n -t\bf u_1^T \bf u_1 =0
t = \bf u_1^T \bf {x_n}
(\bf u_1^T \bf u =Because of 1)
Can be solved. Therefore, $ (\ bf u_1 ^ T \ bf {x_n}) \ bf u_1 $ is a normal projection. In other words, the $ \ bf x_n- \ bf \ tilde {x_n} $ that you first wanted to find becomes $ \ bf x_n-(\ bf u_1 ^ T \ bf {x_n}) \ bf u_1 $. Considering $ \ bf u_1 $ that minimizes the sum of squares of this length,
\sum_{n=1}^{N} ||\bf x_n-(\bf u_1^T \bf {x_n})\bf u_1||^2 \\
Can be formulated.
If you solve this
||\bf x_n-(\bf u_1^T \bf {x_n})\bf u_1||^2 = \bf x_n^T \bf x_n - (\bf u_1^T \bf {x_n})^2
It will be. This minimization problem is because $ \ bf x_n ^ T \ bf x_n $ is constant even if $ \ bf u_1 $ changes.
\sum_{n=1}^{N} -(\bf u_1^T \bf {x_n})^2
Is synonymous with solving. This shows the same as Equation 1 in solving the maximization problem. Earlier we decided the difference from the average value, but in this case it can be said that it is synonymous with interpreting it as subtracting the average value in advance.
In other words, it can be said that the maximum value problem mentioned earlier is only a minimization problem because it has a minus sign this time. The continuation is the same as when finding the maximum value.
I work as a production engineer in the manufacturing industry. Among them, ** sudden trouble of machinery and equipment ** is one of the causes leading to production loss (= opportunity loss). This happens because the device ** stops normal operation ** at unexpected times.
In the case of important equipment whose production activities are stopped due to trouble, we basically perform regular repairs based on the idea of time-based maintenance (at least in my industry (materials)). ).
This period is often placed ** earlier (= ahead of schedule) than the manufacturer's recommended replacement time for parts to be repaired (= important safety parts). ** In other words, even though it is already on the safe side, trouble may occur without waiting for the time. As for the reason why it happens, the cause is different depending on the case, so I can not say unconditionally, but it can be said that the type of product being produced changes or the worker who operates it changes.
Consider the application of principal component analysis to this problem. Below is a manga picture from the problem to the solution.
With the recent introduction of IoT, it has become easier to ** attach sensors and acquire data ** to devices. In addition, operation data related to production is also easy to use (= easy to obtain data).
Against this background, it is possible to predict equipment failures from a large amount of data. In making this prediction, it is a challenge to determine ** reference factors and thresholds. ** ** I think that ** principal component analysis ** can be applied to the process of narrowing down from many factors to few factors. For example, it is conceivable to use a new index that relates multiple factors such as device frequency x total usage time. ** (It may be applied as systemized software, but I don't know, so I would appreciate it if anyone could teach me.) **
** However **, the point here must be considered whether it is appropriate considering the knowledge of the device and production-related so-called domain knowledge.
For example, in the case of the motor installed in this rolling mill, if the frequency is originally 0 to 0 Hz, it is normal vibration. Or, when it becomes △ hz or more, it is almost time to replace it. Knowledge and experience will give validity to the factors decided here. After all, I think that the knowledge and experience of the workers who have visited the site dozens or hundreds of times is still important.
Another point. (** It may be limited to the material industry **) The challenge in applying this idea is the hardware such as the need to attach a sensor to the device and whether there is a sensor that can obtain valid data. I feel that the challenges on the side are still big. There are cases where instrumentation is not sufficiently installed in the factory itself, and even if it is installed, the system configuration is not designed to collect data. Therefore, it stops where there is no way to get the data from the object you want to analyze.
Nowadays, I feel that the reality is that only the idea of machine learning is ahead, and the hardware side is not attached. .. Therefore, ** It is a muddy but important job for the equipment engineers, IT staff, and IT consultants of the manufacturer. ** **
The method of principal component analysis is summarized. Although the mathematical operation is simple, I felt the wide range of applications. Personally, I would like to learn more and understand the scenes used for image processing.
Recommended Posts