Understand Fisher's linear discriminant analysis with mathematical formulas and try it with scikit-learn.
It is assumed that you have already learned calculus and linear algebra.
Fisher's linear discriminant analysis is a supervised method that finds $ w $ so that the distributions between categories do not overlap after projecting the data, and although it is named discriminant, it is practically used for dimensionality reduction.
When the data $ x $ is projected by $ w $, the projected data $ y $ is
y = w^T x
It will be. At this time, we will find $ w $ so that the distribution of categories in $ y $ is as far apart as possible. In the case shown below, the optimal $ w $ projects the blue point cloud and the orange point cloud onto the hollow points on the black straight line.
Now consider two categories of data, as shown above. The average vector of categories 1 and 2 can be expressed as follows.
\mu_1 = \frac{1}{N_1} \sum^{N_1}_{i \in C_1} x_i \\
\mu_2 = \frac{1}{N_2} \sum^{N_2}_{i \in C_2} x_i
When the mean vector projected by $ w $ is represented by $ m_1 = w ^ T \ mu_1, m_2 = w ^ T \ mu_2 $, the difference between the mean values after projection
m_1 - m_2 = w^T (\mu_1 - \mu_2)
The larger is, the greater the degree of separation between categories. Therefore,
s^2_1 = \sum^{N_1}_{i \in C_1} (w^T x_i - w^T \mu_1)^2 \\
s^2_2 = \sum^{N_2}_{i \in C_2} (w^T x_i - w^T \mu_2)^2
The smaller the variance after projection, the better, so we should minimize the intraclass variance $ s ^ 2 = s ^ 2_1 + s ^ 2_2 $, which is the sum of $ s ^ 2_1 and s ^ 2_2 $.
Here, we define the following Fisher criteria $ J (w) $ as an evaluation function that considers both the maximization of the mean value after projection and the minimization of the variance after projection.
J(w) = \frac{(m_1 - m_2)^2}{s^2_1 + s^2_2}
Also, if the interclass covariance matrix is $ S_B = (\ mu_1-\ mu_2) (\ mu_1-\ mu_2) ^ T $, the interclass variation $ (m_1 --m_2) ^ 2 $ is
\begin{align}
(m_1 - m_2)^2 &= \left( w^T(\mu_1 - \mu_2) \right)^2 \\
&= \left( w^T(\mu_1 - \mu_2) \right) \left( w^T(\mu_1 - \mu_2) \right)^T \\
&= w^T (\mu_1 - \mu_2)(\mu_1 - \mu_2)^T w \\
&= w^T S_B w
\end{align}
Can be expressed as. In addition, the intraclass variance $ s ^ 2_k $ is
\begin{align}
s^2_k &= \sum_{i \in C_k} (y_i - m_k)^2 \\
&= \sum_{i \in C_k} \left( w^T (x_i - \mu_k) \right)^2 \\
&= \sum_{i \in C_k} \left( w^T(x_i - \mu_k) \right) \left( w^T(x_i - \mu_k) \right)^T \\
&= w^T \sum_{i \in C_k} (x_i - \mu_k)(x_i - \mu_k)^T w \\
&= w^T S_k w
\end{align}
Therefore, the variance within all classes $ s ^ 2_1 + s ^ 2_2 $ sets the covariance matrix within all classes to $ S_W = S_1 + S_2 $.
s^2_1 + s^2_2 = w^T (S_1 + S_2) w = w^T S_W w
Can be expressed as. Therefore, Fisher's reference $ J (w) $ is
J(w) = \frac{w^T S_B w}{w^T S_W w}
And will maximize this.
Since we need to find the maximum value, we differentiate Fisher's reference $ J (w) $ with respect to $ w $ and solve it as 0.
\begin{align}
\frac{\partial J(w)}{\partial w} &= \frac{2S_B w \cdot w^TS_Ww - w^TS_Bw \cdot 2S_Ww}{(w^TS_Ww)^2} \\
&= \frac{2}{w^TS_Ww} \left( S_Bw - \frac{w^TS_Bw}{w^TS_Ww} S_Ww \right) = 0
\end{align}
Here, set $ \ lambda = \ frac {w ^ TS_Bw} {w ^ TS_Ww} $
\frac{\partial J(w)}{\partial w} = \frac{2}{w^TS_Ww} (S_Bw - \lambda S_Ww) = 0 \\
(S_Bw - \lambda S_Ww) = 0
Therefore, we will solve the generalized eigenvalues problem of the following equation.
S_Bw = \lambda S_Ww
Here, if $ S_W $ is an invertible matrix,
\lambda w = S^{-1}_WS_Bw
And it becomes a normal eigenvalue problem. further,
S_Bw = (\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw \propto (\mu_1 - \mu_2)
Because it becomes
w \propto S^{-1}_WS_Bw \propto S^{-1}_W (\mu_1 - \mu_2)
You can find the optimal $ w $ as.
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Matplotlib 3.3.1 ・ Numpy 1.19.2 ・ Scikit-learn 0.23.2
The implemented program is published on GitHub.
fisher_lda.py
This time, I decided to use the iris dataset provided by scikit-learn.
The execution result is as follows. The setosa is well separated and the versicolor and virginica are partially covered, but they appear to be reasonably separated.
1.2. Linear and Quadratic Discriminant Analysis
Yuzo Hirai. "First Pattern Recognition", Morikita Publishing, 2012.