Principal component analysis is a method of creating principal components (features) by grouping multiple variables from multivariate data that represent the features of an object. It is a method that can reduce dimensions, making it easier to see the structure of data, and reducing the amount of calculation in machine learning to improve the calculation speed. Simply put, it is a method of expressing correlated multivariate data by a function with less correlation, focusing on the direction and magnitude of the variation. This time, I would like to confirm the mechanism while implementing what is taken up in "Principal component analysis and dimension reduction" in the teacher training materials of Information II published on the page of the Ministry of Education, Culture, Sports, Science and Technology with python. think.
[High School Information Department "Information II" Teacher Training Materials (Main Volume): Ministry of Education, Culture, Sports, Science and Technology](https://www.mext.go.jp/a_menu/shotou/zyouhou/detail/mext_00742.html "High School Information Department "Information II" teaching materials for teacher training (main part): Ministry of Education, Culture, Sports, Science and Technology ") Chapter 3 Information and Data Science First Half (PDF: 8.9MB)")
After giving an overview of principal component analysis without using mathematical formulas as much as possible,
Learning 14 Principal component analysis and dimensionality reduction: "4. Let's analyze physical fitness data with R"
I would like to see how it works while implementing the source code written in R in python.
To briefly explain the mechanism of principal component analysis (PCA), a new feature quantity called principal component that maximizes the difference between individual objects by grouping variables with strong covariance and correlation between variables It is to create a variable).
As shown, from p variables $ (X_1, X_2, \ cdots, X_p) $, p mutually independent total variables are obtained by linear combination (weighted sum) without loss of information. Create as principal component $ (Z_1, Z_2, \ cdots, Z_p) $.
The principal components are created in the order in which the variance is the largest, and the variation (variance) of the value of the principal component (main component score) between the objects becomes smaller as the lower principal component becomes.
That is, by discarding the lower component with a smaller contribution and adopting only the upper component, it is possible to profile the feature of the target with a number smaller than the original number p, and reduce the dimension. It is called (dimensional reduction).
Consider the following simple example (results data for exams in 5 subjects).
Consider the composite variables created as a, b, c, d, e by generalizing the coefficients of each subject.
Total variables = a x national language + b x English + c x mathematics + d x physics + e x chemistry
By giving various numerical values to a, b, c, d, and e, it is possible to create various total scores other than the simple total score. The principal component created by principal component analysis has coefficients a, b, c, d, and e so that the value ** (principal component score) ** of the total variable of 50 students varies most, that is, the variance is It is one variable that is determined to be the maximum.
Large variability corresponds to the amount of information in that variable. The first principal component to be obtained is called the first principal component, and the second principal component to be obtained next is calculated so that the variance is maximized under the condition that it is independent of the first component (uncorrelated). The features do not correlate with each other and the information does not overlap.
From teaching materials
There are two positions when creating synthetic variables.
--(1) Align the score levels (positions) of the five subjects (make the average the same, make the deviation from the average).
--(2) Make both the level (position) of the scores of the five subjects and the magnitude of the variation uniform (make the average and standard deviation the same, standardize (standardize), or make the deviation value).
When dealing with multiple variables with different units, it is necessary to standardize (2) and align the units. If the units are the same, analysis can be performed from the standpoints of both ① and ②. In (1), a coefficient that considers the magnitude of the variance of each variable is obtained, and in (2), a coefficient that ignores the difference in the variance of each variable is obtained.
For the discussion on which of (1) and (2) should be adopted, the information summarized below is detailed.
https://qiita.com/koshian2/items/2e69cb4981ae8fbd3bda
--It is not always good to standardize (divide by standard deviation) by principal component analysis --On the contrary, it has the negative effect of dividing by the standard deviation. --Whether it should be standardized depends on where you use principal component analysis (whether you use it as a visualization or as a pipeline), so think about it properly.
Principal component analysis is performed by extracting items from physical fitness data (grip strength, upper body raising, etc.). I would like to use the following preprocessed data created here.
https://gist.github.com/ereyester/5bf6d48fe966238632eca537756a06b0/raw/805c2eea83c7608d4d85ec15e56761133dc5ff4d/high_male2.csv
import numpy as np
import pandas as pd
import urllib.request
import matplotlib.pyplot as plt
import sklearn #Machine learning library
from sklearn.decomposition import PCA #Principal component analyzer
from sklearn.preprocessing import StandardScaler
from IPython.display import display
high_male2 = pd.read_csv('/content/high_male2.csv')
high_male3 = high_male2[['Grip strength', 'Raise your upper body', 'Long-seat forward bending', 'Repeated side jump', 'Shuttle run', 'X50m run', 'Standing long jump', 'Handball throw']]
display(high_male3)
This time, I would like to implement two methods, one with the scikit-learn module and one without. The scikit-learn module method uses sklearn.decomposition.PCA and sklearn.preprocessing.StandardScaler. The method that does not use the scikit-learn module is to implement the processing by yourself with numpy and pandas.
This time, I would like to actively use pandas.DataFrame to proceed with the implementation.
#Matrix standardization
#Standardization(How to use Standard Scaler)
std_sc = StandardScaler()
std_sc.fit(high_male3)
std_data = std_sc.transform(high_male3)
std_data_df = pd.DataFrame(std_data, columns = high_male3.columns)
display(std_data_df)
#Perform principal component analysis
pca = PCA()
pca.fit(std_data_df)
#Mapping data to principal component space
pca_cor = pca.transform(std_data_df)
#print(pca.get_covariance()) #Covariance matrix
#Matrix display of eigenvectors
eig_vec = pd.DataFrame(pca.components_.T, index = high_male3.columns, \
columns = ["PC{}".format(x + 1) for x in range(len(std_data_df.columns))])
display(eig_vec)
#eigenvalue
eig = pd.DataFrame(pca.explained_variance_, index=["PC{}".format(x + 1) for x in range(len(std_data_df.columns))], columns=['eigenvalue']).T
display(eig)
#In the source code by R, the standard deviation is calculated instead of the eigenvalue (variance).
#Standard deviation of principal components
dv = np.sqrt(eig)
dv = dv.rename(index = {'eigenvalue':'Standard deviation of principal components'})
display(dv)
#Contribution rate
ev = pd.DataFrame(pca.explained_variance_ratio_, index=["PC{}".format(x + 1) for x in range(len(std_data_df.columns))], columns=['Contribution rate']).T
display(ev)
#Cumulative contribution rate
t_ev = pd.DataFrame(pca.explained_variance_ratio_.cumsum(), index=["PC{}".format(x + 1) for x in range(len(std_data_df.columns))], columns=['Cumulative contribution rate']).T
display(t_ev)
#Main component score
print('Main component score')
cor = pd.DataFrame(pca_cor, columns=["PC{}".format(x + 1) for x in range(len(std_data_df.columns))])
display(cor)
First, we use sklearn.preprocessing.StandardScaler to standardize the data. By standardization, it is converted to data with an average of 0 and a variance of 1.
Principal component analysis is performed on the standardized data using sklearn.decomposition.PCA. In principal component analysis, the results of eigenvectors, eigenvalues, contribution ratios, and principal component scores can be confirmed.
From these results, it was shown that the output result in R was almost the same.
Here, the relationship between the principal component score, the eigenvector, and the standardized data is as follows.
Main component score = eigenvector (grip strength) x grip strength after standardization + eigenvector (upper body raising) x upper body raising value after standardization + ... + eigenvector (handball throwing) x handball throwing value after standardization
The teaching materials describe the criteria for reducing dimensions.
--Cumulative contribution rate: As a guide, use the cumulative contribution rate from 70% to 80%. --Eigenvalue (variance) is 1 or more (Kaiser criterion): Principal component with eigenvalue greater than 1 is adopted (when analyzing correlation matrix = in this case), contribution rate is (1 / number of variables) x 100 (%) ) Adopt up to the above principal components (when analyzing the covariance matrix). --Screep plot: From the left in descending order of eigenvalues (variance), the main components up to before the decrease in variance becomes small (gradual decrease) are adopted for the line graph (scree plot).
#Standardization(How to not use Standard Scaler)
# default:axis = 0(Column orientation application)
std_data_df_noskl = (high_male3 - high_male3.mean()) / high_male3.std(ddof = 0)
display(std_data_df_noskl)
#Covariance matrix
cov_vec = np.cov(std_data_df_noskl.T, bias = 0)
#Find eigenvalues and eigenvectors
# np.linalg.svd: Singular value decomposition
# np.linalg.eig:Eigenvalue decomposition
#Here, select singular value decomposition
eig_vec_noskl_l, eig_noskl_l, _ = np.linalg.svd(cov_vec)
#Main component score
pca_cor_noskl = np.dot(std_data_df_noskl, eig_vec_noskl_l)
#display(cor_noskl)
#Matrix display of eigenvalue vector
eig_vec_noskl = pd.DataFrame(eig_vec_noskl_l, index = high_male3.columns, \
columns = ["PC{}".format(x + 1) for x in range(len(std_data_df_noskl.columns))])
display(eig_vec_noskl)
#eigenvalue
eig_noskl = pd.DataFrame(eig_noskl_l, index=["PC{}".format(x + 1) for x in range(len(std_data_df_noskl.columns))], columns=['eigenvalue']).T
display(eig_noskl)
#In the source code by R, the standard deviation is calculated instead of the eigenvalue (variance).
#Standard deviation of principal components
dv_noskl = np.sqrt(eig_noskl)
dv_noskl = dv_noskl.rename(index = {'eigenvalue':'Standard deviation of principal components'})
display(dv_noskl)
#Contribution rate
ev_noskl_l = eig_noskl_l / eig_noskl_l.sum()
ev_noskl = pd.DataFrame(ev_noskl_l, index=["PC{}".format(x + 1) for x in range(len(std_data_df_noskl.columns))], columns=['Contribution rate']).T
display(ev_noskl)
#Cumulative contribution rate
t_ev_noskl = pd.DataFrame(ev_noskl_l.cumsum(), index=["PC{}".format(x + 1) for x in range(len(std_data_df_noskl.columns))], columns=['Cumulative contribution rate']).T
display(t_ev_noskl)
#Main component score
print('Main component score')
cor = pd.DataFrame(pca_cor_noskl, columns=["PC{}".format(x + 1) for x in range(len(std_data_df_noskl.columns))])
display(cor)
First, we are standardizing
The standardization formula is as follows.
X_{i,j}^{(std)} ← \frac{x_{i,j} - \mu}{\sigma}
\mu:\frac{1}{n}\sum_{i=1}^{n} X_{i,j}
\sigma:\sqrt{\frac{1}{n}\sum_{i=1}^{n} (X_{i,j} - \mu)^2 }
{\boldsymbol{X}}=
\begin{pmatrix}
X_{1,1}^{(std)} & \cdots & X_{135,1}^{(std)} \\
\vdots & \ddots & \vdots \\
X_{1,8}^{(std)} & \cdots & X_{135,8}^{(std)}
\end{pmatrix}
Here, x = 1,…, 135 = N, j = 1,…, 8
In this python code, the standard deviation is calculated with 0 degrees of freedom. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html It should be noted that in pandas.DataFrame.std, it is ddofint, default 1, and if no argument is specified, the standard deviation due to unbiased variance with 1 degree of freedom will be used. By the way, https://numpy.org/doc/stable/reference/generated/numpy.std.html numpy.std has dd of int, optional default ddof is zero, with 0 degrees of freedom, as well. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html Note that the standard deviation used during the calculation of sklearn.preprocessing.StandardScaler is ddof = 0.
Next, for the standardized data, the formula of the variance-covariance matrix (autocovariance matrix) is
E[\boldsymbol{X}]=0
Than,
cov({\boldsymbol{X,X}})=E[\boldsymbol{X^TX}]-E[{\boldsymbol{X}}]^TE[{\boldsymbol{X}}]=E[\boldsymbol{X^TX}]=\frac{1}{N-1}\boldsymbol{X^TX}
cov({\boldsymbol{X,X}})=\Sigma
Then $ U $: $ 135 \ times 135 $ Orthogonal Matrix $ S $: $ 135 \ times 8 $ real diagonal matrix (component non-negative) $ V_ {svd} $: $ 8 \ times 8 $ Orthogonal matrix
\boldsymbol{X}=U S {V^T_{svd}}
\boldsymbol{X^TX}=[U S {V^T_{svd}}]^T[U S {V^T_{svd}}]=V_{svd} S^T SV^T_{svd}
\Sigma=\frac{V_{svd} S^T SV^T_{svd}}{N-1}
$ S ^ 2 $: Eigenvalues arranged diagonally $ V_ {svd} $: $ \ boldsymbol {X ^ TX} $ eigenvector
The following article is detailed about singular value decomposition.
https://qiita.com/kidaufo/items/0f3da4ca4e19dc0e987e
The following article is detailed about the relationship between principal component analysis and singular value decomposition.
https://qiita.com/horiem/items/71380db4b659fb9307b4 https://qiita.com/sakami/items/50b8485159312573e3c7
Whether to use eigenvalue decomposition or singular value decomposition was selected from the following viewpoints.
https://numpy.org/doc/stable/reference/generated/numpy.linalg.eig.html
Returns: w: ... The eigenvalues are not necessarily ordered. In numpy.linalg.eig, the eigenvalues are not necessarily ordered, so we want to order PC1 PC2 ... in descending order of eigenvalues, so an ordering operation is required. Will be.
https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html
Returns: Vector (s) with the singular values, within each vector sorted in descending order. This is easier because you can get the ones sorted in descending order.
#Principal component analysis using 8 variables from grip strength to handball throwing, excluding height, weight, and sitting height #Correlation matrix #data set(Columns 4-11)Import high_male2 <- read.csv("high_male2.csv") high_male3 <- high_male2[c(4:11)] #Execution of principal component analysis prcomp based on correlation matrix (res2 <- prcomp(high_male3, scale=T)) summary(res2) pc <- res2$x #Sweeping the main component score to a file write.csv(pc, file = "pca_cor.csv")
The argument scale = T of prcomp () indicates scale = True. This is a parameter that specifies that principal component analysis should be performed using the data correlation matrix. Regarding the events handled this time, it is said that the principal component analysis was performed on the data that was substantially standardized. Expected to be equivalent.
Standard deviations (1, .., p=8): [1] 1.9677462 0.9524611 0.9096410 0.8553785 0.7327083 0.6857621 0.6399714 [8] 0.4949534
Rotation (n x k) = (8 x 8): PC1 PC2 PC3 PC4 PC5 Grip strength 0.3252018 0.2682176 -0.53297421 0.39993408 -0.3653663 Raise your upper body 0.3141190 0.4351668 0.42225757 0.40834395 0.4032249 Long-seat forward bending 0.3077864 0.3745785 0.01503113 -0.75987597 -0.2411453 Repeated side jump 0.3933948 0.1203619 0.05183489 -0.20404673 0.4967487 Shuttle run 0.3132617 -0.4444223 0.59760197 0.01703693 -0.3900527 X50m run-0.4057185 0.4620511 0.11729178 -0.10636452 -0.0709927 Standing long jump 0.3681042 -0.3669386 -0.40018514 -0.13933339 0.3055848 Handball throw 0.3844997 0.1955680 0.06075045 0.15245958 -0.3852838 PC6 PC7 PC8 Grip strength-0.31441687 0.34209544 -0.17004275 Raise your upper body-0.33321281 -0.29431157 0.08168542 Long-seat forward bending-0.28776668 -0.10238851 0.18941208 Repeated side jump 0.35638527 0.61198108 -0.19529718 Shuttle run-0.21759749 0.17541898 -0.34157859 X50m run 0.04215936 -0.08597965 -0.76329592 Standing long jump-0.10049579 -0.50594605 -0.43684157 Handball throw 0.72184877 -0.34234695 0.01636705 Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation 1.968 0.9525 0.9096 0.85538 0.73271 0.68576 0.6400 Proportion of Variance 0.484 0.1134 0.1034 0.09146 0.06711 0.05878 0.0512 Cumulative Proportion 0.484 0.5974 0.7008 0.79229 0.85940 0.91818 0.9694 PC8 Standard deviation 0.49495 Proportion of Variance 0.03062 Cumulative Proportion 1.00000
For pca_cor.csv, the output is as follows.
python version https://gist.github.com/ereyester/3c2173eb61cbcd64b61f23b3d4d6480c
R version https://gist.github.com/ereyester/cd309a9ab46c37a3f594963d1ad55495
Recommended Posts