index -Introduction to Python Python Basics -Introduction to Python Scientific Calculations with Python -Introduction to Python Machine Learning Basics (Unsupervised Learning / Principal Component Analysis)
――Summary the whole into 1 to 3 dimensions that are easy to understand and have a good view. ――Big data is multivariate and multidimensional, so it is difficult to understand as it is, but by performing principal component analysis, the information contained in the data is not impaired as much as possible, and the atmosphere of the entire data is visualized so that everyone can understand it. It is possible to make it easy to make.
The following is an excerpt from Wikipedia
Principal component analysis (PCA) is a multivariate analysis that synthesizes a variable called the principal component that best represents the overall variation with a small number of uncorrelated variables from a large number of correlated variables. One method [1]. Used to reduce the dimensions of the data.
The transformation that gives the principal component is chosen so as to maximize the variance of the first principal component and maximize the variance of the following principal components under the constraint that they are orthogonal to the previously determined principal components. The purpose of maximizing the variance of the principal components is to give the principal components as much as possible the ability to explain changes in observed values. The selected principal components are orthogonal to each other and a given set of observations can be represented as a linear combination. In other words, the principal component is the orthogonal basis of the set of observations. The orthogonality of the principal component vector is derived from the fact that the principal component vector is the eigenvector of the covariance matrix (or correlation matrix) and the covariance matrix is a real symmetric matrix.
The following program uses a RandomState object to generate a two-variable dataset and plots the standardized ones for each variable.
from sklearn.preprocessing import StandardScaler
# np.Random.RandomState(1)Create a RandomState object with the seed (initial value of random number) set to 1 as
sample = np.random.RandomState(1)
#Generate two random numbers using the rand and randn functions
X = np.dot(sample.rand(2, 2), sample.randn(2, 200)).T
#Standardization
sc = StandardScaler()
X_std = sc.fit_transform(X)
#Calculation and graphing of correlation coefficient
print('Correlation coefficient{:.3f}:'.format(sp.stats.pearsonr(X_std[:, 0], X_std[:, 1])[0]))
plt.scatter(X_std[:, 0], X_std[:, 1])
The following is the output result
Correlation coefficient 0.889:
Reference URL for the standardization part
-scikit-learn fit () / transform () / fit_transform () -What is standardization
Principal component analysis can be performed using the PCA
class of the sklearn.decomposition
module.
When initializing an object, specify how many dimensions you want to compress the variable, that is, the number of principal components you want to extract as n_components
.
Normally, set a value smaller than the original variable. (Reduce 30 variables to 5 variables, etc.) </ Strong>
By executing the fit
method, the information necessary for extracting the principal components is learned. (Specifically, eigenvalues and eigenvectors are calculated)
#import
from sklearn.decomposition import PCA
#Principal component analysis
pca = PCA(n_components=2)
pca.fit(X_std)
Output result
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
The components_
attribute is called the eigenvector and represents the orientation of the ** new feature space axis discovered by principal component analysis. ** **
print(pca.components_)
Output result
[[-0.707 -0.707] #Orientation of the first principal component
[-0.707 0.707]] #Orientation of the second principal component
The ʻexplained_variance_` attribute represents the variance of each principal component.
print('Dispersion of each principal component:{}'.format(pca.explained_variance_))
Output result
Dispersion of each principal component:[1.899 0.111]
It can be seen that the variances of the two principal components extracted this time are 1.889 and 0.111, respectively, but it is not a coincidence that the sum of the variances here is 2.0, and the (standardized) variable originally has it. The sum of the variances is the same as the sum of the variances of the principal components. In other words, the variance (information) is maintained.
The ʻexplained_variance_ratio_` attribute is the variance ratio of each principal component.
print('Dispersion ratio of each main component:{}'.format(pca.explained_variance_ratio_))
Output result
Dispersion ratio of each main component:[0.945 0.055]
The first 0.945 is obtained by 1.889 / (1.889 + 0.111), and it can be seen that the first principal component holds 94.5% of the information of the original data.
Let's diagram the above results.
#parameter settings
arrowprops=dict(arrowstyle='->',
linewidth=2,
shrinkA=0, shrinkB=0)
#Function for drawing an arrow
def draw_vector(v0, v1):
plt.gca().annotate('', v1, v0, arrowprops=arrowprops)
#Plot the original data
plt.scatter(X_std[:, 0], X_std[:, 1], alpha=0.2)
#Display the two axes of principal component analysis with arrows
for length, vector in zip(pca.explained_variance_, pca.components_):
v = vector * 3 * np.sqrt(length)
draw_vector(pca.mean_, pca.mean_ + v)
#Adjust the upper and lower limits of x or y so that the increments of the same coordinate value have the same length.
plt.axis('equal');
The following is the output result. The upper arrow is the direction of the axis of the new feature space obtained by principal component analysis. It can be seen that the first principal component is determined in the direction of maximum variance and is orthogonal to each other with respect to the vector with the second principal component.
As you can see from the figure, the vector in the direction of maximum variance is the first principal component, and the vector in the direction of the next largest variance is the second principal component. The first principal component and the second principal component are orthogonal. (Orthonormal basis)
From here, we will look concretely in what situations it is useful to compress dimensions using principal component analysis.
Breast cancer data can be loaded using the load_breast_cancer
function in sklearn.datasets
. The following shows the data actually read and the distribution of each explanatory variable visualized according to whether the value of the objective variable (cancer.target) is "malignant" or "benign".
#Import to read breast cancer data
from sklearn.datasets import load_breast_cancer
#Acquisition of breast cancer data
cancer = load_breast_cancer()
#Filtering to separate data into malignant or benign
#malignant is cancer.target is 0
malignant = cancer.data[cancer.target==0]
#benign is cancer.target is 0
benign = cancer.data[cancer.target==1]
#Histogram with blue for malignant and orange for benign
#Each figure is a histogram showing the relationship between each explanatory variable (mean radius, etc.) and the objective variable.
fig, axes = plt.subplots(6,5,figsize=(20,20))
ax = axes.ravel()
for i in range(30):
_,bins = np.histogram(cancer.data[:,i], bins=50)
ax[i].hist(malignant[:,i], bins, alpha=.5)
ax[i].hist(benign[:,i], bins, alpha=.5)
ax[i].set_title(cancer.feature_names[i])
ax[i].set_yticks(())
#Label settings
ax[0].set_ylabel('Count')
ax[0].legend(['malignant','benign'],loc='best')
fig.tight_layout()
The following is the output result
For most histograms, the malignant
and benign
data overlap, and it is difficult to determine where to draw the boundary to distinguish between malignant and benign.
Now let's use principal component analysis to reduce the dimensions of these 20 or more variables. Specifically, the data used as explanatory variables are standardized and principal component analysis is performed. The number of main components to be extracted (n_component) is 2.
#Standardization
sc = StandardScaler()
X_std = sc.fit_transform(cancer.data)
#Principal component analysis
pca = PCA(n_components=2)
pca.fit(X_std)
X_pca = pca.transform(X_std)
#display
print('X_pca shape:{}'.format(X_pca.shape))
print('Explained variance ratio:{}'.format(pca.explained_variance_ratio_))
Output result
X_pca shape:(569, 2)
Explained variance ratio:[0.443 0.19 ]
From the above, when checking the value of the ʻexplained_variance_ratio_` attribute, although the number of variables is reduced to two, about 63% (= 0.443 + 0.19) of the original information is condensed into the first principal component and the second principal component. You can see that. This can be seen from the output result that "X_pca shape: (569, 2)" is 569 rows and 2 columns (2 variables) in the data after principal component analysis, and 2 variables are principal component analysis. Since the number of is set to 2, it is 2.
Next, let's visualize the data with lower dimensions. First, in preparation for visualization, the objective variables corresponding to the explanatory variables are linked to the data of the first principal component and the second principal component, and then separated into benign data and malignant data.
#Label the columns, the first is the first principal component and the second is the second principal component
X_pca = pd.DataFrame(X_pca, columns=['pc1','pc2'])
#In the above data, the objective variable (cancer.Associate target), join horizontally
X_pca = pd.concat([X_pca, pd.DataFrame(cancer.target, columns=['target'])], axis=1)
#Separate malignant and benign
pca_malignant = X_pca[X_pca['target']==0]
pca_benign = X_pca[X_pca['target']==1]
#Plot malignancy(red)
ax = pca_malignant.plot.scatter(x='pc1', y='pc2', color='red', label='malignant');
#Plot benign(Blue)
pca_benign.plot.scatter(x='pc1', y='pc2', color='blue', label='benign', ax=ax);
The following is the output result
From the graph above, it can be seen that in this case, the class of objective variables can be almost separated by only two principal components. If there are many variables and you do not know which variable should be used for analysis, perform principal component analysis in this way. (1) Clarify the relationship between each principal component and the objective variable (2) Interpret the relationship between the original variable and the objective variable from the relationship between each principal component and the original variable. If you proceed with such things, data understanding will progress.
It should also be remembered that principal component analysis can also be used when you want to reduce the number of variables (dimension reduction) when building a prediction model.
-I will explain what principal component analysis is in an easy-to-understand manner with all my strength -Principal component analysis concept -Detecting orthonormal basis from "Kujo Karen's pose" by principal component analysis
Recommended Posts