[Python] How to do PCA in Python

Use ** scikit-learn ** to do ** PCA ** in Python. There are many explanations of PCA in the world, so I will not explain it here, but just explain how to use it.

Usage is simple. n_components is the number of components. If nothing is specified, it will be the number of dimensions of the data. All you have to do now is pass the data to fit.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)

Details can be found here [http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform).

Test data creation

I created the test data like this. It doesn't matter whether shuffle is separate or not.

In [10]: x = np.linspace(0.2,1,100)

In [11]: y = 0.8*x + np.random.randn(100)*0.1

In [12]: X = np.vstack([x, y]).T

In [13]: np.random.shuffle(X)

pca1.png

PCA PCA should be as follows.

In [14]: from sklearn.decomposition import PCA

In [15]: pca = PCA(n_components=2)

In [16]: pca.fit(X)
Out[16]: PCA(copy=True, n_components=2, whiten=False)

Confirmation of main components

** Main components ** are in components_. Incidentally, you can see the ** mean ** and ** covariance matrix ** with mean_, get_covariance ().

In [17]: print 'components'
    ...: print pca.components_
    ...: print 'mean'
    ...: print pca.mean_
    ...: print 'covariance'
    ...: print pca.get_covariance()
    ...: 
components
[[ 0.71487492  0.69925235]
 [-0.69925235  0.71487492]]
mean
[ 0.6         0.47190318]
covariance
[[ 0.05441077  0.04603365]
 [ 0.04603365  0.0523763 ]]

Now let's calculate the ** mean ** and ** covariance matrix ** by ourselves. The average is the average for each component. The covariance matrix is a calculation of the covariance. (That's right) When calculating the covariance, bias = 1 is for specifying what the denominator of the variance and covariance should be. Here, 1 is specified to divide by the number of data. If it is 0, it is divided by the number of data-1. You can see that the results are the same.

In [22]: mn = np.mean(X,axis=0)

In [23]: z = X - mn

In [24]: cv = np.cov(z[:,0],z[:,1],bias=1)

In [25]: print 'mean'
    ...: print mn
    ...: print 'covariance'
    ...: print cv
    ...: 
mean
[ 0.6         0.47190318]
covariance
[[ 0.05441077  0.04603365]
 [ 0.04603365  0.0523763 ]]

Then calculate the ** eigenvalues ** and ** eigenvectors ** of the ** covariance matrix **. The eigenvectors of the covariance matrix match the principal components. Eigenvalues and eigenvectors are calculated in numpy.linalg.eig.

W, v = np.linalg.eig(cv)

print 'eigenvector'
print v
print 'eigenvalue'
print W

eigenvector
[[ 0.71487492 -0.69925235]
 [ 0.69925235  0.71487492]]
eigenvalue
[ 0.09943842  0.00734865]

W is the eigenvalue and v is the eigenvector. Here, the eigenvectors are ** column vectors *. (Lined vertically) That is, v [:, 0] is the first eigenvector. Note that components_ is a ** row vector ** (side by side). You can see that they match. ( Sometimes the direction is reversed 180 degrees.)

Let's multiply the covariance matrix by the eigenvectors. Since the principal components of the eigenvector and PCA are the same, it is the same as multiplying the covariance matrix by the principal components. Since it is an eigenvector, the direction does not change.

In [28]: print cv.dot(v[:,0].reshape(2,1))
    ...: print v[:,0]*W[0]
    ...: print cv.dot(v[:,1].reshape(2,1))
    ...: print v[:,1]*W[1]
[[ 0.07108603]
 [ 0.06953255]]
[ 0.07108603  0.06953255]
[[-0.00513856]
 [ 0.00525337]]
[-0.00513856  0.00525337]

Principal component vector display

Let's display the first principal component vector and the second principal component vector on the data. You can see that the first principal component points in the direction of large dispersion.

pca2.png

Projection Let's ** Projection ** with the data as the main component. Projection is specifically to take the inner product of the data and the principal component vector.

In [30]: Xd = pca.transform(X)

If you actually take the inner product of the data and the principal component vector and check it, you can see that it is the same.

In [31]: print pca.components_[0]
    ...: print pca.components_[1]
    ...: print X[0,:]
    ...: print z[0,:]
    ...: print pca.components_[0].dot(z[0,:]), pca.components_[1].dot(z[0,:])
    ...: print Xd[0,:]
[ 0.71487492  0.69925235]
[-0.69925235  0.71487492]
[ 0.57979798  0.47996242]
[-0.02020202  0.00805924]
-0.00880647453855 0.0198876592146
[-0.00880647  0.01988766]

If you plot the Projected data, it will be as follows. If you look at the plotted shape, you can see that the principal component vector is rotated to the new axis.

pca3.png

MNIST I tried it with MNIST data. MNIST is handwritten character data. The data can be downloaded from here. The format of the data is also written here. (* This time, I borrowed the place to read the data from the sample code of Tensorflow.)

I read the data and used 256 data of character '3'. (Because there are many if you use all) pca5.png

PCA has set the number of components to 50 as shown below. (50 has no particular meaning)

In [36]: from sklearn.decomposition import PCA

In [37]: N = 50

In [38]: pca = PCA(n_components=N)

In [39]: pca.fit(X)
Out[39]: PCA(copy=True, n_components=50, whiten=False)

The following is an image of the main components. pca4.png

Naturally, Projection will result in 50-dimensional data. Now you can compress the dimensions.

In [44]: Xd = pca.transform(X)

In [45]: print X.shape
(256, 784)

In [46]: print Xd.shape
(256, 50)

Let's go back to the original dimension. Use inverse_transform to revert.

In [51]: Xe = pca.inverse_transform(Xd)

In [52]: print Xe.shape
(256, 784)

The following is a comparison of the original and the result of dimension compression. The upper row is the original, and the lower row is the dimension compressed. Since the number of dimensions was 50, it is difficult to see the difference, but you can see that it is slightly different.

pca6.png

The result of setting the number of dimensions to 3 is as follows. The result of dimensional compression has an extremely small number of dimensions, so small changes cannot be expressed and everyone feels the same.

pca7.png

I pasted the code I used below.

code

import numpy as np
import matplotlib.pyplot as plt

# generate data
x = np.linspace(0.2,1,100)
y = 0.8*x + np.random.randn(100)*0.1
X = np.vstack([x, y]).T
np.random.shuffle(X)

# plot data
fig = plt.figure()
axes = fig.add_subplot(111,aspect='equal')
axes.scatter(X[:,0],X[:,1])
axes.set_xlim([-0.2, 1.4])
axes.set_ylim([-0.2, 1.4])
axes.set_xlabel('x0')
axes.set_ylabel('x1')
axes.vlines(0,-0.2,1.4,linestyles='dashed')
axes.hlines(0,-0.2,1.4,linestyles='dashed')

# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)

# print components and mean
print 'components'
print pca.components_
print 'mean'
print pca.mean_
print 'covariance'
print pca.get_covariance()

mn = np.mean(X,axis=0)
z = X - mn
cv = np.cov(z[:,0],z[:,1],bias=1)
print 'mean'
print mn
print 'covariance'
print cv

W, v = np.linalg.eig(cv)
print 'eigenvector'
print v
print 'eigenvalue'
print W

# covariance matrix x eigenvector
print cv.dot(v[:,0].reshape(2,1))
print v[:,0]*W[0]
print cv.dot(v[:,1].reshape(2,1))
print v[:,1]*W[1]
# display 
fig = plt.figure()
axes = fig.add_subplot(111,aspect='equal')
axes.scatter(X[:,0],X[:,1])
axes.set_xlim([-0.2, 1.4])
axes.set_ylim([-0.2, 1.4])
axes.set_xlabel('x0')
axes.set_ylabel('x1')
axes.vlines(0,-0.2,1.4,linestyles='dashed')
axes.hlines(0,-0.2,1.4,linestyles='dashed')
axes.quiver(pca.mean_[0], pca.mean_[1], pca.components_[0,0],pca.components_[0,1], color='red', width=0.01, scale=3)
axes.quiver(pca.mean_[0], pca.mean_[1], pca.components_[1,0],pca.components_[1,1], color='blue', width=0.01, scale=3)

# projection
Xd = pca.transform(X)
print pca.components_[0]
print pca.components_[1]
print X[0,:]
print z[0,:]
print pca.components_[0].dot(z[0,:]), pca.components_[1].dot(z[0,:])
print Xd[0,:]

fig = plt.figure()
axes = fig.add_subplot(111,aspect='equal')
axes.scatter(Xd[:,0],Xd[:,1])
axes.set_xlabel('xd0')
axes.set_ylabel('xd1')
axes.set_xlim([-1.0, 1.0])
axes.set_ylim([-1.,1.0])
axes.vlines(0,-1.0,1.0,linestyles='dashed')
axes.hlines(0,-1.0,1.0,linestyles='dashed')
# MNIST
# generate data  
import numpy as np
import matplotlib.pyplot as plt
import gzip
image_filename = './data/mnist/train-images-idx3-ubyte.gz'
label_filename = './data/mnist/train-labels-idx1-ubyte.gz'
def _read32(bytestream):
  dt = np.dtype(np.uint32).newbyteorder('>')
  return np.frombuffer(bytestream.read(4), dtype=dt)[0]

with gzip.open(image_filename) as bytestream:
    magic = _read32(bytestream)
    num_images = _read32(bytestream)
    rows = _read32(bytestream)
    cols = _read32(bytestream)
    buf = bytestream.read(rows * cols * num_images)
    data = np.frombuffer(buf, dtype=np.uint8)
    data = data.reshape(num_images, rows, cols)

with gzip.open(label_filename) as bytestream:
    magic = _read32(bytestream)
    num_items = _read32(bytestream)
    buf = bytestream.read(num_items)
    labels = np.frombuffer(buf, dtype=np.uint8)

Xall = data[labels == 3, :, :]
X = Xall[0:256,:,:].reshape(256,28*28)
X = X /255.0
# PCA  
from sklearn.decomposition import PCA
N = 3
pca = PCA(n_components=N)
pca.fit(X)

# plot
import matplotlib.pyplot as plt
import matplotlib.cm as cm

cols = 10
rows = int(np.ceil(N/float(cols)))

fig, axes = plt.subplots(ncols=cols, nrows=rows, figsize=(20,10))

for i in range(N):
    r = i // cols
    c = i % cols
    axes[r, c].imshow(pca.components_[i].reshape(28,28),vmin=-0.5,vmax=0.5, cmap = cm.Greys_r)
    axes[r, c].set_title('component %d' % i)
    axes[r, c].get_xaxis().set_visible(False)
    axes[r, c].get_yaxis().set_visible(False)
    
# projection
Xd = pca.transform(X)
print X.shape
print Xd.shape

# inverse
Xe = pca.inverse_transform(Xd)
print Xe.shape


fig, axes = plt.subplots(ncols=10, nrows=2, figsize=(30,4))


for i in range(10):
    axes[0, i].imshow(X[i,:].reshape(28,28),vmin=0.0,vmax=1.0, cmap = cm.Greys_r)
    axes[0, i].set_title('original %d' % i)
    axes[0, i].get_xaxis().set_visible(False)
    axes[0, i].get_yaxis().set_visible(False)
    
    axes[1, i].imshow(Xe[i,:].reshape(28,28),vmin=0.0,vmax=1.0, cmap = cm.Greys_r)
    axes[1, i].set_title('dimension reduction %d' % i)
    axes[1, i].get_xaxis().set_visible(False)
    axes[1, i].get_yaxis().set_visible(False)

Recommended Posts

[Python] How to do PCA in Python
How to do R chartr () in Python
How to develop in Python
How to do hash calculation with salt in Python
How to collect images in Python
How to use SQLite in Python
How to use Mysql in python
How to wrap C in Python
How to use ChemSpider in Python
How to use PubChem in Python
How to handle Japanese in Python
[Introduction to Python] How to use class in Python?
How to access environment variables in Python
How to dynamically define variables in Python
[Itertools.permutations] How to put permutations in Python
How to work with BigQuery in Python
How to get a stacktrace in python
How to display multiplication table in python
How to extract polygon area in Python
How to do portmanteau test with python
How to check opencv version in python
How to switch python versions in cloud9
How to adjust image contrast in Python
How to use __slots__ in Python class
How to dynamically zero pad in Python
How to do Server-Sent Events in Django
How to use regular expressions in Python
How to display Hello world in python
How to use is and == in Python
How to write Ruby to_s in Python
How to install Python
How to install python
How to use the C library in Python
I want to do Dunnett's test in Python
How to receive command line arguments in Python
[REAPER] How to play with Reascript in Python
How to clear tuples in a list (Python)
How to do multi-core parallel processing with python
How to generate permutations in Python and C ++
How to embed a variable in a python string
How to implement Discord Slash Command in Python
Summary of how to import files in Python 3
How to simplify restricted polynomial fit in python
How to use Python Image Library in python3 series
How to implement shared memory in Python (mmap.mmap)
Minimal implementation to do Union Find in Python
How to create a JSON file in Python
Summary of how to use MNIST in Python
How to specify TLS version in python requests
How to notify a Discord channel in Python
How to get the files in the [Python] folder
How to use tkinter with python in pyenv
How to run Leap Motion in non-Apple Python
[Python] How to draw a histogram in Matplotlib
How to output "Ketsumaimo" as standard output in Python
What to do to get google spreadsheet in python
How to handle datetime type in python sqlite3
How to make Python Interpreter changes in Pycharm
How to plot autocorrelation and partial autocorrelation in python
How to remove duplicate elements in Python3 list
[2020.8 latest] How to install Python