Machine Learning: Image Recognition of MNIST by using PCA and Gaussian Native Bayes

Hello. I would like to keep what I learned in the university class in qiita as a memorandum. I have posted the sample code on github, so I'm not sure if it will be helpful, but if you are interested, please take a look. https://github.com/tkshim/MNIST/blob/master/BayesMNIST.py

Contents: -PCA (Principal Component Analysis) is used to reduce the dimensions of feature vectors, and a naive Bayes classifier is used to perform image recognition of numbers.
Purpose: -Share examples of code (Python) using machine learning.
Target audience: ・ I understand the basic theory of machine learning, but I would like to see a sample of how other people implement it in code.
Environment: ・ MacBook Air ・ OSX 10.11.6 · Python 3.x ・ Numpy ・ Pandas ・ Sklearn
Summary: -I used sklearn's built-in Gaussian NB, but I was able to get an accuracy in the latter half of 80%.

■step1 The data set to be analyzed is downloaded from the homepage of Professor Lucan of New York University. http://yann.lecun.com/exdb/mnist/index.html Set variables for data storage according to your environment.

DATA_PATH = '/Users/takeshi/MNIST_data'
TRAIN_IMG_NAME = 'train-images.idx3-ubyte'
TRAIN_LBL_NAME = 'train-labels.idx1-ubyte'
TEST_IMG_NAME = 't10k-images.idx'
TEST_LBL_NAME = 't10k-labels.idx'

■step2, step3 Read the training and testing datasets as an array of numpy. You can use imshow to see which number each data represents, as shown below.

print("The shape of matrix is : ", Xtr.shape)
print("Label is : ", Ttr.shape)
plt.imshow(Xte[0].reshape(28, 28),interpolation='None', cmap=cm.gray)
show()

■step４ This is the heart of PCA. The image data is represented by 28x28 = 784 data, and the eigenvectors are inversely calculated using the eigh function for these 784 feature vectors.

X = np.vstack((Xtr,Xte))
T = np.vstack((Ttr,Tte))
print (X.shape)
print (T.shape)

import numpy as np;
import numpy.linalg as LA;
μ=np.mean(X,axis=0);#print(μ);
Z=X-μ;#print(Z);
C=np.cov(Z,rowvar=False);#print(C);
[λ,V]=LA.eigh(C);#print(λ,'\n\n',V);
row=V[0,:];col=V[:,0];
np.dot(C,row)/(λ[0]*row) ;
np.dot(C,col)/(λ[0]*col);
λ=np.flipud(λ);V=np.flipud(V.T);
row=V[0,:];
np.dot(C,row)/(λ[0]*row);
P=np.dot(Z,V.T);#print(P);

If you want to know the theoretical background, the following article may be helpful. http://postd.cc/a-beginners-guide-to-eigenvectors-pca-covariance-and-entropy/

■step5 There are 784 eigenvectors = principal components, but instead of using all of them, for example, by using only two (= dimensionality reduction) and applying those two eigenvectors to GaussianNB, the recognition model is completed.

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

# Apply traing dataset to this model
# A: the number of training set
# B: the number of dimension
A = 60000
B = 2
model.fit(P[0:A,0:B],T[0:A])

■step6 If there are two eigenvectors, the accuracy of the test data will be 44.7%, which is a very bad number. ..

from sklearn import metrics
predicted = model.predict(P[A:70001,0:B])
expected = T[A:70001,]
print ('The accuracy is : ', metrics.accuracy_score(expected, predicted)*100, '%')

■step７ Here, the Classification Report and Confusion Matrix are displayed so that you can check the recognition status of each number.

import matplotlib.pyplot as plt
import seaborn as sns
print ('          === Classification Report ===')
print (metrics.classification_report(expected, predicted))

cm = metrics.confusion_matrix(expected, predicted)
plt.figure(figsize=(9, 6))
sns.heatmap(cm, linewidths=.9,annot=True,fmt='g')
plt.suptitle('MNIST Confusion Matrix (GaussianNativeBayesian)')
plt.show()

If the eigenvector is 2, you can see that the number "1" is not bad at 83%, while "2" and "5" are hardly recognized correctly.

Why? That is to say, this is because if there are only two eigenvectors, some numbers will overlap and it may be difficult to determine which number.

Let's take a look at an easy-to-understand example. In the above matrix, the number 4 is recognized as the number 1 0 times, but the number 9 is mistakenly recognized as the number 9 374 times. Below is a three-dimensional plot of the eigenvectors of the numbers 1 and 4. If it is 1 and 4, you can see that the sets of eigenvectors are neatly separated. But what about 4 and 9? It's almost worn and overlaps.

■ Numbers 1 and 4 ■ Numbers 4 and 9 　　

Purple is the number 4

Therefore, increasing the eigenvectors used will improve accuracy. In this environment, setting the eigenvector to around 70 seems to maximize the accuracy (late 80%).

from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

# Apply traing dataset to this model
# A: the number of training set
# B: the number of dimension
A = 60000
B = 70 # <-Gradually increase.
model.fit(P[0:A,0:B],T[0:A])

■ Comparison of eigenvectors 2 and 70 The following is the difference when the number 0 is reproduced with the eigenvectors set to 2 and 70. After all, if you increase it to 70, you can see that the outline becomes much clearer.

Xrec2=(np.dot(P[:,0:2],V[0:2,:]))+μ; #Reconstruction using 2 components
Xrec3=(np.dot(P[:,0:70],V[0:70,:]))+μ; #Reconstruction using 3 components
plt.imshow(Xrec2[1].reshape(28, 28),interpolation='None', cmap=cm.gray);
show()
plt.imshow(Xrec3[1].reshape(28, 28),interpolation='None', cmap=cm.gray);
show()

■ Summary ・ I was able to obtain accuracy in the latter half of 80% using the Machine Learning method. ・ This time, I am using sklearn's naive Bayes classifier, but next time I would like to implement this classifier from scratch in Python and aim for the 90% range.