First, prepare handwritten digit data. This time, from the task of Digital Recognizer of Kaggle, teacher data named train I would like to download and use it.
Since the total amount of this data is 73MB, which is a considerable amount of data, we will prioritize the ease of understanding and pick up 20 pieces from each number from 0 to 9, for a total of 200 pieces. Please download the picked up data from here.
This handwritten digit data is a CSV file
8, 0, 0, 0, 128, ... , 54, 23, 0, 0
```The first digit is a label indicating which number was written, and the subsequent digits are 28x28.=Numerical data for 784 pixels follows.
First, import the required libraries.
```py
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
Then it reads the data, stores it in an array, and sorts it by label order.
size = 28
raw_data= np.loadtxt('train_small.csv',delimiter=',',skiprows=1)
digit_data = []
for i in range(len(raw_data)):
digit_data.append((raw_data[i,0],raw_data[i,1:785]))
digit_data.sort(key=lambda x: x[0]) # sort array by label
First of all, let's display what kind of image the read data is (with the pcolor graph of matplotlib).
# draw digit images
plt.figure(figsize=(15, 15))
for i in range(len(digit_data)):
X, Y = np.meshgrid(range(size),range(size))
Z = digit_data[i][1].reshape(size,size) # convert from vector to 28x28 matrix
Z = Z[::-1,:] # flip vertical
plt.subplot(10, 20, i+1) # layout 200 cells
plt.xlim(0,27)
plt.ylim(0,27)
plt.pcolor(X, Y, Z)
plt.flag()
plt.gray()
plt.tick_params(labelbottom="off")
plt.tick_params(labelleft="off")
plt.show()
The 8th data of "2" is amazing, there is no impression of "2" (laugh) If it is not said that it is "2", even human beings can not distinguish it ... This is the dataset we will be using this time.
Let's make a correlation matrix with this 28x28 = 784 pixel image data using a 784-dimensional vector with each element as a grayscale density. The question is how much it makes sense to simply correlate, but I feel that a simple method can express the closeness of the images to some extent. Since it is a 200x200 matrix, I can't understand it even if it's a number, so I'll show it in a graph to get an image.
It's a pretty spectacular graph (laughs)
The complete diagonal components have the same data, so the correlation is 1. Looking at it lightly, I feel that the diagonal blocks (correlation coefficient between the same numbers) are a little dark. "1" is definitely highly correlated.
The calculation in python is done as follows.
data_mat = []
# convert list to ndarray
for i in range(len(digit_data)):
label = digit_data[i][0]
data_mat.append(digit_data[i][1])
A = np.array(data_mat)
Z = np.corrcoef(A) # generate correlation matrix
area_size = len(digit_data)
X, Y = np.meshgrid(range(area_size),range(area_size))
To make it a little easier to see, let's set a threshold and plot the ones with a correlation coefficient greater than that as 0 and the ones with less than 1 as 0. I have selected 0.5 and 0.6 as the thresholds, but they are arbitrary, and I have tried several and picked up the ones whose diagonal components have begun to emerge. Looking at the 0.6 one, it seems that there is a difference between the diagonal block and the others. It also seems to indicate that "9" and "7" are similar. You can see that "2" has a particularly low correlation between "2".
plt.clf()
plt.figure(figsize=(10, 10))
plt.xlim(0,area_size-1)
plt.ylim(0,area_size-1)
plt.title("Correlation matrix of digit charcter vector. (corr>0.5)")
thresh = .5
Z1 = Z.copy()
Z1[Z1 > thresh] = 1
Z1[Z1 <= thresh] = 0
plt.pcolor(X, Y, Z1, cmap=cm.get_cmap('Blues'),alpha=0.6)
plt.xticks([(i * 20) for i in range(10)],range(10))
plt.yticks([(i * 20) for i in range(10)],range(10))
plt.grid(color='deeppink',linestyle='--')
plt.show()
Finally, let's show the average value for each block in a 10x10 graph.
summary_Z = np.zeros(100).reshape(10,10)
for i in range(10):
for j in range(10):
i1 = i * 20
j1 = j * 20
#print "[%d:%d,%d:%d]" % (i1,i1+20,j1,j1+20)
if i==j:
#Since the diagonal component is fixed at 1, take the average excluding it to avoid the value from rising.
summary_Z[i,j] = (np.sum(Z[i1:i1+20,j1:j1+20])-20)/380
else:
summary_Z[i,j] = np.sum(Z[i1:i1+20,j1:j1+20])/400
# average of each digit's grid
plt.clf()
plt.figure(figsize=(10, 10))
plt.xlim(0,10)
plt.ylim(0,10)
sX, sY = np.meshgrid(range(11),range(11))
plt.title("Correlation matrix of summuation of each digit's cell")
plt.xticks(range(10),range(10))
plt.yticks(range(10),range(10))
plt.pcolor(sX, sY, summary_Z, cmap=cm.get_cmap('Blues'),alpha=0.6)
plt.show()
This time, I tried a rough analysis in a sense that the image data is regarded as a 784-dimensional vector and the vectors are correlated as they are, but since the image data is originally two-dimensional, the neighboring pixels such as top, bottom, left, and right I think that it is possible to express more plausible closeness between images by considering the closeness considering the value. It's still before machine learning at this stage. However, the diagonal components were properly displayed. I will write a little more serious thing as the next step in Next article.
Recommended Posts