I heard a rumor recently notMNIST I wrote a process to handle a dataset in Python, so I will publish it: blush:
MNIST is a test data set of handwritten numbers that many people who are learning machine learning know, but this notMNIST is handwritten. An image dataset of alphabets represented in various fonts, not numbers.
If you try to visualize the contents, you will see the following data set. The letters at the beginning of the image are the alphabet that the image represents, and the corresponding image is displayed. Look at the second from the right on the first line. It doesn't look like "I" at all, it's a house, isn't it? This: sweat_smile: With that kind of feeling, even if one person sees one line, it contains suspicious data, but I think it's a very interesting subject.
The official page of notMNIST is http://yaroslavvb.blogspot.jp/2011/09/notmnist-dataset.html So, a person named Yaroslav Bulatov is made.
First of all http://yaroslavvb.com/upload/notMNIST/ Go to and there notMNIST_large.tar.gz Please download the data from.
Since it is compressed with tar.gz, if you decompress it with a decompression tool etc. as appropriate notMNIST_large Folders are created, and folders for each alphabet of A, B, C ... are created in the subfolders. I will write the process to read this in Python. I'm using a Jupyter Notebook, in which case I want the .ipynb file to be in the same directory as this notMNIST_large folder. For .py, make sure that the .py file is created in the same directory as well.
The set of code is also uploaded to Github, but I will write it here as well.
Install various libraries,
from __future__ import division
import sys, os, pickle
import numpy as np
import numpy.random as rd
from scipy.misc import imread
import matplotlib.pyplot as plt
%matplotlib inline
Define a function for pickle, a function for displaying an image, etc.
image_size = 28
depth = 255
def unpickle(filename):
with open(filename, 'rb') as fo:
_dict = pickle.load(fo)
return _dict
def to_pickle(filename, obj):
with open(filename, 'wb') as f:
#pickle.dump(obj, f, -1)
pickle.Pickler(f, protocol=2).dump(obj)
def count_empty_file(folder):
cnt = 0
for file in os.listdir(folder):
if os.stat(os.path.join(folder, file)).st_size == 0:
cnt += 1
return cnt
I want to save the label as an int type, so prepare a dictionary for conversion.
label_conv = {a: i for a, i in zip('ABCDEFGHIJ', range(10))}
num2alpha = {i: a for i,a in zip(range(10), 'ABCDEFGHIJ')}
Read each image file in the folder and save it as a numpy ndarray. At the same time, prepare label data with the folder name as the label. After reading, store the image data in'data'and the label data in'target' in dictionary format, and save the object as a file with pickle. Occasionally there is a corrupted file and the size is 0 and it can not be read, so skip processing is included as a countermeasure for such things and those that cause reading errors.
#Existence check of the folder to be read
assert os.path.exists('notMNIST_large')
# assert os.path.exists('notMNIST_small') #When reading small, please restore it for checking.
for root_dir in ['notMNIST_large']: # ['notMNIST_small', 'notMNIST_large']: #If you also use small, select both
folders = [os.path.join(root_dir, d) for d in sorted(os.listdir(root_dir))
if os.path.isdir(os.path.join(root_dir, d))]
#Make a frame
file_cnt = 0
for folder in folders:
label_name = os.path.basename(folder)
file_list = os.listdir(folder)
file_cnt += len(file_list)-count_empty_file(folder)
dataset = np.ndarray(shape=(file_cnt, image_size*image_size), dtype=np.float32)
labels = np.ndarray(shape=(file_cnt), dtype=np.int)
last_num = 0 #Last index of the previous character
for folder in folders:
file_list = os.listdir(folder)
file_cnt = len(file_list)-count_empty_file(folder)
label_name = os.path.basename(folder)
labels[last_num:(last_num+file_cnt)] = label_conv[label_name]
#label = np.array([label_name] * file_cnt)
skip = 0
for i, file in enumerate(file_list):
#Skip files with 0 file size
if os.stat(os.path.join(folder, file)).st_size == 0:
skip += 1
continue
try:
data = imread(os.path.join(folder, file))
data = data.astype(np.float32)
data /= depth # 0-Convert to 1 data
dataset[last_num+i-skip, :] = data.flatten()
except:
skip += 1
print 'error {}'.format(file)
continue
last_num += i-skip
notmnist = {}
notmnist['data'] = dataset
notmnist['target'] = labels
to_pickle('{}.pkl'.format(root_dir), notmnist)
When using it, unpickle it, read the file, and extract it as an object. If necessary, change the range of values to 0-1 or divide it into training data and validation data.
from sklearn.cross_validation import train_test_split
notmnist = unpickle('notMNIST_large.pkl') #NotMNIST in the same folder_large.Suppose it contains pkl.
notmnist_data = notmnist['data']
notmnist_target = notmnist['target']
notmnist_data = notmnist_data.astype(np.float32)
notmnist_target = notmnist_target.astype(np.int32)
notmnist_data /= 255 # 0-Convert to 1 data
#75 training data%, Set the verification data with the remaining number
x_train, x_test, y_train, y_test = train_test_split(notmnist_data, notmnist_target)
If you want to visualize what the read image looks like, try the display process with the following function.
def draw_digit(digits):
size = 28
plt.figure(figsize=(len(digits)*1.5, 2))
for i, data in enumerate(digits):
plt.subplot(1, len(digits), i+1)
X, Y = np.meshgrid(range(size),range(size))
Z = data[0].reshape(size,size) # convert from vector to 28x28 matrix
Z = Z[::-1,:] # flip vertical
plt.xlim(0,27)
plt.ylim(0,27)
plt.pcolor(X, Y, Z)
plt.gray()
plt.title(num2alpha[data[1]])
plt.tick_params(labelbottom="off")
plt.tick_params(labelleft="off")
plt.show()
It is displayed in 10 rows and 10 columns.
[draw_digit2([[notmnist_data[idx], notmnist_target[idx]] for idx in rd.randint(len(dataset), size=10)]) for i in range(10)]
Since I read it with much effort, I will try classification with Random Forest. (Since the number of weak learning machines is set to 100, it will take some time to learn.)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
clf = clf.fit(x_train, y_train)
For now, let's look at the reassignment error rate.
#Reassignment error rate
pred = clf.predict(x_train)
result = [y==p for y, p in zip(y_train,pred)]
np.sum(result)/len(pred)
out
0.99722555413319358
#Generalization performance check
pred = clf.predict(x_test)
result = [y==p for y, p in zip(y_test,pred)]
np.sum(result)/len(pred)
The generalization performance is also good at 91%.
out
0.91262407487205077
Let's visualize the result of prediction.
#Visualize results
rd.seed(123)
[draw_digit([[x_test[idx], y_test[idx], pred[idx]] for idx in rd.randint(len(x_test), size=10)]) for i in range(10)]
Most people can make mistakes, but those that can be recognized as alphabets are almost correct: smile:
Recommended Posts