The content is a little out of date, so I rewrote it on my personal blog. https://kakedashi-engineer.appspot.com/2020/02/10/mnist/
MNIST is a 28 x 28 pixel handwritten digit dataset. People in the Deep Learning area often use it as a benchmark for the time being.
Each pixel takes an integer value from 0 to 255. There are 70,000 images in total, of which 60,000 are training data and 10,000 are test data. The point is that training data and test data are separated from the beginning. I sometimes use this as it is, Of the training data, 10,000 from the back may be used as validation data. (What is validation data? Let's google early stopping)
The data is not sorted by class, For example, training data
[5 0 4 ..., 8 4 8]
The order is not regular.
One thing to note is the percentage of each number. Naturally, I thought that each number was 6000 + 1000, but when I looked it up, it seemed different. The image of "1" is 24% more than the image of "5" ...
0 1 2 3 4 5 6 7 8 9
[ 5923. 6742. 5958. 6131. 5842. 5421. 5918. 6265. 5851. 5949.] # training data
[ 980. 1135. 1032. 1010. 982. 892. 958. 1028. 974. 1009.] # test data
[ 6903. 7877. 6990. 7141. 6824. 6313. 6876. 7293. 6825. 6958.] # training data + test data
There is one more thing to note. The pixels in the corners of the image may be 0 in any of the 70,000 images. Therefore, if you try to normalize pixel by pixel, division by zero will occur. In sample code etc., it seems that it is common to divide by 255.0 at once with numpy etc.
The introduction has become longer. Next, I will briefly introduce how to use MNIST in Python.
You can download it from the homepage of Yann LeCun and others.
train-images-idx3-ubyte.gz: training set images (9912422 bytes) train-labels-idx1-ubyte.gz: training set labels (28881 bytes) t10k-images-idx3-ubyte.gz: test set images (1648877 bytes) t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)
Since it is binary data, not text data, it is used after appropriate processing. As you can see, it is divided into training data and test data, and the labels are also different. By the way, labels are integer values from 0 to 9, so I think many people want to use 1-of-K representation. In such a case, sklearn.preprocessing.LabelBinarizer is recommended.
With sklearn, you don't need to download, and you can easily call the following benchmarks.
python
from sklearn.datasets import *
load_boston()
load_iris()
load_diabetes()
load_digits([n_class])
load_linnerud()
But not only that, from a repository for machine learning data called mldata.org You can also download the data. There is also MNIST here. Since the data is heavy, it will take some time to download, but Once downloaded, it will be saved in data_home, so you don't have to wait that long for the second time.
python
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original', data_home=custom_data_home)
Very easy to use.
Libraries such as TensorFlow and Theano seem to have their own functions to download MNIST. If you look at the sample code, you can see how to use it. It seems that chainer uses sklearn.
Recommended Posts