Thing you want to do

We are collecting samples of learning / verification data for machine learning.

I want to download MNIST and find out what's inside.
For that purpose, I want to implement the process of creating png from ubyte with python.

Premise

Have basic knowledge of numpy and PIL in python
You have already built those execution environments

What is MNIST?

Image data of "handwritten numbers" from 0 to 9. Used for machine learning such as "identify and classify handwritten numbers with AI". http://yann.lecun.com/exdb/mnist/ You can download it for free from.

file organization

train-images-idx3-ubyte.gz: Image data for learning
train-labels-idx1-ubyte.gz: Label data for training
t10k-images-idx3-ubyte.gz: Image data for verification
t10k-labels-idx1-ubyte.gz: Label data for verification

The contents of the file

When you unzip the gz file, it becomes a binary file like the one below.

t10k-images.idx3-ubyte

Even though it is image data, it is not in a format like .jpg, so It cannot be previewed as it is.

How to image and preview

For example, if you write python code and output it to png with numpy or PIL, you can display it as an ordinary image file.

Advance preparation

If wget and unzip are not included yet, install them. For ubuntu:

apt-get install -y wget

apt-get install unzip

File download

First, download gz with wget.

wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz

Then unzip.

gunzip train-images-idx3-ubyte.gz

gunzip train-labels-idx1-ubyte.gz

Then

-rw-r--r--. 1 root root 47040016 Jul 21  2000 train-images-idx3-ubyte
-rw-r--r--. 1 root root    60008 Jul 21  2000 train-labels-idx1-ubyte

Is output. Then write the python code.

Implementation (Images imaging)

vi test.py

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.datasets import mnist

import os
import numpy as np
import matplotlib.pyplot as plt
import struct
from PIL import Image

trainImagesFile = open('./train-images-idx3-ubyte','rb')
trainLabelsFile = open('./train-labels-idx1-ubyte','rb')

f = trainImagesFile

magic_number = f.read( 4 )
magic_number = struct.unpack('>i', magic_number)[0]

number_of_images = f.read( 4 )
number_of_images = struct.unpack('>i', number_of_images)[0]

number_of_rows = f.read( 4 )
number_of_rows = struct.unpack('>i', number_of_rows)[0]

number_of_columns = f.read( 4 )
number_of_columns = struct.unpack('>i', number_of_columns)[0]

bytes_per_image = number_of_rows * number_of_columns

raw_img = f.read(bytes_per_image)
format = '%dB' % bytes_per_image
lin_img = struct.unpack(format, raw_img)
np_ary = np.asarray(lin_img).astype('uint8')
np_ary = np.reshape(np_ary, (28,28),order='C')

pil_img = Image.fromarray(np_ary)
pil_img.save("output.png ")

Run

python test.py

Output result

output.png

Commentary

Image data for learning train-images-idx3-ubyte The structure of is as follows.

http://yann.lecun.com/exdb/mnist/

TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000803(2051) magic number
0004     32 bit integer  60000            number of images
0008     32 bit integer  28               number of rows
0012     32 bit integer  28               number of columns
0016     unsigned byte   ??               pixel
0017     unsigned byte   ??               pixel
........
xxxx     unsigned byte   ??               pixel

According to the above, the offset is read sequentially while shifting by 4.

magic_number = f.read( 4 )

The result is 2051.

number_of_images = f.read( 4 )

The result is 60000.

number_of_rows = f.read( 4 )

The result is 28.

number_of_columns = f.read( 4 )

The result is 28.

If you really want to see the value

print('--------------------')
print('magic_number');
print(magic_number);
print('--------------------')
print('number_of_images');
print(number_of_images);
print('--------------------')
print('number_of_rows');
print(number_of_rows);
print('--------------------')
print('number_of_columns');
print(number_of_columns);

You can check it by outputting as follows.

--------------------
magic_number
2051
--------------------
number_of_images
60000
--------------------
number_of_rows
28
--------------------
number_of_columns
28

And

[offset] [type]          [value]          [description]
0016     unsigned byte   ??               pixel

Since it is, images are included after offset 16

bytes_per_image = number_of_rows * number_of_columns
raw_img = f.read(bytes_per_image)

Can be read as. After that, I thrust it into numpy and save it in png format.

Output png continuously by loop processing

If you rotate the png output process in a loop as shown below, you can continuously image. After that, if you want to output 10 sheets, you can specify the number of loops as you like, like range (10) :.

for num in range(10):
    raw_img = f.read(bytes_per_image)
    format = '%dB' % bytes_per_image
    lin_img = struct.unpack(format, raw_img)
    np_ary = np.asarray(lin_img).astype('uint8')
    np_ary = np.reshape(np_ary, (28,28),order='C')
    pil_img = Image.fromarray(np_ary)
    pil_img.save("output" + str(num)  + ".png ")

The output result is below.

Comparison of numpy array and png

When np_ary, which is a numpy array, is displayed by print (), the array data is as follows.

 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   3  18  18  18 126 136  175  26 166 255 247 127   0   0   0   0]
 [  0   0   0   0   0   0   0   0  30  36  94 154 170 253 253 253 253 253  225 172 253 242 195  64   0   0   0   0]
 [  0   0   0   0   0   0   0  49 238 253 253 253 253 253 253 253 253 251   93  82  82  56  39   0   0   0   0   0]
 [  0   0   0   0   0   0   0  18 219 253 253 253 253 253 198 182 247 241    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0  80 156 107 253 253 205  11   0  43 154    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0  14   1 154 253  90   0   0   0   0    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0 139 253 190   2   0   0   0    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0  11 190 253  70   0   0   0    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0  35 241 225 160 108   1    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0  81 240 253 253 119   25   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0  45 186 253 253  150  27   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0  16  93 252  253 187   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 249  253 249  64   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0  46 130 183 253  253 207   2   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0  39 148 229 253 253 253  250 182   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0  24 114 221 253 253 253 253 201   78   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0  23  66 213 253 253 253 253 198  81   2    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0  18 171 219 253 253 253 253 195  80   9   0   0    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0  55 172 226 253 253 253 253 244 133  11   0   0   0   0    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0 136 253 253 253 212 135 132  16   0   0   0   0   0   0    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    0   0   0   0   0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0    0   0   0   0   0   0   0   0   0   0]

It can be seen that this is the position of each pixel constituting the image file and its color information.

About label data

The above implementation was to convert image data (Images) to png, In addition to this, it is also necessary to check the label data (Labels). The implementation for that is as follows.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.datasets import mnist

import os
import numpy as np
import matplotlib.pyplot as plt
import struct
from PIL import Image

trainImagesFile = open('./train-images-idx3-ubyte','rb')
trainLabelsFile = open('./train-labels-idx1-ubyte','rb')

f = trainLabelsFile

magic_number = f.read( 4 )
magic_number = struct.unpack('>i', magic_number)[0]

number_of_images = f.read( 4 )
number_of_images = struct.unpack('>i', number_of_images)[0]

print("--------------------")
print("magic_number")
print(magic_number)
print("--------------------")
print("number_of_image")
print(number_of_images)
print("--------------------")

label_byte = f.read( 1 )
label_int = int.from_bytes(label_byte, byteorder='big')
print(label_int)

Output result

--------------------
magic_number
2049
--------------------
number_of_image
60000
--------------------
5

The structure of the label data is as follows.

train-labels-idx1-ubyte

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type]          [value]          [description]
0000     32 bit integer  0x00000801(2049) magic number (MSB first)
0004     32 bit integer  60000            number of items
0008     unsigned byte   ??               label
0009     unsigned byte   ??               label
........
xxxx     unsigned byte   ??               label
The labels values are 0 to 9.

In other words, if you read offset 8 and later one by one, you can read the label. If you implement it in a loop, follow below.


for num in range(10):
    label_byte = f.read( 1 )
    label_int = int.from_bytes(label_byte, byteorder='big')
    print(label_int)

The output result is below.

Compare with the output result of Images.

Each png image correctly indicates "what number is it?" With a label.

Procedure to load MNIST with python and output to png

Thing you want to do

Premise

What is MNIST?

file organization

The contents of the file

How to image and preview

Advance preparation

File download

Implementation (Images imaging)

Run

Output result

Commentary

Output png continuously by loop processing

Comparison of numpy array and png

About label data

Output result