Overview

It became necessary to train using a part of MNIST instead of the entire MNIST dataset. Therefore, I created a program that randomly extracts n images from MNIST Training data of 60,000, divides them into folders for each class, and saves the images.

Execution environment

Google Colaboratory PyTorch 1.6.0

Implementation

Save MNIST in image format

Download the MNIST dataset and save it in image format for random extraction from the Train dataset. I referred to this site. Try using Image Folder with PyTorch

First, import the required modules

import os
from PIL import Image
from torchvision.datasets import MNIST
import shutil
import glob
from pprint import pprint
import random
from pathlib import Path
from tqdm import tqdm

If you do not have the required module, install it with pip or conda as appropriate.

Then download MNIST.

mnist_data = MNIST(root='./', train=True, transform=None, download=True)

You may get a User Warning when you download mnist, but don't worry because we are not learning with the downloaded mnist this time.

Save the MNIST image in PNG format from the downloaded MNIST binary file.

def makeMnistPng(image_dsets):
    for idx in tqdm(range(10)):
        print("Making image file for index {}".format(idx))
        num_img = 0
        dir_path = './mnist_all/'
        if not os.path.exists(dir_path):
            os.makedirs(dir_path)
        for image, label in image_dsets:
           if label == idx:
                filename = dir_path +'/mnist_'+ str(idx) + '-' + str(num_img) + '.png'
                if not os.path.exists(filename):
                    image.save(filename)
                num_img += 1
    print('Success to make MNIST PNG image files. index={}'.format(idx))

Execute the function.

makeMnistPng(mnist_data)

This saves all 600 million mnist images under mnist_all. If you want to save images for each class, please do as follows.

def makeMnistPng(image_dsets):
    for idx in tqdm(range(10)):
        print("Making image file for index {}".format(idx))
        num_img = 0
        dir_path = './MNIST_PNG/' + str(idx)
        if not os.path.exists(dir_path):
            os.makedirs(dir_path)
        for image, label in image_dsets:
            if label == idx:
                filename = dir_path +'/' + 'mnist_'+ str(idx) + '_' + str(num_img) + '.png'
                if not os.path.exists(filename):
                    image.save(filename)
                num_img += 1
    print('Success to make MNIST PNG image files. index={}'.format(idx))

Randomly sample from files in the directory

Since I was able to drop all the data of mnist into one directory, I will randomly sample n images from there and copy them to another directory. The article that I used as a reference (used almost as it is) is here

Class definition


class FileControler(object):
    def get_file_path(self, input_dir, pattern):
        #Get file path
        #Create a path object by specifying a directory
        path_obj = Path(input_dir)
        #Match files in glob format
        files_path = path_obj.glob(pattern)
        #Posix conversion to treat as a character string
        files_path_posix = [file_path.as_posix() for file_path in files_path]
        return files_path_posix
    
    def random_sampling(self, files_path, sample_num, output_dir, fix_seed=True) -> None:
        #Random sampling
        #Pin Seed to sample the same file every time
        if fix_seed is True:
            random.seed(0)
        #Specify the file group path and the number of samples
        files_path_sampled = random.sample(files_path, sample_num)
        #Create if there is no output directory
        os.makedirs(output_dir, exist_ok=True)
        #copy
        for file_path in files_path_sampled:
            shutil.copy(file_path, output_dir)

Instance creation

file_controler =FileControler()

Directory settings

Set the sampling source directory and the directory to copy the sampled files.

all_file_dir = './mnist_all/'
sampled_dir = './mnist_sampled/'

Get the path of all files

pattern = '*.png'
files_path = file_controler.get_file_path(all_file_dir, pattern)

print(len(files_path))
# 60000

n sampling

sample_num = 100
file_controler.random_sampling(files_path, sample_num, sampled_dir)

sampled_files_path = file_controler.get_file_path(sampled_dir, pattern)
print(len(sampled_files_path))
# 100

With this, n (100 this time) were randomly sampled from mnist 60000.

Classification

We will divide the sampled images into classes so that they can be used as a machine learning dataset.

First, get all the file names in the sampled directory in list format.

files = glob.glob("./mnist_sampled/*")

Use the in operator to determine the substring of the file name list and divide it into folders for each class.

for i in range(10):
    os.makedirs(sampled_dir+str(i), exist_ok=True)
    for x in files:
        if '_' + str(i) in x:
            shutil.move(x, sampled_dir + str(i))

The sampled directory has such a directory structure.

./mnist_sampled
├── 0
├── 1
├── 2
├── 3
├── 4
├── 5
├── 6
├── 7
├── 8
└── 9

Now you can randomly sample the mnist images and classify them to create a dataset.

Randomly sample MNIST data to create a dataset