Data supply tricks using deques in machine learning

When dealing with a large number of image files in a machine learning task, it is likely that the file names will be listed and the images will be read sequentially in the train process. However, due to the relationship between the number of samples prepared and the mini-batch size in the learning process, fractions are inevitably generated in the latter half of learning, and handling tends to be complicated.

For example, if the number of data samples is num = 100 and the mini-batch size batch_size = 30

Run the mini-batch 3 times and do not use 10 samples.
Adjust the size in the next mini-batch (batch_size = 10).
In the next mini-batch size, if the number of samples (20) is insufficient, the sample used once is reused.

The method of is conceivable. If the number of samples is large, the above option 1 is fine, but if you want to use the training data carefully, you will want to select options 2 and 3.

Here, we will implement the method of option 3 with deque.

What is a deque

The explanation is quoted from the introductory Python3.

deque (pronounced deque) is a deque, which has the functions of stack and queue. This is useful when you want to be able to add or remove elements at either end of the sequence.

This explanation is illustrated below.

This time, the function of deque.rotate () was used for "reuse of data sample". (The process of "rotate if data is used, rotate if used, ..." is performed.)

Implementation

Consider the case where the data file is expanded as follows.

$ ls deque_ex/*.jpg
deque_ex/img_0.jpg  deque_ex/img_3.jpg  deque_ex/img_6.jpg  deque_ex/img_9.jpg
deque_ex/img_1.jpg  deque_ex/img_4.jpg  deque_ex/img_7.jpg
deque_ex/img_2.jpg  deque_ex/img_5.jpg  deque_ex/img_8.jpg

First, make a list (deck) of the file names to be handled.

import glob
from collections import deque
import numpy as np

def mk_list():
    fname_list = glob.glob('*.jpg')
    sorted_fn = sorted(fname_list)
    deq_fname = deque()
    deq_fname.extend(sorted_fn)   # 'extend' is right, 
                                  # 'append' is not good.
    
    return deq_fname

The point is to use deque.extend () instead of deque.append () when adding the list to the deck.

>>>
deque(['img_0.jpg',
       'img_1.jpg',
       'img_2.jpg',
       'img_3.jpg',
       'img_4.jpg',
       'img_5.jpg',
       'img_6.jpg',
       'img_7.jpg',
       'img_8.jpg',
       'img_9.jpg'])

From the data list (to be exact, deque class) and the number of requests, the function that returns the data is as follows. (Use list slices and deque.rotate ().)

def feed_fn_ver0(dq, num):
    feed = list(dq)[-num:]
    dq.rotate(num)
    
    return feed

The situation where 3 samples of data were taken out 5 times using this is as follows.

0: ['img_7.jpg', 'img_8.jpg', 'img_9.jpg']
1: ['img_4.jpg', 'img_5.jpg', 'img_6.jpg']
2: ['img_1.jpg', 'img_2.jpg', 'img_3.jpg']
3: ['img_8.jpg', 'img_9.jpg', 'img_0.jpg']
4: ['img_5.jpg', 'img_6.jpg', 'img_7.jpg']

We were able to retrieve 3 samples from the end of the data deck. There is no problem in using it in a machine learning process that does not care about the order, but since "from the end" is a little unpleasant, I corrected it to "from the beginning" and checked the required data length next. Code.

def feed_fn_ver1(dq, num):
    '''
      dq  : data source (deque)
      num : request size (int)
    '''
    # check length
    assert num <= len(dq)
   
    feed = list(dq)[:num]
    dq.rotate(-num)

    return feed

my_list = mk_list()
for i in range(5):
    print(' Feed [', i, ']: ', feed_fn_ver1(my_list, 3))
    
>>>
 Feed [ 0 ]:  ['img_0.jpg', 'img_1.jpg', 'img_2.jpg']
 Feed [ 1 ]:  ['img_3.jpg', 'img_4.jpg', 'img_5.jpg']
 Feed [ 2 ]:  ['img_6.jpg', 'img_7.jpg', 'img_8.jpg']
 Feed [ 3 ]:  ['img_9.jpg', 'img_0.jpg', 'img_1.jpg']
 Feed [ 4 ]:  ['img_2.jpg', 'img_3.jpg', 'img_4.jpg']

It worked fine. The randomly shuffled data is as follows.

def mk_list_shuffle():
    fname_list = glob.glob('*.jpg')
    np_list_fn = np.array(fname_list)
    np.random.shuffle(np_list_fn)
    deq_fname = deque()
    deq_fname.extend(list(np_list_fn))
    
    return deq_fname

my_list = mk_list_shuffle()
for i in range(5):
    print(' Feed [', i, ']: ', feed_fn_ver1(my_list, 3))

>>>
 Feed [ 0 ]:  ['img_9.jpg', 'img_7.jpg', 'img_6.jpg']
 Feed [ 1 ]:  ['img_1.jpg', 'img_8.jpg', 'img_3.jpg']
 Feed [ 2 ]:  ['img_4.jpg', 'img_0.jpg', 'img_2.jpg']
 Feed [ 3 ]:  ['img_5.jpg', 'img_9.jpg', 'img_7.jpg']
 Feed [ 4 ]:  ['img_6.jpg', 'img_1.jpg', 'img_8.jpg']

It's a little difficult to understand, but when you squint, you can see that the data can be supplied by circulating properly. Using this function (feed_fn_ver1 ()), the machine learning training process should be simple to write.

(The above code was confirmed in the environment of Python 2.7.11 and Python 3.5.1).

References (web site)

--Introduction Python3 http://www.oreilly.co.jp/books/9784873117386/

Python Documentation - collections https://docs.python.org/2/library/collections.html