For machine learning, it is desirable to have the same number of samples between classes. However, in reality, not only such clean data, but also data with different numbers of samples between classes are often used.

This time, I implemented the process of aligning the number of samples between classes described in the label data in Python, so make a note.

Thing you want to do

When there is the following data array and its label data

Align the number of label data samples
Remove the elements of the data array according to the first process

#Data array
data = np.array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
#Label array
label = np.array([0, 0, 1, 1, 1, 2, 2, 2, 2, 2])

###############
#Data processing...
###############

>>>data
[10 11 12 14 15 16]
>>>label
[0 0 1 1 2 2]

code

Details are in the comments. Simply put, we are doing the following for a class that has more samples than the minimum number of samples.

Get the index array of the data elements of that class
Use random.sample () to get the indexes of the number of elements to be randomly deleted from the index array.
Delete the acquired index data and label

import numpy as np
import random

#Data array
data = np.array(range(10,20))
print("data:", data)
#Label array
label = np.array([0, 0, 1, 1, 1, 2, 2, 2, 2, 2])
print("label:", label)
#Number of samples for all classes
sample_nums = np.array([])


print("\n Calculate the number of samples for each class")
for i in range(max(label)+1):
    #Number of samples for each class
    sample_num = np.sum(label == i)
    #Added to sample number management array
    sample_nums = np.append(sample_nums, sample_num)
print("sample_nums:", sample_nums)

#Minimum number of samples in all classes
min_num = np.min(sample_nums)
print("min_num:", min_num)


print("\n Align the number of samples for each class")
for i in range(len(sample_nums)):

    #Difference between the number of samples in the target class and the minimum number of samples
    diff_num = int(sample_nums[i] - min_num)
    print("class%d Number of deleted samples: %d (%0.2f％)" % (i, diff_num, (diff_num/sample_nums[i])*100))

    #Skip if you don't need to delete
    if diff_num == 0:
        continue

    #Index of elements to delete
    #Since it is a tuple, convert it to list(Located at the 0th index)
    indexes = list(np.where(label == i)[0])
    print("\tindexes:", indexes)

    #Index of data to delete
    del_indexes = random.sample(indexes, diff_num)
    print("\tdel_indexes:", del_indexes)

    #Delete from data
    data = np.delete(data, del_indexes)
    label = np.delete(label, del_indexes)


print("\ndata:", data)
print("label:", label)

Execution result

data: [10 11 12 13 14 15 16 17 18 19]
label: [0 0 1 1 1 2 2 2 2 2]

Calculate the number of samples for each class
sample_nums: [ 2.  3.  5.]
min_num: 2.0

Align the number of samples for each class
Class 0 number of deleted samples: 0 (0.00％)
Class 1 number of deleted samples: 1 (33.33％)
	indexes: [2, 3, 4]
	del_indexes: [3]
Class 2 number of deleted samples: 3 (60.00％)
	indexes: [4, 5, 6, 7, 8]
	del_indexes: [7, 8, 6]

data: [10 11 12 14 15 16]
label: [0 0 1 1 2 2]