For machine learning, it is desirable to have the same number of samples between classes. However, in reality, not only such clean data, but also data with different numbers of samples between classes are often used.
This time, I implemented the process of aligning the number of samples between classes described in the label data in Python, so make a note.
When there is the following data array and its label data
#Data array
data = np.array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
#Label array
label = np.array([0, 0, 1, 1, 1, 2, 2, 2, 2, 2])
###############
#Data processing...
###############
>>>data
[10 11 12 14 15 16]
>>>label
[0 0 1 1 2 2]
Details are in the comments. Simply put, we are doing the following for a class that has more samples than the minimum number of samples.
import numpy as np
import random
#Data array
data = np.array(range(10,20))
print("data:", data)
#Label array
label = np.array([0, 0, 1, 1, 1, 2, 2, 2, 2, 2])
print("label:", label)
#Number of samples for all classes
sample_nums = np.array([])
print("\n Calculate the number of samples for each class")
for i in range(max(label)+1):
#Number of samples for each class
sample_num = np.sum(label == i)
#Added to sample number management array
sample_nums = np.append(sample_nums, sample_num)
print("sample_nums:", sample_nums)
#Minimum number of samples in all classes
min_num = np.min(sample_nums)
print("min_num:", min_num)
print("\n Align the number of samples for each class")
for i in range(len(sample_nums)):
#Difference between the number of samples in the target class and the minimum number of samples
diff_num = int(sample_nums[i] - min_num)
print("class%d Number of deleted samples: %d (%0.2f%)" % (i, diff_num, (diff_num/sample_nums[i])*100))
#Skip if you don't need to delete
if diff_num == 0:
continue
#Index of elements to delete
#Since it is a tuple, convert it to list(Located at the 0th index)
indexes = list(np.where(label == i)[0])
print("\tindexes:", indexes)
#Index of data to delete
del_indexes = random.sample(indexes, diff_num)
print("\tdel_indexes:", del_indexes)
#Delete from data
data = np.delete(data, del_indexes)
label = np.delete(label, del_indexes)
print("\ndata:", data)
print("label:", label)
data: [10 11 12 13 14 15 16 17 18 19]
label: [0 0 1 1 1 2 2 2 2 2]
Calculate the number of samples for each class
sample_nums: [ 2. 3. 5.]
min_num: 2.0
Align the number of samples for each class
Class 0 number of deleted samples: 0 (0.00%)
Class 1 number of deleted samples: 1 (33.33%)
indexes: [2, 3, 4]
del_indexes: [3]
Class 2 number of deleted samples: 3 (60.00%)
indexes: [4, 5, 6, 7, 8]
del_indexes: [7, 8, 6]
data: [10 11 12 14 15 16]
label: [0 0 1 1 2 2]
If you are familiar with Python, you can make it more efficient.
Recommended Posts