My motivation is to effectively learn classifiers using imbalanced data of 1: 10,000 or more. I think that those who are doing web-based CV analysis may be troubled around here. I am one of them w
Please see the paper for details as it is really properly summarized. ..
- algorithm-level approaches Introduce a coefficient to adjust the imbalance into the model (adjust the cost function) - data-level approaches A method of reducing majority data and increasing minority data. The order is undersampling, oversampling. This time, we will mainly describe data-level approaches.
There are roughly divided under-sampling, over-sampling, and hybrid methods. - under-sampling ・ Reduce majority data ・ Random undersampling and other methods ・ Random undersampling may delete useful data ⇒ If the cluster-based method is used, each class will have a distinct data group. Do not erase only some useful data
- over-sampling ・ Increase minority data ・ Random oversampling and other methods ・ Random oversampling tends to cause overfitting ⇒Resolved by increasing peripheral data (data with noise added) instead of duplicating existing data
- Hybrid Methods Do both under-sampling and over-sampling
Looking at the formula, it is as follows.
- under-sampling · Leave / drop farthest / most recent data / clusters from minority data / clusters (In the case of a cluster, it is judged by the distance of the center of gravity, etc.) ・ Cluster all data with k-means, and determine the number of negative sample reductions based on the ratio of positive and negative samples for each cluster. ← This time I am using this method - over-sampling ・ The method called SMOTE seems to be the de facto standard Sampled with noise added to one of the five neighbors of the minority sample on a k-NN basis
under-sampling First of all, about under-sampling
python
def undersampling(imp_info, cv, m):
# minority data
minodata = imp_info[np.where(cv==1)[0]]
# majority data
majodata = imp_info[np.where(cv==0)[0]]
#Clustering with kmeans2
whitened = whiten(imp_info) #Normalization (match the variance of each axis)
centroid, label = kmeans2(whitened, k=3) # kmeans2
C1 = []; C2 = []; C3 = []; #For cluster storage
C1_cv = []; C2_cv = []; C3_cv = []
for i in xrange(len(imp_info)):
if label[i] == 0:
C1 += [whitened[i]]
C1_cv.append(cv[i])
elif label[i] == 1:
C2 += [whitened[i]]
C2_cv.append(cv[i])
elif label[i] == 2:
C3 += [whitened[i]]
C3_cv.append(cv[i])
#Converted because numpy format is easier to handle
C1 = np.array(C1); C2 = np.array(C2); C3 = np.array(C3)
C1_cv = np.array(C1_cv); C2_cv = np.array(C2_cv); C3_cv = np.array(C3_cv);
#Number of minority data for each class
C1_Nmajo = sum(1*(C1_cv==0)); C2_Nmajo = sum(1*(C2_cv==0)); C3_Nmajo = sum(1*(C3_cv==0))
#Number of majority data for each class
C1_Nmino = sum(1*(C1_cv==1)); C2_Nmino = sum(1*(C2_cv==1)); C3_Nmino = sum(1*(C3_cv==1))
t_Nmino = C1_Nmino + C2_Nmino + C3_Nmino
#There is a possibility that 0 will appear in the denominator, so add 1
C1_MAperMI = float(C1_Nmajo)/(C1_Nmino+1); C2_MAperMI = float(C2_Nmajo)/(C2_Nmino+1); C3_MAperMI = float(C3_Nmajo)/(C3_Nmino+1);
t_MAperMI = C1_MAperMI + C2_MAperMI + C3_MAperMI
under_C1_Nmajo = int(m*t_Nmino*C1_MAperMI/t_MAperMI)
under_C2_Nmajo = int(m*t_Nmino*C2_MAperMI/t_MAperMI)
under_C3_Nmajo = int(m*t_Nmino*C3_MAperMI/t_MAperMI)
t_under_Nmajo = under_C1_Nmajo + under_C2_Nmajo + under_C3_Nmajo
# draw(majodata, label)
#Delete data so that the majority and minority are the same in each group
C1 = C1[np.where(C1_cv==0),:][0]
random.shuffle(C1)
C1 = np.array(C1)
C1 = C1[:under_C1_Nmajo,:]
C2 = C2[np.where(C2_cv==0),:][0]
random.shuffle(C2)
C2 = np.array(C2)
C2 = C2[:under_C2_Nmajo,:]
C3 = C3[np.where(C3_cv==0),:][0]
random.shuffle(C3)
C3 = np.array(C3)
C3 = C3[:under_C3_Nmajo,:]
cv_0 = np.zeros(t_under_Nmajo); cv_1 = np.ones(len(minodata))
cv_d = np.hstack((cv_0, cv_1))
info = np.vstack((C1, C2, C3, minodata))
return cv_d, info
over-sampling Next, about over-sampling
python
class SMOTE(object):
def __init__(self, N):
self.N = N
self.T = 0
def oversampling(self, smp, cv):
mino_idx = np.where(cv==1)[0]
mino_smp = smp[mino_idx,:]
#Implementation of kNN
mino_nn = []
for idx in mino_idx:
near_dist = np.array([])
near_idx = np.zeros(nnk)
for i in xrange(len(smp)):
if idx != i:
dist = self.dist(smp[idx,:], smp[i,:])
if len(near_dist)<nnk: #If you have not reached the expected number of neighbors, add it to the list without asking questions
tmp = near_dist.tolist()
tmp.append(dist)
near_dist = np.array(tmp)
elif sum(near_dist[near_dist > dist])>0:
near_dist[near_dist==near_dist.max()] = dist
near_idx[near_dist==near_dist.max()] = i
mino_nn.append(near_idx)
return self.create_synth( smp, mino_smp, np.array(mino_nn, dtype=np.int) )
def dist(self, smp_1, smp_2):
return np.sqrt( np.sum((smp_1 - smp_2)**2) )
def create_synth(self, smp, mino_smp, mino_nn):
self.T = len(mino_smp)
if self.N < 100:
self.T = int(self.N*0.01*len(mino_smp))
self.N = 100
self.N = int(self.N*0.01)
rs = np.floor( np.random.uniform(size=self.T)*len(mino_smp) )
synth = []
for n in xrange(self.N):
for i in rs:
nn = int(np.random.uniform(size=1)[0]*nnk)
dif = smp[mino_nn[i,nn],:] - mino_smp[i,:]
gap = np.random.uniform(size=len(mino_smp[0]))
tmp = mino_smp[i,:] + np.floor(gap*dif)
tmp[tmp<0]=0
synth.append(tmp)
return synth
I wonder if algorithms-level approaches are more robust to dirty data than data-level approaches. Maybe the code is wrong. .. .. Please teach if there are any deficiencies m (_ _) m
Since batch processing is the target of data-level approaches and there is a high possibility that the number of calculations will increase, I felt that it would be more practical to make adjustments with algorithm-level approaches. With algorithm-level approaches, you only have to increase the cost and the gradient of weight adjustment in the minority data sample, so there is almost no effect on the calculation time. .. If you have any other good ways to deal with imbalanced data, please comment.
Recommended Posts