Motivation

My motivation is to effectively learn classifiers using imbalanced data of 1: 10,000 or more. I think that those who are doing web-based CV analysis may be troubled around here. I am one of them w

What I wrote in this blog

Overview of sampling methods for unbalanced data
Outline of specific methods of under-sampling and over-sampling
Reference code

Overview of sampling methods for unbalanced data

References

Appropriate summary of the above paper

Please see the paper for details as it is really properly summarized. ..

How to deal with imbalanced data

－ algorithm-level approaches Introduce a coefficient to adjust the imbalance into the model (adjust the cost function) － data-level approaches A method of reducing majority data and increasing minority data. The order is undersampling, oversampling. This time, we will mainly describe data-level approaches.

Sampling type

There are roughly divided under-sampling, over-sampling, and hybrid methods. － under-sampling ・ Reduce majority data ・ Random undersampling and other methods ・ Random undersampling may delete useful data ⇒ If the cluster-based method is used, each class will have a distinct data group. Do not erase only some useful data

－ over-sampling ・ Increase minority data ・ Random oversampling and other methods ・ Random oversampling tends to cause overfitting ⇒Resolved by increasing peripheral data (data with noise added) instead of duplicating existing data

－ Hybrid Methods Do both under-sampling and over-sampling

Error type

Intrisic noise: Unpredictable error inherent in the sample
Squared bias: Systematic error (error derived from classifier)
Variance: Error due to sampling

Looking at the formula, it is as follows.

Sampling algorithm

－ under-sampling · Leave / drop farthest / most recent data / clusters from minority data / clusters (In the case of a cluster, it is judged by the distance of the center of gravity, etc.) ・ Cluster all data with k-means, and determine the number of negative sample reductions based on the ratio of positive and negative samples for each cluster. ← This time I am using this method － over-sampling ・ The method called SMOTE seems to be the de facto standard Sampled with noise added to one of the five neighbors of the minority sample on a k-NN basis

Reference code

under-sampling First of all, about under-sampling

`python`


def undersampling(imp_info, cv, m):
    # minority data
    minodata = imp_info[np.where(cv==1)[0]]
    
    # majority data
    majodata = imp_info[np.where(cv==0)[0]]

    #Clustering with kmeans2
    whitened = whiten(imp_info) #Normalization (match the variance of each axis)
    centroid, label = kmeans2(whitened, k=3) # kmeans2
    C1 = []; C2 = []; C3 = []; #For cluster storage
    C1_cv = []; C2_cv = []; C3_cv = [] 
    for i in xrange(len(imp_info)):
        if label[i] == 0:
            C1 += [whitened[i]]
            C1_cv.append(cv[i])
        elif label[i] == 1:
            C2 += [whitened[i]]
            C2_cv.append(cv[i])
        elif label[i] == 2:
            C3 += [whitened[i]]
            C3_cv.append(cv[i])
    
    #Converted because numpy format is easier to handle
    C1 = np.array(C1); C2 = np.array(C2); C3 = np.array(C3) 
    C1_cv = np.array(C1_cv); C2_cv = np.array(C2_cv); C3_cv = np.array(C3_cv);
    
    #Number of minority data for each class
    C1_Nmajo = sum(1*(C1_cv==0)); C2_Nmajo = sum(1*(C2_cv==0)); C3_Nmajo = sum(1*(C3_cv==0)) 
    
    #Number of majority data for each class
    C1_Nmino = sum(1*(C1_cv==1)); C2_Nmino = sum(1*(C2_cv==1)); C3_Nmino = sum(1*(C3_cv==1))
    t_Nmino = C1_Nmino + C2_Nmino + C3_Nmino

    #There is a possibility that 0 will appear in the denominator, so add 1
    C1_MAperMI = float(C1_Nmajo)/(C1_Nmino+1); C2_MAperMI = float(C2_Nmajo)/(C2_Nmino+1); C3_MAperMI = float(C3_Nmajo)/(C3_Nmino+1);
    
    t_MAperMI = C1_MAperMI + C2_MAperMI + C3_MAperMI
    
    under_C1_Nmajo = int(m*t_Nmino*C1_MAperMI/t_MAperMI)
    under_C2_Nmajo = int(m*t_Nmino*C2_MAperMI/t_MAperMI)
    under_C3_Nmajo = int(m*t_Nmino*C3_MAperMI/t_MAperMI)
    t_under_Nmajo = under_C1_Nmajo + under_C2_Nmajo + under_C3_Nmajo

#    draw(majodata, label)
    
    #Delete data so that the majority and minority are the same in each group
    C1 = C1[np.where(C1_cv==0),:][0]
    random.shuffle(C1)
    C1 = np.array(C1)
    C1 = C1[:under_C1_Nmajo,:]
    C2 = C2[np.where(C2_cv==0),:][0]
    random.shuffle(C2)
    C2 = np.array(C2)
    C2 = C2[:under_C2_Nmajo,:]
    C3 = C3[np.where(C3_cv==0),:][0]
    random.shuffle(C3)
    C3 = np.array(C3)
    C3 = C3[:under_C3_Nmajo,:]
    
    cv_0 = np.zeros(t_under_Nmajo); cv_1 = np.ones(len(minodata))
    cv_d = np.hstack((cv_0, cv_1))
    
    info = np.vstack((C1, C2, C3, minodata))
    
    return cv_d, info

over-sampling Next, about over-sampling

`python`


class SMOTE(object):
    def __init__(self, N):
        self.N = N
        self.T = 0
    
    def oversampling(self, smp, cv):
        mino_idx = np.where(cv==1)[0]
        mino_smp = smp[mino_idx,:]
        
        #Implementation of kNN
        mino_nn = []
        
        for idx in mino_idx:
            near_dist = np.array([])
            near_idx = np.zeros(nnk)
            for i in xrange(len(smp)):
                if idx != i:
                    dist = self.dist(smp[idx,:], smp[i,:])
                    
                    if len(near_dist)<nnk: #If you have not reached the expected number of neighbors, add it to the list without asking questions
                        tmp = near_dist.tolist()
                        tmp.append(dist)
                        near_dist = np.array(tmp)
                    elif sum(near_dist[near_dist > dist])>0:
                        near_dist[near_dist==near_dist.max()] = dist
                        near_idx[near_dist==near_dist.max()] = i
            mino_nn.append(near_idx)
        return self.create_synth( smp, mino_smp, np.array(mino_nn, dtype=np.int) )

    def dist(self, smp_1, smp_2):
        return np.sqrt( np.sum((smp_1 - smp_2)**2) )
                    
    def create_synth(self, smp, mino_smp, mino_nn):
        self.T = len(mino_smp)
        if self.N < 100:
            self.T = int(self.N*0.01*len(mino_smp))
            self.N = 100
        self.N = int(self.N*0.01)
        
        rs = np.floor( np.random.uniform(size=self.T)*len(mino_smp) )
        
        synth = []
        for n in xrange(self.N):
            for i in rs:
                nn = int(np.random.uniform(size=1)[0]*nnk)
                dif = smp[mino_nn[i,nn],:] - mino_smp[i,:]
                gap = np.random.uniform(size=len(mino_smp[0]))
                tmp = mino_smp[i,:] + np.floor(gap*dif)
                tmp[tmp<0]=0
                synth.append(tmp)
        return synth

Impressions of the experiment

Experiment by generating appropriately sparse data of about 2,000 dimensions
Algorithm-level approaches have 98% accuracy and 100% reproducibility
Under-sampling + over-sampling has about 11% accuracy and 100% reproducibility.

I wonder if algorithms-level approaches are more robust to dirty data than data-level approaches. Maybe the code is wrong. .. .. Please teach if there are any deficiencies m (_ _) m

Summary

Since batch processing is the target of data-level approaches and there is a high possibility that the number of calculations will increase, I felt that it would be more practical to make adjustments with algorithm-level approaches. With algorithm-level approaches, you only have to increase the cost and the gradient of weight adjustment in the minority data sample, so there is almost no effect on the calculation time. .. If you have any other good ways to deal with imbalanced data, please comment.

Sampling in imbalanced data

Motivation

What I wrote in this blog

Overview of sampling methods for unbalanced data

References

Appropriate summary of the above paper

How to deal with imbalanced data

Sampling type

Error type

Sampling algorithm

Reference code

python

python

Impressions of the experiment

Summary

`python`

`python`