Data is often imbalanced when supervised learning ... Rather, I think there are few cases where a large amount can be secured in a balanced manner.
This time, I will introduce a library called ** imbalanced-learn ** that may be useful for resampling unbalanced data.
I mainly referred to the following articles.
-Imbalanced-learn to under-sampling / over-sampling unbalanced data -Data analysis with Python: Sampling unbalanced data with imbalanced-learn
The Official Documentation is here.
Follow Install and contribution to install.
pip install -U imbalanced-learn
Install with.
By the way, as of March 2020, it seems that there are the following conditions for the following libraries.
Prepare the pseudo data to be used this time. If you already have the data, skip it. I'm using a function called make_classification.
In[1]
import pandas as pd
from sklearn.datasets import make_classification
df = make_classification(n_samples=5000, n_features=10, n_informative=2,
n_redundant=0, n_repeated=0, n_classes=3,
n_clusters_per_class=1, weights=[0.01, 0.05, 0.94],
class_sep=0.8, random_state=0)
This df contains two return values as tuples. df [0] contains so-called X, df [1] contains the so-called y. Therefore, store it in the data frame by the following operation.
In[2]
df_raw = pd.DataFrame(df[0], columns = ['var1', 'var2', 'var3', 'var4', 'var5', 'var6', 'var7', 'var8', 'var9', 'var10'])
df_raw['Class'] = df[1]
df_raw.head()
Divide this into X and y.
In[3]
X = df_raw.iloc[:, 0:10]
y = df_raw['Class']
y.value_counts()
Out[3]
2 4674
1 261
0 65
Name: Class, dtype: int64
As you can see, we have an extremely large amount of data for label 2. This completes the preparation of pseudo data.
Split the data frame prepared earlier using train_test_split.
In[4]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 71, stratify=y)
y_train.value_counts()
Out[4]
2 3272
1 183
0 45
Name: Class, dtype: int64
This is the number of data for each label of train data after dividing it into train and test. Here, ** stratified sampling ** is performed by specifying y in the argument stratify of train_test_split. can do.
Statistics Web: Sample extraction method
It's finally the main subject. Undersample with RandomUnderSampler. The API is here.
Describes the argument ** sampling_strategy **. With this argument, you can determine the ratio of each class at the time of sampling. It seems that the argument was ratio in the previous version, but it has been changed from version 0.6 to sampling_strategy.
It is possible to give this argument mainly float and dictionary type.
For float, specify a minority class ÷ majority class. However, it is only applicable for 2-label questions.
In case of dictionary type, please pass the sample size of each class as follows.
In[5]
from imblearn.under_sampling import RandomUnderSampler
positive_count_train = y_train.value_counts()[0]
strategy = {0:positive_count_train, 1:positive_count_train*2, 2:positive_count_train*5}
rus = RandomUnderSampler(random_state=0, sampling_strategy = strategy)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
y_resampled.value_counts()
Out[5]
Using TensorFlow backend.
2 225
1 90
0 45
Name: Class, dtype: int64
You have now undersampled.
Since it's a big deal, I'd like to classify it by some model.
In[6]
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(X_resampled, y_resampled)
y_pred = model.predict(X_test)
print('Accuracy(test) : %.5f' %accuracy_score(y_test, y_pred))
Out[6]
Accuracy(test) : 0.97533
Let's output the confusion matrix as a heat map.
In[7]
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm)
The heatmap was also hard to see because the test data was unbalanced ...
This time I tried undersampling.
As I researched various things while writing this article, I found that there are various methods in undersampling and oversampling.
-Introduction of imbalanced-learn functions -[Handling of imbalanced data | PortoSeguro Competition](https://data-bunseki.com/2019/11/30/%E4%B8%8D%E5%9D%87%E8%A1%A1%E3%83 % 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 8F% 96% E3% 82% 8A% E6% 89% B1% E3% 81% 84-portoseguro-% E3% 82% B3% E3% 83% B3% E3% 83% 9A /)
** SMOTE ** seems to be interesting, so I will study it.
We are always looking for articles, comments, etc.
Recommended Posts