Supervised learning usually requires a sufficient amount of labeled data to achieve high accuracy. However, manual annotation requires a great deal of time and effort. One way to solve this is data augmentation, which artificially bulks the data.
However, data augmentation is often talked about with images, and there are not many methods that can be applied to table data. Therefore, this article introduces data augmentation that can be applied to table data, conducts experiments, and verifies their performance.
Mixup
mixup: Beyond Empirical Risk Minimization
Mixup is a method proposed in 2017 and adopted by ICLR. A new input is generated by mixing the two inputs.
import random as rn
from sklearn.utils import check_random_state
def mixup(x, y=None, alpha=0.2, p=1.0, random_state=None):
n, _ = x.shape
if n is not None and rn.random() < p:
random_state = check_random_state(random_state)
l = random_state.beta(alpha, alpha)
shuffle = random_state.choice(n, n, replace=False)
x = l * x + (1.0 - l) * x[shuffle]
if y is not None:
y = l * y + (1.0 - l) * y[shuffle]
return x, y
It has been reported in the paper that the performance was improved by applying Mixup to audio and table data in addition to images.
Cutmix
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
Cutmix is a method proposed in 2019 and adopted by ICCV. Generate a new input by replacing one part of the input with the other.
import random as rn
import numpy as np
from sklearn.utils import check_random_state
def cutmix(x, y=None, alpha=1.0, p=1.0, random_state=None):
n, h, w, _ = x.shape
if n is not None and rn.random() < p:
random_state = check_random_state(random_state)
l = np.random.beta(alpha, alpha)
r_h = int(h * np.sqrt(1.0 - l))
r_w = int(w * np.sqrt(1.0 - l))
x1 = np.random.randint(h - r_h)
y1 = np.random.randint(w - r_w)
x2 = x1 + r_h
y2 = y1 + r_w
shuffle = random_state.choice(n, n, replace=False)
x[:, x1:x2, y1:y2] = x[shuffle, x1:x2, y1:y2]
if y is not None:
y = l * y + (1.0 - l) * y[shuffle]
return x, y
Only the results of applying Cutmix to images are reported in the paper. What happens if this is applied to table data?
In the table data, the order of features (age, nationality, etc.) is meaningless. Therefore, I will randomly select the part to be replaced with the other input.
import random as rn
import numpy as np
from sklearn.utils import check_random_state
def cutmix_for_tabular(x, y=None, alpha=1.0, p=1.0, random_state=None):
n, d = x.shape
if n is not None and rn.random() < p:
random_state = check_random_state(random_state)
l = random_state.beta(alpha, alpha)
mask = random_state.choice([False, True], size=d, p=[l, 1.0 - l])
mask = np.where(mask)[0]
shuffle = random_state.choice(n, n, replace=False)
x[:, mask] = x[shuffle, mask]
if y is not None:
y = l * y + (1.0 - l) * y[shuffle]
return x, y
This time, we will conduct an experiment using the following data. This is a multi-label classification problem that predicts the mechanism of action of compounds from gene expression patterns.
Mechanisms of Action (MoA) Prediction | Kaggle
Check the code below for the details of the experiment.
Logloss looks like this:
Local | Public | Private | |
---|---|---|---|
Baseline | 0.01604 | 0.01906 | 0.01666 |
Mixup | 0.01605 | 0.01905 | 0.01668 |
Cutmix | 0.01604 | 0.01901 | 0.01663 |
It was confirmed that both Public and Private score improved with Cutmix.
Cutmix is also an effective method for table data.
Finally, we have released the solution that ranked 35th using Cutmix in the above competition, so if you are interested, please have a look.