noisy student
is a method of launching SOTA with Imagenet.
Normally, when increasing data and re-learning, it is necessary for humans to create teacher data, but noisy student
collects data anyway, infers it to the current model and retrains it as temporary teacher data to improve accuracy. Since it can be raised, it means that you do not need time to create teacher data. Strictly speaking, we have to collect data that corresponds to one of the original labels, but I'm grateful that people don't need to have a teacher.
Please refer to the following site for detailed explanation.
Commentary: Thorough commentary on the latest SoTA model "Noisy Student" for image recognition! Paper: Self-training with Noisy Student improves ImageNet classification
Each one I'm doing isn't that difficult, so in this article I'll try to reproduce it using imagenet. I thought, but it takes a lot of time to learn with my PC ability, so I tried experiments with resnet50 and cifar10. I hope you read it as a reference for the procedure and implementation method.
tensorflow 1.15.0
keras 2.3.1
Python 3.7.6
numpy 1.18.1
core i7
GTX1080ti
The procedure for noisy student
is as follows.
Quote: Self-training with Noisy Student improves ImageNet classification
To summarize in Japanese
What is ** noise ** here?
is. I will briefly explain each of them when implementing them.
Rand Augmentation Improve the accuracy of the image recognition model with just two lines! ?? Explanation of the new Data Augmentation automatic optimization method "Rand Augment"! The above is easy to understand. In summary, prepare X types of data extensions
that's all. It's easy. The noisy student dissertation uses N = 2 and M = 27. In my implementation this time, N = 2 and M = 10. The reason is that cifar10 has a small image size, so it would be better to apply too much noise.
Dropout This is famous so I will omit it. The noisy student dissertation uses 0.5.
Stochastic depth [Survey]Deep Networks with Stochastic Depth If you want to know more, please refer to the above explanation.
Quote: Deep Networks with Stochastic Depth
I will explain briefly based on the above image.
First of all, the basic idea is to make the output of resnet only the part that is stochastically skipped. Then, the probability is increased linearly as the layer gets deeper. In the noisy student paper, the last layer is 0.8.
Also, when inferring, the probability is multiplied by the output of each resnet block.
The first stage was long, but I would like to implement it. Here, I will review the procedure.
I will explain the implementation in this order.
This is just a common classification problem. I wanted to prepare a efficient net for the model, but I tried it with resnet50 to save the implementation effort. Note that the basic structure is the same as resnet50, but the image size should not be too small. We are reducing the number of strides to 2.
cifar10_resnet50.py
from keras.datasets import cifar10
from keras.utils.np_utils import to_categorical
#Prepare cifar10 dataset
(x_train_10,y_train_10),(x_test_10,y_test_10)=cifar10.load_data()
#Teacher data one-Change to hot expression
y_train_10 = to_categorical(y_train_10)
y_test_10 = to_categorical(y_test_10)
cifar10_resnet50.py
from keras.models import Model
from keras.layers import Input, Activation, Dense, GlobalAveragePooling2D, Conv2D
from keras import optimizers
from keras.layers.normalization import BatchNormalization as BN
from keras.callbacks import Callback, LearningRateScheduler, ModelCheckpoint, EarlyStopping
#Reference URL: https://www.pynote.info/entry/keras-resnet-implementation
def shortcut_en(x, residual):
'''Create a shortcut connection.
'''
x_shape = K.int_shape(x)
residual_shape = K.int_shape(residual)
if x_shape == residual_shape:
#If x and residual have the same shape, do nothing.
shortcut = x
else:
#If the shapes of x and residual are different, perform a linear transformation to match the shapes.
stride_w = int(round(x_shape[1] / residual_shape[1]))
stride_h = int(round(x_shape[2] / residual_shape[2]))
shortcut = Conv2D(filters=residual_shape[3],
kernel_size=(1, 1),
strides=(stride_w, stride_h),
kernel_initializer='he_normal',
kernel_regularizer=l2(1.e-4))(x)
shortcut = BN()(shortcut)
return Add()([shortcut, residual])
def normal_resblock50(data, filters, strides=1):
x = Conv2D(filters=filters,kernel_size=(1,1),strides=(1,1),padding="same")(data)
x = BN()(x)
x = Activation("relu")(x)
x = Conv2D(filters=filters,kernel_size=(3,3),strides=(1,1),padding="same")(x)
x = BN()(x)
x = Activation("relu")(x)
x = Conv2D(filters=filters*4,kernel_size=(1,1),strides=strides,padding="same")(x)
x = BN()(x)
x = shortcut_en(data, x)
x = Activation("relu")(x)
return x
cifar10_resnet50.py
inputs = Input(shape = (32,32,3))
x = Conv2D(32,(5,5),padding = "SAME")(inputs)
x = BN()(x)
x = Activation('relu')(x)
x = normal_resblock50(x, 64, 1)
x = normal_resblock50(x, 64, 1)
x = normal_resblock50(x, 64, 1)
x = normal_resblock50(x, 128, 2)
x = normal_resblock50(x, 128, 1)
x = normal_resblock50(x, 128, 1)
x = normal_resblock50(x, 128, 1)
x = normal_resblock50(x, 256, 1)
x = normal_resblock50(x, 256, 1)
x = normal_resblock50(x, 256, 1)
x = normal_resblock50(x, 256, 1)
x = normal_resblock50(x, 256, 1)
x = normal_resblock50(x, 256, 1)
x = normal_resblock50(x, 512, 2)
x = normal_resblock50(x, 512, 1)
x = normal_resblock50(x, 512, 1)
x = GlobalAveragePooling2D()(x)
x = Dense(10)(x)
outputs = Activation("softmax")(x)
teacher_model = Model(inputs, outputs)
teacher_model.summary()
cifar10_resnet50.py
batch_size = 64
steps_per_epoch = y_train_10.shape[0] // batch_size
validation_steps = x_test_10.shape[0] // batch_size
log_dir = 'logs/softlabel/teacher/'
checkpoint = ModelCheckpoint(log_dir + 'ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5',
monitor='val_loss', save_weights_only=True, save_best_only=True, period=1)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1)
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=1)
teacher_model.compile(loss = "categorical_crossentropy",optimizer = "adam", metrics = ["accuracy"])
trainj_gen = ImageDataGenerator(rescale = 1./255.).flow(x_train_10,y_train_10, batch_size)
val_gen = ImageDataGenerator(rescale = 1./255.).flow(x_test_10,y_test_10, batch_size)
cifar10_resnet50.py
history = teacher_model.fit_generator(train_gen,
initial_epoch=0,
epochs=250,
steps_per_epoch = steps_per_epoch,
validation_data = val_gen, validation_steps = validation_steps,
callbacks=[checkpoint])
history = teacher_model.fit_generator(trainj_gen,
initial_epoch=250,
epochs=300,
steps_per_epoch = steps_per_epoch,
validation_data = val_gen, validation_steps = validation_steps,
callbacks=[checkpoint, reduce_lr, early_stopping])
cifar10_resnet50.py
#Reference URL: https://qiita.com/yy1003/items/c590d1a26918e4abe512
def my_eval(model,x,t):
#model:The model you want to evaluate, x:Image shape to predict= (batch,32,32,3) t:one-hot expression label
ev = model.evaluate(x,t)
print("loss:" ,end = " ")
print(ev[0])
print("acc: ", end = "")
print(ev[1])
my_eval(teacher_model,x_test_10/255,y_test_10)
teacher_eval
10000/10000 [==============================] - 16s 2ms/step
loss: 0.817680492834933
acc: 0.883899986743927
The results were 88.39% accurate in the test data.
First, prepare an image for attaching a pseudo label. Although it is small, I collected about 800 pieces from imagenet for each of 10 classes. I resized it to 32x32 and made it a data set.
As a detailed procedure
It will be. I will post the implementation, but if you follow 3 and 4, I think that there is no fixed method, so please implement it so that it is easy to do.
imagenet_dummy_label.py
img_path = r"D:\imagenet\cifar10\resize"
img_list = os.listdir(img_path)
x_train_imgnet = []
for i in img_list:
abs_path = os.path.join(img_path, i)
temp = load_img(abs_path)
temp = img_to_array(temp)
x_train_imgnet.append(temp)
x_train_imgnet = np.array(x_train_imgnet)
imagenet_dummy_label.py
#Batch size setting
batch_size = 1
#How many steps for statement to turn
step = int(x_train_imgnet.shape[0] / batch_size)
print(step)
#Empty list for pseudo labels
y_train_imgnet_dummy = []
for i in range(step):
#Extract image data for batch size
x_temp = x_train_imgnet[batch_size*i:batch_size*(i+1)]
#Normalization
x_temp = x_temp / 255.
#inference
temp = teacher_model.predict(x_temp)
#Add to empty list
y_train_imgnet_dummy.extend(temp)
#List to numpy array
y_train_imgnet_dummy = np.array(y_train_imgnet_dummy)
imagenet_dummy_label.py
#Threshold setting
threhold = 0.75
y_train_imgnet_dummy_th = y_train_imgnet_dummy[np.max(y_train_imgnet_dummy, axis=1) > threhold]
x_train_imgnet_th = x_train_imgnet[np.max(y_train_imgnet_dummy, axis=1) > threhold]
imagenet_dummy_label.py
#Index from onehot vector to classification
y_student_all_dummy_label = np.argmax(y_train_imgnet_dummy_th, axis=1)
#Count the number of each class of pseudolabels
u, counts = np.unique(y_student_all_dummy_label, return_counts=True)
print(u, counts)
#Calculate the maximum number of counts
student_label_max = max(counts)
#Separate the numpy array for each label
y_student_per_label = []
y_student_per_img_path = []
for i in range(10):
temp_l = y_train_imgnet_dummy_th[y_student_all_dummy_label == i]
print(i, ":", temp_l.shape)
y_student_per_label.append(temp_l)
temp_i = x_train_imgnet_th[y_student_all_dummy_label == i]
print(i, ":", temp_i.shape)
y_student_per_img_path.append(temp_i)
#Copy data for maximum count on each label
y_student_per_label_add = []
y_student_per_img_add = []
for i in range(10):
num = y_student_per_label[i].shape[0]
temp_l = y_student_per_label[i]
temp_i = y_student_per_img_path[i]
add_num = student_label_max - num
q, mod = divmod(add_num, num)
print(q, mod)
temp_l_tile = np.tile(temp_l, (q+1, 1))
temp_i_tile = np.tile(temp_i, (q+1, 1, 1, 1))
temp_l_add = temp_l[:mod]
temp_i_add = temp_i[:mod]
y_student_per_label_add.append(np.concatenate([temp_l_tile, temp_l_add], axis=0))
y_student_per_img_add.append(np.concatenate([temp_i_tile, temp_i_add], axis=0))
#Check the count number of each label
print([len(i) for i in y_student_per_label_add])
#Combine data for each label
student_train_img = np.concatenate(y_student_per_img_add, axis=0)
student_train_label = np.concatenate(y_student_per_label_add, axis=0)
#Combined with the original cifar10 numpy array
x_train_student = np.concatenate([x_train_10, student_train_img], axis=0)
y_train_student = np.concatenate([y_train_10, student_train_label], axis=0)
Here, I will go with resnet50, which is the same size as the teacher model. As model noise
There are two. For the implementation of Stochastic depth, I referred to the following implementation posted on github. Implementation URL: https://github.com/transcranial/stochastic-depth/blob/master/stochastic-depth.ipynb
In my implementation The probability list for each resblock is created first, and the implementation is such that one is taken out and used when defining the model. I'm doing it because I thought it would be better to define it first and use it later so that there would be no mistakes.
stochastic_resblock.py
#A function that defines the probability that each resblock applies
def get_p_survival(l, L, pl):
pt = 1 - (l / L) * (1 - pl)
return pt
#Output 1 or 0 with probability
#During learning: Output x 1 or 0
#Inference: Output x Probability
def stochastic_survival(y, p_survival=1.0):
# binomial random variable
survival = K.random_binomial((1,), p=p_survival)
# during testing phase:
# - scale y (see eq. (6))
# - p_survival effectively becomes 1 for all layers (no layer dropout)
return K.in_test_phase(tf.constant(p_survival, dtype='float32') * y,
survival * y)
def stochastic_resblock(data, filters, strides, depth_num, p_list):
print(p_list[depth_num])
x = Conv2D(filters=filters,kernel_size=(1,1),strides=(1,1),padding="same")(data)
x = BN()(x)
x = Activation("relu")(x)
x = Conv2D(filters=filters,kernel_size=(3,3),strides=(1,1),padding="same")(x)
x = BN()(x)
x = Activation("relu")(x)
x = Conv2D(filters=filters*4,kernel_size=(1,1),strides=strides,padding="same")(x)
x = BN()(x)
x = Lambda(stochastic_survival, arguments={'p_survival': p_list[depth_num]})(x)
x = shortcut_en(data, x)
x = Activation("relu")(x)
#Increment the number of layers
depth_num += 1
return x, depth_num
L = 16
pl = 0.8
p_list = []
for l in range(L+1):
x = get_p_survival(l,L,pl)
p_list.append(x)
#Starts at 0 but starts at 1 to skip the input layer
depth_num = 1
inputs = Input(shape = (32,32,3))
x = Conv2D(32,(5,5),padding = "SAME")(inputs)
x = BN()(x)
x = Activation('relu')(x)
#depth_Use in the next layer while incrementing num in the function
x, depth_num = stochastic_resblock(x, 64, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 64, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 64, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 128, 2, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 128, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 128, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 128, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 256, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 256, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 256, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 256, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 256, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 256, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 512, 2, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 512, 1, depth_num, p_list)
x, depth_num = stochastic_resblock(x, 512, 1, depth_num, p_list)
x = GlobalAveragePooling2D()(x)
x = Dropout(0.5)(x)
x = Dense(10)(x)
outputs = Activation("softmax")(x)
student_model = Model(inputs, outputs)
student_model.summary()
student_model.compile(loss = "categorical_crossentropy",optimizer = "adam", metrics = ["accuracy"])
Since the dataset was created in 2., the rest is only Rand Augmentation. I used the following implementation on github. Implementation URL: https://github.com/heartInsert/randaugment/blob/master/Rand_Augment.py
Since the data format of the github implementation is PIL, I made my own data generator that outputs teacher data while converting it to a numpy array.
Rand_Augment.py
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image, ImageEnhance, ImageOps
import numpy as np
import random
class Rand_Augment():
def __init__(self, Numbers=None, max_Magnitude=None):
self.transforms = ['autocontrast', 'equalize', 'rotate', 'solarize', 'color', 'posterize',
'contrast', 'brightness', 'sharpness', 'shearX', 'shearY', 'translateX', 'translateY']
if Numbers is None:
self.Numbers = len(self.transforms) // 2
else:
self.Numbers = Numbers
if max_Magnitude is None:
self.max_Magnitude = 10
else:
self.max_Magnitude = max_Magnitude
fillcolor = 128
self.ranges = {
# these Magnitude range , you must test it yourself , see what will happen after these operation ,
# it is no need to obey the value in autoaugment.py
"shearX": np.linspace(0, 0.3, 10),
"shearY": np.linspace(0, 0.3, 10),
"translateX": np.linspace(0, 0.2, 10),
"translateY": np.linspace(0, 0.2, 10),
"rotate": np.linspace(0, 360, 10),
"color": np.linspace(0.0, 0.9, 10),
"posterize": np.round(np.linspace(8, 4, 10), 0).astype(np.int),
"solarize": np.linspace(256, 231, 10),
"contrast": np.linspace(0.0, 0.5, 10),
"sharpness": np.linspace(0.0, 0.9, 10),
"brightness": np.linspace(0.0, 0.3, 10),
"autocontrast": [0] * 10,
"equalize": [0] * 10,
"invert": [0] * 10
}
self.func = {
"shearX": lambda img, magnitude: img.transform(
img.size, Image.AFFINE, (1, magnitude * random.choice([-1, 1]), 0, 0, 1, 0),
Image.BICUBIC, fill=fillcolor),
"shearY": lambda img, magnitude: img.transform(
img.size, Image.AFFINE, (1, 0, 0, magnitude * random.choice([-1, 1]), 1, 0),
Image.BICUBIC, fill=fillcolor),
"translateX": lambda img, magnitude: img.transform(
img.size, Image.AFFINE, (1, 0, magnitude * img.size[0] * random.choice([-1, 1]), 0, 1, 0),
fill=fillcolor),
"translateY": lambda img, magnitude: img.transform(
img.size, Image.AFFINE, (1, 0, 0, 0, 1, magnitude * img.size[1] * random.choice([-1, 1])),
fill=fillcolor),
"rotate": lambda img, magnitude: self.rotate_with_fill(img, magnitude),
# "rotate": lambda img, magnitude: img.rotate(magnitude * random.choice([-1, 1])),
"color": lambda img, magnitude: ImageEnhance.Color(img).enhance(1 + magnitude * random.choice([-1, 1])),
"posterize": lambda img, magnitude: ImageOps.posterize(img, magnitude),
"solarize": lambda img, magnitude: ImageOps.solarize(img, magnitude),
"contrast": lambda img, magnitude: ImageEnhance.Contrast(img).enhance(
1 + magnitude * random.choice([-1, 1])),
"sharpness": lambda img, magnitude: ImageEnhance.Sharpness(img).enhance(
1 + magnitude * random.choice([-1, 1])),
"brightness": lambda img, magnitude: ImageEnhance.Brightness(img).enhance(
1 + magnitude * random.choice([-1, 1])),
"autocontrast": lambda img, magnitude: ImageOps.autocontrast(img),
"equalize": lambda img, magnitude: img,
"invert": lambda img, magnitude: ImageOps.invert(img)
}
def rand_augment(self):
"""Generate a set of distortions.
Args:
N: Number of augmentation transformations to apply sequentially. N is len(transforms)/2 will be best
M: Max_Magnitude for all the transformations. should be <= self.max_Magnitude """
M = np.random.randint(0, self.max_Magnitude, self.Numbers)
sampled_ops = np.random.choice(self.transforms, self.Numbers)
return [(op, Magnitude) for (op, Magnitude) in zip(sampled_ops, M)]
def __call__(self, image):
operations = self.rand_augment()
for (op_name, M) in operations:
operation = self.func[op_name]
mag = self.ranges[op_name][M]
image = operation(image, mag)
return image
def rotate_with_fill(self, img, magnitude):
# I don't know why rotate must change to RGBA , it is copy from Autoaugment - pytorch
rot = img.convert("RGBA").rotate(magnitude)
return Image.composite(rot, Image.new("RGBA", rot.size, (128,) * 4), rot).convert(img.mode)
def test_single_operation(self, image, op_name, M=-1):
'''
:param image: image
:param op_name: operation name in self.transforms
:param M: -1 stands for the max Magnitude in there operation
:return:
'''
operation = self.func[op_name]
mag = self.ranges[op_name][M]
image = operation(image, mag)
return image
data_generator.py
img_augment = Rand_Augment(Numbers=2, max_Magnitude=10)
def get_random_data(x_train_i, y_train_i, data_aug):
x = array_to_img(x_train_i)
if data_aug:
seed_image = img_augment(x)
seed_image = img_to_array(seed_image)
else:
seed_image = x_train_i
seed_image = seed_image / 255
return seed_image, y_train_i
def data_generator(x_train, y_train, batch_size, data_aug):
'''data generator for fit_generator'''
n = len(x_train)
i = 0
while True:
image_data = []
label_data = []
for b in range(batch_size):
if i==0:
p = np.random.permutation(len(x_train))
x_train = x_train[p]
y_train = y_train[p]
image, label = get_random_data(x_train[i], y_train[i], data_aug)
image_data.append(image)
label_data.append(label)
i = (i+1) % n
image_data = np.array(image_data)
label_data = np.array(label_data)
yield image_data, label_data
Now that we have a data generator, all we have to do is learn.
data_generator.py
log_dir = 'logs/softlabel/student1_2/'
checkpoint = ModelCheckpoint(log_dir + 'ep{epoch:03d}-loss{loss:.3f}-val_loss{val_loss:.3f}.h5',
monitor='val_loss', save_weights_only=True, save_best_only=True, period=1)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1)
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=10, verbose=1)
batch_size = 64
steps_per_epoch = x_train_student.shape[0] // batch_size
validation_steps = x_test_10.shape[0] // batch_size
#0-250 epoch learns without changing the learning rate
history = student_model.fit_generator(data_generator(x_train_student, y_train_student, batch_size, data_aug = True),
initial_epoch=0,
epochs=250,
steps_per_epoch = steps_per_epoch,
validation_data = data_generator_wrapper(x_test_10, y_test_10, batch_size, data_aug = False),
validation_steps = validation_steps,
callbacks=[checkpoint])
#For 250epoch-300epoch, stop learning while changing the learning rate
history = student_model.fit_generator(data_generator(x_train_student, y_train_student, batch_size, data_aug = True),
initial_epoch=250,
epochs=300,
steps_per_epoch = steps_per_epoch,
validation_data = data_generator_wrapper(x_test_10, y_test_10, batch_size, data_aug = False),
validation_steps = validation_steps,
callbacks=[checkpoint, reduce_lr, early_stopping])
eval.py
my_eval(student_model,x_test_10/255,y_test_10)
student_eval
10000/10000 [==============================] - 19s 2ms/step
loss: 0.24697399706840514
acc: 0.9394000172615051
The results were 93.94% accurate in the test data. Of course, it's up.
As I was doing it, the question "Which is more accurate than when noise was enabled at the time of the teacher model" came up, so I confirmed it. It is briefly summarized in the table below.
Experiment | teacher model |
Test data loss/accuracy | student model |
Test data loss/accuracy |
---|---|---|---|---|
1 | noiseNone | 0.8176/88.39% | noiseYes | 0.2470/93.94% |
2 | noiseYes | 0.2492/94.14% | noiseYes | 0.2289/94.28% |
In this case, the accuracy was a little higher when the teacher gave noise. I really wanted to check the robustness, but I was exhausted.
that's all. If you have any questions or concerns, please leave a comment.