I think that mobilenet V3
is famous among several lightweight models of deep learning in the past year (V2 is also implemented in keras).
H-swish (hard-swish) plays a part in weight reduction.
Since normal swish is heavy using sigmoid, the processing is light and the error when approximating + quantization is small.
It seems that. Please refer to the following article for a detailed explanation in Japanese. Reference: [[Reading thesis] Searching for Mobilenet V3](https://woodyzootopia.github.io/2019/09/%E8%AB%96%E6%96%87%E8%AA%AD%E3%81%BFsearching -for-mobilenetv3)
This article is about implementing hard-swish in keras. However, the function itself is not that difficult, so it can be implemented quickly on the back end. Since it is a big deal, I would like to implement swish and compare it with a graph. Also, using the backend as an implementation method
tensorflow 1.15.0
keras 2.3.1
Check the definition for the time being.
h-swish [x] = x\frac{ReLU6(x+3)}{6}
swish[x] = x×Sigmoid(x)
I implemented it by referring to the official document of keras. How to use activation function
h-swish.py
from keras import backend as K
#hard_definition of swish
def hard_swish(x):
return x * (K.relu(x + 3., max_value = 6.) / 6.)
#definition of swish
def swish(x):
return x * K.sigmoid(x)
Since the backend relu has an argument max_value
that can set an upper limit, after defining ReLU6 with that, just implement it according to the formula.
Check if the defined function is as defined. Let's also calculate this with a numpy array using the backend.
backend_result.py
from keras import backend as K
import numpy as np
import matplotlib.pyplot as plt
#-0 from 10 to 10.Define array in 2 increments
inputs = np.arange(-10, 10.2, 0.2)
#Change numpy array to tensor
inputs_v = K.variable(inputs)
#Define an arithmetic graph with the defined function
outputs_hs = hard_swish(inputs_v)
outputs_s = swish(inputs_v)
#Calculate and get output
outputs_hs = K.get_value(outputs_hs)
outputs_s = K.get_value(outputs_s)
#View results
plt.figure(figsize=(14,7))
plt.yticks(range(0, 9, 1))
plt.xticks(range(-8, 9, 1))
plt.grid(True)
plt.plot(inputs, outputs_hs, label="hard_swish")
plt.plot(inputs, outputs_s, label="swish")
plt.legend(bbox_to_anchor=(1, 1), loc='lower right', borderaxespad=0, fontsize=18)
** Results of implementation of this article ** ** Paper results ** Paper URL: Searching for MobileNetV3
Sounds good.
Just apply the function you defined earlier to activation.
conv.py
from keras.layers import Conv2D
Conv2D(16,(3,3),padding = "SAME", activation = hard_swish)
Or
conv.py
from keras.layers import Activation
Activation(hard_swish)
I referred to the implementation of keras ʻAdvanced Activations` on github. Paper URL: advanced_activations.py
h-swish_layer.py
from keras import backend as K
from keras.engine.topology import Layer
#hard_definition of swish
class Hard_swish(Layer):
def __init__(self):
super(Hard_swish, self).__init__()
def call(self, inputs):
return inputs * (K.relu(inputs + 3., max_value=6.) / 6.)
def compute_output_shape(self, input_shape):
return input_shape
This is an example of how to use it. I am assuming cifar10.
h-swish_use.py
inputs = Input(shape = (32,32,3))
x = Conv2D(64,(3,3),padding = "SAME")(inputs)
x = Hard_swish()(x)
x = Conv2D(64,(3,3),padding = "SAME")(x)
x = Hard_swish()(x)
x = MaxPooling2D()(x)
x = Conv2D(128,(3,3),padding = "SAME")(x)
x = Hard_swish()(x)
x = Conv2D(128,(3,3),padding = "SAME")(x)
x = Hard_swish()(x)
x = MaxPooling2D()(x)
x = Conv2D(256,(3,3),padding = "SAME")(x)
x = Hard_swish()(x)
x = Conv2D(256,(3,3),padding = "SAME")(x)
x = Hard_swish()(x)
x = GlobalAveragePooling2D()(x)
x = Dense(1024)(x)
x = Hard_swish()(x)
prediction = Dense(10,activation = "softmax")(x)
model = Model(inputs, prediction )
model.summary()
model_output
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 32, 32, 3)] 0
_________________________________________________________________
conv2d (Conv2D) (None, 32, 32, 64) 1792
_________________________________________________________________
hard_swish (Hard_swish) (None, 32, 32, 64) 0
_________________________________________________________________
conv2d_1 (Conv2D) (None, 32, 32, 64) 36928
_________________________________________________________________
hard_swish_1 (Hard_swish) (None, 32, 32, 64) 0
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 16, 16, 64) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 16, 16, 128) 73856
_________________________________________________________________
hard_swish_2 (Hard_swish) (None, 16, 16, 128) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 16, 16, 128) 147584
_________________________________________________________________
hard_swish_3 (Hard_swish) (None, 16, 16, 128) 0
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 8, 8, 128) 0
_________________________________________________________________
conv2d_4 (Conv2D) (None, 8, 8, 256) 295168
_________________________________________________________________
hard_swish_4 (Hard_swish) (None, 8, 8, 256) 0
_________________________________________________________________
conv2d_5 (Conv2D) (None, 8, 8, 256) 590080
_________________________________________________________________
hard_swish_5 (Hard_swish) (None, 8, 8, 256) 0
_________________________________________________________________
global_average_pooling2d (Gl (None, 256) 0
_________________________________________________________________
dense (Dense) (None, 1024) 263168
_________________________________________________________________
hard_swish_6 (Hard_swish) (None, 1024) 0
_________________________________________________________________
dense_1 (Dense) (None, 10) 10250
=================================================================
Total params: 1,418,826
Trainable params: 1,418,826
Non-trainable params: 0
_________________________________________________________________
The merit (?) Compared with the definition of the activation function is that you can see that hard-swish
is used when visualized with summary
(about).
I couldn't find the implementation of hard-swish keras even if I googled this time, so I tried to implement it. It was a good opportunity to find out that the ReLU function that I often used has an argument of max_value. If you have any questions or concerns, please leave a comment.
Recommended Posts