In step 6, the gradient descent method (a method of updating the weight using the differential coefficient of the loss function) is generally adopted. Therefore, the loss function needs to be a differentiable distribution. In the following, the activation function, loss function, and gradient descent method are briefly described.
The input values calculated in the previous layer are first linearly combined according to each weight and bias. Then, using them as arguments to the activation function, pass the output value of the activation function to the next layer ..., and so on is machine learning. Therefore, the important meaning of the activation function is not the form of the formula ("Why exponential, why xxx in fractions ..." is a meaningless argument), but what value should be in what range. It is important to output across. The two types of activation functions used this time are summarized.
When x is 0 or more, it has a direct proportional form. Since the sigmoid function and the gradient disappear as the distance from the origin increases (the differential coefficient approaches 0), there is a problem that learning stagnates once the unit has a large value. It is known that the vanishing gradient problem is empirically solved by the Ramp function.
f(x) = x~~(x>0)
The output value of the function is between 0 and 1.
f(x) = \frac{1}{1-e^x}
Evaluate the $ n $ dimension output of the neural network using a loss function. Conceptually, the smaller the difference, the smaller the value of the loss function compared to the correct $ n $ dimension value. Therefore, a good neural network has a small output value of the loss function.
E(w) = -\sum_n^{N} \left( d_n\log y_n + (1-d_n)\log(1-y_n) \right)
The key part of the neural network is how to update the weights using the loss function. Here, a general SDG will be described. The SDG uses the derivative of the loss function to calculate the weights used for the next training. For the parameters used at this time
-$ \ eta
w^{t+1} = w^{t} - \eta\frac{1}{\sqrt{h}}\frac{\partial E(w^{t})}{\partial w^{t}} + \alpha\Delta w
Calculate according to the above formula. Here as an overview level knowledge about the gradient method
--If the learning factor is too large, the weight values will be significantly different between $ t $ and $ t + 1 $, making it difficult for the training to converge. ――If the learning coefficient is too small, the degree of weight update will be small and it will take time to learn. --By introducing the attenuation factor, the learning coefficient is also updated according to the training.
Can be mentioned.
Generally, the learning method of a neural network is explained with "mini-batch learning" in mind. Here, we will explain when to update the parameters (= update the weights = update the model) using the loss function.
--Online learning --A method to update the model every time based on the loss function calculated from the input information. --For example, if you have 1000 images, you will experience 1000 parameter updates. --Batch learning --A method to update the model in batch units (= all data at once). --For example, if you have 1000 images, you will experience one parameter update. The loss function used at this time is the average of the loss functions of each of the 1000 images.
L=\frac{1}{N}\sum_{i=1}^{N} l_i
--Mini batch learning --A method of dividing all data into mini-batch and updating the model for each mini-batch process --For each processing of the number of data included in the mini-batch (= batch size), the loss function is averaged and calculated, and the model is updated. Then, according to the updated model, training in the next mini-batch is started. --For example, let's say you have 1000 images and you want to divide them into 100 batch sizes. At this time, there are 10 subsets, so we will experience 10 parameter updates.
As mentioned above, the mini-batch learning method is generally widely used. When 10 subsets are processed in the previous example, one epoch is counted.
PNN BDT is often used in high energy. There are many merits such as being strong in small statistics and basically avoiding becoming a black box because it is a DT. Since DNN has already been used for particle identification, we decided to use a neural network to improve the S / N ratio in the form of "use and loss" as in ProfileLL. The model proposed in 2016 is Parametrised Neural Network (PNN), which is built using a general python library. The library used this time is
uproot
CERN library A Python module for reading ROOT format data. Just changing the ROOT Ntuple to python DataFrame
does not change the structure. Rows correspond to events, columns correspond to each variable.
import uproot
f = uproot.open("data.root")
print(f.keys())
# ['data;1', 'background;1", ....]
f['data'].pandas.df()
# Btag EventFlavour EventNumber FourBodyMass Jet0CorrTLV ... mass mass_scaled sT sTNoMet signal weight weight_scaled
#entry ...
#9 2.0 8.0 560044.0 1666.098145 542.301636 ... 900 0.352941 #1566.298340 1404.298218 1 0.003898 0.028524
#10 1.0 5.0 560480.0 1606.993896 241.007111 ... 900 0.352941 #1841.925049 1434.105713 1 0.004255 0.031135
#11 2.0 0.0 561592.0 1857.901245 721.780457 ... 900 0.352941 #2444.058105 1910.263306 1 0.002577 0.018855
#15 2.0 5.0 561088.0 1348.327515 174.501556 ... 900 0.352941 #1328.051147 1029.908447 1 0.003360 0.024585
f['data'].pandas.df('EventNumber')
# EventNumber
#entry
#0 2.148751e+08
#1 2.143515e+08
#2 6.018242e+07
#3 2.868989e+07
...
The above is the data frame immediately after reading, and create a data frame that picks up only the required information values (input information to be used) from here. The method of slicing the data frame used in the next step will be briefly described.
The original data frame read by uproot has mass_scaled
at the end, so slice it withX [:,: -1]
. This is a slicing method that means "all rows, columns are from the beginning to the last one". Based on the above, we will move on to the core from the next.
from sklearn.utils import shuffle
--When splitting test / training data, how to sort it randomly and then split it
--If you do nothing, the data will be divided in order from the beginning.It is necessary to align the scale (= number of digits) of the data to be handled. The method used there is sclearn, and this time we are using RobustSclaer
, which is resistant to outliers. If there are outliers in the first place, the average / variance of the features will be greatly affected by the outliers, and standardization will not work well. Let's think of the data as being reprinted into information that is easy for machines to handle.
--keras.layers: Defines the properties of layers - Input - Dense --Fully coupled neural network layer. All perceptrons (nodes) are connected to the next layer of perceptrons --The NN layer that is commonly drawn in ponchi-e
relu
, and the activation function of the last layer is sigmoid
.x = Input(shape=(n_input_vars,))
d = x
for n in self.layer_size:
d = Dense(n, activation=self.activation)(d)
y = Dense(1, activation="sigmoid")(d)
Model(x, y)
The gradient method used this time is a very orthodox SGD (Stochastic Gradient Descent). Each weight is updated using the loss function $ E (w) $ in the following equation.
w^{t+1} ← w^{t} - \eta \frac{\partial E(w^{t})}{\partial w^{t}} + \alpha \Delta w^{t}
Here, $ \ eta $ represents the learning rate (learning coefficient), and $ \ alpha $ represents the momentum.
sgd = SGD(lr=self.learning_rate, momentum=self.momentum, nesterov=self.nesterov, decay=self.learning_rate_decay)
compile Using the knowledge described so far (with more background knowledge), the following steps are taken to train a neural network in keras. First, you need to "compile" the model
model.compile(...)
fit Then after compiling, fit = do the actual training.
PNN is characterized by taking theoretical parameters as input information separately from the mechanical input information used. Theoretical parameters exist correctly in the simulation of signal events, but what about the theoretical parameters of background events? For example, if the mass parameter is used as the input information, a random value is selected and trained when training the background event.
I have greatly referred to the following sites. Thank you very much.
Recommended Posts