Parametric Neural Network

About neural networks

Rough flow in the first place

image.png

  1. Pass the activation function value to the next node according to the input information
  2. Linearly combine all the received information with the weight $ w $ and the bias $ b $.
  3. Enter the linearly combined value into the activation function and pass it to the next node
  4. Repeat steps 2 and 3 to output the value from the final node
  5. Calculate the loss function from the final output (although PNN has only one node in the last layer): Steps 1-6 are forward propagation
  6. Then use the loss function to calculate the weight at each branch: back propagation
  7. Steps 1-7 complete one training session. After that, repeat this a specified number of times to create a set of weights = neural network that describes the correct answer better.

In step 6, the gradient descent method (a method of updating the weight using the differential coefficient of the loss function) is generally adopted. Therefore, the loss function needs to be a differentiable distribution. In the following, the activation function, loss function, and gradient descent method are briefly described.

About activation function

The input values calculated in the previous layer are first linearly combined according to each weight and bias. Then, using them as arguments to the activation function, pass the output value of the activation function to the next layer ..., and so on is machine learning. Therefore, the important meaning of the activation function is not the form of the formula ("Why exponential, why xxx in fractions ..." is a meaningless argument), but what value should be in what range. It is important to output across. The two types of activation functions used this time are summarized.

relu (ramp function)

When x is 0 or more, it has a direct proportional form. Since the sigmoid function and the gradient disappear as the distance from the origin increases (the differential coefficient approaches 0), there is a problem that learning stagnates once the unit has a large value. It is known that the vanishing gradient problem is empirically solved by the Ramp function.

f(x) = x~~(x>0)

sigmoid (sigmoid function)

The output value of the function is between 0 and 1.

f(x) = \frac{1}{1-e^x}

Loss function (= error function)

Evaluate the $ n $ dimension output of the neural network using a loss function. Conceptually, the smaller the difference, the smaller the value of the loss function compared to the correct $ n $ dimension value. Therefore, a good neural network has a small output value of the loss function.

E(w) = -\sum_n^{N} \left( d_n\log y_n + (1-d_n)\log(1-y_n) \right)

Gradient method

The key part of the neural network is how to update the weights using the loss function. Here, a general SDG will be described. The SDG uses the derivative of the loss function to calculate the weights used for the next training. For the parameters used at this time

-$ \ eta : Learning factor, learning rate, learning rate - \ Alpha : Momentum - h $: decay rate, learning decay rate

w^{t+1} = w^{t} - \eta\frac{1}{\sqrt{h}}\frac{\partial E(w^{t})}{\partial w^{t}} + \alpha\Delta w

Calculate according to the above formula. Here as an overview level knowledge about the gradient method

--If the learning factor is too large, the weight values will be significantly different between $ t $ and $ t + 1 $, making it difficult for the training to converge. ――If the learning coefficient is too small, the degree of weight update will be small and it will take time to learn. --By introducing the attenuation factor, the learning coefficient is also updated according to the training.

Can be mentioned.

Learning method

Generally, the learning method of a neural network is explained with "mini-batch learning" in mind. Here, we will explain when to update the parameters (= update the weights = update the model) using the loss function.

--Online learning --A method to update the model every time based on the loss function calculated from the input information. --For example, if you have 1000 images, you will experience 1000 parameter updates. --Batch learning --A method to update the model in batch units (= all data at once). --For example, if you have 1000 images, you will experience one parameter update. The loss function used at this time is the average of the loss functions of each of the 1000 images.

L=\frac{1}{N}\sum_{i=1}^{N} l_i

--Mini batch learning --A method of dividing all data into mini-batch and updating the model for each mini-batch process --For each processing of the number of data included in the mini-batch (= batch size), the loss function is averaged and calculated, and the model is updated. Then, according to the updated model, training in the next mini-batch is started. --For example, let's say you have 1000 images and you want to divide them into 100 batch sizes. At this time, there are 10 subsets, so we will experience 10 parameter updates.

As mentioned above, the mini-batch learning method is generally widely used. When 10 subsets are processed in the previous example, one epoch is counted.

PNN BDT is often used in high energy. There are many merits such as being strong in small statistics and basically avoiding becoming a black box because it is a DT. Since DNN has already been used for particle identification, we decided to use a neural network to improve the S / N ratio in the form of "use and loss" as in ProfileLL. The model proposed in 2016 is Parametrised Neural Network (PNN), which is built using a general python library. The library used this time is

Read ROOT file (high energy pre-stage processing)

uproot CERN library A Python module for reading ROOT format data. Just changing the ROOT Ntuple to python DataFrame does not change the structure. Rows correspond to events, columns correspond to each variable.

import uproot
f = uproot.open("data.root")
print(f.keys())
# ['data;1', 'background;1", ....]

f['data'].pandas.df()
#        Btag  EventFlavour  EventNumber  FourBodyMass  Jet0CorrTLV  ...  mass  mass_scaled           sT      sTNoMet  signal    weight  weight_scaled
#entry                                                               ...                                                                              
#9        2.0           8.0     560044.0   1666.098145   542.301636  ...  900     0.352941  #1566.298340  1404.298218       1  0.003898       0.028524
#10       1.0           5.0     560480.0   1606.993896   241.007111  ...  900     0.352941  #1841.925049  1434.105713       1  0.004255       0.031135
#11       2.0           0.0     561592.0   1857.901245   721.780457  ...  900     0.352941  #2444.058105  1910.263306       1  0.002577       0.018855
#15       2.0           5.0     561088.0   1348.327515   174.501556  ...  900     0.352941  #1328.051147  1029.908447       1  0.003360       0.024585

f['data'].pandas.df('EventNumber')
#        EventNumber
#entry              
#0      2.148751e+08
#1      2.143515e+08
#2      6.018242e+07
#3      2.868989e+07
...

The above is the data frame immediately after reading, and create a data frame that picks up only the required information values (input information to be used) from here. The method of slicing the data frame used in the next step will be briefly described. The original data frame read by uproot has mass_scaled at the end, so slice it withX [:,: -1]. This is a slicing method that means "all rows, columns are from the beginning to the last one". Based on the above, we will move on to the core from the next.

Actual training process

Pre-processing (scale conversion)

It is necessary to align the scale (= number of digits) of the data to be handled. The method used there is sclearn, and this time we are using RobustSclaer, which is resistant to outliers. If there are outliers in the first place, the average / variance of the features will be greatly affected by the outliers, and standardization will not work well. Let's think of the data as being reprinted into information that is easy for machines to handle.

Created NN model

--keras.layers: Defines the properties of layers - Input - Dense --Fully coupled neural network layer. All perceptrons (nodes) are connected to the next layer of perceptrons --The NN layer that is commonly drawn in ponchi-e

Actual coding

  1. First of all, define the number of dimensions of input information with Input
  2. Define Dense (fully connected neural network) for each layer. The activation function used at this timing is defined.
  3. Since the hidden layer is output to the next layer as it is, it is a 32-dimensional output (this time, the neural network is defined by the 32-dimensional 3-layer of [32,32,32]). And the last layer is one node and outputs [0,1].
  4. The activation function of the hidden layer is relu, and the activation function of the last layer is sigmoid.
x = Input(shape=(n_input_vars,))
d = x 
for n in self.layer_size:
    d = Dense(n, activation=self.activation)(d)

y = Dense(1, activation="sigmoid")(d)
Model(x, y)

Gradient method used

The gradient method used this time is a very orthodox SGD (Stochastic Gradient Descent). Each weight is updated using the loss function $ E (w) $ in the following equation.

w^{t+1} ← w^{t} - \eta \frac{\partial E(w^{t})}{\partial w^{t}} + \alpha \Delta w^{t}

Here, $ \ eta $ represents the learning rate (learning coefficient), and $ \ alpha $ represents the momentum.

sgd = SGD(lr=self.learning_rate, momentum=self.momentum, nesterov=self.nesterov, decay=self.learning_rate_decay)

Training with Keras

compile Using the knowledge described so far (with more background knowledge), the following steps are taken to train a neural network in keras. First, you need to "compile" the model

model.compile(...)

fit Then after compiling, fit = do the actual training.

By the way

PNN is characterized by taking theoretical parameters as input information separately from the mechanical input information used. Theoretical parameters exist correctly in the simulation of signal events, but what about the theoretical parameters of background events? For example, if the mass parameter is used as the input information, a random value is selected and trained when training the background event.

Reference URL

I have greatly referred to the following sites. Thank you very much.

Recommended Posts

Parametric Neural Network
Implement Convolutional Neural Network
Implement Neural Network from 1
Convolutional neural network experience
Implement a 3-layer neural network
Neural network with Python (scikit-learn)
3. Normal distribution with neural network!
Neural network starting with Chainer
network
Neural network implementation in python
Pytorch Neural Network (CNN) Tutorial 1.3.1.
4. Circle parameters with neural network!
Neural network implementation (NumPy only)
TensorFlow Tutorial-Convolutional Neural Network (Translation)
Neural network with OpenCV 3 and Python 3
Implementation of a two-layer neural network 2
PRML Chapter 5 Neural Network Python Implementation
Simple classification model with neural network
What is a Convolutional Neural Network?
[TensorFlow] [Keras] Neural network construction with Keras
Touch the object of the neural network
[Language processing 100 knocks 2020] Chapter 8: Neural network
Survivor prediction using kaggle's titanic neural network [80.8%]
Compose with a neural network! Run Magenta
Predict time series data with neural network
Implementation of 3-layer neural network (no learning)
Try using TensorFlow-Part 2-Convolutional Neural Network (MNIST)
Persist the neural network built with PyBrain
Neural Network Console Challenge Winning Works List
Implementation of "blurred" neural network using Chainer
Simple neural network implementation using Chainer-Data preparation-
[PyTorch] Tutorial (Japanese version) ③ ~ NEURAL NETWORKS (Neural Network) ~
Python & Machine Learning Study Memo ③: Neural Network
Relational Network
Simple neural network implementation using Chainer-Model description-
2. Mean and standard deviation with neural network!
[Chainer] Document classification by convolutional neural network
[Python / Machine Learning] Why Deep Learning # 1 Perceptron Neural Network
Experiment with various optimization algorithms with a neural network
Another style conversion method using Convolutional Neural Network
Visualize the inner layer of a neural network
Verification of Batch Normalization with multi-layer neural network
I ran the neural network on the actual FPGA
Recognition of handwritten numbers by multi-layer neural network