About neural networks

Rough flow in the first place

Pass the activation function value to the next node according to the input information
Linearly combine all the received information with the weight $ w $ and the bias $ b $.
Enter the linearly combined value into the activation function and pass it to the next node
Repeat steps 2 and 3 to output the value from the final node
Calculate the loss function from the final output (although PNN has only one node in the last layer): Steps 1-6 are forward propagation
Then use the loss function to calculate the weight at each branch: back propagation
Steps 1-7 complete one training session. After that, repeat this a specified number of times to create a set of weights = neural network that describes the correct answer better.

In step 6, the gradient descent method (a method of updating the weight using the differential coefficient of the loss function) is generally adopted. Therefore, the loss function needs to be a differentiable distribution. In the following, the activation function, loss function, and gradient descent method are briefly described.

About activation function

The input values calculated in the previous layer are first linearly combined according to each weight and bias. Then, using them as arguments to the activation function, pass the output value of the activation function to the next layer ..., and so on is machine learning. Therefore, the important meaning of the activation function is not the form of the formula ("Why exponential, why xxx in fractions ..." is a meaningless argument), but what value should be in what range. It is important to output across. The two types of activation functions used this time are summarized.

relu (ramp function)

When x is 0 or more, it has a direct proportional form. Since the sigmoid function and the gradient disappear as the distance from the origin increases (the differential coefficient approaches 0), there is a problem that learning stagnates once the unit has a large value. It is known that the vanishing gradient problem is empirically solved by the Ramp function.

f(x) = x~~(x>0)

sigmoid (sigmoid function)

The output value of the function is between 0 and 1.

f(x) = \frac{1}{1-e^x}

Loss function (= error function)

Evaluate the $ n $ dimension output of the neural network using a loss function. Conceptually, the smaller the difference, the smaller the value of the loss function compared to the correct $ n $ dimension value. Therefore, a good neural network has a small output value of the loss function.

binary_crossentropy -Used for 2 class classification (often used for high energy such as 0 or 1, background event or signal event)

E(w) = -\sum_n^{N} \left( d_n\log y_n + (1-d_n)\log(1-y_n) \right)

Gradient method

The key part of the neural network is how to update the weights using the loss function. Here, a general SDG will be described. The SDG uses the derivative of the loss function to calculate the weights used for the next training. For the parameters used at this time

-$ \ eta : Learning factor, learning rate, learning rate - \ Alpha : Momentum - h $: decay rate, learning decay rate

w^{t+1} = w^{t} - \eta\frac{1}{\sqrt{h}}\frac{\partial E(w^{t})}{\partial w^{t}} + \alpha\Delta w

Calculate according to the above formula. Here as an overview level knowledge about the gradient method

--If the learning factor is too large, the weight values will be significantly different between $ t $ and $ t + 1 $, making it difficult for the training to converge. ――If the learning coefficient is too small, the degree of weight update will be small and it will take time to learn. --By introducing the attenuation factor, the learning coefficient is also updated according to the training.

Can be mentioned.

Learning method

Generally, the learning method of a neural network is explained with "mini-batch learning" in mind. Here, we will explain when to update the parameters (= update the weights = update the model) using the loss function.

--Online learning --A method to update the model every time based on the loss function calculated from the input information. --For example, if you have 1000 images, you will experience 1000 parameter updates. --Batch learning --A method to update the model in batch units (= all data at once). --For example, if you have 1000 images, you will experience one parameter update. The loss function used at this time is the average of the loss functions of each of the 1000 images.

L=\frac{1}{N}\sum_{i=1}^{N} l_i

--Mini batch learning --A method of dividing all data into mini-batch and updating the model for each mini-batch process --For each processing of the number of data included in the mini-batch (= batch size), the loss function is averaged and calculated, and the model is updated. Then, according to the updated model, training in the next mini-batch is started. --For example, let's say you have 1000 images and you want to divide them into 100 batch sizes. At this time, there are 10 subsets, so we will experience 10 parameter updates.

As mentioned above, the mini-batch learning method is generally widely used. When 10 subsets are processed in the previous example, one epoch is counted.

PNN BDT is often used in high energy. There are many merits such as being strong in small statistics and basically avoiding becoming a black box because it is a DT. Since DNN has already been used for particle identification, we decided to use a neural network to improve the S / N ratio in the form of "use and loss" as in ProfileLL. The model proposed in 2016 is Parametrised Neural Network (PNN), which is built using a general python library. The library used this time is

uproot --The ROOT format used in the high energy area is changed to data frame with python. --sklearn (scikit-learn) --python machine learning library
keras --A library of neural networks that runs on TensorFlow

Read ROOT file (high energy pre-stage processing)

uproot CERN library A Python module for reading ROOT format data. Just changing the ROOT Ntuple to python DataFrame does not change the structure. Rows correspond to events, columns correspond to each variable.

import uproot
f = uproot.open("data.root")
print(f.keys())
# ['data;1', 'background;1", ....]

f['data'].pandas.df()
#        Btag  EventFlavour  EventNumber  FourBodyMass  Jet0CorrTLV  ...  mass  mass_scaled           sT      sTNoMet  signal    weight  weight_scaled
#entry                                                               ...                                                                              
#9        2.0           8.0     560044.0   1666.098145   542.301636  ...  900     0.352941  #1566.298340  1404.298218       1  0.003898       0.028524
#10       1.0           5.0     560480.0   1606.993896   241.007111  ...  900     0.352941  #1841.925049  1434.105713       1  0.004255       0.031135
#11       2.0           0.0     561592.0   1857.901245   721.780457  ...  900     0.352941  #2444.058105  1910.263306       1  0.002577       0.018855
#15       2.0           5.0     561088.0   1348.327515   174.501556  ...  900     0.352941  #1328.051147  1029.908447       1  0.003360       0.024585

f['data'].pandas.df('EventNumber')
#        EventNumber
#entry              
#0      2.148751e+08
#1      2.143515e+08
#2      6.018242e+07
#3      2.868989e+07
...

The above is the data frame immediately after reading, and create a data frame that picks up only the required information values (input information to be used) from here. The method of slicing the data frame used in the next step will be briefly described. The original data frame read by uproot has mass_scaled at the end, so slice it withX [:,: -1]. This is a slicing method that means "all rows, columns are from the beginning to the last one". Based on the above, we will move on to the core from the next.

from sklearn.utils import shuffle --When splitting test / training data, how to sort it randomly and then split it --If you do nothing, the data will be divided in order from the beginning.

Actual training process

Pre-processing (scale conversion)

It is necessary to align the scale (= number of digits) of the data to be handled. The method used there is sclearn, and this time we are using RobustSclaer, which is resistant to outliers. If there are outliers in the first place, the average / variance of the features will be greatly affected by the outliers, and standardization will not work well. Let's think of the data as being reprinted into information that is easy for machines to handle.

StandardScaler --Standardize the distribution of data
RobustScaler ――The process used this time is this --With fit_transform, fit (calculate the mean and variance of array X) and transform (), and store the array (X)

Created NN model

--keras.layers: Defines the properties of layers - Input - Dense --Fully coupled neural network layer. All perceptrons (nodes) are connected to the next layer of perceptrons --The NN layer that is commonly drawn in ponchi-e

keras.model --Keras has two ways to define models (on Python) --Sequential model and Functional API Model

Actual coding

First of all, define the number of dimensions of input information with Input
Define Dense (fully connected neural network) for each layer. The activation function used at this timing is defined.
Since the hidden layer is output to the next layer as it is, it is a 32-dimensional output (this time, the neural network is defined by the 32-dimensional 3-layer of [32,32,32]). And the last layer is one node and outputs [0,1].
The activation function of the hidden layer is relu, and the activation function of the last layer is sigmoid.

x = Input(shape=(n_input_vars,))
d = x 
for n in self.layer_size:
    d = Dense(n, activation=self.activation)(d)

y = Dense(1, activation="sigmoid")(d)
Model(x, y)

Gradient method used

The gradient method used this time is a very orthodox SGD (Stochastic Gradient Descent). Each weight is updated using the loss function $ E (w) $ in the following equation.

w^{t+1} ← w^{t} - \eta \frac{\partial E(w^{t})}{\partial w^{t}} + \alpha \Delta w^{t}

Here, $ \ eta $ represents the learning rate (learning coefficient), and $ \ alpha $ represents the momentum.

sgd = SGD(lr=self.learning_rate, momentum=self.momentum, nesterov=self.nesterov, decay=self.learning_rate_decay)

Training with Keras

compile Using the knowledge described so far (with more background knowledge), the following steps are taken to train a neural network in keras. First, you need to "compile" the model

model.compile(...)

fit Then after compiling, fit = do the actual training.

batch_size ――The number of data contained in each subset is called batch size. --ex) Dividing the data of 10000 events by batch size 10 creates a subset of 1000.
verbose --0: No output ―― 1: With progress bar ―― 2: No progress bar
callbacks --Pass the list of functions you want to call at the end of the epoch. ex. Your own function that prints certain information for each epoch, etc.

By the way

PNN is characterized by taking theoretical parameters as input information separately from the mechanical input information used. Theoretical parameters exist correctly in the simulation of signal events, but what about the theoretical parameters of background events? For example, if the mass parameter is used as the input information, a random value is selected and trained when training the background event.

Reference URL

I have greatly referred to the following sites. Thank you very much.

https://aizine.ai/preprocessing0614/
https://qiita.com/tokkuman/items/1944c00415d129ca0ee9
https://note.nkmk.me/python-tensorflow-keras-basics/
https://products.sint.co.jp/aisia/blog/vol1-4
https://qiita.com/kenta1984/items/bad75a37d552510e4682

Parametric Neural Network