This article is the last day of Wacul Advent Calendar 2016.
In preparation, we implemented a restricted Boltzmann machine. This time, we will "stack" this restricted Boltzmann machine to form a deep Boltzmann machine.
[With simple explanation] Scratch implementation of deep Boltzmann machine with Python ① http://qiita.com/yutaitatsu/items/a9478841357b10789514
It is a kind of Boltzmann machine (with hidden layer) that I explained last time. It represents one generative model as a whole. It has a visible layer at the bottom, and a hidden layer is stacked on top of it. Since the deep Boltzmann machine is a kind of general Boltzmann machine with a hidden layer, it can be learned by the gradient method as it is. However, the deeper the layer, the more parameters to estimate and the higher the risk of falling into a local solution.
Therefore, *** each layer is conditionally independent (there is no link between nodes in the same layer) ***. Here is the biggest trick of deep Boltzmann machine learning introduced here. The reason for doing so is that ***, starting from the bottom, considers two layers each as a restricted Boltzmann machine *** </ font>. The pair of two layers from the bottom is regarded as a restricted Boltzmann machine, and the aim is to obtain good initial values of parameters by pre-learning each pair. Specifically, take the following steps.
[Pre-learning steps]
① First, learn the restricted Boltzmann machine (at the bottom) including the visible layer
② In the restricted Boltzmann machine learned in ①, sample the value of the hidden layer using the "conditional distribution from the visible layer to the hidden layer".
③ Learn by regarding the sample obtained in ② as the input of the visible layer of the next (second from the bottom) restricted Boltzmann machine.
(Repeat below)
After getting a good initial value of the parameter, relearn the orthodox gradient method. However, if the layer is deep, the calculation cost of the expected value will increase explosively, so an approximation method is required to obtain a solution in a realistic time.
Approximation is needed to calculate the expected value under the current parameters when calculating the gradient. This area is the same as the previous restricted Boltzmann machine. There are two main methods for approximating the expected value of the distribution. One is a *** sampling method such as "Gypsum sampling" used in the previous limited Boltzmann machine learning, and the other is a *** variational method such as "mean field approximation" used this time. **is. This time, we will use both methods.
The difference between the sampling method and the variational method is that the former "actually generates a large number of samples from the target distribution itself and approximates the expected value by taking the average", while the latter "expected value". Instead, use the one that is closest to the target distribution among the distributions that can be calculated analytically (easily). " It should be noted that the former guarantees that the value will converge to the true expected value if an infinite number of samples are obtained, while the latter does not.
Specifically, the sampling method uses the Markov chain Monte Carlo method (MCMC), and the variational method estimates the parameters that minimize the KL divergence with the target distribution for a distribution group such as a mixed Gaussian distribution, and uses this. To do.
The deep-layer Botulman machine can be used as an "autoencoder" in addition to being used as a generative model as it is. What this means is that each hidden layer can be considered a *** visible layer dimensionality reduction *** </ font>. In other words, the value of the hidden layer can be regarded as a feature quantity with a reduced dimension while preserving the essence of the information possessed by each node of the visible layer. Therefore, deep Boltzmann machines can be applied as data preprocessing for applying deterministic neural networks and other orthodox machine learning techniques.
Implementation of deep Boltzmann machine
from typing import List
from itertools import chain
Layer = List[int]
Input = List[int]
from pprint import pprint
def sigmoid(z):
return 1/(1+np.exp(-z))
def get_array_from_structure(base_array, structure):
result = []
current_index = 0
for index in range(len(structure)):
accumlator = []
if index == 0:
accumlator = base_array[0:structure[current_index]]
else:
accumlator = base_array[current_index: current_index + structure[index]]
current_index += structure[index]
result.append(accumlator)
return result
class DeepBolzmannMachine:
learningRate = 0.05
sample_num = 100
def __init__(self,layer_structure:Layer):
#Generate a list of RBM by pairing from the Visible layer in order
##Layer structure
self.layer_structure = layer_structure
self.layer_num = len(layer_structure)
##Make a pair
self.pairs = []
for i in range(len(layer_structure) -1):
self.pairs.append((layer_structure[i],layer_structure[i+1]))
##Generate RBM for each pair
self.rbms = []
for pair in self.pairs:
rbm = RBM(pair[0],pair[1])
self.rbms.append(rbm)
self.link_structure = list(map(lambda x :x[0] *x[1], self.pairs))
def preTrain(self,initaial_data_for_learning:Input, sample_num : int):
current_learning_data = []
##Calculate the initial value of each RBM
for rbm_index in range(len(self.rbms)):
#When the layer is the bottom layer (Visible vs. Hidden)
if rbm_index == 0:
self.rbms[rbm_index].train(initaial_data_for_learning,100)
for i in range(sample_num):
hidden_value = self.rbms[rbm_index].getHiddenValue(initaial_data_for_learning[i])
current_learning_data.append(hidden_value)
#When the layer is not the bottom layer (Hidden vs. Hidden)
else:
#Learn with input data
self.rbms[rbm_index].train(current_learning_data,100)
#Create data for the next RBM
data_for_learning = []
for i in range(sample_num):
hidden_value = self.rbms[rbm_index].getHiddenValue(current_learning_data[i])
data_for_learning.append(hidden_value)
current_learning_data = data_for_learning
#Update the parameters obtained in the pre-learning to the initial values by the gradient ascending method.
#Find the first term of the derivative by the link parameter of the log-likelihood function by mean field approximation and the second term by Contrastive Divergence.
def train(self, data_for_learning:Input, iteration_times:int):
samples = []
for iteration_times in range(iteration_times):
#Calculate the expected value of each hidden layer / node for one sample input by mean field approximation [Note]:Update only the link, not the bias]
##Intermediate data of expected value(mean_field)Is kept as a three-dimensional array (sth sample)/r layer/Expected value of the jth variable)
sample_length = len(data_for_learning)
max_node_num = max(self.layer_structure)
mean_field = np.zeros((sample_length, self.layer_num, max_node_num))
for s in range(sample_length):
for r in range(self.layer_num):
if r == 0:
pass
elif r == 1:
for j in range(self.rbms[0].Hn):
first_paragraph = sum(map(lambda i : self.rbms[0].link[i][j] * data_for_learning[s][i] , range(self.rbms[0].Vn)))
second_paragraph = sum(map(lambda k : self.rbms[1].link[j][k] * mean_field[s][1][k] , range(self.rbms[1].Hn)))
mean_field[s][1][j] = sigmoid(first_paragraph + second_paragraph)
elif r == self.layer_num - 1:
for j in range(self.rbms[r -1].Hn):
mean_field[s][r][j] = sigmoid(sum(map(lambda k : self.rbms[r-1].link[k][j] * mean_field[s][r-1][k] , range(self.rbms[r-2].Hn))))
else :
for j in range(self.rbms[r -1].Hn):
mean_field[s][r][j] = sigmoid(sum(map(lambda k : self.rbms[r -1].link[k][j] * mean_field[s][r-2][k] , range(self.rbms[r - 2].Hn))) + sum(map(lambda l : self.rbms[r].link[j][l] * mean_field[s][r+1][l] , range(self.rbms[r].Hn))))
#Calculate the first term of the slope of the link
gradient_1 = []
for r in range(len(self.rbms)):
gradient_1_element = []
if r == 0:
for i in range(self.rbms[0].Vn):
for j in range(self.rbms[0].Hn):
gradient_1_element.append((sum(map(lambda s : data_for_learning[s][i] * mean_field[s][1][j] , range(sample_length)))) / sample_length)
else:
for i in range(self.rbms[r].Vn):
for j in range(self.rbms[r].Hn):
gradient_1_element.append((sum(map(lambda s : mean_field[s][r][i] * mean_field[s][r+1][j] , range(sample_length)))) / sample_length)
gradient_1.append(gradient_1_element)
#Calculate the second term of the slope of the link
##To sample
new_samples = []
for s in range(DeepBolzmannMachine.sample_num):
sample = []
##Initialize the value of sample
for layer in self.layer_structure:
accumlator = []
for i in range(layer):
accumlator.append(np.empty)
sample.append(accumlator)
##Set the initial value for sampling (at this time, the previous sample is used for the second and subsequent times of the gradient rising loop)
initial_randoms = []
if iteration_times == 0:
for layer in self.layer_structure:
accumlator = []
for i in range(layer):
accumlator.append(np.random.rand(1)[0])
initial_randoms.append(accumlator)
else:
initial_randoms = samples[s]
##Estimate one sample (value of each node)
sigmoid_belief = []
for layer_index in range(len(self.layer_structure)):
for node_index in range(self.layer_structure[layer_index]):
####Calculate sigmoid belief using messages from all nodes attached to that node
if layer_index == 0:
sigmoid_belief = sigmoid(sum(map(lambda j : initial_randoms[1][j] * self.rbms[0].link[node_index][j] , range(self.layer_structure[1]))))
elif layer_index == self.layer_num -1:
sigmoid_belief = sigmoid(sum(map(lambda j : initial_randoms[self.layer_num - 2][j] * self.rbms[self.layer_num -2].link[j][node_index] , range(self.layer_structure[self.layer_num -2]))))
else :
belief_downstair = sum(map(lambda j : initial_randoms[layer_index - 1][j] * self.rbms[layer_index -1].link[j][node_index] , range(self.layer_structure[layer_index -1])))
belief_upstair = sum(map(lambda j : initial_randoms[layer_index + 1][j] * self.rbms[layer_index].link[node_index][j] , range(self.layer_structure[layer_index + 1])))
sigmoid_belief = sigmoid(belief_upstair + belief_downstair)
####Generate random numbers
r = np.random.rand(1)[0]
####Set 1 if sigmoid belief is larger than random number, 0 if smaller than random number
if sigmoid_belief > r :
sample[layer_index][node_index] = 1
else:
sample[layer_index][node_index] = 0
#Add sample
new_samples.append(sample)
#Discard the sample from the previous gradient update and replace it with this sample
samples = new_samples
#Approximate the second gradient term from the sample
gradient_2 = []
for r in range(len(self.rbms)):
gradient_2_element = []
if r == 0:
for i in range(self.rbms[0].Vn):
for j in range(self.rbms[0].Hn):
gradient_2_element.append(sum(map(lambda m : samples[m][0][i] * samples[m][1][j], range(DeepBolzmannMachine.sample_num))) / DeepBolzmannMachine.sample_num)
else :
for i in range(self.rbms[r].Vn):
for j in range(self.rbms[r].Hn):
gradient_2_element.append(sum(map(lambda m : samples[m][r][i] * samples[m][r+1][j], range(DeepBolzmannMachine.sample_num))) / DeepBolzmannMachine.sample_num)
gradient_2.append(gradient_2_element)
#Calculate the gradient
gradient_1_flatten = np.array(list(chain.from_iterable(gradient_1)))
gradient_2_flatten = np.array(list(chain.from_iterable(gradient_2)))
gradient_flatten = gradient_1_flatten + gradient_2_flatten
gradient = get_array_from_structure(gradient_flatten, self.link_structure)
#Update parameters with gradient
for rbm_index in range(len(self.rbms)):
self.rbms[rbm_index].link += DeepBolzmannMachine.learningRate * gradient[rbm_index].reshape((self.rbms[rbm_index].Vn,self.rbms[rbm_index].Hn))
Execution example
#A 4-layer network with 5, 4, 3, and 2 nodes in order from the visible layer
layer_structure = [5,4,3,2]
#Number of samples to be sampled when calculating the expected value
sample_num = 100
#Boltzmann machine instantiation
dbm = DeepBolzmannMachine(layer_structure)
#Create appropriate learning data
data_for_learning = np.random.rand(100,layer_structure[0])
#Pre-learning
dbm.preTrain(data_for_learning,sample_num)
#Post-learning
dbm.train(data_for_learning,sample_num)
That's why I implemented it.
Stochastic deep learning doesn't seem to be as pervasive as deep learning in deterministic neural networks, but personally *** "Understanding events, including their uncertainty" *** Bayes I think that the merit of this way of thinking is great, so I hope this will become the de facto standard.
This year is just around the corner. Good year-end, everyone ...
Recommended Posts