Deep Learning (Supervised by The Japanese Society for Artificial Intelligence) Book – 2015/11/5 http://www.amazon.co.jp/dp/476490487X/ref=pd_lpo_sbs_dp_ss_1?pf_rd_p=187205609&pf_rd_s=lpo-top-stripe&pf_rd_t=201&pf_rd_i=4061529021&pf_rd_m=AN1VRQENFRJN5&pf_rd_r=1ZJ9T7KCKHC8QV0QKN9H I summarized Chapter 4 of.
DistBelief Google is the source of Tensorflow developed. A good implementation of distributed parallel technology.
Until now, MapReduce required high-quality communication. The solution to that.
There are two types of parallelization -Model (task) parallel: Process is divided for each machine. Assembly line work. -Data parallel: Divide the data to be included in the processing flow. Distbelief uses these together.
With Distbelief, users only have to focus on "how to calculate a node" and "information to send to the next node", and the computer will decide where to divide the model and where to divide the data.
Gradient calculation and parameter update are performed in parallel with the model, and actual data calculation is performed in parallel with the data. Downpour SGD (stochastic gradient descent method) is used as the gradient calculation method, but even if the replica model processing a certain data group fails, the rest will work.
Sandblaster L-BFGS is used for batch processing (a method of doing it little by little). Since this is data parallelism, data synchronization is required at the end. Waiting for the slowest computer is too annoying, so make the batch smaller and give it to the nodes one by one. Tasks are distributed in order from the end.
In language processing, only a small part of a long vector is non-zero, and the rest is zero (sparse matrix). Image processing, on the other hand, has a dense vector. There are few branches in image processing, and the same processing can be done endlessly. This is a field that GPUs are good at. That's why many people think about leaving it to the GPU.
Also, the transfer speed between GPU / CPU / memory may be a slight bottleneck.
InfiniBand Insanely fast cable. Boasting performance of 56Gbps ~, it seems to be able to solve the transfer speed problem peculiar to GPU.
Technology to solve the problem of internal covariate shift. The internal covariate shift means that the distribution of x at the time of input changes significantly during learning. Since the weights are enthusiastic about adjusting to this shift, the learning of the layer itself can only proceed after that. This slows me down.
Batch normalization normalizes this shift. At the same time, whitening (normalization + uncorrelated) is performed.
Ensemble learning that uses the average of multiple models for inference is accurate. But it takes too long.
In the case of neural networks, shallower is faster and deeper is slower. Distillation accelerates the learning of deep neural networks from the learning of shallow neural networks.
When the optimum θ is selected, the gradient of E is 0. Then, the slopes of the error term L and the normalization term R of E match. In other words, if you know how to make a mistake, you can also know the normalization term. A technique that makes good use of this.
Dropout How to control overfitting. A method of ignoring nodes at a certain rate at the learning stage. Every time you change the data to be learned, change the node to ignore. Do not ignore during testing. Ignore rate is generally input: 0.2, middle layer: 0.5.
ReLu
max(0,x)
Gradient calculation is fast and does not require any special ingenuity. The error propagates to deep nodes without the gradient disappearing. MaxOut
max wx
A method of choosing the largest of multiple linear functions. It's linear and insanely simple, while being able to express complex states. It seems to be very good.
Recommended Posts