Guidelines for Output Layer Design of Neural Networks

Overview

In recent years, the development of neural network-related technologies such as Deep Learning has been remarkable. Until now, neural networks that did not have sufficient accuracy due to overfitting can now acquire high institutional learning ability without impairing generalization ability by using methods such as Hinton et al.'S Drop Learning and Auto Encoder. .. In addition, recent developments in GPGPU and the development of libraries such as Tensorflow have made it relatively easy to learn large-scale neural networks. On the other hand, while much interest is being paid to the learning methods of neural networks, there is not much discussion about the hierarchical structure of those neural networks themselves and the number of perceptrons. In recent years, the existence of applied neural networks such as LSTM and CNN has been seen, but the design guidelines for actually improving their accuracy have not changed from the past to require craftsmanship. In this chapter, we discuss the output layer design of the modeled neural network using the Z3 Solver. Then, using Z3 Prover, we will show the method and program that will guide you to build a neural network with the specified accuracy.

Introduction of modeled neural network

Error-free regression model

Introduce a very basic neural network. Here, we will discuss ** regression problems by neural networks **.

image

As shown in the figure, the output layer has three perceptrons, each output is multiplied by a weighting factor, and the sum of these and the constant is added to output the output. It is as follows when written in a mathematical formula.

\begin{split}
u_i \in \{ 0 , 1 \}  \\
output = \sum_{i=1}^{3} u_i w_i + c
\end{split}

Discussion on the number of perceptrons

Here, it is assumed that $ u_i $ is a so-called step-type perceptron, which takes only a value of 0 or 1. With such a neural network

( \vec{x_1} , y_1 ), ( \vec{x_2} , y_2 ), ( \vec{x_3} , y_3 ), ( \vec{x_4} , y_4 )

Let's think about regressing the four data. Suppose you want to take $ \ vec {x_i} $ as input to the perceptron and output $ y_i $. At this time, pay attention to ** $ y_1, y_2, y_3, y_4 $, and ignore $ \ vec {x_1}, \ vec {x_2}, \ vec {x_3}, \ vec {x_4} $. ** ** Consider performing regression with a perceptron under the above conditions. The output layer of the target neural network is composed of three units. At this time, there are at most eight possible states for $ u_1, u_2, u_3 $. When $ y_i \ neq y_j (i \ neq j) $ holds, the maximum number that can be regressed without error is only 8. about it. Obviously, we can see the following.

When the output layer has $ n $ perceptrons, the maximum number of datasets that can be regressed without error is only $ 2 ^ n $.

about it. From this, we can see that the parameter of note is the number of datasets **. And from there, the number of perceptrons in the number of outputs can be derived. Here, assuming that the number of datasets is $ m $ and the number of perceptrons with the number of outputs that can be regressed without error is $ n $,

n \geq \lceil \log_{2} m \rceil

You can see the relational expression of. This allowed us to find the lower limit for $ n $. On the other hand, the upper limit is equivalent to assigning 0 and 1 to $ m $ of data, respectively.

m \geq n \geq \lceil \log_{2} m \rceil \tag{1}

I found that.

Discussion of optimization problems related to regression

Regression by perceptron consists of two problems.

  1. Sign assignment
  2. Optimization of regression coefficient

I will explain using the example given above. Here we introduce the variable $ s_ {i, k} $.

\begin{split}
s_{i,j} \in {0,1} \\
y_k = \sum_{i=1}^{n} s_{i,k}w_i + c
\end{split}

When regressing the data $ y_k $, we define the output of $ u_i $ as $ s_ {i, k} $. Here, this formula is called "start constraint formula S" for convenience. If you write down all the previous examples,

\begin{split}
y_1 = s_{1,1}w_1 + s_{2,1}w_2 + s_{3,1}w_3 + c \\
y_2 = s_{1,2}w_1 + s_{2,2}w_2 + s_{3,2}w_3 + c \\
y_3 = s_{1,3}w_1 + s_{2,3}w_2 + s_{3,3}w_3 + c \\
y_4 = s_{1,4}w_1 + s_{2,4}w_2 + s_{3,4}w_3 + c 
\end{split}

It becomes. Now, how to determine the contents of $ s_ {i, j} $ when regressing $ y_i $? One problem arises. This is the problem of "sign assignment" that I raised earlier. Suppose that $ s_ {1,1} = 0 $, $ s_ {2,1} = 1 $, $ s_ {3,1} = 1 $ are assigned to $ y_1 $, and then $ w_1 What are the values of $, $ w_2 $, $ w_3 $, $ c $? The problem arises. This corresponds to the "regression coefficient optimization" mentioned earlier. It is very difficult to think of an algorithm that solves these at the same time. Therefore, we use the SMT solver.

  1. Define the start constraint expression S. However, it is defined as $ n = m $.
  2. Set $ E_ {min} = \ lceil \ log_ {2} m \ rceil $.
  3. Let $ E_ {max} = m $.
  4. Set $ E_ {try} = {E_ {min} + E_ {max} \ over 2} $.
  5. Set $ w_i = 0 (i> E_ {try}) $ to judge the constraint satisfaction.
  6. If the formula is satisfied, $ E_ {max} = E_ {try} $, if not, exclude the constraint formula in 5. and set $ E_ {min} = E_ {try} $.
  7. If $ E_ {max} --E_ {min} = 1 $, go to 8. If not, go to 4.
  8. It turns out that the minimum number of units that can regress $ y_i $ is $ E_ {max} $.

A so-called binary search is performed to find the minimum number of units $ n $ required to represent the data $ y_i $. At that time, the lower limit and the upper limit are given using the equation (1) used earlier.

Regression model with error

In this chapter, we discuss regression models with errors. It is very rare that it is represented by the regression model without error described above, and in general, a model containing error is often used. Here, the design-allowable error is defined as $ \ epsilon $. At this time, the constraint expression is

\begin{split}
y_k - \epsilon \leq \sum_{i=1}^{n} s_{i,k}w_i + c \leq y_k + \epsilon
\end{split}

It becomes. This formula is defined as "start constraint formula $ S'$". If this is calculated in the same way as the previous algorithm, the minimum number $ n $ that can be designed for a perceptron with an accuracy within $ \ epsilon $ can be obtained. Here, it can be seen that the "start constraint expression $ S $" discussed earlier is a special pattern when $ \ epsilon = 0 $ of the "start constraint expression $ S'$".

Actual code

This time, as data

http://hojin.ctot.jp/markets/data_download.html

We are using the daily data of USDJPY for 16 days. Among them, the error is applied as within 0.1 yen for the closing price of the day.

#coding:utf-8
from z3 import *
from math import log,ceil

def cluster(max_epsilon,data,solver):
    n = len(data)
    epsilon = Real("epsilon")
    constant = Real("constant")

    max_n = n
    min_n = int(ceil(log(n,2)))
    weights = [
        Real("weight_%04d" % i)
        for i
        in range(n)
    ]

    solver.add(epsilon >= 0)
    solver.add(max_epsilon >= epsilon)

    all_vals = []

    for idx,d in enumerate(data):
        print idx,len(data)
        vals = [
            Real("val_%04d%04d" % (idx,i))
            for i
            in range(n)
        ]

        all_vals.append(vals)

        for val in vals:
           solver.add(Or(val == 1.0,val == 0.0))
            
        solver.check()

        out = sum(v*w for v,w in zip(vals,weights)) + constant

        solver.add(
            d - epsilon <= out , out <= d + epsilon
        )

        solver.check()


    while max_n != min_n:
        try_n = (max_n + min_n) / 2
        solver.push()
        expressions = [
            weights[i] == 0
            for i
            in range(try_n,n)
        ]

        print expressions

        solver.add(
            expressions
        )

        print min_n,max_n,try_n

        if s.check() == sat:
            print "ok:",min_n,max_n,try_n
            max_n = try_n
            s.push()
        else:
            print "ng:",min_n,max_n,try_n
            min_n = try_n
            s.pop()

    print "max_n:",max_n
    print "constant:",float(solver.model()[constant].as_decimal(15)[:-1])
    model = solver.model()
    print "weights:"
    for w in weights:
        print w,model[w].as_decimal(15)[:-1]
    print

    print "patterns:"
    for line in all_vals:
        print "".join(str(int(model[v].as_decimal(15))) for v in line)

       

if __name__=="__main__":
    s = Solver()

    data = []
    with open("tmp.csv") as f:
        f.next()
        for l in f:
            data.append(float(l.split(",")[5]))
        
    data = data[:16]
    data.sort()
    
    cluster(0.1,data,s)

The actual output is

kotauchisunsun@kotauchisunsun-VirtualBox:~/z3nn$ python nn_cluster2.py 
0 16
1 16
2 16
3 16
4 16
5 16
6 16
7 16
8 16
9 16
10 16
11 16
12 16
13 16
14 16
15 16
[weight_0010 == 0, weight_0011 == 0, weight_0012 == 0, weight_0013 == 0, weight_0014 == 0, weight_0015 == 0]
4 16 10
ok: 4 16 10
[weight_0007 == 0, weight_0008 == 0, weight_0009 == 0, weight_0010 == 0, weight_0011 == 0, weight_0012 == 0, weight_0013 == 0, weight_0014 == 0, weight_0015 == 0]
4 10 7
ok: 4 10 7
[weight_0005 == 0, weight_0006 == 0, weight_0007 == 0, weight_0008 == 0, weight_0009 == 0, weight_0010 == 0, weight_0011 == 0, weight_0012 == 0, weight_0013 == 0, weight_0014 == 0, weight_0015 == 0]
4 7 5
ok: 4 7 5
[weight_0004 == 0, weight_0005 == 0, weight_0006 == 0, weight_0007 == 0, weight_0008 == 0, weight_0009 == 0, weight_0010 == 0, weight_0011 == 0, weight_0012 == 0, weight_0013 == 0, weight_0014 == 0, weight_0015 == 0]
4 5 4
ok: 4 5 4
max_n: 4
constant: 89.5
weights:
weight_0000 1.4
weight_0001 -0.6
weight_0002 -0.1
weight_0003 -0.8
weight_0004 
weight_0005 
weight_0006 
weight_0007 
weight_0008 
weight_0009 
weight_0010 
weight_0011 
weight_0012 
weight_0013 
weight_0014 
weight_0015 

patterns:
0111101001011111
0011010111111111
0110000111011001
0001110111011110
0100001011101010
0110111001110011
1111001010111011
0000010010100000
0000010110111111
1011100000110011
1001101000010111
1100110001100100
1010011010110111
1010110101000010
1000010110001011
1000110010010001

Excerpting only the results and summarizing

output = 1.4 u_1 -0.6 u_2 -0.1 u_3 -0.8 u_4 + 89.5
closing price u_1 u_2 u_3 u_4
87.87 0 1 1 1
88.37 0 0 1 1
88.61 0 1 1 0
88.7 0 0 0 1
88.77 0 1 0 0
88.79 0 1 1 0
89.17 1 1 1 1
89.47 0 0 0 0
89.64 0 0 0 0
89.86 1 0 1 1
90.09 1 0 0 1
90.32 1 1 0 0
90.71 1 0 1 0
90.84 1 0 1 0
90.9 1 0 0 0
91.08 1 0 0 0

It will be in the form of. Here we can see that 16 data can be represented by 4 perceptrons with an error of 0.1 or less.

Method limits and limitations

By this method, the number of perceptrons minimized, the weighting coefficient at that time, and the ignition pattern of the perceptrons can be obtained. However, there are two restrictions.

  1. Applicable only to small problems
  2. Only discussing the expressive power of Perceptron

Regarding 1., this is the limit of the Z3 Prover I am using. In the above sample program, if the error range is narrowed down to 0.01, it takes more than 30 minutes to output the answer (it is unconfirmed whether the solution was possible because it took too long). It may be faster if you stick to the implementation method and the construction of logical expressions a little more. Regarding 2., ** The texts so far have not discussed regression ability. However, in order to perform regression, it is a prerequisite that the output layer has at least "expressive ability of the data to be regressed" **. This article only discusses the question, "Do you have the ability to express the data you want to return to the output layer?"

Use of neural networks as a relaxation problem

The Perceptron discussion discussed this time is a basic model. The most commonly used perceptron discriminant functions today are sigmoids and tanh. Now look at the figure below.

image

This time, the number of perceptrons found by this method is $ n_ {b0} $ because it is "the number of perceptrons required to express the expressive power of the output layer". However, in general, the data you want to find is the number corresponding to $ n_ {b1} $. Then, "probably" the following general formula holds. (Left green arrow)

n_{b0} \leq n_{b1}

The proviso "probably" is because it has not been proved. As a discussion about regression, we do not discuss generalization performance on another axis. When regressing the data, the above equation probably holds to represent the fine corner cases. It is a prediction. Here, another argument is possible: when the perceptron identification function is extended to $ 0 \ leq u \ leq 1 $, the number of perceptrons required to represent the data to be regressed is $ n_ {a0} $. Define it. this is,

n_{b0} \geq n_{a0}

Holds. (Upper black arrow) This is because $ n_ {b0} $ can only take two values of 0,1 for the discriminant function, while the discriminator function of $ n_ {a0} $ can take the range from 0 to 1. .. At least the range of the former discrimination range is only a part of the range of the latter discrimination function, so the latter $ n_ {a0} $ is more expressive and can be represented with a smaller number of perceptrons. , It is considered that the above formula holds. Also, $ n_ {a0} $ can be discussed in the same way as $ n_ {b0} $.

n_{a0} \leq n_{a1}

(Right green arrow). Also, from the same discussion as $ n_ {b0} $ and $ n_ {a0} $,

n_{b1} \geq n_{a1}

You can see the relational expression. (Down black arrow) To summarize these formulas

\begin{split}
n_{a0} \leq n_{b0} \leq n_{b1} \\
n_{a0} \leq n_{a1} \leq n_{b1}
\end{split}

You can see that. Currently, from the condition of Eq. (1), when the number of data is $ m $, the number of perceptrons $ n $ is

m \geq n \geq \lceil \log_{2} m \rceil

I only knew that. However, it can be seen that the range of $ n_ {a0} $ can be further limited by calculating the value of $ n_ {b0} $ by the above method.

n_{b0} \geq n_{a0} \geq \lceil \log_{2} m \rceil \tag{2}

The discriminant function used this time is a very limited model, and its merit is that the cost is low compared to discussing the number of perceptrons using a general model. Also, the advantage is that the upper limit value can be reduced as in Eq. (2), and when $ m $ is very large, $ m >> log_ {2} m $, which is general as it is. It may take a long time to discuss with a model because the range of $ n $ is too wide. Therefore, it is thought that effective design can be achieved by passing through the basic model used this time and giving an upper limit.

Summary

What was discussed in this article this time

  1. Discussed the expressive power of perceptrons with basic discriminant functions
  2. For the model in 1., we showed how to find the minimum number of perceptrons that represents the given data at a minimum.
  3. It was shown that the number of perceptrons with a general discriminant function can be narrowed down efficiently by using the result of 2.

Probably, if $ n_ {b0} $ can be obtained, it seems that there is no problem with $ n_ {a1} = n_ {b0} $ as the initial design. The reason for this is that the number of perceptrons has been empirically selected without much reason, so using $ n_ {b0} $ as a guideline is worth choosing as a policy.

the next deployment

This time, we discussed "the expressive ability of Perceptron". However, we could not discuss "regression ability". If a simple method can be found in this regard, it is thought that the minimum necessary neural network can be constructed and the learning cost can be reduced. In addition, although it depends on Z3 Prover in many cases, if these designs can be described in a simpler way and the logic can be guaranteed by constraint satisfaction, it is considered that it will contribute to the advanced discrimination ability of machine learning.

Recommended Posts

Guidelines for Output Layer Design of Neural Networks
Implementation of 3-layer neural network (no learning)
Imaging of neural networks that recognize MNIST
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
[For beginners] Why are the "weights" and "bias" of neural networks necessary?
Do Neural Networks Dream of Electric Rats?
Visualize the inner layer of a neural network