In recent years, the development of neural network-related technologies such as Deep Learning has been remarkable. Until now, neural networks that did not have sufficient accuracy due to overfitting can now acquire high institutional learning ability without impairing generalization ability by using methods such as Hinton et al.'S Drop Learning and Auto Encoder. .. In addition, recent developments in GPGPU and the development of libraries such as Tensorflow have made it relatively easy to learn large-scale neural networks. On the other hand, while much interest is being paid to the learning methods of neural networks, there is not much discussion about the hierarchical structure of those neural networks themselves and the number of perceptrons. In recent years, the existence of applied neural networks such as LSTM and CNN has been seen, but the design guidelines for actually improving their accuracy have not changed from the past to require craftsmanship. In this chapter, we discuss the output layer design of the modeled neural network using the Z3 Solver. Then, using Z3 Prover, we will show the method and program that will guide you to build a neural network with the specified accuracy.
Introduce a very basic neural network. Here, we will discuss ** regression problems by neural networks **.
As shown in the figure, the output layer has three perceptrons, each output is multiplied by a weighting factor, and the sum of these and the constant is added to output the output. It is as follows when written in a mathematical formula.
\begin{split}
u_i \in \{ 0 , 1 \} \\
output = \sum_{i=1}^{3} u_i w_i + c
\end{split}
Here, it is assumed that $ u_i $ is a so-called step-type perceptron, which takes only a value of 0 or 1. With such a neural network
( \vec{x_1} , y_1 ), ( \vec{x_2} , y_2 ), ( \vec{x_3} , y_3 ), ( \vec{x_4} , y_4 )
Let's think about regressing the four data. Suppose you want to take $ \ vec {x_i} $ as input to the perceptron and output $ y_i $. At this time, pay attention to ** $ y_1, y_2, y_3, y_4 $, and ignore $ \ vec {x_1}, \ vec {x_2}, \ vec {x_3}, \ vec {x_4} $. ** ** Consider performing regression with a perceptron under the above conditions. The output layer of the target neural network is composed of three units. At this time, there are at most eight possible states for $ u_1, u_2, u_3 $. When $ y_i \ neq y_j (i \ neq j) $ holds, the maximum number that can be regressed without error is only 8. about it. Obviously, we can see the following.
When the output layer has $ n $ perceptrons, the maximum number of datasets that can be regressed without error is only $ 2 ^ n $.
about it. From this, we can see that the parameter of note is the number of datasets **. And from there, the number of perceptrons in the number of outputs can be derived. Here, assuming that the number of datasets is $ m $ and the number of perceptrons with the number of outputs that can be regressed without error is $ n $,
n \geq \lceil \log_{2} m \rceil
You can see the relational expression of. This allowed us to find the lower limit for $ n $. On the other hand, the upper limit is equivalent to assigning 0 and 1 to $ m $ of data, respectively.
m \geq n \geq \lceil \log_{2} m \rceil \tag{1}
I found that.
Regression by perceptron consists of two problems.
I will explain using the example given above. Here we introduce the variable $ s_ {i, k} $.
\begin{split}
s_{i,j} \in {0,1} \\
y_k = \sum_{i=1}^{n} s_{i,k}w_i + c
\end{split}
When regressing the data $ y_k $, we define the output of $ u_i $ as $ s_ {i, k} $. Here, this formula is called "start constraint formula S" for convenience. If you write down all the previous examples,
\begin{split}
y_1 = s_{1,1}w_1 + s_{2,1}w_2 + s_{3,1}w_3 + c \\
y_2 = s_{1,2}w_1 + s_{2,2}w_2 + s_{3,2}w_3 + c \\
y_3 = s_{1,3}w_1 + s_{2,3}w_2 + s_{3,3}w_3 + c \\
y_4 = s_{1,4}w_1 + s_{2,4}w_2 + s_{3,4}w_3 + c
\end{split}
It becomes. Now, how to determine the contents of $ s_ {i, j} $ when regressing $ y_i $? One problem arises. This is the problem of "sign assignment" that I raised earlier. Suppose that $ s_ {1,1} = 0 $, $ s_ {2,1} = 1 $, $ s_ {3,1} = 1 $ are assigned to $ y_1 $, and then $ w_1 What are the values of $, $ w_2 $, $ w_3 $, $ c $? The problem arises. This corresponds to the "regression coefficient optimization" mentioned earlier. It is very difficult to think of an algorithm that solves these at the same time. Therefore, we use the SMT solver.
A so-called binary search is performed to find the minimum number of units $ n $ required to represent the data $ y_i $. At that time, the lower limit and the upper limit are given using the equation (1) used earlier.
In this chapter, we discuss regression models with errors. It is very rare that it is represented by the regression model without error described above, and in general, a model containing error is often used. Here, the design-allowable error is defined as $ \ epsilon $. At this time, the constraint expression is
\begin{split}
y_k - \epsilon \leq \sum_{i=1}^{n} s_{i,k}w_i + c \leq y_k + \epsilon
\end{split}
It becomes. This formula is defined as "start constraint formula $ S'$". If this is calculated in the same way as the previous algorithm, the minimum number $ n $ that can be designed for a perceptron with an accuracy within $ \ epsilon $ can be obtained. Here, it can be seen that the "start constraint expression $ S $" discussed earlier is a special pattern when $ \ epsilon = 0 $ of the "start constraint expression $ S'$".
This time, as data
http://hojin.ctot.jp/markets/data_download.html
We are using the daily data of USDJPY for 16 days. Among them, the error is applied as within 0.1 yen for the closing price of the day.
#coding:utf-8
from z3 import *
from math import log,ceil
def cluster(max_epsilon,data,solver):
n = len(data)
epsilon = Real("epsilon")
constant = Real("constant")
max_n = n
min_n = int(ceil(log(n,2)))
weights = [
Real("weight_%04d" % i)
for i
in range(n)
]
solver.add(epsilon >= 0)
solver.add(max_epsilon >= epsilon)
all_vals = []
for idx,d in enumerate(data):
print idx,len(data)
vals = [
Real("val_%04d%04d" % (idx,i))
for i
in range(n)
]
all_vals.append(vals)
for val in vals:
solver.add(Or(val == 1.0,val == 0.0))
solver.check()
out = sum(v*w for v,w in zip(vals,weights)) + constant
solver.add(
d - epsilon <= out , out <= d + epsilon
)
solver.check()
while max_n != min_n:
try_n = (max_n + min_n) / 2
solver.push()
expressions = [
weights[i] == 0
for i
in range(try_n,n)
]
print expressions
solver.add(
expressions
)
print min_n,max_n,try_n
if s.check() == sat:
print "ok:",min_n,max_n,try_n
max_n = try_n
s.push()
else:
print "ng:",min_n,max_n,try_n
min_n = try_n
s.pop()
print "max_n:",max_n
print "constant:",float(solver.model()[constant].as_decimal(15)[:-1])
model = solver.model()
print "weights:"
for w in weights:
print w,model[w].as_decimal(15)[:-1]
print
print "patterns:"
for line in all_vals:
print "".join(str(int(model[v].as_decimal(15))) for v in line)
if __name__=="__main__":
s = Solver()
data = []
with open("tmp.csv") as f:
f.next()
for l in f:
data.append(float(l.split(",")[5]))
data = data[:16]
data.sort()
cluster(0.1,data,s)
The actual output is
kotauchisunsun@kotauchisunsun-VirtualBox:~/z3nn$ python nn_cluster2.py
0 16
1 16
2 16
3 16
4 16
5 16
6 16
7 16
8 16
9 16
10 16
11 16
12 16
13 16
14 16
15 16
[weight_0010 == 0, weight_0011 == 0, weight_0012 == 0, weight_0013 == 0, weight_0014 == 0, weight_0015 == 0]
4 16 10
ok: 4 16 10
[weight_0007 == 0, weight_0008 == 0, weight_0009 == 0, weight_0010 == 0, weight_0011 == 0, weight_0012 == 0, weight_0013 == 0, weight_0014 == 0, weight_0015 == 0]
4 10 7
ok: 4 10 7
[weight_0005 == 0, weight_0006 == 0, weight_0007 == 0, weight_0008 == 0, weight_0009 == 0, weight_0010 == 0, weight_0011 == 0, weight_0012 == 0, weight_0013 == 0, weight_0014 == 0, weight_0015 == 0]
4 7 5
ok: 4 7 5
[weight_0004 == 0, weight_0005 == 0, weight_0006 == 0, weight_0007 == 0, weight_0008 == 0, weight_0009 == 0, weight_0010 == 0, weight_0011 == 0, weight_0012 == 0, weight_0013 == 0, weight_0014 == 0, weight_0015 == 0]
4 5 4
ok: 4 5 4
max_n: 4
constant: 89.5
weights:
weight_0000 1.4
weight_0001 -0.6
weight_0002 -0.1
weight_0003 -0.8
weight_0004
weight_0005
weight_0006
weight_0007
weight_0008
weight_0009
weight_0010
weight_0011
weight_0012
weight_0013
weight_0014
weight_0015
patterns:
0111101001011111
0011010111111111
0110000111011001
0001110111011110
0100001011101010
0110111001110011
1111001010111011
0000010010100000
0000010110111111
1011100000110011
1001101000010111
1100110001100100
1010011010110111
1010110101000010
1000010110001011
1000110010010001
Excerpting only the results and summarizing
output = 1.4 u_1 -0.6 u_2 -0.1 u_3 -0.8 u_4 + 89.5
closing price | ||||
---|---|---|---|---|
87.87 | 0 | 1 | 1 | 1 |
88.37 | 0 | 0 | 1 | 1 |
88.61 | 0 | 1 | 1 | 0 |
88.7 | 0 | 0 | 0 | 1 |
88.77 | 0 | 1 | 0 | 0 |
88.79 | 0 | 1 | 1 | 0 |
89.17 | 1 | 1 | 1 | 1 |
89.47 | 0 | 0 | 0 | 0 |
89.64 | 0 | 0 | 0 | 0 |
89.86 | 1 | 0 | 1 | 1 |
90.09 | 1 | 0 | 0 | 1 |
90.32 | 1 | 1 | 0 | 0 |
90.71 | 1 | 0 | 1 | 0 |
90.84 | 1 | 0 | 1 | 0 |
90.9 | 1 | 0 | 0 | 0 |
91.08 | 1 | 0 | 0 | 0 |
It will be in the form of. Here we can see that 16 data can be represented by 4 perceptrons with an error of 0.1 or less.
By this method, the number of perceptrons minimized, the weighting coefficient at that time, and the ignition pattern of the perceptrons can be obtained. However, there are two restrictions.
Regarding 1., this is the limit of the Z3 Prover I am using. In the above sample program, if the error range is narrowed down to 0.01, it takes more than 30 minutes to output the answer (it is unconfirmed whether the solution was possible because it took too long). It may be faster if you stick to the implementation method and the construction of logical expressions a little more. Regarding 2., ** The texts so far have not discussed regression ability. However, in order to perform regression, it is a prerequisite that the output layer has at least "expressive ability of the data to be regressed" **. This article only discusses the question, "Do you have the ability to express the data you want to return to the output layer?"
The Perceptron discussion discussed this time is a basic model. The most commonly used perceptron discriminant functions today are sigmoids and tanh. Now look at the figure below.
This time, the number of perceptrons found by this method is $ n_ {b0} $ because it is "the number of perceptrons required to express the expressive power of the output layer". However, in general, the data you want to find is the number corresponding to $ n_ {b1} $. Then, "probably" the following general formula holds. (Left green arrow)
n_{b0} \leq n_{b1}
The proviso "probably" is because it has not been proved. As a discussion about regression, we do not discuss generalization performance on another axis. When regressing the data, the above equation probably holds to represent the fine corner cases. It is a prediction. Here, another argument is possible: when the perceptron identification function is extended to $ 0 \ leq u \ leq 1 $, the number of perceptrons required to represent the data to be regressed is $ n_ {a0} $. Define it. this is,
n_{b0} \geq n_{a0}
Holds. (Upper black arrow) This is because $ n_ {b0} $ can only take two values of 0,1 for the discriminant function, while the discriminator function of $ n_ {a0} $ can take the range from 0 to 1. .. At least the range of the former discrimination range is only a part of the range of the latter discrimination function, so the latter $ n_ {a0} $ is more expressive and can be represented with a smaller number of perceptrons. , It is considered that the above formula holds. Also, $ n_ {a0} $ can be discussed in the same way as $ n_ {b0} $.
n_{a0} \leq n_{a1}
(Right green arrow). Also, from the same discussion as $ n_ {b0} $ and $ n_ {a0} $,
n_{b1} \geq n_{a1}
You can see the relational expression. (Down black arrow) To summarize these formulas
\begin{split}
n_{a0} \leq n_{b0} \leq n_{b1} \\
n_{a0} \leq n_{a1} \leq n_{b1}
\end{split}
You can see that. Currently, from the condition of Eq. (1), when the number of data is $ m $, the number of perceptrons $ n $ is
m \geq n \geq \lceil \log_{2} m \rceil
I only knew that. However, it can be seen that the range of $ n_ {a0} $ can be further limited by calculating the value of $ n_ {b0} $ by the above method.
n_{b0} \geq n_{a0} \geq \lceil \log_{2} m \rceil \tag{2}
The discriminant function used this time is a very limited model, and its merit is that the cost is low compared to discussing the number of perceptrons using a general model. Also, the advantage is that the upper limit value can be reduced as in Eq. (2), and when $ m $ is very large, $ m >> log_ {2} m $, which is general as it is. It may take a long time to discuss with a model because the range of $ n $ is too wide. Therefore, it is thought that effective design can be achieved by passing through the basic model used this time and giving an upper limit.
What was discussed in this article this time
Probably, if $ n_ {b0} $ can be obtained, it seems that there is no problem with $ n_ {a1} = n_ {b0} $ as the initial design. The reason for this is that the number of perceptrons has been empirically selected without much reason, so using $ n_ {b0} $ as a guideline is worth choosing as a policy.
This time, we discussed "the expressive ability of Perceptron". However, we could not discuss "regression ability". If a simple method can be found in this regard, it is thought that the minimum necessary neural network can be constructed and the learning cost can be reduced. In addition, although it depends on Z3 Prover in many cases, if these designs can be described in a simpler way and the logic can be guaranteed by constraint satisfaction, it is considered that it will contribute to the advanced discrimination ability of machine learning.
Recommended Posts