Stock price forecast using machine learning (regression)

Previously (1, 2, [3](http://qiita. Continuation from com / yai / items / 1b7f8ef69f8f2343e3e9)). It's not fun to just classify, so I'd like to try a regression problem. So, this time, I used the machine learning framework TensorFlow and scikit-learn to directly predict the stock price. Previously, the stock price on the next day was classified as "up or down", but this time we will directly predict "how many yen". The data used for input is reused. Sumimasen every time.

Effect

--Try to predict the stock price using TensorFlow or scikit-learn. --Check the accuracy and usability.

things to do

"Use several days' worth of global stock indexes (Dow, Nikkei 225, DAX, etc.) to predict the next day's Nikkei 225" (regression)

environment

scikit-learn scikit-learn 0.17.1 Python 2.7 Windows 7 TensorFlow TensorFlow 0.7 Ubuntu 14.04 Python 2.7 AWS EC2 micro instance

Implementation

Download the Nikkei, Dow, Hang Seng and German stock indexes from the site Quandl. Combine them into one as text data. (Manual work)

label

Use the closing price of the next day as the correct answer data. (Predict the closing price of the next day) However, when I tried to put the stock price directly as a label, the result was divergent and divergent, so I devised it. We will make the prediction target indirect, such as "what percentage of the closing price of the next day will be higher or lower than the previous day", and then recalculate to find the closing price. The same rate of change is used for the inputs described below, so it makes sense ... (・・;)

# JUDGE_DAY = 1,Second subscript[3]Contains the closing price of the Nikkei 225.
y_array.append([(array_base[i][3] - array_base[i+JUDGE_DAY][3]) / array_base[i][3] * 100])

Input data

Instead of putting the stock price as it is, we give a list of "how much (%) it went up or down compared to the previous day". (Because it didn't work at all even if I put the stock price as it was)

tmp_array = []
for j in xrange(i+1, i + data_num + 1):
    for k in range(16):
        tmp_array.append((array_base[j][k] - array_base[j+1][k]) / array_base[j][k] * 100)
x_array.append(tmp_array)

A story unique to TensorFlow

Graph

TensorFlow has two hidden layers and the number of units is 50 and 25, respectively.

NUM_HIDDEN1 = 50
NUM_HIDDEN2 = 25

def inference(x_ph, keep_prob):

  with tf.name_scope('hidden1'):
    weights = tf.Variable(tf.truncated_normal([data_num * price_num, NUM_HIDDEN1], stddev=stddev), name='weights')
    biases = tf.Variable(tf.zeros([NUM_HIDDEN1]), name='biases')
    hidden1 = tf.nn.relu(tf.matmul(x_ph, weights) + biases)
  
  with tf.name_scope('hidden2'):
    weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN1, NUM_HIDDEN2], stddev=stddev), name='weights')
    biases = tf.Variable(tf.zeros([NUM_HIDDEN2]), name='biases')
    hidden2 = tf.nn.relu(tf.matmul(hidden1, weights) + biases)
  
  #DropOut
  dropout = tf.nn.dropout(hidden2, keep_prob)
  
  with tf.name_scope('regression'):
    weights = tf.Variable(tf.truncated_normal([NUM_HIDDEN2, 1], stddev=stddev), name='weights')
    biases = tf.Variable(tf.zeros([1]), name='biases')
    y = tf.matmul(dropout, weights) + biases
  
  return y

loss

Use l2_loss () to calculate the loss. I'm wondering if this is good because the difference between the numbers is a loss, but I'm not sure if it's correct. Those who say "No," are welcome to comment.

def loss(y, target):

  return tf.reduce_mean(tf.nn.l2_loss((y - target)))

optimisation

Isn't there anything special to mention?

def optimize(loss):
  optimizer = tf.train.AdamOptimizer(learning_rate)
  train_step = optimizer.minimize(loss)
  return train_step

Training

Isn't there anything special to mention here as well?

def training(sess, train_step, loss, x_train_array, y_flg_train_array):
  
  summary_op = tf.merge_all_summaries()
  init = tf.initialize_all_variables()
  sess.run(init)
  
  summary_writer = tf.train.SummaryWriter(LOG_DIR, graph_def=sess.graph_def)
  
  for i in range(int(len(x_train_array) / bach_size)):
    batch_xs = getBachArray(x_train_array, i * bach_size, bach_size)
    batch_ys = getBachArray(y_flg_train_array, i * bach_size, bach_size)
    sess.run(train_step, feed_dict={x_ph: batch_xs, y_ph: batch_ys, keep_prob: 0.8})
    ce = sess.run(loss, feed_dict={x_ph: batch_xs, y_ph: batch_ys, keep_prob: 1.0})

    summary_str = sess.run(summary_op, feed_dict={x_ph: batch_xs, y_ph: batch_ys, keep_prob: 1.0})
    summary_writer.add_summary(summary_str, i)

Evaluation

For accuracy, the difference between the calculated fluctuation rate of the stock price and the actual fluctuation rate is calculated, and the average of the absolute values is output. In other words, it just gives the average error.

accuracy = tf.reduce_mean(tf.abs(y - y_ph))
print "accuracy"
print(sess.run(accuracy, feed_dict={x_ph: test_batch_xs, y_ph: test_batch_ys, keep_prob: 1.0}))

A story unique to scikit-learn

Regression algorithm

There are various algorithms, but ... I'm not sure which one is the best, so I'll pick up about three and use them without arguments.

# SGDRegressor
clf = linear_model.SGDRegressor()
testClf(clf, x_train_array, y_train_array, x_test_array, y_test_array)

# DecisionTreeRegressor
clf = tree.DecisionTreeRegressor()
testClf(clf, x_train_array, y_train_array, x_test_array, y_test_array)

# SVM
clf = svm.SVR()
testClf(clf, x_train_array, y_train_array, x_test_array, y_test_array)

Training, evaluation

Training only executes fit (). When I executed score (), the evaluation became partially negative, and I was not sure how to judge it (please tell me if you know it), so as with TensorFlow, the predicted stock price change rate Is taken out and the absolute value of the difference in the actual stock price change rate is averaged (in short, the average error).

def testClf(clf, x_train_array, y_flg_train_array, x_test_array, y_flg_test_array):

    print clf
    clf.fit(x_train_array, y_flg_train_array)
    result = clf.predict(x_test_array)
    print clf.score(x_test_array, y_flg_test_array)
    print np.mean(np.abs(np.array(result) - np.array(y_flg_test_array)))

result

TensorFlow

1.00044

scikit-learn

SGDRegressor: 0.943171296872
DecisionTreeRegressor: 1.3551351662
SVM: 0.945361479916

Therefore, the error was about 1%. A 1% error in the stock price forecast ... can't be used at all ... Gefun Gefun.

Actually expected

Since it's a big deal, let's actually get the predicted value. Since the data at hand is up to 2016/03/24, we predict the closing price of the Nikkei Stock Average on 2016/03/25. scikit-learn uses SVM.

Since it will be a list of numbers, the input part will be omitted. In both cases, the data contains stock price data for the past few days. 16892.33 is the closing price on 3/24.

`TensorFlow`


p = sess.run(y, feed_dict={x_ph: data, keep_prob: 1.0})
price = ((p[0][0] / 100.) + 1.) * 16892.33
print price

`scikit-learn`


p = clf.predict(data)
price = ((p[0] / 100.) + 1.) * 16892.33
print price

result

`TensorFlow`


16804.3398821

`scikit-learn`


16822.6013292

So, the actual stock price on 3/25 is ... 17,002.75. ** ... Well, that's right. ** **

Consideration

――It seems that machine learning does not mean that "why do you train with it and the machines will do their best to come up with the best ones". It seems that humans also need to think about what to do to make machines easier to think and answer.

Impressions

--Regression problem is fun. ――It's easy to find an error because it doesn't give an unmotivated answer such as "all first categories" like classification. ――The code is a little different from the time of classification, but development is easy because you can use 80% of the chords.