Stock price forecast using machine learning (scikit-learn)

I studied scikit-learn, so I used it to predict the stock price. Last time, Last time I want to compare with TensorFlow This is because it is troublesome to obtain and process the data used for input. Please forgive me. By the way, there are already people who are doing the same thing. Since I only studied scikit-learn (and the theory around it) for about a week, I think there are many mistakes. We are waiting for your suggestions.

What is scikit-learn?

It seems to read "Sykit Learn". A library for machine learning. It is equipped with various algorithms and is relatively easy to use. You may be able to do the same with TensorFlow, but scikit-learn is easier to write.

merit

--Various algorithms can be used. --It works on Windows. (This is important)

Demerit

――Deep learning is not possible.

Effect

--Try scikit-learn. --See usability, accuracy, speed, etc. compared to when using TensorFlow.

things to do

"Use several days' worth of global stock indexes (Dow, Nikkei 225, DAX, etc.) to predict whether the Nikkei 225 will rise or fall the next day (2 choices)" (same as last time)

environment

scikit-learn 0.17.1 Python 2.7 Windows 7

Implementation

Preparation

The previous data will be used as it is. (The Nikkei, Dow, Hang Seng Index, and German stock indexes downloaded from the site Quandl are combined into one text data)

label

In the case of scikit-learn, the label seems to specify the numerical value with int instead of the flag format (like [0,0,1]), so it was set to 0 for rising and 1 for falling.

if array_base[i][3] > (array_base[i+1][3]):
    y_flg_array.append(0)
    up += 1
else:
    y_flg_array.append(1)
    down += 1

As a whole sample Go up: 50.5% Down: 49.5% have become.

Input data

Based on the previous improvement points, instead of putting the stock price as it is, we give a list of "how much (%) it went up or down compared to the previous day".

tmp_array = []
for j in xrange(i+1, i + data_num + 1):
    for k in range(16):
        tmp_array.append((array_base[j][k] - array_base[j+1][k]) / array_base[j][k] * 100)
x_array.append(tmp_array)

Classification algorithm

Various algorithms can be used with scikit-learn, but honestly I'm not sure which one is better, so I decided to try about three such ones. This time, we will try three methods: stochastic gradient descent, decision tree, and support vector machine. By the way, I have no idea how these three are different. (^ _ ^;)

# SGDClassifier
clf = linear_model.SGDClassifier()
testClf(clf, x_train_array, y_flg_train_array, x_test_array, y_flg_test_array)

# Decision Tree
clf = tree.DecisionTreeClassifier()
testClf(clf, x_train_array, y_flg_train_array, x_test_array, y_flg_test_array)

# SVM
clf = svm.SVC()
testClf(clf, x_train_array, y_flg_train_array, x_test_array, y_flg_test_array)

Training, evaluation

I tried to train and evaluate in the function. Training is just doing fit () and evaluation is doing score (), so it's very easy.

def testClf(clf, x_train_array, y_flg_train_array, x_test_array, y_flg_test_array):

    print clf
    clf.fit(x_train_array, y_flg_train_array)
    print clf.score(x_test_array, y_flg_test_array)

Result-Part 1-

SGDClassifier : 0.56591099916
DecisionTreeClassifier : 0.544080604534
SVM : 0.612090680101

When using TensorFlow, the correct answer rate was about 63%, so it seems that some results are coming out, though not so much. Processing is heavy only for SVM.

Although asked occasionally, the data is divided into training and testing. 80% of the total is used for training and 20% is used for testing.

Parameter adjustment

In the above, when creating an instance of each classifier, nothing was specified in the argument, but it seems that the accuracy can be improved by adjusting the parameters. In addition, there is also the ability to brute force this parameter. Convenient. Try it with the SVM that gave the best results.

clf = svm.SVC()
grid = grid_search.GridSearchCV(estimator=clf, param_grid={'kernel': ['rbf','linear','poly','sigmoid']})
grid.fit(x_train_array, y_flg_train_array)
testClf(grid.best_estimator_, x_train_array, y_flg_train_array, x_test_array, y_flg_test_array)

In the above, we have tried the SVM kernel with four,'rbf','linear','poly', and'sigmoid', and trained and tested again with the best parameters. (Is training unnecessary anymore?) As an aside, of course, I don't really understand the meaning of kernel. (^ _ ^;)

Result-Part 2-

0.638958858102

The best results were obtained when the kernel was linear, with a slight increase in accuracy. Approximately 64% ... I've exceeded deep learning ... (maybe within the margin of error)

Consideration

――After all, it is better to input the rate of change rather than entering the stock price as it is. (I tried it with the stock price as it was, but it didn't work) ――Deep learning is very popular, but you can do your best in other areas as well.

Impressions

――It's fun to move it relatively easily even if you don't understand the algorithm at all. --Grid search (a function to brute force parameters) takes some time. If you want to try multiple parameters, you need to be prepared for specs. (Is this the story of "curse of dimensionality"?) ――It doesn't matter, but I used Eclipse for this development (until now it was a text editor). It's super easy. --There is too little Japanese information on scikit-learn. Can someone translate the official tutorials into Japanese ...

Referenced site

-Official Tutorial -Official API Reference -Predict the future with machine learning --Predict the future stock price with the decision tree of scikit-learn