Overview

This is the Nikkei Stock Average forecast that various people are doing. This time I tried it with Random Forest, MLP and CNN. Although I am a disc braker, I am not responsible for any loss that may occur if the product is actually bought or sold using this method.

Preface

Theoretical story about stocks

Basically, stocks or general financial assets are random walks, and even if the information before one point is known, it is basically impossible to predict the value at the next time. If you can do that, everyone is done, and there is no such good story.

On the other hand, there is an anomaly in the world of stocks, for example, there are small-cap stock effect and value stock effect, which can not be explained theoretically ・ It is confirmed that there is a kind of fluctuation that deviates from theory It has been. (The small-cap effect and value-cap effect were confirmed as anomalies that could not be captured by CAPM, which was the asset price theory until then, and a model called the Fama-French model was created.)

It is difficult to make a profit by using an anomaly because if it becomes widely known, it will be woven into the market.

In deep learning and machine learning, we aim to find such anomalies (in my personal understanding). For that purpose, it is calculated as rough by using data or methods that have not been used until now.

Well, if you use a lot of data, you can't do anything, right? Is it a place with such expectations?

Talk about actual stock trading

A general stock forecast is a return, that is, a rate of change from the previous day. In other words, "use the data up to that day to predict the ups and downs of the next day."

However, the actual problem is not so easy,

What time is the data for the day available?
If you forecast using the closing price of the day, you cannot buy at the closing price of the day.
Since you cannot buy at the closing price on the day, you will buy at the closing price on the next day, but the closing price is not equal to the closing price.
You can't buy it at the bid price in the first place.
Can I get a return that is commensurate with the trading fee?
I don't want to buy or sell every day because it costs a lot of trading fees.
Learning and test data should be tested at various times, not just when there is a certain direction. Especially if you use only the Abenomics period, the percentage of correct answers will increase dramatically.

There are problems such as.

You should be aware that these points are often more important than expected in actual buying and selling.

After that, when forecasting using overseas stock price indexes, it is necessary to fully consider the effect of "time difference". If you don't take this into account, you'll end up using future data to make predictions.

Model building

So far, I will make an actual model with the introduction.

Data and forecast targets

The data uses daily closing price data of 225 Nikkei Stock Average constituents (most recent constituent stocks). The forecast target is whether the closing price of the Nikkei Stock Average on the next day will rise or fall compared to the previous day. In other words, the teacher data is whether the closing price return compared to the day is positive or negative. The training data is from 2000/01/11 to 2007/12/30, and the test data is from then until the latest.

As shown in the introduction, it is reaffirmed that using the closing price to predict the closing price return cannot be used in a real trade. Well, it's like seeing if the results differ depending on the method.

Composition of features by model

Random forest

In Random Forest, features are extended horizontally. This time, we will build a matrix that has the closing price return on the previous day of each issue in the column direction and the time point in the row direction.

Multilayer Perceptron (MLP)

It uses the same features as Random Forest.

Convolutional Neural Network (CNN)

In the convolutional neural network, it is necessary to generate an image format, that is, a 2D feature map, so we set the channel to 1 and used a 4D tensor. clm_dim and row_dim are the number of columns and rows of the 2D image at a certain point in time, and are the maximum value and the number of industries by industry, respectively. We will embed the returns of stocks in each industry.

    clm_dim = max(industry_count["count"])
    row_dim = len(industry_count)
    l_sample = len(x_train)
    t_sample = len(x_test)
    print row_dim,clm_dim
    x_train_mat = np.zeros((l_sample,1,row_dim,clm_dim),dtype=np.float32)
    x_test_mat = np.zeros((t_sample,1,row_dim,clm_dim),dtype=np.float32)

    for ind in industry_count["ind"]:
        """Turn by industry"""
        ind_code_list = ind_data[ind_data["ind"]==ind]["code"]
        len_3 = [i for i,ii in enumerate(industry_count["ind"]) if ii == ind] #line number

        len_1 = 0 #Column index
        for idx,row in x_train.iterrows():
            len_4 = 0
            for cc in ind_code_list:
                # x_train_mat[len_1,0,len_3,len_4] = 1. if row[str(cc)] > 0 else -1.
                x_train_mat[len_1,0,len_3,len_4] = row[str(cc)]
                len_4 += 1
            len_1 += 1

        len_1 = 0 #Column index
        for idx,row in x_test.iterrows():
            len_4 = 0
            for cc in ind_code_list:
                x_test_mat[len_1,0,len_3,len_4] = row[str(cc)]
                len_4 += 1
            len_1 += 1

Model parameters

Random forest

The number of decision trees is 200.

Multilayer Perceptron (MLP)

It is a 3-layer multi-layer perceptron with 1000 hidden layer nodes. epoch is 100.

Convolutional Neural Network (CNN)

Convolution → average pooling once, then one hidden layer, and the number of nodes is 1000. The filter size is 2x3 asymmetric filter, the pooling size is 1x2, and the output channel is 30.

result

It was measured by sklearn's classification report and AUC, respectively.

Random forest

Multilayer Perceptron (MLP)

Convolutional Neural Network (CNN)

The result is that CNN has slightly better performance than others.

Looking at the percentage of correct answers in CNN results, Abenomics in 2013 was the highest. That's right because it was a period when there was a clear trend.

正答率.png

Summary

The result is that it is basically impossible to predict the next day using only the price data of the previous day.
However, does the performance come out a little if the convolution is included? Such a place.
Since CNN is also divided by industry, I feel that the data is sparse, so what should I do about that?

I tried the common story of using Deep Learning to predict the Nikkei 225

Overview

Preface

Theoretical story about stocks

Talk about actual stock trading

Model building

Data and forecast targets

Composition of features by model

Random forest

Multilayer Perceptron (MLP)

Convolutional Neural Network (CNN)

Model parameters

Random forest

Multilayer Perceptron (MLP)

Convolutional Neural Network (CNN)

result

Random forest

Multilayer Perceptron (MLP)

Convolutional Neural Network (CNN)

Summary