This is the Nikkei Stock Average forecast that various people are doing. This time I tried it with Random Forest, MLP and CNN. Although I am a disc braker, I am not responsible for any loss that may occur if the product is actually bought or sold using this method.
Basically, stocks or general financial assets are random walks, and even if the information before one point is known, it is basically impossible to predict the value at the next time. If you can do that, everyone is done, and there is no such good story.
On the other hand, there is an anomaly in the world of stocks, for example, there are small-cap stock effect and value stock effect, which can not be explained theoretically ・ It is confirmed that there is a kind of fluctuation that deviates from theory It has been. (The small-cap effect and value-cap effect were confirmed as anomalies that could not be captured by CAPM, which was the asset price theory until then, and a model called the Fama-French model was created.)
It is difficult to make a profit by using an anomaly because if it becomes widely known, it will be woven into the market.
In deep learning and machine learning, we aim to find such anomalies (in my personal understanding). For that purpose, it is calculated as rough by using data or methods that have not been used until now.
Well, if you use a lot of data, you can't do anything, right? Is it a place with such expectations?
A general stock forecast is a return, that is, a rate of change from the previous day. In other words, "use the data up to that day to predict the ups and downs of the next day."
However, the actual problem is not so easy,
There are problems such as.
You should be aware that these points are often more important than expected in actual buying and selling.
After that, when forecasting using overseas stock price indexes, it is necessary to fully consider the effect of "time difference". If you don't take this into account, you'll end up using future data to make predictions.
So far, I will make an actual model with the introduction.
The data uses daily closing price data of 225 Nikkei Stock Average constituents (most recent constituent stocks). The forecast target is whether the closing price of the Nikkei Stock Average on the next day will rise or fall compared to the previous day. In other words, the teacher data is whether the closing price return compared to the day is positive or negative. The training data is from 2000/01/11 to 2007/12/30, and the test data is from then until the latest.
As shown in the introduction, it is reaffirmed that using the closing price to predict the closing price return cannot be used in a real trade. Well, it's like seeing if the results differ depending on the method.
In Random Forest, features are extended horizontally. This time, we will build a matrix that has the closing price return on the previous day of each issue in the column direction and the time point in the row direction.
It uses the same features as Random Forest.
In the convolutional neural network, it is necessary to generate an image format, that is, a 2D feature map, so we set the channel to 1 and used a 4D tensor. clm_dim and row_dim are the number of columns and rows of the 2D image at a certain point in time, and are the maximum value and the number of industries by industry, respectively. We will embed the returns of stocks in each industry.
clm_dim = max(industry_count["count"])
row_dim = len(industry_count)
l_sample = len(x_train)
t_sample = len(x_test)
print row_dim,clm_dim
x_train_mat = np.zeros((l_sample,1,row_dim,clm_dim),dtype=np.float32)
x_test_mat = np.zeros((t_sample,1,row_dim,clm_dim),dtype=np.float32)
for ind in industry_count["ind"]:
"""Turn by industry"""
ind_code_list = ind_data[ind_data["ind"]==ind]["code"]
len_3 = [i for i,ii in enumerate(industry_count["ind"]) if ii == ind] #line number
len_1 = 0 #Column index
for idx,row in x_train.iterrows():
len_4 = 0
for cc in ind_code_list:
# x_train_mat[len_1,0,len_3,len_4] = 1. if row[str(cc)] > 0 else -1.
x_train_mat[len_1,0,len_3,len_4] = row[str(cc)]
len_4 += 1
len_1 += 1
len_1 = 0 #Column index
for idx,row in x_test.iterrows():
len_4 = 0
for cc in ind_code_list:
x_test_mat[len_1,0,len_3,len_4] = row[str(cc)]
len_4 += 1
len_1 += 1
The number of decision trees is 200.
It is a 3-layer multi-layer perceptron with 1000 hidden layer nodes. epoch is 100.
Convolution → average pooling once, then one hidden layer, and the number of nodes is 1000. The filter size is 2x3 asymmetric filter, the pooling size is 1x2, and the output channel is 30.
It was measured by sklearn's classification report and AUC, respectively.
The result is that CNN has slightly better performance than others.
Looking at the percentage of correct answers in CNN results, Abenomics in 2013 was the highest. That's right because it was a period when there was a clear trend.
Recommended Posts