[Python] My stock price forecast [HFT]

1. Motivation

2. About board information

Selling quantity(ASK) Stock price Buy quantity(BID)
500 670
400 669
600 668
667 300
666 1,200
665 400
ASK Stock price BID
500 670
400 669
500 668
667 300
666 1,200
665 400
ASK Stock price BID
500 670
400 669
600 668
667 400
666 1,200
665 400

3. FI-2010 dataset

What is the FI-2010 dataset?

Data overview

data = pd.read_csv('Train_Dst_Auction_DecPre_CF_1.txt', 
                   header=None, delim_whitespace=True)
#=> (149, 47342)
plt.imshow(data, interpolation='nearest', vmin=0, vmax=0.75, 
           cmap='jet', aspect=data.shape[1]/data.shape[0])


lob = data.iloc[:40,0].values
lob_df = pd.DataFrame(lob.reshape(10,4), 
ask ask_vol bid bid_vol
0 0.2631 0.00392 0.2616 0.00663
1 0.2643 0.00028 0.2615 0.00500
2 0.2663 0.00165 0.2614 0.00500
3 0.2664 0.00500 0.2613 0.00043
4 0.2667 0.00039 0.2612 0.00646
5 0.2710 0.00700 0.2611 0.00200
6 0.2745 0.00200 0.2609 0.00199
7 0.2749 0.00487 0.2602 0.00081
8 0.2750 0.00300 0.2600 0.00197
9 0.2769 0.01000 0.2581 0.01321

4. Model

Training data and labels

Model architecture

5. Implementation

Data preprocessing

#Board information is in the first 40 lines. 29738 as the data of the fifth brand~Specify 47294.
lob = data.iloc[:40, 29738:47294].T.values
#Here, standardize by price and quantity.
lob = lob.reshape(-1,2)
lob = (lob - lob.mean(axis=0)) / lob.std(axis=0)
lob = lob.reshape(-1,40)
lob_df = pd.DataFrame(lob)
#Calculate the non-standardized midpoint.
lob_df['mid'] = (data.iloc[0,29738:47294].T.values + data.iloc[2,29738:47294].T.values) / 2
#Specify the parameters.
p = 50
k = 50
alpha = 0.0003
#Create a label from the midpoint based on the parameters.
lob_df['lt'] = (lob_df['mid'].rolling(window=k).mean().shift(-k)-lob_df['mid'])/lob_df['mid']
lob_df = lob_df.dropna()
lob_df['label'] = 0
lob_df.loc[lob_df['lt']>alpha, 'label'] = 1
lob_df.loc[lob_df['lt']<-alpha, 'label'] = -1
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical
from keras.layers import Conv2D, Dense, Reshape, Input, LSTM
from keras import Model, backend
import tensorflow as tf
#Create training data.
X = np.zeros((len(lob_df)-p+1, p, 40, 1))
lob = lob_df.iloc[:,:40].values
for i in range(len(lob_df)-p+1):
    X[i] = lob[i:i+p,:].reshape(p,-1,1)
y = to_categorical(lob_df['label'].iloc[p-1:], 3)
print(X.shape, y.shape)
#=> (17457, 50, 40, 1) (17457, 3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Model building


inputs = Input(shape=(p,40,1))
x = Conv2D(8, kernel_size=(1,2), strides=(1,2), activation='relu')(inputs)
x = Conv2D(8, kernel_size=(1,2), strides=(1,2), activation='relu')(x)
x = Conv2D(8, kernel_size=(1,10), strides=1, activation='relu')(x)
x = Reshape((p, 8))(x)
x = LSTM(8, activation='relu')(x)
x = Dense(16, activation='relu')(x)
outputs = Dense(3, activation='softmax')(x)

model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Let's learn!

epochs = 50
batch_size = 256
history = model.fit(X_train, y_train,
                    validation_data=(X_test, y_test))

Epoch 100/100 13965/13965 [==============================] - 5s 326us/step - loss: 0.6526 - acc: 0.6808 - val_loss: 0.6984 - val_acc: 0.6595

save.png save.png

6. Consideration

