A story about clustering time series data of foreign exchange

Summary of this article

I tried clustering exchange data.
k-means, Euclidean distance was used.
It seems useful to combine the data of the upper bar (longer time axis).
Using the upper bar, the bias of the ratio of labels (profit taking: 1, loss cut: -1, settlement by holding time: 0) improved.

Development environment

Colaboratory
- scikit-learn

Data preparation

Using USD / JPY from 2018.01 to 2019.04, The entry point of the golden cross of the moving average on the 5-minute bar was used as sample data. (2482 data)

Feature value:
Approximately 3 hours of data (ohlc) before the entry point
- RSI

labeling

Labeling was done according to the following rules.

Result	Label
Profit	1
Loss	-1
Settlement by holding time	0

This time, we set the loss cut and profit taking lines so that they are roughly divided into three equal parts.

Clustering

Expected result

As shown in the graph below, I expected that "profit taking" / "loss cut" / "settlement by holding time" would be separated for each cluster.

With this, in the case of cluster 2, it can be judged that it is not good and the trade can be forgotten.

result

We clustered using scikit-learn's TimeSeriesKMeans, illustrated the percentage of labels in each cluster, and sorted them in order of winning percentage.

Not good enough. .. The highest win rate was 45% and the lowest win rate was 22%. Since the original is almost divided into 3 equal parts (33%), it seems that it can be divided a little, but I would like it to be divided a little more beautifully.

Add upper leg

Aiming for improvement, we decided to add the following longer timeframe information to the features.

Oscillator indicator in 30 minutes
Trend-following indicator in 2 hours

The result is below.

The highest win rate was 63% and the lowest win rate was 14%. By adding the information of the upper legs, it has improved a lot. I think it was good because I was able to confirm again that the information on the upper legs is useful. With such a result, it seems difficult to avoid damaging, but I personally thought that it could be used to adjust the quantity of positions.

Thank you for reading the article.

reference

Recommended Posts

A story about clustering time series data of foreign exchange

About time series data and overfitting

Differentiation of time series data (discrete)

Time series analysis 3 Preprocessing of time series data

Acquisition of time series data (daily) of stock prices

Smoothing of time series and waveform data 3 methods (smoothing)

A story about data analysis by machine learning

A story about predicting exchange rates with Deep Learning

Anomaly detection of time series data by LSTM (Keras)

A story about struggling to loop 3 million ID data

A story about changing the master name of BlueZ

A story about improving the program for partial filling of 3D binarized image data

About data management of anvil-app-server

How to extract features of time series data with PySpark Basics

Comparison of time series data predictions between SARIMA and Prophet models

[numpy] Create a moving window matrix from multidimensional time series data

<Pandas> How to handle time series data in a pivot table

When plotting time series data and getting a matplotlib Overflow Error

Calculation of time series customer loyalty

A refreshing story about Python's Slice

Python: Time Series Analysis: Preprocessing Time Series Data

A sloppy story about Python's Slice

A story about using Python's reduce

The story of writing a program

[For beginners] Script within 10 lines (5. Resample of time series data using pandas)

Power of forecasting methods in time series data analysis Semi-optimization (SARIMA) [Memo]

A story about adopting Django instead of Rails at a young seed startup

A story about my new study of Python after 3 years of MATLAB experience

A story of a person who started aiming for data scientist from a beginner

Plot CSV of time series data with unixtime value in Python (matplotlib)

A note about the functions of the Linux standard library that handles time

[Kaggle] I tried feature engineering of multidimensional time series data using tsfresh.

Forecasting time series data with Simplex Projection

Predict time series data with neural network

The story of verifying the open data of COVID-19

A story about machine learning with Kyasuket

A memorandum of understanding about django's QueryDict

[Python] Accelerates loading of time series CSV

Time series analysis 4 Construction of SARIMA model

Time series data anomaly detection for beginners

Conversion of time data in 25 o'clock notation

The story of blackjack A processing (python)

How to handle time series data (implementation)

A story about a 503 error on Heroku open

Reading OpenFOAM time series data and sets data

A memorandum of trouble when formatting data

A story about achieving a horse racing recovery rate of over 100% through machine learning

Extract periods that match a particular pattern from pandas time series qualitative data