References [Finance Machine Learning](https://www.amazon.co.jp/%E3%83%95%E3%82%A1%E3%82%A4%E3%83%8A%E3%83%B3 % E3% 82% B9% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E2% 80% 95% E9% 87% 91% E8% 9E% 8D% E5 % B8% 82% E5% A0% B4% E5% 88% 86% E6% 9E% 90% E3% 82% 92% E5% A4% 89% E3% 81% 88% E3% 82% 8B% E6% A9 % 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 82% A2% E3% 83% AB% E3% 82% B4% E3% 83% AA% E3% 82% BA % E3% 83% A0% E3% 81% AE% E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% B7% B5-% E3% 83% 9E% E3% 83% AB% E3% 82% B3% E3% 82% B9% E3% 83% BB% E3% 83% AD% E3% 83% 9A% E3% 82% B9% E3% 83% BB% E3% 83% 87% E3% 83% BB% E3% 83% 97% E3% 83% A9% E3% 83% 89-ebook / dp / B0834XJQTY)

Motivation for this article

When forecasting financial data, you need to define what you want to forecast, and the approach is completely different depending on what you want to forecast. Perhaps you are most familiar with defining whether the stock price $ T + 1 $ goes up or down with the price change rate or the sign of the price change rate? However, in some cases, it may be difficult to predict, and even if the correct answer rate is high, the average rate of return and Sharpe ratio may be terrible. Such a problem is not a problem that can be solved by labeling alone, but labeling is often neglected, but it actually has a deep meaning.

Examples of labeling in time series data and their interpretation

For example, suppose you have daily OHLC data of the Nikkei Stock Average. If you want to predict the closing price of the Nikkei Stock Average on the next business day with each closing price as $ X_1, X_2, ..., X_T , set the forecast label to $ Y_n = \ frac {X_ {n + 1} -X_n } {X_n} (1 \ leq n \ leq n) Will be $$. At this time, predicting this label is a strategy to make a new long (short) product with the Nikkei Stock Average as the underlying asset at today's discount price and settle at the market at the close of the next business day. There will be. For example, if you succeed in predicting this label and get a correct answer rate of 55%, it is not always successful in operation. Here, the cost is ignored once. If the probability of success is $ p $, the profit of success is $ \ mu_ + $, and the loss of failure is $ \ mu_- $, the expected value is $ \ mu = p \ mu_ +-(1-p ) \ Mu_- $, and under the condition of $ \ mu> 0 $, it must be $ p> \ frac {\ mu_-} {\ mu_ + + \ mu_-} $. Here, even if $ \ mu_- = 1 $ does not lose generality, it becomes $ p> \ frac {1} {\ mu_ + + 1} $. Here, if $ p = 0.55 $, then $ \ mu_ +> 0.8181 ... $. In this way, it is necessary to decide what to look for according to the purpose. In other words, I interpret that a forecast label is an investment strategy.

Labeling application example

Derived from the above example, is there an example where it is sufficient to give a correct answer rate of 50% or more on the prediction label? For example, what about such a strategy? We make a strong assumption that we can trade with the price of assets and have excellent liquidity (do not jump). If you hold a new asset at $ T = 0 $ and the reconciliation moves up + 1bps or -1bps, settle. This is the simplest binomial model introduced in finance. In this case, 50% or more of the predicted labels will have a positive expected value.

So how do you label it?

The data is tick data of board information (mid).

I want to explain using Python code.

`label.py`



labels = df["mid"].diff().shift(-1).replace(0, np.nan).bfill()
labels = labels / abs(labels)

――Since it changes by 1bps, look at it with diff. --Next, I want to see the difference between $ X_ {T} $ and $ X_ {T + 1} $, so shift the index one step to the left. --If the difference is 0, no transaction is made, so set 0 to Null. --Since the settlement is made only when the difference is not 0, if the stop time is set to $ t $, the profit of the strategy at time $ T = 0 $ will be $ X_t-X_0 $. ――At the end, I want to have two labels (1 or -1), so I only look at the code.

Other

In addition, Triple Barrier method, Trend-Scanning method, etc. were introduced in this book, so why not try it as a reference?

What is labeling in financial forecasting?

Motivation for this article

Examples of labeling in time series data and their interpretation

Labeling application example

label.py

Other

`label.py`