Model Complexity and Robustness

Verify model complexity and deterioration of model accuracy when light noise is added to the data

For me, who tend to think that the data used for model construction was a sample of possible past events (population), it is also possible that the only sample that appeared in the past was a slightly blurred event. Things you want to care about. If you say "think about the sample blurring" = "add noise to the visible sample", you can use it as a way to check the robustness (dependence on the data) of your model. I thought it might be, and created it

Roughly speaking This is not divided into test data and training data! !! I made a model called, and in reality, the data can be slightly different. How much the accuracy of the model deteriorates at that time The motive was that I wanted to confirm the phenomenon

Forecast: USDJPY next week return
Data used for forecasting: Returns for the last 4 weeks
Prediction method: PolynomialFeatures & LinearRegression
For complexity, set the degree used for Polynomial Features to 1-6
Addition noise: Normal distribution random number with volatility of 0.3 times the original data

If the model is simple and the noise is simple, I think that the deterioration of prediction accuracy can be calculated using mathematical formulas, but by making it based on Simulation, it may be possible to handle various models, noise that is neither iid nor normal distribution. I don't know

Import etc.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import statsmodels.api as sm
# from statsmodels.tsa.arima_model import ARIMA
import pandas_datareader.data as web
import yfinance as yf
from numpy.random import *

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

import seaborn as sns
import warnings
import sys

warnings.filterwarnings('ignore')

plt.style.use('seaborn-darkgrid')
plt.rcParams['axes.xmargin'] = 0.01
plt.rcParams['axes.ymargin'] = 0.01

Get'USDJPY'from yfinance and create weekly returns

ReadDF = yf.download('JPY=X', start="1995-01-01", end="2019-10-30")
ReadDF.index = pd.to_datetime(ReadDF.index)
IndexValueReadDF_rsmpl = ReadDF.resample('W').last()['Adj Close']
ReadDF = IndexValueReadDF_rsmpl / IndexValueReadDF_rsmpl.shift(1) - 1

Explanatory variable (X), explained variable (y) Data creation

ret_df = pd.DataFrame()
ret_df[mkt] = ReadDF

test = pd.DataFrame(ret_df[mkt])
for i in range(1, 5):
    test['i_' + str(i)] = test['USDJPY'].shift(1 * i)

test = test.dropna(axis=0)
X = test.ix[:, 1:]
y = test[mkt]

Model creation (actual sample & noise added)

m_ = 3  # 0.A hook to have 3 times more volatility
output_degree = {}

for k in range(1, 7):
    polynomial_features = PolynomialFeatures(degree=k, include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", linear_regression)])

    pipeline.fit(X, y)

    k_sample = pd.DataFrame()

    for l in range(0, 300):

        #Normal random number generation
        eps_0 = pd.DataFrame(randn(X.shape[0], X.shape[1]))
        eps_1 = eps_0.apply(lambda x: x * list(X.std()), axis=1)

        eps_1.columns = X.columns
        eps_1.index = X.index

        #Add to the original data
        X_r = X + m_/10 * eps_1

        signal = pd.DataFrame()
        signal[mkt] = np.sign(pd.DataFrame(pipeline.predict(X_r)))[0]

        signal.index = y.index

        k_sample['s_' + str(l)] = (pd.DataFrame(signal[mkt]) * pd.DataFrame(y)).ix[:, 0]

    signal_IS = pd.DataFrame()
    # signal_IS[mkt] = np.sign(pd.DataFrame(pipeline.predict(X.ix[:, 0][:, np.newaxis])))[0]

    signal_IS[mkt] = np.sign(pd.DataFrame(pipeline.predict(X)))[0]

    signal_IS.index = y.index

    k_sample['IS'] = (pd.DataFrame(signal_IS[mkt])*pd.DataFrame(y)).ix[:, 0]
    k_sample[mkt] = pd.DataFrame(y).ix[:, 0]

    output_degree['degree_' + str(k)] = k_sample

Output histogram (every degree N)

Gray line: Return / risk of original data (USDJPY)
Red line: Return / risk of strategy applied to the original sample
Blue line: Average return / risk of the strategy applied to the sample with noise
Histogram: Original data of blue line
Gray line:


Performance_sim = pd.DataFrame()

fig = plt.figure(figsize=(15, 7), dpi=80)
for k in range(1, 7):
    ax = fig.add_subplot(2, 3, k)

    for_stats = output_degree['degree_' + str(k)]
    Performance = pd.DataFrame(for_stats.mean()*50 / (for_stats.std()*np.sqrt(50))).T

    Performance_tmp = Performance.ix[:, 1:].T

    ax.hist(Performance.drop(['IS', mkt], axis=1), bins=30, color="dodgerblue", alpha=0.8)
    ax.axvline(x=float(Performance.drop(['IS', mkt], axis=1).mean(axis=1)), color="b")
    ax.axvline(x=float(Performance['IS']), color="tomato")
    ax.axvline(x=float(Performance[mkt]), color="gray")
    ax.set_ylim([0, 40])
    ax.set_xlim([-0.3, 2.5])
    ax.set_title('degree-N polynomial: ' + str(m_))
fig.show()

result

Even if it is in-sampled, USDJPY cannot be predicted with a simple model of weekly return → Performance does not change even if it deviates (because it is not accurate from the beginning)
In-sample complex models can of course be found, but heavily depend on the original data

It is a natural result if it is simple data, but it seems to be useful for verification when I made a model

Addendum 1

In the above, the complexity of the model is set to degree = k of PolynomialFeatures, but the degree = 3, the model is RandomForestRegressor, and the complexity of the model is max_depth = k.

    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("rf_regression", RandomForestRegressor(max_depth=k))])

result

Addendum 2

Once again, in the example where the complexity of the model is set to degree = k of PolynomialFeatures, an explanatory variable is added, and in the sense that the feeling of wanting to make a model is systematically controlled, the explanatory variable is used in the regression part using Lasso. Make a selection

    pipeline = Pipeline([("polynomial_features", polynomial_features),
                         ("linear_regression", LassoCV(cv=5))])

In the first place, it is difficult to predict the return for the next week, and it can be understood that the prediction was the result of over-fitting (probably over-fitting).