For me, who tend to think that the data used for model construction was a sample of possible past events (population), it is also possible that the only sample that appeared in the past was a slightly blurred event. Things you want to care about. If you say "think about the sample blurring" = "add noise to the visible sample", you can use it as a way to check the robustness (dependence on the data) of your model. I thought it might be, and created it
Roughly speaking This is not divided into test data and training data! !! I made a model called, and in reality, the data can be slightly different. How much the accuracy of the model deteriorates at that time The motive was that I wanted to confirm the phenomenon
If the model is simple and the noise is simple, I think that the deterioration of prediction accuracy can be calculated using mathematical formulas, but by making it based on Simulation, it may be possible to handle various models, noise that is neither iid nor normal distribution. I don't know
Import etc.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import statsmodels.api as sm
# from statsmodels.tsa.arima_model import ARIMA
import pandas_datareader.data as web
import yfinance as yf
from numpy.random import *
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import seaborn as sns
import warnings
import sys
warnings.filterwarnings('ignore')
plt.style.use('seaborn-darkgrid')
plt.rcParams['axes.xmargin'] = 0.01
plt.rcParams['axes.ymargin'] = 0.01
Get'USDJPY'from yfinance and create weekly returns
ReadDF = yf.download('JPY=X', start="1995-01-01", end="2019-10-30")
ReadDF.index = pd.to_datetime(ReadDF.index)
IndexValueReadDF_rsmpl = ReadDF.resample('W').last()['Adj Close']
ReadDF = IndexValueReadDF_rsmpl / IndexValueReadDF_rsmpl.shift(1) - 1
Explanatory variable (X), explained variable (y) Data creation
ret_df = pd.DataFrame()
ret_df[mkt] = ReadDF
test = pd.DataFrame(ret_df[mkt])
for i in range(1, 5):
test['i_' + str(i)] = test['USDJPY'].shift(1 * i)
test = test.dropna(axis=0)
X = test.ix[:, 1:]
y = test[mkt]
Model creation (actual sample & noise added)
m_ = 3 # 0.A hook to have 3 times more volatility
output_degree = {}
for k in range(1, 7):
polynomial_features = PolynomialFeatures(degree=k, include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", linear_regression)])
pipeline.fit(X, y)
k_sample = pd.DataFrame()
for l in range(0, 300):
#Normal random number generation
eps_0 = pd.DataFrame(randn(X.shape[0], X.shape[1]))
eps_1 = eps_0.apply(lambda x: x * list(X.std()), axis=1)
eps_1.columns = X.columns
eps_1.index = X.index
#Add to the original data
X_r = X + m_/10 * eps_1
signal = pd.DataFrame()
signal[mkt] = np.sign(pd.DataFrame(pipeline.predict(X_r)))[0]
signal.index = y.index
k_sample['s_' + str(l)] = (pd.DataFrame(signal[mkt]) * pd.DataFrame(y)).ix[:, 0]
signal_IS = pd.DataFrame()
# signal_IS[mkt] = np.sign(pd.DataFrame(pipeline.predict(X.ix[:, 0][:, np.newaxis])))[0]
signal_IS[mkt] = np.sign(pd.DataFrame(pipeline.predict(X)))[0]
signal_IS.index = y.index
k_sample['IS'] = (pd.DataFrame(signal_IS[mkt])*pd.DataFrame(y)).ix[:, 0]
k_sample[mkt] = pd.DataFrame(y).ix[:, 0]
output_degree['degree_' + str(k)] = k_sample
Output histogram (every degree N)
Performance_sim = pd.DataFrame()
fig = plt.figure(figsize=(15, 7), dpi=80)
for k in range(1, 7):
ax = fig.add_subplot(2, 3, k)
for_stats = output_degree['degree_' + str(k)]
Performance = pd.DataFrame(for_stats.mean()*50 / (for_stats.std()*np.sqrt(50))).T
Performance_tmp = Performance.ix[:, 1:].T
ax.hist(Performance.drop(['IS', mkt], axis=1), bins=30, color="dodgerblue", alpha=0.8)
ax.axvline(x=float(Performance.drop(['IS', mkt], axis=1).mean(axis=1)), color="b")
ax.axvline(x=float(Performance['IS']), color="tomato")
ax.axvline(x=float(Performance[mkt]), color="gray")
ax.set_ylim([0, 40])
ax.set_xlim([-0.3, 2.5])
ax.set_title('degree-N polynomial: ' + str(m_))
fig.show()
It is a natural result if it is simple data, but it seems to be useful for verification when I made a model
In the above, the complexity of the model is set to degree = k of PolynomialFeatures, but the degree = 3, the model is RandomForestRegressor, and the complexity of the model is max_depth = k.
pipeline = Pipeline([("polynomial_features", polynomial_features),
("rf_regression", RandomForestRegressor(max_depth=k))])
Once again, in the example where the complexity of the model is set to degree = k of PolynomialFeatures, an explanatory variable is added, and in the sense that the feeling of wanting to make a model is systematically controlled, the explanatory variable is used in the regression part using Lasso. Make a selection
pipeline = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", LassoCV(cv=5))])