When looking at the relationship between two stock price data, we generally start the analysis on the assumption that the logarithmic profit margins of both have a normal distribution. However, when looking at the actual stock price, it is difficult to see a clean normal distribution, so it is necessary to pay careful attention to the output statistical values when performing regression analysis using a linear model.
There is GLM (= Generalized Linear model) as a model that handles relationships that do not have a normal distribution, but in order to apply this, it is necessary to learn the concept of statistical modeling, and some technical gaps are required. I'm feeling it. However, since it is supported by python ** statsmodels **, I decided to use it in "trial" without thinking too much about the strictness this time.
First, we picked up automobile-related stocks (3 companies) on the First Section of the Tokyo Stock Exchange as targets for analysis. The Scatter Plot of the logarithmic profit margin of three combinations of two companies selected from three companies is shown in the figure below.
It can be confirmed that all three have a not-so-strong (weak) positive correlation. We decided to take the middle data (stock2 vs. stock3) from these three and perform regression analysis. By the way, stock2 is a stock with a stock price code of 7203, and stock3 is a stock with a stock price code of 7267.
First, regression analysis was performed using the Linear Model. The following code was used for data reading and regression analysis of the linear model.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
def my_tof(s):
f1 = float(s.replace(',', ''))
return f1
# pandas read_csv()
my_colmn = ['Date', 'Open', 'High', 'Low', 'Close', 'Diff', 'Volume', 'cH', 'cI', 'cJ', 'cK', 'cL', 'cM', 'cN', 'cO']
index = pd.date_range(start='2014/1/1', end='2014/12/31', freq='B')
stock_raw = pd.DataFrame(index=index)
mydf = pd.DataFrame(index=index)
stock1 = pd.read_csv('./x7201-2014.csv', index_col=0, parse_dates=True, skiprows=1, names=my_colmn, header=None)
stock_raw['stock1'] = stock1[::-1].loc[:, 'Close']
stock2 = pd.read_csv('./x7203-2014.csv', index_col=0, parse_dates=True, skiprows=1, names=my_colmn, header=None)
stock_raw['stock2'] = stock2[::-1].loc[:, 'Close']
stock3 = pd.read_csv('./x7267-2014.csv', index_col=0, parse_dates=True, skiprows=1, names=my_colmn, header=None)
stock_raw['stock3'] = stock3[::-1].loc[:, 'Close']
stock_raw.dropna(inplace=True)
stock_base_label = ['stock1', 'stock2', 'stock3']
for st in stock_base_label:
st_price = st + '_p'
st_return = st + '_ret'
st_log_return = st + '_lgret'
mydf[st_price] = stock_raw[st].apply(my_tof)
mydf[st_price].fillna(method='ffill', inplace=True)
mydf[st_return] = mydf[st_price] / mydf[st_price].shift(1)
mydf[st_log_return] = np.log(mydf[st_return])
# scatter plotting
(Omitted)
# apply OLS model
mydf.dropna(inplace=True)
x1 = mydf['stock2_lgret'].values # stock2 log-return
x1a = sm.add_constant(x1)
y1 = mydf['stock3_lgret'].values # stock3 log-return
# OLS (linear model)
md0 = sm.OLS(y1, x1a)
res0 = md0.fit()
print res0.summary()
plt.figure(figsize=(5,4))
plt.scatter(mydf['stock2_lgret'], mydf['stock3_lgret'], c='b', alpha=0.6)
plt.plot(x1, res0.fittedvalues, 'r-', label='Linear Model')
plt.grid(True)
As above, statsmodels.api.OLS () is used. As a result, the following graph was obtained.
Fig. stock2 vs. stock3 (Log Return) Linear Model
GLM (Gaussian distribution) Next, regression analysis by GLM is performed. The GLM (Generalized Linear Models) of statsmodels supports the following as usable probability distributions (called Family): (Excerpt from Document)
Families for GLM(Generalized Linear Model)
Family | The parent class for one-parameter exponential families. | Remark |
---|---|---|
Binomial | Binomial exponential family distribution. | Binomial distribution |
Gamma | Gamma exponential family distribution. | Gamma distribution |
Gaussian | Gaussian exponential family distribution. | Gaussian distribution |
InverseGaussian | InverseGaussian exponential family. | Inverse Gaussian distribution |
NegativeBinomial | Negative Binomial exponential family. | Negative binomial distribution |
Poisson | Poisson exponential family. | Poisson distribution |
In addition, the link function that can be used (combination) is determined for each family. (Excerpt from Document) The link function can be specified as an option, but if it is not specified, the default one seems to be used.
ident | log | logit | probit | cloglog | pow | opow | nbinom | loglog | logc | |
---|---|---|---|---|---|---|---|---|---|---|
Gaussian | x | x | x | |||||||
inv Gaussian | x | x | x | |||||||
binomial | x | x | x | x | x | x | x | x | x | |
Poission | x | x | x | |||||||
neg binomial | x | x | x | x | ||||||
gamma | x | x | x |
First, the calculation was performed using the Gaussian function. Code is as follows.
# apply GLM(Gaussian) model
md1 = sm.GLM(y1, x1a, family=sm.families.Gaussian()) # Gaussian()
res1 = md1.fit()
print res1.summary()
plt.figure(figsize=(5,4))
plt.scatter(mydf['stock2_lgret'], mydf['stock3_lgret'], c='g', alpha=0.6)
plt.plot(x1, res1.fittedvalues, 'r-', label='GLM(Gaussian)')
plt.grid(True)
Fig. stock2 vs. stock3 (GLM(gaussian dist.))
The line fitted by GLM does not seem to change at all from the above figure. Compare the calculation result summary ().
** OLS summary **
In [71]: print res0.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.486
Model: OLS Adj. R-squared: 0.484
Method: Least Squares F-statistic: 241.1
Date: Sun, 26 Jul 2015 Prob (F-statistic): 1.02e-38
Time: 16:18:16 Log-Likelihood: 803.92
No. Observations: 257 AIC: -1604.
Df Residuals: 255 BIC: -1597.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -0.0013 0.001 -1.930 0.055 -0.003 2.64e-05
x1 0.7523 0.048 15.526 0.000 0.657 0.848
==============================================================================
Omnibus: 10.243 Durbin-Watson: 1.997
Prob(Omnibus): 0.006 Jarque-Bera (JB): 16.017
Skew: -0.235 Prob(JB): 0.000333
Kurtosis: 4.129 Cond. No. 73.0
==============================================================================
** GLM (Gaussian dist.) Summary **
In [72]: print res1.summary()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 257
Model: GLM Df Residuals: 255
Model Family: Gaussian Df Model: 1
Link Function: identity Scale: 0.00011321157031
Method: IRLS Log-Likelihood: 803.92
Date: Sun, 26 Jul 2015 Deviance: 0.028869
Time: 16:12:11 Pearson chi2: 0.0289
No. Iterations: 4
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const -0.0013 0.001 -1.930 0.054 -0.003 2.02e-05
x1 0.7523 0.048 15.526 0.000 0.657 0.847
==============================================================================
It can be seen that the contents of the two outputs are quite different.
In OLS, the numerical values of R-squared, AIC, and BIC are output, but in GLM, these are not output, and instead, Deviance (deviance), Pearson chi2 statistics, etc. are output. As for both, the Log-Likelihood (Log-likelihood) value is output.
From the output of GLM, it can be seen that the Link Function is set to "identity" (identity link function). In addition, since the partial regression coefficients are the same (-0.0013, 0.7523) for OLS and GLM, it was confirmed that the results (contents) of the regression analysis are the same.
GLM (Gamma distribution)
Next, I tried to calculate GLM using the Gamma distribution as the distribution. I thought it was arguable whether the gamma distribution could represent the price-earnings ratio well, but I tried it with the aim of trying out GLM-like calculations.
The problem with executing the calculation is that the logarithmic price-earnings ratio takes a negative value when the stock price falls, but this is outside the range of the gamma distribution. Therefore, the price-earnings ratio before taking the logarithm was set as the y value for calculation. (I can't deny that I feel a little forced ...)
# apply GLM(gamma) model
x2 = x1 ; x2a = x1a
y2 = mydf['stock3_ret'].values # replaced
md2 = sm.GLM(y2, x2a, family=sm.families.Gamma())
res2 = md2.fit()
# print summary and plot fitting curve
print res2.summary()
plt.figure(figsize=(5,4))
plt.scatter(mydf['stock2_lgret'], mydf['stock3_ret'], c='c', alpha=0.6)
plt.plot(x2, res2.fittedvalues, 'r-', label='GLM(Gamma)')
plt.grid(True)
y2_fit_log = np.log(res2.fittedvalues)
plt.figure(figsize=(5,4))
plt.scatter(mydf['stock2_lgret'], mydf['stock3_lgret'], c='c', alpha=0.6)
plt.plot(x2, y2_fit_log, 'r-', label='GLM(Gamma)')
Fig. stock2 vs. stock3 (GLM(gamma dist.)) (log - ident) ** (log --log) ** (converted y value)
As a graph, the same result was obtained. Let's look at summary ().
In [73]: print res2.summary()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 257
Model: GLM Df Residuals: 255
Model Family: Gamma Df Model: 1
Link Function: inverse_power Scale: 0.000113369003649
Method: IRLS Log-Likelihood: 803.72
Date: Sun, 26 Jul 2015 Deviance: 0.028956
Time: 16:12:16 Pearson chi2: 0.0289
No. Iterations: 5
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 1.0013 0.001 1502.765 0.000 1.000 1.003
x1 -0.7491 0.048 -15.470 0.000 -0.844 -0.654
==============================================================================
By changing from GLM (gaussian dist.) To GLM (gamma dist.), The Log-Likelihood values and Deviance have changed slightly. However, it is certain that the model has not changed so much that it could be improved. Since the y value was converted and calculated, the partial regression coefficient is different.
Around the same time, a histogram of the log-rate of return of stock2 and stock3 was drawn to confirm the normality of the data. The shape was as shown in the figure below.
In this data analysis, we could not confirm the improvement of model accuracy by applying GLM. It is probable that this is because the stock prices in the same industry (the period is about one year) were not complicated (non-linear). However, it is not a bad thing that the number of tools that can be used to analyze various data will increase in the future, so I would like to deepen my understanding of GLM and other advanced regression analysis methods.
This time (stock price of automobile manufacturer A vs. stock price of company B) did not show its power, but it may be effectively used in a combination with a slightly different coat color, for example (maximum temperature vs. stock price of beer company). Are expected.
--statsmodels documentation http://statsmodels.sourceforge.net/stable/glm.html
--Introduction to Statistics (Department of Statistics, Faculty of Liberal Arts, University of Tokyo) http://www.utp.or.jp/bd/978-4-13-042065-5.html
--Introduction to Statistical Modeling for Data Analysis (Kubo, Iwanami Shoten) https://www.iwanami.co.jp/cgi-bin/isearch?isbn=ISBN978-4-00-006973-1
Recommended Posts