** Learn in 10 minutes APPEL PY **
This post is a Japanese translation of this notebook.
import pandas as pd
import numpy as np
import statsmodels.api as sm # for datasets
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
** Introducing the features of Appelpy. ** **
--Exploratory data analysis --Model estimation --Model diagnosis
Note: This notebook was run using the latest version of Appelpy v0.4.2 and its dependencies.
Now let's explore the features of Appelpy using one of the richest datasets in econometrics. The California Test Score Data Set (** Caschool **).
Load the dataset with Pandas and get the data with the Statsmodels method.
# Load data
df = sm.datasets.get_rdataset('Caschool', 'Ecdat').data
# Add together the two test scores (total score)
df['scrtot'] = df['readscr'] + df['mathscr']
df.head()
distcod | county | district | grspan | enrltot | teachers | calwpct | mealpct | computer | testscr | compstu | expnstu | str | avginc | elpct | readscr | mathscr | scrtot | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 75119 | Alameda | Sunol Glen Unified | KK-08 | 195 | 10.900000 | 0.510200 | 2.040800 | 67 | 690.799988 | 0.343590 | 6384.911133 | 17.889910 | 22.690001 | 0.000000 | 691.599976 | 690.000000 | 1381.599976 |
1 | 61499 | Butte | Manzanita Elementary | KK-08 | 240 | 11.150000 | 15.416700 | 47.916698 | 101 | 661.200012 | 0.420833 | 5099.380859 | 21.524664 | 9.824000 | 4.583333 | 660.500000 | 661.900024 | 1322.400024 |
2 | 61549 | Butte | Thermalito Union Elementary | KK-08 | 1550 | 82.900002 | 55.032299 | 76.322601 | 169 | 643.599976 | 0.109032 | 5501.954590 | 18.697226 | 8.978000 | 30.000002 | 636.299988 | 650.900024 | 1287.200012 |
3 | 61457 | Butte | Golden Feather Union Elementary | KK-08 | 243 | 14.000000 | 36.475399 | 77.049202 | 85 | 647.700012 | 0.349794 | 7101.831055 | 17.357143 | 8.978000 | 0.000000 | 651.900024 | 643.500000 | 1295.400024 |
4 | 61523 | Butte | Palermo Union Elementary | KK-08 | 1335 | 71.500000 | 33.108601 | 78.427002 | 171 | 640.849976 | 0.128090 | 5235.987793 | 18.671329 | 9.080333 | 13.857677 | 641.799988 | 639.900024 | 1281.700012 |
Now that we have the data, let's take a closer look at the distribution of some key variables before modeling.
Appelpy has a ** ʻeda` module ** (exploratory data analysis).
from appelpy.eda import statistical_moments
Get all the statistical moments at once.
statistical_moments(df)
mean | var | skew | kurtosis | |
---|---|---|---|---|
distcod | 67472.8 | 1.20201e+07 | -0.0307844 | -1.09626 |
enrltot | 2628.79 | 1.53124e+07 | 2.87017 | 10.2004 |
teachers | 129.067 | 35311.2 | 2.93254 | 11.219 |
calwpct | 13.246 | 131.213 | 1.68306 | 4.58959 |
mealpct | 44.7052 | 735.678 | 0.183954 | -0.999802 |
computer | 303.383 | 194782 | 2.87169 | 10.7729 |
testscr | 654.157 | 363.03 | 0.0916151 | -0.254288 |
compstu | 0.135927 | 0.00421926 | 0.922369 | 1.43113 |
expnstu | 5312.41 | 401876 | 1.0679 | 1.87571 |
str | 19.6404 | 3.57895 | -0.0253655 | 0.609597 |
avginc | 15.3166 | 52.2135 | 2.21516 | 6.53212 |
elpct | 15.7682 | 334.375 | 1.4268 | 1.4354 |
readscr | 654.97 | 404.331 | -0.0583962 | -0.358049 |
mathscr | 653.343 | 351.72 | 0.255082 | -0.159811 |
scrtot | 1308.31 | 1452.12 | 0.0916149 | -0.254288 |
At the core of Appelpy is to model the phenomena in the dataset.
The ** linear_model
** module contains classes for modeling data. Example: OLS (Ordinary Least Squares)
from appelpy.linear_model import OLS
** Here is a recipe for estimating a linear model. ** **
y_list
and X_list
, for the dependent variable (y) and the independent variable (X).y_list = ['scrtot']
X_list = ['avginc', 'str', 'enrltot', 'expnstu']
model_nonrobust = OLS(df, y_list, X_list).fit()
model_selection_stats
attribute, including the root mean square error (Root MSE).model_nonrobust.model_selection_stats
{'root_mse': 25.85923055850994,
'r_squared': 0.5438972040191434,
'r_squared_adj': 0.5395010324916171,
'aic': 3929.1191674301117,
'bic': 3949.320440986499}
Get a summary of Statsmodels with the results_output
attribute ...
model_nonrobust.results_output
Dep. Variable: | scrtot | R-squared: | 0.544 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.540 |
Method: | Least Squares | F-statistic: | 123.7 |
Date: | Mon, 04 May 2020 | Prob (F-statistic): | 2.05e-69 |
Time: | 18:32:26 | Log-Likelihood: | -1959.6 |
No. Observations: | 420 | AIC: | 3929. |
Df Residuals: | 415 | BIC: | 3949. |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 1312.9203 | 27.833 | 47.171 | 0.000 | 1258.208 | 1367.632 |
avginc | 3.8640 | 0.185 | 20.876 | 0.000 | 3.500 | 4.228 |
str | -1.3954 | 0.893 | -1.563 | 0.119 | -3.150 | 0.359 |
enrltot | -0.0016 | 0.000 | -4.722 | 0.000 | -0.002 | -0.001 |
expnstu | -0.0061 | 0.003 | -2.316 | 0.021 | -0.011 | -0.001 |
Omnibus: | 7.321 | Durbin-Watson: | 0.763 |
---|---|---|---|
Prob(Omnibus): | 0.026 | Jarque-Bera (JB): | 7.438 |
Skew: | -0.326 | Prob(JB): | 0.0243 |
Kurtosis: | 2.969 | Cond. No. | 1.39e+05 |
... Now it's finally easy to get the results of a standardized model using the results_output_standardized
attribute.
model_nonrobust.results_output_standardized
# This is a Styler object; add .data to access the underlying Pandas dataframe
coef | t | P>|t| | coef_stdX | coef_stdXy | stdev_X | |
---|---|---|---|---|---|---|
scrtot | ||||||
avginc | +3.8640 | +20.876 | 0.000 | +27.9209 | +0.7327 | 7.2259 |
str | -1.3954 | -1.563 | 0.119 | -2.6399 | -0.0693 | 1.8918 |
enrltot | -0.0016 | -4.722 | 0.000 | -6.3035 | -0.1654 | 3913.1050 |
expnstu | -0.0061 | -2.316 | 0.021 | -3.8364 | -0.1007 | 633.9371 |
The output is similar to Stata's listcoef
command.
Call the diagnostic_plot
method of the model object to display many visual ** regression diagnostics: **.
--pp_plot
: P-P plot
-- qq_plot
: Q-Q plot
--rvp_plot
: Plot of residuals and predicted values
--rvf_plot
: Plot of residuals and measured values
Appelpy's ** diagnostics
** module has a suite of classes and functions for regression diagnostics.
--If visual diagnostics are not enough, call other methods like heteroskedasticity_test
to do a formal test.
--Are there any observations that pollute the model? Pass the model object to the BadApples
class to see the points of influence, outliers, and high leverage.
** Here is a recipe for a 2x2 set of diagnostic plots. ** **
fig, axarr = plt.subplots(2, 2, figsize=(10, 10))
model_nonrobust.diagnostic_plot('pp_plot', ax=axarr[0][0])
model_nonrobust.diagnostic_plot('qq_plot', ax=axarr[1][0])
model_nonrobust.diagnostic_plot('rvf_plot', ax=axarr[1][1])
axarr[0, 1].axis('off')
plt.tight_layout()
Good grief. The specified model has a special problem with test scores above 1350.
There seems to be a non-constant variance in the RVF plot.
It's probably homoscedastic, but let's do a more specific test.
from appelpy.diagnostics import heteroskedasticity_test
Call one of several ** homoscedasticity tests . Example:
- Breusch-Pagan * (default for Stata's hettest
command)
- Breusch-Pagan studentized * (default for R's bptest
command)
-* White * (Stata's hettest, white
command)
bp_stats = heteroskedasticity_test('breusch_pagan', model_nonrobust)
print('Breusch-Pagan test :: {}'.format(bp_stats['distribution'] + '({})'.format(bp_stats['nu'])))
print('Test statistic: {:.4f}'.format(bp_stats['test_stat']))
print('Test p-value: {:.4f}'.format(bp_stats['p_value']))
Breusch-Pagan test :: chi2(1)
Test statistic: 2.5579
Test p-value: 0.1097
bps_stats = heteroskedasticity_test('breusch_pagan_studentized', model_nonrobust)
print('Breusch-Pagan test (studentized) :: {}'.format(bps_stats['distribution'] + '({})'.format(bps_stats['nu'])))
print('Test statistic: {:.4f}'.format(bps_stats['test_stat']))
print('Test p-value: {:.4f}'.format(bps_stats['p_value']))
Breusch-Pagan test (studentized) :: chi2(1)
Test statistic: 11.3877
Test p-value: 0.0225
The Studentized Breusch-Pagan test is considered more robust and shows a significant difference from the initial hypothesis of homoscedasticity, so let's fit it to a model with the Heteroskedastic-Robust standard error (HC1
error). ..
model_hc1 = OLS(df, y_list, X_list, cov_type='HC1').fit()
model_hc1.results_output
Dep. Variable: | scrtot | R-squared: | 0.544 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.540 |
Method: | Least Squares | F-statistic: | 78.60 |
Date: | Mon, 04 May 2020 | Prob (F-statistic): | 1.36e-49 |
Time: | 18:32:27 | Log-Likelihood: | -1959.6 |
No. Observations: | 420 | AIC: | 3929. |
Df Residuals: | 415 | BIC: | 3949. |
Df Model: | 4 | ||
Covariance Type: | HC1 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 1312.9203 | 29.364 | 44.711 | 0.000 | 1255.367 | 1370.474 |
avginc | 3.8640 | 0.240 | 16.119 | 0.000 | 3.394 | 4.334 |
str | -1.3954 | 0.932 | -1.498 | 0.134 | -3.221 | 0.430 |
enrltot | -0.0016 | 0.000 | -5.794 | 0.000 | -0.002 | -0.001 |
expnstu | -0.0061 | 0.003 | -2.166 | 0.030 | -0.012 | -0.001 |
Omnibus: | 7.321 | Durbin-Watson: | 0.763 |
---|---|---|---|
Prob(Omnibus): | 0.026 | Jarque-Bera (JB): | 7.438 |
Skew: | -0.326 | Prob(JB): | 0.0243 |
Kurtosis: | 2.969 | Cond. No. | 1.39e+05 |
What are the significant regressors (independent variables) now?
model_hc1.significant_regressors(0.05)
['avginc', 'enrltot', 'expnstu']
The str
(student-teacher ratio) and ʻexpnstu` regressors are especially relevant for investing in students. You can do a ** Wald test ** to test if the two regressors are together and significantly different from 0.
from appelpy.diagnostics import wald_test
wald_stats = wald_test(model_hc1, ['str', 'expnstu'])
print('Wald test :: {}'.format(wald_stats['distribution'] + '({})'.format(wald_stats['nu'])))
print('Test statistic: {:.4f}'.format(wald_stats['test_stat']))
print('Test p-value: {:.4f}'.format(wald_stats['p_value']))
Wald test :: chi2(2)
Test statistic: 4.7352
Test p-value: 0.0937
Not surprisingly, the standard error of some variables is higher for the HC1 error than for the non-robust error.
pd.concat([pd.Series(model_nonrobust.results.bse, name='std_error_nonrobust'),
pd.Series(model_hc1.results.bse, name='std_error_hc1')], axis='columns')
std_error_nonrobust | std_error_hc1 | |
---|---|---|
const | 27.833430 | 29.364467 |
avginc | 0.185093 | 0.239715 |
str | 0.892703 | 0.931541 |
enrltot | 0.000341 | 0.000278 |
expnstu | 0.002613 | 0.002794 |
Let's inspect the observed results.
--High influence --High leverage --Outliers
from appelpy.diagnostics import BadApples
Set up an instance of BadApples
using a model object.
bad_apples = BadApples(model_hc1).fit()
Let's plot the leverage value against the square of the normalized residuals.
The observations in the upper right quadrant have higher than average leverage and residual values (that is, the most influential observations in the model).
fig, ax = plt.subplots(figsize=(11,7))
bad_apples.plot_leverage_vs_residuals_squared(annotate=True)
plt.show()
Leverage values are stored in the measures_leverage
attribute.
bad_apples.measures_leverage.sort_values(ascending=False).head()
48 0.113247
34 0.085633
404 0.076762
413 0.065741
159 0.061353
Name: leverage, dtype: float64
Let's take a look at the five most leveraged observations.
bad_apples.show_extreme_observations().loc[[48,34,404,413,159]]
scrtot | avginc | str | enrltot | expnstu | |
---|---|---|---|---|---|
48 | 1261.099976 | 12.109128 | 19.017494 | 27176 | 5864.366211 |
34 | 1252.200012 | 11.722225 | 21.194069 | 25151 | 5117.039551 |
404 | 1387.900024 | 55.327999 | 16.262285 | 1059 | 6460.657227 |
413 | 1398.200012 | 50.676998 | 15.407042 | 687 | 7217.263184 |
159 | 1294.500000 | 14.298300 | 20.291372 | 21338 | 5123.474121 |
model_hc1.X_standardized.loc[[48,34,404,413,159]]
avginc | str | enrltot | expnstu | |
---|---|---|---|---|
48 | -0.443884 | -0.329278 | 6.273077 | 0.870684 |
34 | -0.497428 | 0.821246 | 5.755585 | -0.308182 |
404 | 5.537230 | -1.785664 | -0.401163 | 1.811299 |
413 | 4.893572 | -2.237740 | -0.496228 | 3.004803 |
159 | -0.140922 | 0.344087 | 4.781167 | -0.298032 |
These highly leveraged points have at least one independent variable that is at least 4 standard deviations away from the mean.
The BadApples object calculates a heuristic to determine if the scale value is "extreme".
Extreme indexes are stored in attributes, for example outliers are stored in ʻindices_outliers`.
bad_apples.indices_outliers
[5, 6, 7, 8, 9, 10, 13, 14, 37, 38, 47, 134, 370, 392, 403, 404, 415, 417]
Outlier heuristics: If the absolute value of the standardized residuals or the absolute value of the studentized residuals is greater than 2, the observations are outliers.
print(f"Proportion of observations as outliers: {len(bad_apples.indices_outliers) / len(df):.4%}")
Proportion of observations as outliers: 4.2857%
Due to its heuristics, less than about 5% of the observed values are expected to be considered outliers.
Also consider various influence measurements such as Cook's distance, dffits and dfbeta.
bad_apples.measures_influence.sort_values('cooks_d', ascending=False).head()
dfbeta_const | dfbeta_avginc | dfbeta_str | dfbeta_enrltot | dfbeta_expnstu | cooks_d | dffits_internal | dffits | |
---|---|---|---|---|---|---|---|---|
404 | 0.010742 | -0.811324 | 0.068295 | 0.061925 | 0.039761 | 0.152768 | -0.873981 | -0.882753 |
403 | 0.024082 | -0.529623 | 0.016131 | 0.031978 | 0.026944 | 0.063718 | -0.564440 | -0.567338 |
413 | 0.074297 | -0.373064 | 0.013021 | 0.033955 | -0.099557 | 0.044116 | -0.469661 | -0.470876 |
38 | -0.277356 | -0.193382 | 0.217223 | -0.136237 | 0.315817 | 0.030955 | -0.393412 | -0.397701 |
415 | -0.157629 | 0.099707 | 0.045684 | -0.026836 | 0.249169 | 0.025414 | 0.356468 | 0.357936 |
--Load the dataset and preprocess it in Pandas.
--Appelpy ** ʻeda**: Inspect the model dataset with useful functions such as statistical moments. --Appelpy ** modeling **.
Linear_modelcontains classes such as OLS and WLS for model estimation. Use a short recipe for each model to access model estimates. --Appelpy **
diagnostics**: Checks if model estimates are valid with diagnostic plots, statistical tests, and diagnostics such as
BadApples` for impact analysis.
--Logistic regression (under discrete_model
)
--DummyEncoder and InteractionEncoder classes