Introduction to Effectiveness Verification-Causal Reasoning for Correct Comparison / Basics of Econometrics Reproduce the source code in Python To do.
I already have a Great ancestor implementation example, but I will leave it as a memo for my study.
This article covers Chapter 2. The code is also posted on github. In addition, variable names and processing contents are basically implemented in the book.
You can implement linear regression with scikit-learn or stats models. However, regarding the statistics of each variable, statsmodels is more advantageous, so this is more convenient.
scikit-learn
Regression analysis of sklearn
from sklearn.linear_model import LinearRegression
#Model learning
X = biased_data[['treatment', 'history']]
y = biased_data['spend']
model = LinearRegression(fit_intercept=True, normalize=False).fit(X, y)
#Result output
print(f'R^2: {model.score(X, y)}')
print(f'intercept: {model.intercept_}')
print(f'coefficients: {model.coef_}')
The result of scikit-learn is less than that of stats models described later. It is possible to calculate each statistical value based on the training data, but it seems that there is no merit to use scikit-learn up to that point.
statsmodels
There seem to be multiple ways to learn models using statsmodels, but the R-like ones are as follows.
Regression analysis of stats models
from statsmodels.formula.api import ols
#Model learning
model = ols('spend ~ treatment + history', data=biased_data).fit()
#Result output
model.summary()
By the way, if the output of the result is like model.summary (). Tables [1]
, you can specify any table from multiple information.
Also, if you want to get the estimated value, you can refer to the list in dictionary format using model.params
.
To read R format data, use rdata. For details on how to use the module, refer to here.
Read RData
import rdata
parsed = rdata.parser.parse_file('./vouchers.rda')
converted = rdata.conversion.convert(parsed)
vouchers = converted['vouchers']
In this book, we are learning the models of multiple objective variables at once, but it seems difficult to do the same in Python.
Regression analysis collectively
import pandas as pd
from statsmodels.formula.api import ols
#Definition of regression equation
formula_x_base = ['VOUCH0']
formula_x_covariate = [
'SVY', 'HSVISIT', 'AGE', 'STRATA1', 'STRATA2', 'STRATA3', 'STRATA4', 'STRATA5', 'STRATA6', 'STRATAMS',
'D1993', 'D1995', 'D1997', 'DMONTH1', 'DMONTH2', 'DMONTH3', 'DMONTH4', 'DMONTH5', 'DMONTH6',
'DMONTH7', 'DMONTH8', 'DMONTH9', 'DMONTH10', 'DMONTH11', 'DMONTH12', 'SEX2',
]
formula_ys = [
"TOTSCYRS","INSCHL","PRSCH_C","USNGSCH","PRSCHA_1","FINISH6","FINISH7","FINISH8","REPT6",
"REPT","NREPT","MARRIED","HASCHILD","HOURSUM","WORKING3",
]
#Definition of the function that receives the regression result
def get_regression_result(formula, data):
model = ols(formula, data=data).fit()
result = pd.read_html(model.summary().tables[1].as_html(), header=0)[0]
result.columns = ['term', 'estimate', 'std.err', 'statistic', 'p.value', '0.025', '0.975']
result
return result
#Perform regression analysis together
results = list()
for formula_y in formula_ys:
base_reg_formula = f'{formula_y} ~ {" + ".join(formula_x_base)}'
base_reg_model_index = f'{formula_y}_base'
covariate_reg_formula = f'{formula_y} ~ {" + ".join(formula_x_base+formula_x_covariate)}'
covariate_reg_model_index = f'{formula_y}_covariate'
base_reg_result = get_regression_result(base_reg_formula, regression_data)
base_reg_result['model_index'] = base_reg_model_index
results.append(base_reg_result)
covariate_reg_result = get_regression_result(covariate_reg_formula, regression_data)
covariate_reg_result['model_index'] = covariate_reg_model_index
results.append(covariate_reg_result)
df_results = pd.concat(results).reset_index(drop=True)
df_results = df_results[['model_index', 'term', 'estimate', 'std.err', 'statistic', 'p.value', '0.025', '0.975']]
Probably easy to plot with error bars in matplotlib. The point is to find the magnitude of the error in the error bar used to express the confidence interval from the difference between the estimated value and the confidence interval.
In the data used for drawing (not shown here), the value is obtained from the model, but there may be some deviation because there are few significant figures at that time.
plot
import matplotlib.pyplot as plt
estimate = going_private_results['estimate']
estimate_error = going_private_results['estimate'] - going_private_results['0.025'] #Let the difference between the confidence interval and the estimated value be the length of the error bar
xmin = 0
xmax = going_private_results.shape[0] - 1
plt.errorbar(range(xmax+1), estimate, estimate_error, fmt='o')
plt.hlines(y=0, xmin=xmin, xmax=xmax, colors='k', linestyles='dashed')
plt.xlabel('model_indexe')
plt.ylabel('estimate')
plt.xticks(range(going_private_results.shape[0]), going_private_results['model_index'], rotation=45)
plt.show()
-I wrote "Introduction to Effect Verification" in Python -Read RData format dataset in Python
Recommended Posts