Introduction

Introduction to Effectiveness Verification-Causal Reasoning for Correct Comparison / Basics of Econometrics Reproduce the source code in Python To do.

I already have a Great ancestor implementation example, but I will leave it as a memo for my study.

This article covers Chapter 2. The code is also posted on github. In addition, variable names and processing contents are basically implemented in the book.

regression analysis

You can implement linear regression with scikit-learn or stats models. However, regarding the statistics of each variable, statsmodels is more advantageous, so this is more convenient.

scikit-learn

`Regression analysis of sklearn`


from sklearn.linear_model import LinearRegression

#Model learning
X = biased_data[['treatment', 'history']]
y = biased_data['spend']
model = LinearRegression(fit_intercept=True, normalize=False).fit(X, y)

#Result output
print(f'R^2: {model.score(X, y)}')
print(f'intercept: {model.intercept_}')
print(f'coefficients: {model.coef_}')

The result of scikit-learn is less than that of stats models described later. It is possible to calculate each statistical value based on the training data, but it seems that there is no merit to use scikit-learn up to that point.

statsmodels

There seem to be multiple ways to learn models using statsmodels, but the R-like ones are as follows.

`Regression analysis of stats models`


from statsmodels.formula.api import ols

#Model learning
model = ols('spend ~ treatment + history', data=biased_data).fit()

#Result output
model.summary()

By the way, if the output of the result is like model.summary (). Tables [1], you can specify any table from multiple information.

Also, if you want to get the estimated value, you can refer to the list in dictionary format using model.params.

Read RData

To read R format data, use rdata. For details on how to use the module, refer to here.

`Read RData`


import rdata

parsed = rdata.parser.parse_file('./vouchers.rda')
converted = rdata.conversion.convert(parsed)
vouchers = converted['vouchers']

Regression analysis collectively

In this book, we are learning the models of multiple objective variables at once, but it seems difficult to do the same in Python.

`Regression analysis collectively`


import pandas as pd
from statsmodels.formula.api import ols

#Definition of regression equation
formula_x_base = ['VOUCH0']
formula_x_covariate = [
    'SVY', 'HSVISIT', 'AGE', 'STRATA1', 'STRATA2', 'STRATA3', 'STRATA4', 'STRATA5', 'STRATA6', 'STRATAMS',
    'D1993', 'D1995', 'D1997', 'DMONTH1', 'DMONTH2', 'DMONTH3', 'DMONTH4', 'DMONTH5', 'DMONTH6',
    'DMONTH7', 'DMONTH8', 'DMONTH9', 'DMONTH10', 'DMONTH11', 'DMONTH12', 'SEX2',
]
formula_ys = [
    "TOTSCYRS","INSCHL","PRSCH_C","USNGSCH","PRSCHA_1","FINISH6","FINISH7","FINISH8","REPT6",
    "REPT","NREPT","MARRIED","HASCHILD","HOURSUM","WORKING3",
]

#Definition of the function that receives the regression result
def get_regression_result(formula, data):
  model = ols(formula, data=data).fit()
  result = pd.read_html(model.summary().tables[1].as_html(), header=0)[0]
  result.columns = ['term', 'estimate', 'std.err', 'statistic', 'p.value', '0.025', '0.975']
  result
  return result

#Perform regression analysis together
results = list()
for formula_y in formula_ys:
  base_reg_formula = f'{formula_y} ~ {" + ".join(formula_x_base)}'
  base_reg_model_index = f'{formula_y}_base'
  covariate_reg_formula = f'{formula_y} ~ {" + ".join(formula_x_base+formula_x_covariate)}'
  covariate_reg_model_index = f'{formula_y}_covariate'

  base_reg_result = get_regression_result(base_reg_formula, regression_data)
  base_reg_result['model_index'] = base_reg_model_index
  results.append(base_reg_result)
  
  covariate_reg_result = get_regression_result(covariate_reg_formula, regression_data)
  covariate_reg_result['model_index'] = covariate_reg_model_index
  results.append(covariate_reg_result)

df_results = pd.concat(results).reset_index(drop=True)
df_results = df_results[['model_index', 'term', 'estimate', 'std.err', 'statistic', 'p.value', '0.025', '0.975']]

Plot of relationship between intervention and objective variable

Probably easy to plot with error bars in matplotlib. The point is to find the magnitude of the error in the error bar used to express the confidence interval from the difference between the estimated value and the confidence interval.

In the data used for drawing (not shown here), the value is obtained from the model, but there may be some deviation because there are few significant figures at that time.

`plot`


import matplotlib.pyplot as plt

estimate = going_private_results['estimate']
estimate_error = going_private_results['estimate'] - going_private_results['0.025']  #Let the difference between the confidence interval and the estimated value be the length of the error bar

xmin = 0
xmax = going_private_results.shape[0] - 1

plt.errorbar(range(xmax+1), estimate, estimate_error, fmt='o')
plt.hlines(y=0, xmin=xmin, xmax=xmax, colors='k', linestyles='dashed')
plt.xlabel('model_indexe')
plt.ylabel('estimate')
plt.xticks(range(going_private_results.shape[0]), going_private_results['model_index'], rotation=45)
plt.show()

Relation

-I wrote "Introduction to Effect Verification" in Python -Read RData format dataset in Python

Introduction to Effectiveness Verification Chapter 2 Written in Python

Introduction

regression analysis

Regression analysis of sklearn

Regression analysis of stats models

Read RData

Read RData

Regression analysis collectively

Regression analysis collectively

Plot of relationship between intervention and objective variable

plot

Relation

`Regression analysis of sklearn`

`Regression analysis of stats models`

`Read RData`

`Regression analysis collectively`

`plot`