TL;DR

The R source code of the book "Introduction to Effectiveness Verification / Causal Reasoning for Correct Comparison / Basics of Econometrics" has been (almost) reproduced in Python.
https://github.com/nekoumei/cibook-python
In this article, mainly R is a library! I will explain the part that is not so in Python

Book introduction

https://www.amazon.co.jp/dp/B0834JN23Y

The table of contents is listed on Amazon above, so I think it's quick to see it. .. It's a very good book. How to remove bias to make accurate decisions? Various causal reasoning methods (propensity score / DiD / RDD, etc.) are introduced with an implementation by R source code. Throughout, how can we use causal reasoning to verify the effectiveness of real problems? It was written from the perspective of, and I felt that it was very practical.

Written in python

https://github.com/nekoumei/cibook-python I created a Jupyter Notebook that corresponds to the original public R source code (https://github.com/ghmagazine/cibook). In addition, in this Python implementation, the graph visualization library uses plotly.express except for some parts. Plotly graphs cannot be displayed in Github Notebook rendering, so if you want to check online, please check Github Pages in the README. (Displaying html under images) Regarding plotly.express, the following article will be very helpful in Japanese. Summary of basic drawing of the de facto standard Plotly Express of the Python drawing library in the Reiwa era

Key points when writing in Python

regression analysis

I am using OLS and WLS from stats models.

(Excerpt from ch2_regression.ipynb)

##Multiple regression on biased data
y = biased_df.spend
#In R lm, categorical variables are automatically converted to dummy variables, so reproduce them.
X = pd.get_dummies(biased_df[['treatment', 'recency', 'channel', 'history']], columns=['channel'], drop_first=True)
X = sm.add_constant(X)
results = sm.OLS(y, X).fit()
nonrct_mreg_coef = results.summary().tables[1]
nonrct_mreg_coef

Here, there are two differences from the R lm function. (1) It seems that lm automatically converts categorical variables into dummy variables. Since sm.OLS cannot do it automatically, pd.get_dummies () is used. (2) In lm, the bias term is added automatically. This is also added manually. (Reference: Statistics: Try multiple regression analysis with Python and R)

Read RData format dataset

Some of the datasets used in the implementation are published as R packages such as experimentdatar, or datasets published in RData format. there is. ~~ There is no way to read these in Python only. ~~ (Please let me know if you have one) In the comment section, upura-san taught me! https://qiita.com/nekoumei/items/648726e89d05cba6f432#comment-0ea9751e3f01b27b0adb The .rda file is a package called rdata that can be read without R. (Excerpt from ch2_voucher.ipynb)

parsed = rdata.parser.parse_file('../data/vouchers.rda')
converted = rdata.conversion.convert(parsed)
vouchers = converted['vouchers']

When reading data using R via rpy2 and converting to pandas DataFrame, it is as follows. (Excerpt from ch2_voucher.ipynb)

from rpy2.robjects import r, pandas2ri
from rpy2.robjects.packages import importr
pandas2ri.activate()
experimentdatar = importr('experimentdatar')
vouchers = r['vouchers']

As described in ch2_voucher.ipynb, the R package is installed in advance in the R interactive environment. Also, the dataset used by ch3_lalonde.ipynb is a .dta file. This can be read by read_stata () in pandas. (Excerpt from ch3_lalonde.ipynb)

cps1_data = pd.read_stata('https://users.nber.org/~rdehejia/data/cps_controls.dta')
cps3_data = pd.read_stata('https://users.nber.org/~rdehejia/data/cps_controls3.dta')
nswdw_data = pd.read_stata('https://users.nber.org/~rdehejia/data/nsw_dw.dta')

Propensity score matching (nearest neighbor matching)

There are several matching methods for propensity score matching, but the book seems to use MatchIt's nearest neighbor matching. There didn't seem to be a good library for Python, so I'm honestly implementing it. (Excerpt from ch3_pscore.ipynb)

def get_matched_dfs_using_propensity_score(X, y, random_state=0):
    #Calculate propensity score
    ps_model = LogisticRegression(solver='lbfgs', random_state=random_state).fit(X, y)
    ps_score = ps_model.predict_proba(X)[:, 1]
    all_df = pd.DataFrame({'treatment': y, 'ps_score': ps_score})
    treatments = all_df.treatment.unique()
    if len(treatments) != 2:
        print('Only 2 groups can be matched. 2 groups must be[0, 1]Please express with.')
        raise ValueError
    # treatment ==1 to group1, treatment ==Let 0 be group2. Since group2 that matches group1 is extracted, it should be an ATT estimate.
    group1_df = all_df[all_df.treatment==1].copy()
    group1_indices = group1_df.index
    group1_df = group1_df.reset_index(drop=True)
    group2_df = all_df[all_df.treatment==0].copy()
    group2_indices = group2_df.index
    group2_df = group2_df.reset_index(drop=True)

    #Standard deviation of overall propensity score* 0.2 as the threshold
    threshold = all_df.ps_score.std() * 0.2

    matched_group1_dfs = []
    matched_group2_dfs = []
    _group1_df = group1_df.copy()
    _group2_df = group2_df.copy()

    while True:
        #Find and match one nearest neighbor in Nearest Neighbors
        neigh = NearestNeighbors(n_neighbors=1)
        neigh.fit(_group1_df.ps_score.values.reshape(-1, 1))
        distances, indices = neigh.kneighbors(_group2_df.ps_score.values.reshape(-1, 1))
        #Remove duplicates
        distance_df = pd.DataFrame({'distance': distances.reshape(-1), 'indices': indices.reshape(-1)})
        distance_df.index = _group2_df.index
        distance_df = distance_df.drop_duplicates(subset='indices')
        #Delete records that exceed the threshold
        distance_df = distance_df[distance_df.distance < threshold]
        if len(distance_df) == 0:
            break
        #Extract and delete matched records
        group1_matched_indices = _group1_df.iloc[distance_df['indices']].index.tolist()
        group2_matched_indices = distance_df.index
        matched_group1_dfs.append(_group1_df.loc[group1_matched_indices])
        matched_group2_dfs.append(_group2_df.loc[group2_matched_indices])
        _group1_df = _group1_df.drop(group1_matched_indices)
        _group2_df = _group2_df.drop(group2_matched_indices)

    #Returns a matched record
    group1_df.index = group1_indices
    group2_df.index = group2_indices
    matched_df = pd.concat([
        group1_df.iloc[pd.concat(matched_group1_dfs).index],
        group2_df.iloc[pd.concat(matched_group2_dfs).index]
    ]).sort_index()
    matched_indices = matched_df.index

    return X.loc[matched_indices], y.loc[matched_indices]

Matching 1 point in the treatment group with 1 point in the control group with the closest propensity score is repeated. At that time, a threshold is set so that only pairs whose distance is within 0.2 times std are extracted. For details about this, see Concept of Propensity Score and Its Practice. I am comparing with the matching result by MatchIt in ch3_pscore.ipynb, but there is no exact match. The conclusion remains the same and the covariate balance feels good (see below), so it is generally good. Again, there is no convenient library like R's cobalt love.plot (), so I visualize it by myself. (From images / ch3_plot1.html) スクリーンショット 2020-01-03 16.37.23.png

Inverse probability weighted estimation (IPW)

The good thing about IPW is that it is simple to implement. This is also compared with WeightIt, but it seems to be generally correct. (Excerpt from ch3_pscore.ipynb)

def get_ipw(X, y, random_state=0):
    #Calculate propensity score
    ps_model = LogisticRegression(solver='lbfgs', random_state=random_state).fit(X, y)
    ps_score = ps_model.predict_proba(X)[:, 1]
    all_df = pd.DataFrame({'treatment': y, 'ps_score': ps_score})
    treatments = all_df.treatment.unique()
    if len(treatments) != 2:
        print('Only 2 groups can be matched. 2 groups must be[0, 1]Please express with.')
        raise ValueError
    # treatment ==1 to group1, treatment ==Let 0 be group2.
    group1_df = all_df[all_df.treatment==1].copy()
    group2_df = all_df[all_df.treatment==0].copy()
    group1_df['weight'] = 1 / group1_df.ps_score
    group2_df['weight'] = 1 / (1 - group2_df.ps_score)
    weights = pd.concat([group1_df, group2_df]).sort_index()['weight'].values
    return weights

CausalImpact As described in ch4_did.ipynb, the following two libraries are used for comparison.

dafiti/causalinpact: https://github.com/dafiti/causalimpact
tcassou/causal_impact: https://github.com/tcassou/causal_impact

Since the output is beautiful, I usually use dafiti's causal impact, but it was tcassou's causal_impact that the estimation result was close to the book (R's original causal impact). Both seem to be estimated by the state space model of stats models, but I didn't understand the difference in implementation. Please let me know. .. In both cases, the estimation error is extremely small compared to the R implementation. It may not be so good that there are many covariates.

dafiti / causal impact plot

plot of tcassou / causal_impact

Regression Discontinuity Design (RDD)

With R, you can run it quickly with rddtools, but with Python it's not. First, let's check the regression equation used in rdd_reg_lm of rddtools. (Reference: https://cran.r-project.org/web/packages/rddtools/rddtools.pdf P23)

Y = α + τD + β_1(X-c)+ β_2D(X-c) + ε

As an aside, I thought that RDD would make regression models on the left and right of the cut-off value, and take the difference between the estimated values at the cut-off value, but one regression. It is expressed by an expression. Is it the same in meaning? Where D is a binary variable with a value of 1 if X is greater than or equal to the cut-off value and 0 if it is earlier. The coefficient of D is the value you want to check as the effect size. Also, c is the cut-off value. Based on the above, we will implement it. I also refer to the source code of rddtools for implementation. (https://github.com/MatthieuStigler/RDDtools/blob/master/RDDtools/R/model.matrix.RDD.R) (Excerpt from ch5_rdd.ipynb)

class RDDRegression:
#Rdd of R package rddtools_reg_Reproduce lm
#Reference: https://cran.r-project.org/web/packages/rddtools/rddtools.pdf P23
    def __init__(self, cut_point, degree=4):
        self.cut_point = cut_point
        self.degree = degree
        
    def _preprocess(self, X):
        X = X - threshold_value
        X_poly = PolynomialFeatures(degree=self.degree, include_bias=False).fit_transform(X)
        D_df = X.applymap(lambda x: 1 if x >= 0 else 0)
        X = pd.DataFrame(X_poly, columns=[f'X^{i+1}' for i in range(X_poly.shape[1])])
        X['D'] = D_df
        for i in range(X_poly.shape[1]):
            X[f'D_X^{i+1}'] = X_poly[:, i] * X['D']
        return X
    
    def fit(self, X, y):
        X = X.copy()
        X = self._preprocess(X)
        self.X = X
        self.y = y
        X = sm.add_constant(X)
        self.model = sm.OLS(y, X)
        self.results = self.model.fit()
        coef = self.results.summary().tables[1]
        self.coef = pd.read_html(coef.as_html(), header=0, index_col=0)[0]
        
    def predict(self, X):
        X = self._preprocess(X)
        X = sm.add_constant(X)
        return self.model.predict(self.results.params, X)

Polynomial Features of scikit-learn are used as preprocessing for performing polynomial regression. (From images / ch5_plot2_3.html) newplot (1).png I was able to estimate it well. As you can see by looking at the notebook, the effect size estimation is almost the same as the book. Was good.

nonparametric RDD (RDestimate) I couldn't ... It's almost the part of "(almost) reproduced in Python" at the beginning. In RDestimate, I used the method of Immunos and Kalyanaraman (2012) Optimal Bandwidth Choice for the Regression Discontinuity Estimator to select the optimum bandwidth, but I was not sure how to estimate the bandwidth. In ch5_rdd.ipynb, I tried to implement MSE to find the optimum bandwidth when changing the bandwidth roughly, but it doesn't seem to work very well. (From ch5_plot4.html) newplot (3).png Hmmm, isn't it? It seems that it is impossible to predict outside the bandwidth if it is estimated by purely narrowing the bandwidth, but how do you actually do it? Or maybe I don't have to worry too much because I'm only interested in estimating the cut-off neighborhood.

in conclusion

Please let me know if there is something wrong with your understanding or implementation, or something is wrong. Twitter

I wrote "Introduction to Effect Verification" in Python