This article is based on "Introduction to Effectiveness Verification-Causal Reasoning for Correct Comparison / Basics of Econometrics". This is a summary of what I learned and felt by reading "Chapter 3 Analysis Using Propensity Scores".
-** Observed ** Probability that the covariate $ X $ will be assigned to the intervention group under the given conditions $ e (X) = P (Z = 1 | X) $ [^ px] --Not all covariates are observed
[^ px]: In this book, the propensity score is written as $ P (X) $, but for convenience, it is set as $ e (X) $.
[^ ra-vs-ps]: It's a strange comparison, but it can be read like this from this book.
--Regression analysis assumptions:
- CIA(Conditional Independence Assumption):
--CIA is used in this book, but strictly speaking, ** strongly negligible assignment ** is correct [^ rosenbaum_1983]
--A strongly negligible allocation adds the following two assumptions to the CIA:
--The covariate $ X $ is only observed
-
――In short, I think that it is ** quite important ** because it is NG if there is no possibility that any sample will be assigned to both the intervention group and the non-intervention group.
--For example, if all men are assigned to the intervention group, $ P (Z = 1 | men) = 1 $, so NG --Logistic regression is often used to estimate propensity scores, but machine learning is also OK.
--The model includes not only covariates but also other variables that affect outcomes [^ tsugawa-ps2]
--Probability calibration is required when using machine learning [^ rmizutaa]
--Random forest and SVM?
--On the other hand, if the loss function is logloss, it converges to the probability, so calibration is unnecessary [^ calibration-unnecessary] ATT v.s. ATE [^ matching-vs-ipw]: There are other methods for estimating the effect of intervention using propensity scores other than matching and IPW, but this book deals with these two. -** Matching **:
--A pair of samples with similar propensity score values are taken from the intervention group and the non-intervention group, and one missing measure (counterfact) is complemented by the other. --Matching samples with similar propensity score values to the samples in the intervention group from the non-intervention group, and letting the average pair difference be ATT. --Similarly for samples in the non-intervention group, the average of the pair difference is ATE. ATT in IPW --Calculated by the following formula ATE in IPW --Calculated by the following formula --Common support is the range in which the distribution of propensity scores of the intervention group and the non-intervention group overlap.
--If you use a sample that does not support common support in the calculation of IPW, the reciprocal (weight) of the propensity score becomes too large to estimate well [^ tsugawa-ps1]
-Is "out of common support" and "weight too large" equivalent?
--Is it common to exclude samples that are not in common support?
--This book introduces property score trim mig / clipping
――Isn't it possible to create a new bias by excluding non-common support?
――So it is important to design the experimental plan so that you can observe two comparable groups properly so that you do not have to exclude it **
――The bias generated on that should be adjusted by the tendency score etc.
――Since there is no sample that can complement the counterfacts outside of common support, isn't it not satisfying the “strongly negligible allocation” in the first place?
- ――It is important whether the distribution of covariates in the intervention group and the non-intervention group is balanced between the two groups by the propensity score.
-That is, make assumptions MineThatData E-Mail Analytics And Data Mining Challenge dataset[1] --Will be added at a later date LaLonde(1986) --Refer to this manual for the explanation of the data.
--Refer to here for the explanation of the paper. --NSW: RCT dataset
--CPS1 + NSW: Data set with NSW non-intervention group replaced by CPS1
--CPS3 + NSW: Data set with NSW non-intervention group replaced by CPS3 CPS1+NSW CPS3+NSW NSW --The data is well-balanced because it only sings RCT. CPS1+NSW ――It's quite unbalanced data CPS3+NSW --Not as much as NSW, but more balanced data than CPS1 + NSW NSW CPS1+NSW CPS3+NSW --NSW intervention effect is about $ 1600
--Challenge how close you can get to $ 1600 using propensity score from biased state --Where did ʻI (re74 ^ 2) CPS1+NSW --Estimate propensity score by logistic regression --Check the distribution of propensity scores --Calculate AUC CPS3+NSW --Estimate propensity score by logistic regression --Check the distribution of propensity scores --Calculate AUC CPS1+NSW --Define a function (recursive generator) to make a pair
-Here was used as a reference. --Make a pair --Confirm a pair of intervention groups
--Orange is a discarded sample
――Originally, it is desirable to make a pair so that the intersection as shown in the figure does not occur (I think) --Calculate ATT
--Standard deviation is quite large IPW --Calculate weight --Estimate ATT with weighted linear regression --Extract samples in the range where the distributions of propensity scores overlap --Estimate ATT with weighted linear regression --Confirm standardized mean difference between two groups of covariates
――After adjustment, it is almost within the red dotted line.
--Does it match the graph in this book? (Especially black before adjustment) --Confirm the distribution of covariates in matching
--The average does not tell the truth, so look at the distribution
――Balancing is done properly compared to the original distribution --Confirm the distribution of covariates in IPW
――Balancing is done properly compared to the original distribution --Confirm the distribution of covariates in IPW (common support)
――Balancing is done properly compared to the original distribution CPS3+NSW --Make a pair --Confirm a pair of intervention groups --Calculate ATT IPW --Calculate weight --Estimate ATT with weighted linear regression --Extract samples in the range where the distributions of propensity scores overlap --Estimate ATT with weighted linear regression --Confirm standardized mean difference between two groups of covariates --Confirm the distribution of covariates in matching --Confirm the distribution of covariates in IPW --Confirm the distribution of covariates in IPW (common support) --The covariates are balanced, but the estimated intervention effect seems to be quite different from $ 1600. CPS1+NSW --Check the pair --Calculate ATE IPW --Estimate ATE with weighted linear regression --Estimate ATE with weighted linear regression --Confirm standardized mean difference between two groups of covariates --Confirm the distribution of covariates in matching --Confirm the distribution of covariates in IPW --Confirm the distribution of covariates in IPW (common support) CPS3+NSW --Check the pair --Calculate ATE IPW --Estimate ATE with weighted linear regression --Estimate ATE with weighted linear regression --Confirm standardized mean difference between two groups of covariates --Confirm the distribution of covariates in matching --Confirm the distribution of covariates in IPW --Confirm the distribution of covariates in IPW (common support) --I can't get rid of the "bias" at all ... https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html ↩︎Propensity score estimation
Estimating intervention effect using propensity score
Matching v.s. IPW [^ matching-vs-ipw]
ATT in matching
ATE in matching
\frac{\sum_{i=1}^{N}Z_{i}Y_{i}}{\sum_{i=1}^{N}Z_{i}}
-\frac{\sum_{i=1}^{N}\frac{(1-Z_{i})\hat{e}(X_{i})}{1-\hat{e}(X_{i})}Y_{i}}{\sum_{i=1}^{N}\frac{(1-Z_{i})\hat{e}(X_{i})}{1-\hat{e}(X_{i})}}
\frac{\sum_{i=1}^{N}\frac{Z_{i}}{\hat{e}(X_{i})}Y_{i}}{\sum_{i=1}^{N}\frac{Z_{i}}{\hat{e}(X_{i})}}
-\frac{\sum_{i=1}^{N}\frac{1-Z_{i}}{1-\hat{e}(X_{i})}Y_{i}}{\sum_{i=1}^{N}\frac{1-Z_{i}}{1-\hat{e}(X_{i})}}
About common support
Evaluation of propensity score
Practical edition
Data read
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
df_cps1 = pd.read_stata('https://users.nber.org/~rdehejia/data/cps_controls.dta')
df_cps3 = pd.read_stata('https://users.nber.org/~rdehejia/data/cps_controls3.dta')
df_nsw = pd.read_stata('https://users.nber.org/~rdehejia/data/nsw_dw.dta')
Data set creation
treat_label = {0: 'untreat', 1: 'treat'}
X_columns = ['age', 'education', 'black', 'hispanic', 'married', 'nodegree', 're74', 're75']
y_column = 're78'
z_column = 'treat'
df_cps1_nsw = pd.concat([df_nsw[df_nsw[z_column] == 1], df_cps1], ignore_index=True)
df_cps3_nsw = pd.concat([df_nsw[df_nsw[z_column] == 1], df_cps3], ignore_index=True)
Confirm distribution of intervention group and non-intervention group
df_nsw[z_column].map(treat_label).value_counts().plot.bar(title='# of treatment in NSW')
_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
(df_nsw[c].groupby(df_nsw[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
df_cps1_nsw[z_column].map(treat_label).value_counts().plot.bar(title='# of treatment in CPS1+NSW')
_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
(df_cps1_nsw[c].groupby(df_cps1_nsw[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
df_cps3_nsw[z_column].map(treat_label).value_counts().plot.bar(title='# of treatment in CPS3+NSW')
_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
(df_cps3_nsw[c].groupby(df_cps3_nsw[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
regression analysis
import statsmodels.api as sm
results = sm.OLS(df_nsw[y_column], sm.add_constant(df_nsw[[z_column] + X_columns])).fit()
results.summary().tables[1]
results = sm.OLS(df_cps1_nsw[y_column], sm.add_constant(df_cps1_nsw[[z_column] + X_columns])).fit()
results.summary().tables[1]
results = sm.OLS(df_cps3_nsw[y_column], sm.add_constant(df_cps3_nsw[[z_column] + X_columns])).fit()
results.summary().tables[1]
result
data set
Intervention effect[$]
NSW
1676.3
CPS1+NSW
699.1
CPS3+NSW
1548.2
Game Start
Propensity score estimation
and ʻI (re75 ^ 2)
in this book come from?ps_model = sm.Logit(df_cps1_nsw[z_column], sm.add_constant(df_cps1_nsw[X_columns])).fit()
ps_model.summary().tables[1]
df_cps1_nsw['ps'] = 1 / (1 + np.exp(-ps_model.fittedvalues))
(df_cps1_nsw.groupby(df_cps1_nsw[z_column].map(treat_label))['ps']
.plot.hist(density=True, alpha=0.5, title=f'propensity score densities', legend=True))
from sklearn.metrics import roc_auc_score
roc_auc_score(df_cps1_nsw[z_column], df_cps1_nsw['ps'])
0.9708918648513447
ps_model = sm.Logit(df_cps3_nsw[z_column], sm.add_constant(df_cps3_nsw[X_columns])).fit()
ps_model.summary().tables[1]
df_cps3_nsw['ps'] = 1 / (1 + np.exp(-ps_model.fittedvalues))
(df_cps3_nsw.groupby(df_cps3_nsw[z_column].map(treat_label))['ps']
.plot.hist(density=True, alpha=0.5, title=f'propensity score densities', legend=True))
from sklearn.metrics import roc_auc_score
roc_auc_score(df_cps3_nsw[z_column], df_cps3_nsw['ps'])
0.8742266742266742
ATT estimation
matching
from sklearn.neighbors import NearestNeighbors
def pairs_generator(s0: pd.Series, s1: pd.Series, threshold: np.float) -> pd.DataFrame:
"""
A recursive generator that creates pairs using the K-nearest neighbor method.
Parameters
----------
s0, s1 : pd.Series
threshold : np.float
The threshold for the distance to pair. If even the nearest neighbor points exceed the threshold value, they will not be paired.
Returns
-------
pairs : pd.DataFrame
Index of a pair made up of the s0 sample and the closest s1 sample.
Column l is the index of s0 and column r is the index of s1.
Notes
-----
The same sample is never used for multiple pairs.
Examples
--------
s0 = df[df[z_column] == 0]['ps']
s1 = df[df[z_column] == 1]['ps']
threshold = df['ps'].std() * 0.1
pairs = pd.concat(pairs_generator(s1, s0, threshold), ignore_index=True)
"""
neigh_dist, neigh_ind = NearestNeighbors(1).fit(s0.values.reshape(-1, 1)).kneighbors(s1.values.reshape(-1, 1))
pairs = pd.DataFrame(
{'d': neigh_dist[:, 0], 'l': s0.iloc[neigh_ind[:, 0]].index},
index=s1.index,
).query(f'd < {threshold}').groupby('l')['d'].idxmin().rename('r').reset_index()
if len(pairs) > 0:
yield pairs
yield from pairs_generator(s0.drop(pairs['l']), s1.drop(pairs['r']), threshold)
s1 = df_cps1_nsw[df_cps1_nsw[z_column] == 1]['ps']
s0 = df_cps1_nsw[df_cps1_nsw[z_column] == 0]['ps']
threshold = df_cps1_nsw['ps'].std() * 0.1
pairs1_cps1_nsw = pd.concat(pairs_generator(s1, s0, threshold), ignore_index=True)
pairs0_cps1_nsw = pd.concat(pairs_generator(s0, s1, threshold), ignore_index=True) #Used in ATE calculation
ax = df_cps1_nsw.loc[pairs1_cps1_nsw.unstack()].plot.scatter(x='ps', y=z_column, label='matched', alpha=0.3, figsize=(16, 4), title='pairs on treated')
df_cps1_nsw.loc[~df_cps1_nsw.index.isin(pairs1_cps1_nsw.unstack())].plot.scatter(x='ps', y=z_column, label='unmatched', alpha=0.3, c='tab:orange', ax=ax)
for x in zip(df_cps1_nsw['ps'].iloc[pairs1_cps1_nsw['l']].values, df_cps1_nsw['ps'].iloc[pairs1_cps1_nsw['r']].values):
plt.plot(x, [1, 0], c='tab:blue', alpha=0.3)
(pairs1_cps1_nsw['l'].map(df_cps1_nsw[y_column]) - pairs1_cps1_nsw['r'].map(df_cps1_nsw[y_column])).agg(['mean', 'std'])
mean 1929.083008
std 9618.346680
dtype: float64
df_cps1_nsw['w'] = df_cps1_nsw[z_column] / df_cps1_nsw['ps'] + (1 - df_cps1_nsw[z_column]) / (1 - df_cps1_nsw['ps'])
att_model = sm.WLS(df_cps1_nsw[y_column], df_cps1_nsw[[z_column]].assign(untreat=1-df_cps1_nsw[z_column]), weights=df_cps1_nsw['w'] * df_cps1_nsw['ps']).fit()
att_model.summary().tables[1]
att_model.params['treat'] - att_model.params['untreat']
1180.4078220960955
IPW (common support)
df_cps1_nsw_cs = df_cps1_nsw[df_cps1_nsw['ps'].between(*df_cps1_nsw.groupby(z_column)['ps'].agg(['min', 'max']).agg({'min': 'max', 'max': 'min'}))].copy()
att_model = sm.WLS(df_cps1_nsw_cs[y_column], df_cps1_nsw_cs[[z_column]].assign(untreat=1-df_cps1_nsw_cs[z_column]), weights=df_cps1_nsw_cs['w'] * df_cps1_nsw_cs['ps']).fit()
att_model.summary().tables[1]
att_model.params['treat'] - att_model.params['untreat']
1243.8640545001808
pd.DataFrame([{
'unadjusted': np.subtract(*df_cps1_nsw.groupby(z_column)[c].mean()) / df_cps1_nsw[c].std(),
'adjusted(matching)': np.subtract(*df_cps1_nsw.loc[pairs1_cps1_nsw.unstack()].groupby(z_column)[c].mean()) / df_cps1_nsw[c].std(),
'adjusted(weighting)': np.subtract(*df_cps1_nsw.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w'] * x['ps']))) / df_cps1_nsw[c].std(),
'adjusted(weighting in common support)': np.subtract(*df_cps1_nsw_cs.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w'] * x['ps']))) / df_cps1_nsw_cs[c].std(),
} for c in X_columns], index=X_columns).abs().plot(marker='o', alpha=0.5, grid=True, ylim=(0, 100), logy=True, title='covariates balance')
plt.axhline(y=0.1, linestyle='--', c='r')
_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
(df_cps1_nsw.loc[pairs1_cps1_nsw.unstack()][c].groupby(df_cps1_nsw[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps1_nsw.sample(1000000, replace=True, weights=df_cps1_nsw['w'] * df_cps1_nsw['ps'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
(df[c].groupby(df_cps1_nsw[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps1_nsw_cs.sample(1000000, replace=True, weights=df_cps1_nsw_cs['w'] * df_cps1_nsw_cs['ps'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
(df[c].groupby(df_cps1_nsw_cs[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
matching
s1 = df_cps3_nsw[df_cps3_nsw[z_column] == 1]['ps']
s0 = df_cps3_nsw[df_cps3_nsw[z_column] == 0]['ps']
threshold = df_cps3_nsw['ps'].std() * 0.1
pairs1_cps3_nsw = pd.concat(pairs_generator(s1, s0, threshold), ignore_index=True)
pairs0_cps3_nsw = pd.concat(pairs_generator(s0, s1, threshold), ignore_index=True) #Used in ATE calculation
ax = df_cps3_nsw.loc[pairs1_cps3_nsw.unstack()].plot.scatter(x='ps', y=z_column, label='matched', alpha=0.3, figsize=(16, 4), title='pairs on treated')
df_cps3_nsw.loc[~df_cps3_nsw.index.isin(pairs1_cps3_nsw.unstack())].plot.scatter(x='ps', y=z_column, label='unmatched', alpha=0.3, c='tab:orange', ax=ax)
for x in zip(df_cps3_nsw['ps'].iloc[pairs1_cps3_nsw['l']].values, df_cps3_nsw['ps'].iloc[pairs1_cps3_nsw['r']].values):
plt.plot(x, [1, 0], c='tab:blue', alpha=0.3)
(pairs1_cps3_nsw['l'].map(df_cps3_nsw[y_column]) - pairs1_cps3_nsw['r'].map(df_cps3_nsw[y_column])).agg(['mean', 'std'])
mean 1463.115845
std 8779.598633
dtype: float64
df_cps3_nsw['w'] = df_cps3_nsw[z_column] / df_cps3_nsw['ps'] + (1 - df_cps3_nsw[z_column]) / (1 - df_cps3_nsw['ps'])
att_model = sm.WLS(df_cps3_nsw[y_column], df_cps3_nsw[[z_column]].assign(untreat=1-df_cps3_nsw[z_column]), weights=df_cps3_nsw['w'] * df_cps3_nsw['ps']).fit()
att_model.summary().tables[1]
att_model.params['treat'] - att_model.params['untreat']
1214.0711733143507
IPW (common support)
df_cps3_nsw_cs = df_cps3_nsw[df_cps3_nsw['ps'].between(*df_cps3_nsw.groupby(z_column)['ps'].agg(['min', 'max']).agg({'min': 'max', 'max': 'min'}))].copy()
att_model = sm.WLS(df_cps3_nsw_cs[y_column], df_cps3_nsw_cs[[z_column]].assign(untreat=1-df_cps3_nsw_cs[z_column]), weights=df_cps3_nsw_cs['w'] * df_cps3_nsw_cs['ps']).fit()
att_model.summary().tables[1]
att_model.params['treat'] - att_model.params['untreat']
988.8657151185471
pd.DataFrame([{
'unadjusted': np.subtract(*df_cps3_nsw.groupby(z_column)[c].mean()) / df_cps3_nsw[c].std(),
'adjusted(matching)': np.subtract(*df_cps3_nsw.loc[pairs1_cps3_nsw.unstack()].groupby(z_column)[c].mean()) / df_cps3_nsw[c].std(),
'adjusted(weighting)': np.subtract(*df_cps3_nsw.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w'] * x['ps']))) / df_cps3_nsw[c].std(),
'adjusted(weighting in common support)': np.subtract(*df_cps3_nsw_cs.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w'] * x['ps']))) / df_cps3_nsw_cs[c].std(),
} for c in X_columns], index=X_columns).abs().plot(marker='o', alpha=0.5, grid=True, ylim=(0, 100), logy=True, title='covariates balance')
plt.axhline(y=0.1, linestyle='--', c='r')
_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
(df_cps3_nsw.loc[pairs1_cps3_nsw.unstack()][c].groupby(df_cps3_nsw[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps3_nsw.sample(1000000, replace=True, weights=df_cps3_nsw['w'] * df_cps3_nsw['ps'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
(df[c].groupby(df_cps3_nsw[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps3_nsw_cs.sample(1000000, replace=True, weights=df_cps3_nsw_cs['w'] * df_cps3_nsw_cs['ps'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
(df[c].groupby(df_cps3_nsw_cs[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
result
data set
matching
IPW
IPW(Common support)
CPS1+NSW
1929.1
1180.4
1243.9
CPS3+NSW
1463.1
1214.1
988.9
ATE estimation
matching
pairs_cps1_nsw = pd.concat([pairs1_cps1_nsw, pairs0_cps1_nsw.rename({'l': 'r', 'r': 'l'}, axis=1)], ignore_index=True)
ax = df_cps1_nsw.loc[pairs_cps1_nsw.unstack()].plot.scatter(x='ps', y=z_column, label='matched', alpha=0.3, figsize=(16, 4), title='pairs')
df_cps1_nsw.loc[~df_cps1_nsw.index.isin(pairs_cps1_nsw.unstack())].plot.scatter(x='ps', y=z_column, label='unmatched', alpha=0.3, c='tab:orange', ax=ax)
for x in zip(df_cps1_nsw['ps'].iloc[pairs_cps1_nsw['l']].values, df_cps1_nsw['ps'].iloc[pairs_cps1_nsw['r']].values):
plt.plot(x, [1, 0], c='tab:blue', alpha=0.3)
(pairs_cps1_nsw['l'].map(df_cps1_nsw[y_column]) - pairs_cps1_nsw['r'].map(df_cps1_nsw[y_column])).agg(['mean', 'std'])
mean 1836.516968
std 9634.636719
dtype: float64
ate_model = sm.WLS(df_cps1_nsw[y_column], df_cps1_nsw[[z_column]].assign(untreat=1-df_cps1_nsw[z_column]), weights=df_cps1_nsw['w']).fit()
ate_model.summary().tables[1]
ate_model.params['treat'] - ate_model.params['untreat']
-6456.300804768925
IPW (common support)
ate_model = sm.WLS(df_cps1_nsw_cs[y_column], df_cps1_nsw_cs[[z_column]].assign(untreat=1-df_cps1_nsw_cs[z_column]), weights=df_cps1_nsw_cs['w']).fit()
ate_model.summary().tables[1]
ate_model.params['treat'] - ate_model.params['untreat']
545.3149727568589
pd.DataFrame([{
'unadjusted': np.subtract(*df_cps1_nsw.groupby(z_column)[c].mean()) / df_cps1_nsw[c].std(),
'adjusted(matching)': np.subtract(*df_cps1_nsw.loc[pairs_cps1_nsw.unstack()].groupby(z_column)[c].mean()) / df_cps1_nsw[c].std(),
'adjusted(weighting)': np.subtract(*df_cps1_nsw.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w']))) / df_cps1_nsw[c].std(),
'adjusted(weighting in common support)': np.subtract(*df_cps1_nsw_cs.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w']))) / df_cps1_nsw_cs[c].std(),
} for c in X_columns], index=X_columns).abs().plot(marker='o', alpha=0.5, grid=True, ylim=(0, 100), logy=True, title='covariates balance')
plt.axhline(y=0.1, linestyle='--', c='r')
_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
(df_cps1_nsw.loc[pairs_cps1_nsw.unstack()][c].groupby(df_cps1_nsw[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps1_nsw.sample(1000000, replace=True, weights=df_cps1_nsw['w'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
(df[c].groupby(df_cps1_nsw[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps1_nsw_cs.sample(1000000, replace=True, weights=df_cps1_nsw_cs['w'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
(df[c].groupby(df_cps1_nsw_cs[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
matching
pairs_cps3_nsw = pd.concat([pairs1_cps3_nsw, pairs0_cps3_nsw.rename({'l': 'r', 'r': 'l'}, axis=1)], ignore_index=True)
ax = df_cps3_nsw.loc[pairs_cps3_nsw.unstack()].plot.scatter(x='ps', y=z_column, label='matched', alpha=0.3, figsize=(16, 4), title='pairs')
df_cps3_nsw.loc[~df_cps3_nsw.index.isin(pairs_cps3_nsw.unstack())].plot.scatter(x='ps', y=z_column, label='unmatched', alpha=0.3, c='tab:orange', ax=ax)
for x in zip(df_cps3_nsw['ps'].iloc[pairs_cps3_nsw['l']].values, df_cps3_nsw['ps'].iloc[pairs_cps3_nsw['r']].values):
plt.plot(x, [1, 0], c='tab:blue', alpha=0.3)
(pairs_cps3_nsw['l'].map(df_cps3_nsw[y_column]) - pairs_cps3_nsw['r'].map(df_cps3_nsw[y_column])).agg(['mean', 'std'])
mean 1486.608276
std 8629.639648
dtype: float64
ate_model = sm.WLS(df_cps3_nsw[y_column], df_cps3_nsw[[z_column]].assign(untreat=1-df_cps3_nsw[z_column]), weights=df_cps3_nsw['w']).fit()
ate_model.summary().tables[1]
ate_model.params['treat'] - ate_model.params['untreat']
224.6762634665365
IPW (common support)
ate_model = sm.WLS(df_cps3_nsw_cs[y_column], df_cps3_nsw_cs[[z_column]].assign(untreat=1-df_cps3_nsw_cs[z_column]), weights=df_cps3_nsw_cs['w']).fit()
ate_model.summary().tables[1]
ate_model.params['treat'] - ate_model.params['untreat']
801.0736196669177
pd.DataFrame([{
'unadjusted': np.subtract(*df_cps3_nsw.groupby(z_column)[c].mean()) / df_cps3_nsw[c].std(),
'adjusted(matching)': np.subtract(*df_cps3_nsw.loc[pairs_cps3_nsw.unstack()].groupby(z_column)[c].mean()) / df_cps3_nsw[c].std(),
'adjusted(weighting)': np.subtract(*df_cps3_nsw.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w']))) / df_cps3_nsw[c].std(),
'adjusted(weighting in common support)': np.subtract(*df_cps3_nsw_cs.groupby(z_column).apply(lambda x: np.average(x[c], weights=x['w']))) / df_cps3_nsw_cs[c].std(),
} for c in X_columns], index=X_columns).abs().plot(marker='o', alpha=0.5, grid=True, ylim=(0, 100), logy=True, title='covariates balance')
plt.axhline(y=0.1, linestyle='--', c='r')
_, ax = plt.subplots(3, 3, figsize=(16, 12))
for i, c in enumerate(X_columns + [y_column]):
(df_cps3_nsw.loc[pairs_cps3_nsw.unstack()][c].groupby(df_cps3_nsw[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps3_nsw.sample(1000000, replace=True, weights=df_cps3_nsw['w'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
(df[c].groupby(df_cps3_nsw[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
_, ax = plt.subplots(3, 3, figsize=(16, 12))
df = df_cps3_nsw_cs.sample(1000000, replace=True, weights=df_cps3_nsw_cs['w'], random_state=2020)
for i, c in enumerate(X_columns + [y_column]):
(df[c].groupby(df_cps3_nsw_cs[z_column].map(treat_label))
.plot.hist(density=True, alpha=0.5, title=f'{c} densities', legend=True, ax=ax[i // 3, i % 3]))
result
data set
matching
IPW
IPW(Common support)
CPS1+NSW
1836.5
-6456.3
545.3
CPS3+NSW
1486.6
224.7
801.0
Impressions