Consideration of propensity score and effect estimation accuracy

Article summary

I was a little worried while studying the propensity score that appears in the field of causal reasoning, so I experimented. What I was interested in was how much and how I should be concerned with "calculating the propensity score". This time, I intentionally created biased data

  1. When using the true propensity score
  2. When the propensity score is calculated using the same model as the dataset generation (logistic regression)
  3. When emphasizing the prediction accuracy of the presence or absence of intervention (LightGBM)

I compared the accuracy of effect estimation in three ways. In this article, "effect" refers to "ATE (Average Treatment Effect)".

** I'm thrilled that this is my first post, but I would appreciate it if you could let me know if you have any problems. ** **

Preparing a biased dataset

Based on the features $ x_0, x_1, x_2 $, determine the probability (= true propensity score) of performing intervention $ z \ in \ {0, 1 \} $ with the same model as logistic regression. The true effect of $ z $ is 1.3 for the text, and the target of the effect is calculated by linear combination of $ x_0, x_1, x_2, z $ and the error term. The coefficient of the linear combination is also determined by the text.

dataset.py


import numpy as np
import pandas as pd

# sample size
n = 10000

# features
np.random.seed(0)
x_0 = np.random.randn(n)

np.random.seed(1)
x_1 = np.random.randn(n)

np.random.seed(2)
x_2 = np.random.randn(n)

# treatment
z = np.zeros_like(x_0)
true_P_score = np.zeros_like(x_0)

# true effect of z
effect = 1.3

for i in range(n):
    # those who have high value of x_0, x_1 and x_2 tend to be z = 1.
    true_P_score[i] = 1 / (1 + np.exp(- x_0[i] - x_1[i] - x_2[i]))

    if np.random.rand(1) < true_P_score[i]:
        z[i] = 1
    else:
        z[i] = 0
        
# error
np.random.seed(3)
error = 0.01 * np.random.randn(n)

# generate target y
target = 2*x_0 + 3*x_1 + 1.5*x_2 + effect*z + error

# make dataframe
data = pd.DataFrame({'x_0': x_0,
                     'x_1': x_1,
                     'x_2': x_2,
                     'z': z,
                     'target': target,
                     'true_P_score': true_P_score})

For the time being, if you check how much bias is affected when random sampling is assumed, you can see that it seems to be a fairly large estimate.

True effect: 1.3 Apparent effect: 5.5

confirm_bias.py


# confirm the bias
print("the true effect of z = ", effect)
print('the pseudo effect of z = ',
      np.mean(data[data.z == 1].target) - np.mean(data[data.z == 0].target))

IPW using propensity score

Let's estimate the effect by IPW (inverse probability weighted estimation) with the probabilistic score as a weight. In principle, the propensity score is $ P $, the weight $ Z_i / P_i $ for the sample $ i $ with $ z = 1 $, and the weight $ (1-Z_i) for the sample $ i $ with $ z = 0 $. Add / (1-P_i) $ to calculate the expected value of each target of $ z = 1, 0 $, and the difference is the estimator of the effect.

However, here I want the library to do t-test, so if I devise a little, I put the weight $ [Z_i / P_i + (1-Z_i) / (1-P_i)] $ on all the samples, $ z $ You can see that we should consider the ** weighted linear simple regression ** of $ target $ by.

ipw.py


from statsmodels.regression.linear_model import WLS # weighted least square
from statsmodels.tools import add_constant

X = add_constant(data['z'])
y = data['target']
weights = data['Z/P'] + data['(1-Z)/(1-P)']
wls = WLS(y, X, weights=weights).fit()
wls.summary()

It may be a chord, but the following code can output the true effect and estimation result.

confirm_effect.py


print("the true effect of z = ", effect)
print("effect of z is estimated as", wls.params[1])

Comparison of effect estimation by propensity score

Well, the true effect is 1.3 as decided when the dataset was generated, but we estimated it using several propensity scores. If the propensity score really exists in the dataset, the estimator of the effect is ** unbiased **, so it is most accurate to use the intervention probability used when generating the dataset as it is as the propensity score. Should be.

Also, since the logistic regression model is ** the same as the dataset generation mechanism **, the calculated propensity score should be close to the true value. Therefore, the estimation accuracy of the effect is also quite good. Is expected. By the way, it was about 76% in terms of "prediction accuracy" of intervention $ z $. (It is a mystery whether it is called prediction because there is only so-called training data.) Also, the propensity score was about $ \ pm 0.4 $% on average from the true value (the average of the absolute values of the differences, not the ratio of errors). ..

Finally, I tried to improve the "prediction accuracy" of intervention $ z $ with LightBGM. The prediction accuracy of intervention $ z $ is about 80%. Hyperparameters are left at their defaults. The propensity score averaged around $ \ pm 6 $% from the true value (the average of the absolute values of the differences, not the ratio of errors).

The effect estimation results using these propensity scores are as follows.

True effect: 1.3 Estimated result using true propensity score: 1.3345 ... Estimated results using logistic regression: 1.4002 ... Estimated result using LightGBM: 2.7601 ...

Although LightGBM has a high prediction accuracy of intervention $ z $, it can be seen that it has estimated the effect considerably.

Conclusion

Not surprisingly, the estimation using the true propensity score was the most accurate. In other words, I think the important thing is not ** "prediction accuracy of intervention $ z $" but "accuracy of propensity score (unverifiable)" **. But basically you don't know the true propensity score. Therefore, I thought that the following points should be taken into consideration when calculating the propensity score in practice.

  1. ** Hear in detail the criteria from the decision maker on whether or not to intervene. **** (= to create a model that is as close as possible to the data generation mechanism) **

  2. ** Intervention Monitor the balance of covariates using standardized mean differences, etc., rather than spending a great deal of effort on improving the prediction accuracy of $ z $. ** **   Thank you for reading to the end.

References

--Introduction to effect verification: Causal reasoning for correct comparison / Basics of metrological economics (by Shota Yasui) ISBN-10: 4297111179 ISBN-13: 978-4297111175

Recommended Posts

Consideration of propensity score and effect estimation accuracy
Advantages and disadvantages of maximum likelihood estimation
Estimating the effect of measures using propensity scores
Concept of Bayesian reasoning (2) ... Bayesian estimation and probability distribution
Maximum likelihood estimation of mean and variance with TensorFlow