When I was researching Uplift Modeling for work, I arrived at CATE (Conditional Average Treatment Effect). CATE conditioned ATE (Average Treatment Effect) with a certain feature amount, and while ATE calculates the "average treatment effect", the effect should change depending on each attribute (feature amount). Based on the idea, the treatment effect is calculated by incorporating heterogeneity.
Where $ Y (1) $ and $ Y (0) $ are potential outcome variables and $ X = x $ are features.
If CATE, that is, the treatment effect at the individual or segment level can be estimated, targeting optimization such as campaigning only for those who have a positive treatment effect, and personalization of products and channels will be possible.
A Python package developed by members of Uber Technologies that provides a causal reasoning method using machine learning, which allows you to estimate CATE from experimental / observation data. The following techniques are provided, but in this article we will learn about S-Learner, T-Learner, and X-Learner.
> pip install causalml
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from causalml.inference.meta import BaseXRegressor, BaseSRegressor, BaseTRegressor
Using the data used in "Iwanami Data Science Vol.3" to estimate the causal effect of CM contact on application usage. I will. Kato / Hoshino Support Page (Iwanami Data Science Vol.3 Special Feature Causal Reasoning)
#Download data
path = 'https://raw.githubusercontent.com/iwanami-datascience/vol3/master/kato%26hoshino/q_data_x.csv'
df = pd.read_csv(path)
X = df[['area_kanto', 'area_tokai', 'area_keihanshin', 'age', 'sex', 'marry_dummy', 'child_dummy',
'job_dummy1', 'job_dummy2', 'job_dummy3', 'job_dummy4', 'job_dummy5', 'job_dummy6', 'job_dummy7',
'inc', 'pmoney', 'fam_str_dummy1', 'fam_str_dummy2', 'fam_str_dummy3', 'fam_str_dummy4', 'TVwatch_day']]
Y = df['gamesecond'] #App usage seconds
W = df['cm_dummy'] #With or without CM contact
#Divided into training data and test data(Stratified with W)
X_train, X_test, Y_train, Y_test, W_train, W_test = train_test_split(X, Y, W, test_size=0.2, shuffle=True, random_state=42, stratify=W)
S-Learner Stage 1 Create a single predictive model that includes treatment assignments in features and estimate mean outcome $ μ (x) $
μ(x)=E[Y|X=x, W=w]
Stage 2 CATE estimate:
\hat{τ}(x)=\hat{μ}(x, W=1)-\hat{μ}(x, W=0)
# S-Create a Learner instance(This time based on XGBoost)
learner_s = BaseSRegressor(learner=XGBRegressor(random_state=42))
#Learning
learner_s.fit(X=X_train, treatment=W_train, y=Y_train)
#Prediction for test data
cate_s = learner_s.predict(X=X_test)
cate_s contains CATE estimates for each row of test data.
array([[ 1139.97802734],
[ -22.8037262 ],
[-1353.3692627 ],
...,
[ -751.50939941],
[ 1418.30859375],
[ -743.94995117]])
T-Learner Stage 1 Predictive models are created from the treated and untreated data, and the average outcomes of $ μ_1 (x) $ and $ μ_0 (x) $ are estimated.
μ_1(x)=E[Y(1)|X=x] \\
μ_0(x)=E[Y(0)|X=x]
Stage 2 CATE estimate:
\hat{τ}(x)=\hat{μ_1}(x)-\hat{μ_0}(x)
# T-Create a Learner instance(This time based on XGBoost)
learner_t = BaseTRegressor(learner=XGBRegressor(random_state=42))
#Learning
learner_t.fit(X=X_train, treatment=W_train, y=Y_train)
#Prediction for test data
cate_t = learner_t.predict(X=X_test)
X-Learner Stage 1 Predictive models are created from the treated and untreated data, and the average outcomes of $ μ_1 (x) $ and $ μ_0 (x) $ are estimated.
μ_1(x)=E[Y(1)|X=x] \\
μ_0(x)=E[Y(0)|X=x]
Stage 2 Calculate the conditional treatment effect $ D_i ^ {1} (x) $ and $ D_i ^ {0} (x) $ in the treatment group and control group.
D_i^{1}(x)=Y_i^{1}-\hat{μ_0}(X_i^{1}) \\
D_i^{0}(x)=\hat{μ_1}(X_i^{0})-Y_i^{0}
Create predictive models targeting $ D ^ {1} $ and $ D ^ {0} $, respectively, and estimate CATT (Conditional Average Treatment Effect on the Treated) and CATU (Conditional Average Treatment Effect on the Untreated).
τ_1(x)=E[D^{1}|X=x] \\
τ_0(x)=E[D^{0}|X=x]
Stage 3 CATE estimate:
\hat{τ}(x)=g(x)\hat{τ_0}(x)+(1-g(x))\hat{τ_1}(x)
$ g (x) $ is a weighting function that takes values in the range 0 to 1 and often uses an estimate of the propensity score $ P [W = 1 | X = x] $.
# X-Create a Learner instance(This time based on XGBoost)
learner_x = BaseXRegressor(learner=XGBRegressor(random_state=42))
#Learning
#If you do not specify a propensity score like this time, it will be calculated automatically by Elastic Net.
learner_x.fit(X=X_train, treatment=W_train, y=Y_train)
#Prediction for test data
cate_x = learner_x.predict(X=X_test)
Check the CATE estimates for each method on the violin plot.
plt.figure(figsize=(12,8))
plt.violinplot([cate_s.flatten(), cate_t.flatten(), cate_x.flatten()], showmeans=True)
plt.xticks([1,2,3], ['S-learner', 'T-learner', 'X-learner'])
plt.ylabel('Distribution of CATE')
The variation increases in the order of S-Learner <X-Learner <T-Learner. But I can't say anything more. .. We will learn the characteristics of each method and what kind of case it is suitable for. By the way, there are still cases where S-Learner does not divide the tree with the treatment allocation $ W $, and the estimated value tends to approach zero.
--Treatment assignment depends only on the observed features - There are no hidden confounders --There is no significant bias in the pattern of features in the treatment group and control group. --Biasing limits external validity
Recommended Posts