Example: Generating data
Motivation
Given the fundamental problem of Causal Inference, simulating or generating data is of particular relevance when working with ATE or CATE estimation: it allows to have a ground truth that we don’t get from the real world.
For instance, when generating data, we can have access to the Individual Treatment Effect and use that ground truth to evaluate a treatment effect method at hand.
In the following example we will describe how the modules
metalearners.data_generation and
metalearners.outcome_functions can be used to generate data in
light of treatment effect estimation.
How-to
In the context of treatment effect estimation, our data usually consists of 3 ingredients:
Covariates
Treatment assignments
Observed outcomes
In this particular scenario of simulating data, we can add some quantities of interest which are not available in the real world:
Potential outcomes
True CATE or true ITE
Let’s generate those quantities one after another.
Covariates
Let’s start by generating covariates. We will use
metalearners.data_generation.generate_covariates() for that
purpose.
from metalearners.data_generation import generate_covariates
features, categorical_features_idx, n_categories = generate_covariates(
n_obs=1000,
n_features=8,
n_categoricals=3,
format="pandas",
)
features.head() # type: ignore
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| 0 | -2.451565 | 2.189041 | -0.887414 | 2.486362 | -0.268975 | 4 | 2 | 1 |
| 1 | -4.347556 | 3.189760 | -2.135512 | 0.084371 | 3.955077 | 2 | 8 | 2 |
| 2 | -1.050218 | 2.201926 | -3.722261 | 2.383120 | -0.563782 | 0 | 5 | 0 |
| 3 | -4.050834 | 3.250730 | -1.141644 | 1.626316 | 1.833348 | 1 | 4 | 2 |
| 4 | -0.845611 | 1.882806 | 3.619741 | 2.216667 | 1.882438 | 2 | 6 | 0 |
We see that we generated a DataFrame with 8 columns of which the last three are categoricals.
Treatment assignments
In this example we will replicate the setup of an RCT, i.e. where the
treatment assignments are independent of the covariates. We rely on
metalearners.data_generation.generate_treatment().
import numpy as np
from metalearners.data_generation import generate_treatment
# We use a fair conflip as a reference.
propensity_scores = .5 * np.ones(1000)
treatment = generate_treatment(propensity_scores)
type(treatment), np.unique(treatment), treatment.mean()
(numpy.ndarray, array([0, 1]), 0.512)
As we would expect, an array of binary assignments is generated. The average approximately corresponds to the universal propensity score of .5.
Potential outcomes
In this example we will rely on
metalearners.outcome_functions.linear_treatment_effect(), which
generates additive treatment effects which are linear in the features.
Note that there are other potential outcome functions available.
from metalearners._utils import get_linear_dimension
from metalearners.outcome_functions import linear_treatment_effect
dim = get_linear_dimension(features)
outcome_function = linear_treatment_effect(dim)
potential_outcomes = outcome_function(features)
potential_outcomes
array([[ 0.64206009, 1.24024087],
[ 0.56319116, 0.79794291],
[ 0.45516047, 0.14003734],
...,
[ 2.36455309, 5.05259348],
[ 1.75308904, 3.76108235],
[-2.79972096, -9.70375103]])
We see it generates one column with the potential outcome \(Y(0)\) and one column with the potential outcome \(Y(1)\). The individual treatment effect can be inferred as a subtraction of both.
Observed outcomes
Lastly, we can combine the treatment assignments and potential
outcomes to generate the observed outcomes. Note that there might be
noise which distinguishes the potential outcome from the observed
outcome. For that purpose we can use
metalearners.data_generation.compute_experiment_outputs() and run
from metalearners.data_generation import compute_experiment_outputs
observed_outcomes, true_cate = compute_experiment_outputs(
potential_outcomes,
treatment,
)