Example: Generating data¶
Motivation¶
Given the fundamental problem of Causal Inference, simulating or generating data is of particular relevance when working with ATE or CATE estimation: it allows to have a ground truth that we don't get from the real world.
For instance, when generating data, we can have access to the Individual Treatment Effect and use that ground truth to evaluate a treatment effect method at hand.
In the following example we will describe how the modules
data_generation and
outcome_functions can be used to generate data in
light of treatment effect estimation.
How-to¶
In the context of treatment effect estimation, our data usually consists of 3 ingredients:
- Covariates
- Treatment assignments
- Observed outcomes
In this particular scenario of simulating data, we can add some quantities of interest which are not available in the real world:
- Potential outcomes
- True CATE or true ITE
Let's generate those quantities one after another.
Covariates¶
Let's start by generating covariates. We will use
generate_covariates for that
purpose.
from metalearners.data_generation import generate_covariates
features, categorical_features_idx, n_categories = generate_covariates(
n_obs=1000,
n_features=8,
n_categoricals=3,
format="pandas",
)
features.head() # type: ignore
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| 0 | -0.432580 | -1.956691 | -2.724410 | -4.051359 | -4.275785 | 0 | 2 | 3 |
| 1 | 0.213119 | -1.794246 | 3.335272 | 0.596448 | -8.053070 | 0 | 2 | 3 |
| 2 | -0.333022 | -1.855324 | 2.567406 | -0.507977 | -7.255018 | 2 | 4 | 3 |
| 3 | -1.036547 | -1.379920 | 1.721547 | -2.817249 | -4.626411 | 2 | 3 | 1 |
| 4 | -1.514100 | -3.060547 | -4.077247 | -5.819707 | -4.468868 | 5 | 0 | 3 |
We see that we generated a DataFrame with 8 columns of which the last three are categoricals.
Treatment assignments¶
In this example we will replicate the setup of an RCT, i.e. where the
treatment assignments are independent of the covariates. We rely on
generate_treatment.
import numpy as np
from metalearners.data_generation import generate_treatment
# We use a fair conflip as a reference.
propensity_scores = .5 * np.ones(1000)
treatment = generate_treatment(propensity_scores)
type(treatment), np.unique(treatment), treatment.mean()
(numpy.ndarray, array([0, 1]), 0.514)
As we would expect, an array of binary assignments is generated. The average approximately corresponds to the universal propensity score of $.5$.
Potential outcomes¶
In this example we will rely on linear_treatment_effect, which
generates additive treatment effects which are linear in the features.
Note that there are other potential outcome functions available.
from metalearners._utils import get_linear_dimension
from metalearners.outcome_functions import linear_treatment_effect
dim = get_linear_dimension(features)
outcome_function = linear_treatment_effect(dim)
potential_outcomes = outcome_function(features)
potential_outcomes
array([[-4.6390948 , -6.99101697],
[-4.5927874 , -1.43775422],
[-5.6179741 , -3.62754599],
...,
[-5.81369594, -2.16523526],
[ 0.89106589, 0.44998321],
[-6.62191898, -7.66198481]])
We see it generates one column with the potential outcome $Y(0)$ and one column with the potential outcome $Y(1)$. The individual treatment effect can be inferred as a subtraction of both.
Observed outcomes¶
Lastly, we can combine the treatment assignments and potential
outcomes to generate the observed outcomes. Note that there might be
noise which distinguishes the potential outcome from the observed
outcome. For that purpose we can use compute_experiment_outputs and run
from metalearners.data_generation import compute_experiment_outputs
observed_outcomes, true_cate = compute_experiment_outputs(
potential_outcomes,
treatment,
)