Example: Generating data

Motivation

Given the fundamental problem of Causal Inference, simulating or generating data is of particular relevance when working with ATE or CATE estimation: it allows to have a ground truth that we don’t get from the real world.

For instance, when generating data, we can have access to the Individual Treatment Effect and use that ground truth to evaluate a treatment effect method at hand.

In the following example we will describe how the modules metalearners.data_generation and metalearners.outcome_functions can be used to generate data in light of treatment effect estimation.

How-to

In the context of treatment effect estimation, our data usually consists of 3 ingredients:

  • Covariates

  • Treatment assignments

  • Observed outcomes

In this particular scenario of simulating data, we can add some quantities of interest which are not available in the real world:

  • Potential outcomes

  • True CATE or true ITE

Let’s generate those quantities one after another.

Covariates

Let’s start by generating covariates. We will use metalearners.data_generation.generate_covariates() for that purpose.

from metalearners.data_generation import generate_covariates

features, categorical_features_idx, n_categories = generate_covariates(
        n_obs=1000,
        n_features=8,
        n_categoricals=3,
        format="pandas",
)
features.head() # type: ignore
0 1 2 3 4 5 6 7
0 -2.451565 2.189041 -0.887414 2.486362 -0.268975 4 2 1
1 -4.347556 3.189760 -2.135512 0.084371 3.955077 2 8 2
2 -1.050218 2.201926 -3.722261 2.383120 -0.563782 0 5 0
3 -4.050834 3.250730 -1.141644 1.626316 1.833348 1 4 2
4 -0.845611 1.882806 3.619741 2.216667 1.882438 2 6 0

We see that we generated a DataFrame with 8 columns of which the last three are categoricals.

Treatment assignments

In this example we will replicate the setup of an RCT, i.e. where the treatment assignments are independent of the covariates. We rely on metalearners.data_generation.generate_treatment().

import numpy as np
from metalearners.data_generation import generate_treatment

# We use a fair conflip as a reference.
propensity_scores = .5 * np.ones(1000)
treatment = generate_treatment(propensity_scores)
type(treatment), np.unique(treatment), treatment.mean()
(numpy.ndarray, array([0, 1]), 0.512)

As we would expect, an array of binary assignments is generated. The average approximately corresponds to the universal propensity score of .5.

Potential outcomes

In this example we will rely on metalearners.outcome_functions.linear_treatment_effect(), which generates additive treatment effects which are linear in the features. Note that there are other potential outcome functions available.

from metalearners._utils import get_linear_dimension
from metalearners.outcome_functions import linear_treatment_effect

dim = get_linear_dimension(features)
outcome_function = linear_treatment_effect(dim)
potential_outcomes = outcome_function(features)
potential_outcomes
array([[ 0.64206009,  1.24024087],
       [ 0.56319116,  0.79794291],
       [ 0.45516047,  0.14003734],
       ...,
       [ 2.36455309,  5.05259348],
       [ 1.75308904,  3.76108235],
       [-2.79972096, -9.70375103]])

We see it generates one column with the potential outcome \(Y(0)\) and one column with the potential outcome \(Y(1)\). The individual treatment effect can be inferred as a subtraction of both.

Observed outcomes

Lastly, we can combine the treatment assignments and potential outcomes to generate the observed outcomes. Note that there might be noise which distinguishes the potential outcome from the observed outcome. For that purpose we can use metalearners.data_generation.compute_experiment_outputs() and run

from metalearners.data_generation import compute_experiment_outputs

observed_outcomes, true_cate = compute_experiment_outputs(
    potential_outcomes,
    treatment,
)