Motivation
Why CATE estimation?
Please see the section How can CATEs be useful?.
Why MetaLearners?
There are various ways for estimating CATEs, such as Targeted Maximum Likelihood Estimation, Causal Forests or MetaLearners.
We’ve found MetaLearners to be a particularly compelling approach for CATE estimation because:
- They are conceptually simple
- Some of them come with strong statistical guarantees, see e.g. Nie et al. (2019) for the R-Learner or Kennedy (2023) for the DR-Learner
- They rely on existing, arbitrary prediction approaches
The latter point is particularly important since it implies that battle-tested and production-grade code from existing prediction libraries such as scikit-learn, lightgbm or xgboost can be reused. Given that the field of CATE estimation is still young and engineering efforts limited, this is a highly relevant factor.
Why not causalml or econml?
causalml and econml are open-source Python libraries providing, among other things, implementations of many MetaLearners.
What we’ve come to like about the design of both is that:
- Their Metalearner implementations mostly follow the interface one might expect from an
sklearnEstimator - They are, in the intended use cases, fairly straightforward and intuitive to use
Yet, we’ve also found that in some regards, the MetaLearner implementations from causalml and econml don’t perfectly lend themselves to use cases we care about.
Accessing base models
While MetaLearners are, in principle, designed in a very modular fashion, we’ve struggled to access individual base models in a meaningful way.
One reason to access the base models is to evaluate their individual performance. Due to the fundamental problem of Causal Inference, we are not able to evaluate a MetaLearner based on a simple metric measuring the mismatch between estimate and ground truth. Yet, we might want to do this for our base learners which often do have ground truth labels to compare the estimates to. Yet, this is not supported by econml and causalml.
In the illustration above, we indicate that we’d like to access, predict with, and evaluate a propensity model – one base model of the MetaLearner at hand – in isolation.
See, for instance, econml issue 619.
Reusing trained base models
Given MetaLearners’ modular design, it should, in principle, be simple to not only train all base estimators of a MetaLearner ‘together’ but to reuse already trained base models.
We envision two concrete use cases where this might be relevant in that it would save considerable resources:
-
When tuning hyperparameters of a given MetaLearner architecture (e.g., an R-Learner) on a given dataset, one might, for instance, want to tune the hyperparameters of an outcome model in light of the behavior of the overall MetaLearner. In such a scenario, it is redundant to retrain a propensity model for every single outcome model hyperparameter constellation. Instead, one might want to reuse and plug in an already trained propensity model.
-
When training several MetaLearner architectures on the same dataset, some base models might be part of the design of several of these MetaLearner architectures. An example of this could be an outcome model, used in both the R-Learner and DR-Learner. In such a scenario, it seems desirable to reuse the conceptually equivalent outcome model instead of training it several times.
The illustration above indicates the intention to reuse an already trained base estimator as part of a MetaLearner.
See econml issue 646 for reference. The causalml documentation provides no officially supported way of passing in pre-trained models. Note that the specified models are first copied and then fit from scratch.
Working with pandas DataFrames
Many standard estimation libraries, such as sklearn or lightgbm, accept pandas DataFrame as well as numpy ndarray as input - sometimes even generic interfaces such as the Array API standard. Importantly, a user would not only expect those to be accepted but also to be treated in a way that corresponds to their semantics.
Since the operational essence of MetaLearners is merely distributing the right data (e.g., covariates and outcomes indexed on treated observations) from the right source (e.g., a base estimator or a raw input) to the right sink (e.g., a base estimator or final output), we would expect that anything the base model of choice can support should also be supported by a MetaLearner library.
Since we are concerned about tabular data, support for pandas DataFrames is of particular importance. Now, in most cases, econml and causalml accept DataFrames; in many do they work as intended with them. Yet, under the hood, econml and causalml transform every data structure to numpy (see this causalml snippet and this econml snippet). Concretely, this leads to errors with non-integer categoricals and silent errors with integer categoricals when using pandas’s category dtype and lightgbm base models even though lightgbm can handle the former just fine. See this notebook for an illustration.
An important illustration of the usefulness of categorical data types is working with discrete, yet more than binary variants. Here, econml, for instance, internally encodes these variants with one-hot encoding. This encoding is not easily undone by the user, and therefore, results can be cumbersome to interpret.
Using different covariate sets for different base learners
Most base learners in a MetaLearner expect some covariate matrix X. Conceptually, we need to make sure that this X satisfies our fundamental assumptions of positivity, unconfoundedness, and stable unit treatment value. Yet, if we know of certain (conditional) independences, we might not always require this entire covariate matrix for each base learner. Conversely, offering a base learner more features than we know are relevant might make the learning process more fragile to noise and prone to overfitting.
In the following illustration, we indicate that we have a column-wise partitioning of X into X1 and X2. One base estimator relies on X1 only, one on X2 only, and one on X, i.e., X1 and X2.
For this reason, we would want to be able to define which covariate set is used by which base learner. This is currently not supported by econml or causalml.
Multiprocessing training of base learners
Many MetaLearners come with two ‘stages’ of base models. The models of the first stage, nuisance models, are trained independently of each other. The models of the second stage, the treatment models, are trained independently of each other, too.
Clearly, this is a perfect setup for concurrent training of various models which are independent of each other – trading off space for time. Yet, neither causalml nor econml support multiprocessing within a stage.
See, for instance, causalml issue 616.