Motivation ========== Why CATE estimation? -------------------- Please see the section :ref:`how-can-cates-be-useful`. .. _motivation-why-metalearners: Why MetaLearners? ----------------- There are various ways for estimating CATEs, such as `Targeted Maximum Likelihood Estimation `_, `Causal Forests `_ or MetaLearners. We've found MetaLearners to be a particularly compelling approach for CATE estimation because - they are conceptually simple - some of them come with strong statistical guarantees, see e.g. `Nie et al. (2019) `_ for the R-Learner or `Kennedy (2023) `_ for the DR-Learner - they rely on existing, arbitrary prediction approaches The latter point is particularly important since it implies that battle-tested and production-grade code from existing prediction libraries such as ``scikit-learn``, ``lightgbm`` or ``xgboost`` can be reused. Given that the field of CATE estimation is still young and engineering efforts limited, this is a highly relevant factor. .. _motivation-why-not-causalml-or-econml: Why not ``causalml`` or ``econml``? ----------------------------------- `causalml `_ and `econml `_ are open-source Python libraries providing, among other things, implementations of many MetaLearners. What we've come to like about the design of both is that * their Metalearner implementations mostly follow the interface one might expect from an ``sklearn`` Estimator * they are, in the intended use cases, fairly straight-forward and intuitive to use Yet, we've also found that in some regards, the MetaLearner implementations from ``causalml`` and ``enconml`` don't perfectly lend themselves to use cases we care about. Accessing base models """"""""""""""""""""" While MetaLearners are, in principle, designed in a very modular fashion, we've struggled to access individual base models in a meaningful way. One reason to access the base models is to evaluate their individual performance. Due to the fundamental problem of Causal Inference we are not able to evaluate a MetaLearner based on a simple metric measuring the mismatch between estimate and ground truth. Yet, we might want to do this for our base learners which often do have ground truth labels to compare the estimates to. Yet, this is not supported by ``econml`` and ``causalml``. .. image:: imgs/component_eval.drawio.svg :width: 400 In the illustration above, we indicate that we'd like to access, predict with and evaluate a propensity model -- one base model of the MetaLearner at hand -- in isolation. See, for instance, `econml issue 619 `_. Reusing trained base models """"""""""""""""""""""""""" Given MetaLearners' modular design, it should in principle be simple to not only train all base estimators of a MetaLearner 'together' but to reuse already trained base models. We envision two concrete use cases where this might be relevant in that it would save considerable resources: * When tuning hyperparameters of a given MetaLearner architecture (e.g. an R-Learner) on a given dataset, one might, for instance, want to tune the hyperparameters of an outcome model in light of the behaviour of the overall MetaLearner. In such a scenario, it is redundant to retrain a propensity model for every single outcome model hyperparameter constellation. Instead, one might want to reuse and plug in an already trained propensity model. * When training several MetaLearner architectures on the same dataset, some base models might be part of the design of several of these MetaLearner architectures. An example of this could be an outcome model, used in both the R-Learner and DR-Learner. In such a scenario, it seems desirable to us to reuse the conceptually equivalent outcome model instead of training it several times. .. image:: imgs/component_reuse.drawio.svg :width: 400 The illustration above indicates the intention to reuse an already trained base estimator as part of a MetaLearner. See `econml issue 646 `_ for reference. The `causalml documentation `_ provides no officially supported way of passing in pre-trained models. Note that the specified models are first `copied `_ and then `fit `_ from scratch. Working with ``pandas`` DataFrames """""""""""""""""""""""""""""""""" Many standard estimation libraries, such as ``sklearn`` or ``lightgbm``, accept ``pandas`` ``DataFrame`` as well as ``numpy`` ``ndarrays`` as input - sometimes even generic interfaces such as the `Array API standard `_. Importantly, a user would not only expect those to be accepted, but also to be treated in a way that corresponds to their semantics. Since the operational essence of MetaLearners is merely distributing the right data (e.g. covariates and outcomes indexed on treated observations) from the right source (e.g. a base estimator or a raw input) to the right sink (e.g. a base estimator or final output), we would expect that anything the base model of choice can support should also be supported by a MetaLearner library. Since we are concerned about tabular data, support for ``pandas`` ``DataFrame``\s is of particular importance. Now, in most cases, ``econml`` and ``causalml`` accept DataFrames; in many do they work as intended with them. Yet, under the hood, ``econml`` and ``causalml`` transform every data structure to ``numpy`` (see `this causalml snippet `_ and `this econml snippet `_ ). Concretely, this leads to errors with non-integer categoricals and silent errors with integer categoricals when using ``pandas``\'s `category dtype `_ and ``lightgbm`` base models even though ``lightgbm`` can handle the former just fine. See `this notebook `_ for an illustration. An important illustration of the usefulness of categorical data types is working with discrete, yet more than binary variants. Here, ``econml``, for instance, internally encodes these variants with one-hot encoding. This encoding is not easily undone by the user, and therefore, results can be cumbersome to interpret. Using different covariate sets for different base learners """""""""""""""""""""""""""""""""""""""""""""""""""""""""" Most base learners in a MetaLearner expect some covariate matrix ``X``. Conceptually, we need to make sure that this ``X`` satisfies our fundamental assumptions of positivity, unconfoundedness and stable unit treatment value. Yet, if we know of certain (conditional) independences, we might not always require this entire covariate matrix for each base learner. Conversely, offering a base learner more features than we know are relevant might make the learning process more fragile to noise and prone to overfitting. In the following illustration we indicate that we have a column-wise partitioning of ``X`` into ``X1`` and ``X2``. One base estimator relies on ``X1`` only, one on ``X2`` only and one on ``X``, i.e. ``X1`` and ``X2``. .. image:: imgs/covariate_sets.drawio.svg :width: 400 For this reason, we would want to be able to define which covariate set is used by which base learner. This is currently not supported by ``econml`` or ``causalml``. .. _Motivation_multiprocessing: Multiprocessing training of base learners """"""""""""""""""""""""""""""""""""""""" Many MetaLearners come with two 'stages' of base models. The models of the first stage, nuisance models, are trained independently of each other. The models of the second stage, the treatment models, are trained independently of each other, too. Clearly, this is a perfect setup for concurrent training of various models which are independent of each other -- trading off space for time. Yet, neither ``causalml`` nor ``econml`` support multiprocessing within a stage. See, for instance, `causalml issue 616 `_.