{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(example-basic)=\n",
    "\n",
    "Example: Estimating CATEs with a MetaLearner\n",
    "==============================================\n",
    "\n",
    "Loading the data\n",
    "----------------\n",
    "\n",
    "First, we will load and prepare some data for this example. In this\n",
    "particular case we rely on the so-called mindset data set, taken from\n",
    "[here](https://github.com/matheusfacure/python-causality-handbook/blob/master/causal-inference-for-the-brave-and-true/data/learning_mindset.csv)\n",
    "and under MIT License. It stems from an experimental setup where\n",
    "\n",
    "* The outcome was the achievement of a student in scalar form, found\n",
    "  in column ``\"achievement_score\".``\n",
    "* The mindset intervention is a binary variable found in the column\n",
    "  ``\"intervention\"``.\n",
    "* Both numerical and categorical covariates/features are present."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from pathlib import Path\n",
    "from git_root import git_root\n",
    "\n",
    "df = pd.read_csv(git_root(\"data/learning_mindset.zip\"))\n",
    "outcome_column = \"achievement_score\"\n",
    "treatment_column = \"intervention\"\n",
    "feature_columns = [\n",
    "    column\n",
    "    for column in df.columns\n",
    "    if column not in [outcome_column, treatment_column]\n",
    "]\n",
    "categorical_feature_columns = [\n",
    "    \"ethnicity\",\n",
    "    \"gender\",\n",
    "    \"frst_in_family\",\n",
    "    \"school_urbanicity\",\n",
    "    \"schoolid\",\n",
    "]\n",
    "# Note that explicitly setting the dtype of these features to category\n",
    "# allows both lightgbm as well as shap plots to\n",
    "# 1. Operate on features which are not of type int, bool or float\n",
    "# 2. Correctly interpret categoricals with int values to be\n",
    "#    interpreted as categoricals, as compared to ordinals/numericals.\n",
    "for categorical_feature_column in categorical_feature_columns:\n",
    "    df[categorical_feature_column] = df[categorical_feature_column].astype(\n",
    "        \"category\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using a first, simple MetaLearner\n",
    "---------------------------------\n",
    "\n",
    "Now that the data has been loaded, we can get to actually using\n",
    "MetaLearners. Let's start with the\n",
    "{class}`~metalearners.TLearner`.\n",
    "Investigating its documentation, we realize that only three initialization parameters\n",
    "are necessary in the case we do not want to reuse nuisance models: ``nuisance_model_factory``, ``is_classification`` and\n",
    "``n_variants``. Given that our outcome is a scalar, we want to set\n",
    "``is_classification=False`` and use a regressor as the\n",
    "``nuisance_model_factory``. In this case we arbitrarily choose a\n",
    "regressor from ``lightgbm``. Since we know that the intervention was\n",
    "binary, we set ``n_variants=2``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from metalearners import TLearner\n",
    "from lightgbm import LGBMRegressor\n",
    "\n",
    "tlearner = TLearner(\n",
    "    nuisance_model_factory=LGBMRegressor,\n",
    "    is_classification=False,\n",
    "    n_variants=2,\n",
    "    nuisance_model_params={\"verbose\": -1}\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once our T-Learner has been instantiated, we can use it\n",
    "in a fashion akin to scikit-learn's Estimator protocol. The subtle differences\n",
    "to aforementioned scikit-learn protocol are that\n",
    "\n",
    "* We need to specify the observed treatment assignment ``w`` in the call to the\n",
    "  ``fit`` method.\n",
    "* We need to specify whether we want in-sample or out-of-sample\n",
    " CATE estimates in the {meth}`~metalearners.TLearner.predict` call via ``is_oos``. In the\n",
    " case of in-sample predictions, the data passed to {meth}`~metalearners.TLearner.predict`\n",
    " must be exactly the same as the data that was used to call {meth}`~metalearners.TLearner.fit`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "tlearner.fit(\n",
    "    X=df[feature_columns],\n",
    "    y=df[outcome_column],\n",
    "    w=df[treatment_column],\n",
    ")\n",
    "\n",
    "cate_estimates_tlearner = tlearner.predict(\n",
    "    X=df[feature_columns],\n",
    "    is_oos=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now notice that ``cate_estimates_tlearner`` is of shape\n",
    "{math}`(n_{obs}, n_{variants} - 1, n_{outputs})`. This is meant to\n",
    "cater to a general case, where there are more than two variants and/or\n",
    "classification problems with many class probabilities. Given that we\n",
    "care about the simple case of binary variant regression, we can make use of\n",
    "{func}`~metalearners.utils.simplify_output` to simplify this shape as such:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from metalearners.utils import simplify_output\n",
    "one_d_estimates = simplify_output(cate_estimates_tlearner)\n",
    "\n",
    "print(cate_estimates_tlearner.shape)\n",
    "print(one_d_estimates.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using a MetaLearner with two stages\n",
    "-----------------------------------\n",
    "\n",
    "Instead of using a T-Learner, we can of course also use some other\n",
    "MetaLearner, such as the {class}`~metalearners.RLearner`.\n",
    "The R-Learner's documentation tells us that two more instantiation\n",
    "parameters are necessary: ``propensity_model_factory`` and\n",
    "``treatment_model_factory``. Hence we can instantiate an R-Learner as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "from metalearners import RLearner\n",
    "from lightgbm import LGBMClassifier\n",
    "rlearner = RLearner(\n",
    "    nuisance_model_factory=LGBMRegressor,\n",
    "    propensity_model_factory=LGBMClassifier,\n",
    "    treatment_model_factory=LGBMRegressor,\n",
    "    is_classification=False,\n",
    "    n_variants=2,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where we choose a classifier class to serve as a blueprint for our\n",
    "eventual propensity model. It is important to notice that although we consider the propensity\n",
    "model a nuisance model, the initialization parameters for it are separated from the other\n",
    "nuisance parameters to allow a more understandable user interface, see the next code prompt.\n",
    "\n",
    "In general, when initializing a MetaLearner, the ``nuisance_model_factory`` parameter will\n",
    "be used to create all the nuisance models which are not a propensity model,  the\n",
    "``propensity_model_factory`` will be used for the propensity model if the MetaLearner\n",
    "contains one, and the ``treatment_model_factory`` will be used for the models predicting\n",
    "the CATE. To see the models present in each MetaLearner type see\n",
    "{meth}`~metalearners.metalearner.MetaLearner.nuisance_model_specifications` and\n",
    "{meth}`~metalearners.metalearner.MetaLearner.treatment_model_specifications`.\n",
    "\n",
    "In the {class}`~metalearners.RLearner` case, the ``nuisance_model_factory`` parameter will\n",
    "be used to create the outcome model, the ``propensity_model_factory`` will be used for the\n",
    "propensity model and the ``treatment_model_factory`` will be used for the model predicting\n",
    "the CATE.\n",
    "\n",
    "If we want to make sure these models are initialized in a specific\n",
    "way, e.g. with a specific value for the hyperparameter ``n_estimators``, we can do that\n",
    "as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "rlearner = RLearner(\n",
    "    nuisance_model_factory=LGBMRegressor,\n",
    "    propensity_model_factory=LGBMClassifier,\n",
    "    treatment_model_factory=LGBMRegressor,\n",
    "    is_classification=False,\n",
    "    n_variants=2,\n",
    "    nuisance_model_params={\"n_estimators\": 10, \"verbose\": -1},\n",
    "    propensity_model_params={\"n_estimators\": 8, \"verbose\": -1},\n",
    "    treatment_model_params={\"n_estimators\": 3, \"verbose\": -1},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The estimation steps look identical to those of the T-Learner:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "rlearner.fit(\n",
    "    X=df[feature_columns],\n",
    "    y=df[outcome_column],\n",
    "    w=df[treatment_column],\n",
    ")\n",
    "\n",
    "cate_estimates_rlearner = rlearner.predict(\n",
    "    X=df[feature_columns],\n",
    "    is_oos=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Comparing estimates\n",
    "-------------------\n",
    "\n",
    "We can now compare the CATE estimates produced by both MetaLearners on\n",
    "a histogram:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "fig, ax = plt.subplots()\n",
    "\n",
    "ax.hist(simplify_output(cate_estimates_tlearner), density=True, alpha=.5, label=\"T-Learner\")\n",
    "ax.hist(simplify_output(cate_estimates_rlearner), density=True, alpha=.5, label=\"R-Learner\")\n",
    "ax.legend()\n",
    "ax.set_xlabel(\"CATE estimate\")\n",
    "ax.set_ylabel(\"relative frequency\")\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}