metalearners.data_generation module

metalearners.data_generation.generate_categoricals(n_obs, n_features, n_categories=None, n_uniform=None, p_binomial=0.5, use_strings=False, rng=None)[source]

Generate a dataset of categorical features.

Generates a dataset of n_obs observations and n_features categorical features. The first n_uniform features are sampled uniformly across their categories and the rest are sampled from a binomial distribution with parameters \(n = c_i\) and \(p=\) p_binomial where \(c_i\) is the number of categories of feature \(i\).

n_categories is the number of categories of the features, it can either be an int which is used for all the features or an array of length n_features. If None, the number of categories for each feature is sampled from \(c_i \sim \mathcal{U}\{2,3,\dots,10\}\).

In case n_uniform is None, all features are sampled uniformly.

use_strings can be set to True if the wanted represantion of the variables are strings. If set to False it will return an array with dtype np.int64.

The function returns a np.ndarray with the sampled dataset and a np.ndarray with the number of categories for each feature.

Parameters:

n_obs (int)
n_features (int)
n_categories (int | ndarray | None)
n_uniform (int | None)
p_binomial (float)
use_strings (bool)
rng (Generator | None)

Return type:

tuple[ndarray, ndarray]

metalearners.data_generation.generate_numericals(n_obs, n_features, mu=None, wishart_scale=1, rng=None)[source]

Generate a dataset of numerical features.

Generates a dataset of n_obs observations and n_features numerical features. These are sampled from \(\mathcal{N}(\mu, \Sigma)\) where \(\mu \sim \mathcal{U}[-5,5]\) unless specified in mu and \(\Sigma \sim \mathcal{W}(d, \sigma_w I_d)\) where \(W\) is the Wishart distribution and \(d\) the number of features.

mu can be either a float or an array of length n_features.

wishart_scale should be \(\geq 0\) , in case it is 0 then \(\Sigma = I_d\).

Parameters:

n_obs (int)
n_features (int)
mu (float | ndarray | None)
wishart_scale (float)
rng (Generator | None)

Return type:

ndarray

metalearners.data_generation.generate_covariates(n_obs, n_features, n_categoricals=0, format='pandas', mu=None, wishart_scale=1, n_categories=None, n_uniform=None, p_binomial=0.5, use_strings=False, rng=None)[source]

Generates a dataset of covariates with both numerical and categorical features.

Dataset is composed of n_obs observations and n_features features, with the first n_features - n_categoricals being numerical and the rest being categorical. Numerical features are generated using the function metalearners.data_generation.generate_numericals() and categorical features are generated using the function metalearners.data_generation.generate_categoricals().

By default, the generated dataset is returned as a Pandas DataFrame where categorical features are converted to pandas' Categorical type. Optionally, the dataset can be returned as a numpy array with dtype float64 with format = "numpy". If generating categorical variables, working with pandas DataFrames is preferred as they have support for category dtype.

For mu and wishart_scale see the docstring for metalearners.data_generation.generate_numericals().

For n_categories, n_uniform, p_binomial and use_strings see the docstring for metalearners.data_generation.generate_categoricals().

use_strings can only be set to True when using format = "pandas".

The function returns a tuple of three elements. The first element is the dataset generated (either a numpy array or a pandas DataFrame depending on format). The second element is a list of indices indicating the columns of categorical features in the dataset. The third element is a np.ndarray with the number of categories for each feature.

Parameters:

n_obs (int)
n_features (int)
n_categoricals (int)
format (Literal['pandas', 'numpy'])
mu (float | ndarray | None)
wishart_scale (float)
n_categories (int | ndarray | None)
n_uniform (int | None)
p_binomial (float)
use_strings (bool)
rng (Generator | None)

Return type:

tuple[DataFrame | ndarray, list[int], ndarray]

metalearners.data_generation.insert_missing(X, missing_probability=0.1, rng=None)[source]

Inserts missing values into the dataset.

Each element of the dataset has a missing_probability chance of being replaced with a NaN, thus simulating a dataset with missing values.

The function returns a copy of the original dataset, but with some elements replaced by NaNs.

Parameters:

X (DataFrame | ndarray)
missing_probability (float)
rng (Generator | None)

Return type:

DataFrame | ndarray

metalearners.data_generation.generate_treatment(propensity_scores, rng=None)[source]

Generates a treatment assignment based on the provided propensity scores.

The function first determines the number of treatment variants based on the shape of the input propensity scores. If the propensity score array has a single dimension or only one column in the second dimension, there are two treatment variants (treated vs not-treated), and the value is interpreted as the treatment probability. Otherwise, the second dimension of the propensity scores array indicates the number of treatment variants.

Each observation is assigned to a treatment group by drawing from a categorical distribution where the probability of each treatment group is given by the propensity scores.

propensity_scores should be of size (n_obs,) or (n_obs, n_variants), where n_obs is the number of observations and n_variants is the number of treatment variants.

The function return an array of shape (n_obs,) where each element indicates the treatment variant received.

Parameters:

propensity_scores (ndarray)
rng (Generator | None)

Return type:

ndarray

metalearners.data_generation.compute_experiment_outputs(mu, treatment, sigma_y=1, sigma_tau=0.5, n_variants=None, is_classification=False, positive_proportion=0.5, return_probability_cate=False, rng=None)[source]

Compute the experiment’s observed outcomes y and the true CATE.

This function generates experiment outputs and the true CATE values based on the given potential outcomes function and treatments. The treatment effect for each observation is computed as the difference in potential outcomes. Normally distributed noise is added to the response variable \(Y_i(0)\) with standard deviation sigma_y and to each corresponding treatment effect to simulate real-world variance with standard deviation sigma_tau.

treatment must be a vector representing the treatment group assignment for each observation. Each element of the vector is an integer representing a treatment variant starting at 0.

mu must be a matrix of size (n_obs, n_variants) containing the potential outcomes for each observation and treatment variant without added noise.

n_variants can be passed to specify the number of treatment variants. If None, it is inferred from the maximum value in the ‘treatment’ vector plus one.

is_classification determines if the problem to be simulated is a classification problem. If True, the function simulates a classification problem where the response variable is binary and the proportion of positive outputs is controlled by the positive_proportion parameter. It is important to notice that the potential outputs are passed through a sigmoid function and therefore the domain of them can be \(\mathbb{R}\). Classification problems are only implemented for binary treatments.

In the case of a classification problem return_probability_cate specifies if the outputted CATE is the difference in probabilities between treating and not treating or if it samples from a Bernoulli distribution and the difference in samples is returned.

The function returns a tuple containing the following elements:

y: numpy array of the experiment’s observed outcomes (response variable) after noise addition.
true_cate: numpy array of the true CATE without any added noise.

Parameters:

mu (ndarray)
treatment (Series | ndarray)
sigma_y (float)
sigma_tau (float)
n_variants (int | None)
is_classification (bool)
positive_proportion (float)
return_probability_cate (bool)
rng (Generator | None)

Return type:

tuple[ndarray, ndarray]