metalearners.data_generation module
- metalearners.data_generation.generate_categoricals(n_obs, n_features, n_categories=None, n_uniform=None, p_binomial=0.5, use_strings=False, rng=None)[source]
Generate a dataset of categorical features.
Generates a dataset of
n_obsobservations andn_featurescategorical features. The firstn_uniformfeatures are sampled uniformly across their categories and the rest are sampled from a binomial distribution with parameters \(n = c_i\) and \(p=\)p_binomialwhere \(c_i\) is the number of categories of feature \(i\).n_categoriesis the number of categories of the features, it can either be an int which is used for all the features or an array of lengthn_features. If None, the number of categories for each feature is sampled from \(c_i \sim \mathcal{U}\{2,3,\dots,10\}\).In case
n_uniformis None, all features are sampled uniformly.use_stringscan be set toTrueif the wanted represantion of the variables are strings. If set toFalseit will return an array with dtypenp.int64.The function returns a
np.ndarraywith the sampled dataset and anp.ndarraywith the number of categories for each feature.- Parameters:
n_obs (int)
n_features (int)
n_categories (int | ndarray | None)
n_uniform (int | None)
p_binomial (float)
use_strings (bool)
rng (Generator | None)
- Return type:
tuple[ndarray, ndarray]
- metalearners.data_generation.generate_numericals(n_obs, n_features, mu=None, wishart_scale=1, rng=None)[source]
Generate a dataset of numerical features.
Generates a dataset of
n_obsobservations andn_featuresnumerical features. These are sampled from \(\mathcal{N}(\mu, \Sigma)\) where \(\mu \sim \mathcal{U}[-5,5]\) unless specified inmuand \(\Sigma \sim \mathcal{W}(d, \sigma_w I_d)\) where \(W\) is the Wishart distribution and \(d\) the number of features.mucan be either a float or an array of lengthn_features.wishart_scaleshould be \(\geq 0\) , in case it is 0 then \(\Sigma = I_d\).- Parameters:
n_obs (int)
n_features (int)
mu (float | ndarray | None)
wishart_scale (float)
rng (Generator | None)
- Return type:
ndarray
- metalearners.data_generation.generate_covariates(n_obs, n_features, n_categoricals=0, format='pandas', mu=None, wishart_scale=1, n_categories=None, n_uniform=None, p_binomial=0.5, use_strings=False, rng=None)[source]
Generates a dataset of covariates with both numerical and categorical features.
Dataset is composed of
n_obsobservations andn_featuresfeatures, with the firstn_features - n_categoricalsbeing numerical and the rest being categorical. Numerical features are generated using the functionmetalearners.data_generation.generate_numericals()and categorical features are generated using the functionmetalearners.data_generation.generate_categoricals().By default, the generated dataset is returned as a Pandas DataFrame where categorical features are converted to
pandas' Categorical type. Optionally, the dataset can be returned as a numpy array with dtypefloat64withformat = "numpy". If generating categorical variables, working with pandas DataFrames is preferred as they have support for category dtype.For
muandwishart_scalesee the docstring formetalearners.data_generation.generate_numericals().For
n_categories,n_uniform,p_binomialanduse_stringssee the docstring formetalearners.data_generation.generate_categoricals().use_stringscan only be set toTruewhen usingformat = "pandas".The function returns a tuple of three elements. The first element is the dataset generated (either a numpy array or a pandas DataFrame depending on
format). The second element is a list of indices indicating the columns of categorical features in the dataset. The third element is anp.ndarraywith the number of categories for each feature.- Parameters:
n_obs (int)
n_features (int)
n_categoricals (int)
format (Literal['pandas', 'numpy'])
mu (float | ndarray | None)
wishart_scale (float)
n_categories (int | ndarray | None)
n_uniform (int | None)
p_binomial (float)
use_strings (bool)
rng (Generator | None)
- Return type:
tuple[DataFrame | ndarray, list[int], ndarray]
- metalearners.data_generation.insert_missing(X, missing_probability=0.1, rng=None)[source]
Inserts missing values into the dataset.
Each element of the dataset has a
missing_probabilitychance of being replaced with a NaN, thus simulating a dataset with missing values.The function returns a copy of the original dataset, but with some elements replaced by NaNs.
- Parameters:
X (DataFrame | ndarray)
missing_probability (float)
rng (Generator | None)
- Return type:
DataFrame | ndarray
- metalearners.data_generation.generate_treatment(propensity_scores, rng=None)[source]
Generates a treatment assignment based on the provided propensity scores.
The function first determines the number of treatment variants based on the shape of the input propensity scores. If the propensity score array has a single dimension or only one column in the second dimension, there are two treatment variants (treated vs not-treated), and the value is interpreted as the treatment probability. Otherwise, the second dimension of the propensity scores array indicates the number of treatment variants.
Each observation is assigned to a treatment group by drawing from a categorical distribution where the probability of each treatment group is given by the propensity scores.
propensity_scoresshould be of size(n_obs,)or(n_obs, n_variants), wheren_obsis the number of observations andn_variantsis the number of treatment variants.The function return an array of shape
(n_obs,)where each element indicates the treatment variant received.- Parameters:
propensity_scores (ndarray)
rng (Generator | None)
- Return type:
ndarray
- metalearners.data_generation.compute_experiment_outputs(mu, treatment, sigma_y=1, sigma_tau=0.5, n_variants=None, is_classification=False, positive_proportion=0.5, return_probability_cate=False, rng=None)[source]
Compute the experiment’s observed outcomes y and the true CATE.
This function generates experiment outputs and the true CATE values based on the given potential outcomes function and treatments. The treatment effect for each observation is computed as the difference in potential outcomes. Normally distributed noise is added to the response variable \(Y_i(0)\) with standard deviation
sigma_yand to each corresponding treatment effect to simulate real-world variance with standard deviationsigma_tau.treatmentmust be a vector representing the treatment group assignment for each observation. Each element of the vector is an integer representing a treatment variant starting at 0.mumust be a matrix of size(n_obs, n_variants)containing the potential outcomes for each observation and treatment variant without added noise.n_variantscan be passed to specify the number of treatment variants. If None, it is inferred from the maximum value in the ‘treatment’ vector plus one.is_classificationdetermines if the problem to be simulated is a classification problem. If True, the function simulates a classification problem where the response variable is binary and the proportion of positive outputs is controlled by thepositive_proportionparameter. It is important to notice that the potential outputs are passed through a sigmoid function and therefore the domain of them can be \(\mathbb{R}\). Classification problems are only implemented for binary treatments.In the case of a classification problem
return_probability_catespecifies if the outputted CATE is the difference in probabilities between treating and not treating or if it samples from a Bernoulli distribution and the difference in samples is returned.The function returns a tuple containing the following elements:
y: numpy array of the experiment’s observed outcomes (response variable) after noise addition.true_cate: numpy array of the true CATE without any added noise.
- Parameters:
mu (ndarray)
treatment (Series | ndarray)
sigma_y (float)
sigma_tau (float)
n_variants (int | None)
is_classification (bool)
positive_proportion (float)
return_probability_cate (bool)
rng (Generator | None)
- Return type:
tuple[ndarray, ndarray]