metalearners.cross_fit_estimator module

class metalearners.cross_fit_estimator.CrossFitEstimator(n_folds, estimator_factory, estimator_params=<factory>, enable_overall=True, random_state=None)[source]

Bases: object

Helper class for cross-fitting estimators on data.

Conceptually, it allows for fitting n_folds or n_folds + 1 models on n_folds folds of the data.

estimator_factory is a class implementing an estimator with a scikit-learn interface. Instantiation parameters can be passed to estimator_params. An example argument for estimator_factory would be lightgbm.LGBMRegressor.

Importantly, the CrossFitEstimator can handle in-sample and out-of-sample (‘oos’) data for prediction. When doing in-sample prediction the single model will be used in which the respective data point has not been part of the training set. When doing oos prediction, different options exist. These options either rely on combining the n_folds models or using a model trained on all of the data (enable_overall).

n_folds can be set to 1 if the user desires to deactivate cross-fitting. In that case, the CrossFitEstimator would only fit one overall model which would be the one used for either in sample or out of sample predictions. Note that this is not recommended since it can lead to data leakage when doing in-sample predictions.

Parameters:
  • n_folds (int)

  • estimator_factory (type[_ScikitModel])

  • estimator_params (dict)

  • enable_overall (bool)

  • random_state (int | None)

n_folds: int
estimator_factory: type[_ScikitModel]
estimator_params: dict
enable_overall: bool = True
random_state: int | None = None
classes_: ndarray | None
clone()[source]

Construct a new unfitted CrossFitEstimator with the same init parameters.

Return type:

CrossFitEstimator

fit(X, y, fit_params=None, n_jobs_cross_fitting=None, cv=None)[source]

Fit the underlying estimators.

One estimator is trained per n_folds.

If enable_overall is set, an additional estimator is trained on all data.

n_jobs_cross_fitting can be used to specify the number of jobs for cross-fitting. For more information see the sklearn glossary.

cv can optionally be passed. If passed, it is expected to be a list of (train_indices, test_indices) tuples indicating how to split the data at hand into train and test/estimation sets for different folds.

Parameters:
  • X (DataFrame | ndarray)

  • y (Series | ndarray | DataFrame)

  • fit_params (dict | None)

  • n_jobs_cross_fitting (int | None)

  • cv (list[tuple[ndarray, ndarray]] | None)

Return type:

Self

predict(X, is_oos, oos_method=None, **kwargs)[source]

Predict from X.

If is_oos, the oos_method will be used to generate predictions on ‘out of sample’ data. ‘Out of sample’ refers to this data not having been used in the fit method. The oos_method 'overall' can only be used if the CrossFitEstimator has been initialized with enable_overall=True.

Parameters:
  • X (DataFrame | ndarray)

  • is_oos (bool)

  • oos_method (Literal['overall', 'median', 'mean'] | None)

Return type:

ndarray

predict_proba(X, is_oos, oos_method=None)[source]

Predict probability from X.

If is_oos, the oos_method will be used to generate predictions on ‘out of sample’ data. ‘Out of sample’ refers to this data not having been used in the fit method. The oos_method 'overall' can only be used if the CrossFitEstimator has been initialized with enable_overall=True.

Parameters:
  • X (DataFrame | ndarray)

  • is_oos (bool)

  • oos_method (Literal['overall', 'median', 'mean'] | None)

Return type:

ndarray

score(X, y, sample_weight=None, **kwargs)[source]
set_params(**params)[source]