Model_Pipeline¶

class BPt.Model_Pipeline(loaders=None, imputers='default', scalers=None, transformers=None, feat_selectors=None, model='default', param_search=None, n_jobs='default', cache='depreciated', feat_importances='depreciated')¶

Model_Pipeline is defined as essentially a wrapper around all of the explicit modelling pipeline parameters. This object is used as input to Evaluate and Test

The ordering of the parameters listed below defines the pre-set order in which these Pipeline pieces are composed (params up to model, param_search is not an ordered pipeline piece). For more flexibility, one can always use custom defined objects, or even pass custom defined pipelines directly to model (i.e., in the case where you have a specific pipeline you want to use already defined, but say just want to use the loaders from BPt).

Parameters

loaders (Loader, list of or None, optional) –
Each Loader refers to transformations which operate on loaded Data_Files (See Load_Data_Files). See Loader explcitly for more information on how to create a valid object, with relevant params and scope.

In the case that a list of Loaders is passed to loaders, if a native python list, then passed loaders will be applied sequentially (likely each passed loader given a seperate scope, as the output from one loader cannot be input to another- note to create actual sequential loader steps, look into using the Pipe wrapper argument when creating a single Loader obj).

Passed loaders can also be wrapped in a Select wrapper, e.g., as either
```
# Just passing select
loaders = Select([Loader(...), Loader(...)])

# Or nested
loaders = [Loader(...), Select([Loader(...), Loader(...)])]
```
In this way, most of the pipeline objects can accept lists, or nested lists with param wrapped, not just loaders!
```
default = None
```
imputers (Imputer, list of or None, optional) –
If there is any missing data (NaN’s) that have been kept within data or covars, then an imputation strategy must be defined! This param controls what kind of imputation strategy to use.

Each Imputer contains information around which imputation strategy to use, what scope it is applied to (in this case only ‘float’ vs. ‘cat’), and other relevant base parameters (i.e., a base model if an iterative imputer is selected).

In the case that a list of Imputer are passed, they will be applied sequentially, though note that unless custom scopes are employed, at most passing only an imputer for float data and an imputer for categorical data makes sense. You may also use input wrapper, like Select.

In the case that no NaN data is passed, but imputers is not None, it will simply be set to None.
```
default = [Imputer('mean', scope='float'),
           Imputer('median', scope='cat')]
```
scalers (Scaler, list of or None, optional) –
Each Scaler refers to any potential data scaling where a transformation on the data (without access to the target variable) is computed, and the number of features or data points does not change. Each Scaler object contains information about the base object, what scope it should be applied to, and saved param distributions if relevant.

As with other pipeline params, scalers can accept a list of Scaler objects, in order to apply sequential transformations (or again in the case where each object has a seperate scope, these are essentially two different streams of transformations, vs. when two Scalers with the same scope are passed, the output from one is passed as input to the next). Likewise, you may also use valid input wrappers, e.g., Select.

By default no scaler is used, though it is reccomended.
```
default = None
```
transformers (Transformer, list of or None, optional) –
Each Transformer defines a type of transformation to the data that changes the number of features in perhaps non-deterministic or not simply removal (i.e., different from feat_selectors), for example applying a PCA, where both the number of features change, but also the new features do not 1:1 correspond to the original features. See Transformer for more information.

Transformers can be composed sequentially with list or special input type wrappers, the same as other objects.
```
default = None
```
feat_selectors (Feat_Selector, list of or None, optional) –
Each Feat_Selector refers to an optional feature selection stage of the Pipeline. See Feat_Selector for specific options.

Input can be composed in a list, to apply feature selection sequentially, or with special Input Type wrapper, e.g., Select.
```
default = None
```
model (Model, Ensemble, optional) –
model accepts one input of type Model or Ensemble. Though, while it cannot accept a list (i.e., no sequential behavior), you may still pass Input Type wrapper like Select to perform model selection via param search.

See Model for more information on how to specify a single model to BPt, and Ensemble for information on how to build an ensemble of models.

Note: You must have provide a model, there is no option for None. Instead default behavior is to use a ridge regression.
```
default = Model('ridge')
```
param_search (Param_Search or None, optional) –
Param_Search can be provided in order to specify a corresponding hyperparameter search for the provided pipeline pieces. When defining each piece, you may set hyperparameter distributions for that piece. If param search is None, these distribution will be essentially ignored, but if Param_Search is passed here, then they will be used along with the strategy defined in the passed Param_Search to conduct a nested hyper-param search.

Note: If using input wrapper types like Select, then a param search must be passed!
```
default = None
```
n_jobs (int or 'default', optional) –
The number of cores to be used with this pipeline. In general, this parameter should be left as ‘default’, which will set it based on the n_jobs as set in the problem spec- and will attempt to automatically change this value if say in the context of nesting.
```
default = 'default'
```
cache (depreciated) –
The cache parameter has been depreciated, use the cache_loc params within individual pieces instead.
```
default = 'depreciated'
```

feat_importancesdepreciated: Feature importances in a past version of BPt were specified via this Model Pipeline object. Now they should be provided to either Evaluate and Test

default = 'depreciated'

Problem_Spec¶

class BPt.Problem_Spec(problem_type='default', target=0, scorer='default', weight_scorer=False, scope='all', subjects='all', n_jobs='default', random_state='default')¶

Problem Spec is defined as an object of params encapsulating the set of parameters shared by modelling class functions Evaluate and Test

Parameters

problem_type (str or 'default', optional) –
This parameter controls what type of machine learning should be conducted. As either a regression, or classification where ‘categorical’ represents a special case of binary classification, where typically a binary classifier is trained on each class.
- ’default’
  Determine the problem type based on how the requested target variable is loaded.
- ’regression’, ‘f’ or ‘float’
  For ML on float/continuous target data.
- ’binary’ or ‘b’
  For ML on binary target data.
- ’categorical’ or ‘c’
  For ML on categorical target data, as multiclass.
```
default = 'default'
```
target (int or str, optional) –
The loaded target in which to use during modelling. This can be the int index (as assigned by order loaded in, e.g., first target loaded is 0, then the next is 1), or the name of the target column. If only one target is loaded, just leave as default of 0.
```
default = 0
```
scorer (str or list, optional) –
Indicator str for which scorer(s) to use when calculating average validation score in Evaluate, or Test set score in Test.

A list of str’s can be passed as well, in this case, scores for all of the requested scorers will be calculated and returned.

Note: If using a Param_Search, the Param_Search object has a scorer parameter as well. This scorer describes the scorer optimized in a parameter search.

For a full list of supported scorers please view the scikit-learn docs at: https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules

If left as ‘default’, assign a reasonable scorer based on the passed problem type.
- ’regression’ : [‘explained_variance’, ‘neg_mean_squared_error’]
- ’binary’ : [‘matthews’, ‘roc_auc’, ‘balanced_accuracy’]
- ’categorical’ : [‘matthews’, ‘roc_auc_ovr’, ‘balanced_accuracy’]
```
default = 'default'
```
weight_scorer (bool, list of, optional) –
If True, then the scorer of interest will be weighted within each repeated fold by the number of subjects in that validation set. This parameter only typically makes sense for custom split behavior where validation folds may end up with differing sizes. When default CV schemes are employed, there is likely no point in applying this weighting, as the validation folds will have simmilar sizes.

If you are passing mutiple scorers, then you can also pass a list of values for weight_scorer, with each value set as boolean True or False, specifying if the corresponding scorer by index should be weighted or not.
```
default = False
```
scope (key str or Scope obj, optional) –
This parameter allows the user to optionally run an expiriment with just a subset of the loaded features / columns.

See Scopes for a more detailed explained / guide on how scopes are defined and used within BPt.
```
default = 'all'
```
subjects (str, array-like or Value_Subset, optional) –
This parameter allows the user to optionally run Evaluate or Test with just a subset of the loaded subjects. It is notably distinct from the train_subjects, and test_subjects parameters directly avaliable to Evaluate and Test, as those parameters typically refer to train/test splits. Specifically, any value specified for this subjects parameter will be applied AFTER selecting the relevant train or test subset.

One use case for this parameter might be specifying subjects of just one sex, where you would still want the same training set for example, but just want to test sex specific models.

If set to ‘all’ (as is by default), all avaliable subjects will be used.

subjects can accept either a specific array of subjects, or even a loc of a text file (formatted one subject per line) in which to read from.

A special wrapper, Value_Subset, can also be used to specify more specific, specifically value specific, subsets of subjects to use. See Value_Subset for how this input wrapper can be used.
```
default = 'all'
```
n_jobs (int, or 'default') –
n_jobs are employed witin the context of a call to Evaluate or Test. If left as default, the class wide BPt value will be used.

In general, the way n_jobs are propegated to the different pipeline pieces on the backend is that, if there is a parameter search, the base ML pipeline will all be set to use 1 job, and the n_jobs budget will be used to train pipelines in parellel to explore different params. Otherwise, if no param search, n_jobs will be used for each piece individually, though some might not support it.
```
default = 'default'
```
random_state (int, RandomState instance, None or 'default', optional) –
Random state, either as int for a specific seed, or if None then the random seed is set by np.random.

This parameter is used to ensure replicability of expirements (wherever possible!). In some cases even with a random seed, depending on the pipeline pieces being used, if any have a component that occassionally yields different results, even with the same random seed, e.g., some model optimizations, then you might still not get exact replicicability.

If ‘default’, use the saved class value. (Defined in ML)
```
default = 'default'
```

Pieces¶

Loader¶

class BPt.Loader(obj, params=0, scope='data files', cache_loc=None, extra_params=None, fix_n_wrapper_jobs=False)¶

Loader refers to transformations which operate on loaded Data_Files. (See Load_Data_Files()). They in essence take in saved file locations, and after some series of transformations pass on compatible features. Notably loaders define operations which are computed on single files indepedently.

Parameters

obj (str, custom obj or Pipe) –
obj selects the base loader object to use, this can be either a str corresponding to one of the preset loaders found at Loaders. Beyond pre-defined loaders, users can pass in custom objects (they just need to have a defined fit_transform function which when passed the already loaded file, will return a 1D representation of that subjects features.

obj can also be passed as a Pipe. See Pipe’s documentation to learn more on how this works, and why you might want to use it.

See Pipeline Objects to read more about pipeline objects in general.

For example, the ‘identity’ loader will load in saved data at the stored file location, lets say they are 2d numpy arrays, and will return a flattened version of the saved arrays, with each data point as a feature. A more practical example might constitute loading in say 3D neuroimaging data, and passing on features as extracted by ROI.
params (int, str or dict of params, optional) –
params determines optionally if the distribution of hyper-parameters to potentially search over for this loader. Preset param distributions are listed for each choice of obj at Loaders, and you can read more on how params work more generally at Params.

If obj is passed as Pipe, see Pipe for an example on how different corresponding params can be passed to each piece individually.
```
default = 0
```
scope (valid scope, optional) –
scope determines on which subset of features the specified loader should transform. See Scopes for more information on how scopes can be specified.

You will likely want to use either custom key based scopes, or the ‘data files’ preset scope, as something like ‘covars’ won’t make much sense, when atleast for now, you cannot even load Covars data files.
```
default = 'data files'
```
cache_loc (str, Path or None, optional) – Optional location in which to cache loader transformations.
extra_params (:ref`extra params dict<Extra Params>`, optional) –
See Extra Params
```
default = None
```
fix_n_wrapper_jobs (int or False, optional) –
Typically this parameter is left as default, but in special cases you may want to set this. It controls the number of jobs fixed for the Loading Wrapper.

This parameter can be used to set that value.
```
default = False
```

Imputer¶

class BPt.Imputer(obj, params=0, scope='all', base_model=None, base_model_type='default', extra_params=None)¶

If there is any missing data (NaN’s) that have been kept within data or covars, then an imputation strategy must be defined! This object allows you to define an imputation strategy. In general, you should need at most two Imputers, one for all float type data and one for all categorical data, assuming you have been present, and they both have missing values.

Parameters

obj (str) –
obj selects the base imputation strategy to use. See Imputers for all avaliable options. Notably, if ‘iterative’ is passed, then a base model must also be passed! Also note that the sample_posterior argument within iterative imputer is not currently supported.

See Pipeline Objects to read more about pipeline objects in general.
params (int, str or dict of params, optional) –
params set an associated distribution of hyper-parameters to potentially search over with the Imputer. Preset param distributions are listed for each choice of params with the corresponding obj at Imputers, and you can read more on how params work more generally at Params.
```
default = 0
```
scope ({'float', 'cat', custom}, optional) –
scope determines on which subset of features the specified imputer will have access to.

The main options that make sense for imputer are one for float data and one for categorical / ‘cat’ datatypes. Though you can also pass a custom set of keys.

Note: If using iterative imputation you may want to carefully consider the scope passed. For example, while it may be beneficial to impute categorical and float features seperately, i.e., with different base_model_type’s (categorical for categorical and regression for float), you must also consider that in predicting the missing values under this setup, the categorical imputer would not have access to to the float features and vice versa.

In this way, you may want to either just treat all features as float, or instead of imputing categorical features, load missing values as a seperate category - and then set the scope here to be ‘all’, such that the iterative imputer has access to all features. Essently why this is neccisary is the iterative imputer will try to replace any NaN value present in its input features.

See Scopes for more information on how scopes can be specified.
```
default = 'all'
```
scope –
scope determines on which subset of features the imputer should act on.

Scopes.
```
default = 'float'
```
base_model (Model, Ensemble or None, optional) –
If ‘iterative’ is passed to obj, then a base_model is required in order to perform iterative imputation! The base model can be any valid Model_Pipeline Model.
```
default = None
```
base_model_type ('default' or Problem Type, optional) –
In setting a base imputer model, it may be desirable to have this model have a different ‘problem type’, then your over-arching problem. For example, if performing iterative imputation on categorical features only, you will likely want to use a categorical predictor - but for imputing on float-type features, you will want to use a ‘regression’ type base model.

Choices are {‘binary’, ‘regression’, ‘categorical’} or ‘default’. If ‘default’, then the following behavior will be applied: If the scope of the imputer is set to ‘cat’ or ‘categorical’, then the ‘categorical’ problem type will be used for the base model. If anything else, then the ‘regression’ type will be used.
```
default = 'default'
```

extra_params : :ref`extra params dict<Extra Params>`, optional

See Extra Params
default = None

Scaler¶

class BPt.Scaler(obj, params=0, scope='float', extra_params=None)¶

Scaler refers to a piece in the Model_Pipeline, which is responsible for performing any sort of scaling or transformation on the data which doesn’t require the target variable, and doesn’t change the number of data points or features.

Parameters

obj (str or custom obj) –
obj if passed a str selects a scaler from the preset defined scalers, See Scalers for all avaliable options. If passing a custom object, it must be a sklearn compatible transformer, and further must not require the target variable, not change the number of data points or features.

See Pipeline Objects to read more about pipeline objects in general.
params (int, str or dict of params, optional) –
params set an associated distribution of hyper-parameters to potentially search over with this Scaler. Preset param distributions are listed for each choice of params with the corresponding obj at Scalers, and you can read more on how params work more generally at Params.
```
default = 0
```
scope (valid scope, optional) –
scope determines on which subset of features the specified scaler should transform. See Scopes for more information on how scopes can be specified.
```
default = 'float'
```
extra_params (:ref`extra params dict<Extra Params>`, optional) –
See Extra Params
```
default = None
```

Transformer¶

class BPt.Transformer(obj, params=0, scope='float', cache_loc=None, extra_params=None, fix_n_wrapper_jobs='default')¶

The Transformer is base optional component of the Model_Pipeline class. Transformers define any type of transformation to the loaded data which may change the number of features in a non-simple way (i.e., conceptually distinct from Feat_Selector, where you know in advance the transformation is just selecting a subset of existing features). These are transformations like applying Principle Component Analysis, or on the fly One Hot Encoding.

Parameters

obj (str or custom_obj) –
obj if passed a str selects from the avaliable class defined options for transformer as found at Transformers.

If a custom object is passed as obj, it must be a sklearn api compatible transformer (i.e., have fit, transform, get_params and set_params methods, and further be cloneable via sklearn’s clone function). See Custom Input Objects for more info.

See Pipeline Objects to read more about pipeline objects in general.
params (int, str or dict of params, optional) –
params determines optionally if the distribution of hyper-parameters to potentially search over for this transformer. Preset param distributions are listed for each choice of obj at Transformers, and you can read more on how params work more generally at Params.
```
default = 0
```
scope (valid scope, optional) –
scope determines on which subset of features the specified transformer should transform. See Scopes for more information on how scopes can be specified.

Specifically, it may be useful to consider the use of Duplicate here.
```
default = 'float'
```
extra_params (:ref`extra params dict<Extra Params>`, optional) –
See Extra Params
```
default = None
```
fix_n_wrapper_jobs (int or 'default', optional) –
This parameter is ignored right now for Transformers
```
default = 'default'
```

Feat_Selector¶

class BPt.Feat_Selector(obj, params=0, scope='all', base_model=None, extra_params=None)¶

Feat_Selector is a base piece of Model_Pipeline, which is designed to preform feature selection.

Parameters

obj (str) –
obj selects the feature selection strategy to use. See Feat Selectors for all avaliable options. Notably, if ‘rfe’ is passed, then a base model must also be passed!

See Pipeline Objects to read more about pipeline objects in general.
params (int, str or dict of params, optional) –
params set an associated distribution of hyper-parameters to potentially search over with this Feat_Selector. Preset param distributions are listed for each choice of params with the corresponding obj at Feat Selectors, and you can read more on how params work more generally at Params.
```
default = 0
```
scope (valid scope, optional) –
scope determines on which subset of features the specified feature selector will have access to. See Scopes for more information on how scopes can be specified.
```
default = 'all'
```
base_model (Model, Ensemble or None, optional) –
If ‘rfe’ is passed to obj, then a base_model is required in order to perform recursive feature elimination. The base model can be any valid argument accepts by param model in Model_Pipeline.
```
default = None
```
extra_params (:ref`extra params dict<Extra Params>`, optional) –
See Extra Params
```
default = None
```

Model¶

class BPt.Model(obj, params=0, scope='all', param_search=None, target_scaler=None, extra_params=None)¶

Model represents a base components of the Model_Pipeline, specifically a single Model / estimator. Model can also be used as a component in building other pieces of the model pipeline, e.g., Ensemble.

Parameters

obj (str, or custom obj) –
obj selects the base model object to use from either a preset str indicator found at Models, or from a custom passed user model (compatible w/ sklearn api).

See Pipeline Objects to read more about pipeline objects in general.

obj should be wither a single str indicator or a single custom model object, and not passed a list-like of either. If an ensemble of models is requested, then see Ensemble.
params (int, str or dict of params, optional) –
params optionally set an associated distribution of hyper-parameters to this model object. Preset param distributions are listed for each choice of obj at Models, and you can read more on how params work more generally at Params.
```
default = 0
```
scope (valid scope, optional) –
scope determines on which subset of features the specified model should work on. See Scopes for more information on how scopes can be specified.
```
default = 'all'
```
param_search (Param_Search, None, optional) –
If None, by default, this will be a base model. Alternatively, by passing a Param_Search instance here, it specifies that this model should be wrapped in a Nevergrad hyper-parameter search object.

This can be useful to create Model’s which have a nested hyper-parameter tuning independent from the other pipeline steps.
```
default = None
```
target_scaler (Scaler, None, optional) –
Still somewhat experimental, can pass a Scaler object here and have this model perform target scaling + reverse scaling.

Note: Has not been fully tested in complicated nesting cases, e.g., if Model is wrapping a nested Model_Pipeline, this param will likely break.
```
default = None
```
extra_params (:ref`extra params dict<Extra Params>`, optional) –
See Extra Params
```
default = None
```

Ensemble¶

class BPt.Ensemble(obj, models, params=0, scope='all', param_search=None, target_scaler=None, base_model=None, cv_splits=None, is_des=False, single_estimator=False, des_split=0.2, n_jobs_type='ensemble', extra_params=None)¶

The Ensemble object is valid base Model_Pipeline piece, designed to be passed as input to the model parameter of Model_Pipeline, or to its own models parameters.

This class is used to create a variety ensembled models, typically based on Model pieces.

Parameters

obj (str) –
Each str passed to ensemble refers to a type of ensemble to train, based on also the passed input to the models parameter, and also the additional parameters passed when init’ing Ensemble.

See Ensemble Types to see all avaliable options for ensembles.

Passing custom objects here, while technically possible, is not currently full supported. That said, there are just certain assumptions that the custom object must meet in order to work, specifially, they should have simmilar input params to other simmilar existing ensembles, e.g., in the case the single_estimator is False and needs_split is also False, then the passed object needs to be able to accept an input parameter estimators, which accepts a list of (str, estimator) tuples. Whereas if needs_split is still False, but single_estimator is True, then the passed object needs to support an init param of base_estimator, which accepts a single estimator.
models (Model, Ensemble or list of) –
The models parameter is designed to accept any single model-like pipeline parameter object, i.e., Model or even another Ensemble. The passed pieces here will be used along with the requested ensemble object to create the requested ensemble.

See Model for how to create a valid base model(s) to pass as input here.
params (int, str or dict of params, optional) –
params sets as associated distribution of hyper-parameters for this ensemble object. These parameters will be used only in the context of a hyper-parameter search. Notably, these params refer to the ensemble obj itself, params for base models should be passed accordingly when creating the base models. Preset param distributions are listed at Ensemble Types, under each of the options for ensemble obj’s.

You can read more about generally about hyper-parameter distributions as associated with objects at Params.
```
default = 0
```
scope (valid scope, optional) –
scope determines on which subset of features the specified ensemble model should work on. See Scopes for more information on how scopes can be specified.
```
default = 'all'
```
param_search (Param_Search, None, optional) –
If None, by default, this will be a base ensemble model. Alternatively, by passing a Param_Search instance here, it specifies that this model should be wrapped in a Nevergrad hyper-parameter search object.

This can be useful to create Model’s which have a nested hyper-parameter tuning independent from the other pipeline steps.
```
default = None
```
target_scaler (Scaler, None, optional) –
Still somewhat experimental, can pass a Scaler object here and have this model perform target scaling + reverse scaling.

scope in the passed scaler is ignored.

Note: Has not been fully tested in complicated nesting cases, e.g., if Model is wrapping a nested Model_Pipeline, this param will likely break.
```
default = None
```
base_model (Model, None, optional) –
In the case that an ensemble is passed which has the parameter final_estimator (not base model!), for example in the case of stacking, then you may pass a Model type object here to be used as that final estimator.

Otherwise, by default this will be left as None, and if the requested ensemble has the final_estimator parameter, then it will pass None to the object (which is typically for setting the default).
```
default = None
```
cv_splits (CV_Splits or None, optional) –
Used for passing custom CV split behavior to ensembles which employ splits, e.g., stacking.
```
default = None
```
is_des (bool, optional) –
is_des refers to if the requested ensemble obj requires a further training test split in order to train the base ensemble. As of right now, If this parameter is True, it means that the base ensemble is from the DESlib library . Which means the base ensemble obj must have a pool_classifiers init parameter.

The following des_split parameter determines the size of the split if is_des is True.
```
default = False
```
single_estimator (bool, optional) –
The parameter single_estimator is used to let the Ensemble object know if the models must be a single estimator. This is used for ensemble types that requires an init param base_estimator. In the case that multiple models are passed to models, but single_estimator is True, then the models will automatically be wrapped in a voting ensemble, thus creating one single estimator.
```
default = False
```
des_split (float, optional) –
If is_des is True, then the passed ensemble must be fit on a seperate validation set. This parameter determines the size of the further train/val split on initial training set passed to the ensemble. Where the size is comptued as the a percentage of the total size.
```
default = .2
```
n_jobs_type ('ensemble' or 'models', optional) –
Valid options are either ‘ensemble’ or ‘models’.

This parameter controls how the total n_jobs are distributed, if ‘ensemble’, then the n_jobs will be used all in the ensemble object and every instance within the sub-models set to n_jobs = 1. Alternatively, if passed ‘models’, then the ensemble object will not be multi-processed, i.e., will be set to n_jobs = 1, and the n_jobs will be distributed to each base model.

If you are training a stacking regressor for example with n_jobs = 16, and you have 16+ models, then ‘ensemble’ is likely a good choice here. If instead you have only 3 base models, and one or more of those 3 could benefit from a higher n_jobs, then setting n_jobs_type to ‘models’ might give a speed-up.
```
default = 'ensemble'
```
extra_params (:ref`extra params dict<Extra Params>`, optional) –
See Extra Params
```
default = None
```

Param_Search¶

class BPt.Param_Search(search_type='RandomSearch', splits=3, n_repeats=1, cv='default', n_iter=10, scorer='default', weight_scorer=False, mp_context='default', n_jobs='default', dask_ip=None, memmap_X=False, search_only_params=None, CV='depreciated', _random_state=None, _splits_vals=None, _cv=None, _scorer=None, _n_jobs=None)¶

Param_Search is special input object designed to be used with Model_Pipeline. Param_Search defines a hyperparameter search strategy. When passed to Model_Pipeline, its search strategy is applied in the context of any set Params within the base pieces. Specifically, there must be atleast one parameter search somewhere in the object Param_Search is passed!

All backend hyper-parameter searches make use of the <https://github.com/facebookresearch/nevergrad>`_ library.

Parameters

search_type (str, optional) –
The type of nevergrad hyper-parameter search to conduct. See Search Types for all avaliable options. Also you may further look into nevergrad’s experimental varients if you so choose, this parameter can accept those as well.

New: You may pass ‘grid’ here in addition to the supported nevergrad searches. This will use sklearn’s GridSearch. Note in this case some of the other parameters are ignored, these are: weight_scorer, mp_context, dask_ip, memmap_X, search_only_params
```
default = 'RandomSearch'
```
splits (int, float, str or list of str, optional) –
In order to optimize hyper-parameters, some sort of internal cross validation must be specified, such that combinations of hyper-parameters can be evaluated on different data then they were trained on. splits allows you to specify the base of what CV strategy should be used to evaluate every n_iter combination of hyper-parameters.

Specifically, options for split are:
- int
  The number of k-fold splits to conduct. (E.g., 3 for 3-fold CV split to be conducted at every hyper-param evaluation).
- float
  Must be 0 < splits < 1, and defines a single train-test like split, with splits % of the current training data size used as a validation set.
- str
  If a str is passed, then it must correspond to a loaded Strat variable. In this case, a leave-out-group CV will be used according to the value of the indicated Strat variable (E.g., a leave-out-site CV scheme).
- list of str
  If multiple str passed, first determine the overlapping unique values from their corresponing loaded Strat variables, and then use this overlapped value to define the leave-out-group CV as described above.
Also note that n_repeats will work with any of these options, but say in the case of a leave out group CV, would be awfully redundant, versus, with a passed float value, very reasonable.
```
default = 3
```
n_repeats (int, optional) –
Given the base hyper-param search CV defined / described in the splits param, this parameter further controls if the defined train/val splits should be repeated (w/ different random splits in all cases but the leave-out-group passed str option).

For example, if n_repeats is set to 2, and splits is 3, then a twice repeated 3-fold CV will be performed to evaluate every choice of n_iter hyper-params.
```
default = 1
```
cv (CV or ‘default’, optional) –
If left as default ‘default’, use the class defined CV behavior for the splits, otherwise can pass custom behavior.
```
default = 'default'
```
n_iter (int, optional) –
The number of hyper-parameters to try / budget of the underlying search algorithm. How well a hyper-parameter search works and how long it takes will be very dependent on this parameter and the defined internal CV strategy (via splits and n_repeats). In general, if too few choices are provided the algorithm will likely not select high performing hyper-paramers, and alternatively if too high a value/budget is set, then you may find overfit/non-generalize hyper-parameter choices. Other factors which will influence the ‘right’ number of n_iter to specify are:
- search_type
  Depending on the underlying search type, it may take a bigger or smaller budget on average to find a good set of hyper-parameters
- The dimension of the underlying search space
  If you are only optimizing a few, say 2, underlying parameter distributions, this will require a far smaller budget then say a really high dimensional search space.
- The CV strategy
  The CV strategy defined via splits and n_repeats may make it easier or harder to overfit when searching for hyper-parameters, thus conceptually a good choice of CV strategy can serve to increase the number n_iter you can use before overfitting, or conversely a bad choice may limit it.
- Number of data points / subjects
  Along with CV strategy, the number of data points/subjects will greatly influence how quickly you overfit, and therefore a good choice of n_iter.
Notably, one can always if they have the resources simply expiriment with this parameter.
```
default = 10
```
scorer (str or 'default', optional) –
In order for a set of hyper-parameters to be evaluated, a single scorer must be defined.

For a full list of supported scorers please view the scikit-learn docs at: https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules

If left as ‘default’, assign a reasonable scorer based on the passed problem type.
- ’regression’ : ‘explained_variance’
- ’binary’ : ‘matthews’
- ’categorical’ : ‘matthews’
Be careful to make sure to select an appropriate scorer for the problem type.

Only one value of scorer may be passed here.
```
default = 'default'
```
weight_scorer (bool or 'default', optional) –
weight_scorer describes if the scorer of interest should be weighted by the number of subjects within each validation fold. So, for example, if a leave-out-group CV scheme is specified to splits, and the groups have drastically different numbers of subjects, then you may want to consider weighting the final average validation metric (as computed across in this case all groups used by themselves) by the number of subjects in each fold.
```
default = False
```
mp_context (str, optional) –
When a hyper-parameter search is launched, there are different ways through python that the multi-processing can be launched (assuming n_jobs > 1). Occassionally some choices can lead to unexpected errors.

Choices are: - ‘default’: If ‘default’ use the BPt mp_context.
- ’loky’: Create and use the python library loky backend.
- ’fork’: Python default fork mp_context
- ’forkserver’: Python default forkserver mp_context
- ’spawn’: Python default spawn mp_context
```
default = 'default'
```
n_jobs (int or 'default', optional) –
The number of cores to be used for the search. In general, this parameter should be left as ‘default’, which will set it based on the n_jobs as set in the problem spec- and will attempt to automatically change this value if say in the context of nesting.
```
default = 'default'
```
dask_ip (str or None, optional) –
If None, default, then ignore this parameter..

For experimental Dask support. This should be the ip of a created dask cluster. A dask Client object will be created and passed this ip in order to connect to the cluster.
```
default = None
```
memmap_X (bool, optional) –
When passing large memory arrays in each parameter search, it can be useful as a memory reduction technique to pass numpy memmap’ed arrays. This solves an issue where the loky backend will not properly pass too large arrays.

Warning: This can slow down code, and only reduces the actual memory consumption of each job by a little bit.

Note: If passing a dask_ip, this option will be skipped, as if using the dask backend, then large X’s will be pre-scattered instead.
```
default = False
```
search_only_params (dict or None, optional) –
In some rare cases, it may be the case that you want to specify that certain parameters be passed only during the nested parameter searches. A dict of parameters can be passed here to accomplish that. For example, if passing:

search_only_params = {‘svm classifier__probability’: False}

And assuming that the default / selecting parameter for this svm classifier for probaility is True by default, then only when exploring nested hyper-parameter options will probability be set to False, but when fitting the final model with the best parameters found from the search, it will revert back to the default, i.e., in this case probability = True.

Note: this may be a little bit tricky to use as you need to know how to represent the parameters correctly!

To ignore this parameter / option. simply keep the default value of None
```
default = None
```
CV ('depreciated') –
Switching to passing cv parameter as cv instead of CV. Will raise error if anything is passed here.
```
default = 'depreciated'
```

Feat_Importance¶

class BPt.Feat_Importance(obj, scorer='default', shap_params='default', n_perm=10, inverse_global=False, inverse_local=False)¶

There are a number of options for creating Feature Importances in BPt. See Feat Importances to learn more about feature importances generally. The way this object works, is that you can a type of feature importance, and then its relevant parameters. This object is designed to passed directly to Model_Pipeline.

Parameters

obj (str) – obj is the str indiciator for which feature importance to use. See Feat Importances for what options are avaliable.
scorer (str or 'default', optional) –
If a permutation based feature importance is being used, then a scorer is required.

For a full list of supported scorers please view the scikit-learn docs at: https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules

If left as ‘default’, assign a reasonable scorer based on the passed problem type.
- ’regression’ : ‘explained_variance’
- ’binary’ : ‘matthews’
- ’categorical’ : ‘matthews’
```
default = 'default'
```
shap_params (Shap_Params or ‘default’, optional) –
If a shap based feature importance is used, it is neccicary to define a number of relevant parameters for how the importances should be calculated. See Shap_Params for what these parameters are.

If ‘default’ is passed, then shap_params will be set to either the default values of Shap_Params if shap feature importances are being used, or None if not.
```
default = 'default'
```
n_perm (int, optional) –
If a permutation based feature importance method is selected, then it is neccicary to indicate how many random permutations each feature should be permuted.
```
default = 10
```
inverse_global (bool) –
Warning: This feature, along with inverse_local, is still expirimental.

If there are any loaders, or transformers specified in the Model_Pipeline, then feature importance becomes slightly trickier. For example, if you have a PCA transformer, and what to calculate averaged feature importance across 3-folds, there is no gaurentee ‘pca feature 1’ is the same from one fold to the next. In this case, if set to True, global feature importances will be inverse_transformed back into their original feature space - per fold. Note: this will only work if all transformers / loaders have an implemented reverse_transform function, if one does not for transformer, then it will just return 0 for that feature. For a loader w/o, then it will return ‘No inverse_transform’.

There are also other cases where this might be a bad idea, for example if you are using one hot encoders in your transformers then trying to reverse_transform feature importances will yield nonsense (NaN’s).
```
default = False
```
inverse_local (bool) –
Same as inverse_global, but for local feature importances. By default this is set to False, as it is more memory and computationally expensive to inverse_transform this case.
```
default = False
```

CV¶

class BPt.CV(groups=None, stratify=None, train_only_loc=None, train_only_subjects=None)¶

This objects is used to encapsulate a set of parameters for a CV strategy.

Parameters

groups (str, list or None, optional) – In the case of str input, will assume the str to refer to a column key within the loaded strat data, and will assign it as a value to preserve groups by during any train/test or K-fold splits. If a list is passed, then each element should be a str, and they will be combined into all unique combinations of the elements of the list.
:: – default = None
stratify (str, list or None, optional) –
In the case of str input, will assume the str to refer to a column key within the loaded strat data, or a loaded target col., and will assign it as a value to preserve distribution of groups by during any train/test or K-fold splits. If a list is passed, then each element should be a str, and they will be combined into all unique combinations of the elements of the list.

Any target_cols passed must be categorical or binary, and cannot be float. Though you can consider loading in a float target as a strat, which will apply a specific k_bins, and then be valid here.

In the case that you have a loaded strat val with the same name as your target, you can distinguish between the two by passing either the raw name, e.g., if they are both loaded as ‘Sex’, passing just ‘Sex’, will try to use the loaded target. If instead you want to use your loaded strat val with the same name - you have to pass ‘Sex’ + self.self.strat_u_name (by default this is ‘_Strat’).
```
default = None
```
train_only_loc (str, Path or None, optional) –
Location of a file to load in train_only subjects, where any subject loaded as train_only will be assigned to every training fold, and never to a testing fold. This file should be formatted as one subject per line.

You can load from a loc and pass subjects, the subjects from each source will be merged.

This parameter is compatible with groups / stratify.
```
default = None
```
train_only_subjects (set, array-like, 'nan', or None, optional) –
An explicit list or array-like of train_only subjects, where any subject loaded as train_only will be assigned to every training fold, and never to a testing fold.

You can also optionally specify ‘nan’ as input, which will add all subjects with any NaN data to train only.

If you want to add both all the NaN subjects and custom subjects, call Get_Nan_Subjects() to get all NaN subjects, and then merge them yourself with any you want to pass.

You can load from a loc and pass subjects, the subjects from each source will be merged.

This parameter is compatible with groups / stratify.
```
default = None
```

CV_Splits¶

class BPt.CV_Splits(cv='default', splits=3, n_repeats=1, _cv=None, _random_state=None, _splits_vals=None)¶

This object is used to wrap around a CV strategy at a higher level.

Parameters

cv (‘default’ or CV, optional) – If left as default ‘default’, use the class defined CV behavior for the splits, otherwise can pass custom behavior.
splits (int, float, str or list of str, optional) –
splits allows you to specify the base of what CV strategy should be used.

Specifically, options for split are:
- int
  The number of k-fold splits to conduct. (E.g., 3 for 3-fold CV split to be conducted at every hyper-param evaluation).
- float
  Must be 0 < splits < 1, and defines a single train-test like split, with splits % of the current training data size used as a validation set.
- str
  If a str is passed, then it must correspond to a loaded Strat variable. In this case, a leave-out-group CV will be used according to the value of the indicated Strat variable (E.g., a leave-out-site CV scheme).
- list of str
  If multiple str passed, first determine the overlapping unique values from their corresponing loaded Strat variables, and then use this overlapped value to define the leave-out-group CV as described above.
n_repeats is designed to work with any of these choices.
```
default = 3
```
n_repeats (int, optional) –
The number of times to repeat the defined strategy as defined in splits.

For example, if n_repeats is set to 2, and splits is 3, then a twice repeated 3-fold CV will be performed
```
default = 1
```

Input Types¶

Select¶

class BPt.Select¶

The Select object is an BPt specific Input Wrapper designed to allow hyper-parameter searches to include not just choice of hyper-parameter, but also choosing between objects (as well as their relevant distributions).

Select is used to cast lists of base Model_Pipeline pieces as different options. Consider a simple example, for specifying a selection between two different Models

model = Select([Model('linear'), Model('random forest')])

In this example, the model passed to Model_Pipeline becomes a meta object for selecting between the two base models. Note: this does require a Param_Search object be passed to Model_Pipeline. Notably as well, if further param distributions are defined within say the Model(‘random forest’), those will still be optimized, allowing for potentially even a hyper-parameter search to select hyper-parameter distribution… (i.e., if both select options had the same base model obj, but only differed in the passed hyper-param distributions) if one were so inclined…

Other notable features of Select are, you are not limited to passing only two options, you can pass an arbitrary number… you can even, and I’m not even sure I want to tell you this… pass nested calls to Select… i.e., one of the Select options could be another Select, with say another Select…

Lastly, explcitly note that Select is not restricted for use with Models, it can be used on any of the base class:Model_Pipeline piece params (i.e., every param but param_search, feat_importances and cache…).

Duplicate¶

class BPt.Duplicate¶

The Duplicate object is an BPt specific Input wrapper. It is designed to be cast on a list of valid scope parameters, e.g.,

scope = Duplicate(['float', 'cat'])

Such that the corresponding pipeline piece will be duplicated for every entry within Duplicate. In this case, two copies of the base object will be made, where both have the same remaining non-scope params (i.e., obj, params, extra_params), but one will have a scope of ‘float’ and the other ‘cat’.

Consider the following exentended example, where loaders is being specified when creating an instance of Model_Pipeline:

loaders = Loader(obj='identity', scope=Duplicate(['float', 'cat']))

Is transformed in post processing / equivalent to

loaders = [Loader(obj='identity', scope='float'),
           Loader(obj='identity', scope='cat')]

Pipe¶

class BPt.Pipe¶

The Pipe object is an BPt specific Input wrapper, designed for now to work specifically within Loader. Because loader objects within BPt are designed to work on single files at a time, and further are resitricted in that they must go directly from some arbitrary file, shape and charteristics to outputted as a valid 2D (# Subects X # Features) array, it restricts potential sequential compositions. Pipe offers some utilty towards building sequential compositions.

For example, say one had saved 4D neuroimaging fMRI timeseries, and they wanted to first employ a loader to extract timeseries by ROI (with say hyper-parameters defined to select which ROI to use), but then wanted to use another loader to convert the timeseries ROIs to a correlation matrix, and only then pass along the output as 1D features per subject. In this case, the Pipe wrapper is a greate canidate!

Specifically, the pipe wrapper works at the level of defining a specific Loader, where basicially you are requesting that the loader you want to use be a Pipeline of a few different loader options, where the loader options are ones compatible in passing input to each other, e.g., the output from fit_transform as called on the ROI extractor is valid input to fit_transform of the Timeseries creator, and lastly the output from fit_transform of the Timeseries creator valid 1D feature array per subjects output.

Consider the example in code below, where we assume that ‘rois’ is the ROI extractor, and ‘timeseries’ is the correlation matrix creator object (where these could be can valid loader str, or custom user passed objects)

loader = Loader(obj = Pipe(['rois', 'timeseries']))

We only passed arguments for obj above, but in our toy example as initially described we wanted to further define parameters for a parameter search across both objects. See below for what different options for passing corresponding parameter distributions are:

# Options loader1 and loader2 tell it explicitly no params

# Special case, if just default params = 0, will convert to 2nd case
loader1 = Loader(obj = Pipe(['rois', 'timeseries']),
                 params = 0)

# You can specify just a matching list
loader2 = Loader(obj = Pipe(['rois', 'timeseries']),
                 params = [0, 0])

# Option 3 assumes that there are pre-defined valid class param dists
# for each of the base objects
loader3 = Loader(obj = Pipe(['rois', 'timeseries']),
                 params = [1, 1])

# Option 4 lets set params for the 'rois' object, w/ custom param dists
loader4 = Loader(obj = Pipe(['rois', 'timeseries']),
                 params = [{'some custom param dist'}, 0])

Note that still only one scope may be passed, and that scope will define the scope of the new combined loader. Also note that if extra_params is passed, the same extra_params will be passed when creating both individual objects. Where extra params behavior is to add its contents, only when the name of that param appears in the base classes init, s.t. there could exist a case where, if both ‘rois’ and ‘timeseries’ base objects had a parameter with the same name, passing a value for that name in extra params would update them both with the passed value.

Value_Subset¶

class BPt.Value_Subset(name, value)¶

Value_Subset is special wrapper class for BPt designed to work with Subjects style input. As seen in Param_Search, or to the train_subjects or test_subjects params in Evaluate and Test.

This wrapper can be used as follows, just specify an object as
Value_Subset(name, value)

Where name is the name of a loaded Strat column / feature, and value is the subset of values from that column to select subjects by. E.g., if you wanted to select just subjects of a specific sex, and assuming a variable was loaded in Strat (See Load_Strat) you could pass:

subjects = Value_Subset('sex', 0)

Which would specify only subjects with ‘sex’ equal to 0. You may also pass a list-like set of multiple columns to the name param. In this case, the overlap across all passed names will be computed, for example:

subjects = Value_Subset(['sex', 'race'], 0)

Where ‘race’ is another valid loaded Strat, would select only subjects with a value of 0 in the computed unique overlap across ‘sex’ and ‘race’.

Note it might be hard to tell what a value of 0 actually means, especially when you compose across multiple variables. With that in mind, as long as verbose is set to True, upon computation of the subset of subjects a message with be printed indicating what the passed value corresponds to in all of the combined variables, e.g., in the example above you would get the print out ‘sex’ = 0, ‘race’ = 0.

Values_Subset¶

class BPt.Values_Subset(name, values)¶

Value_Subsets is special wrapper class for BPt designed to work with Subjects style input.

This wrapper is very similar to Value_Subject, and will actually function the same in the case that one value for name and one value for values is selected, e.g. the below are equivilent.

subjects = Value_Subset(name='sex', value=0)
subjects = Values_Subset(name='sex', values=0)

That said, where Value_Subset, allows passing multiple values for name, but only allows one value for value, Values_Subset only allows one value for name, and multiple values for values.

Values_Subset therefore lets you select the subset of subjects via one or more values in a loaded Strat variable. E.g.,

subjects = Values_Subset(name='site', values=[0,1,5])

Would select the subset of subjects from sites 0, 1 and 5.