Init Phase

Import

To start, you must import the module. Assuming that it has been downloaded of course. Import and then make an object, in this example the obj is called “ML”

from BPt import BPt_ML
ML = BPt_ML(**init_params)

Alternatively, if you wish to load from an already saved object, you would do as follows

from BPt import Load
ML = Load(saved_location)

Load

BPt_ML.Load(exp_name='default', log_dr='default', existing_log='default', verbose='default', notebook='default', random_state='default')

This function is designed to load in a saved previously created BPt_ML object.

See Save for saving an object. See Init for the rest of changable param descriptions, e.g., log_dr, existing_log, ect…

Parameters

loc (str or Path) –

A path/str to a saved BPt_ML object, (One saved with Save), then that object will be loaded. Notably, if any additional params are passed along with it, e.g., exp_name, notebook, ect… they will override the saved values with the newly passed values. If left as ‘default’, all params will be set to the loaded value, though see the warning below.

Warning

The exp_name or log_dr may need to be changed, especially in the case where the object is being loaded in a new location or enviroment from where the original was created, as it will by default try to create logs with the saved path information as the original.

You can only change exp_name, log_dr, existing_log, verbose, notebook and random_state when loading a new object, for the remaining params, even if a value is passed, it will not be applied. If the user really wishes to change one of these params, they can change it manually via self.name_of_param = whatever.

To init params as referenced above are those listed here under Init.

Init

class BPt.BPt_ML(exp_name='My_Exp', log_dr='', existing_log='append', verbose=True, notebook=True, use_abcd_subject_ids=False, low_memory_mode=False, strat_u_name='_Strat', random_state=534, n_jobs=1, dpi=100, mp_context='loky')

Main class used within BPt for interfacing with Data Loading and Modeling / Other funcationality.

Parameters
  • exp_name (str, optional) –

    The name of this experimental run, used explicitly in saving logs, and figures, where the passed exp_name is used as the name of the log folder. If log_dr is not set to None, (if not None then saves logs and figures) then a folder is created within the log dr with the exp_name.

    default = 'My_Exp'
    

  • log_dr (str, Path or None, optional) –

    The directory in which to store logs… If set to None, then will not save any logs! If set to empty str, will save in the current dr.

    default = ''
    

  • existing_log ({'new', 'append', 'overwrite'}, optional) –

    This parameter dictates different choices for when an a folder with exp_name already exists in the specified log_dr.

    These choices are:

    • ’new’

      If the log folder already exists, then just increment exp_name until a free name is found, and use that as the log folder / exp_name.

    • ’append’

      If existing_log is ‘append’ then log entries and new figures will be added to the existing folder.

    • ’overwrite’

      If existing_log is ‘overwrite’, then the existing log folder with the same exp_name will be cleared upon __init__.

    default = 'append'
    

  • verbose (bool, optional) –

    If verbose is set to True, the BPt_ML object will print output, diagnostic and more general, directly to std out. If set to False, no output will be printed, though output will still be recorded within the logs assuming log_dr is not None.

    default = True
    

  • notebook (bool, optional) –

    If True, then assumes the user is running the code in an interactive jupyter notebook. In this case, certain features will either be enabled or disabled, e.g., type of progress bar.

    default = Trues
    

  • use_abcd_subject_ids (bool, optional) –

    Flag to determine the usage of ABCD speficic ‘default’ subject id behavior. If set to True, this will convert input NDAR subject ids into upper case, with prepended NDAR - type format. If set to False, then all input subject names must be entered explicitly the same, no preprocessing will be done on them.

    default = False
    

  • low_memory_mode (bool, optional) –

    This parameter dictates behavior around loading in data, specifically, If set to True, individual dataframes self.data, self.covars ect… will be deleted from memory as soon as modeling begins. This parameter also controls the pandas read_csv behavior, which also has a low_memory flag.

    default = False
    

  • strat_u_name (str, optional) –

    A unique str identifier to be appended to every loaded strat value (to keep them seperate from covars and data).

    You should only need to change or ever worry about this in the case that one of your input variables happens to have the default value of ‘_Strat’ in it…

    default = '_Strat'
    

  • random_state (int, RandomState instance or None, optional) –

    The default random state, either as int for a specific seed, or if None then the random seed is set by np.random. This parameters if set will be the default random_state class-wide, so any place random_state is left to default, unless a different default is set (e.g. default load value or default ML value) this random state will be used.

    default = 534
    

  • n_jobs (int, optional) –

    The default number of jobs / processors to use (if avaliable) where ever avaliable class-wide across the BPt.

    default = 1
    

  • dpi (int, optional) –

    The default dpi in which to save any automatically saved fiugres with. Where this parameter can also be set to specific values for specific plots.

    default = 1
    

  • mp_context (str, optional) –

    When a hyper-parameter search is launched, there are different ways through python that the multi-processing can be launched (assuming n_jobs > 1). Occassionally some choices can lead to unexpected errors.

    Choices are:

    • ’loky’: Create and use the python library

      loky backend.

    • ’fork’: Python default fork mp_context

    • ’forkserver’: Python default forkserver mp_context

    • ’spawn’: Python default spawn mp_context

    default = 'loky'
    

Loading Phase

The next ‘phase’, is the where all of the loading is done, and the structure of the desired expiriments set up.

Set_Default_Load_Params

BPt_ML.Set_Default_Load_Params(dataset_type='default', subject_id='default', eventname='default', eventname_col='default', overlap_subjects='default', merge='default', na_values='default', drop_na='default', drop_or_na='default')

This function is used to define default values for a series of params accessible to all or most of the different loading functions. By setting common values here, it reduces the need to repeat params within each loader (e.g. Load_Data, Load_Targets, ect…)

Parameters
  • dataset_type ({'basic', 'explorer', 'custom'}, optional) –

    The dataset_type / file-type to load from. Dataset types are,

    • ’basic’

      ABCD2p0NDA style (.txt and tab seperated). Typically the default columns, and therefore not neuroimaging data, will be dropped, also not including the eventname column.

    • ’explorer’

      2.0_ABCD_Data_Explorer style (.csv and comma seperated). The first 2 columns before self.subject_id (typically the default columns, and therefore not neuroimaging data - also not including the eventname column), will be dropped.

    • ’custom’

      A user-defined custom dataset. Right now this is only. supported as a comma seperated file, with the subject names in a column called self.subject_id, and can optionally have ‘eventname’. No columns will be dropped, (except eventname) or unless specific drop keys are passed.

    If loading multiple locs as a list, dataset_type can be a list with inds corresponding to which datatype for each loc.

    if ‘default’, and not already defined, set to ‘basic’

    default = 'default'
    

  • subject_id (str, optional) –

    The name of the column with unique subject ids in different dataset, for default ABCD datasets this is ‘src_subject_id’, but if a user wanted to load and work with a different dataset, they just need to change this accordingly (in addition to setting eventname most likely to None and use_abcd_subject_ids to False)

    if ‘default’, and not already defined, set to ‘src_subject_id’.

    default = 'default'
    

  • eventname (value, list of values or None, optional) –

    Optional value to provide, specifying to optional keep certain rows when reading data based on the eventname flag, where eventname is the value and eventname_col is the name of the value.

    If a list of values are passed, then it will be treated as keeping a row if that row’s value within the eventname_col is equal to ANY of the passed eventname values.

    As ABCD is a longitudinal study, this flag lets you select only one specific time point, or if set to None, will load everything.

    For selecting only baseline imagine data one might consider setting this param to ‘baseline_year_1_arm_1’.

    if ‘default’, and not already defined, set to None. (default = ‘default’)

  • eventname_col (str or None, optional) –

    If an eventname is provided, this param refers to the column name containing the eventname. This could also be used along with eventname to be set to any arbitrary value, in order to perform selection by specific column value.

    Note: The eventname col is dropped after proc’ed!

    if ‘default’, and not already defined, set to ‘eventname’ (default = ‘default’)

  • overlap_subjects (bool, optional) –

    This parameter dictates when loading data, covars, targets or strat (after initial basic proc and/or merge w/ other passed loc’s), if the loaded data should be restricted to only the overlapping subjects from previously loaded data, targets, covars or strat - important when performing intermediate proc. If False, then all subjects will be kept throughout the rest of the optional processing - and only merged at the end AFTER processing has been done.

    Note: Inclusions and Exclusions are always applied regardless of this parameter.

    if ‘default’, and not already defined, set to False (default = ‘default’)

  • merge ({'inner' or 'outer'}) –

    Simmilar to overlap subjects, this parameter controls the merge behavior between different df’s. i.e., when calling Load_Data twice, a local dataframe is merged with the class self.data on the second call. There are two behaviors that make sense here, one is ‘inner’ which says, only take the overlapping subjects from each dataframe, and the other is ‘outer’ which will keep all subjects from both, and set any missing subjects values to NaN.

    if ‘default’, and not already defined, set to ‘inner’ (default = ‘default’)

  • na_values (list, optional) –

    Additional values to treat as NaN, by default ABCD specific values of ‘777’ and ‘999’ are treated as NaN, and those set to default by pandas ‘read_csv’ function. Note: if new values are passed here, it will override these default ‘777’ and ‘999’ NaN values, so if it desired to keep these, they should be passed explicitly, along with any new values.

    if ‘default’, and not already defined, set to [‘777’, ‘999’] (default = ‘default’)

  • drop_na (bool, int, float or 'default', optional) –

    This setting sets the value for drop_na, which is used when loading data and covars only!

    If set to True, then will drop any row within the loaded data if there are any NaN! If False, the will not drop any rows for missing values.

    If an int or float, then this means some NaN entries will potentially be preserved! Missing data imputation will therefore be required later on!

    If an int > 1, then will drop any row with more than drop_na NaN values. If a float, will determine the drop threshold as a percentage of the possible values, where 1 would not drop any rows as it would require the number of columns + 1 NaN, and .5 would require that more than half the column entries are NaN in order to drop that row.

    if ‘default’, and not already defined, set to True (default = ‘default’)

  • drop_or_na ({'drop', 'na'}, optional) –

    This setting sets the value for drop_na, which is used when loading data and covars only!

    filter_outlier_percent, or when loading a binary variable in load covars and more then two classes are present - are both instances where rows/subjects are by default dropped. If drop_or_na is set to ‘na’, then these values will instead be set to ‘na’ rather then the whole row dropped!

    Otherwise, if left as default value of ‘drop’, then rows will be dropped!

    if ‘default’, and not already defined, set to ‘drop’ (default = ‘default’)

Load_Name_Map

BPt_ML.Load_Name_Map(name_map=None, loc=None, dataset_type='default', source_name_col='NDAR name', target_name_col='REDCap name/NDA alias', na_values='default', clear_existing=False)

Loads a mapping dictionary for loading column names. Either a loc or name_map must be passed! Note: If both a name_map and loc are passed, the name_map will be loaded first, then updated with values from the loc.

Parameters
  • name_map (dict or None, optional) –

    A dictionary containing the mapping to be passed directly. Set to None if using loc instead!

    (default = None)

  • loc (str, Path or None, optional) –

    The location of the csv file which contains the mapping.

    (default = None)

  • dataset_type ({'basic', 'explorer', 'custom'}, optional) –

    The dataset_type / file-type to load from. Dataset types are,

    • ’basic’

      ABCD2p0NDA style (.txt and tab seperated). Typically the default columns, and therefore not neuroimaging data, will be dropped, also not including the eventname column.

    • ’explorer’

      2.0_ABCD_Data_Explorer style (.csv and comma seperated). The first 2 columns before self.subject_id (typically the default columns, and therefore not neuroimaging data - also not including the eventname column), will be dropped.

    • ’custom’

      A user-defined custom dataset. Right now this is only. supported as a comma seperated file, with the subject names in a column called self.subject_id, and can optionally have ‘eventname’. No columns will be dropped, (except eventname) or unless specific drop keys are passed.

    If loading multiple locs as a list, dataset_type can be a list with inds corresponding to which datatype for each loc.

    if ‘default’, and not already defined, set to ‘basic’

    default = 'default'
    

  • source_name_col (str, optional) –

    The column name with the file which lists names to be changed.

    (default = “NDAR name”)

  • target_name_col (str, optional) –

    The column name within the file which lists the new name.

    (default = “REDCap name/NDA alias”)

  • na_values (list, optional) –

    Additional values to treat as NaN, by default ABCD specific values of ‘777’ and ‘999’ are treated as NaN, and those set to default by pandas ‘read_csv’ function. Note: if new values are passed here, it will override these default ‘777’ and ‘999’ NaN values, so if it desired to keep these, they should be passed explicitly, along with any new values.

    if ‘default’, and not already defined, set to [‘777’, ‘999’] (default = ‘default’)

  • clear_existing (bool, optional) – If set to True, will clear the existing loaded name_map, otherwise the name_map dictionary will be updated if already loaded!

Load_Exclusions

BPt_ML.Load_Exclusions(loc=None, subjects=None, clear_existing=False)

Loads in a set of excluded subjects, from either a file or as directly passed in.

Parameters
  • loc (str, Path or None, optional) – Location of a file to load in excluded subjects from. The file should be formatted as one subject per line. (default = None)

  • subjects (list, set, array-like or None, optional) – An explicit list of subjects to add to exclusions. (default = None)

  • clear_existing (bool, optional) –

    If this parameter is set to True, then any existing loaded exclusions will first be cleared before loading new exclusions!

    Warning

    If any subjects have been dropped from a different place, e.g. targets or data, then simply reloading / clearing existing exclusions might result in computing a misleading overlap of final valid subjects. Reloading should therefore be best used right after loading the original exclusions, or if not possible, then reloading the notebook or re-running the script.

    (default = False)

Notes

For best/most reliable performance across all Loading cases, exclusions should be loaded before data, covars and targets.

If default subject id behavior is set to False, reading subjects from a exclusion loc might not function as expected.

Load_Inclusions

BPt_ML.Load_Inclusions(loc=None, subjects=None, clear_existing=False)

Loads in a set of subjects such that only these subjects can be loaded in, and any subject not as an inclusion is dropped, from either a file or as directly passed in.

If multiple inclusions are loaded, the final set of inclusions is computed as the union of all passed inclusions, not the intersection! In this way, inclusions acts more as an iterative whitelist.

Parameters
  • loc (str, Path or None, optional) – Location of a file to load in inclusion subjects from. The file should be formatted as one subject per line. (default = None)

  • subjects (list, set, array-like or None, optional) – An explicit list of subjects to add to inclusions. (default = None)

  • clear_existing (bool, optional) –

    If this parameter is set to True, then any existing loaded inclusions will first be cleared before loading new inclusions!

    Warning

    If any subjects have been dropped from a different place, e.g. targets or data, then simply reloading / clearing existing inclusions might result in computing a misleading overlap of final valid subjects. Reloading should therefore be best used right after loading the original inclusions, or if not possible, then reloading the notebook or re-running the script.

    (default = False)

Notes

For best/most reliable performance across all Loading cases, inclusions should be loaded before data, covars and targets.

If default subject id behavior is set to False, reading subjects from a inclusion loc might not function as expected.

Load_Data

BPt_ML.Load_Data(loc=None, df=None, dataset_type='default', drop_keys=None, inclusion_keys=None, subject_id='default', eventname='default', eventname_col='default', overlap_subjects='default', merge='default', na_values='default', drop_na='default', drop_or_na='default', filter_outlier_percent=None, filter_outlier_std=None, unique_val_drop=None, unique_val_warn=0.05, drop_col_duplicates=None, clear_existing=False, ext=None)

Class method for loading ROI-style data, assuming all loaded columns are continuous / float datatype.

Parameters
  • loc (str Path, list of or None, optional) –

    The location of the file to load data load from. If passed a list, then will load each loc in the list, and will assume them all to be of the same dataset_type if one dataset_type is passed, or if they differ in type, a list must be passed to dataset_type with the different types in order.

    Note: some proc will be done on each loaded dataset before merging with the rest (duplicate subjects, proc for eventname ect…), but other dataset loading behavior won’t occur until after the merge, e.g., dropping cols by key, filtering for outlier, ect…

    (default = None)

  • df (pandas DataFrame or None, optional) –

    This parameter represents the option for the user to pass in a raw custom dataframe. A loc and/or a df must be passed.

    When pasing a raw DataFrame, the loc and dataset_type param will be ignored, as those are for loading data from a file. Otherwise, it will be treated the same as if loading from a file, which means, there should be a column within the passed dataframe with subject_id, and e.g. if eventname params are passed, they will be applied along with any other proc. specified.

    (default = None)

  • dataset_type ({'basic', 'explorer', 'custom'}, optional) –

    The dataset_type / file-type to load from. Dataset types are,

    • ’basic’

      ABCD2p0NDA style (.txt and tab seperated). Typically the default columns, and therefore not neuroimaging data, will be dropped, also not including the eventname column.

    • ’explorer’

      2.0_ABCD_Data_Explorer style (.csv and comma seperated). The first 2 columns before self.subject_id (typically the default columns, and therefore not neuroimaging data - also not including the eventname column), will be dropped.

    • ’custom’

      A user-defined custom dataset. Right now this is only. supported as a comma seperated file, with the subject names in a column called self.subject_id, and can optionally have ‘eventname’. No columns will be dropped, (except eventname) or unless specific drop keys are passed.

    If loading multiple locs as a list, dataset_type can be a list with inds corresponding to which datatype for each loc.

    if ‘default’, and not already defined, set to ‘basic’

    default = 'default'
    

  • drop_keys (str, list or None, optional) –

    A list of keys to drop columns by, where if any key given in a columns name, then that column will be dropped. If a str, then same behavior, just with one col. (Note: if a name mapping exists, this drop step will be conducted after renaming)

    (default = None)

  • inclusion_keys (str, list or None, optional) –

    A list of keys in which to only keep a loaded data column if ANY of the passed inclusion_keys are present within that column name.

    If passed only with drop_keys will be proccessed second.

    (Note: if a name mapping exists, this drop step will be conducted after renaming)

    (default = None)

  • subject_id (str, optional) –

    The name of the column with unique subject ids in different dataset, for default ABCD datasets this is ‘src_subject_id’, but if a user wanted to load and work with a different dataset, they just need to change this accordingly (in addition to setting eventname most likely to None and use_abcd_subject_ids to False)

    if ‘default’, and not already defined, set to ‘src_subject_id’.

    default = 'default'
    

  • eventname (value, list of values or None, optional) –

    Optional value to provide, specifying to optional keep certain rows when reading data based on the eventname flag, where eventname is the value and eventname_col is the name of the value.

    If a list of values are passed, then it will be treated as keeping a row if that row’s value within the eventname_col is equal to ANY of the passed eventname values.

    As ABCD is a longitudinal study, this flag lets you select only one specific time point, or if set to None, will load everything.

    For selecting only baseline imagine data one might consider setting this param to ‘baseline_year_1_arm_1’.

    if ‘default’, and not already defined, set to None. (default = ‘default’)

  • eventname_col (str or None, optional) –

    If an eventname is provided, this param refers to the column name containing the eventname. This could also be used along with eventname to be set to any arbitrary value, in order to perform selection by specific column value.

    Note: The eventname col is dropped after proc’ed!

    if ‘default’, and not already defined, set to ‘eventname’ (default = ‘default’)

  • overlap_subjects (bool, optional) –

    This parameter dictates when loading data, covars, targets or strat (after initial basic proc and/or merge w/ other passed loc’s), if the loaded data should be restricted to only the overlapping subjects from previously loaded data, targets, covars or strat - important when performing intermediate proc. If False, then all subjects will be kept throughout the rest of the optional processing - and only merged at the end AFTER processing has been done.

    Note: Inclusions and Exclusions are always applied regardless of this parameter.

    if ‘default’, and not already defined, set to False (default = ‘default’)

  • merge ({'inner' or 'outer'}) –

    Simmilar to overlap subjects, this parameter controls the merge behavior between different df’s. i.e., when calling Load_Data twice, a local dataframe is merged with the class self.data on the second call. There are two behaviors that make sense here, one is ‘inner’ which says, only take the overlapping subjects from each dataframe, and the other is ‘outer’ which will keep all subjects from both, and set any missing subjects values to NaN.

    if ‘default’, and not already defined, set to ‘inner’ (default = ‘default’)

  • na_values (list, optional) –

    Additional values to treat as NaN, by default ABCD specific values of ‘777’ and ‘999’ are treated as NaN, and those set to default by pandas ‘read_csv’ function. Note: if new values are passed here, it will override these default ‘777’ and ‘999’ NaN values, so if it desired to keep these, they should be passed explicitly, along with any new values.

    if ‘default’, and not already defined, set to [‘777’, ‘999’] (default = ‘default’)

  • drop_na (bool, int, float or 'default', optional) –

    This setting sets the value for drop_na, which is used when loading data and covars only!

    If set to True, then will drop any row within the loaded data if there are any NaN! If False, the will not drop any rows for missing values.

    If an int or float, then this means some NaN entries will potentially be preserved! Missing data imputation will therefore be required later on!

    If an int > 1, then will drop any row with more than drop_na NaN values. If a float, will determine the drop threshold as a percentage of the possible values, where 1 would not drop any rows as it would require the number of columns + 1 NaN, and .5 would require that more than half the column entries are NaN in order to drop that row.

    if ‘default’, and not already defined, set to True (default = ‘default’)

  • drop_or_na ({'drop', 'na'}, optional) –

    This setting sets the value for drop_na, which is used when loading data and covars only!

    filter_outlier_percent, or when loading a binary variable in load covars and more then two classes are present - are both instances where rows/subjects are by default dropped. If drop_or_na is set to ‘na’, then these values will instead be set to ‘na’ rather then the whole row dropped!

    Otherwise, if left as default value of ‘drop’, then rows will be dropped!

    if ‘default’, and not already defined, set to ‘drop’ (default = ‘default’)

  • filter_outlier_percent (int, float, tuple or None, optional) –

    For float data only. A percent of values to exclude from either end of the targets distribution, provided as either 1 number, or a tuple (% from lower, % from higher). set filter_outlier_percent to None for no filtering. If over 1 then treated as a percent, if under 1, then used directly.

    If drop_or_na == ‘drop’, then all rows/subjects with >= 1 value(s) found outside of the percent will be dropped. Otherwise, if drop_or_na = ‘na’, then any outside values will be set to NaN.

    (default = None)

  • filter_outlier_std (int, float, tuple or None, optional) –

    For float data only. Determines outliers as data points within each column where their value is less than the mean of the column - filter_outlier_std[0] * the standard deviation of the column, and greater than the mean of the column + filter_outlier_std[1] * the standard deviation of the column.

    If a singler number is passed, that number is applied to both the lower and upper range. If a tuple with None on one side is passed, e.g. (None, 3), then nothing will be taken off that lower or upper bound.

    If drop_or_na == ‘drop’, then all rows/subjects with >= 1 value(s) found will be dropped. Otherwise, if drop_or_na = ‘na’, then any outside values will be set to NaN.

    (default = None)

  • unique_val_drop (int, float None, optional) –

    This parameter allows you to drops columns within loaded data where there are under a certain threshold of unique values.

    The threshold is determined by the passed value as either a float for a percentage of the data, e.g., computed as unique_val_drop * len(data), or if passed a number greater then 1, then that number, where a ny column with less unique values then this threshold will be dropped.

    (default = None)

  • unique_val_warn (int or float, optional) –

    This parameter is simmilar to unique_val_drop, but only warns about columns with under the threshold (see unique_val_drop for how the threshold is computed) unique vals.

    (default = .05)

  • drop_col_duplicates (float or None/False, optional) –

    If set to None, will not drop any. If float, then pass a value between 0 and 1, where if two columns within data are correlated >= to corr_thresh, the second column is removed.

    A value of 1 will instead make a quicker direct ==’s comparison.

    Note: This param just drops duplicated within the just loaded data. You can call self.Drop_Data_Duplicates() to drop duplicates across all loaded data.

    Be advised, this functionality runs rather slow when there are ~500+ columns to compare!

    (default = None)

  • clear_existing (bool, optional) –

    If this parameter is set to True, then any existing loaded data will first be cleared before loading new data!

    Warning

    If any subjects have been dropped from a different place, e.g. targets, then simply reloading / clearing existing data might result in computing a misleading overlap of final valid subjects. Reloading should therefore be best used right after loading the original data, or if not possible, then reloading the notebook or re-running the script.

    (default = False)

  • ext (None or str, optional) –

    Optional fixed extension to append to all loaded col names, leave as None to ignore this param. Note: applied after name mapping.

    (default = None)

Load_Data_Files

BPt_ML.Load_Data_Files(loc=None, df=None, files=None, file_to_subject=None, load_func=<function load>, dataset_type='default', drop_keys=None, inclusion_keys=None, subject_id='default', eventname='default', eventname_col='default', overlap_subjects='default', merge='default', reduce_func=<function mean>, filter_outlier_percent=None, filter_outlier_std=None, clear_existing=False, ext=None)

Class method for loading in data as file paths, where file paths correspond to some sort of raw data which should only be actually loaded / proc’ed within the actual modelling. The further assumption made is that these files represent ‘Data’ in the same sense that Load_Data() represents data, where once loaded / proc’ed (See Loaders), the outputted features should be continuous / float datatype.

Parameters
  • loc (str Path, list of or None, optional) –

    The location of the file to load data load from. If passed a list, then will load each loc in the list, and will assume them all to be of the same dataset_type if one dataset_type is passed, or if they differ in type, a list must be passed to dataset_type with the different types in order.

    Note: some proc will be done on each loaded dataset before merging with the rest (duplicate subjects, proc for eventname ect…), but other dataset loading behavior won’t occur until after the merge, e.g., dropping cols by key, filtering for outlier, ect…

    (default = None)

  • df (pandas DataFrame or None, optional) –

    This parameter represents the option for the user to pass in a raw custom dataframe. A loc and/or a df must be passed.

    When pasing a raw DataFrame, the loc and dataset_type param will be ignored, as those are for loading data from a file. Otherwise, it will be treated the same as if loading from a file, which means, there should be a column within the passed dataframe with subject_id, and e.g. if eventname params are passed, they will be applied along with any other proc. specified.

    (default = None)

  • files (dict, optional) –

    Another alternative for specifying files to load can be done by passing a dict to this param.

    Warning: This option right now only works if all files to load are the same across each subject, e.g., no missing data for one modality. This will hopefully be fixed in the future, or atleast provide a better warning!

    Specifically, a python dictionary should be passed where each key refers to the name of that feature / column of data files to load, and the value is a python list, or array-like of str file paths.

    You must also pass a python function to the file_to_subject param, which specifies how to convert from passed file path, to a subject name.

    E.g., consider the example below, where 2 subjects files are loaded for ‘feat1’ and feat2’:

    files = {'feat1': ['f1/subj_0.npy', 'f1/subj_1.npy'],
             'feat2': ['f2/subj_0.npy', 'f2/subj_1.npy']}
    
    def file_to_subject_func(file):
        subject = file.split('/')[1].replace('.npy', '')
        return subject
    
    file_to_subject = file_to_subject_func
    # or
    file_to_subject = {'feat1': file_to_subject_func,
                       'feat2': file_to_subject_func}
    

    In this example, subjects are loaded as ‘subj_0’ and ‘subj_1’, and they have associated loaded data files ‘feat1’ and ‘feat2’.

    default = None
    

  • file_to_subject (python function, or dict of optional) –

    If files is passed, then you also need to specify a function which takes in a file path, and returns the relevant subject for that file path. If just one function is passed, it will be used for to load all dictionary entries, alternatively you can pass a matching dictionary of funcs, allowing for different funcs for each feature to load.

    See the example in param files.

    default = None
    

  • load_func (python function, optional) –

    Data_Files represent a path to a saved file, which means you must also provide some information on how to load the saved file. This parameter is where that loading function should be passed. The passed load_func will be used on each Data_File individually and whatever the output of the function is will be passed to loaders directly in modelling.

    You might need to pass a user defined custom function in some cases, e.g., you want to use np.load, but then also np.stack. Just wrap those two functions in one, and pass the new function.

    (default = np.load)

  • dataset_type ({'basic', 'explorer', 'custom'}, optional) –

    The dataset_type / file-type to load from. Dataset types are,

    • ’basic’

      ABCD2p0NDA style (.txt and tab seperated). Typically the default columns, and therefore not neuroimaging data, will be dropped, also not including the eventname column.

    • ’explorer’

      2.0_ABCD_Data_Explorer style (.csv and comma seperated). The first 2 columns before self.subject_id (typically the default columns, and therefore not neuroimaging data - also not including the eventname column), will be dropped.

    • ’custom’

      A user-defined custom dataset. Right now this is only. supported as a comma seperated file, with the subject names in a column called self.subject_id, and can optionally have ‘eventname’. No columns will be dropped, (except eventname) or unless specific drop keys are passed.

    If loading multiple locs as a list, dataset_type can be a list with inds corresponding to which datatype for each loc.

    if ‘default’, and not already defined, set to ‘basic’

    default = 'default'
    

  • drop_keys (str, list or None, optional) –

    A list of keys to drop columns by, where if any key given in a columns name, then that column will be dropped. If a str, then same behavior, just with one col. (Note: if a name mapping exists, this drop step will be conducted after renaming)

    (default = None)

  • inclusion_keys (str, list or None, optional) –

    A list of keys in which to only keep a loaded data column if ANY of the passed inclusion_keys are present within that column name.

    If passed only with drop_keys will be proccessed second.

    (Note: if a name mapping exists, this drop step will be conducted after renaming)

    (default = None)

  • subject_id (str, optional) –

    The name of the column with unique subject ids in different dataset, for default ABCD datasets this is ‘src_subject_id’, but if a user wanted to load and work with a different dataset, they just need to change this accordingly (in addition to setting eventname most likely to None and use_abcd_subject_ids to False)

    if ‘default’, and not already defined, set to ‘src_subject_id’.

    default = 'default'
    

  • eventname (value, list of values or None, optional) –

    Optional value to provide, specifying to optional keep certain rows when reading data based on the eventname flag, where eventname is the value and eventname_col is the name of the value.

    If a list of values are passed, then it will be treated as keeping a row if that row’s value within the eventname_col is equal to ANY of the passed eventname values.

    As ABCD is a longitudinal study, this flag lets you select only one specific time point, or if set to None, will load everything.

    For selecting only baseline imagine data one might consider setting this param to ‘baseline_year_1_arm_1’.

    if ‘default’, and not already defined, set to None. (default = ‘default’)

  • eventname_col (str or None, optional) –

    If an eventname is provided, this param refers to the column name containing the eventname. This could also be used along with eventname to be set to any arbitrary value, in order to perform selection by specific column value.

    Note: The eventname col is dropped after proc’ed!

    if ‘default’, and not already defined, set to ‘eventname’ (default = ‘default’)

  • overlap_subjects (bool, optional) –

    This parameter dictates when loading data, covars, targets or strat (after initial basic proc and/or merge w/ other passed loc’s), if the loaded data should be restricted to only the overlapping subjects from previously loaded data, targets, covars or strat - important when performing intermediate proc. If False, then all subjects will be kept throughout the rest of the optional processing - and only merged at the end AFTER processing has been done.

    Note: Inclusions and Exclusions are always applied regardless of this parameter.

    if ‘default’, and not already defined, set to False (default = ‘default’)

  • merge ({'inner' or 'outer'}) –

    Simmilar to overlap subjects, this parameter controls the merge behavior between different df’s. i.e., when calling Load_Data twice, a local dataframe is merged with the class self.data on the second call. There are two behaviors that make sense here, one is ‘inner’ which says, only take the overlapping subjects from each dataframe, and the other is ‘outer’ which will keep all subjects from both, and set any missing subjects values to NaN.

    if ‘default’, and not already defined, set to ‘inner’ (default = ‘default’)

  • reduce_func (python function or list of, optional) –

    This function is used if either filter_outlier_percent or filter_outlier_std is requested.

    The passed python function should reduce the file, once loaded, to one number, making it comptabile with the different filtering strategies. For example, the default function is just to take the mean of each loaded file, and to compute outlier detection on the mean.

    You may also pass a list to reduce func, where each entry of the list is a single reduce func. In this case outlier filtering will be computed on each reduce_fun seperately, and the union of all subjects marked as outlier will be dropped at the end.

    default = np.mean
    

  • filter_outlier_percent (int, float, tuple or None, optional) –

    For float data only. A percent of values to exclude from either end of the targets distribution, provided as either 1 number, or a tuple (% from lower, % from higher). set filter_outlier_percent to None for no filtering. If over 1 then treated as a percent, if under 1, then used directly.

    If drop_or_na == ‘drop’, then all rows/subjects with >= 1 value(s) found outside of the percent will be dropped. Otherwise, if drop_or_na = ‘na’, then any outside values will be set to NaN.

    (default = None)

  • filter_outlier_std (int, float, tuple or None, optional) –

    For float data only. Determines outliers as data points within each column where their value is less than the mean of the column - filter_outlier_std[0] * the standard deviation of the column, and greater than the mean of the column + filter_outlier_std[1] * the standard deviation of the column.

    If a singler number is passed, that number is applied to both the lower and upper range. If a tuple with None on one side is passed, e.g. (None, 3), then nothing will be taken off that lower or upper bound.

    If drop_or_na == ‘drop’, then all rows/subjects with >= 1 value(s) found will be dropped. Otherwise, if drop_or_na = ‘na’, then any outside values will be set to NaN.

    (default = None)

  • clear_existing (bool, optional) –

    If this parameter is set to True, then any existing loaded data will first be cleared before loading new data!

    Warning

    If any subjects have been dropped from a different place, e.g. targets, then simply reloading / clearing existing data might result in computing a misleading overlap of final valid subjects. Reloading should therefore be best used right after loading the original data, or if not possible, then reloading the notebook or re-running the script.

    (default = False)

  • ext (None or str, optional) –

    Optional fixed extension to append to all loaded col names, leave as None to ignore this param. Note: applied after name mapping.

    (default = None)

Drop_Data_Cols

BPt_ML.Drop_Data_Cols(drop_keys=None, inclusion_keys=None)

Function to drop columns within loaded data by drop_keys or inclusion_keys.

Parameters
  • drop_keys (str, list or None, optional) –

    A list of keys to drop columns within loaded data by, where if ANY key given in a columns name, then that column will be dropped. If a str, then same behavior, just with one col.

    If passed along with inclusion_keys will be processed first.

    (Note: if a name mapping exists, this drop step will be conducted after renaming)

    (default = None)

  • inclusion_keys (str, list or None, optional) –

    A list of keys in which to only keep a loaded data column if ANY of the passed inclusion_keys are present within that column name.

    If passed only with drop_keys will be proccessed second.

    (Note: if a name mapping exists, this drop step will be conducted after renaming)

    (default = None)

Filter_Data_Cols

BPt_ML.Filter_Data_Cols(filter_outlier_percent=None, filter_outlier_std=None, overlap_subjects='default', drop_or_na='default')

Perform filtering on all loaded data based on an outlier percent, either dropping outlier rows or setting specific outliers to NaN.

Note, if overlap_subject is set to True here, only the overlap will be saved after proc within self.data.

Parameters
  • filter_outlier_percent (int, float, tuple or None) –

    For float data only. A percent of values to exclude from either end of the targets distribution, provided as either 1 number, or a tuple (% from lower, % from higher). set filter_outlier_percent to None for no filtering. If over 1 then treated as a percent, if under 1, then used directly.

    If drop_or_na == ‘drop’, then all rows/subjects with >= 1 value(s) found outside of the percent will be dropped. Otherwise, if drop_or_na = ‘na’, then any outside values will be set to NaN.

    (default = None)

  • filter_outlier_std (int, float, tuple or None, optional) –

    For float data only. Determines outliers as data points within each column where their value is less than the mean of the column - filter_outlier_std[0] * the standard deviation of the column, and greater than the mean of the column + filter_outlier_std[1] * the standard deviation of the column.

    If a single number is passed, that number is applied to both the lower and upper range. If a tuple with None on one side is passed, e.g. (None, 3), then nothing will be taken off that lower or upper bound.

    (default = None)

  • overlap_subjects (bool, optional) –

    This parameter dictates when loading data, covars, targets or strat (after initial basic proc and/or merge w/ other passed loc’s), if the loaded data should be restricted to only the overlapping subjects from previously loaded data, targets, covars or strat - important when performing intermediate proc. If False, then all subjects will be kept throughout the rest of the optional processing - and only merged at the end AFTER processing has been done.

    Note: Inclusions and Exclusions are always applied regardless of this parameter.

    if ‘default’, and not already defined, set to False (default = ‘default’)

  • drop_or_na ({'drop', 'na'}, optional) –

    This setting sets the value for drop_na, which is used when loading data and covars only!

    filter_outlier_percent, or when loading a binary variable in load covars and more then two classes are present - are both instances where rows/subjects are by default dropped. If drop_or_na is set to ‘na’, then these values will instead be set to ‘na’ rather then the whole row dropped!

    Otherwise, if left as default value of ‘drop’, then rows will be dropped!

    if ‘default’, and not already defined, set to ‘drop’ (default = ‘default’)

Proc_Data_Unique_Cols

BPt_ML.Proc_Data_Unique_Cols(unique_val_drop=None, unique_val_warn=0.05, overlap_subjects='default')

This function performs proccessing on all loaded data based on the number of unique values loaded within each column (allowing users to drop or warn!).

Note, if overlap_subjects is set to True here, only the overlap will be saved after proc within self.data.

Parameters
  • unique_val_drop (int, float None, optional) –

    This parameter allows you to drops columns within loaded data where there are under a certain threshold of unique values.

    The threshold is determined by the passed value as either a float for a percentage of the data, e.g., computed as unique_val_drop * len(data), or if passed a number greater then 1, then that number, where a ny column with less unique values then this threshold will be dropped.

    (default = None)

  • unique_val_warn (int or float, optional) –

    This parameter is simmilar to unique_val_drop, but only warns about columns with under the threshold (see unique_val_drop for how the threshold is computed) unique vals.

    (default = .05)

  • overlap_subjects (bool, optional) –

    This parameter dictates when loading data, covars, targets or strat (after initial basic proc and/or merge w/ other passed loc’s), if the loaded data should be restricted to only the overlapping subjects from previously loaded data, targets, covars or strat - important when performing intermediate proc. If False, then all subjects will be kept throughout the rest of the optional processing - and only merged at the end AFTER processing has been done.

    Note: Inclusions and Exclusions are always applied regardless of this parameter.

    if ‘default’, and not already defined, set to False (default = ‘default’)

Drop_Data_Duplicates

BPt_ML.Drop_Data_Duplicates(corr_thresh, overlap_subjects='default')

Drop duplicates columns within self.data based on if two data columns are >= to a certain correlation threshold.

Note, if overlap_subjects is set to True here, only the overlap will be saved after proc within self.data.

Parameters
  • corr_thresh (float) –

    A value between 0 and 1, where if two columns within self.data are correlated >= to corr_thresh, the second column is removed.

    A value of 1 will instead make a quicker direct ==’s comparison.

  • overlap_subjects (bool, optional) –

    This parameter dictates when loading data, covars, targets or strat (after initial basic proc and/or merge w/ other passed loc’s), if the loaded data should be restricted to only the overlapping subjects from previously loaded data, targets, covars or strat - important when performing intermediate proc. If False, then all subjects will be kept throughout the rest of the optional processing - and only merged at the end AFTER processing has been done.

    Note: Inclusions and Exclusions are always applied regardless of this parameter.

    if ‘default’, and not already defined, set to False (default = ‘default’)

Show_Data_Dist

BPt_ML.Show_Data_Dist(data_subset='SHOW_ALL', num_feats=20, feats='random', reduce_func=None, frame_interval=500, plot_type='hist', show_only_overlap=True, subjects=None, save=True, dpi='default', save_name='data distribution', random_state='default', return_anim=False)

This method displays some summary statistics about the loaded targets, as well as plots the distibution if possible.

Note: to display loaded data files, pass a fun to reduce_func, otherwise they will not be displayed.

Parameters
  • data_subset ('SHOW_ALL' or array-like, optional) –

    ‘SHOW_ALL’ is reserved for showing the distributions of loaded data. You may also pass a list/array-like to specify specific a custom source of features to show.

    If self.all_data is already prepared, this data subset can also include any float type features loaded as covar or target.

    default = 'SHOW_ALL'
    

  • num_feats (int, optional) –

    The number of features’ distributions in which to view. Note: If too many are selected it may take a long time to render and/or consume a lot of memory!

    default = 20
    

  • feats ({'random', 'skew'}, optional) –

    The features in which to display, if ‘random’ then will select num_feats random features to display. If ‘skew’, will show the top num_feats features by absolute skew.

    If ‘skew’ and subjects == ‘both’, will compute the top skewed features based on the training set.

    default = 'random'
    

  • reduce_func (python function or list of, optional) –

    If a function is passed here, then data files will be loaded and reduced to 1 number according to the passed function. For example, the default function is just to take the mean of each loaded file, and to compute outlier detection on the mean.

    To not display data files, if any, then just keep reduce func as None

    default = None
    

  • frame_interval (int, optional) –

    The number of milliseconds between each frame.

    default = 500
    

  • plot_type ({'bar', 'hist', 'kde'}) –

    The type of base seaborn plot to generate for each datapoint. Either ‘bar’ for barplot, or ‘hist’ for seaborns dist plot, or ‘kde’ for just a kernel density estimate plot.

    default = 'hist'
    

  • show_only_overlap (bool, optional) –

    If True, then displays only the distributions for valid overlapping subjects across data, covars, ect… otherwise, if False, shows the current loaded distribution as is.

    If subjects is set (anything but None), this param will be ignored.

    default = True
    

  • subjects (None, 'train', 'test', 'both' or array-like, optional) –

    If not None, then plot only the subjects loaded as train_subjects, or as test subjects, of you can pass a custom list or array-like of subjects.

    If ‘both’, then will plot the train and test distributions seperately. Note: This only works for plot_type == ‘hist’ or ‘kde’. Also take into account, specifying ‘both’ will show some different information, then the default settings.

    default = None
    

  • save (bool, optional) –

    If the animation should be saved as a gif, True or False.

    default = True
    

  • dpi (int, 'default', optional) –

    The dpi in which to save the distribution gif. If ‘default’ use the class default value.

    default = 'default'
    

  • save_name (str, optional) –

    The name in which the gif should be saved under.

    default = 'data distribution'
    

  • random_state ('default', int or None) –

    The random state in which to choose random features from. If ‘default’ use the class define value, otherwise set to the value passed. None for random.

    default = 'default'
    

  • return_anim (bool, optional) –

    If True, return just the animation

    default = False
    

Load_Targets

BPt_ML.Load_Targets(loc=None, df=None, col_name=None, data_type=None, dataset_type='default', subject_id='default', eventname='default', eventname_col='default', overlap_subjects='default', merge='default', na_values='default', drop_na='default', drop_or_na='default', filter_outlier_percent=None, filter_outlier_std=None, categorical_drop_percent=None, float_bins=10, float_bin_strategy='uniform', clear_existing=False, ext=None)

Loads in targets, the outcome / variable(s) to predict.

Parameters
  • loc (str, Path or None, optional) –

    The location of the file to load targets load from.

    Either loc or df must be set, but they both cannot be set!

    (default = None)

  • df (pandas DataFrame or None, optional) –

    This parameter represents the option for the user to pass in a raw custom dataframe. A loc and/or a df must be passed.

    When pasing a raw DataFrame, the loc and dataset_type param will be ignored, as those are for loading from a file. Otherwise, it will be treated the same as if loading from a file, which means, there should be a column within the passed dataframe with subject_id, and e.g. if eventname params are passed, they will be applied along with any other proc. specified.

    Either loc or df must be set, but they both cannot be set!

  • col_name (str, list, optional) –

    The name(s) of the column(s) to load.

    Note: Must be in the same order as data types passed in. (default = None)

  • data_type ({'b', 'c', 'f', 'f2c', 'a'}, optional) –

    The data types of the different columns to load, in the same order as the column names passed in. Shorthands for datatypes can be used as well.

    If a list is passed to col_name, then you can either supply one data_type to be applied to all passed cols, or a list with corresponding data types by index for each col_name passed.

    • ’binary’ or ‘b’

      Binary input

    • ’categorical’ or ‘c’

      Categorical input

    • ’float’ or ‘f’

      Float numerical input

    • ’float_to_cat’, ‘f2c’, ‘float_to_bin’ or ‘f2b’

      This specifies that the data should be loaded initially as float, then descritized to be a binned categorical feature.

    • ’auto’ or ‘a’

      This specifies that the type should be automatically inferred. Current inference rules are: if 2 unique non-nan categories then binary, if pandas datatype category, then categorical, otherwise float.

    Datatypes are explained further in Notes.

    (default = None)

  • dataset_type ({'basic', 'explorer', 'custom'}, optional) –

    The dataset_type / file-type to load from. Dataset types are,

    • ’basic’

      ABCD2p0NDA style (.txt and tab seperated). Typically the default columns, and therefore not neuroimaging data, will be dropped, also not including the eventname column.

    • ’explorer’

      2.0_ABCD_Data_Explorer style (.csv and comma seperated). The first 2 columns before self.subject_id (typically the default columns, and therefore not neuroimaging data - also not including the eventname column), will be dropped.

    • ’custom’

      A user-defined custom dataset. Right now this is only. supported as a comma seperated file, with the subject names in a column called self.subject_id, and can optionally have ‘eventname’. No columns will be dropped, (except eventname) or unless specific drop keys are passed.

    If loading multiple locs as a list, dataset_type can be a list with inds corresponding to which datatype for each loc.

    if ‘default’, and not already defined, set to ‘basic’

    default = 'default'
    

  • subject_id (str, optional) –

    The name of the column with unique subject ids in different dataset, for default ABCD datasets this is ‘src_subject_id’, but if a user wanted to load and work with a different dataset, they just need to change this accordingly (in addition to setting eventname most likely to None and use_abcd_subject_ids to False)

    if ‘default’, and not already defined, set to ‘src_subject_id’.

    default = 'default'
    

  • eventname (value, list of values or None, optional) –

    Optional value to provide, specifying to optional keep certain rows when reading data based on the eventname flag, where eventname is the value and eventname_col is the name of the value.

    If a list of values are passed, then it will be treated as keeping a row if that row’s value within the eventname_col is equal to ANY of the passed eventname values.

    As ABCD is a longitudinal study, this flag lets you select only one specific time point, or if set to None, will load everything.

    For selecting only baseline imagine data one might consider setting this param to ‘baseline_year_1_arm_1’.

    if ‘default’, and not already defined, set to None. (default = ‘default’)

  • eventname_col (str or None, optional) –

    If an eventname is provided, this param refers to the column name containing the eventname. This could also be used along with eventname to be set to any arbitrary value, in order to perform selection by specific column value.

    Note: The eventname col is dropped after proc’ed!

    if ‘default’, and not already defined, set to ‘eventname’ (default = ‘default’)

  • overlap_subjects (bool, optional) –

    This parameter dictates when loading data, covars, targets or strat (after initial basic proc and/or merge w/ other passed loc’s), if the loaded data should be restricted to only the overlapping subjects from previously loaded data, targets, covars or strat - important when performing intermediate proc. If False, then all subjects will be kept throughout the rest of the optional processing - and only merged at the end AFTER processing has been done.

    Note: Inclusions and Exclusions are always applied regardless of this parameter.

    if ‘default’, and not already defined, set to False (default = ‘default’)

  • merge ({'inner' or 'outer'}) –

    Simmilar to overlap subjects, this parameter controls the merge behavior between different df’s. i.e., when calling Load_Data twice, a local dataframe is merged with the class self.data on the second call. There are two behaviors that make sense here, one is ‘inner’ which says, only take the overlapping subjects from each dataframe, and the other is ‘outer’ which will keep all subjects from both, and set any missing subjects values to NaN.

    if ‘default’, and not already defined, set to ‘inner’ (default = ‘default’)

  • na_values (list, optional) –

    Additional values to treat as NaN, by default ABCD specific values of ‘777’ and ‘999’ are treated as NaN, and those set to default by pandas ‘read_csv’ function. Note: if new values are passed here, it will override these default ‘777’ and ‘999’ NaN values, so if it desired to keep these, they should be passed explicitly, along with any new values.

    if ‘default’, and not already defined, set to [‘777’, ‘999’] (default = ‘default’)

  • drop_na (bool, int, float or 'default', optional) –

    This setting sets the value for drop_na, which is used when loading data and covars only!

    If set to True, then will drop any row within the loaded data if there are any NaN! If False, the will not drop any rows for missing values.

    If an int or float, then this means some NaN entries will potentially be preserved! Missing data imputation will therefore be required later on!

    If an int > 1, then will drop any row with more than drop_na NaN values. If a float, will determine the drop threshold as a percentage of the possible values, where 1 would not drop any rows as it would require the number of columns + 1 NaN, and .5 would require that more than half the column entries are NaN in order to drop that row.

    if ‘default’, and not already defined, set to True (default = ‘default’)

  • drop_or_na ({'drop', 'na'}, optional) –

    This setting sets the value for drop_na, which is used when loading data and covars only!

    filter_outlier_percent, or when loading a binary variable in load covars and more then two classes are present - are both instances where rows/subjects are by default dropped. If drop_or_na is set to ‘na’, then these values will instead be set to ‘na’ rather then the whole row dropped!

    Otherwise, if left as default value of ‘drop’, then rows will be dropped!

    if ‘default’, and not already defined, set to ‘drop’ (default = ‘default’)

  • filter_outlier_percent (float, tuple, list of or None, optional) –

    For float datatypes only. A percent of values to exclude from either end of the target distribution, provided as either 1 number, or a tuple (% from lower, % from higher). set filter_outlier_percent to None for no filtering.

    For example, if passed (1, 1), then the bottom 1% and top 1% of the distribution will be dropped, the same as passing 1. Further, if passed (.1, 1), the bottom .1% and top 1% will be removed.

    A list of values can also be passed in the case that multiple col_names / targets are being loaded. In this case, the index should correspond. If a list is not passed then the same value is used for all targets.

    (default = None).

  • filter_outlier_std (int, float, tuple, None or list of, optional) –

    For float datatypes only. Determines outliers as data points within each column (target distribution) where their value is less than the mean of the column - filter_outlier_std[0] * the standard deviation of the column, and greater than the mean of the column + filter_outlier_std[1] * the standard deviation of the column.

    If a single number is passed, that number is applied to both the lower and upper range. If a tuple with None on one side is passed, e.g. (None, 3), then nothing will be taken off that lower or upper bound.

    A list of values can also be passed in the case that multiple col_names / covars are being loaded. In this case, the index should correspond. If a list is not passed here, then the same value is used when loading all targets.

    (default = None)

  • categorical_drop_percent (float, list of or None, optional) –

    Optional percentage threshold for dropping categories when loading categorical data. If a float is given, then a category will be dropped if it makes up less than that % of the data points. E.g. if .01 is passed, then any datapoints with a category with less then 1% of total valid datapoints is dropped.

    A list of values can also be passed in the case that multiple col_names / targets are being loaded. In this case, the index should correspond. If a list is not passed then the same value is used for all targets.

    (default = None)

  • float_bins (int or list of, optional) –

    If any columns are loaded as ‘float_to_bin’ or ‘f2b’ then input must be discretized into bins. This param controls the number of bins to create. As with other params, if one value is passed, it is applied to all columns, but if different values per column loaded are desired, a list of ints (with inds correponding) should be pased. For columns that are not specifed as ‘f2b’ type, anything can be passed in that list index spot as it will be igored.

    (default = 10)

  • float_bin_strategy ({'uniform', 'quantile', 'kmeans'}, optional) –

    If any columns are loaded as ‘float_to_bin’ or ‘f2b’ then input must be discretized into bins. This param controls the strategy used to define the bins. Options are,

    • ’uniform’

      All bins in each feature have identical widths.

    • ’quantile’

      All bins in each feature have the same number of points.

    • ’kmeans’

      Values in each bin have the same nearest center of a 1D k-means cluster.

    As with float_bins, if one value is passed, it is applied to all columns, but if different values per column loaded are desired, a list of choices (with inds correponding) should be pased.

    (default = ‘uniform’)

  • clear_existing (bool, optional) –

    If this parameter is set to True, then any existing loaded targets will first be cleared before loading new targets!

    Warning

    If any subjects have been dropped from a different place, e.g. covars or data, then simply reloading / clearing existing covars might result in computing a misleading overlap of final valid subjects. Reloading should therefore be best used right after loading the original data, or if not possible, then reloading the notebook or re-running the script.

    (default = False)

  • ext (None or str, optional) –

    Optional fixed extension to append to all loaded col names, leave as None to ignore this param. Note: applied after name mapping.

    (default = None)

Notes

Targets can be either ‘binary’, ‘categorical’, or ‘float’,

  • binary

    Targets are read in and label encoded to be 0 or 1, Will also work if passed column of unique string also, e.g. ‘M’ and ‘F’.

  • categorical

    Targets are treated as taking on one fixed value from a limited set of possible values.

  • float

    Targets are read in as a floating point number, and optionally then filtered.

Binarize_Target

BPt_ML.Binarize_Target(threshold=None, lower=None, upper=None, target=0, replace=True, merge='outer')

This function binarizes a loaded target variable, assuming that a float type target is loaded, otherwise this function will break!

Parameters
  • threshold (float or None, optional) –

    Single binary threshold, where any value less than the threshold will be set to 0 and any value greater than or equal to the threshold will be set to 1. Leave threshold as None, and use lower and upper instead to ‘cut’ out a chunk of values in the middle.

    (default = None)

  • lower (float or None, optional) –

    Any value that is greater than lower will be set to 1, and any value <= upper and >= lower will be dropped.

    If a value is set for lower, one cannot be set for threshold, and one must bet set for upper.

    (default = None)

  • upper (float or None, optional) –

    Any value that is less than upper will be set to 0, and any value <= upper and >= lower will be dropped.

    If a value is set for upper, one cannot be set for threshold, and one must bet set for lower.

    (default = None)

  • target (int or str, optional) –

    The loaded target in which to Binarize. This can be the int index, or the name of the target column. If only one target is loaded, just leave as default.

    (default = 0)

  • replace (bool, optional) –

    If True, then replace the target to be binarized in place, otherwise if False, add the binarized version as a new target.

    (default = True)

  • merge ({'inner' or 'outer'}) –

    This argument is used only when replace is False, and is further relevant only when upper and lower arguments are passed. If ‘inner’, then drop from the loaded target dataframe any subjects which do not overlap, if ‘outer’, then set any non-overlapping subjects data to NaN’s.

    (default = ‘outer’)

Show_Targets_Dist

BPt_ML.Show_Targets_Dist(targets='SHOW_ALL', cat_show_original_name=True, show_only_overlap=True, subjects=None, show=True, cat_type='Counts', return_display_dfs=False)

This method displays some summary statistics about the loaded targets, as well as plots the distibution if possible.

Parameters
  • targets (str, int or list, optional) –

    The single (str) or multiple targets (list), in which to display the distributions of. The str input ‘SHOW_ALL’ is reserved, and set to default, for showing the distributions of loaded targets.

    You can also pass the int index of the loaded target to show!

    (default = ‘SHOW_ALL’)

  • cat_show_original_name (bool, optional) –

    If True, then when showing a categorical distribution (or binary) make the distr plot using the original names. Otherwise, use the internally used names.

    (default = True)

  • show_only_overlap (bool, optional) –

    If True, then displays only the distributions for valid overlapping subjects across data, covars, ect… otherwise, if False, shows the current loaded distribution as is.

    (default = True)

  • subjects (None, 'train', 'test' or array-like, optional) –

    If None, plot all subjects. If not None, then plot only the subjects loaded as train_subjects, or as test subjects, or you can pass a custom list or array-like of subjects.

    (default = None)

  • show (bool, optional) –

    If True, then plt.show(), the matplotlib command will be called, and the figure displayed. On the other hand, if set to False, then the user can customize the plot as they desire. You can think of plt.show() as clearing all of the loaded settings, so in order to make changes, you can’t call this until you are done.

    (default = True)

  • cat_type ({'Counts', 'Frequency'}, optional) –

    If plotting a categorical variable (binary or categorical), plot the X axis as either by raw count or frequency.

    (default = ‘Counts’)

  • return_display_dfs (bool, optional) –

    Optionally return the display df as a pandas df

    (default = False)

Load_Covars

BPt_ML.Load_Covars(loc=None, df=None, col_name=None, data_type=None, dataset_type='default', subject_id='default', eventname='default', eventname_col='default', overlap_subjects='default', merge='default', na_values='default', drop_na='default', drop_or_na='default', nan_as_class=False, code_categorical_as='depreciated', filter_outlier_percent=None, filter_outlier_std=None, categorical_drop_percent=None, float_bins=10, float_bin_strategy='uniform', clear_existing=False, ext=None)

Load a covariate or covariates, type data.

Parameters
  • loc (str, Path or None, optional) –

    The location of the file to load co-variates load from.

    Either loc or df must be set, but they both cannot be set!

    (default = None)

  • df (pandas DataFrame or None, optional) –

    This parameter represents the option for the user to pass in a raw custom dataframe. A loc and/or a df must be passed.

    When pasing a raw DataFrame, the loc and dataset_type param will be ignored, as those are for loading from a file. Otherwise, it will be treated the same as if loading from a file, which means, there should be a column within the passed dataframe with subject_id, and e.g. if eventname params are passed, they will be applied along with any other proc. specified.

    Either loc or df must be set, but they both cannot be set!

  • col_name (str or list, optional) –

    The name(s) of the column(s) to load.

    Note: Must be in the same order as data types passed in.

    (default = None)

  • data_type ({'b', 'c', 'f', 'm', 'f2c'} or None, optional) –

    The data types of the different columns to load, in the same order as the column names passed in. Shorthands for datatypes can be used as well.

    If a list is passed to col_name, then you can either supply one data_type to be applied to all passed cols, or a list with corresponding data types by index for each col_name passed.

    • ’binary’ or ‘b’

      Binary input

    • ’categorical’ or ‘c’

      Categorical input

    • ’float’ or ‘f’

      Float numerical input

    • ’float_to_cat’, ‘f2c’, ‘float_to_bin’ or ‘f2b’

      This specifies that the data should be loaded initially as float, then descritized to be a binned categorical feature.

    • ’multilabel’ or ‘m’

      Multilabel categorical input

    Warning

    If ‘multilabel’ datatype is specified, then the associated col name should be a list of columns, and will be assumed to be. For example, if loading multiple targets and one is multilabel, a nested list should be passed to col_name.

    (default = None)

  • dataset_type ({'basic', 'explorer', 'custom'}, optional) –

    The dataset_type / file-type to load from. Dataset types are,

    • ’basic’

      ABCD2p0NDA style (.txt and tab seperated). Typically the default columns, and therefore not neuroimaging data, will be dropped, also not including the eventname column.

    • ’explorer’

      2.0_ABCD_Data_Explorer style (.csv and comma seperated). The first 2 columns before self.subject_id (typically the default columns, and therefore not neuroimaging data - also not including the eventname column), will be dropped.

    • ’custom’

      A user-defined custom dataset. Right now this is only. supported as a comma seperated file, with the subject names in a column called self.subject_id, and can optionally have ‘eventname’. No columns will be dropped, (except eventname) or unless specific drop keys are passed.

    If loading multiple locs as a list, dataset_type can be a list with inds corresponding to which datatype for each loc.

    if ‘default’, and not already defined, set to ‘basic’

    default = 'default'
    

  • subject_id (str, optional) –

    The name of the column with unique subject ids in different dataset, for default ABCD datasets this is ‘src_subject_id’, but if a user wanted to load and work with a different dataset, they just need to change this accordingly (in addition to setting eventname most likely to None and use_abcd_subject_ids to False)

    if ‘default’, and not already defined, set to ‘src_subject_id’.

    default = 'default'
    

  • eventname (value, list of values or None, optional) –

    Optional value to provide, specifying to optional keep certain rows when reading data based on the eventname flag, where eventname is the value and eventname_col is the name of the value.

    If a list of values are passed, then it will be treated as keeping a row if that row’s value within the eventname_col is equal to ANY of the passed eventname values.

    As ABCD is a longitudinal study, this flag lets you select only one specific time point, or if set to None, will load everything.

    For selecting only baseline imagine data one might consider setting this param to ‘baseline_year_1_arm_1’.

    if ‘default’, and not already defined, set to None. (default = ‘default’)

  • eventname_col (str or None, optional) –

    If an eventname is provided, this param refers to the column name containing the eventname. This could also be used along with eventname to be set to any arbitrary value, in order to perform selection by specific column value.

    Note: The eventname col is dropped after proc’ed!

    if ‘default’, and not already defined, set to ‘eventname’ (default = ‘default’)

  • overlap_subjects (bool, optional) –

    This parameter dictates when loading data, covars, targets or strat (after initial basic proc and/or merge w/ other passed loc’s), if the loaded data should be restricted to only the overlapping subjects from previously loaded data, targets, covars or strat - important when performing intermediate proc. If False, then all subjects will be kept throughout the rest of the optional processing - and only merged at the end AFTER processing has been done.

    Note: Inclusions and Exclusions are always applied regardless of this parameter.

    if ‘default’, and not already defined, set to False (default = ‘default’)

  • merge ({'inner' or 'outer'}) –

    Simmilar to overlap subjects, this parameter controls the merge behavior between different df’s. i.e., when calling Load_Data twice, a local dataframe is merged with the class self.data on the second call. There are two behaviors that make sense here, one is ‘inner’ which says, only take the overlapping subjects from each dataframe, and the other is ‘outer’ which will keep all subjects from both, and set any missing subjects values to NaN.

    if ‘default’, and not already defined, set to ‘inner’ (default = ‘default’)

  • na_values (list, optional) –

    Additional values to treat as NaN, by default ABCD specific values of ‘777’ and ‘999’ are treated as NaN, and those set to default by pandas ‘read_csv’ function. Note: if new values are passed here, it will override these default ‘777’ and ‘999’ NaN values, so if it desired to keep these, they should be passed explicitly, along with any new values.

    if ‘default’, and not already defined, set to [‘777’, ‘999’] (default = ‘default’)

  • drop_na (bool, int, float or 'default', optional) –

    This setting sets the value for drop_na, which is used when loading data and covars only!

    If set to True, then will drop any row within the loaded data if there are any NaN! If False, the will not drop any rows for missing values.

    If an int or float, then this means some NaN entries will potentially be preserved! Missing data imputation will therefore be required later on!

    If an int > 1, then will drop any row with more than drop_na NaN values. If a float, will determine the drop threshold as a percentage of the possible values, where 1 would not drop any rows as it would require the number of columns + 1 NaN, and .5 would require that more than half the column entries are NaN in order to drop that row.

    if ‘default’, and not already defined, set to True (default = ‘default’)

  • drop_or_na ({'drop', 'na'}, optional) –

    This setting sets the value for drop_na, which is used when loading data and covars only!

    filter_outlier_percent, or when loading a binary variable in load covars and more then two classes are present - are both instances where rows/subjects are by default dropped. If drop_or_na is set to ‘na’, then these values will instead be set to ‘na’ rather then the whole row dropped!

    Otherwise, if left as default value of ‘drop’, then rows will be dropped!

    if ‘default’, and not already defined, set to ‘drop’ (default = ‘default’)

  • nan_as_class (bool, or list of, optional) –

    If True, then when data_type is categorical, instead of keeping rows with NaN (explicitly this parameter does not override drop_na, so to use this, drop_na must be set to not True). the NaN values will be treated as a unique category.

    A list of values can also be passed in the case that multiple col_names / covars are being loaded. In this case, the index should correspond. If a list is not passed here, then the same value is used when loading all covars.

    default = False
    

  • code_categorical_as ('depreciated', optional) –

    This parameter has been removed, please use transformers within the actual modelling to accomplish something simillar.

    default = 'depreciated'
    

  • filter_outlier_percent (int, float, tuple, None or list of, optional) –

    For float datatypes only. A percent of values to exclude from either end of the covars distribution, provided as either 1 number, or a tuple (% from lower, % from higher). set filter_outlier_percent to None for no filtering.

    For example, if passed (1, 1), then the bottom 1% and top 1% of the distribution will be dropped, the same as passing 1. Further, if passed (.1, 1), the bottom .1% and top 1% will be removed.

    A list of values can also be passed in the case that multiple col_names / covars are being loaded. In this case, the index should correspond. If a list is not passed here, then the same value is used when loading all covars.

    Note: If loading a variable with type ‘float_to_cat’ / ‘float_to_bin’, the outlier filtering will be performed before kbin encoding.

    (default = None)

  • filter_outlier_std (int, float, tuple, None or list of, optional) –

    For float datatypes only. Determines outliers as data points within each column where their value is less than the mean of the column - filter_outlier_std[0] * the standard deviation of the column, and greater than the mean of the column + filter_outlier_std[1] * the standard deviation of the column.

    If a single number is passed, that number is applied to both the lower and upper range. If a tuple with None on one side is passed, e.g. (None, 3), then nothing will be taken off that lower or upper bound.

    A list of values can also be passed in the case that multiple col_names / covars are being loaded. In this case, the index should correspond. If a list is not passed here, then the same value is used when loading all covars.

    Note: If loading a variable with type ‘float_to_cat’ / ‘float_to_bin’, the outlier filtering will be performed before kbin encoding.

    (default = None)

  • categorical_drop_percent (float, None or list of, optional) –

    Optional percentage threshold for dropping categories when loading categorical data. If a float is given, then a category will be dropped if it makes up less than that % of the data points. E.g. if .01 is passed, then any datapoints with a category with less then 1% of total valid datapoints is dropped.

    A list of values can also be passed in the case that multiple col_names / covars are being loaded. In this case, the index should correspond. If a list is not passed here, then the same value is used when loading all covars.

    Note: percent in the name might be a bit misleading. For 1%, you should pass .01, for 10%, you should pass .1.

    If loading a categorical variable, this filtering will be applied before ordinally encoding that variable. If instead loading a variable with type ‘float_to_cat’ / ‘float_to_bin’, the outlier filtering will be performed after kbin encoding (as before then it is not categorical). This can yield gaps in the oridinal outputted values.

    (default = None)

  • float_bins (int or list of, optional) –

    If any columns are loaded as ‘float_to_bin’ or ‘f2b’ then input must be discretized into bins. This param controls the number of bins to create. As with other params, if one value is passed, it is applied to all columns, but if different values per column loaded are desired, a list of ints (with inds correponding) should be pased. For columns that are not specifed as ‘f2b’ type, anything can be passed in that list index spot as it will be igored.

    (default = 10)

  • float_bin_strategy ({'uniform', 'quantile', 'kmeans'}, optional) –

    If any columns are loaded as ‘float_to_bin’ or ‘f2b’ then input must be discretized into bins. This param controls the strategy used to define the bins. Options are,

    • ’uniform’

      All bins in each feature have identical widths.

    • ’quantile’

      All bins in each feature have the same number of points.

    • ’kmeans’

      Values in each bin have the same nearest center of a 1D k-means cluster.

    As with float_bins, if one value is passed, it is applied to all columns, but if different values per column loaded are desired, a list of choices (with inds correponding) should be pased.

    (default = ‘uniform’)

  • clear_existing (bool, optional) –

    If this parameter is set to True, then any existing loaded covars will first be cleared before loading new covars!

    Warning

    If any subjects have been dropped from a different place, e.g. targets or data, then simply reloading / clearing existing covars might result in computing a misleading overlap of final valid subjects. Reloading should therefore be best used right after loading the original data, or if not possible, then reloading the notebook or re-running the script.

    (default = False)

  • ext (None or str, optional) –

    Optional fixed extension to append to all loaded col names, leave as None to ignore this param. Note: applied after name mapping.

    (default = None)

Show_Covars_Dist

BPt_ML.Show_Covars_Dist(covars='SHOW_ALL', cat_show_original_name=True, show_only_overlap=True, subjects=None, show=True, cat_type='Counts', return_display_dfs=False)

Plot a single or multiple covar distributions, along with outputting useful summary statistics.

Parameters
  • covars (str or list, optional) –

    The single covar (str) or multiple covars (list), in which to display the distributions of. The str input ‘SHOW_ALL’ is reserved, and set to default, for showing the distributions of all loaded covars.

    (default = ‘SHOW_ALL’)

  • cat_show_original_name (bool, optional) –

    If True, then when showing a categorical distribution (or binary) make the distr plot using the original names. Otherwise, use the internally used names.

    (default = True)

  • show_only_overlap (bool, optional) –

    If True, then displays only the distributions for valid overlapping subjects across data, covars, ect… otherwise, shows the current loaded distribution as is.

    (default = True)

  • subjects (None, 'train', 'test' or array-like, optional) –

    If not None, then plot only the subjects loaded as train_subjects, or as test subjects, of you can pass a custom list or array-like of subjects.

    (default = None)

  • show (bool, optional) –

    If True, then plt.show(), the matplotlib command will be called, and the figure displayed. On the other hand, if set to False, then the user can customize the plot as they desire. You can think of plt.show() as clearing all of the loaded settings, so in order to make changes, you can’t call this until you are done.

    (default = True)

  • cat_type ({'Counts', 'Frequency'}, optional) –

    If plotting a categorical variable (binary or categorical), plot the X axis as either by raw count or frequency.

    (default = ‘Counts’)

  • return_display_dfs (bool, optional) –

    Optionally return the display df as a pandas df

    (default = False)

Load_Strat

BPt_ML.Load_Strat(loc=None, df=None, col_name=None, dataset_type='default', subject_id='default', eventname='default', eventname_col='default', overlap_subjects='default', binary_col=False, float_to_binary=False, float_col=False, float_bins=10, float_bin_strategy='uniform', filter_outlier_percent=None, filter_outlier_std=None, categorical_drop_percent=None, na_values='default', clear_existing=False, ext=None)

Load stratification values from a file. See Notes for more details on what stratification values are.

Parameters
  • loc (str, Path or None, optional) –

    The location of the file to load stratification vals load from.

    Either loc or df must be set, but they both cannot be set!

    (default = None)

  • df (pandas DataFrame or None, optional) –

    This parameter represents the option for the user to pass in a raw custom dataframe. A loc and/or a df must be passed.

    When pasing a raw DataFrame, the loc and dataset_type param will be ignored, as those are for loading from a file. Otherwise, it will be treated the same as if loading from a file, which means, there should be a column within the passed dataframe with subject_id, and e.g. if eventname params are passed, they will be applied along with any other proc. specified.

    Either loc or df must be set, but they both cannot be set!

  • col_name (str or list, optional) –

    The name(s) of the column(s) to load. Any datatype can be loaded with the exception of multilabel, but for float variables in particular, they should be specified with the float_col and corresponding float_bins and float_bin_strategy params. Noisy binary cols can also be specified with the binary_col param.

    (default = None)

  • dataset_type ({'basic', 'explorer', 'custom'}, optional) –

    The dataset_type / file-type to load from. Dataset types are,

    • ’basic’

      ABCD2p0NDA style (.txt and tab seperated). Typically the default columns, and therefore not neuroimaging data, will be dropped, also not including the eventname column.

    • ’explorer’

      2.0_ABCD_Data_Explorer style (.csv and comma seperated). The first 2 columns before self.subject_id (typically the default columns, and therefore not neuroimaging data - also not including the eventname column), will be dropped.

    • ’custom’

      A user-defined custom dataset. Right now this is only. supported as a comma seperated file, with the subject names in a column called self.subject_id, and can optionally have ‘eventname’. No columns will be dropped, (except eventname) or unless specific drop keys are passed.

    If loading multiple locs as a list, dataset_type can be a list with inds corresponding to which datatype for each loc.

    if ‘default’, and not already defined, set to ‘basic’

    default = 'default'
    

  • subject_id (str, optional) –

    The name of the column with unique subject ids in different dataset, for default ABCD datasets this is ‘src_subject_id’, but if a user wanted to load and work with a different dataset, they just need to change this accordingly (in addition to setting eventname most likely to None and use_abcd_subject_ids to False)

    if ‘default’, and not already defined, set to ‘src_subject_id’.

    default = 'default'
    

  • eventname (value, list of values or None, optional) –

    Optional value to provide, specifying to optional keep certain rows when reading data based on the eventname flag, where eventname is the value and eventname_col is the name of the value.

    If a list of values are passed, then it will be treated as keeping a row if that row’s value within the eventname_col is equal to ANY of the passed eventname values.

    As ABCD is a longitudinal study, this flag lets you select only one specific time point, or if set to None, will load everything.

    For selecting only baseline imagine data one might consider setting this param to ‘baseline_year_1_arm_1’.

    if ‘default’, and not already defined, set to None. (default = ‘default’)

  • eventname_col (str or None, optional) –

    If an eventname is provided, this param refers to the column name containing the eventname. This could also be used along with eventname to be set to any arbitrary value, in order to perform selection by specific column value.

    Note: The eventname col is dropped after proc’ed!

    if ‘default’, and not already defined, set to ‘eventname’ (default = ‘default’)

  • overlap_subjects (bool, optional) –

    This parameter dictates when loading data, covars, targets or strat (after initial basic proc and/or merge w/ other passed loc’s), if the loaded data should be restricted to only the overlapping subjects from previously loaded data, targets, covars or strat - important when performing intermediate proc. If False, then all subjects will be kept throughout the rest of the optional processing - and only merged at the end AFTER processing has been done.

    Note: Inclusions and Exclusions are always applied regardless of this parameter.

    if ‘default’, and not already defined, set to False (default = ‘default’)

  • binary_col (bool or list of, optional) –

    Strat values are loaded as ordinal categorical, but there still exists the case where the user would like to load a binary set of values, and would like to ensure they are binary (filtering out all values but the top 2 most frequent).

    This input should either be one boolean True False value, or a list of values corresponding the the length of col_name if col_name is a list.

    If col_name is a list and only one value for binary_col is passed, then that value is applied to all loaded cols.

    (default = False)

  • float_to_binary (False, int, (int, int), or list of) –

    Strat values are loaded as ordinal categorical, but one could also want to load a float value, and force it to be binary via thresholding.

    If False is passed, or False within a list of values, this will be ignored. Otherwise, a single int can be passed in the case of one threshold when values lower than or equal should be converted to 0, and values > to 1. If a tuple of ints passed, that corresponds to the case of passing a lower and upper binary threshold.

    (default = False)

  • float_col (bool, or list or None, optional) –

    Strat values are loaded as ordinal categorical, but one could also want to load a float value, and bin it into according to some strategy into ordinal categorical.

    This input should either be one boolean True False value, or a list of values corresponding the the length of col_name if col_name is a list.

    If col_name is a list and only one value for binary_col is passed, then that value is applied to all loaded cols.

    (default = None)

  • float_bins (int or list of, optional) –

    If any float_col are set to True, then the float input must be discretized into bins. This param controls the number of bins to create. As with float_col, if one value is passed, it is applied to all columns, but if different values per column loaded are desired, a list of ints (with inds correponding) should be pased.

    (default = 10)

  • float_bin_strategy ({'uniform', 'quantile', 'kmeans'}, optional) –

    If any float_col are set to True, then the float input must be discretized into bins. This param controls the strategy used to define the bins. Options are,

    • ’uniform’

      All bins in each feature have identical widths.

    • ’quantile’

      All bins in each feature have the same number of points.

    • ’kmeans’

      Values in each bin have the same nearest center of a 1D k-means cluster.

    As with float_col and float_bins, if one value is passed, it is applied to all columns, but if different values per column loaded are desired, a list of choices (with inds correponding) should be pased.

    (default = ‘uniform’)

  • filter_outlier_percent (int, float, tuple, None or list of, optional) –

    If any float_col are set to True, then you may perform float based outlier removal.

    A percent of values to exclude from either end of the covars distribution, provided as either 1 number, or a tuple (% from lower, % from higher). set filter_outlier_percent to None for no filtering.

    For example, if passed (1, 1), then the bottom 1% and top 1% of the distribution will be dropped, the same as passing 1. Further, if passed (.1, 1), the bottom .1% and top 1% will be removed.

    As with float_col and float_bins, if one value is passed, it is applied to all columns, but if different values per column loaded are desired, a list of choices (with inds correponding) should be pased.

    Note: this filtering will be applied before binning.

    (default = None)

  • filter_outlier_std (int, float, tuple, None or list of, optional) –

    If any float_col are set to True, then you may perform float based outlier removal.

    Determines outliers as data points within each column where their value is less than the mean of the column - filter_outlier_std[0] * the standard deviation of the column, and greater than the mean of the column + filter_outlier_std[1] * the standard deviation of the column.

    If a single number is passed, that number is applied to both the lower and upper range. If a tuple with None on one side is passed, e.g. (None, 3), then nothing will be taken off that lower or upper bound.

    As with float_col and float_bins, if one value is passed, it is applied to all columns, but if different values per column loaded are desired, a list of choices (with inds correponding) should be pased.

    Note: this filtering will be applied before binning.

    (default = None)

  • categorical_drop_percent (float, None or list of, optional) –

    Optional percentage threshold for dropping categories when loading categorical data (so for strat these are any column that are not specified as float or binary). If a float is given, then a category will be dropped if it makes up less than that % of the data points. E.g. if .01 is passed, then any datapoints with a category with less then 1% of total valid datapoints is dropped.

    A list of values can also be passed in the case that multiple col_names / strat vals are being loaded. In this case, the indices should correspond. If a list is not passed here, then the same value is used when loading all non float non binary strat cols.

    Note: if this is used with float col, then the outlier removal will be performed after the k-binning. If also provided filter_outlier_percent or std, that will be applied before binning.

    (default = None)

  • na_values (list, optional) –

    Additional values to treat as NaN, by default ABCD specific values of ‘777’ and ‘999’ are treated as NaN, and those set to default by pandas ‘read_csv’ function. Note: if new values are passed here, it will override these default ‘777’ and ‘999’ NaN values, so if it desired to keep these, they should be passed explicitly, along with any new values.

    if ‘default’, and not already defined, set to [‘777’, ‘999’] (default = ‘default’)

  • clear_existing (bool, optional) –

    If this parameter is set to True, then any existing loaded strat will first be cleared before loading new strat!

    Warning

    If any subjects have been dropped from a different place, e.g. targets or data, then simply reloading / clearing existing strat might result in computing a misleading overlap of final valid subjects. Reloading should therefore be best used right after loading the original strat, or if not possible, then reloading the notebook or re-running the script.

    (default = False)

  • ext (None or str, optional) –

    Optional fixed extension to append to all loaded col names, leave as None to ignore this param. Note: applied after name mapping.

    (default = None)

Notes

Stratification values are categorical variables which are loaded for the purpose of defining custom validation behavior.

For example: Sex might be loaded here, and used later to ensure that any validation splits retain the same distribution of each sex. See Define_Validation_Strategy(), and some arguments within Evaluate() (sample_on and subjects_to_use).

For most relaible split behavior based off strat values, make sure to load strat values after data, targets and covars.

Show_Strat_Dist

BPt_ML.Show_Strat_Dist(strat='SHOW_ALL', cat_show_original_name=True, show_only_overlap=True, subjects=None, show=True, cat_type='Counts', return_display_dfs=False)

Plot a single or multiple strat distributions, along with outputting useful summary statistics.

Parameters
  • strat (str or list, optional) –

    The single strat (str) or multiple strats (list), in which to display the distributions of. The str input ‘SHOW_ALL’ is reserved, and set to default, for showing the distributions of all loaded strat cols.

    (default = ‘SHOW_ALL’)

  • cat_show_original_name (bool, optional) –

    If True, then make the distr. plot using the original names. Otherwise, use the internally used names.

    (default = True)

  • show_only_overlap (bool, optional) –

    If True, then displays only the distributions for valid overlapping subjects across data, covars, ect… otherwise, shows the current loaded distribution as is.

    (default = True)

  • subjects (None, 'train', 'test' or array-like, optional) –

    If not None, then plot only the subjects loaded as train_subjects, or as test subjects, of you can pass a custom list or array-like of subjects.

    (default = None)

  • show (bool, optional) –

    If True, then plt.show(), the matplotlib command will be called, and the figure displayed. On the other hand, if set to False, then the user can customize the plot as they desire. You can think of plt.show() as clearing all of the loaded settings, so in order to make changes, you can’t call this until you are done.

    (default = True)

  • cat_type ({'Counts', 'Frequency'}, optional) –

    If plotting a categorical variable (binary or categorical), plot the X axis as either by raw count or frequency.

    (default = ‘Counts’)

  • return_display_dfs (bool, optional) –

    Optionally return the display df as a pandas df

    (default = False)

Get_Overlapping_Subjects

BPt_ML.Get_Overlapping_Subjects()

This function will return the set of valid overlapping subjects currently loaded across data, targets, covars, strat ect… respecting any inclusions and exclusions.

Returns

The set of valid overlapping subjects.

Return type

set

Clear_Name_Map

BPt_ML.Clear_Name_Map()

Reset name mapping

Clear_Exclusions

BPt_ML.Clear_Exclusions()

Resets exclusions to be an empty set.

Warning

If any subjects have been dropped from a different place, e.g. targets or data, then simply reloading / clearing existing exclusions might result in computing a misleading overlap of final valid subjects. Reloading should therefore be best used right after loading the original exclusions, or if not possible, then reloading the notebook or re-running the script.

Clear_Data

BPt_ML.Clear_Data()

Resets any loaded data.

Warning

If any subjects have been dropped from a different place, e.g. targets, then simply clearing data might result in computing a misleading overlap of final valid subjects. Reloading should therefore be best used right after loading the original data, or if not possible, then reloading the notebook or re-running the script.

Clear_Targets

BPt_ML.Clear_Targets()

Resets targets

Clear_Covars

BPt_ML.Clear_Covars()

Reset any loaded covars.

Warning

If any subjects have been dropped from a different place, e.g. targets or data, then simply reloading / clearing existing covars might result in computing a misleading overlap of final valid subjects. Reloading should therefore be best used right after loading the original covars, or if not possible, then reloading the notebook or re-running the script.

Clear_Strat

BPt_ML.Clear_Strat()

Reset any loaded strat

Warning

If any subjects have been dropped from a different place, e.g. targets or data, then simply reloading / clearing existing strat might result in computing a misleading overlap of final valid subjects. Reloading should therefore be best used right after loading the original strat, or if not possible, then reloading the notebook or re-running the script.

Get_Nan_Subjects

BPt_ML.Get_Nan_Subjects()

Retrieves all subjects with any loaded NaN data, returns their pandas index.

Validation Phase

Define_Validation_Strategy

BPt_ML.Define_Validation_Strategy(cv=None, groups=None, stratify=None, train_only_loc=None, train_only_subjects=None, show=True, show_original=True, return_df=False)

Define a validation strategy to be used during different train/test splits, in addition to model selection and model hyperparameter cross validation. See Notes for more info.

Note, can also pass a cv params objects here.

Parameters
  • cv (CV or None, optional) –

    If None, then skip, otherwise can pass a CV object here, and the rest of the parameters will be skipped.

    default = None
    

  • groups (str, list or None, optional) –

    In the case of str input, will assume the str to refer to a column key within the loaded strat data, and will assign it as a value to preserve groups by during any train/test or K-fold splits. If a list is passed, then each element should be a str, and they will be combined into all unique combinations of the elements of the list.

    default = None
    

  • stratify (str, list or None, optional) –

    In the case of str input, will assume the str to refer to a column key within the loaded strat data, or a loaded target col., and will assign it as a value to preserve distribution of groups by during any train/test or K-fold splits. If a list is passed, then each element should be a str, and they will be combined into all unique combinations of the elements of the list.

    Any target_cols passed must be categorical or binary, and cannot be float. Though you can consider loading in a float target as a strat, which will apply a specific k_bins, and then be valid here.

    In the case that you have a loaded strat val with the same name as your target, you can distinguish between the two by passing either the raw name, e.g., if they are both loaded as ‘Sex’, passing just ‘Sex’, will try to use the loaded target. If instead you want to use your loaded strat val with the same name - you have to pass ‘Sex’ + self.self.strat_u_name (by default this is ‘_Strat’).

    default = None
    

  • train_only_loc (str, Path or None, optional) –

    Location of a file to load in train_only subjects, where any subject loaded as train_only will be assigned to every training fold, and never to a testing fold. This file should be formatted as one subject per line.

    You can load from a loc and pass subjects, the subjects from each source will be merged.

    This parameter is compatible with groups / stratify.

    default = None
    

  • train_only_subjects (set, array-like, 'nan', or None, optional) –

    An explicit list or array-like of train_only subjects, where any subject loaded as train_only will be assigned to every training fold, and never to a testing fold.

    You can also optionally specify ‘nan’ as input, which will add all subjects with any NaN data to train only.

    If you want to add both all the NaN subjects and custom subjects, call Get_Nan_Subjects() to get all NaN subjects, and then merge them yourself with any you want to pass.

    You can load from a loc and pass subjects, the subjects from each source will be merged.

    This parameter is compatible with groups / stratify.

    default = None
    

  • show (bool, optional) –

    By default, if True, information about the defined validation strategy will be shown, including a dataframe if stratify is defined.

    default = True
    

  • show_original (bool, optional) –

    By default when you define stratifying behavior, a dataframe will be displayed. This param controls if that dataframe shows original names, or if False, then it shows the internally used names.

    default = True
    

  • return_df (bool, optional) –

    If set to true, then will return as dataframe version of the defined validation strategy. Note: this will return None in all cases execpt for when stratifying by a variable is requested!

    default = False
    

Notes

Validation strategy choices are explained in more detail:

  • Random

    Just make validation splits randomly.

  • Group Preserving

    Make splits that ensure subjects that are part of specific group are all within the same fold e.g., split by family, so that people with the same family id are always a part of the same fold.

  • Stratifying

    Make splits such that the distribution of a given group is as equally split between two folds as possible, so simmilar to matched halves or e.g., in a binary or categorical predictive context, splits could be done to ensure roughly equal distribution of the dependent class.

For now, it is possible to define only one overarching strategy (One could imagine combining group preserving splits while also trying to stratify for class, but the logistics become more complicated). Though, within one strategy it is certainly possible to provide multiple values e.g., for stratification you can stratify by target (the dependent variable to be predicted) as well as say sex, though with addition of unique value, the size of the smallest unique group decreases.

Train_Test_Split

BPt_ML.Train_Test_Split(test_size=None, test_subjects=None, cv='default', random_state='default', test_loc='depreciated', CV='depreciated')

Define the overarching train / test split, highly reccomended.

Parameters
  • test_size (float, int or None, optional) –

    If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to be included in the test split. If int, represents the absolute number (or target number) to include in the testing group. Keep as None if using test_subjects.

    default = None
    

  • test_subjects (Subjects, optional) –

    Pass in a Subjects (see for more info) formatted input. This will define an explicit set of subjects to use as a test set. If anything but None is passed here, nothing should be passed to the test_size parameter.

    default = None
    

  • cv (‘default’ or CV, optional) –

    If left as default ‘default’, use the class defined CV for the train test split, otherwise can pass custom behavior

    default = 'default'
    

  • random_state (int None or 'default', optional) –

    If using test_size, then can optionally provide a random state, in

    order to be able to recreate an exact test set.

    If set to default, will use the value saved in self.random_state, (as set in BPt.BPt_ML upon class init).

    default = 'default'
    
    test_locdepreciated

    Pass a single str with the test loc to test_subjects instead.

    default = 'depreciated'
    
    CV’depreciated’

    Switching to passing cv parameter as cv instead of CV. For now if CV is passed it will still work as if it were passed as cv.

    default = 'depreciated'
    

Modeling Phase

Set_Default_ML_Verbosity

BPt_ML.Set_Default_ML_Verbosity(save_results='default', progress_bar='default', progress_loc='default', pipeline_verbose='default', best_params_score='default', compute_train_score='default', show_init_params='default', fold_name='default', time_per_fold='default', score_per_fold='default', fold_sizes='default', best_params='default', save_to_logs='default', flush='default')

This function allows setting various verbosity options that effect output during Evaluate() and Test().

Parameters
  • save_results (bool, optional) –

    If True, all results returned by Evaluate will be saved within the log dr (if one exists!), under run_name + .eval, and simmilarly for results returned by Test, but as run_name + .test.

    if ‘default’, and not already defined, set to False.

    default = 'default'
    

  • progress_bar (bool, optional) –

    If True, a progress bar, implemented in the python library tqdm, is used to show progress during use of Evaluate() , If False, then no progress bar is shown. This bar should work both in a notebook env and outside one, assuming self.notebook has been set correctly.

    if ‘default’, and not already defined, set to True.

    default = 'default'
    

  • progress_loc (str, Path or None, optional) –

    If not None, then this will record the progress of each Evaluate / Test call in this location.

    if ‘default’, and not already defined, set to False.

    default = 'default'
    

  • pipeline_verbose (bool, optional) –

    This controls the verbose parameter for the pipeline object itself. If set to True, then time elapsed while fitting each step will be printed.

    if ‘default’, and not already defined, set to False.

    default = 'default'
    

  • compute_train_score (bool, optional) –

    If True, then metrics/scorers and raw preds will also be computed on the training set in addition to just the eval or testing set.

    if ‘default’, and not already defined, set to False.

    default = 'default'
    

  • show_init_params (bool, optional) –

    If True, then print/show the parameters used before running Evaluate / Test. If False, then don’t print the params used.

    if ‘default’, and not already defined, set to True.

    default = 'default'
    

  • fold_name (bool, optional) –

    If True, prints a rough measure of progress via printing out the current fold (somewhat redundant with the progress bar if used, except if used with other params, e.g. time per fold, then it is helpful to have the time printed with each fold). If False, nothing is shown.

    if ‘default’, and not already defined, set to False.

    default = 'default'
    

  • time_per_fold (bool, optional) –

    If True, prints the full time that a fold took to complete.

    if ‘default’, and not already defined, set to False.

    default = 'default'
    

  • score_per_fold (bool, optional) –

    If True, displays the score for each fold, though slightly less formatted then in the final display.

    if ‘default’, and not already defined, set to False.

    default = 'default'
    

  • fold_sizes (bool, optional) –

    If True, will show the number of subjects within each train and val/test fold.

    if ‘default’, and not already defined, set to False.

    default = 'default'
    

  • best_params (bool, optional) –

    If True, print the best search params found after every param search.

    if ‘default’, and not already defined, set to False.

    default = 'default'
    

  • save_to_logs (bool, optional) –

    If True, then when possible, and with the selected model verbosity options, verbosity ouput will be saved to the log file.

    if ‘default’, and not already defined, set to False.

    default = 'default'
    

  • flush (bool, optional) –

    If True, then add flush=True to all ML prints, which adds a call to flush the std output.

    if ‘default’, and not already defined, set to False.

    default = False
    

Evaluate

BPt_ML.Evaluate(model_pipeline, problem_spec='default', splits=3, n_repeats=2, cv='default', train_subjects='train', feat_importances=None, return_raw_preds=False, return_models=False, run_name='default', only_fold=None, base_dtype='float32', CV='depreciated')

The Evaluate function is one of the main interfaces for building and evaluating Model_Pipeline on the loaded data. Specifically, Evaluate is designed to try and estimate the out of sample performance of a passed Model_Pipeline on a specific ML task (as specified by Problem_Spec). This estimate is done through a defined CV strategy (splits and n_repeats). While Evaluate’s ideal usage is an expirimental context for exploring different choices of Model_Pipeline and then ultimately with Test - if used carefully (i.e., dont try 50 Pipelines’s and only report the one that does best), it can be used on a full dataset.

Parameters
  • model_pipeline (Model_Pipeline) –

    The passed model_pipeline should be an instance of the BPt params class Model_Pipeline. This object defines the underlying model pipeline to be evaluated.

    See Model_Pipeline for more information / how to create a the model pipeline.

  • problem_spec (Problem_Spec or ‘default’, optional) –

    problem_spec accepts an instance of the BPt.BPt_ML params class Problem_Spec. This object is essentially a wrapper around commonly used parameters needs to define the context the model pipeline should be evaluated in. It includes parameters like problem_type, scorer, n_jobs, random_state, etc… See Problem_Spec explicitly for more information and for how to create an instance of this object.

    If left as ‘default’, then will just initialize a Problem_Spec with default params.

    default = 'default'
    

  • splits (int, float, str or list of str, optional) –

    In every fold of the defined CV strategy, the passed model_pipeline will be fitted on a train fold, and evaluated on a validation fold. This parameter controls the type of CV, i.e., specifies what the train and validation folds should be. These splits are further determined by the subjects passed to train_subjects. Notably, the splits defined will respect any special split behavior as defined in Define_Validation_Strategy.

    Specifically, options for split are:

    • int

      The number of k-fold splits to conduct. (E.g., 3 for a 3-fold CV).

    • float

      Must be 0 < splits < 1, and defines a single train-test like split, with splits as the % of the current training data size used as a validation/test set.

    • str

      If a str is passed, then it must correspond to a loaded Strat variable. In this case, a leave-out-group CV will be used according to the value of the indicated Strat variable (E.g., a leave-out-site CV scheme).

    • list of str

      If multiple str passed, first determine the overlapping unique values from their corresponing loaded Strat variables, and then use this overlapped value to define the leave-out-group CV as described above.

    Note that this defines only the base CV strategy, and that the following param n_repeats is optionally used to replicate this base strategy, e.g., for a twice repeated train-test split evaluation. Note further that n_repeats will work with any of these options, but say in the case of a leave out group CV, it would be awfully redundant, versus, with a passed float value, very reasonable.

    default = 3
    

  • n_repeats (int, optional) –

    Given the base CV defined / described in the splits param, this parameter further controls if the defined train/val splits should be repeated (w/ different random splits in all cases but the leave-out-group passed str option).

    For example, if n_repeats is set to 2, and splits is 3, then a twice repeated 3-fold CV will be performed, and results returned with respect to this strategy.

    It can be a good idea to set multiple n_repeats (assuming enough computation power), as it can help you spot cases where you may not have enough training subjects to get stable behavior, e.g., say you run a three times repeated 3 fold CV, if the mean validation scores from each 3-fold are all very close to each other, then you know that 1 repeat is likely enough. If instead the macro std in score (the std from in this case those 3 scores) is high, then it indicates you may not have enough subjects to get stable results from just one 3-fold CV, and that you might want to consider changing some settings.

    default = 2
    

  • cv ('default' or CV params object, optional) –

    If left as default ‘default’, use the class defined CV behavior for the splits, otherwise can pass custom behavior

    default = 'default'
    

  • train_subjects (Subjects, optional) –

    This parameter determines the set of training subjects which are used in this call to Evaluate. Note, this parameter is distinct to the subjects parameter within Problem_Spec, which is applied after selecting the subset of train_subjects specified here. These subjects are used as the input to Evaluate, i.e., so typically any subjects data you want to remain untouched (say your global test subjects) are considered within Evaluate, and only those explicitly passed here are.

    By default, this value will be set to the special str indiciator ‘train’, which specifies that the full set of globally defined training subjects (See: Define_Train_Test_Split()), should be used. Other special str indicators include ‘all’ to select all subjects, and ‘test’ to select the test set subjects.

    If subjects is passed a str, and that str is not one of the str indicators listed above, then it will be interpretted as the location of file in which to read subjects from (assuming one subjects per line).

    subjects may also be a custom array-like of subjects to use.

    See Subjects for how to correctly format input and for other special options.

    default = 'train'
    

  • feat_importances (Feat_Importance list of, str or None, optional) –

    If passed None, by default, no feature importances will be saved.

    Alternatively, one may pass the keyword ‘base’, to indicate that the base feature importances - those automatically calculated by base objects (e.g., beta weights from linear models) be saved. In this case, the object Feat_Importance(‘base’) will be made.

    Otherwise, for more detailed control provide here either a single, or list of Feat_Importance param objects in which to specify what importance values, and with what settings should be computed. See the base Feat_Importance object for more information on how to specify these objects.

    See Feat Importances to learn more about feature importances generally.

    In this case of a passed list, all passed Feat_Importances will attempt to be computed.

    default = None
    

  • return_raw_preds (bool, optional) –

    If True, return the raw predictions from each fold.

    default = False
    

  • return_models (bool, optional) –

    If True, return the trained models from each evaluation.

    default = False
    

  • run_name (str or 'default', optional) –

    Each run of Evaluate can be optionally associated with a specific run_name. This name is used if save_results in Set_Default_ML_Verbosity is set to True, then will be used as the name output from Evaluate as saved as in the specific log_dr (if any, and as set when Init’ing the BPt_ML class object), with ‘.eval’ appended to the name.

    If left as ‘default’, will come up with a kind of terrible name passed on the underlying model used in the passed model_pipeline.

    default = 'default'
    

  • only_fold (int or None, optional) –

    This is a special parameter used to only Evaluate a specific fold of the specified runs to evaluate. Keep as None to ignore.

    default = None
    

  • base_dtype (numpy dtype) –

    The dataset is cast to a numpy array of float. This parameter can be used to change the default behavior, e.g., if more resolution or less is needed.

    default = 'float32'
    

  • CV ('depreciated') –

    Switching to passing cv parameter as cv instead of CV. For now if CV is passed it will still work as if it were passed as cv.

    default = 'depreciated'
    

Returns

results – Dictionary containing: ‘summary_scores’, A list representation of the printed summary scores, where the 0 index is the mean, 1 index is the macro std, then second index is the micro std. ‘train_summary_scores’, Same as summary scores, but only exists if train scores are computed. ‘raw_scores’, a numpy array of numpy arrays, where each internal array contains the raw scores as computed for all passed in scorers, computed for each fold within each repeat. e.g., array will have a length of n_repeats * number of folds, and each internal array will have the same length as the number of scorers. Optionally, this could instead return a list containing as the first element the raw training score in this same format, and then the raw testing scores. ‘raw_preds’, A pandas dataframe containing the raw predictions for each subject, in the test set, and ‘FIs’ a list where each element corresponds to a passed feature importance.

Return type

dict

Notes

Prints by default the following for each scorer,

float

The mean macro score (as set by input scorer) across each repeated K-fold.

float

The standard deviation of the macro score (as set by input scorer) across each repeated K-fold.

float

The standard deviation of the micro score (as set by input scorer) across each fold with the repeated K-fold.

Plot_Global_Feat_Importances

BPt_ML.Plot_Global_Feat_Importances(feat_importances='most recent', top_n=10, show_abs=False, multiclass=False, ci=95, palette='default', figsize=(10, 10), title='default', titles='default', xlabel='default', n_cols=1, ax=None, show=True)

Plots any global feature importance, e.g. base or shap, values per feature not per prediction.

Parameters
  • feat_importances ('most recent' or Feat_Importances object) –

    Input should be either a Feat_Importances object as output from a call to Evaluate, or Test, or if left as default ‘most recent’, the passed params will be used to plot any valid calculated feature importances from the last call to Evaluate or Test.

    Note, if there exist multiple valid feature importances in the last call, passing custom ax will most likely break things.

    (default = ‘most recent’)

  • top_n (int, optional) –

    The number of top features to display. In the case where show_abs is set to True, or the feature importance being plotted is only positive, then top_n features will be shown. On the other hand, when show_abs is set to False and the feature importances being plotted contain negative numbers, then the top_n highest and top_n lowest features will be shown.

    (default = 10)

  • show_abs (bool, optional) –

    In the case where the underlying feature importances contain negative numbers, you can either plot the top_n by absolute value, with show_abs set to True, or plot the top_n highest and lowest with show_abs set to False.

    (default = False)

  • multiclass (bool, optional) –

    If multiclass is set to True, and the underlying feature importances were derived from a categorical problem type, then a seperate feature importance plot will be made for each class. Alternatively, if multiclass is set to False, then feature importances will be averaged over all classes.

    (default = False)

  • ci (float, 'sd' or None, optional) –

    Size of confidence intervals to draw around estimated values. If ‘sd’, skip bootstrapping and draw the standard deviation of the feat importances. If None, no bootstrapping will be performed, and error bars will not be drawn.

    (default = 95)

  • palette (Seaborn palette name, optional) –

    Color scheme to use. Search seaborn palettes for more information. Default for absolute is ‘Reds’, and default for both pos and neg is ‘coolwarm’.

    (default = ‘default’)

  • title (str, optional) –

    The title used during plotting, and also used to save a version of the figure (with spaces in title replaced by _, and as a png).

    When multiclass is True, this is the full figure title.

    (default = ‘default’)

  • titles (list, optional) –

    This parameter is only used when multiclass is True. titles should be a list with the name for each classes plot. If left as default, it will just be named the original loaded name for that class.

    (default = ‘default’)

  • xlabel (str, optional) –

    The xlabel, descriping the measure of feature importance. If left as ‘default’ it will change depend on what feature importance is being plotted.

    (default = ‘default’)

  • n_cols (int, optional) –

    If multiclass, then the number of class plots to plot on each row.

    (default = 1)

  • ax (matplotlib axis, or axes, optional) –

    A custom ax to plot to for an individual plot, or if using multiclass, then a list of axes can be passed here.

    (default = None)

  • show (bool, optional) –

    If True, then plt.show(), the matplotlib command will be called, and the figure displayed. On the other hand, if set to False, then the user can customize the plot as they desire. You can think of plt.show() as clearing all of the loaded settings, so in order to make changes, you can’t call this until you are done.

    (default = True)

Plot_Local_Feat_Importances

BPt_ML.Plot_Local_Feat_Importances(feat_importances='most recent', top_n=10, title='default', titles='default', xlabel='default', one_class=None, show=True)

Plots any local feature importance, e.g. shap, values per per prediction.

Parameters
  • feat_importances ('most recent' or Feat_Importances object) –

    Input should be either a Feat_Importances object as output from a call to Evaluate, or Test, or if left as default ‘most recent’, the passed params will be used to plot any valid calculated feature importances from the last call to Evaluate or Test.

    (default = ‘most recent’)

  • top_n (int, optional) –

    The number of top features to display. In the case where show_abs is set to True, or the feature importance being plotted is only positive, then top_n features will be shown. On the other hand, when show_abs is set to False and the feature importances being plotted contain negative numbers, then the top_n highest and top_n lowest features will be shown.

    (default = 10)

  • title (str, optional) –

    The title used during plotting, and also used to save a version of the figure (with spaces in title replaced by _, and as a png).

    With a multiclass / categorical problem type, this is only used if one_class is set. Otherwise, titles are used.

    (default = ‘default’)

  • titles (list, optional) –

    This parameter is only used with a multiclass problem type. titles should be a list with the name for each class to plot. If left as default, it will use originally loaded class names. for that class.

    (default = ‘default’)

  • xlabel (str, optional) –

    The xlabel, descriping the measure of feature importance. If left as ‘default’ it will change depend on what feature importance is being plotted.

    (default = ‘default’)

  • one_class (int or None, optional) –

    If an underlying multiclass or categorical type, optionally provide an int here, corresponding to the single class to plot. If left as None, with make plots for all classes.

    (default = None)

  • show (bool, optional) –

    If True, then plt.show(), the matplotlib command will be called, and the figure displayed. On the other hand, if set to False, then the user can customize the plot as they desire. You can think of plt.show() as clearing all of the loaded settings, so in order to make changes, you can’t call this until you are done.

    (default = True)

Testing Phase

Test

BPt_ML.Test(model_pipeline, problem_spec='default', train_subjects='train', test_subjects='test', feat_importances=None, return_raw_preds=False, return_models=False, run_name='default', base_dtype='float32')

The test function is one of the main interfaces for testing a specific Model_Pipeline. Test is conceptually different from Evaluate in that it is designed to contrust / train a Model_Pipeline on one discrete set of train_subjects and evaluate it on a further discrete set of test_subjects. Otherwise, these functions are very simmilar as they both evaluate a Model_Pipeline as defined in the context of a Problem_Spec, and return similar output.

Parameters
  • model_pipeline (Model_Pipeline) –

    The passed model_pipeline should be an instance of the BPt params class Model_Pipeline. This object defines the underlying model pipeline to be evaluated.

    See Model_Pipeline for more information / how to create a the model pipeline.

  • problem_spec (Problem_Spec or ‘default’, optional) –

    problem_spec accepts an instance of the BPt.BPt_ML params class Problem_Spec. This object is essentially a wrapper around commonly used parameters needs to define the context the model pipeline should be evaluated in. It includes parameters like problem_type, scorer, n_jobs, random_state, etc… See Problem_Spec explicitly for more information and for how to create an instance of this object.

    If left as ‘default’, then will just initialize a Problem_Spec with default params.

    default = 'default'
    

  • train_subjects (str, array-like or Value_Subset, optional) –

    This parameter determines the set of training subjects which are used to train the passed instance of Model_Pipeline.

    Note, this parameter and test_subjects are distinct, but complementary to the subjects parameter within Problem_Spec, which is applied after selecting the subset of train_subjects specified here.

    By default, this value will be set to the special str indiciator ‘train’, which specifies that the full set of globally defined training subjects (See: Define_Train_Test_Split()), should be used. Other special str indicators include ‘all’ to select all subjects, and ‘test’ to select the test set subjects.

    If subjects is passed a str, and that str is not one of the str indicators listed above, then it will be interpretted as the location of file in which to read subjects from (assuming one subjects per line).

    subjects may also be a custom array-like of subjects to use.

    Lastly, a special wrapper, Value_Subset, can also be used to specify more specific, specifically value specific, subsets of subjects to use. See Value_Subset for how this input wrapper can be used.

    If passing custom input here, be warned that you NEVER want to pass an overlap of subjects between train_subjects and test_subjects

    default = 'train'
    

  • test_subjects (str, array-like or Value_Subset, optional) –

    This parameter determines the set of testing subjects which are used to evaluate the passed instance of Model_Pipeline, after it has been trained on the passed train_subjects.

    Note, this parameter and train_subjects are distinct, but complementary to the subjects parameter within Problem_Spec, which is applied after selecting the subset of test_subjects specified here.

    By default, this value will be set to the special str indiciator ‘test’, which specifies that the full set of globally defined test subjects (See: Define_Train_Test_Split()), should be used. Other special str indicators include ‘all’ to select all subjects, and ‘train’ to select the train set subjects.

    If subjects is passed a str, and that str is not one of the str indicators listed above, then it will be interpretted as the location of file in which to read subjects from (assuming one subjects per line).

    subjects may also be a custom array-like of subjects to use.

    Lastly, a special wrapper, Value_Subset, can also be used to specify more specific, specifically value specific, subsets of subjects to use. See Value_Subset for how this input wrapper can be used.

    If passing custom input here, be warned that you NEVER want to pass an overlap of subjects between train_subjects and test_subjects

    default = 'test'
    

  • feat_importances (Feat_Importance list of, str or None, optional) –

    If passed None, by default, no feature importances will be saved.

    Alternatively, one may pass the keyword ‘base’, to indicate that the base feature importances - those automatically calculated by base objects (e.g., beta weights from linear models) be saved. In this case, the object Feat_Importance(‘base’) will be made.

    Otherwise, for more detailed control provide here either a single, or list of Feat_Importance param objects in which to specify what importance values, and with what settings should be computed. See the base Feat_Importance object for more information on how to specify these objects.

    See Feat Importances to learn more about feature importances generally.

    In this case of a passed list, all passed Feat_Importances will attempt to be computed.

    default = None
    

  • return_raw_preds (bool, optional) –

    If True, return the raw predictions from each fold.

    default = False
    

  • return_models (bool, optional) –

    If True, return the trained models from each evaluation.

    default = False
    

  • run_name (str or 'default', optional) –

    Each run of test can be optionally associated with a specific run_name. This name is used if save_results in Set_Default_ML_Verbosity is set to True, then will be used as the name output from Test as saved as in the specific log_dr (if any, and as set when Init’ing the BPt_ML class object), with .test appended to the name.

    If left as ‘default’, will come up with a kind of terrible name passed on the underlying model used in the passed model_pipeline.

    default = 'default'
    

  • base_dtype (numpy dtype) –

    The dataset is cast to a numpy array of float. This parameter can be used to change the default behavior, e.g., if more resolution or less is needed.

    default = 'float32'
    

Returns

results – Dictionary containing: ‘scores’, the score on the test set by each scorer, ‘raw_preds’, A pandas dataframe containing the raw predictions for each subject, in the test set, and ‘FIs’ a list where each element corresponds to a passed feature importance.

Return type

dict

Plot_Global_Feat_Importances

BPt_ML.Plot_Global_Feat_Importances(feat_importances='most recent', top_n=10, show_abs=False, multiclass=False, ci=95, palette='default', figsize=(10, 10), title='default', titles='default', xlabel='default', n_cols=1, ax=None, show=True)

Plots any global feature importance, e.g. base or shap, values per feature not per prediction.

Parameters
  • feat_importances ('most recent' or Feat_Importances object) –

    Input should be either a Feat_Importances object as output from a call to Evaluate, or Test, or if left as default ‘most recent’, the passed params will be used to plot any valid calculated feature importances from the last call to Evaluate or Test.

    Note, if there exist multiple valid feature importances in the last call, passing custom ax will most likely break things.

    (default = ‘most recent’)

  • top_n (int, optional) –

    The number of top features to display. In the case where show_abs is set to True, or the feature importance being plotted is only positive, then top_n features will be shown. On the other hand, when show_abs is set to False and the feature importances being plotted contain negative numbers, then the top_n highest and top_n lowest features will be shown.

    (default = 10)

  • show_abs (bool, optional) –

    In the case where the underlying feature importances contain negative numbers, you can either plot the top_n by absolute value, with show_abs set to True, or plot the top_n highest and lowest with show_abs set to False.

    (default = False)

  • multiclass (bool, optional) –

    If multiclass is set to True, and the underlying feature importances were derived from a categorical problem type, then a seperate feature importance plot will be made for each class. Alternatively, if multiclass is set to False, then feature importances will be averaged over all classes.

    (default = False)

  • ci (float, 'sd' or None, optional) –

    Size of confidence intervals to draw around estimated values. If ‘sd’, skip bootstrapping and draw the standard deviation of the feat importances. If None, no bootstrapping will be performed, and error bars will not be drawn.

    (default = 95)

  • palette (Seaborn palette name, optional) –

    Color scheme to use. Search seaborn palettes for more information. Default for absolute is ‘Reds’, and default for both pos and neg is ‘coolwarm’.

    (default = ‘default’)

  • title (str, optional) –

    The title used during plotting, and also used to save a version of the figure (with spaces in title replaced by _, and as a png).

    When multiclass is True, this is the full figure title.

    (default = ‘default’)

  • titles (list, optional) –

    This parameter is only used when multiclass is True. titles should be a list with the name for each classes plot. If left as default, it will just be named the original loaded name for that class.

    (default = ‘default’)

  • xlabel (str, optional) –

    The xlabel, descriping the measure of feature importance. If left as ‘default’ it will change depend on what feature importance is being plotted.

    (default = ‘default’)

  • n_cols (int, optional) –

    If multiclass, then the number of class plots to plot on each row.

    (default = 1)

  • ax (matplotlib axis, or axes, optional) –

    A custom ax to plot to for an individual plot, or if using multiclass, then a list of axes can be passed here.

    (default = None)

  • show (bool, optional) –

    If True, then plt.show(), the matplotlib command will be called, and the figure displayed. On the other hand, if set to False, then the user can customize the plot as they desire. You can think of plt.show() as clearing all of the loaded settings, so in order to make changes, you can’t call this until you are done.

    (default = True)

Plot_Local_Feat_Importances

BPt_ML.Plot_Local_Feat_Importances(feat_importances='most recent', top_n=10, title='default', titles='default', xlabel='default', one_class=None, show=True)

Plots any local feature importance, e.g. shap, values per per prediction.

Parameters
  • feat_importances ('most recent' or Feat_Importances object) –

    Input should be either a Feat_Importances object as output from a call to Evaluate, or Test, or if left as default ‘most recent’, the passed params will be used to plot any valid calculated feature importances from the last call to Evaluate or Test.

    (default = ‘most recent’)

  • top_n (int, optional) –

    The number of top features to display. In the case where show_abs is set to True, or the feature importance being plotted is only positive, then top_n features will be shown. On the other hand, when show_abs is set to False and the feature importances being plotted contain negative numbers, then the top_n highest and top_n lowest features will be shown.

    (default = 10)

  • title (str, optional) –

    The title used during plotting, and also used to save a version of the figure (with spaces in title replaced by _, and as a png).

    With a multiclass / categorical problem type, this is only used if one_class is set. Otherwise, titles are used.

    (default = ‘default’)

  • titles (list, optional) –

    This parameter is only used with a multiclass problem type. titles should be a list with the name for each class to plot. If left as default, it will use originally loaded class names. for that class.

    (default = ‘default’)

  • xlabel (str, optional) –

    The xlabel, descriping the measure of feature importance. If left as ‘default’ it will change depend on what feature importance is being plotted.

    (default = ‘default’)

  • one_class (int or None, optional) –

    If an underlying multiclass or categorical type, optionally provide an int here, corresponding to the single class to plot. If left as None, with make plots for all classes.

    (default = None)

  • show (bool, optional) –

    If True, then plt.show(), the matplotlib command will be called, and the figure displayed. On the other hand, if set to False, then the user can customize the plot as they desire. You can think of plt.show() as clearing all of the loaded settings, so in order to make changes, you can’t call this until you are done.

    (default = True)

Extras

Save

BPt_ML.Save(loc, low_memory=False)

This class method is used to save an existing BPt_ML object for further use.

Parameters
  • loc (str or Path) – The location in which the pickle of the BPt_ML object should be saved! This is the same loc which should be passed to Load in order to re-load the object.

  • low_memory (bool, optional) –

    If this parameter is set to True, then self.data, self.targets, self.covars, self.strat will be deleted before saving. The assumption for the param to be used is that self.all_data has already been created, and therefore the individual dataframes with data, covars ect… can safely be deleted as the user will not need to work with them directly any more.

    default = False
    

Save_Table

BPt_ML.Save_Table(save_loc, targets='SHOW_ALL', covars='SHOW_ALL', strat='SHOW_ALL', group_by=None, split=True, include_all=True, subjects=None, cat_show_original_name=True, shape='long', heading=None, center=True, rnd_to=2, style=None)

This method is used to save a table with summary statistics in docx format.

Warning: if there is any NaN data kept in any of the selected covars, then those subject’s data will not be outputted to the table! Likewise, only overlapped subjects present in any loaded data, covars, strat, targets, ect… will be outputted to the table!.

Note: you must have the optional library python-docx installed to use this function.

Parameters
  • save_loc (str) – The location where the .docx file with the table should be saved. You should include .docx in this save_loc.

  • targets (str, int or list, optional) –

    The single (str) or multiple targets (list), in which to add to the outputed table. The str input ‘SHOW_ALL’ is reserved, and set to default, for displaying all loaded targets.

    You can also pass the int index, (or indices).

    (default = ‘SHOW_ALL’)

  • covars (str or list, optional) –

    The single (str) or multiple covars (list), in which to add to the outputed table. The str input ‘SHOW_ALL’ is reserved, and set to default, for displaying all loaded covars.

    Warning: If there are any NaN’s in selected covars, these subjects will be disregarded from all summary measures.

    (default = ‘SHOW_ALL’)

  • strat (str or list, optional) –

    The single (str) or multiple strat (list), in which to add to the outputed table. The str input ‘SHOW_ALL’ is reserved, and set to default, for displaying all loaded strat.

    Note, if strat is passed, then self.strat_u_name will be removed from every title before plotting, so if a same variables is loaded as a covar and a strat it will be put in the table twice, and if in a rarer case strat_u_name has been changed to a common value present in a covar name, then this common key will be removed.

    (default = ‘SHOW_ALL’)

  • group_by (str or list, optional) –

    This parameter, by default None, controls if the table statistics should be further broken down by different binary / categorical (or multilabel) groups. For example, by passing ‘split’, (assuming the split param has been left as True), will output a table with statistics for each column as seperated by global train test split.

    Notably, ‘split’ is a special keyword, to split on any other group, the name of that feature/column should be passed. E.g., to split on a loaded covar ‘sex’, then ‘sex’ would be passed.

    If a list of values is passed, then each element will be used for its own seperate split. Further, if the include_all parameter is left as its default value of True, then in addition to a single or multiple group_by splits, the values over all subjects will also be displayed. (E.g. in the train test split case, statistics would be shown for the full sample, only the train subjects and only the test subjects).

    (default = None)

  • split (bool, optional) –

    If True, then information about the global train test split will be added as a binary feature of the table. If False, or if a global split has not yet been defined, this split will not be added to the table. Note that ‘split’ can also be passed to group_by.

    Note: If it is desired to create a table with just the train subjects (or test) for example, then the subjects parameter should be used. If subjects is set to ‘train’ or ‘test’, this split param will switch to False regardless of user passed input.

    (default = True)

  • include_all (bool, optional) –

    If True, as default, then in addition to passed group_by param(s), statistics will be displayed over all subjects as well. If no group_by param(s) are passed, then include_all will be True regardless of user passed value.

    (default = True)

  • subjects (None, 'train', 'test' or array-like, optional) –

    If left as None, display a table with all overlapping subjects. Alternatively this parameter can be used to pass just the train_subjects with ‘train’ or just the test subjects with ‘test’ or even a custom list or array-like set of subjects

    (default = None)

  • cat_show_original_name (bool, optional) –

    If True, then when showing a categorical distribution (or binary) use the original loaded name in the table. Otherwise, if False, the internally used variable names will be used to determine table headers.

    (default = True)

  • shape ({'long', 'wide'}, optional) –

    There are two options for shape. First ‘long’, which is selected by default, will create a table where each row is a summary statistic for a passed variable, and each column represents each group_by group if any. This is a good choice when the number of variable to plot is more than the number of groups to display values by.

    Alternatively, you may pass ‘wide’, for the opposite behavior of ‘long’, where in this case the variables to summarize will each be a column in the table, and the group_by groups will represent rows.

    If your table ends up being too squished in either direction, you can try the opposite shape. If both are squished, you’ll need to reduce the number of group_by variables or targets, covars/ strat.

    (default = ‘long’)

  • heading (str, optional) –

    You may optionally pass a heading for the table as a str. By default no heading will be added.

    (default = None)

  • center (bool, optional) –

    This parameter optionally determines if the values in the table along with the headers should be centered within each cell. If False, the values will be left alligned on the bottom.

    (default = True)

  • rnd_to (int, optional) –

    This parameter determines how many decimal places each value to be added to the table should be rounded to. E.g., the default value of 2 will round each table entry to 2 decimal points, but if rnd_to 0 was passed, that would be no decimal points and -1, would be rounded to the nearest 10.

    (default = 2)

  • style (str or None, optional) –

    The default .docx table style in which to use, which contrals things like table color and fonts, ect… Keep style as None by default to just use the default style, or feel free to try passing any of a huge number of preset styles. These styles can be found at the bottom of https://python-docx.readthedocs.io/en/latest/user/styles-understanding.html under “Table styles in default template”.

    Some examples are: ‘Colorful Grid’, ‘Dark List’, ‘Light List’, ‘Medium List 2’, ‘Medium Shading 2 Accent 6’, ect…

    (default = None)