CohortHandler
- class raimitigations.cohort.CohortHandler(cohort_def: Optional[Union[dict, list, str]] = None, cohort_col: Optional[list] = None, cohort_json_files: Optional[list] = None, df: Optional[DataFrame] = None, label_col: Optional[str] = None, X: Optional[DataFrame] = None, y: Optional[DataFrame] = None, verbose: bool = True)
Bases:
DataProcessing
Abstract class that manages multiple cohorts.
- Parameters
cohort_def – a list of cohort definitions or a dictionary of cohort definitions. A cohort condition is the same variable received by the
cohort_definition
parameter of theCohortDefinition
class. When using a list of cohort definitions, the cohorts will be named automatically. For the dictionary of cohort definitions, the key used represents the cohort’s name, and the value assigned to each key is given by that cohort’s conditions. This parameter can’t be used together with thecohort_col
parameter. Only one these two parameters must be used at a time;cohort_col – a list of column names or indices, from which one cohort is created for each unique combination of values for these columns. This parameter can’t be used together with the
cohort_def
parameter. Only one of these two parameters must be used at a time;cohort_json_files – a list with the name of the JSON files that contains the definition of each cohort. Each cohort is saved in a single JSON file, so the length of the
cohort_json_files
should be equal to the number of cohorts to be used.df – the data frame to be used during the fit method. This data frame must contain all the features, including the label column (specified in the
label_col
parameter). This parameter is mandatory iflabel_col
is also provided. The user can also provide this dataset (along with thelabel_col
) when calling thefit()
method. If df is provided during the class instantiation, it is not necessary to provide it again when callingfit()
. It is also possible to use theX
andy
instead ofdf
andlabel_col
, although it is mandatory to pass the pair of parameters (X,y) or (df, label_col) either during the class instantiation or during thefit()
method;label_col – the name or index of the label column. This parameter is mandatory if
df
is provided;X – contains only the features of the original dataset, that is, does not contain the label column. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if
y
is provided;y – contains only the label column of the original dataset. This parameter is mandatory if
X
is provided;verbose – indicates whether internal messages should be printed or not.
- get_queries()
Returns a dictionary with one key for each cohort’s name, where each key is assigned to the pandas query used for filtering the instances that belongs to the cohort.
- Returns
a dictionary containing the pandas queries used by each of the existing cohorts.
- Return type
dict
- get_subsets(X: Union[DataFrame, ndarray], y: Optional[Union[Series, ndarray]] = None, apply_transform: bool = False)
Fetches a dictionary with the subset associated to all of the existing cohorts and their label column (only if
y
is provided). Ifapply_transform
is set to True, then the returned subsets are transformed using the cohort’s pipeline before being returned (similar to calling thetransform()
method).- Parameters
X – a dataset that has at least the columns used by the cohorts’ filters (this means that the dataset may also have other columns not used by the filters);
y – a dataset containing only the label column (y dataset). This parameter is optional, and it is useful when it is necessary to filter a feature dataset (X) and a label dataset (y), and get a list of subsets from
X
andy
;apply_transform – boolean value indicating if we want to apply the transformations pipeline used for each cohort or not. If True, this method will behave similarly to the transform() method, with the main difference being that this method will always return a list of subsets, even if the cohorts are compatible with each other.
- Returns
a dictionary where the primary keys are the name of the cohorts, and the secondary keys are:
X
: the subset of the features dataset;y
: the subset of the label dataset. This key will only be returned if the y dataset is passed in the method’s call.
- Return type
dict
- save_cohorts(json_file_names: Optional[list] = None)
Save the definition of each cohort in their respective JSON file, which means that one JSON file will be created for each cohort. The name of the JSON files created is provided through the ‘json_file_names’. If no list of JSON file names is used (json_file_names = None), then a default list of JSON file names is created.
- Parameters
json_file – a list of JSON file names. The first file name is used to save the first cohort, and so on. If not provided, a default list of file names is created.
Class Diagram