CohortHandler

class raimitigations.cohort.CohortHandler(cohort_def: Optional[Union[dict, list, str]] = None, cohort_col: Optional[list] = None, cohort_json_files: Optional[list] = None, df: Optional[DataFrame] = None, label_col: Optional[str] = None, X: Optional[DataFrame] = None, y: Optional[DataFrame] = None, verbose: bool = True)

Bases: DataProcessing

Abstract class that manages multiple cohorts.

Parameters

cohort_def – a list of cohort definitions or a dictionary of cohort definitions. A cohort condition is the same variable received by the cohort_definition parameter of the CohortDefinition class. When using a list of cohort definitions, the cohorts will be named automatically. For the dictionary of cohort definitions, the key used represents the cohort’s name, and the value assigned to each key is given by that cohort’s conditions. This parameter can’t be used together with the cohort_col parameter. Only one these two parameters must be used at a time;
cohort_col – a list of column names or indices, from which one cohort is created for each unique combination of values for these columns. This parameter can’t be used together with the cohort_def parameter. Only one of these two parameters must be used at a time;
cohort_json_files – a list with the name of the JSON files that contains the definition of each cohort. Each cohort is saved in a single JSON file, so the length of the cohort_json_files should be equal to the number of cohorts to be used.
df – the data frame to be used during the fit method. This data frame must contain all the features, including the label column (specified in the label_col parameter). This parameter is mandatory if label_col is also provided. The user can also provide this dataset (along with the label_col) when calling the fit() method. If df is provided during the class instantiation, it is not necessary to provide it again when calling fit(). It is also possible to use the X and y instead of df and label_col, although it is mandatory to pass the pair of parameters (X,y) or (df, label_col) either during the class instantiation or during the fit() method;
label_col – the name or index of the label column. This parameter is mandatory if df is provided;
X – contains only the features of the original dataset, that is, does not contain the label column. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if y is provided;
y – contains only the label column of the original dataset. This parameter is mandatory if X is provided;
verbose – indicates whether internal messages should be printed or not.

get_queries()

Returns a dictionary with one key for each cohort’s name, where each key is assigned to the pandas query used for filtering the instances that belongs to the cohort.

Returns: a dictionary containing the pandas queries used by each of the existing cohorts.
Return type: dict

get_subsets(X: Union[DataFrame, ndarray], y: Optional[Union[Series, ndarray]] = None, apply_transform: bool = False)

Fetches a dictionary with the subset associated to all of the existing cohorts and their label column (only if y is provided). If apply_transform is set to True, then the returned subsets are transformed using the cohort’s pipeline before being returned (similar to calling the transform() method).

Parameters

X – a dataset that has at least the columns used by the cohorts’ filters (this means that the dataset may also have other columns not used by the filters);
y – a dataset containing only the label column (y dataset). This parameter is optional, and it is useful when it is necessary to filter a feature dataset (X) and a label dataset (y), and get a list of subsets from X and y;
apply_transform – boolean value indicating if we want to apply the transformations pipeline used for each cohort or not. If True, this method will behave similarly to the transform() method, with the main difference being that this method will always return a list of subsets, even if the cohorts are compatible with each other.

Returns

a dictionary where the primary keys are the name of the cohorts, and the secondary keys are:

X: the subset of the features dataset;

y: the subset of the label dataset. This key will only be returned if the y dataset is passed in the method’s call.

Return type

dict

save_cohorts(json_file_names: Optional[list] = None)

Save the definition of each cohort in their respective JSON file, which means that one JSON file will be created for each cohort. The name of the JSON files created is provided through the ‘json_file_names’. If no list of JSON file names is used (json_file_names = None), then a default list of JSON file names is created.

Parameters: json_file – a list of JSON file names. The first file name is used to save the first cohort, and so on. If not provided, a default list of file names is created.

Class Diagram