Cohort Utils
- raimitigations.cohort.utils.fetch_cohort_results(X: Union[DataFrame, ndarray], y_true: Union[DataFrame, ndarray], y_pred: Union[DataFrame, ndarray], cohort_def: Optional[Union[dict, list, str]] = None, cohort_col: Optional[list] = None, regression: bool = False, fixed_th: Optional[Union[float, dict, bool]] = None, shared_th: bool = False, return_th_dict: bool = False)
Computes several classification or regression metrics for a given array of predictions for the entire dataset and for a set of cohorts. The cohorts used to compute these metrics are defined by the
cohort_def
orcohort_col
parameters (but not both). These parameters are the ones used in the constructor method of theCohortManager
class. Each metric is computed separately for each set of predictions belonging to each of the existing cohorts.- Parameters
X – the dataset containing the feature columns. This dataset is used to filter the instances that belong to each cohort;
y_true – the true labels of the instances in the
X
dataset;y_pred – the predicted labels;
cohort_def – a list of cohort definitions, a dictionary of cohort definitions, a CohortHandler object, or the path to a JSON file containing the definition of all cohorts. For more details on this parameter, please check the
CohortManager
class;cohort_col – a list of column names or indices, from which one cohort is created for each unique combination of values for these columns. This parameter is ignored if
cohort_def
is provided;regression – if True, regression metrics are computed. If False, only classification metrics are computed;
fixed_th – if None, the thresholds will be computed using the ROC curve. If a single float is provided, then this threshold is used for all cohorts (in this case,
shared_th
will be ignored). Iffixed_th
is a dictionary, it must have the following structure: one key equal to each cohort name (including one key named “all” for the entire dataset), and each key is associated with the threshold that should be used for that cohort. Iffixed_th
is True, then the thresholds used will be the ones found in the DecoupledClass object passed in thecohort_def
(ifcohort_def
is anything but a DecoupledClass object, an error will be raised). Iffixed_th
is a dictionary andshared_th
is True, then the only threshold used will befixed_th["all"]
. This parameter is ignored ifregression
is True;shared_th – if True, the binarization of the predictions is made using the same threshold for all cohorts. The threshold used is the one computed for the entire dataset. If False, a different threshold is computed for each cohort;
return_th_dict – if True, return a dictionary that maps the best threshold found for each cohort. This parameter is ignored if
regression
is True;
- Returns
a dataframe containing the metrics for the entire dataset and for all the defined cohorts. If ‘return_th_dict’ is True (and ‘regression’ is False), return a tuple with two values: (
df_metrics
,th_dict
), wheredf_metrics
is the metrics dataframe andth_dict
is a dictionary that maps the best threshold found for each cohort;- Return type
pd.DataFrame or a tuple (pd.DataFrame, dict)