CohortDefinition
- class raimitigations.cohort.CohortDefinition(cohort_definition: Optional[Union[list, str]] = None, name: str = 'cohort')
Implements an interface for building and filtering cohorts from any dataset. This class is not associated to any specific dataset. It simply converts a set of conditions into a query. Based on this query, it is able to fetch a cohort that satisfies these conditions.
- Parameters
cohort_definition –
a list of conditions or a string containing the path to a JSON file that has the list condition. A basic condition is a list with three values:
Column: name or index of the column being analyzed
Inner Operator: one of the following operators:
'=='
,'!='
,'>'
,'>='
,'<'
,'<='
, or'range'
)Value: the value used in the condition. It can be a numerical or categorical value.
An
and
oror
operator may be placed between two basic conditions. Complex conditions may be created by concatenating multiple conditions;name – a string indicating the name of the cohort. This parameter may be accessed later using the
name
attribute.
- check_valid_df(df: DataFrame)
Checks if a dataset contains all columns used by the cohort’s query. If at least one of the columns used in the query is not present, an error is raised.
- Parameters
df – a pandas dataset.
- create_query_remaining_instances_cohort(prev_conditions: list)
Creates the query for the cohort that handles all instances that doesn’t belong to any other cohort. This query is built by doing the negation of the condition list of all other cohorts, and concatenate them using the “and” operator.
- Parameters
prev_conditions – a list of the list conditions used by other cohorts. Each sub-list here follows the same pattern as the parameter ‘cohort_definition’ from the
CohortDefinition
class.
- get_cohort_subset(X: DataFrame, y: Optional[DataFrame] = None, index_used: Optional[list] = None, return_index_list: bool = False)
Filters a dataset to fetch only the instances that follow the conditions of the current cohort. If the current cohort doesn’t have any conditions, this means that this cohort is comprised of all instances that doesn’t belong to any other cohort (cohort_definition = None). In this case, the cohort subset will be all instances whose index does not belong to any other cohort. The list of indices used in other cohorts is given by the index_used parameter. Finally, return the filtered dataset.
- Parameters
X – a data frame containing the features of a dataset that should be filtered;
y – the label dataset (
y
) that should also be filtered. Thisy
dataset should have the same number of rows as thedf
parameter. The filtering is performed over thedf
dataset, and the same indices selected indf
are also selected in they
dataset;index_used – a list of all indices of the dataset df that already belongs to some other cohort. This is used when the cohort doesn’t have a valid list of conditions. In that case, this cohort is the group of all instances that doesn’t belong to any other cohort;
return_index_list – if True, return the list of indices that belongs to the cohort. If False, this list isn’t returned;
- Returns
this method can return 4 different values based on its parameters:
when
y
is provided andreturn_index_list == True
, the following tuple is returned: (subset, subset_y, index_list)when
y
is provided andreturn_index_list == False
, the following tuple is returned: (subset, subset_y)when
y
is not provided andreturn_index_list == True
, the following tuple is returned: (subset, index_list)when
y
is not provided andreturn_index_list == False
, the following dataframe is returned: subset
Here,
subset
is the subset ofdf
that belongs to the cohort,subset_y
is the label dataset associated tosubset
, andindex_list
is the list of indices of instances that belongs to the cohort;- Return type
tuple or pd.DataFrame
- require_remaining_index()
Returns True if the cohort requires the
index_used
parameter for theget_cohort_subset()
method. When this happens, this means that this cohort was built with acohort_definition
parameter set toNone
.- Returns
True if the cohort requires the
index_used
parameter for theget_cohort_subset()
method. False otherwise.- Return type
bool
- save(json_file: str)
Saves the conditions used by the cohort into a JSON file.
- Parameters
json_file – the path of the JSON file to be saved.