CohortDefinition

class raimitigations.cohort.CohortDefinition(cohort_definition: Optional[Union[list, str]] = None, name: str = 'cohort')

Implements an interface for building and filtering cohorts from any dataset. This class is not associated to any specific dataset. It simply converts a set of conditions into a query. Based on this query, it is able to fetch a cohort that satisfies these conditions.

Parameters

cohort_definition –
a list of conditions or a string containing the path to a JSON file that has the list condition. A basic condition is a list with three values:
1. Column: name or index of the column being analyzed
2. Inner Operator: one of the following operators: '==', '!=', '>', '>=', '<', '<=', or 'range')
3. Value: the value used in the condition. It can be a numerical or categorical value.
An and or or operator may be placed between two basic conditions. Complex conditions may be created by concatenating multiple conditions;
name – a string indicating the name of the cohort. This parameter may be accessed later using the name attribute.

check_valid_df(df: DataFrame)

Checks if a dataset contains all columns used by the cohort’s query. If at least one of the columns used in the query is not present, an error is raised.

Parameters: df – a pandas dataset.

create_query_remaining_instances_cohort(prev_conditions: list)

Creates the query for the cohort that handles all instances that doesn’t belong to any other cohort. This query is built by doing the negation of the condition list of all other cohorts, and concatenate them using the “and” operator.

Parameters: prev_conditions – a list of the list conditions used by other cohorts. Each sub-list here follows the same pattern as the parameter ‘cohort_definition’ from the CohortDefinition class.

get_cohort_subset(X: DataFrame, y: Optional[DataFrame] = None, index_used: Optional[list] = None, return_index_list: bool = False)

Filters a dataset to fetch only the instances that follow the conditions of the current cohort. If the current cohort doesn’t have any conditions, this means that this cohort is comprised of all instances that doesn’t belong to any other cohort (cohort_definition = None). In this case, the cohort subset will be all instances whose index does not belong to any other cohort. The list of indices used in other cohorts is given by the index_used parameter. Finally, return the filtered dataset.

Parameters

X – a data frame containing the features of a dataset that should be filtered;
y – the label dataset (y) that should also be filtered. This y dataset should have the same number of rows as the df parameter. The filtering is performed over the df dataset, and the same indices selected in df are also selected in the y dataset;
index_used – a list of all indices of the dataset df that already belongs to some other cohort. This is used when the cohort doesn’t have a valid list of conditions. In that case, this cohort is the group of all instances that doesn’t belong to any other cohort;
return_index_list – if True, return the list of indices that belongs to the cohort. If False, this list isn’t returned;

Returns

this method can return 4 different values based on its parameters:

when y is provided and return_index_list == True, the following tuple is returned: (subset, subset_y, index_list)

when y is provided and return_index_list == False, the following tuple is returned: (subset, subset_y)

when y is not provided and return_index_list == True, the following tuple is returned: (subset, index_list)

when y is not provided and return_index_list == False, the following dataframe is returned: subset

Here, subset is the subset of df that belongs to the cohort, subset_y is the label dataset associated to subset, and index_list is the list of indices of instances that belongs to the cohort;

Return type

tuple or pd.DataFrame

require_remaining_index()

Returns True if the cohort requires the index_used parameter for the get_cohort_subset() method. When this happens, this means that this cohort was built with a cohort_definition parameter set to None.

Returns: True if the cohort requires the index_used parameter for the get_cohort_subset() method. False otherwise.
Return type: bool

save(json_file: str)

Saves the conditions used by the cohort into a JSON file.

Parameters: json_file – the path of the JSON file to be saved.

Examples

Defining a Cohort