Defining a Cohort

This library allows users to apply certain mitigations over a specific cohort, instead of ​applying it to the entire dataset. This is useful when, for example, two cohorts have a very different label distribution and we want to rebalance each cohort individually, instead of rebalancing the entire dataset. In some scenarios, applying a mitigation over each cohort separately is more advantageous than applying it over the whole dataset.

In this notebook, we’ll start covering this topic by showing how to define a single cohort. The class CohortDefinition is the one responsible for handling a single cohort, and this class is used internally by the CohortManager class in order to accomplish the tasks described previously.

For starters, let’s create a very simple dataset that we can use throughout this notebook.

[1]:
import numpy as np
import pandas as pd
from raimitigations.cohort import CohortDefinition

df = pd.DataFrame({
    "race":     ['elf', 'orc', 'halfling', 'human', 'halfling', 'orc', 'elf', 'orc', 'human', 'orc'],
    "height(m)":[1.6,   1.95,  1.40,       1.75,     1.53,      2.10,   1.85,  1.79,  1.65,   np.nan],
    "past_score":[85,   59,    19,          89,      91,        79,      45,   82,    47,     87  ],
    "score":    [90,    43,    29,          99,      85,        73,      58,   94,    37,     51]
})
df
[1]:
race height(m) past_score score
0 elf 1.60 85 90
1 orc 1.95 59 43
2 halfling 1.40 19 29
3 human 1.75 89 99
4 halfling 1.53 91 85
5 orc 2.10 79 73
6 elf 1.85 45 58
7 orc 1.79 82 94
8 human 1.65 47 37
9 orc NaN 87 51

Creating a CohortDefinition object

The CohortDefinition class aims to solve the following task: given a set of conditions, provide methods for automatically filtering a dataset using the conditions provided to the constructor method and return a subset of the original dataset containing only the desired cohort. This is a very lightweight class, and it’s main goal is to facilitate solving more complex tasks regarding cohort management (check out the documentation for the CohortManager class for more details).

Given our simple fantasy dataset, let’s try creating a cohort where we have only:

  • Elves with a height above or equal 1.8 meters

  • Orcs with a height above or equal 1.8 meters

The CohortDefinition has it’s own language for understanding a set of conditions. The conditions must be formatted into a list of lists, where:

  • The most inner lists must always contain 3 variables:

    1. Column name: the column name used for setting a condition. If the dataset has no columns, use the column index instead, but remember to provide the column index as a string;

    2. Inner Operator: the operator used to associate the column_name (given by the first variable) with the value provided (given by the third and last variable in the list). The allowed inner ops are:

      • Equal (‘==’): if the value provided is a single value, then this condition is set as “all instances where ``column_name`` is equal to ``value``”. However, if value is a list, then the condition is set as “all instances where ``column_name`` is one of the values in the ``value`` list”;

      • Different (‘!=’): similar to the equal operator: accepts value as a single value or a list of values;

      • Greater (‘>’): value must be a single value;

      • Greater or Equal (‘>=’): value must be a single value;

      • Lesser (‘<’): value must be a single value;

      • Lesser or Equal (‘<=’): value must be a single value;

      • Range (‘range’): value must be a list with only 2 values: the minimum and maximum values defining the boundaries of the range, respectively. This operator also includes the minimum and maximum values of the range (value[0] <= column_name <= value[1]).

    3. Value: the value used in the condition. It can be a numerical or categorical value. If the value provided is another column name, then a comparison between two columns is performed (we’ll check an example ahead).

  • Between two inner lists, there must be a Outer Operator, which can be: and or or.

  • With this basic set of rules, we can create complex conditions for a cohort by concatenating multiple simple conditions (inner lists) using the and and or operators.

Now, let’s create a CohortDefinition object using the conditions previously mentioned:

[2]:
conditions = [
                [ ['race', '==', 'elf'], 'or', ['race', '==', 'orc'] ],
                'and',
                ['height(m)', '>=', 1.8]
            ]

cht_def = CohortDefinition(conditions)
subset = cht_def.get_cohort_subset(df)
subset
[2]:
race height(m) past_score score
1 orc 1.95 59 43
5 orc 2.10 79 73
6 elf 1.85 45 58

The get_cohort_subset() method simply applies the cohort’s filter over any dataset provided. If the dataset provided to this method doesn’t have the required columns, and error will be raised. Note that the CohortDefinition class is not associated with any specific dataset. It aims to simply convert a set of conditions into a query, which can then be used over any dataset that has the required columns (the columns used in the filters).

In the example above, notice that we used two equal conditions to specify that the race should be either elf or orc. Let’s recreate this condition, but this time we’ll use a list assigned to the equal operator:

[3]:
conditions = [
                [ ['race', '==', ['elf', 'orc'] ] ],
                'and',
                ['height(m)', '>=', 1.8]
            ]

cht_def = CohortDefinition(conditions)
subset = cht_def.get_cohort_subset(df)
subset
[3]:
race height(m) past_score score
1 orc 1.95 59 43
5 orc 2.10 79 73
6 elf 1.85 45 58

In the following cells, we’ll show a few other examples where we use other operators:

[4]:
conditions = [ ['height(m)', '==', np.nan] ]
cht_def = CohortDefinition(conditions)
subset = cht_def.get_cohort_subset(df)
subset
[4]:
race height(m) past_score score
9 orc NaN 87 51
[5]:
conditions = [ ['height(m)', '==', [1.95, np.nan]] ]
cht_def = CohortDefinition(conditions)
subset = cht_def.get_cohort_subset(df)
subset
[5]:
race height(m) past_score score
1 orc 1.95 59 43
9 orc NaN 87 51
[6]:
conditions = [ [ ['height(m)', 'range', [1.1, 1.7]], 'and', ['race', '!=', 'halfling'] ] ]

cht_def = CohortDefinition(conditions)
subset = cht_def.get_cohort_subset(df)
subset
[6]:
race height(m) past_score score
0 elf 1.60 85 90
8 human 1.65 47 37
[7]:
conditions = [ ['height(m)', '>', 1.5],
              'and',
              ['height(m)', '<', 1.99],
              'and',
              ['score', '<=', 70]
            ]

cht_def = CohortDefinition(conditions)
subset = cht_def.get_cohort_subset(df)
subset
[7]:
race height(m) past_score score
1 orc 1.95 59 43
6 elf 1.85 45 58
8 human 1.65 47 37

We can also create a condition that compares the values between two different columns:

[8]:
conditions = [ ['score', '<=', 'past_score'] ]

cht_def = CohortDefinition(conditions)
subset = cht_def.get_cohort_subset(df)
subset
[8]:
race height(m) past_score score
1 orc 1.95 59 43
4 halfling 1.53 91 85
5 orc 2.10 79 73
8 human 1.65 47 37
9 orc NaN 87 51

Saving and Loading a Cohort Definition JSON

Sometimes it is useful to save the definitions used for a cohort. This way, we can later create a new CohortDefinition object using the same conditions as the ones used for the saved cohort.

To save a cohort definition JSON file, simply use the save() method. This method will save the cohort definition in a JSON file following the same JSON structure used by the raiwidgets library.

[9]:
cht_def.save("json_files/CohortDefinition_tutorial/single_cohort.json")

To load a JSON file, simply pass the JSON file’s path to the constructor of the class.

[10]:
new_cht = CohortDefinition("json_files/CohortDefinition_tutorial/single_cohort.json")
subset = new_cht.get_cohort_subset(df)
subset
[10]:
race height(m) past_score score
1 orc 1.95 59 43
4 halfling 1.53 91 85
5 orc 2.10 79 73
8 human 1.65 47 37
9 orc NaN 87 51

In the following cell, we’ll call the .save() method for different CohortDefinition objects, and then we’ll load these JSON files. The interested reader is encouraged to check the saved JSON files just to better understand the structure used.

[11]:
conditions_list = []
conditions = [
                [ ['race', '==', 'elf'], 'or', ['race', '==', 'orc'] ],
                'and',
                ['height(m)', '>=', 1.8]
            ]
conditions_list.append(conditions)

conditions = [
                [ ['race', '==', ['elf', 'orc'] ] ],
                'and',
                ['height(m)', '>=', 1.8]
            ]
conditions_list.append(conditions)

conditions = [ [ ['height(m)', 'range', [1.1, 1.7]], 'and', ['race', '!=', 'halfling'] ] ]
conditions_list.append(conditions)

conditions = [ ['height(m)', '>', 1.5],
              'and',
              ['height(m)', '<', 1.99],
              'and',
              ['score', '<=', 70]
            ]
conditions_list.append(conditions)

conditions = [ ['height(m)', '==', np.nan] ]
conditions_list.append(conditions)

conditions = [ ['score', '<=', 'past_score'] ]
conditions_list.append(conditions)

for i, conditions in enumerate(conditions_list):
    cht_def = CohortDefinition(conditions)
    cht_def.save(f"json_files/CohortDefinition_tutorial/cht_{i}.json")
    new_cht = CohortDefinition(f"json_files/CohortDefinition_tutorial/cht_{i}.json")