Decoupled Classifiers

This notebook aims to be a tutorial on how to use the DecoupledClass estimator, provided in the raimitigations.cohort package. This class is based on the work presented in the paper Decoupled classifiers for group-fair and efficient machine learning. The DecoupledClass estimator will build a different estimator for each cohort, where the cohort separation rules are defined by a set of parameters of the class, which is similar to how the CohortManager class creates its cohorts. Both DecoupledClass and CohortManager inherit from the same abstract class CohortHandler, which implements the core functionalities for handling cohorts. The difference between the CohortManager and the DecoupledClass is that the former aims to provide an interface for creating a variety of different pipelines that are executed over each cohort separately (be it a pipeline with an estimator, different pipelines to each cohort, and so on), while the goal of the latter is to function as an estimator, which means that it will always fit a model over each cohort separately, and it can also apply some transformations to each cohort as well, but the transformations must always be the same (it doesn’t allow using different pre-processing pipelines to each cohort as it is allowed in the CohortManager).

In this notebook, we’ll show the different ways we can instantiate and use the DecoupledClass. Let’s start off by opening a dataset. Here, we’ll use the UCI Breast Cancer dataset.

[1]:
import pandas as pd
import numpy as np
import uci_dataset as database

from raimitigations.utils import split_data
import raimitigations.dataprocessing as dp
from raimitigations.cohort import DecoupledClass, fetch_cohort_results

df = database.load_breast_cancer()
label_col = "Class"
df[label_col] = df[label_col].replace({     "recurrence-events": 1,
                                                                            "no-recurrence-events": 0})
df
[1]:
Class age menopause tumor-size inv-nodes node-caps deg-malig breast breast-quad irradiat
0 0 30-39 premeno 30-34 0-2 no 3 left left_low no
1 0 40-49 premeno 20-24 0-2 no 2 right right_up no
2 0 40-49 premeno 20-24 0-2 no 2 left left_low no
3 0 60-69 ge40 15-19 0-2 no 2 right left_up no
4 0 40-49 premeno 0-4 0-2 no 2 right right_low no
... ... ... ... ... ... ... ... ... ... ...
281 1 30-39 premeno 30-34 0-2 no 2 left left_up no
282 1 30-39 premeno 20-24 0-2 no 3 left left_up yes
283 1 60-69 ge40 20-24 0-2 no 1 right left_up no
284 1 40-49 ge40 30-34 3-5 no 3 left left_low no
285 1 50-59 ge40 30-34 3-5 no 3 left left_low no

286 rows × 10 columns

[2]:
X_train, X_test, y_train, y_test = split_data(df, label="Class", test_size=0.2)

Basic Scenario

Let’s consider the following scenario: suppose that we want to train a different model for each cohort comprised of the different values in the irradiat column. To do this, we can call the DecoupledClass using the following parameters:

[3]:
preprocessing = [dp.BasicImputer(verbose=False), dp.EncoderOrdinal(verbose=False)]

dec_class = DecoupledClass(
                                    cohort_col=["irradiat"],
                                    transform_pipe=preprocessing
                            )
dec_class.fit(X_train, y_train)

dec_class.print_cohorts()
FINAL COHORTS
cohort_0:
        Size: 173
        Query:
                (`irradiat` == "no")
        Value Counts:
                0: 130 (75.14%)
                1: 43 (24.86%)
        Invalid: False


cohort_1:
        Size: 55
        Query:
                (`irradiat` == "yes")
        Value Counts:
                0: 30 (54.55%)
                1: 25 (45.45%)
        Invalid: False


The cohort_col parameter works similarly to the same parameter in the CohortManager class: it creates a different cohort for each combination of values found in the columns specified in the cohort_col list (check this notebook for more details). Therefore, since cohort_col = [“irradiat”], then we’ll create one cohort for all instances where the irradiat column is “no”, and another column where its value is “yes”. We then train two models: one for each cohort. Since no estimators were provided, we’ll create a copy of the baseline estimator used by the DecoupledClass: a sklearn.tree.DecisionTreeClassifier for classification problems, or a sklearn.tree.DecisionTreeRegressor for regression.

Note that we also provided a pre-processing pipeline through the transform_pipe parameter. What happens here is that each cohort will have their own copy of this pipeline, and before fitting the model, each cohort’s dataset (a subset of the original dataset) will go through this pipeline. In this case, before fitting the model, each cohort will impute the missing values, and then encode the categorical features. Differently from the CohortManager class, the DecoupledClass doesn’t allow to use different pipelines for each cohort: all cohorts will use different copies of the same pipeline.

After creating the DecoupledClass object, we can then print some information about each cohort created. To do this, we use the print_cohorts() method.

Merging Invalid Cohorts

When creating multiple cohorts, we might end up with a few cohorts with a skewed label distribution, or very small cohorts. In these cases, we might want to fix these cohorts before proceeding. One approach is to use data rebalancing and create new instances for only a few cohorts, and we can do this using the dataprocessing.Synthesizer class or using the Rebalance class together with the CohortManager(). Apart from these solutions, the DecoupledClass also offers some new solutions. The first solution, which we’ll explore in this section, is to greedily merge invalid cohorts until they become valid.

When merging cohorts, we first need to define what a valid cohort is. Here, we consider that invalid cohorts are those that fall into at least one of the following conditions:

  1. Small Cohorts: cohorts with a size below a certain threshold

  2. Skewed Cohorts: cohorts with a label column with a skewed distribution.

After a cohort is deemed invalid, we need to decide which cohort it will be merged into. We simply choose the smallest cohort different from the invalid cohort and then merge them (this is why we mentioned that this is a greedy approach for merging cohorts).

There are a few parameters used to control these validity checks:

  • min_cohort_size: the minimum size a cohort is allowed to have to be considered valid

  • min_cohort_pct: a value between [0, 1] that determines the minimum size allowed for a cohort. The minimum size is given by the size of the full dataset (df.shape[0]) multiplied by min_cohort_pct. The maximum value between min_cohort_size and (df.shape[0] * min_cohort_pct) is used to determine the minimum size allowed for a cohort

  • minority_min_rate: the minimum occurrence rate for the minority class (from the label column) that a cohort is allowed to have. If the minority class of the cohort has an occurrence rate lower than min_rate, the cohort is considered invalid.

In the next cell, we’ll create a set of cohorts based on the joint values of the columns [“age”, “menopause”]. We’ll also specify a different value for the parameters min_cohort_pct and minority_min_rate. Note that the resulting cohorts are not what we expected initially, that is, one cohort for each combination of unique values found between the columns [“age”, “menopause”]. Instead, we end up with only a few cohorts. But note that while cohort cohort_4 is a combination of simple filters based on these two columns, the other cohorts use a complex combination of filters based on these two columns. This means that the other two cohorts are a result of merged cohorts, and when two cohorts are merged, so are their filters.

[4]:
preprocessing = [dp.EncoderOrdinal(verbose=False), dp.BasicImputer(verbose=False)]

dec_class = DecoupledClass(
                                    cohort_col=["age", "menopause"],
                                    min_cohort_pct=0.2,
                                    minority_min_rate=0.15,
                                    transform_pipe=preprocessing
                            )
dec_class.fit(df=df, label_col="Class")

dec_class.print_cohorts()
FINAL COHORTS
cohort_0:
        Size: 91
        Query:
                ((((((((`age` == "20-29") and (`menopause` == "premeno")) or ((`age` == "30-39") and (`menopause` == "lt40"))) or ((`age` == "60-69") and (`menopause` == "lt40"))) or ((`age` == "50-59") and (`menopause` == "lt40"))) or ((`age` == "70-79") and (`menopause` == "ge40"))) or ((`age` == "40-49") and (`menopause` == "ge40"))) or ((`age` == "50-59") and (`menopause` == "premeno"))) or ((`age` == "30-39") and (`menopause` == "premeno"))
        Value Counts:
                0: 59 (64.84%)
                1: 32 (35.16%)
        Invalid: False


cohort_4:
        Size: 81
        Query:
                (`age` == "40-49") and (`menopause` == "premeno")
        Value Counts:
                0: 58 (71.60%)
                1: 23 (28.40%)
        Invalid: False


cohort_8:
        Size: 114
        Query:
                ((`age` == "60-69") and (`menopause` == "ge40")) or ((`age` == "50-59") and (`menopause` == "ge40"))
        Value Counts:
                0: 84 (73.68%)
                1: 30 (26.32%)
        Invalid: False


Specify the Cohorts

Just like the CohortManager class, the DecoupledClass also allows users to specify the exact filters they want when creating the cohorts. In the following example, we’ll create three cohorts: 2 of them with specific filters, and the last one will be created to include all instances that don’t belong to any other cohort. NOTE: when specifying the exact cohorts using the cohort_def parameter, invalid cohorts won’t be merged. Instead, an error will be raised indicating that one of the cohorts is invalid. However, invalid cohorts can still be used if Transfer Learning is used. More details on that in the following sections.

[6]:
cohorts = {
    "cohort_1": [['age', '==', '40-49'], 'and', ['menopause', '==', 'premeno']],
    "cohort_2": [
            [['age', '==', '60-69'], 'and', ['menopause', '==', 'ge40']], 'or',
            [['age', '==', '30-39'], 'and', ['menopause', '==', 'premeno']],
        ],
    "cohort_3": None
}

preprocessing = [dp.EncoderOrdinal(verbose=False), dp.BasicImputer(verbose=False)]

dec_class = DecoupledClass(
                                    cohort_def=cohorts,
                                    min_cohort_pct=0.2,
                                    minority_min_rate=0.15,
                                    transform_pipe=preprocessing
                            )
dec_class.fit(df=df, label_col="Class")

dec_class.print_cohorts()
[6]:
<raimitigations.cohort.decoupled_class.decoupled_classifier.DecoupledClass at 0x7f4c286b7160>
FINAL COHORTS
cohort_1:
        Size: 81
        Query:
                (`age` == "40-49") and (`menopause` == "premeno")
        Value Counts:
                0: 58 (71.60%)
                1: 23 (28.40%)
        Invalid: False


cohort_2:
        Size: 90
        Query:
                ((`age` == "60-69") and (`menopause` == "ge40")) or ((`age` == "30-39") and (`menopause` == "premeno"))
        Value Counts:
                0: 58 (64.44%)
                1: 32 (35.56%)
        Invalid: False


cohort_3:
        Size: 115
        Query:
                ((`age` != "40-49") or (`menopause` != "premeno")) and (((`age` != "60-69") or (`menopause` != "ge40")) and ((`age` != "30-39") or (`menopause` != "premeno")))
        Value Counts:
                0: 85 (73.91%)
                1: 30 (26.09%)
        Invalid: False


Specifying the estimator

The DecoupledClass class has multiple parameters that allow for a wide range of customizations. For example, we can also choose the estimator used by the decoupled classifier. By default, we’ll use a simple DecisionTreeClassifier (for classification problems) or DecisionTreeRegressor (for regression problems), both from sklearn. However, if the user wants to use a more powerful estimator or the same estimator, but tweak certain parameters of it, they can specify the estimator when creating the DecoupledClass object. To do this, they just need to instantiate the estimator (don’t call their fit() method yet), and pass it through the estimator parameter. When doing this, the Decoupled Classifier will create a copy of this estimator for each cohort. This way, each estimator will be fitted using a different dataset (the cohort’s subset).

[6]:
import xgboost as xgb

model = xgb.XGBClassifier(
            objective="binary:logistic",
            learning_rate=0.1,
            n_estimators=30,
            max_depth=10,
            colsample_bytree=0.7,
            alpha=0.0,
            reg_lambda=10.0,
            nthreads=4,
            verbosity=0,
            use_label_encoder=False,
        )

preprocessing = [dp.EncoderOrdinal(verbose=False), dp.BasicImputer(verbose=False)]

dec_class = DecoupledClass(
                    cohort_col=["age", "menopause"],
                    min_cohort_pct=0.2,
                    minority_min_rate=0.15,
                    estimator=model,
                    transform_pipe=preprocessing
                )
dec_class.fit(df=df, label_col="Class")
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/xgboost/sklearn.py:1421: UserWarning: `use_label_encoder` is deprecated in 1.7.0.
  warnings.warn("`use_label_encoder` is deprecated in 1.7.0.")
[6]:
<raimitigations.cohort.decoupled_class.decoupled_classifier.DecoupledClass at 0x7f307a540a00>

Calling the predict() and predict_proba() methods

The Decoupled Classifier also implements the same interface from other sklearn’s estimators: the predict() and predict_proba() methods. It also follows the same standards: the predict() method will return the exact classes, while the predict_proba() returns the probabilities of each instance belonging to each class. Note that the predict_proba() will only work if the estimator being used has the predict_proba() method.

[7]:
X = df.drop(columns=[label_col])

y_pred = dec_class.predict(X)

print(f"y_pred size = {y_pred.shape}")
print(f"{y_pred[:6]} ... {y_pred[-6:]}")
y_pred size = (286,)
[0 0 0 0 0 0] ... [1 0 1 0 1 1]
[8]:
y_pred = dec_class.predict_proba(X)

print(f"y_pred size = {y_pred.shape}")
print(f"{y_pred[:6]} ... {y_pred[-6:]}")
y_pred size = (286, 2)
[[0.53159475 0.46840525]
 [0.78541195 0.21458806]
 [0.7623236  0.23767635]
 [0.8384228  0.1615772 ]
 [0.78541195 0.21458806]
 [0.8384228  0.1615772 ]] ... [[0.3676687  0.6323313 ]
 [0.72908753 0.27091247]
 [0.44569016 0.55430984]
 [0.81318545 0.18681458]
 [0.42785716 0.57214284]
 [0.45488244 0.54511756]]

Using Transfer Learning for invalid cohorts

What sets the DecoupledClass apart from the CohortManager is its capability to deal with invalid cohorts. We already showed how to use the Decoupled Classifier to merge invalid cohorts using a greedy approach. In this section, we’ll explore a second approach for dealing with invalid cohorts, called here the transfer learning approach. In this approach, instead of merging an invalid cohort into another cohort, we’ll keep all the cohorts, but when calling the fit() method of an invalid cohort, we’ll also use data from other cohorts (called out-data), but the instances of the outer-data will be weighed down compared to the instances belonging to the cohort (called here the in-data). Note that when an invalid cohort uses the data from other cohorts when calling its fit() method, the cohort that lent the data to the invalid cohort (the one from which the out-data was fetched) will still only use its own data when fitting its model (unless it is also an invalid cohort).

We are now left with questions: (i) which cohorts should be used as the out-data for an invalid cohort, and (ii) how to define the value of \(\theta\).

Selecting the out-data for an invalid cohort

When selecting the out-data for an invalid cohort, that is, when selecting which cohorts will be used to lend their data to the invalid cohort, we must be careful not to use cohorts with a very different label distribution (compared to the invalid cohort that needs extra data), otherwise, the use of external data can be more harmful than useful. To check if two cohorts have a similar label distribution (be it for a classification problem where the label column is a set of encoded classes, or for regression problems, where the label is an array of real values), we compute the Jensen-Shanon distance between the label distribution of these two cohorts. If the distance is below a predefined threshold (controlled by the cohort_dist_th parameter), then the distributions are considered similar.

Here is a summary of the transfer learning process covered so far:

  • When using Transfer Learning, first check if there are any invalid cohorts. Differently from the greedy approach of merging cohorts, when using transfer learning, cohorts deemed invalid due to skewed distributions are not allowed. If a cohort is deemed invalid due to a skewed distribution, an error will be raised. For each invalid cohort i, we’ll do the following steps:

    1. Search all other cohorts j \(\neq\) i (including other invalid cohorts) and find those that have a similar label distribution

    2. Create a new dataset (visible only to cohort i) called out-data that will include the subset of all other cohorts with a similar label distribution

    3. Train the estimator of cohort i using its own subset (in-data) + out-data, where the instances from the out-data have a smaller weight \(\theta\) (we’ll discuss how to set this value in the remainder of this section). Note that the estimator used when using transfer learning must allow for setting a different weight for each instance.

  • All valid cohorts will be trained using only their in-data, even if the subset of these cohorts is used as out-data for invalid cohorts.

Setting the value of \(\theta\)

We are now going to focus on how to set the value of the \(\theta\) parameter. We’ll cover the different approaches for setting this value.

Using a fixed \(\theta\) value

The most straightforward approach for setting the value of \(\theta\) is to provide a specific value for it directly. This can be done using the theta parameter when creating the DecoupledClass object. When passing a float value between [0, 1] to this parameter, this will be the value used for \(\theta\) for all transfer learning operations.

In the following cell, we’ll set \(\theta\) = 0.3. Note that this time we first remove any missing values from the dataset. We do this because when using the cohort_col parameter to define the cohorts, if the columns used in this list have missing values, these values will be used for creating the cohorts. In this specific case, the breast-quad column has very few missing values, so if we try to create a set of cohorts with those missing values, this will result in a cohort (the one that holds all instances where this column is NaN) with a skewed label distribution, which, as previously mentioned, is not allowed when using transfer learning. Therefore, we simply remove the missing values prior to creating the cohorts.

[9]:
preprocessing = [dp.EncoderOrdinal(verbose=False)]

imputer = dp.BasicImputer(categorical={'missing_values':np.nan,
                                        'strategy':'most_frequent',
                                        'fill_value':None },
                            verbose=False)
imputer.fit(df)
df_nomiss = imputer.transform(df)

dec_class = DecoupledClass(
                    cohort_col=["breast-quad"],
                    theta=0.3,
                    min_cohort_pct=0.2,
                    minority_min_rate=0.15,
                    transform_pipe=preprocessing
                )
dec_class.fit(df=df_nomiss, label_col="Class")

dec_class.print_cohorts()
FINAL COHORTS
cohort_0:
        Size: 21
        Query:
                (`breast-quad` == "central")
        Value Counts:
                0: 17 (80.95%)
                1: 4 (19.05%)
        Invalid: True
                Cohorts used as outside data: ['cohort_1', 'cohort_2', 'cohort_3', 'cohort_4']
                Theta = 0.3


cohort_1:
        Size: 111
        Query:
                (`breast-quad` == "left_low")
        Value Counts:
                0: 75 (67.57%)
                1: 36 (32.43%)
        Invalid: False


cohort_2:
        Size: 97
        Query:
                (`breast-quad` == "left_up")
        Value Counts:
                0: 71 (73.20%)
                1: 26 (26.80%)
        Invalid: False


cohort_3:
        Size: 24
        Query:
                (`breast-quad` == "right_low")
        Value Counts:
                0: 18 (75.00%)
                1: 6 (25.00%)
        Invalid: True
                Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_4']
                Theta = 0.3


cohort_4:
        Size: 33
        Query:
                (`breast-quad` == "right_up")
        Value Counts:
                0: 20 (60.61%)
                1: 13 (39.39%)
        Invalid: True
                Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_3']
                Theta = 0.3


Note that when using transfer learning and calling the print_cohorts() method, the “Invalid” key of the invalid cohorts will be set to True, and in that case, it will also inform which cohorts were used as out-data and the \(\theta\) value used.

Finding the best \(\theta\) parameter using Cross-Validation

Instead of using a fixed \(\theta\) value, we can also find the best value using Cross-Validation (CV). When a cohort uses transfer learning, CV is used with the cohort data (in-data) plus the out-data using different values of \(\theta\) (obtained from a list of \(\theta\) values, called here :math:`theta` list), and the final \(\theta\) is selected as being the one associated with the highest performance in the CV process. The CV here splits the in-data into K folds (the best K value is identified according to the possible values specified in the valid_k_folds_theta parameter), and then proceeds to use one of the folds as the test set, and the remaining folds plus the out-data as the train set. A model is fitted for the train set and then evaluated in the test set. The ROC AUC metric is obtained for each CV run until all folds have been used as a test set. We then compute the average ROC AUC score for the K runs and that gives the CV score for a given \(\theta\) value. This is repeated for all possible \(\theta\) values (the \(\theta\) list), and the \(\theta\) with the best score is selected for that cohort. This process is repeated for each cohort that requires transfer learning, which means that some invalid cohorts might end up using different values of \(\theta\).

There are a set of parameters used for controlling the CV process. These parameters are: default_theta, min_fold_size_theta, and valid_k_folds_theta. We recommend looking through the API documentation of these parameters to better understand this process.

In the following cells, we’ll check two ways to specify the :math:`theta` list, that is, the list of possible \(\theta\) values to be tested during the CV phase.

Using a specific list of possible \(\theta\) values

We can specify a list of possible \(\theta\) values. This way, when running the CV process mentioned above, we’ll do it for all the \(\theta\) values contained in the list passed as a parameter. This list is passed to the same theta parameter mentioned in the previous cell.

[10]:
dec_class = DecoupledClass(
                                    cohort_col=["breast-quad"],
                                    theta=[0.2, 0.4, 0.6, 0.8],
                                    min_fold_size_theta=5,
                                    min_cohort_pct=0.2,
                                    minority_min_rate=0.15,
                                    transform_pipe=preprocessing
                            )
dec_class.fit(df=df_nomiss, label_col="Class")

dec_class.print_cohorts()
FINAL COHORTS
cohort_0:
        Size: 21
        Query:
                (`breast-quad` == "central")
        Value Counts:
                0: 17 (80.95%)
                1: 4 (19.05%)
        Invalid: True
                Cohorts used as outside data: ['cohort_1', 'cohort_2', 'cohort_3', 'cohort_4']
                Theta = 0.6


cohort_1:
        Size: 111
        Query:
                (`breast-quad` == "left_low")
        Value Counts:
                0: 75 (67.57%)
                1: 36 (32.43%)
        Invalid: False


cohort_2:
        Size: 97
        Query:
                (`breast-quad` == "left_up")
        Value Counts:
                0: 71 (73.20%)
                1: 26 (26.80%)
        Invalid: False


cohort_3:
        Size: 24
        Query:
                (`breast-quad` == "right_low")
        Value Counts:
                0: 18 (75.00%)
                1: 6 (25.00%)
        Invalid: True
                Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_4']
                Theta = 0.8


cohort_4:
        Size: 33
        Query:
                (`breast-quad` == "right_up")
        Value Counts:
                0: 20 (60.61%)
                1: 13 (39.39%)
        Invalid: True
                Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_3']
                Theta = 0.8


Using a default list of possible \(\theta\) values

Instead of providing a list of \(\theta\) values, we could also use a default \(\theta\) list. To do this, we only need to set the theta parameter to True. This way, the DecoupledClass understands that transfer learning must be used, and that the best \(\theta\) value must be identified using a default \(\theta\) list.

[11]:
dec_class = DecoupledClass(
                                    cohort_col=["breast-quad"],
                                    theta=True,
                                    min_fold_size_theta=5,
                                    min_cohort_pct=0.2,
                                    minority_min_rate=0.15,
                                    transform_pipe=preprocessing
                            )
dec_class.fit(df=df_nomiss, label_col="Class")

dec_class.print_cohorts()
FINAL COHORTS
cohort_0:
        Size: 21
        Query:
                (`breast-quad` == "central")
        Value Counts:
                0: 17 (80.95%)
                1: 4 (19.05%)
        Invalid: True
                Cohorts used as outside data: ['cohort_1', 'cohort_2', 'cohort_3', 'cohort_4']
                Theta = 0.2


cohort_1:
        Size: 111
        Query:
                (`breast-quad` == "left_low")
        Value Counts:
                0: 75 (67.57%)
                1: 36 (32.43%)
        Invalid: False


cohort_2:
        Size: 97
        Query:
                (`breast-quad` == "left_up")
        Value Counts:
                0: 71 (73.20%)
                1: 26 (26.80%)
        Invalid: False


cohort_3:
        Size: 24
        Query:
                (`breast-quad` == "right_low")
        Value Counts:
                0: 18 (75.00%)
                1: 6 (25.00%)
        Invalid: True
                Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_4']
                Theta = 0.4


cohort_4:
        Size: 33
        Query:
                (`breast-quad` == "right_up")
        Value Counts:
                0: 20 (60.61%)
                1: 13 (39.39%)
        Invalid: True
                Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_3']
                Theta = 0.2


Optimizing Fairness Metrics

Aside from training one estimator for each cohort and adding transfer learning for invalid cohorts, the Decoupled Classifier, as presented in its original paper, also offers the option to optimize all models according to some fairness metric. In this section, we’ll show how to optimize a set of models based on one of the fairness metrics available.

First of all, let’s start by reading a new dataset, and then splitting it.

[12]:
import random
import sys
sys.path.append('../../notebooks')
from lightgbm import LGBMClassifier

from download import download_datasets

SEED = 42

def get_model():
    model = LGBMClassifier(random_state=SEED)
    return model

np.random.seed(SEED)
random.seed(SEED)

data_dir = '../../../datasets/'
download_datasets(data_dir)
df =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
df.drop(columns=['employee_id'], inplace=True)
label_col = 'is_promoted'

X_train, X_test, y_train, y_test = split_data(df, label_col, test_size=0.3)
X_train
[12]:
department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score
19004 Sales & Marketing region_13 Bachelor's m other 1 27 3.0 4 0 0 52
54186 Technology region_2 Bachelor's f other 2 38 4.0 10 1 0 79
37539 Operations region_2 Bachelor's m sourcing 1 26 1.0 3 0 0 57
51713 Sales & Marketing region_22 Bachelor's f sourcing 1 27 2.0 4 0 0 49
2051 Procurement region_2 Bachelor's f sourcing 1 28 3.0 2 0 0 66
... ... ... ... ... ... ... ... ... ... ... ... ...
51522 Operations region_27 Bachelor's f other 1 33 3.0 10 1 0 62
4220 Operations region_32 Bachelor's m other 1 33 2.0 4 0 0 62
24351 Operations region_2 Master's & above m sourcing 1 31 5.0 3 1 0 64
9214 Procurement region_17 Bachelor's f sourcing 1 43 3.0 4 1 0 66
54230 Sales & Marketing region_11 Bachelor's m sourcing 1 34 1.0 6 0 0 51

38365 rows × 12 columns

Decoupled Classifier WITHOUT Fairness Optimization

We’ll look into the cohorts formed by the “department” column. We’ll first train a decoupled classifier without optimizing any fairness metric, and then output the results. Note here that we’re using the get_threasholds_dict() method from the DecoupledClass, which returns a dictionary with all the optimal thresholds found for the training set for each cohort. These thresholds are used only for binary classification problems, and they are used whenever the user uses the predict() method: in this case, we’ll first compute the probabilities (predict_proba()), and then we binarize the results using this threshold as the cutoff point (values below the threshold are assigned the 0 class, and values above are assigned the 1 class). The dictionary returned by the get_threasholds_dict() method can be used when calling the raimitigations.cohort.fetch_cohort_results() function.

[13]:
preprocessing = [dp.BasicImputer(verbose=False), dp.EncoderOrdinal(verbose=False)]

dec_class = DecoupledClass(
                    cohort_col=["department"],
                    transform_pipe=preprocessing,
                    estimator=get_model(),
                    minority_min_rate=0.01,
                    min_cohort_pct=0.01,
                    theta=False,
                )
dec_class.fit(X_train, y_train)

th_dict = dec_class.get_threasholds_dict()
pred = dec_class.predict_proba(X_test)
fetch_cohort_results(X_test, y_test, pred, cohort_def=dec_class, fixed_th=th_dict)
[13]:
cohort cht_query roc precision recall f1 accuracy threshold num_pos %_pos cht_size
0 all all 0.895163 0.876146 0.685779 0.742239 0.939488 0.500000 659 0.040078 16443
1 cohort_0 (`department` == "Analytics") 0.722190 0.619585 0.596869 0.606264 0.875456 0.291690 126 0.076549 1646
2 cohort_1 (`department` == "Finance") 0.919360 0.838021 0.786997 0.809857 0.943570 0.399871 56 0.073491 762
3 cohort_2 (`department` == "HR") 0.895265 0.836806 0.697395 0.746324 0.956989 0.547408 24 0.032258 744
4 cohort_3 (`department` == "Legal") 0.915327 0.841469 0.663451 0.717749 0.963190 0.760291 7 0.021472 326
5 cohort_4 (`department` == "Operations") 0.896737 0.757878 0.729081 0.742328 0.918598 0.271413 272 0.080808 3366
6 cohort_5 (`department` == "Procurement") 0.901670 0.758723 0.718070 0.736035 0.915596 0.329608 173 0.079358 2180
7 cohort_6 (`department` == "R&D") 0.718231 0.578056 0.526861 0.535669 0.944444 0.664189 5 0.015432 324
8 cohort_7 (`department` == "Sales & Marketing") 0.935592 0.765992 0.727787 0.745056 0.938333 0.272557 298 0.059280 5027
9 cohort_8 (`department` == "Technology") 0.869781 0.710383 0.682738 0.695082 0.886847 0.329839 197 0.095261 2068

Decoupled Classifier WITH Fairness Optimization

Now, we’ll add the Demographic Parity fairness function (refer to the paper for other loss functions). The demographic parity loss forces the decoupled classifier to output a similar rate of positive labels for all cohorts. The Demographic Parity is given by \(\hat{L}\), expressed in the following equations:

\[L_{1} = \frac{1}{n} \sum_{i = 0...n}{|y_i - z_i|}\]
\[p_k = \frac{1}{n} \sum_{i:g_i = k}{z_i}\]
\[\hat{L} = \lambda L_{1} + (1-\lambda) \sum_{k}{\Bigg| p_k \frac{n}{n_k} - \frac{1}{K}\sum_{k'}{p_{k'} \frac{n}{n_{k'}}} \Bigg|}\]

where:

  • \(L_1\) is the L1 loss function

  • \(p_k\) is the rate of positive labels (in relation to the entire dataset) inside cohort \(k\)

  • \(\sum_{k'}\) represents the sum over all cohorts different from cohort \(k\)

To use the Demographic Parity loss, we’ll use three new parameters:

  • fairness_loss: the fairness loss that should be optimized alongside the L1 loss. This is only possible for binary classification problems. For regression or multi-class problems, this parameter should be set to None (default value), otherwise, an error will be raised. The L1 and fairness losses are computed over the binarized predictions, not over the probabilities. Therefore, the decoupled classifier tries to identify the best set of thresholds (one for each estimator, where we have one estimator for each cohort) that produces the lowest joint loss (L1 + fairness loss). There are 3 available fairness losses:

    • None: don’t use any fairness loss. The threshold used for each cohort is identified through the ROC curve, that is, doesn’t consider any fairness metric. This is the default behavior;

    • “balanced”: the Balanced loss is computed as the mean loss value over the L1 loss of each cohort. This loss is useful when we want that all cohorts to have a similar L1 loss, where all cohorts are considered with an equal weight, which makes it ideal for unbalanced datasets;

    • “num_parity”: the Numerical Parity loss forces all cohorts to have a similar number of positive labels. This loss is useful in a situation where a model should output an equal number of positive labels for each cohort;

    • “dem_parity”: the Demographic Parity loss forces all cohorts to have a similar rate of positive labels. This is somehow similar to the Numerical Parity loss, but this loss accounts for the difference in size of each cohort, that is, the number of positive labels should be different for cohorts with different sizes, but the ratio of positive labels over the size of the cohort should be consistent across cohorts. This is useful when we want an unbiased model, that is, a model that outputs an equal proportion of positive labels without considering the cohort to which an instance belongs to;

  • lambda_coef: the \(\lambda\) variable presented in the equations above, which represents the weight assigned to the L1 loss when computing the joint loss (L1 + fairness loss). This parameter is ignored when fairness_loss = None;

  • max_joint_loss_time: the maximum time (in seconds) allowed for the decoupled classifier to run its fairness optimization step. This parameter is ignored when fairness_loss = None. When fairness != None, the decoupled classifier will try to find the best set of thresholds to be used for each cohort such that the final predictions result in a minimum joint loss. However, this optimization step is computationally expensive, and can take some time to be finalized depending on the number of cohorts and the size of the dataset. To avoid long execution times, we can specify the maximum time allowed for the decoupled classifier to run this step. If the optimization step reaches the maximum time, then the best set of thresholds found so far is returned;

Finally, another new trick we’ll use is to call the fetch_cohort_results() function using fixed_th = True. What this will do is that it will use the thresholds from the DecoupledClass object passed through the cohort_def parameter (using the get_threasholds_dict() method used in the previous cell). Using fixed_th = True only works if we provide a DecoupledClass object through the cohort_def parameter. Otherwise, an error will be raised.

[15]:
preprocessing = [dp.BasicImputer(verbose=False), dp.EncoderOrdinal(verbose=False)]

dec_class = DecoupledClass(
                    cohort_col=["department"],
                    transform_pipe=preprocessing,
                    estimator=get_model(),
                    minority_min_rate=0.01,
                    min_cohort_pct=0.01,
                    theta=False,
                    fairness_loss="dem_parity",
                    lambda_coef=0.5,
                    max_joint_loss_time=200
                )
dec_class.fit(X_train, y_train)

pred = dec_class.predict_proba(X_test)
fetch_cohort_results(X_test, y_test, pred, cohort_def=dec_class, fixed_th=True)
[15]:
cohort cht_query roc precision recall f1 accuracy threshold num_pos %_pos cht_size
0 all all 0.895163 0.876146 0.685779 0.742239 0.939488 0.500000 659 0.040078 16443
1 cohort_0 (`department` == "Analytics") 0.722190 0.636853 0.583599 0.600360 0.886999 0.365011 93 0.056501 1646
2 cohort_1 (`department` == "Finance") 0.919360 0.878382 0.762182 0.807264 0.947507 0.579517 45 0.059055 762
3 cohort_2 (`department` == "HR") 0.895265 0.655064 0.741758 0.685869 0.913978 0.051923 68 0.091398 744
4 cohort_3 (`department` == "Legal") 0.915327 0.630507 0.820043 0.672362 0.898773 0.033636 40 0.122699 326
5 cohort_4 (`department` == "Operations") 0.896737 0.759269 0.724897 0.740439 0.918895 0.278185 265 0.078728 3366
6 cohort_5 (`department` == "Procurement") 0.901670 0.812134 0.709188 0.747522 0.926606 0.418526 135 0.061927 2180
7 cohort_6 (`department` == "R&D") 0.718231 0.578056 0.526861 0.535669 0.944444 0.664189 5 0.015432 324
8 cohort_7 (`department` == "Sales & Marketing") 0.935592 0.731392 0.742891 0.736963 0.929779 0.231286 371 0.073801 5027
9 cohort_8 (`department` == "Technology") 0.869781 0.775429 0.668209 0.704017 0.904739 0.418906 134 0.064797 2068

Note that, when using Demographic Parity with \(\lambda = 0.5\), the percentage of positive labels inside each cohort is more equal across different cohorts when compared to the previous result. This is important if the generated model must not be biased for a given group, that is, provide equal rates of positive labels for each cohort, even if this hurts the overall performance of the model. The \(\lambda\) parameter controls how much we should care about the fairness metric compared to the L1 loss: low \(\lambda\) values focuses more on the fairness metric, while high \(\lambda\) values gives more attention to the L1 loss.