Cohort Case Study 3

This notebook will follow a similar approach to what was done in the notebook Cohort Case Study 2.

We’ll start by downloading the dataset and reading the train and test sets.

[23]:
import sys
sys.path.append('../../../notebooks')
import random
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import uci_dataset as database

from raimitigations.utils import split_data
import raimitigations.dataprocessing as dp
from raimitigations.cohort import (
    CohortDefinition,
    CohortManager,
    fetch_cohort_results
)
from download import download_datasets

SEED = 42
#SEED = None

np.random.seed(SEED)
random.seed(SEED)

data_dir = '../../../datasets/'
download_datasets(data_dir)
df =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
df.drop(columns=['employee_id'], inplace=True)
label_col = 'is_promoted'
df
[23]:
department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 Sales & Marketing region_9 NaN m sourcing 1 29 1.0 2 0 0 45 0
54807 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 13 columns

[24]:
df.isna().any()
[24]:
department              False
region                  False
education                True
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating     True
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool
[25]:
X_train, X_test, y_train, y_test = split_data(df, label_col, test_size=0.3)
print(X_train.shape)
print(X_test.shape)
(38365, 12)
(16443, 12)

Once again, we’ll be using the same estimator for all of our experiments, since our goal here is to compare different approaches to cohort processing.

[26]:
def get_model():
    model = LGBMClassifier(random_state=SEED)
    #model = LogisticRegression()
    return model

Initial analysis of multiple cohort definitions

Now that we have our dataset, we can create a simple pipeline and fit it using the training data. We’ll then use the fetch_cohort_results function to show the results obtained for the entire dataset, as well as for different cohorts. Notice that this function computes all metrics for each cohort separately, and therefore, different thresholds might be encountered for each cohort (the optimal threshold of a given set of predictions is found using the ROC curve).

[27]:
#EXPERIMENT: Baseline 1

pipe = Pipeline([
    ("imputer", dp.BasicImputer(verbose=False)),
    ("scaler", dp.DataStandardScaler(verbose=False)),
    ("encoder", dp.EncoderOHE(verbose=False)),
    ("estimator", get_model())
])
pipe.fit(X_train, y_train)
pred = pipe.predict_proba(X_test)

Our first analysis will focus on the cohorts defined by the department feature.

[28]:
pred_train = pipe.predict_proba(X_train)
_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["department"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred, cohort_col=["department"], fixed_th=th_dict)
[28]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.906538 0.621886 0.806852 0.635748 0.788238 0.111382 16443
1 cohort_0 (`department` == "Analytics") 0.776998 0.578288 0.680165 0.576152 0.741798 0.141988 1646
2 cohort_1 (`department` == "Finance") 0.928949 0.657223 0.842639 0.689856 0.835958 0.118915 762
3 cohort_2 (`department` == "HR") 0.915106 0.590844 0.836284 0.589310 0.775538 0.087160 744
4 cohort_3 (`department` == "Legal") 0.900643 0.571520 0.836442 0.551627 0.748466 0.083536 326
5 cohort_4 (`department` == "Operations") 0.904996 0.616505 0.791970 0.621249 0.761735 0.099676 3366
6 cohort_5 (`department` == "Procurement") 0.906561 0.639946 0.788949 0.664866 0.815596 0.132125 2180
7 cohort_6 (`department` == "R&D") 0.813484 0.556883 0.665372 0.566567 0.845679 0.158987 324
8 cohort_7 (`department` == "Sales & Marketing") 0.941234 0.643102 0.840060 0.678709 0.853988 0.132790 5027
9 cohort_8 (`department` == "Technology") 0.888814 0.658265 0.771988 0.685713 0.826886 0.176120 2068

Let’s now check the results for the cohorts defined by the education feature.

[29]:
_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["education"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred, cohort_col=["education"], fixed_th=th_dict)
[29]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.906538 0.621886 0.806852 0.635748 0.788238 0.111382 16443
1 cohort_0 (`education` == "Bachelor's") 0.901324 0.615260 0.794942 0.629060 0.791273 0.111382 10909
2 cohort_1 (`education` == "Below Secondary") 0.880221 0.629514 0.805487 0.647113 0.795082 0.110401 244
3 cohort_2 (`education` == "Master's & above") 0.911406 0.644196 0.802767 0.667880 0.808548 0.131799 4539
4 cohort_3 (`education`.isnull()) 0.947114 0.634414 0.823119 0.674122 0.882823 0.133010 751

Let’s now check the results for the cohorts defined by the gender feature.

[30]:
_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["gender"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred, cohort_col=["gender"], fixed_th=th_dict)
[30]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.906538 0.621886 0.806852 0.635748 0.788238 0.111382 16443
1 cohort_0 (`gender` == "f") 0.901428 0.621032 0.803937 0.622139 0.751782 0.102872 4770
2 cohort_1 (`gender` == "m") 0.908411 0.619794 0.808885 0.634991 0.794312 0.111398 11673

Finally, let’s check the results for the cohorts defined by the recruitment_channel feature.

[31]:
_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["recruitment_channel"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred, cohort_col=["recruitment_channel"], fixed_th=th_dict)
[31]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.906538 0.621886 0.806852 0.635748 0.788238 0.111382 16443
1 cohort_0 (`recruitment_channel` == "other") 0.904970 0.627684 0.797138 0.649783 0.811548 0.120166 9127
2 cohort_1 (`recruitment_channel` == "referred") 0.877682 0.604750 0.765023 0.547651 0.626582 0.120019 316
3 cohort_2 (`recruitment_channel` == "sourcing") 0.910529 0.624562 0.812469 0.640121 0.792571 0.111398 7000

Analyzing the results, we can see that many of the cohort definitions result in cohorts with relatively similar metrics, with the exception of the department column. Therefore, for the rest of this notebook, we’ll make an in-depth analysis of how we can improve the performance metrics for the cohorts defined by the department column.

A Closer Look at the “department” cohorts

In this section, we’ll take a closer look at how we can try to improve the results for the cohorts defined by the department column. As we’ve shown, this set of cohorts does present some considerably different metrics between different cohorts. Here we’ll try using different cohort definitions, and data rebalancing to improve the performance over a set of cohorts.

Cohort-based pipeline for the department column

As we’ve done in other examples from other Cohort Case Study notebooks, we’ll create a cohort-based pipeline using the department column as the cohort definition.

[33]:
#EXPERIMENT: Cohort 1

cht_manager = CohortManager(
    transform_pipe=[
        dp.BasicImputer(verbose=False),
        dp.DataStandardScaler(verbose=False),
        dp.EncoderOHE(drop=False, unknown_err=False, verbose=False),
        get_model()
    ],
    cohort_col=["department"]
)
cht_manager.fit(X_train, y_train)
pred_cht = cht_manager.predict_proba(X_test)

pred_train = cht_manager.predict_proba(X_train)
metrics_train, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["department"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["department"], fixed_th=th_dict)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
[33]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.896297 0.654764 0.748016 0.684303 0.870826 0.160967 16443
1 cohort_0 (`department` == "Analytics") 0.739505 0.611656 0.626435 0.618234 0.857837 0.220229 1646
2 cohort_1 (`department` == "Finance") 0.914678 0.848933 0.766767 0.800959 0.943570 0.480133 762
3 cohort_2 (`department` == "HR") 0.896995 0.829796 0.685490 0.734559 0.955645 0.588032 744
4 cohort_3 (`department` == "Legal") 0.918971 0.819138 0.695177 0.740446 0.963190 0.691379 326
5 cohort_4 (`department` == "Operations") 0.895444 0.662804 0.755522 0.692435 0.866607 0.158296 3366
6 cohort_5 (`department` == "Procurement") 0.902884 0.685450 0.732900 0.705026 0.884404 0.203897 2180
7 cohort_6 (`department` == "R&D") 0.688026 0.578056 0.526861 0.535669 0.944444 0.661615 324
8 cohort_7 (`department` == "Sales & Marketing") 0.937158 0.659160 0.787313 0.696528 0.887806 0.156050 5027
9 cohort_8 (`department` == "Technology") 0.874313 0.655917 0.732302 0.680105 0.840426 0.197238 2068

From the previous results, we can see that using the cohort-based pipeline for the department column, we can considerably improve the results for each cohort when compared to the single pipeline.