Cohort Case Study 3
This notebook will follow a similar approach to what was done in the notebook Cohort Case Study 2.
We’ll start by downloading the dataset and reading the train and test sets.
[23]:
import sys
sys.path.append('../../../notebooks')
import random
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import uci_dataset as database
from raimitigations.utils import split_data
import raimitigations.dataprocessing as dp
from raimitigations.cohort import (
CohortDefinition,
CohortManager,
fetch_cohort_results
)
from download import download_datasets
SEED = 42
#SEED = None
np.random.seed(SEED)
random.seed(SEED)
data_dir = '../../../datasets/'
download_datasets(data_dir)
df = pd.read_csv(data_dir + 'hr_promotion/train.csv')
df.drop(columns=['employee_id'], inplace=True)
label_col = 'is_promoted'
df
[23]:
department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | Sales & Marketing | region_9 | NaN | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns
[24]:
df.isna().any()
[24]:
department False
region False
education True
gender False
recruitment_channel False
no_of_trainings False
age False
previous_year_rating True
length_of_service False
KPIs_met >80% False
awards_won? False
avg_training_score False
is_promoted False
dtype: bool
[25]:
X_train, X_test, y_train, y_test = split_data(df, label_col, test_size=0.3)
print(X_train.shape)
print(X_test.shape)
(38365, 12)
(16443, 12)
Once again, we’ll be using the same estimator for all of our experiments, since our goal here is to compare different approaches to cohort processing.
[26]:
def get_model():
model = LGBMClassifier(random_state=SEED)
#model = LogisticRegression()
return model
Initial analysis of multiple cohort definitions
Now that we have our dataset, we can create a simple pipeline and fit it using the training data. We’ll then use the fetch_cohort_results
function to show the results obtained for the entire dataset, as well as for different cohorts. Notice that this function computes all metrics for each cohort separately, and therefore, different thresholds might be encountered for each cohort (the optimal threshold of a given set of predictions is found using the ROC curve).
[27]:
#EXPERIMENT: Baseline 1
pipe = Pipeline([
("imputer", dp.BasicImputer(verbose=False)),
("scaler", dp.DataStandardScaler(verbose=False)),
("encoder", dp.EncoderOHE(verbose=False)),
("estimator", get_model())
])
pipe.fit(X_train, y_train)
pred = pipe.predict_proba(X_test)
Our first analysis will focus on the cohorts defined by the department
feature.
[28]:
pred_train = pipe.predict_proba(X_train)
_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["department"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred, cohort_col=["department"], fixed_th=th_dict)
[28]:
cohort | cht_query | roc | precision | recall | f1 | accuracy | threshold | cht_size | |
---|---|---|---|---|---|---|---|---|---|
0 | all | all | 0.906538 | 0.621886 | 0.806852 | 0.635748 | 0.788238 | 0.111382 | 16443 |
1 | cohort_0 | (`department` == "Analytics") | 0.776998 | 0.578288 | 0.680165 | 0.576152 | 0.741798 | 0.141988 | 1646 |
2 | cohort_1 | (`department` == "Finance") | 0.928949 | 0.657223 | 0.842639 | 0.689856 | 0.835958 | 0.118915 | 762 |
3 | cohort_2 | (`department` == "HR") | 0.915106 | 0.590844 | 0.836284 | 0.589310 | 0.775538 | 0.087160 | 744 |
4 | cohort_3 | (`department` == "Legal") | 0.900643 | 0.571520 | 0.836442 | 0.551627 | 0.748466 | 0.083536 | 326 |
5 | cohort_4 | (`department` == "Operations") | 0.904996 | 0.616505 | 0.791970 | 0.621249 | 0.761735 | 0.099676 | 3366 |
6 | cohort_5 | (`department` == "Procurement") | 0.906561 | 0.639946 | 0.788949 | 0.664866 | 0.815596 | 0.132125 | 2180 |
7 | cohort_6 | (`department` == "R&D") | 0.813484 | 0.556883 | 0.665372 | 0.566567 | 0.845679 | 0.158987 | 324 |
8 | cohort_7 | (`department` == "Sales & Marketing") | 0.941234 | 0.643102 | 0.840060 | 0.678709 | 0.853988 | 0.132790 | 5027 |
9 | cohort_8 | (`department` == "Technology") | 0.888814 | 0.658265 | 0.771988 | 0.685713 | 0.826886 | 0.176120 | 2068 |
Let’s now check the results for the cohorts defined by the education
feature.
[29]:
_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["education"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred, cohort_col=["education"], fixed_th=th_dict)
[29]:
cohort | cht_query | roc | precision | recall | f1 | accuracy | threshold | cht_size | |
---|---|---|---|---|---|---|---|---|---|
0 | all | all | 0.906538 | 0.621886 | 0.806852 | 0.635748 | 0.788238 | 0.111382 | 16443 |
1 | cohort_0 | (`education` == "Bachelor's") | 0.901324 | 0.615260 | 0.794942 | 0.629060 | 0.791273 | 0.111382 | 10909 |
2 | cohort_1 | (`education` == "Below Secondary") | 0.880221 | 0.629514 | 0.805487 | 0.647113 | 0.795082 | 0.110401 | 244 |
3 | cohort_2 | (`education` == "Master's & above") | 0.911406 | 0.644196 | 0.802767 | 0.667880 | 0.808548 | 0.131799 | 4539 |
4 | cohort_3 | (`education`.isnull()) | 0.947114 | 0.634414 | 0.823119 | 0.674122 | 0.882823 | 0.133010 | 751 |
Let’s now check the results for the cohorts defined by the gender
feature.
[30]:
_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["gender"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred, cohort_col=["gender"], fixed_th=th_dict)
[30]:
cohort | cht_query | roc | precision | recall | f1 | accuracy | threshold | cht_size | |
---|---|---|---|---|---|---|---|---|---|
0 | all | all | 0.906538 | 0.621886 | 0.806852 | 0.635748 | 0.788238 | 0.111382 | 16443 |
1 | cohort_0 | (`gender` == "f") | 0.901428 | 0.621032 | 0.803937 | 0.622139 | 0.751782 | 0.102872 | 4770 |
2 | cohort_1 | (`gender` == "m") | 0.908411 | 0.619794 | 0.808885 | 0.634991 | 0.794312 | 0.111398 | 11673 |
Finally, let’s check the results for the cohorts defined by the recruitment_channel
feature.
[31]:
_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["recruitment_channel"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred, cohort_col=["recruitment_channel"], fixed_th=th_dict)
[31]:
cohort | cht_query | roc | precision | recall | f1 | accuracy | threshold | cht_size | |
---|---|---|---|---|---|---|---|---|---|
0 | all | all | 0.906538 | 0.621886 | 0.806852 | 0.635748 | 0.788238 | 0.111382 | 16443 |
1 | cohort_0 | (`recruitment_channel` == "other") | 0.904970 | 0.627684 | 0.797138 | 0.649783 | 0.811548 | 0.120166 | 9127 |
2 | cohort_1 | (`recruitment_channel` == "referred") | 0.877682 | 0.604750 | 0.765023 | 0.547651 | 0.626582 | 0.120019 | 316 |
3 | cohort_2 | (`recruitment_channel` == "sourcing") | 0.910529 | 0.624562 | 0.812469 | 0.640121 | 0.792571 | 0.111398 | 7000 |
Analyzing the results, we can see that many of the cohort definitions result in cohorts with relatively similar metrics, with the exception of the department
column. Therefore, for the rest of this notebook, we’ll make an in-depth analysis of how we can improve the performance metrics for the cohorts defined by the department
column.
A Closer Look at the “department” cohorts
In this section, we’ll take a closer look at how we can try to improve the results for the cohorts defined by the department
column. As we’ve shown, this set of cohorts does present some considerably different metrics between different cohorts. Here we’ll try using different cohort definitions, and data rebalancing to improve the performance over a set of cohorts.
Cohort-based pipeline for the department
column
As we’ve done in other examples from other Cohort Case Study
notebooks, we’ll create a cohort-based pipeline using the department
column as the cohort definition.
[33]:
#EXPERIMENT: Cohort 1
cht_manager = CohortManager(
transform_pipe=[
dp.BasicImputer(verbose=False),
dp.DataStandardScaler(verbose=False),
dp.EncoderOHE(drop=False, unknown_err=False, verbose=False),
get_model()
],
cohort_col=["department"]
)
cht_manager.fit(X_train, y_train)
pred_cht = cht_manager.predict_proba(X_test)
pred_train = cht_manager.predict_proba(X_train)
metrics_train, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["department"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["department"], fixed_th=th_dict)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
fscore = (2 * precision * recall) / (precision + recall)
[33]:
cohort | cht_query | roc | precision | recall | f1 | accuracy | threshold | cht_size | |
---|---|---|---|---|---|---|---|---|---|
0 | all | all | 0.896297 | 0.654764 | 0.748016 | 0.684303 | 0.870826 | 0.160967 | 16443 |
1 | cohort_0 | (`department` == "Analytics") | 0.739505 | 0.611656 | 0.626435 | 0.618234 | 0.857837 | 0.220229 | 1646 |
2 | cohort_1 | (`department` == "Finance") | 0.914678 | 0.848933 | 0.766767 | 0.800959 | 0.943570 | 0.480133 | 762 |
3 | cohort_2 | (`department` == "HR") | 0.896995 | 0.829796 | 0.685490 | 0.734559 | 0.955645 | 0.588032 | 744 |
4 | cohort_3 | (`department` == "Legal") | 0.918971 | 0.819138 | 0.695177 | 0.740446 | 0.963190 | 0.691379 | 326 |
5 | cohort_4 | (`department` == "Operations") | 0.895444 | 0.662804 | 0.755522 | 0.692435 | 0.866607 | 0.158296 | 3366 |
6 | cohort_5 | (`department` == "Procurement") | 0.902884 | 0.685450 | 0.732900 | 0.705026 | 0.884404 | 0.203897 | 2180 |
7 | cohort_6 | (`department` == "R&D") | 0.688026 | 0.578056 | 0.526861 | 0.535669 | 0.944444 | 0.661615 | 324 |
8 | cohort_7 | (`department` == "Sales & Marketing") | 0.937158 | 0.659160 | 0.787313 | 0.696528 | 0.887806 | 0.156050 | 5027 |
9 | cohort_8 | (`department` == "Technology") | 0.874313 | 0.655917 | 0.732302 | 0.680105 | 0.840426 | 0.197238 | 2068 |
From the previous results, we can see that using the cohort-based pipeline for the department
column, we can considerably improve the results for each cohort when compared to the single pipeline.