Cohort Case Study 3

This notebook will follow a similar approach to what was done in the notebook Cohort Case Study 2.

We’ll start by downloading the dataset and reading the train and test sets.

[23]:

import sys
sys.path.append('../../../notebooks')
import random
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
import uci_dataset as database

from raimitigations.utils import split_data
import raimitigations.dataprocessing as dp
from raimitigations.cohort import (
    CohortDefinition,
    CohortManager,
    fetch_cohort_results
)
from download import download_datasets

SEED = 42
#SEED = None

np.random.seed(SEED)
random.seed(SEED)

data_dir = '../../../datasets/'
download_datasets(data_dir)
df =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
df.drop(columns=['employee_id'], inplace=True)
label_col = 'is_promoted'
df

[23]:

	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	Technology	region_14	Bachelor's	m	sourcing	1	48	3.0	17	0	0	78	0
54804	Operations	region_27	Master's & above	f	other	1	37	2.0	6	0	0	56	0
54805	Analytics	region_1	Bachelor's	m	other	1	27	5.0	3	1	0	79	0
54806	Sales & Marketing	region_9	NaN	m	sourcing	1	29	1.0	2	0	0	45	0
54807	HR	region_22	Bachelor's	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 13 columns

[24]:

df.isna().any()

[24]:

department              False
region                  False
education                True
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating     True
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool

[25]:

X_train, X_test, y_train, y_test = split_data(df, label_col, test_size=0.3)
print(X_train.shape)
print(X_test.shape)

(38365, 12)
(16443, 12)

Once again, we’ll be using the same estimator for all of our experiments, since our goal here is to compare different approaches to cohort processing.

[26]:

def get_model():
    model = LGBMClassifier(random_state=SEED)
    #model = LogisticRegression()
    return model

Initial analysis of multiple cohort definitions

Now that we have our dataset, we can create a simple pipeline and fit it using the training data. We’ll then use the fetch_cohort_results function to show the results obtained for the entire dataset, as well as for different cohorts. Notice that this function computes all metrics for each cohort separately, and therefore, different thresholds might be encountered for each cohort (the optimal threshold of a given set of predictions is found using the ROC curve).

[27]:

#EXPERIMENT: Baseline 1

pipe = Pipeline([
    ("imputer", dp.BasicImputer(verbose=False)),
    ("scaler", dp.DataStandardScaler(verbose=False)),
    ("encoder", dp.EncoderOHE(verbose=False)),
    ("estimator", get_model())
])
pipe.fit(X_train, y_train)
pred = pipe.predict_proba(X_test)

Our first analysis will focus on the cohorts defined by the department feature.

[28]:

pred_train = pipe.predict_proba(X_train)
_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["department"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred, cohort_col=["department"], fixed_th=th_dict)

[28]:

	cohort	cht_query	roc	precision	recall	f1	accuracy	threshold	cht_size
0	all	all	0.906538	0.621886	0.806852	0.635748	0.788238	0.111382	16443
1	cohort_0	(`department` == "Analytics")	0.776998	0.578288	0.680165	0.576152	0.741798	0.141988	1646
2	cohort_1	(`department` == "Finance")	0.928949	0.657223	0.842639	0.689856	0.835958	0.118915	762
3	cohort_2	(`department` == "HR")	0.915106	0.590844	0.836284	0.589310	0.775538	0.087160	744
4	cohort_3	(`department` == "Legal")	0.900643	0.571520	0.836442	0.551627	0.748466	0.083536	326
5	cohort_4	(`department` == "Operations")	0.904996	0.616505	0.791970	0.621249	0.761735	0.099676	3366
6	cohort_5	(`department` == "Procurement")	0.906561	0.639946	0.788949	0.664866	0.815596	0.132125	2180
7	cohort_6	(`department` == "R&D")	0.813484	0.556883	0.665372	0.566567	0.845679	0.158987	324
8	cohort_7	(`department` == "Sales & Marketing")	0.941234	0.643102	0.840060	0.678709	0.853988	0.132790	5027
9	cohort_8	(`department` == "Technology")	0.888814	0.658265	0.771988	0.685713	0.826886	0.176120	2068

Let’s now check the results for the cohorts defined by the education feature.

[29]:

_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["education"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred, cohort_col=["education"], fixed_th=th_dict)

[29]:

	cohort	cht_query	roc	precision	recall	f1	accuracy	threshold	cht_size
0	all	all	0.906538	0.621886	0.806852	0.635748	0.788238	0.111382	16443
1	cohort_0	(`education` == "Bachelor's")	0.901324	0.615260	0.794942	0.629060	0.791273	0.111382	10909
2	cohort_1	(`education` == "Below Secondary")	0.880221	0.629514	0.805487	0.647113	0.795082	0.110401	244
3	cohort_2	(`education` == "Master's & above")	0.911406	0.644196	0.802767	0.667880	0.808548	0.131799	4539
4	cohort_3	(`education`.isnull())	0.947114	0.634414	0.823119	0.674122	0.882823	0.133010	751

Let’s now check the results for the cohorts defined by the gender feature.

[30]:

_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["gender"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred, cohort_col=["gender"], fixed_th=th_dict)

[30]:

	cohort	cht_query	roc	precision	recall	f1	accuracy	threshold	cht_size
0	all	all	0.906538	0.621886	0.806852	0.635748	0.788238	0.111382	16443
1	cohort_0	(`gender` == "f")	0.901428	0.621032	0.803937	0.622139	0.751782	0.102872	4770
2	cohort_1	(`gender` == "m")	0.908411	0.619794	0.808885	0.634991	0.794312	0.111398	11673

Finally, let’s check the results for the cohorts defined by the recruitment_channel feature.

[31]:

_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["recruitment_channel"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred, cohort_col=["recruitment_channel"], fixed_th=th_dict)

[31]:

	cohort	cht_query	roc	precision	recall	f1	accuracy	threshold	cht_size
0	all	all	0.906538	0.621886	0.806852	0.635748	0.788238	0.111382	16443
1	cohort_0	(`recruitment_channel` == "other")	0.904970	0.627684	0.797138	0.649783	0.811548	0.120166	9127
2	cohort_1	(`recruitment_channel` == "referred")	0.877682	0.604750	0.765023	0.547651	0.626582	0.120019	316
3	cohort_2	(`recruitment_channel` == "sourcing")	0.910529	0.624562	0.812469	0.640121	0.792571	0.111398	7000

Analyzing the results, we can see that many of the cohort definitions result in cohorts with relatively similar metrics, with the exception of the department column. Therefore, for the rest of this notebook, we’ll make an in-depth analysis of how we can improve the performance metrics for the cohorts defined by the department column.

A Closer Look at the “department” cohorts

In this section, we’ll take a closer look at how we can try to improve the results for the cohorts defined by the department column. As we’ve shown, this set of cohorts does present some considerably different metrics between different cohorts. Here we’ll try using different cohort definitions, and data rebalancing to improve the performance over a set of cohorts.

Cohort-based pipeline for the `department` column

As we’ve done in other examples from other Cohort Case Study notebooks, we’ll create a cohort-based pipeline using the department column as the cohort definition.

[33]:

#EXPERIMENT: Cohort 1

cht_manager = CohortManager(
    transform_pipe=[
        dp.BasicImputer(verbose=False),
        dp.DataStandardScaler(verbose=False),
        dp.EncoderOHE(drop=False, unknown_err=False, verbose=False),
        get_model()
    ],
    cohort_col=["department"]
)
cht_manager.fit(X_train, y_train)
pred_cht = cht_manager.predict_proba(X_test)

pred_train = cht_manager.predict_proba(X_train)
metrics_train, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["department"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["department"], fixed_th=th_dict)

/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)

[33]:

	cohort	cht_query	roc	precision	recall	f1	accuracy	threshold	cht_size
0	all	all	0.896297	0.654764	0.748016	0.684303	0.870826	0.160967	16443
1	cohort_0	(`department` == "Analytics")	0.739505	0.611656	0.626435	0.618234	0.857837	0.220229	1646
2	cohort_1	(`department` == "Finance")	0.914678	0.848933	0.766767	0.800959	0.943570	0.480133	762
3	cohort_2	(`department` == "HR")	0.896995	0.829796	0.685490	0.734559	0.955645	0.588032	744
4	cohort_3	(`department` == "Legal")	0.918971	0.819138	0.695177	0.740446	0.963190	0.691379	326
5	cohort_4	(`department` == "Operations")	0.895444	0.662804	0.755522	0.692435	0.866607	0.158296	3366
6	cohort_5	(`department` == "Procurement")	0.902884	0.685450	0.732900	0.705026	0.884404	0.203897	2180
7	cohort_6	(`department` == "R&D")	0.688026	0.578056	0.526861	0.535669	0.944444	0.661615	324
8	cohort_7	(`department` == "Sales & Marketing")	0.937158	0.659160	0.787313	0.696528	0.887806	0.156050	5027
9	cohort_8	(`department` == "Technology")	0.874313	0.655917	0.732302	0.680105	0.840426	0.197238	2068

From the previous results, we can see that using the cohort-based pipeline for the department column, we can considerably improve the results for each cohort when compared to the single pipeline.

Cohort Case Study 3

Initial analysis of multiple cohort definitions

A Closer Look at the “department” cohorts

Cohort-based pipeline for the department column

Cohort-based pipeline for the `department` column