Decoupled Classifiers Case Study 2

This notebook will follow a similar approach to what was done in the notebook Decoupled Classifiers Case Study 1, but this time we’ll use a real dataset.

[93]:
import sys
sys.path.append('../../../../notebooks')

import pandas as pd
import numpy as np
import random

from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier

from raimitigations.utils import split_data
import raimitigations.dataprocessing as dp
from raimitigations.cohort import DecoupledClass, CohortDefinition, CohortManager, fetch_cohort_results, plot_value_counts_cohort
from sklearn.pipeline import Pipeline
from download import download_datasets

SEED = 100

Load and split the data into train and test sets:

[94]:
data_dir = "../../../../datasets/census/"
download_datasets(data_dir)

label_col = "income"
df_train = pd.read_csv(data_dir + "train.csv")
df_test = pd.read_csv(data_dir + "test.csv")
# convert to 0 and 1 encoding
df_train[label_col] = df_train[label_col].apply(lambda x: 0 if x == "<=50K" else 1)
df_test[label_col] = df_test[label_col].apply(lambda x: 0 if x == "<=50K" else 1)

X_train = df_train.drop(columns=[label_col])
y_train = df_train[label_col]
X_test = df_test.drop(columns=[label_col])
y_test = df_test[label_col]

df_train
[94]:
income age workclass fnlwgt education education-num marital-status occupation relationship race gender capital-gain capital-loss hours-per-week native-country
0 0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States
1 0 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States
2 0 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States
3 0 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States
4 0 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32556 0 27 Private 257302 Assoc-acdm 12 Married-civ-spouse Tech-support Wife White Female 0 0 38 United-States
32557 1 40 Private 154374 HS-grad 9 Married-civ-spouse Machine-op-inspct Husband White Male 0 0 40 United-States
32558 0 58 Private 151910 HS-grad 9 Widowed Adm-clerical Unmarried White Female 0 0 40 United-States
32559 0 22 Private 201490 HS-grad 9 Never-married Adm-clerical Own-child White Male 0 0 20 United-States
32560 1 52 Self-emp-inc 287927 HS-grad 9 Married-civ-spouse Exec-managerial Wife White Female 15024 0 40 United-States

32561 rows × 15 columns

[95]:
def get_model():
    #model = DecisionTreeClassifier(max_features="sqrt")
    model = LGBMClassifier(random_state=SEED)
    return model

Taking a closer look at a few sensitive attributes of this dataset and possible imbalance in the data.

The “race” cohorts

Let’s start by plotting the label distribution over the “race” cohorts.

[96]:
# BASELINE: "race"
cohort_set = CohortManager(
    transform_pipe=[
        dp.DataStandardScaler(verbose=False),
        dp.EncoderOHE(drop=False, unknown_err=False, verbose=False),
        get_model()
    ],
    cohort_col=["race"]
)
cohort_set.fit(X=X_train, y=y_train)
subsets = cohort_set.get_subsets(X_train, y_train, apply_transform=False)

print(y_train.value_counts())

for key in subsets.keys():
    print(f"\n{key} : {cohort_set.get_queries()[key]}:\n{subsets[key]['y'].value_counts()}\n***********")

plot_value_counts_cohort(y_train, subsets, normalize=False)
0    24720
1     7841
Name: income, dtype: int64

cohort_0 : (`race` == " Amer-Indian-Eskimo"):
0    275
1     36
Name: income, dtype: int64
***********

cohort_1 : (`race` == " Asian-Pac-Islander"):
0    763
1    276
Name: income, dtype: int64
***********

cohort_2 : (`race` == " Black"):
0    2737
1     387
Name: income, dtype: int64
***********

cohort_3 : (`race` == " Other"):
0    246
1     25
Name: income, dtype: int64
***********

cohort_4 : (`race` == " White"):
0    20699
1     7117
Name: income, dtype: int64
***********
../../../../_images/notebooks_cohort_case_study_decoupled_class_case_2_7_1.png

Clearly, we can see an imbalance in the size of cohorts and label distribution. Given the sensitivity of the “race” attribute, merging cohorts might not be an appropriate solution, instead, let’s take a look at the effect of transfer learning:

Transfer Learning

Note: the following experiment is expected to fail. We’ll explain why it failed in the following cells

[97]:
preprocessing = [dp.DataStandardScaler(verbose=False), dp.EncoderOHE(drop=False, unknown_err=False, verbose=False)]

dec_class = DecoupledClass(
    cohort_col=["race"],
    theta=True,
    min_fold_size_theta=15,
    min_cohort_pct=0.1,
    minority_min_rate=0.1,
    transform_pipe=preprocessing,
    estimator=get_model()
)
try:
    dec_class.fit(X_train, y_train)
except Exception as e:
    print(e)

INVALID COHORTS:
cohort_3:
        Size: 271
        Query:
                (`race` == " Other")
        Value Counts:
                0: 246 (90.77%)
                1: 25 (9.23%)
        Invalid: False
ERROR: Cannot use transfer learning over cohorts with skewed distributions.

The above code fails because one or more dataset has a label distribution invalidity, meaning that at least one cohort doesn’t satisfy the minimum rate requirement of the minority class. In order to try and solve this skewed distribution, we can use the Rebalance functionality of this library to re-balance the data.

[98]:
c0_pipe = [dp.Rebalance(strategy_over=0.5, verbose=False)]
c1_pipe = []
c2_pipe = []
c3_pipe = [dp.Rebalance(strategy_over=0.5, verbose=False)]
c4_pipe = []

cohort_set = CohortManager(
    transform_pipe=[c0_pipe, c1_pipe, c2_pipe, c3_pipe, c4_pipe],
    cohort_col=["race"]
)

rebalanced_X_train, rebalanced_y_train = cohort_set.fit_resample(X=X_train, y=y_train)
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(

let’s look at the label distribution plots of the rebalanced data:

[99]:
cohort_set = CohortManager(
    transform_pipe=[
        dp.DataStandardScaler(verbose=False),
        dp.EncoderOHE(drop=False, unknown_err=False, verbose=False),
        get_model()
    ],
    cohort_col=["race"]
)
cohort_set.fit(X=rebalanced_X_train, y=rebalanced_y_train)
subsets = cohort_set.get_subsets(rebalanced_X_train, rebalanced_y_train, apply_transform=False)

print(rebalanced_y_train.value_counts())

for key in subsets.keys():
    print(f"\n{key} : {cohort_set.get_queries()[key]}:\n{subsets[key]['y'].value_counts()}\n***********")

plot_value_counts_cohort(rebalanced_y_train, subsets, normalize=False)

0    24720
1     8040
Name: income, dtype: int64

cohort_0 : (`race` == " Amer-Indian-Eskimo"):
0    275
1    137
Name: income, dtype: int64
***********

cohort_1 : (`race` == " Asian-Pac-Islander"):
0    763
1    276
Name: income, dtype: int64
***********

cohort_2 : (`race` == " Black"):
0    2737
1     387
Name: income, dtype: int64
***********

cohort_3 : (`race` == " Other"):
0    246
1    123
Name: income, dtype: int64
***********

cohort_4 : (`race` == " White"):
0    20699
1     7117
Name: income, dtype: int64
***********
../../../../_images/notebooks_cohort_case_study_decoupled_class_case_2_14_1.png

then the performance metrics of the rebalanced baseline cohorts:

[100]:
pred_cht = cohort_set.predict_proba(X_test)

pred_train = cohort_set.predict_proba(rebalanced_X_train)
metrics_train, th_dict = fetch_cohort_results(rebalanced_X_train, rebalanced_y_train, pred_train, cohort_col=["race"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["race"], fixed_th=th_dict)
[100]:
cohort cht_query roc precision recall f1 accuracy threshold num_pos %_pos cht_size
0 all all 0.919891 0.779694 0.836502 0.798674 0.838585 0.260096 5186 0.318531 16281
1 cohort_0 (`race` == " Amer-Indian-Eskimo") 0.719173 0.736372 0.594549 0.623024 0.886792 0.730358 7 0.044025 159
2 cohort_1 (`race` == " Asian-Pac-Islander") 0.881606 0.795724 0.750569 0.767495 0.827083 0.522477 104 0.216667 480
3 cohort_2 (`race` == " Black") 0.920955 0.771581 0.767609 0.769573 0.907111 0.367778 176 0.112748 1561
4 cohort_3 (`race` == " Other") 0.906909 0.839147 0.595455 0.617357 0.844444 0.943149 6 0.044444 135
5 cohort_4 (`race` == " White") 0.924379 0.780414 0.839327 0.797787 0.830346 0.254678 4860 0.348487 13946

And now let’s attempt transfer learning again using the rebalanced data:

[101]:
preprocessing = [dp.DataStandardScaler(verbose=False), dp.EncoderOHE(drop=False, unknown_err=False, verbose=False)]

dec_class = DecoupledClass(
    cohort_col=["race"],
    theta=0.8,
    min_fold_size_theta=15,
    min_cohort_pct=0.1,
    minority_min_rate=0.1,
    transform_pipe=preprocessing,
    estimator=get_model()
)
dec_class.fit(rebalanced_X_train, rebalanced_y_train)

dec_class.print_cohorts()

subsets = dec_class.get_subsets(rebalanced_X_train, rebalanced_y_train, apply_transform=True)

print(rebalanced_y_train.value_counts())

for key in subsets.keys():
    print(f"\n{key} : {dec_class.get_queries()[key]}:\n{subsets[key]['y'].value_counts()}\n***********")

plot_value_counts_cohort(rebalanced_y_train, subsets, normalize=False)

FINAL COHORTS
cohort_0:
        Size: 412
        Query:
                (`race` == " Amer-Indian-Eskimo")
        Value Counts:
                0: 275 (66.75%)
                1: 137 (33.25%)
        Invalid: True
                Cohorts used as outside data: ['cohort_1', 'cohort_2', 'cohort_3', 'cohort_4']
                Theta = 0.8


cohort_1:
        Size: 1039
        Query:
                (`race` == " Asian-Pac-Islander")
        Value Counts:
                0: 763 (73.44%)
                1: 276 (26.56%)
        Invalid: True
                Cohorts used as outside data: ['cohort_0', 'cohort_2', 'cohort_3', 'cohort_4']
                Theta = 0.8


cohort_2:
        Size: 3124
        Query:
                (`race` == " Black")
        Value Counts:
                0: 2737 (87.61%)
                1: 387 (12.39%)
        Invalid: True
                Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_3', 'cohort_4']
                Theta = 0.8


cohort_3:
        Size: 369
        Query:
                (`race` == " Other")
        Value Counts:
                0: 246 (66.67%)
                1: 123 (33.33%)
        Invalid: True
                Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_4']
                Theta = 0.8


cohort_4:
        Size: 27816
        Query:
                (`race` == " White")
        Value Counts:
                0: 20699 (74.41%)
                1: 7117 (25.59%)
        Invalid: False


0    24720
1     8040
Name: income, dtype: int64

cohort_0 : (`race` == " Amer-Indian-Eskimo"):
0    275
1    137
Name: income, dtype: int64
***********

cohort_1 : (`race` == " Asian-Pac-Islander"):
0    763
1    276
Name: income, dtype: int64
***********

cohort_2 : (`race` == " Black"):
0    2737
1     387
Name: income, dtype: int64
***********

cohort_3 : (`race` == " Other"):
0    246
1    123
Name: income, dtype: int64
***********

cohort_4 : (`race` == " White"):
0    20699
1     7117
Name: income, dtype: int64
***********
../../../../_images/notebooks_cohort_case_study_decoupled_class_case_2_18_1.png
[102]:
th_dict = dec_class.get_threasholds_dict()
pred = dec_class.predict_proba(X_test)
fetch_cohort_results(X_test, y_test, pred, cohort_def=dec_class, fixed_th=th_dict)
[102]:
cohort cht_query roc precision recall f1 accuracy threshold num_pos %_pos cht_size
0 all all 0.927809 0.835215 0.799236 0.814789 0.873288 0.500000 3285 0.201769 16281
1 cohort_0 (`race` == " Amer-Indian-Eskimo") 0.941729 0.799934 0.839850 0.817998 0.918239 0.406598 22 0.138365 159
2 cohort_1 (`race` == " Asian-Pac-Islander") 0.905202 0.793487 0.802702 0.797815 0.835417 0.408684 140 0.291667 480
3 cohort_2 (`race` == " Black") 0.950893 0.834517 0.809811 0.821497 0.930173 0.404750 164 0.105061 1561
4 cohort_3 (`race` == " Other") 0.973455 0.877273 0.877273 0.877273 0.925926 0.395015 25 0.185185 135
5 cohort_4 (`race` == " White") 0.924379 0.805904 0.820844 0.812812 0.856016 0.384067 3756 0.269325 13946

…thanks to the rebalance, we were able to run transfer learning over the “race” cohorts.

It seems that transfer learning has a positive effect on most cohorts in the test set. We can see improvement in the ROC metric, label distribution and performance with a generally more comparable distribution among cohorts.

The “gender” cohorts

Secondly, let’s look at the plots and label distribution of another sensitive attribute:

[103]:
cohort_set = CohortManager(
    transform_pipe=[
        dp.DataStandardScaler(verbose=False),
        dp.EncoderOHE(drop=False, unknown_err=False, verbose=False),
        get_model()
    ],
    cohort_col=["gender"]
)
cohort_set.fit(X=X_train, y=y_train)
subsets = cohort_set.get_subsets(X_train, y_train, apply_transform=False)

print(y_train.value_counts())

for key in subsets.keys():
    print(f"\n{key} : {cohort_set.get_queries()[key]}:\n{subsets[key]['y'].value_counts()}\n***********")

plot_value_counts_cohort(y_train, subsets, normalize=False)
0    24720
1     7841
Name: income, dtype: int64

cohort_0 : (`gender` == " Female"):
0    9592
1    1179
Name: income, dtype: int64
***********

cohort_1 : (`gender` == " Male"):
0    15128
1     6662
Name: income, dtype: int64
***********
../../../../_images/notebooks_cohort_case_study_decoupled_class_case_2_22_1.png
[104]:
pred_cht = cohort_set.predict_proba(X_test)

pred_train = cohort_set.predict_proba(X_train)
metrics_train, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["gender"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["gender"], fixed_th=th_dict)

[104]:
cohort cht_query roc precision recall f1 accuracy threshold num_pos %_pos cht_size
0 all all 0.924731 0.775651 0.840135 0.794981 0.832750 0.243250 5447 0.334562 16281
1 cohort_0 (`gender` == " Female") 0.939232 0.749313 0.854318 0.787376 0.899465 0.156009 895 0.165099 5421
2 cohort_1 (`gender` == " Male") 0.908182 0.778074 0.818033 0.787847 0.806906 0.280206 4349 0.400460 10860

The distribution of the “gender” column might not be as skewed as the “race” column, but we can still see that cohort_1 has a more balanced distribution than cohort_0. Let’s attempt to use transfer learning over these cohorts:

Transfer Learning

[105]:
preprocessing = [dp.DataStandardScaler(verbose=False), dp.EncoderOHE(drop=False, unknown_err=False, verbose=False)]

dec_class = DecoupledClass(
    cohort_col=["gender"],
    theta=True,
    min_fold_size_theta=15,
    min_cohort_pct=0.1,
    minority_min_rate=0.1,
    transform_pipe=preprocessing,
    estimator=get_model()
)
dec_class.fit(X_train, y_train)

dec_class.print_cohorts()

FINAL COHORTS
cohort_0:
        Size: 10771
        Query:
                (`gender` == " Female")
        Value Counts:
                0: 9592 (89.05%)
                1: 1179 (10.95%)
        Invalid: False


cohort_1:
        Size: 21790
        Query:
                (`gender` == " Male")
        Value Counts:
                0: 15128 (69.43%)
                1: 6662 (30.57%)
        Invalid: False


[106]:
th_dict = dec_class.get_threasholds_dict()
pred = dec_class.predict_proba(X_test)
fetch_cohort_results(X_test, y_test, pred, cohort_def=dec_class, fixed_th=th_dict)

[106]:
cohort cht_query roc precision recall f1 accuracy threshold num_pos %_pos cht_size
0 all all 0.924731 0.832133 0.794795 0.810810 0.870892 0.500000 3260 0.200233 16281
1 cohort_0 (`gender` == " Female") 0.939232 0.814410 0.824649 0.819414 0.928795 0.341452 612 0.112894 5421
2 cohort_1 (`gender` == " Male") 0.908182 0.791165 0.811141 0.799352 0.825414 0.365531 3690 0.339779 10860

In the case of the “gender” column, observing the two past results, we notice that transfer learning has improved the precision, F1 score, and accuracy for both cohorts. Therefore, transfer learning has proved useful in this scenario compared to using the more generic CohortManager approach.

Fairness Optimization

One issue we can identify on the previous result is that the rate of positive labels for cohort_0 is smaller than the positive rate in cohort_1. In some cases, this is not desirable, as this shows that the model fitted has some bias in regard to the sensitive feature (in this case, the gender column). Let’s try to mitigate this issue by using the DecoupledClass with Transfer Learning + Fairness optimization using the Demographic Parity loss.

[107]:
preprocessing = [dp.DataStandardScaler(verbose=False), dp.EncoderOHE(drop=False, unknown_err=False, verbose=False)]

dec_class = DecoupledClass(
    cohort_col=["gender"],
    theta=True,
    min_fold_size_theta=15,
    min_cohort_pct=0.1,
    minority_min_rate=0.1,
    transform_pipe=preprocessing,
    fairness_loss="dem_parity",
    lambda_coef=0.3,
    max_joint_loss_time=200,
    estimator=get_model()
)
dec_class.fit(X_train, y_train)

dec_class.print_cohorts()

FINAL COHORTS
cohort_0:
        Size: 10771
        Query:
                (`gender` == " Female")
        Value Counts:
                0: 9592 (89.05%)
                1: 1179 (10.95%)
        Invalid: False


cohort_1:
        Size: 21790
        Query:
                (`gender` == " Male")
        Value Counts:
                0: 15128 (69.43%)
                1: 6662 (30.57%)
        Invalid: False


[108]:
th_dict = dec_class.get_threasholds_dict()
pred = dec_class.predict_proba(X_test)
fetch_cohort_results(X_test, y_test, pred, cohort_def=dec_class, fixed_th=th_dict)

[108]:
cohort cht_query roc precision recall f1 accuracy threshold num_pos %_pos cht_size
0 all all 0.924731 0.832133 0.794795 0.810810 0.870892 0.500000 3260 0.200233 16281
1 cohort_0 (`gender` == " Female") 0.939232 0.643159 0.834692 0.643880 0.749124 0.023640 1884 0.347537 5421
2 cohort_1 (`gender` == " Male") 0.908182 0.791165 0.811141 0.799352 0.825414 0.365531 3690 0.339779 10860

The “relationship” cohorts

Lastly, we’ll take a closer look at the relationship column, starting with the baseline:

[82]:
cohort_set = CohortManager(
    transform_pipe=[
        dp.DataStandardScaler(verbose=False),
        dp.EncoderOHE(drop=False, unknown_err=False, verbose=False),
        get_model()
    ],
    cohort_col=["relationship"]
)
cohort_set.fit(X=X_train, y=y_train)
subsets = cohort_set.get_subsets(X_train, y_train, apply_transform=False)

print(y_train.value_counts())

for key in subsets.keys():
    print(f"\n{key} : {cohort_set.get_queries()[key]}:\n{subsets[key]['y'].value_counts()}\n***********")

plot_value_counts_cohort(y_train, subsets, normalize=False)

0    24720
1     7841
Name: income, dtype: int64

cohort_0 : (`relationship` == " Husband"):
0    7275
1    5918
Name: income, dtype: int64
***********

cohort_1 : (`relationship` == " Not-in-family"):
0    7449
1     856
Name: income, dtype: int64
***********

cohort_2 : (`relationship` == " Other-relative"):
0    944
1     37
Name: income, dtype: int64
***********

cohort_3 : (`relationship` == " Own-child"):
0    5001
1      67
Name: income, dtype: int64
***********

cohort_4 : (`relationship` == " Unmarried"):
0    3228
1     218
Name: income, dtype: int64
***********

cohort_5 : (`relationship` == " Wife"):
0    823
1    745
Name: income, dtype: int64
***********
../../../../_images/notebooks_cohort_case_study_decoupled_class_case_2_32_1.png

similar to the “race” column, we can see that some cohorts are smaller and with more skewness in label distribution than others. Can we apply transfer learning in this case?

Transfer Learning

[83]:
preprocessing = [dp.DataStandardScaler(verbose=False), dp.EncoderOHE(drop=False, unknown_err=False, verbose=False)]

dec_class = DecoupledClass(
    cohort_col=["relationship"],
    theta=True,
    min_fold_size_theta=15,
    min_cohort_pct=0.1,
    minority_min_rate=0.1,
    transform_pipe=preprocessing,
    estimator=get_model()
)
try:
    dec_class.fit(X_train, y_train)
except Exception as e:
    print(e)

INVALID COHORTS:
cohort_2:
        Size: 981
        Query:
                (`relationship` == " Other-relative")
        Value Counts:
                0: 944 (96.23%)
                1: 37 (3.77%)
        Invalid: False
cohort_3:
        Size: 5068
        Query:
                (`relationship` == " Own-child")
        Value Counts:
                0: 5001 (98.68%)
                1: 67 (1.32%)
        Invalid: False
cohort_4:
        Size: 3446
        Query:
                (`relationship` == " Unmarried")
        Value Counts:
                0: 3228 (93.67%)
                1: 218 (6.33%)
        Invalid: False
ERROR: Cannot use transfer learning over cohorts with skewed distributions.

Similar to the “race” column, we are unable to directly apply transfer learning over these cohorts, so we will rebalance the training data first:

[84]:
c0_pipe = []
c1_pipe = [dp.Rebalance(strategy_over='minority', verbose=False)]
c2_pipe = [dp.Rebalance(strategy_over='minority', verbose=False)]
c3_pipe = [dp.Rebalance(strategy_over='minority', verbose=False)]
c4_pipe = [dp.Rebalance(strategy_over='minority', verbose=False)]
c5_pipe = []

cohort_set = CohortManager(
    transform_pipe=[c0_pipe, c1_pipe, c2_pipe, c3_pipe, c4_pipe, c5_pipe],
    cohort_col=["relationship"]
)

rebalanced_X_train, rebalanced_y_train = cohort_set.fit_resample(X=X_train, y=y_train)
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/sklearn/preprocessing/_encoders.py:808: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(

Now let’s run the baseline plots again:

[85]:
cohort_set = CohortManager(
    transform_pipe=[
        dp.DataStandardScaler(verbose=False),
        dp.EncoderOHE(drop=False, unknown_err=False, verbose=False),
        get_model()
    ],
    cohort_col=["relationship"]
)
cohort_set.fit(X=rebalanced_X_train, y=rebalanced_y_train)
subsets = cohort_set.get_subsets(rebalanced_X_train, rebalanced_y_train, apply_transform=False)

print(rebalanced_y_train.value_counts())

for key in subsets.keys():
    print(f"\n{key} : {cohort_set.get_queries()[key]}:\n{subsets[key]['y'].value_counts()}\n***********")

plot_value_counts_cohort(rebalanced_y_train, subsets, normalize=False)

0    24720
1    23285
Name: income, dtype: int64

cohort_0 : (`relationship` == " Husband"):
0    7275
1    5918
Name: income, dtype: int64
***********

cohort_1 : (`relationship` == " Not-in-family"):
0    7449
1    7449
Name: income, dtype: int64
***********

cohort_2 : (`relationship` == " Other-relative"):
0    944
1    944
Name: income, dtype: int64
***********

cohort_3 : (`relationship` == " Own-child"):
0    5001
1    5001
Name: income, dtype: int64
***********

cohort_4 : (`relationship` == " Unmarried"):
0    3228
1    3228
Name: income, dtype: int64
***********

cohort_5 : (`relationship` == " Wife"):
0    823
1    745
Name: income, dtype: int64
***********
../../../../_images/notebooks_cohort_case_study_decoupled_class_case_2_39_1.png
[86]:
pred_cht = cohort_set.predict_proba(X_test)

pred_train = cohort_set.predict_proba(rebalanced_X_train)
metrics_train, th_dict = fetch_cohort_results(rebalanced_X_train, rebalanced_y_train, pred_train, cohort_col=["relationship"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["relationship"], fixed_th=th_dict)
[86]:
cohort cht_query roc precision recall f1 accuracy threshold num_pos %_pos cht_size
0 all all 0.913161 0.799514 0.797840 0.798671 0.855107 0.469123 3815 0.234322 16281
1 cohort_0 (`relationship` == " Husband") 0.847310 0.763712 0.760496 0.761651 0.765445 0.467993 2772 0.424958 6523
2 cohort_1 (`relationship` == " Not-in-family") 0.886510 0.745795 0.738989 0.742325 0.910005 0.535268 407 0.095138 4278
3 cohort_2 (`relationship` == " Other-relative") 0.905359 0.822736 0.631373 0.684159 0.975238 0.947751 6 0.011429 525
4 cohort_3 (`relationship` == " Own-child") 0.907388 0.706046 0.710644 0.708318 0.979706 0.631069 45 0.017907 2513
5 cohort_4 (`relationship` == " Unmarried") 0.882609 0.677055 0.709674 0.691634 0.930911 0.529153 109 0.064920 1679
6 cohort_5 (`relationship` == " Wife") 0.830681 0.734595 0.735099 0.734815 0.736566 0.510850 353 0.462647 763

And lastly let’s reattempt transfer learning and compare its metrics to the baseline:

[89]:
preprocessing = [dp.DataStandardScaler(verbose=False), dp.EncoderOHE(drop=False, unknown_err=False, verbose=False)]

dec_class = DecoupledClass(
    cohort_col=["relationship"],
    theta=True,
    min_fold_size_theta=15,
    min_cohort_pct=0.1,
    minority_min_rate=0.1,
    transform_pipe=preprocessing,
    estimator=get_model()
)
dec_class.fit(rebalanced_X_train, rebalanced_y_train)

dec_class.print_cohorts()
FINAL COHORTS
cohort_0:
        Size: 13193
        Query:
                (`relationship` == " Husband")
        Value Counts:
                0: 7275 (55.14%)
                1: 5918 (44.86%)
        Invalid: False


cohort_1:
        Size: 14898
        Query:
                (`relationship` == " Not-in-family")
        Value Counts:
                0: 7449 (50.00%)
                1: 7449 (50.00%)
        Invalid: False


cohort_2:
        Size: 1888
        Query:
                (`relationship` == " Other-relative")
        Value Counts:
                0: 944 (50.00%)
                1: 944 (50.00%)
        Invalid: True
                Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_3', 'cohort_4', 'cohort_5']
                Theta = 0.2


cohort_3:
        Size: 10002
        Query:
                (`relationship` == " Own-child")
        Value Counts:
                0: 5001 (50.00%)
                1: 5001 (50.00%)
        Invalid: False


cohort_4:
        Size: 6456
        Query:
                (`relationship` == " Unmarried")
        Value Counts:
                0: 3228 (50.00%)
                1: 3228 (50.00%)
        Invalid: False


cohort_5:
        Size: 1568
        Query:
                (`relationship` == " Wife")
        Value Counts:
                0: 823 (52.49%)
                1: 745 (47.51%)
        Invalid: True
                Cohorts used as outside data: ['cohort_0', 'cohort_1', 'cohort_2', 'cohort_3', 'cohort_4']
                Theta = 0.2


[90]:
th_dict = dec_class.get_threasholds_dict()
pred = dec_class.predict_proba(X_test)
fetch_cohort_results(X_test, y_test, pred, cohort_def=dec_class, fixed_th=th_dict)
[90]:
cohort cht_query roc precision recall f1 accuracy threshold num_pos %_pos cht_size
0 all all 0.914236 0.809129 0.795081 0.801682 0.860082 0.500000 3600 0.221117 16281
1 cohort_0 (`relationship` == " Husband") 0.847310 0.746642 0.748831 0.743895 0.744136 0.367500 3395 0.520466 6523
2 cohort_1 (`relationship` == " Not-in-family") 0.886510 0.710561 0.753516 0.729050 0.894109 0.444907 519 0.121318 4278
3 cohort_2 (`relationship` == " Other-relative") 0.940261 0.746024 0.855882 0.789894 0.971429 0.506925 22 0.041905 525
4 cohort_3 (`relationship` == " Own-child") 0.907388 0.692846 0.710036 0.701050 0.978512 0.607914 48 0.019101 2513
5 cohort_4 (`relationship` == " Unmarried") 0.882609 0.677055 0.709674 0.691634 0.930911 0.529153 109 0.064920 1679
6 cohort_5 (`relationship` == " Wife") 0.865590 0.780855 0.782139 0.781301 0.782438 0.491991 360 0.471822 763

In this case, transfer learning hasn’t made a noticeable difference over the cohorts of the training data since the rebalancig has already done most of the heavy lifting to even out the distribution. As for its performance over the test set cohorts, we can see improvement in the label distribution and accuracy of some cohorts, but it is small.