Cohort Case Study 1 - Rebalancing

This notebook is a continuation of the notebook Case 1. Here, we’ll show a situation where using the CohortManager to rebalance the dataset is a good idea.

First of all, we’ll need to create the artificial dataset once again. We’ll just copy the code used in the previous notebook.

[11]:
import random
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression

from raimitigations.utils import split_data
import raimitigations.dataprocessing as dp
from raimitigations.cohort import (
    CohortDefinition,
    CohortManager,
    fetch_cohort_results,
    plot_value_counts_cohort
)

SEED = 51
#SEED = None
np.random.seed(SEED)
random.seed(SEED)

def _create_country_df(samples: int, sectors: dict, country_name: str):
    df = None
    for key in sectors.keys():
        size = int(samples * sectors[key]["prob_occur"])
        invest = np.random.uniform(low=sectors[key]["min"], high=sectors[key]["max"], size=size)
        min_invest = min(invest)
        max_invest = max(invest)
        range_invest = max_invest - min_invest
        bankrupt_th = sectors[key]["prob_success"] * range_invest
        inverted_behavior = sectors[key]["inverted_behavior"]
        bankrupt = []
        for i in range(invest.shape[0]):
            inst_class = 1
            if invest[i] > bankrupt_th:
                inst_class = 0
            if inverted_behavior:
                inst_class = int(not inst_class)
            bankrupt.append(inst_class)
        noise_ind = np.random.choice(range(size), int(size*0.05), replace=False)
        for ind in noise_ind:
            bankrupt[ind] = int(not bankrupt[ind])
        noise_ind = np.random.choice(range(size), int(size*0.1), replace=False)
        for ind in noise_ind:
            invest[ind] = np.nan

        country_col = [country_name for _ in range(size)]
        sector_col = [key for _ in range(size)]
        df_sector = pd.DataFrame({
            "investment":invest,
            "sector":sector_col,
            "country":country_col,
            "bankrupt":bankrupt
        })

        if df is None:
            df = df_sector
        else:
            df = pd.concat([df, df_sector], axis=0)
    return df

def create_df_multiple_distributions(samples: list):
    sectors_c1 = {
        "s1": {"prob_occur":0.5, "prob_success":0.99, "inverted_behavior":False, "min":2e6, "max":1e7},
        "s2": {"prob_occur":0.1, "prob_success":0.2, "inverted_behavior":False, "min":1e7, "max":1.5e9},
        "s3": {"prob_occur":0.1, "prob_success":0.9, "inverted_behavior":True, "min":1e9, "max":1e10},
        "s4": {"prob_occur":0.3, "prob_success":0.7, "inverted_behavior":False, "min":4e10, "max":9e13},
    }
    sectors_c2 = {
        "s1": {"prob_occur":0.1, "prob_success":0.6, "inverted_behavior":True, "min":1e3, "max":5e3},
        "s2": {"prob_occur":0.3, "prob_success":0.9, "inverted_behavior":False, "min":1e5, "max":1.5e6},
        "s3": {"prob_occur":0.5, "prob_success":0.3, "inverted_behavior":False, "min":5e4, "max":3e5},
        "s4": {"prob_occur":0.1, "prob_success":0.8, "inverted_behavior":False, "min":1e6, "max":1e7},
    }
    sectors_c3 = {
        "s1": {"prob_occur":0.3, "prob_success":0.9, "inverted_behavior":False, "min":3e2, "max":6e2},
        "s2": {"prob_occur":0.6, "prob_success":0.7, "inverted_behavior":False, "min":5e3, "max":9e3},
        "s3": {"prob_occur":0.07, "prob_success":0.6, "inverted_behavior":False, "min":4e3, "max":2e4},
        "s4": {"prob_occur":0.03, "prob_success":0.1, "inverted_behavior":True, "min":6e5, "max":1.3e6},
    }
    countries = {
        "A":{"sectors":sectors_c1, "sample_rate":0.85},
        "B":{"sectors":sectors_c2, "sample_rate":0.05},
        "C":{"sectors":sectors_c2, "sample_rate":0.1}
    }
    df = None
    for key in countries.keys():
        n_sample = int(samples * countries[key]["sample_rate"])
        df_c = _create_country_df(n_sample, countries[key]["sectors"], key)
        if df is None:
            df = df_c
        else:
            df = pd.concat([df, df_c], axis=0)

    idx = pd.Index([i for i in range(df.shape[0])])
    df = df.set_index(idx)
    return df

Let’s now create our artificial dataset:

[12]:
df = create_df_multiple_distributions(10000)
df
[12]:
investment sector country bankrupt
0 7.405851e+06 s1 A 1
1 2.357697e+06 s1 A 1
2 4.746429e+06 s1 A 1
3 7.152158e+06 s1 A 1
4 NaN s1 A 1
... ... ... ... ...
9995 4.226512e+06 s4 C 1
9996 3.566758e+06 s4 C 0
9997 9.281006e+06 s4 C 0
9998 5.770378e+06 s4 C 1
9999 3.661511e+06 s4 C 1

10000 rows × 4 columns

We’ll now split our dataset into train and test sets:

[13]:
X_train, X_test, y_train, y_test = split_data(df, label="bankrupt", test_size=0.3)
[14]:
def get_model():
    #model = LGBMClassifier(random_state=SEED)
    model = LogisticRegression(random_state=SEED)
    return model

Rebalancing

In the following experiments, we’ll test different ways the Rebalance class can be used in this dataset.

Using the plot_value_counts_cohort function with the subsets returned by the CohortManager class, let’s take a look at the label distributions of the cohorts based on the sector and country columns:

[16]:
cohort_set = CohortManager(
    cohort_col=["sector", "country"]
)
cohort_set.fit(X=X_train, y=y_train)
subsets = cohort_set.get_subsets(X_train, y_train, apply_transform=False)

plot_value_counts_cohort(y_train, subsets, normalize=False)
../../../_images/notebooks_cohort_case_study_case_1_rebalance_9_0.png

As we can see, there is a slight class imbalance in the entire dataset, but if we look at each cohort individually, this imbalance varies greatly. We can see that the imbalance is inverted for some cohorts.

Rebalance the entire dataset

For our first experiment, let’s try rebalancing the entire dataset to see how this will affect the results. Note that we don’t expect great changes to the results, since the imbalance in the entire dataset is not very accentuated.

We’ll use the Rebalance class to rebalance the full dataset. We then plot the label distributions for all cohorts once again to see how the rebalance process affected the distributions inside each cohort.

[17]:
rebalance = dp.Rebalance(verbose=False)
new_X_train, new_y_train = rebalance.fit_resample(X_train, y_train)

cohort_set = CohortManager(
    cohort_col=["sector", "country"]
)
cohort_set.fit(X=X_train, y=y_train)
subsets = cohort_set.get_subsets(new_X_train,new_y_train, apply_transform=False)

plot_value_counts_cohort(new_y_train, subsets, normalize=False)
../../../_images/notebooks_cohort_case_study_case_1_rebalance_11_0.png

As we can see, after rebalancing the full dataset, there is no class imbalance when we look at the full dataset. However, the imbalance inside the cohorts with an inverted class imbalanced (compared to the full dataset) has only been more accentuated.

Let’s now check how this new rebalanced dataset impacts the training process when using the baseline pipeline (Case 1 - Baseline 3), where we train a single pipeline for the entire dataset.

[18]:
#EXPERIMENT: Rebalance-Base 1

pipe = Pipeline([
            ("imputer", dp.BasicImputer(verbose=False)),
            ("scaler", dp.DataMinMaxScaler(verbose=False)),
            ("encoder", dp.EncoderOHE(verbose=False)),
            ("estimator", get_model()),
        ])
pipe.fit(new_X_train, new_y_train)
pred_org = pipe.predict_proba(X_test)

pred_train_org = pipe.predict_proba(new_X_train)
_, th_dict = fetch_cohort_results(new_X_train, new_y_train, pred_train_org, cohort_col=["sector", "country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_org, cohort_col=["sector", "country"], fixed_th=th_dict)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
[18]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.803268 0.788877 0.790533 0.769316 0.769333 0.639927 3000
1 cohort_0 (`sector` == "s1") and (`country` == "A") 0.852323 0.860370 0.896209 0.873737 0.888436 0.639927 1228
2 cohort_1 (`sector` == "s1") and (`country` == "B") 0.238095 0.208333 0.416667 0.277778 0.384615 0.649560 13
3 cohort_2 (`sector` == "s1") and (`country` == "C") 0.161905 0.232143 0.464286 0.309524 0.448276 0.715199 29
4 cohort_3 (`sector` == "s2") and (`country` == "A") 0.803265 0.930028 0.852337 0.882836 0.920962 0.332854 291
5 cohort_4 (`sector` == "s2") and (`country` == "B") 0.813390 0.803529 0.867521 0.828205 0.880597 0.342261 67
6 cohort_5 (`sector` == "s2") and (`country` == "C") 0.775021 0.701744 0.787014 0.730828 0.858491 0.413488 106
7 cohort_6 (`sector` == "s3") and (`country` == "A") 0.188799 0.575122 0.529032 0.285017 0.304878 0.125266 246
8 cohort_7 (`sector` == "s3") and (`country` == "B") 0.710065 0.913043 0.684211 0.721612 0.842105 0.129994 76
9 cohort_8 (`sector` == "s3") and (`country` == "C") 0.766817 0.848348 0.785480 0.811509 0.899225 0.168353 129
10 cohort_9 (`sector` == "s4") and (`country` == "A") 0.870353 0.935381 0.894539 0.911125 0.925587 0.432214 766
11 cohort_10 (`sector` == "s4") and (`country` == "B") 0.944444 0.875000 0.933333 0.892857 0.904762 0.914203 21
12 cohort_11 (`sector` == "s4") and (`country` == "C") 0.794444 0.804813 0.816667 0.809524 0.821429 0.935217 28

From the previous results, we can see that the results haven’t changed much (compared to Case 1 - Baseline 3). This is true for the full dataset, as well as when we analyze each cohort individually. Therefore, this rebalancing process didn’t impact the results for the baseline pipeline.

In the following cell, we use the rebalanced dataset in the cohort-based pipeline (Case 1 - Cohort 5).

[19]:
#EXPERIMENT: Rebalance-Cohort 1

cht_manager = CohortManager(
    transform_pipe=[
        dp.BasicImputer(verbose=False),
        dp.DataMinMaxScaler(verbose=False),
        dp.EncoderOHE(verbose=False),
        get_model()
    ],
    cohort_col=["sector", "country"]
)
cht_manager.fit(new_X_train, new_y_train)
pred_cht = cht_manager.predict_proba(X_test)

pred_train = cht_manager.predict_proba(new_X_train)
_, th_dict = fetch_cohort_results(new_X_train, new_y_train, pred_train, cohort_col=["sector", "country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["sector", "country"], fixed_th=th_dict)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
[19]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.827125 0.814789 0.820160 0.801492 0.801667 0.637556 3000
1 cohort_0 (`sector` == "s1") and (`country` == "A") 0.852323 0.860370 0.896209 0.873737 0.888436 0.637556 1228
2 cohort_1 (`sector` == "s1") and (`country` == "B") 0.761905 0.888889 0.833333 0.837500 0.846154 0.635485 13
3 cohort_2 (`sector` == "s1") and (`country` == "C") 0.838095 0.899038 0.895238 0.896057 0.896552 0.466120 29
4 cohort_3 (`sector` == "s2") and (`country` == "A") 0.196735 0.378007 0.500000 0.430528 0.756014 0.193607 291
5 cohort_4 (`sector` == "s2") and (`country` == "B") 0.186610 0.097015 0.500000 0.162500 0.194030 0.781296 67
6 cohort_5 (`sector` == "s2") and (`country` == "C") 0.224979 0.061321 0.500000 0.109244 0.122642 0.832238 106
7 cohort_6 (`sector` == "s3") and (`country` == "A") 0.188799 0.575122 0.529032 0.285017 0.304878 0.170009 246
8 cohort_7 (`sector` == "s3") and (`country` == "B") 0.710065 0.913043 0.684211 0.721612 0.842105 0.061715 76
9 cohort_8 (`sector` == "s3") and (`country` == "C") 0.766817 0.848348 0.785480 0.811509 0.899225 0.122870 129
10 cohort_9 (`sector` == "s4") and (`country` == "A") 0.907025 0.935381 0.894539 0.911125 0.925587 0.421985 766
11 cohort_10 (`sector` == "s4") and (`country` == "B") 0.944444 0.875000 0.933333 0.892857 0.904762 0.570675 21
12 cohort_11 (`sector` == "s4") and (`country` == "C") 0.794444 0.804813 0.816667 0.809524 0.821429 0.531133 28

We can see that by comparing the previous results with the one obtained by using the cohort-based pipeline (the one for the cohorts based on the sector and country columns - Case 1 - Cohort 5) over the original dataset, we can notice that the new rebalanced dataset performed considerably worse. Using the original dataset, we managed to get good metrics for the entire dataset and for each individual cohort. But when we use the rebalanced dataset, the metrics for some cohorts have become severely worse (for example, the metrics for cohort cohort_4). One possible explanation for this is that, given that each cohort behaves very differently from the other, the new instances created by the SMOTE method (used internally by the Rebalance class) for some cohorts are not very befitting to the behavior analyzed in each cohort. This is especially true in smaller cohorts, that have fewer data points that can be used in the data creation process used in SMOTE. However, this has little impact when we train a pipeline over the full dataset since these small cohorts are already marginalized in these pipelines. But when we train a separate pipeline for each cohort, and we insert noisy data points into small cohorts, this noise has a great impact on the resulting estimator, as we can see from the previous results.

We can check this by printing the statistics for the investment column for all cohorts in the original dataset and for the rebalanced dataset. Notice how the value range changes drastically from the original to the rebalanced datasets for several cohorts. The mean and standard deviation also suffers great changes.

[20]:
cohort_set = CohortManager(
    cohort_col=["sector", "country"]
)
cohort_set.fit(X=X_train, y=y_train)
subsets_original = cohort_set.get_subsets(X_train, y_train, apply_transform=False)
subsets_rebalance = cohort_set.get_subsets(new_X_train, new_y_train, apply_transform=False)

for cht_name in subsets_original.keys():
    min_value = subsets_original[cht_name]["X"]["investment"].min()
    max_value = subsets_original[cht_name]["X"]["investment"].max()
    mean = subsets_original[cht_name]["X"]["investment"].mean()
    std = subsets_original[cht_name]["X"]["investment"].std()
    print(f"{cht_name}:")
    print(f"\tORIGINAL: (min, max, mean, std) = ({min_value:e}, {max_value:e}, {mean:e}, {std:e})")
    min_value = subsets_rebalance[cht_name]["X"]["investment"].min()
    max_value = subsets_rebalance[cht_name]["X"]["investment"].max()
    mean = subsets_rebalance[cht_name]["X"]["investment"].mean()
    std = subsets_rebalance[cht_name]["X"]["investment"].std()
    print(f"\tREBALANCE: (min, max, mean, std) = ({min_value:e}, {max_value:e}, {mean:e}, {std:e})")
cohort_0:
        ORIGINAL: (min, max, mean, std) = (2.005068e+06, 9.998027e+06, 5.976259e+06, 2.339736e+06)
        REBALANCE: (min, max, mean, std) = (2.005068e+06, 1.098854e+13, 1.040341e+12, 3.217533e+12)
cohort_1:
        ORIGINAL: (min, max, mean, std) = (1.182911e+03, 4.987448e+03, 3.215314e+03, 1.206982e+03)
        REBALANCE: (min, max, mean, std) = (1.182911e+03, 1.098854e+13, 1.127030e+12, 3.377380e+12)
cohort_2:
        ORIGINAL: (min, max, mean, std) = (1.149726e+03, 4.982145e+03, 2.959479e+03, 1.251280e+03)
        REBALANCE: (min, max, mean, std) = (1.149726e+03, 1.098854e+13, 1.046528e+12, 3.244988e+12)
cohort_3:
        ORIGINAL: (min, max, mean, std) = (1.516971e+07, 1.499138e+09, 7.909766e+08, 4.417939e+08)
        REBALANCE: (min, max, mean, std) = (1.516971e+07, 1.098854e+13, 1.252196e+12, 3.492966e+12)
cohort_4:
        ORIGINAL: (min, max, mean, std) = (1.068535e+05, 1.493043e+06, 8.164178e+05, 4.428167e+05)
        REBALANCE: (min, max, mean, std) = (1.068535e+05, 1.098854e+13, 9.555261e+11, 3.113221e+12)
cohort_5:
        ORIGINAL: (min, max, mean, std) = (1.037734e+05, 1.498685e+06, 8.011541e+05, 3.926894e+05)
        REBALANCE: (min, max, mean, std) = (1.037734e+05, 1.098854e+13, 1.088488e+12, 3.290464e+12)
cohort_6:
        ORIGINAL: (min, max, mean, std) = (1.007322e+09, 9.996360e+09, 5.407672e+09, 2.557161e+09)
        REBALANCE: (min, max, mean, std) = (1.007322e+09, 1.098854e+13, 1.239427e+12, 3.471019e+12)
cohort_7:
        ORIGINAL: (min, max, mean, std) = (5.059673e+04, 2.994070e+05, 1.814661e+05, 7.001416e+04)
        REBALANCE: (min, max, mean, std) = (5.059673e+04, 1.098854e+13, 9.609752e+11, 3.110148e+12)
cohort_8:
        ORIGINAL: (min, max, mean, std) = (5.057161e+04, 2.981749e+05, 1.635852e+05, 7.342614e+04)
        REBALANCE: (min, max, mean, std) = (5.057161e+04, 1.098854e+13, 1.163756e+12, 3.384817e+12)
cohort_9:
        ORIGINAL: (min, max, mean, std) = (5.621367e+10, 8.999349e+13, 4.307630e+13, 2.579574e+13)
        REBALANCE: (min, max, mean, std) = (5.621367e+10, 8.999349e+13, 4.251200e+13, 2.731107e+13)
cohort_10:
        ORIGINAL: (min, max, mean, std) = (1.073877e+06, 9.817396e+06, 5.365237e+06, 2.469238e+06)
        REBALANCE: (min, max, mean, std) = (1.073877e+06, 1.098854e+13, 9.747005e+11, 3.089928e+12)
cohort_11:
        ORIGINAL: (min, max, mean, std) = (1.209695e+06, 9.919140e+06, 5.624768e+06, 2.545331e+06)
        REBALANCE: (min, max, mean, std) = (1.209695e+06, 1.098854e+13, 1.024707e+12, 3.191871e+12)

Rebalance only a specific cohort

Suppose now that we are interested in improving the performance over a single cohort, not the entire dataset. For example, let’s consider that we are satisfied with the results obtained for all cohorts except for cohort cohort_7 in the experiment Case 1 - Cohort 5. This way, we want to improve the metrics only for this cohort and try to leave the other cohorts as is. To this end, we’ll use the Rebalance class once again, but this time we’ll use it inside the CohortManager class. This way, we’ll be able to rebalance only a specific set of cohorts. We’ll use an empty list of transformations for all cohorts except for cohort_7, which is the 8th cohort in the list.

[21]:

rebalance_cohort = CohortManager( transform_pipe=[[], [], [], [], [], [], [], [dp.Rebalance(verbose=False)], [], [], [], []], cohort_col=["sector", "country"] ) new_X_train, new_y_train = rebalance_cohort.fit_resample(X_train, y_train) subsets = rebalance_cohort.get_subsets(new_X_train,new_y_train, apply_transform=False) plot_value_counts_cohort(new_y_train, subsets, normalize=False)
../../../_images/notebooks_cohort_case_study_case_1_rebalance_19_0.png

Let’s check if the new instances created are very different from the original instances from cohort cohort_7:

[22]:
cohort_set = CohortManager(
    cohort_col=["sector", "country"]
)
cohort_set.fit(X=X_train, y=y_train)
subsets_original = cohort_set.get_subsets(X_train, y_train, apply_transform=False)
subsets_rebalance = cohort_set.get_subsets(new_X_train, new_y_train, apply_transform=False)

cht_name = "cohort_7"
min_value = subsets_original[cht_name]["X"]["investment"].min()
max_value = subsets_original[cht_name]["X"]["investment"].max()
mean = subsets_original[cht_name]["X"]["investment"].mean()
std = subsets_original[cht_name]["X"]["investment"].std()
print(f"{cht_name}:")
print(f"\tORIGINAL: (min, max, mean, std) = ({min_value:e}, {max_value:e}, {mean:e}, {std:e})")
min_value = subsets_rebalance[cht_name]["X"]["investment"].min()
max_value = subsets_rebalance[cht_name]["X"]["investment"].max()
mean = subsets_rebalance[cht_name]["X"]["investment"].mean()
std = subsets_rebalance[cht_name]["X"]["investment"].std()
print(f"\tREBALANCE: (min, max, mean, std) = ({min_value:e}, {max_value:e}, {mean:e}, {std:e})")
cohort_7:
        ORIGINAL: (min, max, mean, std) = (5.059673e+04, 2.994070e+05, 1.814661e+05, 7.001416e+04)
        REBALANCE: (min, max, mean, std) = (5.059673e+04, 2.994070e+05, 1.383858e+05, 7.739440e+04)

As we can see, the values for the investment column suffered only slight changes to the mean and standard deviation. This shows that in some cases, rebalancing certain cohorts separately is less harmful than rebalancing the entire dataset.

With our new rebalanced dataset, let’s repeat the previous experiments:

  1. train the baseline pipeline over the rebalanced dataset (Rebalance-Base 1);

  2. train the cohort-based pipeline over the rebalanced dataset (Rebalance-Cohort 1);

In the following cell, we perform the first of these two experiments:

[23]:
#EXPERIMENT: Rebalance-Base 2

pipe = Pipeline([
            ("imputer", dp.BasicImputer(verbose=False)),
            ("scaler", dp.DataMinMaxScaler(verbose=False)),
            ("encoder", dp.EncoderOHE(verbose=False)),
            ("estimator", get_model()),
        ])
pipe.fit(new_X_train, new_y_train)
pred_org = pipe.predict_proba(X_test)

pred_train_org = pipe.predict_proba(new_X_train)
_, th_dict = fetch_cohort_results(new_X_train, new_y_train, pred_train_org, cohort_col=["sector", "country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_org, cohort_col=["sector", "country"], fixed_th=th_dict)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
[23]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.804837 0.788418 0.790132 0.768981 0.769000 0.712906 3000
1 cohort_0 (`sector` == "s1") and (`country` == "A") 0.852323 0.860370 0.896209 0.873737 0.888436 0.712906 1228
2 cohort_1 (`sector` == "s1") and (`country` == "B") 0.238095 0.208333 0.416667 0.277778 0.384615 0.893269 13
3 cohort_2 (`sector` == "s1") and (`country` == "C") 0.161905 0.232143 0.464286 0.309524 0.448276 0.782414 29
4 cohort_3 (`sector` == "s2") and (`country` == "A") 0.803265 0.867622 0.834155 0.848937 0.893471 0.377946 291
5 cohort_4 (`sector` == "s2") and (`country` == "B") 0.813390 0.803529 0.867521 0.828205 0.880597 0.671898 67
6 cohort_5 (`sector` == "s2") and (`country` == "C") 0.775021 0.701744 0.787014 0.730828 0.858491 0.468042 106
7 cohort_6 (`sector` == "s3") and (`country` == "A") 0.188799 0.575122 0.529032 0.285017 0.304878 0.196127 246
8 cohort_7 (`sector` == "s3") and (`country` == "B") 0.710065 0.913043 0.684211 0.721612 0.842105 0.451353 76
9 cohort_8 (`sector` == "s3") and (`country` == "C") 0.766817 0.889474 0.773175 0.814833 0.906977 0.261151 129
10 cohort_9 (`sector` == "s4") and (`country` == "A") 0.869792 0.935381 0.894539 0.911125 0.925587 0.503452 766
11 cohort_10 (`sector` == "s4") and (`country` == "B") 0.944444 0.875000 0.933333 0.892857 0.904762 0.981152 21
12 cohort_11 (`sector` == "s4") and (`country` == "C") 0.794444 0.804813 0.816667 0.809524 0.821429 0.957202 28

As we can see, the results obtained are very similar to the baseline pipeline over the original dataset (Case 1 - Baseline 3). This was already expected, since cohort_7 is a small cohort, and since we are training a single pipeline over the entire dataset, the new instances added had little impact on the results of the other cohorts.

Let’s now check how this rebalancing process impacts the cohort-based pipeline:

[24]:
#EXPERIMENT: Rebalance-Cohort 2

cht_manager = CohortManager(
    transform_pipe=[
        dp.BasicImputer(verbose=False),
        dp.DataMinMaxScaler(verbose=False),
        dp.EncoderOHE(verbose=False),
        get_model()
    ],
    cohort_col=["sector", "country"]
)
cht_manager.fit(new_X_train, new_y_train)
pred_cht = cht_manager.predict_proba(X_test)

pred_train = cht_manager.predict_proba(new_X_train)
_, th_dict = fetch_cohort_results(new_X_train, new_y_train, pred_train, cohort_col=["sector", "country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["sector", "country"], fixed_th=th_dict)
[24]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.923553 0.908323 0.898396 0.902471 0.906333 0.475074 3000
1 cohort_0 (`sector` == "s1") and (`country` == "A") 0.911655 0.941368 0.894939 0.914116 0.931596 0.445645 1228
2 cohort_1 (`sector` == "s1") and (`country` == "B") 0.904762 0.888889 0.833333 0.837500 0.846154 0.616965 13
3 cohort_2 (`sector` == "s1") and (`country` == "C") 0.914286 0.899038 0.895238 0.896057 0.896552 0.471948 29
4 cohort_3 (`sector` == "s2") and (`country` == "A") 0.861716 0.867622 0.834155 0.848937 0.893471 0.488684 291
5 cohort_4 (`sector` == "s2") and (`country` == "B") 0.814815 0.914912 0.836895 0.868782 0.925373 0.521201 67
6 cohort_5 (`sector` == "s2") and (`country` == "C") 0.874276 0.929167 0.840778 0.878077 0.952830 0.662477 106
7 cohort_6 (`sector` == "s3") and (`country` == "A") 0.875000 0.943813 0.877957 0.905093 0.934959 0.477076 246
8 cohort_7 (`sector` == "s3") and (`country` == "B") 0.787627 0.913043 0.684211 0.721612 0.842105 0.771149 76
9 cohort_8 (`sector` == "s3") and (`country` == "C") 0.780353 0.889474 0.773175 0.814833 0.906977 0.388728 129
10 cohort_9 (`sector` == "s4") and (`country` == "A") 0.907772 0.935381 0.894539 0.911125 0.925587 0.480231 766
11 cohort_10 (`sector` == "s4") and (`country` == "B") 0.900000 0.837500 0.800000 0.815249 0.857143 0.559088 21
12 cohort_11 (`sector` == "s4") and (`country` == "C") 0.833333 0.862500 0.822222 0.836257 0.857143 0.560495 28

Notice that we managed to get a slight improvement over the ROC AUC metric for cohort_7 (compared to Case 1 - Cohort 5), although the other metrics were left unchanged. It is interesting to note that the threshold used for cohort_7 changed considerably when compared to the cohort-based pipeline trained over the original dataset (Case 1 - Cohort 5).