Cohort Case Study 1 - Rebalancing

This notebook is a continuation of the notebook Case 1. Here, we’ll show a situation where using the CohortManager to rebalance the dataset is a good idea.

First of all, we’ll need to create the artificial dataset once again. We’ll just copy the code used in the previous notebook.

[11]:

import random
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression

from raimitigations.utils import split_data
import raimitigations.dataprocessing as dp
from raimitigations.cohort import (
    CohortDefinition,
    CohortManager,
    fetch_cohort_results,
    plot_value_counts_cohort
)

SEED = 51
#SEED = None
np.random.seed(SEED)
random.seed(SEED)

def _create_country_df(samples: int, sectors: dict, country_name: str):
    df = None
    for key in sectors.keys():
        size = int(samples * sectors[key]["prob_occur"])
        invest = np.random.uniform(low=sectors[key]["min"], high=sectors[key]["max"], size=size)
        min_invest = min(invest)
        max_invest = max(invest)
        range_invest = max_invest - min_invest
        bankrupt_th = sectors[key]["prob_success"] * range_invest
        inverted_behavior = sectors[key]["inverted_behavior"]
        bankrupt = []
        for i in range(invest.shape[0]):
            inst_class = 1
            if invest[i] > bankrupt_th:
                inst_class = 0
            if inverted_behavior:
                inst_class = int(not inst_class)
            bankrupt.append(inst_class)
        noise_ind = np.random.choice(range(size), int(size*0.05), replace=False)
        for ind in noise_ind:
            bankrupt[ind] = int(not bankrupt[ind])
        noise_ind = np.random.choice(range(size), int(size*0.1), replace=False)
        for ind in noise_ind:
            invest[ind] = np.nan

        country_col = [country_name for _ in range(size)]
        sector_col = [key for _ in range(size)]
        df_sector = pd.DataFrame({
            "investment":invest,
            "sector":sector_col,
            "country":country_col,
            "bankrupt":bankrupt
        })

        if df is None:
            df = df_sector
        else:
            df = pd.concat([df, df_sector], axis=0)
    return df

def create_df_multiple_distributions(samples: list):
    sectors_c1 = {
        "s1": {"prob_occur":0.5, "prob_success":0.99, "inverted_behavior":False, "min":2e6, "max":1e7},
        "s2": {"prob_occur":0.1, "prob_success":0.2, "inverted_behavior":False, "min":1e7, "max":1.5e9},
        "s3": {"prob_occur":0.1, "prob_success":0.9, "inverted_behavior":True, "min":1e9, "max":1e10},
        "s4": {"prob_occur":0.3, "prob_success":0.7, "inverted_behavior":False, "min":4e10, "max":9e13},
    }
    sectors_c2 = {
        "s1": {"prob_occur":0.1, "prob_success":0.6, "inverted_behavior":True, "min":1e3, "max":5e3},
        "s2": {"prob_occur":0.3, "prob_success":0.9, "inverted_behavior":False, "min":1e5, "max":1.5e6},
        "s3": {"prob_occur":0.5, "prob_success":0.3, "inverted_behavior":False, "min":5e4, "max":3e5},
        "s4": {"prob_occur":0.1, "prob_success":0.8, "inverted_behavior":False, "min":1e6, "max":1e7},
    }
    sectors_c3 = {
        "s1": {"prob_occur":0.3, "prob_success":0.9, "inverted_behavior":False, "min":3e2, "max":6e2},
        "s2": {"prob_occur":0.6, "prob_success":0.7, "inverted_behavior":False, "min":5e3, "max":9e3},
        "s3": {"prob_occur":0.07, "prob_success":0.6, "inverted_behavior":False, "min":4e3, "max":2e4},
        "s4": {"prob_occur":0.03, "prob_success":0.1, "inverted_behavior":True, "min":6e5, "max":1.3e6},
    }
    countries = {
        "A":{"sectors":sectors_c1, "sample_rate":0.85},
        "B":{"sectors":sectors_c2, "sample_rate":0.05},
        "C":{"sectors":sectors_c2, "sample_rate":0.1}
    }
    df = None
    for key in countries.keys():
        n_sample = int(samples * countries[key]["sample_rate"])
        df_c = _create_country_df(n_sample, countries[key]["sectors"], key)
        if df is None:
            df = df_c
        else:
            df = pd.concat([df, df_c], axis=0)

    idx = pd.Index([i for i in range(df.shape[0])])
    df = df.set_index(idx)
    return df

Let’s now create our artificial dataset:

[12]:

df = create_df_multiple_distributions(10000)
df

[12]:

	investment	sector	country	bankrupt
0	7.405851e+06	s1	A	1
1	2.357697e+06	s1	A	1
2	4.746429e+06	s1	A	1
3	7.152158e+06	s1	A	1
4	NaN	s1	A	1
...	...	...	...	...
9995	4.226512e+06	s4	C	1
9996	3.566758e+06	s4	C	0
9997	9.281006e+06	s4	C	0
9998	5.770378e+06	s4	C	1
9999	3.661511e+06	s4	C	1

10000 rows × 4 columns

We’ll now split our dataset into train and test sets:

[13]:

X_train, X_test, y_train, y_test = split_data(df, label="bankrupt", test_size=0.3)

[14]:

def get_model():
    #model = LGBMClassifier(random_state=SEED)
    model = LogisticRegression(random_state=SEED)
    return model

Rebalancing

In the following experiments, we’ll test different ways the Rebalance class can be used in this dataset.

Using the plot_value_counts_cohort function with the subsets returned by the CohortManager class, let’s take a look at the label distributions of the cohorts based on the sector and country columns:

[16]:

cohort_set = CohortManager(
    cohort_col=["sector", "country"]
)
cohort_set.fit(X=X_train, y=y_train)
subsets = cohort_set.get_subsets(X_train, y_train, apply_transform=False)

plot_value_counts_cohort(y_train, subsets, normalize=False)

../../../_images/notebooks_cohort_case_study_case_1_rebalance_9_0.png

As we can see, there is a slight class imbalance in the entire dataset, but if we look at each cohort individually, this imbalance varies greatly. We can see that the imbalance is inverted for some cohorts.

Rebalance the entire dataset

For our first experiment, let’s try rebalancing the entire dataset to see how this will affect the results. Note that we don’t expect great changes to the results, since the imbalance in the entire dataset is not very accentuated.

We’ll use the Rebalance class to rebalance the full dataset. We then plot the label distributions for all cohorts once again to see how the rebalance process affected the distributions inside each cohort.

[17]:

rebalance = dp.Rebalance(verbose=False)
new_X_train, new_y_train = rebalance.fit_resample(X_train, y_train)

cohort_set = CohortManager(
    cohort_col=["sector", "country"]
)
cohort_set.fit(X=X_train, y=y_train)
subsets = cohort_set.get_subsets(new_X_train,new_y_train, apply_transform=False)

plot_value_counts_cohort(new_y_train, subsets, normalize=False)

../../../_images/notebooks_cohort_case_study_case_1_rebalance_11_0.png

As we can see, after rebalancing the full dataset, there is no class imbalance when we look at the full dataset. However, the imbalance inside the cohorts with an inverted class imbalanced (compared to the full dataset) has only been more accentuated.

Let’s now check how this new rebalanced dataset impacts the training process when using the baseline pipeline (Case 1 - Baseline 3), where we train a single pipeline for the entire dataset.

[18]:

#EXPERIMENT: Rebalance-Base 1

pipe = Pipeline([
            ("imputer", dp.BasicImputer(verbose=False)),
            ("scaler", dp.DataMinMaxScaler(verbose=False)),
            ("encoder", dp.EncoderOHE(verbose=False)),
            ("estimator", get_model()),
        ])
pipe.fit(new_X_train, new_y_train)
pred_org = pipe.predict_proba(X_test)

pred_train_org = pipe.predict_proba(new_X_train)
_, th_dict = fetch_cohort_results(new_X_train, new_y_train, pred_train_org, cohort_col=["sector", "country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_org, cohort_col=["sector", "country"], fixed_th=th_dict)

/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)

[18]:

	cohort	cht_query	roc	precision	recall	f1	accuracy	threshold	cht_size
0	all	all	0.803268	0.788877	0.790533	0.769316	0.769333	0.639927	3000
1	cohort_0	(`sector` == "s1") and (`country` == "A")	0.852323	0.860370	0.896209	0.873737	0.888436	0.639927	1228
2	cohort_1	(`sector` == "s1") and (`country` == "B")	0.238095	0.208333	0.416667	0.277778	0.384615	0.649560	13
3	cohort_2	(`sector` == "s1") and (`country` == "C")	0.161905	0.232143	0.464286	0.309524	0.448276	0.715199	29
4	cohort_3	(`sector` == "s2") and (`country` == "A")	0.803265	0.930028	0.852337	0.882836	0.920962	0.332854	291
5	cohort_4	(`sector` == "s2") and (`country` == "B")	0.813390	0.803529	0.867521	0.828205	0.880597	0.342261	67
6	cohort_5	(`sector` == "s2") and (`country` == "C")	0.775021	0.701744	0.787014	0.730828	0.858491	0.413488	106
7	cohort_6	(`sector` == "s3") and (`country` == "A")	0.188799	0.575122	0.529032	0.285017	0.304878	0.125266	246
8	cohort_7	(`sector` == "s3") and (`country` == "B")	0.710065	0.913043	0.684211	0.721612	0.842105	0.129994	76
9	cohort_8	(`sector` == "s3") and (`country` == "C")	0.766817	0.848348	0.785480	0.811509	0.899225	0.168353	129
10	cohort_9	(`sector` == "s4") and (`country` == "A")	0.870353	0.935381	0.894539	0.911125	0.925587	0.432214	766
11	cohort_10	(`sector` == "s4") and (`country` == "B")	0.944444	0.875000	0.933333	0.892857	0.904762	0.914203	21
12	cohort_11	(`sector` == "s4") and (`country` == "C")	0.794444	0.804813	0.816667	0.809524	0.821429	0.935217	28

From the previous results, we can see that the results haven’t changed much (compared to Case 1 - Baseline 3). This is true for the full dataset, as well as when we analyze each cohort individually. Therefore, this rebalancing process didn’t impact the results for the baseline pipeline.

In the following cell, we use the rebalanced dataset in the cohort-based pipeline (Case 1 - Cohort 5).

[19]:

#EXPERIMENT: Rebalance-Cohort 1

cht_manager = CohortManager(
    transform_pipe=[
        dp.BasicImputer(verbose=False),
        dp.DataMinMaxScaler(verbose=False),
        dp.EncoderOHE(verbose=False),
        get_model()
    ],
    cohort_col=["sector", "country"]
)
cht_manager.fit(new_X_train, new_y_train)
pred_cht = cht_manager.predict_proba(X_test)

pred_train = cht_manager.predict_proba(new_X_train)
_, th_dict = fetch_cohort_results(new_X_train, new_y_train, pred_train, cohort_col=["sector", "country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["sector", "country"], fixed_th=th_dict)

/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/home/mmendonca/anaconda3/envs/raipub/lib/python3.9/site-packages/sklearn/metrics/_classification.py:1318: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)

[19]:

	cohort	cht_query	roc	precision	recall	f1	accuracy	threshold	cht_size
0	all	all	0.827125	0.814789	0.820160	0.801492	0.801667	0.637556	3000
1	cohort_0	(`sector` == "s1") and (`country` == "A")	0.852323	0.860370	0.896209	0.873737	0.888436	0.637556	1228
2	cohort_1	(`sector` == "s1") and (`country` == "B")	0.761905	0.888889	0.833333	0.837500	0.846154	0.635485	13
3	cohort_2	(`sector` == "s1") and (`country` == "C")	0.838095	0.899038	0.895238	0.896057	0.896552	0.466120	29
4	cohort_3	(`sector` == "s2") and (`country` == "A")	0.196735	0.378007	0.500000	0.430528	0.756014	0.193607	291
5	cohort_4	(`sector` == "s2") and (`country` == "B")	0.186610	0.097015	0.500000	0.162500	0.194030	0.781296	67
6	cohort_5	(`sector` == "s2") and (`country` == "C")	0.224979	0.061321	0.500000	0.109244	0.122642	0.832238	106
7	cohort_6	(`sector` == "s3") and (`country` == "A")	0.188799	0.575122	0.529032	0.285017	0.304878	0.170009	246
8	cohort_7	(`sector` == "s3") and (`country` == "B")	0.710065	0.913043	0.684211	0.721612	0.842105	0.061715	76
9	cohort_8	(`sector` == "s3") and (`country` == "C")	0.766817	0.848348	0.785480	0.811509	0.899225	0.122870	129
10	cohort_9	(`sector` == "s4") and (`country` == "A")	0.907025	0.935381	0.894539	0.911125	0.925587	0.421985	766
11	cohort_10	(`sector` == "s4") and (`country` == "B")	0.944444	0.875000	0.933333	0.892857	0.904762	0.570675	21
12	cohort_11	(`sector` == "s4") and (`country` == "C")	0.794444	0.804813	0.816667	0.809524	0.821429	0.531133	28

We can see that by comparing the previous results with the one obtained by using the cohort-based pipeline (the one for the cohorts based on the sector and country columns - Case 1 - Cohort 5) over the original dataset, we can notice that the new rebalanced dataset performed considerably worse. Using the original dataset, we managed to get good metrics for the entire dataset and for each individual cohort. But when we use the rebalanced dataset, the metrics for some cohorts have become severely worse (for example, the metrics for cohort cohort_4). One possible explanation for this is that, given that each cohort behaves very differently from the other, the new instances created by the SMOTE method (used internally by the Rebalance class) for some cohorts are not very befitting to the behavior analyzed in each cohort. This is especially true in smaller cohorts, that have fewer data points that can be used in the data creation process used in SMOTE. However, this has little impact when we train a pipeline over the full dataset since these small cohorts are already marginalized in these pipelines. But when we train a separate pipeline for each cohort, and we insert noisy data points into small cohorts, this noise has a great impact on the resulting estimator, as we can see from the previous results.

We can check this by printing the statistics for the investment column for all cohorts in the original dataset and for the rebalanced dataset. Notice how the value range changes drastically from the original to the rebalanced datasets for several cohorts. The mean and standard deviation also suffers great changes.

[20]:

cohort_set = CohortManager(
    cohort_col=["sector", "country"]
)
cohort_set.fit(X=X_train, y=y_train)
subsets_original = cohort_set.get_subsets(X_train, y_train, apply_transform=False)
subsets_rebalance = cohort_set.get_subsets(new_X_train, new_y_train, apply_transform=False)

for cht_name in subsets_original.keys():
    min_value = subsets_original[cht_name]["X"]["investment"].min()
    max_value = subsets_original[cht_name]["X"]["investment"].max()
    mean = subsets_original[cht_name]["X"]["investment"].mean()
    std = subsets_original[cht_name]["X"]["investment"].std()
    print(f"{cht_name}:")
    print(f"\tORIGINAL: (min, max, mean, std) = ({min_value:e}, {max_value:e}, {mean:e}, {std:e})")
    min_value = subsets_rebalance[cht_name]["X"]["investment"].min()
    max_value = subsets_rebalance[cht_name]["X"]["investment"].max()
    mean = subsets_rebalance[cht_name]["X"]["investment"].mean()
    std = subsets_rebalance[cht_name]["X"]["investment"].std()
    print(f"\tREBALANCE: (min, max, mean, std) = ({min_value:e}, {max_value:e}, {mean:e}, {std:e})")

cohort_0:
        ORIGINAL: (min, max, mean, std) = (2.005068e+06, 9.998027e+06, 5.976259e+06, 2.339736e+06)
        REBALANCE: (min, max, mean, std) = (2.005068e+06, 1.098854e+13, 1.040341e+12, 3.217533e+12)
cohort_1:
        ORIGINAL: (min, max, mean, std) = (1.182911e+03, 4.987448e+03, 3.215314e+03, 1.206982e+03)
        REBALANCE: (min, max, mean, std) = (1.182911e+03, 1.098854e+13, 1.127030e+12, 3.377380e+12)
cohort_2:
        ORIGINAL: (min, max, mean, std) = (1.149726e+03, 4.982145e+03, 2.959479e+03, 1.251280e+03)
        REBALANCE: (min, max, mean, std) = (1.149726e+03, 1.098854e+13, 1.046528e+12, 3.244988e+12)
cohort_3:
        ORIGINAL: (min, max, mean, std) = (1.516971e+07, 1.499138e+09, 7.909766e+08, 4.417939e+08)
        REBALANCE: (min, max, mean, std) = (1.516971e+07, 1.098854e+13, 1.252196e+12, 3.492966e+12)
cohort_4:
        ORIGINAL: (min, max, mean, std) = (1.068535e+05, 1.493043e+06, 8.164178e+05, 4.428167e+05)
        REBALANCE: (min, max, mean, std) = (1.068535e+05, 1.098854e+13, 9.555261e+11, 3.113221e+12)
cohort_5:
        ORIGINAL: (min, max, mean, std) = (1.037734e+05, 1.498685e+06, 8.011541e+05, 3.926894e+05)
        REBALANCE: (min, max, mean, std) = (1.037734e+05, 1.098854e+13, 1.088488e+12, 3.290464e+12)
cohort_6:
        ORIGINAL: (min, max, mean, std) = (1.007322e+09, 9.996360e+09, 5.407672e+09, 2.557161e+09)
        REBALANCE: (min, max, mean, std) = (1.007322e+09, 1.098854e+13, 1.239427e+12, 3.471019e+12)
cohort_7:
        ORIGINAL: (min, max, mean, std) = (5.059673e+04, 2.994070e+05, 1.814661e+05, 7.001416e+04)
        REBALANCE: (min, max, mean, std) = (5.059673e+04, 1.098854e+13, 9.609752e+11, 3.110148e+12)
cohort_8:
        ORIGINAL: (min, max, mean, std) = (5.057161e+04, 2.981749e+05, 1.635852e+05, 7.342614e+04)
        REBALANCE: (min, max, mean, std) = (5.057161e+04, 1.098854e+13, 1.163756e+12, 3.384817e+12)
cohort_9:
        ORIGINAL: (min, max, mean, std) = (5.621367e+10, 8.999349e+13, 4.307630e+13, 2.579574e+13)
        REBALANCE: (min, max, mean, std) = (5.621367e+10, 8.999349e+13, 4.251200e+13, 2.731107e+13)
cohort_10:
        ORIGINAL: (min, max, mean, std) = (1.073877e+06, 9.817396e+06, 5.365237e+06, 2.469238e+06)
        REBALANCE: (min, max, mean, std) = (1.073877e+06, 1.098854e+13, 9.747005e+11, 3.089928e+12)
cohort_11:
        ORIGINAL: (min, max, mean, std) = (1.209695e+06, 9.919140e+06, 5.624768e+06, 2.545331e+06)
        REBALANCE: (min, max, mean, std) = (1.209695e+06, 1.098854e+13, 1.024707e+12, 3.191871e+12)

Rebalance only a specific cohort

Suppose now that we are interested in improving the performance over a single cohort, not the entire dataset. For example, let’s consider that we are satisfied with the results obtained for all cohorts except for cohort cohort_7 in the experiment Case 1 - Cohort 5. This way, we want to improve the metrics only for this cohort and try to leave the other cohorts as is. To this end, we’ll use the Rebalance class once again, but this time we’ll use it inside the CohortManager class. This way, we’ll be able to rebalance only a specific set of cohorts. We’ll use an empty list of transformations for all cohorts except for cohort_7, which is the 8th cohort in the list.

[21]:

rebalance_cohort = CohortManager(
    transform_pipe=[[], [], [], [], [], [], [], [dp.Rebalance(verbose=False)], [], [], [], []],
    cohort_col=["sector", "country"]
)
new_X_train, new_y_train = rebalance_cohort.fit_resample(X_train, y_train)

subsets = rebalance_cohort.get_subsets(new_X_train,new_y_train, apply_transform=False)

plot_value_counts_cohort(new_y_train, subsets, normalize=False)

../../../_images/notebooks_cohort_case_study_case_1_rebalance_19_0.png

Let’s check if the new instances created are very different from the original instances from cohort cohort_7:

[22]:

cohort_set = CohortManager(
    cohort_col=["sector", "country"]
)
cohort_set.fit(X=X_train, y=y_train)
subsets_original = cohort_set.get_subsets(X_train, y_train, apply_transform=False)
subsets_rebalance = cohort_set.get_subsets(new_X_train, new_y_train, apply_transform=False)

cht_name = "cohort_7"
min_value = subsets_original[cht_name]["X"]["investment"].min()
max_value = subsets_original[cht_name]["X"]["investment"].max()
mean = subsets_original[cht_name]["X"]["investment"].mean()
std = subsets_original[cht_name]["X"]["investment"].std()
print(f"{cht_name}:")
print(f"\tORIGINAL: (min, max, mean, std) = ({min_value:e}, {max_value:e}, {mean:e}, {std:e})")
min_value = subsets_rebalance[cht_name]["X"]["investment"].min()
max_value = subsets_rebalance[cht_name]["X"]["investment"].max()
mean = subsets_rebalance[cht_name]["X"]["investment"].mean()
std = subsets_rebalance[cht_name]["X"]["investment"].std()
print(f"\tREBALANCE: (min, max, mean, std) = ({min_value:e}, {max_value:e}, {mean:e}, {std:e})")

cohort_7:
        ORIGINAL: (min, max, mean, std) = (5.059673e+04, 2.994070e+05, 1.814661e+05, 7.001416e+04)
        REBALANCE: (min, max, mean, std) = (5.059673e+04, 2.994070e+05, 1.383858e+05, 7.739440e+04)

As we can see, the values for the investment column suffered only slight changes to the mean and standard deviation. This shows that in some cases, rebalancing certain cohorts separately is less harmful than rebalancing the entire dataset.

With our new rebalanced dataset, let’s repeat the previous experiments:

train the baseline pipeline over the rebalanced dataset (Rebalance-Base 1);
train the cohort-based pipeline over the rebalanced dataset (Rebalance-Cohort 1);

In the following cell, we perform the first of these two experiments:

[23]:

#EXPERIMENT: Rebalance-Base 2

pipe = Pipeline([
            ("imputer", dp.BasicImputer(verbose=False)),
            ("scaler", dp.DataMinMaxScaler(verbose=False)),
            ("encoder", dp.EncoderOHE(verbose=False)),
            ("estimator", get_model()),
        ])
pipe.fit(new_X_train, new_y_train)
pred_org = pipe.predict_proba(X_test)

pred_train_org = pipe.predict_proba(new_X_train)
_, th_dict = fetch_cohort_results(new_X_train, new_y_train, pred_train_org, cohort_col=["sector", "country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_org, cohort_col=["sector", "country"], fixed_th=th_dict)

/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)
/home/mmendonca/ResponsibleAI/code/git/responsible-ai-mitigations/raimitigations/utils/metric_utils.py:189: RuntimeWarning: invalid value encountered in true_divide
  fscore = (2 * precision * recall) / (precision + recall)

[23]:

	cohort	cht_query	roc	precision	recall	f1	accuracy	threshold	cht_size
0	all	all	0.804837	0.788418	0.790132	0.768981	0.769000	0.712906	3000
1	cohort_0	(`sector` == "s1") and (`country` == "A")	0.852323	0.860370	0.896209	0.873737	0.888436	0.712906	1228
2	cohort_1	(`sector` == "s1") and (`country` == "B")	0.238095	0.208333	0.416667	0.277778	0.384615	0.893269	13
3	cohort_2	(`sector` == "s1") and (`country` == "C")	0.161905	0.232143	0.464286	0.309524	0.448276	0.782414	29
4	cohort_3	(`sector` == "s2") and (`country` == "A")	0.803265	0.867622	0.834155	0.848937	0.893471	0.377946	291
5	cohort_4	(`sector` == "s2") and (`country` == "B")	0.813390	0.803529	0.867521	0.828205	0.880597	0.671898	67
6	cohort_5	(`sector` == "s2") and (`country` == "C")	0.775021	0.701744	0.787014	0.730828	0.858491	0.468042	106
7	cohort_6	(`sector` == "s3") and (`country` == "A")	0.188799	0.575122	0.529032	0.285017	0.304878	0.196127	246
8	cohort_7	(`sector` == "s3") and (`country` == "B")	0.710065	0.913043	0.684211	0.721612	0.842105	0.451353	76
9	cohort_8	(`sector` == "s3") and (`country` == "C")	0.766817	0.889474	0.773175	0.814833	0.906977	0.261151	129
10	cohort_9	(`sector` == "s4") and (`country` == "A")	0.869792	0.935381	0.894539	0.911125	0.925587	0.503452	766
11	cohort_10	(`sector` == "s4") and (`country` == "B")	0.944444	0.875000	0.933333	0.892857	0.904762	0.981152	21
12	cohort_11	(`sector` == "s4") and (`country` == "C")	0.794444	0.804813	0.816667	0.809524	0.821429	0.957202	28

As we can see, the results obtained are very similar to the baseline pipeline over the original dataset (Case 1 - Baseline 3). This was already expected, since cohort_7 is a small cohort, and since we are training a single pipeline over the entire dataset, the new instances added had little impact on the results of the other cohorts.

Let’s now check how this rebalancing process impacts the cohort-based pipeline:

[24]:

#EXPERIMENT: Rebalance-Cohort 2

cht_manager = CohortManager(
    transform_pipe=[
        dp.BasicImputer(verbose=False),
        dp.DataMinMaxScaler(verbose=False),
        dp.EncoderOHE(verbose=False),
        get_model()
    ],
    cohort_col=["sector", "country"]
)
cht_manager.fit(new_X_train, new_y_train)
pred_cht = cht_manager.predict_proba(X_test)

pred_train = cht_manager.predict_proba(new_X_train)
_, th_dict = fetch_cohort_results(new_X_train, new_y_train, pred_train, cohort_col=["sector", "country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["sector", "country"], fixed_th=th_dict)

[24]:

	cohort	cht_query	roc	precision	recall	f1	accuracy	threshold	cht_size
0	all	all	0.923553	0.908323	0.898396	0.902471	0.906333	0.475074	3000
1	cohort_0	(`sector` == "s1") and (`country` == "A")	0.911655	0.941368	0.894939	0.914116	0.931596	0.445645	1228
2	cohort_1	(`sector` == "s1") and (`country` == "B")	0.904762	0.888889	0.833333	0.837500	0.846154	0.616965	13
3	cohort_2	(`sector` == "s1") and (`country` == "C")	0.914286	0.899038	0.895238	0.896057	0.896552	0.471948	29
4	cohort_3	(`sector` == "s2") and (`country` == "A")	0.861716	0.867622	0.834155	0.848937	0.893471	0.488684	291
5	cohort_4	(`sector` == "s2") and (`country` == "B")	0.814815	0.914912	0.836895	0.868782	0.925373	0.521201	67
6	cohort_5	(`sector` == "s2") and (`country` == "C")	0.874276	0.929167	0.840778	0.878077	0.952830	0.662477	106
7	cohort_6	(`sector` == "s3") and (`country` == "A")	0.875000	0.943813	0.877957	0.905093	0.934959	0.477076	246
8	cohort_7	(`sector` == "s3") and (`country` == "B")	0.787627	0.913043	0.684211	0.721612	0.842105	0.771149	76
9	cohort_8	(`sector` == "s3") and (`country` == "C")	0.780353	0.889474	0.773175	0.814833	0.906977	0.388728	129
10	cohort_9	(`sector` == "s4") and (`country` == "A")	0.907772	0.935381	0.894539	0.911125	0.925587	0.480231	766
11	cohort_10	(`sector` == "s4") and (`country` == "B")	0.900000	0.837500	0.800000	0.815249	0.857143	0.559088	21
12	cohort_11	(`sector` == "s4") and (`country` == "C")	0.833333	0.862500	0.822222	0.836257	0.857143	0.560495	28

Notice that we managed to get a slight improvement over the ROC AUC metric for cohort_7 (compared to Case 1 - Cohort 5), although the other metrics were left unchanged. It is interesting to note that the threshold used for cohort_7 changed considerably when compared to the cohort-based pipeline trained over the original dataset (Case 1 - Cohort 5).