Cohort Case Study 1

In this first case study for the cohort submodule, we’ll create a synthetic dataset that has some characteristics that will highlight the advantages of using pre-processing pipelines for each cohort separately instead of using it over the entire dataset.

Adopting a separate pipeline for each cohort is not an approach that will work in every situation. In fact, it is usually recommended to process the entire dataset instead of analyzing each cohort separately. But in some cases, certain cohorts behave differently from others, and the intensity of this difference will indicate if using a cohort-based pipeline is the best approach or not. When each cohort has a distinct behavior, e.g., considerably different class distributions, different feature importance, opposite class behaviors, different value distribution for certain features, etc., then one cohort might end up degrading the performance of a model trained over the entire dataset when evaluated over the remaining cohorts. This is especially true when we have cohorts with very different behaviors, and one cohort that comprises the majority of the instances: in this case, the model trained over the entire dataset might simply learn how to predict the class of the instances of this majority cohort, and simply ignore the other cohorts. By doing this, the model will still achieve good results, but at the cost of neglecting the minority cohorts. This becomes a major concern when we are dealing with sensitive cohorts, that is, cohorts built with sensitive features, such as (but not limited to): gender, nationality, race, age, or a combination of these features. When a model performs well for one of the sensitive cohorts, but under-performs for the remaining cohorts, then the model is considered to be biased and may result in several legal problems. To mitigate these discrepancies, we might need to apply different pre-processing operations over each cohort separately, or even train different models for each cohort.

Creating the artificial dataset

Given that scenarios that benefit from using different data pipelines for each cohort are not common, in this first case study we’ll use a synthetic dataset that is created to artificially create this scenario. Here are the main characteristics that we want to see in our dataset:

  1. Different cohorts and sub-cohorts that behave differently: to simulate this, we can use different rules to establish each instance’s class based on which cohort it belongs to. By doing this, a model will have a hard time understanding the general classification rule if trained with the entire dataset, since learning how to classify instances from one cohort may harm the classification capabilities for another cohort;

  2. Different value ranges and distributions for numerical features: if the values for a feature vary considerably for different cohorts, we may find it useful to normalize these features separately for each cohort, instead of normalizing it for the entire dataset.

Our dataset will detail if a company went bankrupt (after a fixed number of months) or not. The only features used for each company are: the company’s country’s of origin, the industry sector to which it belongs, and the initial investment poured into the creation of the company measured in the company’s country local currency. Each country has different local currency values, so it is expected that this feature varies based on the country’s value. Also, due to each country’s many characteristics (culture, environment, financial situation, social differences, etc.), each industry sector functions differently based on the country of the company. For example, a company that sells tropical fruits requires a lot less investment to succeed in tropical countries when compared to countries farther away from tropical areas. Finally, the rule adopted to define the class of each company (1 if the company went bankrupt, and 0 otherwise) is the following: if the investment value is larger than a given threshold then the company succeeded (class 0). For some companies, however, this behavior is inverted: if the investment is larger than the threshold, the company goes bankrupt. This threshold is defined for each sector in each country.

We’ll also add some noise to the dataset by adding some missing values in the investment column, as well as inverting a small percentage of the final classes.

[1]:
import random
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression

from raimitigations.utils import split_data
import raimitigations.dataprocessing as dp
from raimitigations.cohort import (
    CohortDefinition,
    CohortManager,
    fetch_cohort_results
)

SEED = 51
#SEED = None
np.random.seed(SEED)
random.seed(SEED)

def _create_country_df(samples: int, sectors: dict, country_name: str):
    df = None
    for key in sectors.keys():
        size = int(samples * sectors[key]["prob_occur"])
        invest = np.random.uniform(low=sectors[key]["min"], high=sectors[key]["max"], size=size)
        min_invest = min(invest)
        max_invest = max(invest)
        range_invest = max_invest - min_invest
        bankrupt_th = sectors[key]["prob_success"] * range_invest
        inverted_behavior = sectors[key]["inverted_behavior"]
        bankrupt = []
        for i in range(invest.shape[0]):
            inst_class = 1
            if invest[i] > bankrupt_th:
                inst_class = 0
            if inverted_behavior:
                inst_class = int(not inst_class)
            bankrupt.append(inst_class)
        noise_ind = np.random.choice(range(size), int(size*0.05), replace=False)
        for ind in noise_ind:
            bankrupt[ind] = int(not bankrupt[ind])
        noise_ind = np.random.choice(range(size), int(size*0.1), replace=False)
        for ind in noise_ind:
            invest[ind] = np.nan

        country_col = [country_name for _ in range(size)]
        sector_col = [key for _ in range(size)]
        df_sector = pd.DataFrame({
            "investment":invest,
            "sector":sector_col,
            "country":country_col,
            "bankrupt":bankrupt
        })

        if df is None:
            df = df_sector
        else:
            df = pd.concat([df, df_sector], axis=0)
    return df

def create_df_multiple_distributions(samples: list):
    sectors_c1 = {
        "s1": {"prob_occur":0.5, "prob_success":0.99, "inverted_behavior":False, "min":2e6, "max":1e7},
        "s2": {"prob_occur":0.1, "prob_success":0.2, "inverted_behavior":False, "min":1e7, "max":1.5e9},
        "s3": {"prob_occur":0.1, "prob_success":0.9, "inverted_behavior":True, "min":1e9, "max":1e10},
        "s4": {"prob_occur":0.3, "prob_success":0.7, "inverted_behavior":False, "min":4e10, "max":9e13},
    }
    sectors_c2 = {
        "s1": {"prob_occur":0.1, "prob_success":0.6, "inverted_behavior":True, "min":1e3, "max":5e3},
        "s2": {"prob_occur":0.3, "prob_success":0.9, "inverted_behavior":False, "min":1e5, "max":1.5e6},
        "s3": {"prob_occur":0.5, "prob_success":0.3, "inverted_behavior":False, "min":5e4, "max":3e5},
        "s4": {"prob_occur":0.1, "prob_success":0.8, "inverted_behavior":False, "min":1e6, "max":1e7},
    }
    sectors_c3 = {
        "s1": {"prob_occur":0.3, "prob_success":0.9, "inverted_behavior":False, "min":3e2, "max":6e2},
        "s2": {"prob_occur":0.6, "prob_success":0.7, "inverted_behavior":False, "min":5e3, "max":9e3},
        "s3": {"prob_occur":0.07, "prob_success":0.6, "inverted_behavior":False, "min":4e3, "max":2e4},
        "s4": {"prob_occur":0.03, "prob_success":0.1, "inverted_behavior":True, "min":6e5, "max":1.3e6},
    }
    countries = {
        "A":{"sectors":sectors_c1, "sample_rate":0.85},
        "B":{"sectors":sectors_c2, "sample_rate":0.05},
        "C":{"sectors":sectors_c2, "sample_rate":0.1}
    }
    df = None
    for key in countries.keys():
        n_sample = int(samples * countries[key]["sample_rate"])
        df_c = _create_country_df(n_sample, countries[key]["sectors"], key)
        if df is None:
            df = df_c
        else:
            df = pd.concat([df, df_c], axis=0)

    idx = pd.Index([i for i in range(df.shape[0])])
    df = df.set_index(idx)
    return df

Let’s now create our artificial dataset:

[2]:
df = create_df_multiple_distributions(10000)
df
[2]:
investment sector country bankrupt
0 7.405851e+06 s1 A 1
1 2.357697e+06 s1 A 1
2 4.746429e+06 s1 A 1
3 7.152158e+06 s1 A 1
4 NaN s1 A 1
... ... ... ... ...
9995 4.226512e+06 s4 C 1
9996 3.566758e+06 s4 C 0
9997 9.281006e+06 s4 C 0
9998 5.770378e+06 s4 C 1
9999 3.661511e+06 s4 C 1

10000 rows × 4 columns

We’ll now split our dataset into train and test sets:

[3]:
X_train, X_test, y_train, y_test = split_data(df, label="bankrupt", test_size=0.3)
[4]:
def get_model():
    #model = LGBMClassifier(random_state=SEED)
    model = LogisticRegression(random_state=SEED)
    return model

Analyzing the “country” cohorts

Let’s create our baseline model. We’ll use a simple model since our goal is to test the efficiency of data processing pipelines, not test how different models behave. We’ll create a pipeline with an imputer, a data normalization transformer, a one-hot encoding transformer, and finally our simple estimator.

[5]:
#EXPERIMENT: Baseline 1

pipe = Pipeline([
            ("imputer", dp.BasicImputer(verbose=False)),
            ("scaler", dp.DataMinMaxScaler(verbose=False)),
            ("encoder", dp.EncoderOHE(verbose=False)),
            ("estimator", get_model()),
        ])
pipe.fit(X_train, y_train)
pred_org = pipe.predict_proba(X_test)

We’ll then test our pipeline over the test set and analyze how this pipeline performs over different cohorts. This analysis is done using the fetch_cohort_results function, which shows the results obtained for the entire dataset, as well as for different cohorts. Notice that this function computes all metrics for each cohort separately, and therefore, different thresholds might be encountered for each cohort (the optimal threshold of a given set of predictions is found using the ROC curve, and this threshold is used to determine if a probability should be converted to class 1 or class 0). Since we use the thresholds for computing the precision, recall, accuracy and F1 score, we must compute these thresholds using the training set, otherwise, we’ll have some data leakage. Therefore, we first call the fetch_cohort_results using the training set. Notice that we set the return_th_dict to True (its default value is False), which makes the function return a dataframe with all of the metrics computed and a dictionary with the best thresholds found for each cohort. We then use this dict of thresholds when we call the fetch_cohort_results for a second time, but this time for the test dataset. Since we want to use the thresholds computed for the training set, we set the fixed_th parameter to be the dict of thresholds.

Notice that the thresholds are only used for binary problems. For multi-class problems, the class chosen based on the probabilities is the class with the largest probability.

For now, we’ll focus on the cohorts defined by the different countries in the dataset.

[6]:
pred_train_org = pipe.predict_proba(X_train)
metrics_train, th_dict = fetch_cohort_results(X_train, y_train, pred_train_org, cohort_col=["country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_org, cohort_col=["country"], fixed_th=th_dict)
[6]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.803805 0.788653 0.790248 0.768984 0.769000 0.714791 3000
1 cohort_0 (`country` == "A") 0.836172 0.821859 0.832952 0.813529 0.814303 0.714791 2531
2 cohort_1 (`country` == "B") 0.788575 0.797303 0.784403 0.786350 0.790960 0.176188 177
3 cohort_2 (`country` == "C") 0.798822 0.846590 0.830079 0.829746 0.832192 0.234479 292

We can see that our pipeline managed to get a decent performance for all countries, despite the different behaviors that we injected for each cohort. However, since we have only a single estimator, we should consider only a single threshold for the entire dataset. The function fetch_cohort_results() analyzes the results separately for each cohort, that is, all metrics are computed using the isolated predictions of each cohort. That is why we can see that each cohort used different thresholds (threshold column). But if we use the same optimal threshold computed for the entire dataset for all cohorts (which makes sense in this case, since we have a single estimator), then we’ll notice that the results are very different.

In the following cell, we call the fetch_cohort_results() again using the test dataset, but this time we set the shared_th to True, and we also specify that the thresholds to be used are the ones from th_dict (which were computed using the training set). This way, what will happen is that the precision, recall, accuracy, and F1 score metrics will be computed for all cohorts using the threshold value for the “all” cohort (because shared_th is True).

[7]:
#EXPERIMENT: Baseline 2
fetch_cohort_results(X_test, y_test, pred_org, cohort_col=["country"], shared_th=True, fixed_th=th_dict)
[7]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.803805 0.788653 0.790248 0.768984 0.769000 0.714791 3000
1 cohort_0 (`country` == "A") 0.836172 0.821859 0.832952 0.813529 0.814303 0.714791 2531
2 cohort_1 (`country` == "B") 0.788575 0.579861 0.548716 0.486034 0.525424 0.714791 177
3 cohort_2 (`country` == "C") 0.798822 0.550498 0.530459 0.475221 0.523973 0.714791 292

For the sake of comparability, we’ll use different thresholds for each cohort for all of our experiments from this point on (that is, shared_th will be set to False). This will make our analysis more straightforward and easier to understand.

Let’s see if we can improve these metrics (compared to Baseline 1) by applying some pre-processing steps over each cohort separately. To that end, let’s apply the imputation and normalization over each cohort separately and see how this impacts the resulting pipeline. We’ll use the CohortManager class to achieve this.

[8]:
#EXPERIMENT: Cohort 1

cht_manager = CohortManager(
    transform_pipe=[
        dp.BasicImputer(verbose=False),
        dp.DataMinMaxScaler(verbose=False),
    ],
    cohort_col=["country"]
)

pipe = Pipeline([
            ("cht_preprocess", cht_manager),
            ("encoder", dp.EncoderOHE(verbose=False)),
            ("estimator", get_model()),
        ])
pipe.fit(X_train, y_train)
pred_cht = pipe.predict_proba(X_test)

pred_train = pipe.predict_proba(X_train)
metrics_train, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["country"], fixed_th=th_dict)
[8]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.814191 0.795227 0.797164 0.775978 0.776000 0.717450 3000
1 cohort_0 (`country` == "A") 0.840887 0.826499 0.838250 0.819723 0.820624 0.717450 2531
2 cohort_1 (`country` == "B") 0.803594 0.801724 0.803273 0.801855 0.802260 0.367261 177
3 cohort_2 (`country` == "C") 0.820083 0.846590 0.830079 0.829746 0.832192 0.265146 292

Unfortunately, we only achieved a slight performance increase. Let’s now try using the same pipeline used for the baseline model, but this time each cohort has its own pipeline, that is, the pre-processing steps and the estimator are fitted for each cohort separately.

[9]:
#EXPERIMENT: Cohort 2

cht_manager = CohortManager(
    transform_pipe=[
        dp.BasicImputer(verbose=False),
        dp.DataMinMaxScaler(verbose=False),
        dp.EncoderOHE(verbose=False),
        get_model()
    ],
    cohort_col=["country"]
)
cht_manager.fit(X_train, y_train)
pred_cht = cht_manager.predict_proba(X_test)

pred_train = cht_manager.predict_proba(X_train)
metrics_train, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["country"], fixed_th=th_dict)
[9]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.829600 0.813206 0.818502 0.799828 0.800000 0.725327 3000
1 cohort_0 (`country` == "A") 0.837492 0.827085 0.838913 0.820496 0.821414 0.725327 2531
2 cohort_1 (`country` == "B") 0.830680 0.797303 0.784403 0.786350 0.790960 0.110570 177
3 cohort_2 (`country` == "C") 0.889919 0.846590 0.830079 0.829746 0.832192 0.174096 292

We managed to get a decent increase in the metrics by simply training different models for each cohort. In this case, we trained different pipelines for each country cohort. However, each industry sector behaves differently, so even though we are now looking at each country separately, the sector column still hinders the trained model. Let’s now replicate the previous experiment, but this time train different pipelines for each industry sector instead of each country:

[10]:
#EXPERIMENT: Cohort 3

cht_manager = CohortManager(
    transform_pipe=[
        dp.BasicImputer(verbose=False),
        dp.DataMinMaxScaler(verbose=False),
        dp.EncoderOHE(verbose=False),
        get_model()
    ],
    cohort_col=["sector"]
)
cht_manager.fit(X_train, y_train)
pred_cht = cht_manager.predict_proba(X_test)

pred_train = cht_manager.predict_proba(X_train)
metrics_train, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["country"], fixed_th=th_dict)
[10]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.902798 0.902903 0.895324 0.898555 0.902333 0.464561 3000
1 cohort_0 (`country` == "A") 0.924799 0.927963 0.917449 0.921941 0.925721 0.464561 2531
2 cohort_1 (`country` == "B") 0.747497 0.797229 0.796085 0.790934 0.790960 0.747629 177
3 cohort_2 (`country` == "C") 0.808819 0.795423 0.793871 0.794047 0.794521 0.504412 292

From the previous results, we can see that the difference in behavior for each sector is greater than the differences imposed by the countries. By training separate pipelines for each sector, we managed to greatly improve the performance of our model.

Let’s now see if we can improve these results even further using a different pipeline for each combination of country and sector:

[11]:
#EXPERIMENT: Cohort 4

cht_manager = CohortManager(
    transform_pipe=[
        dp.BasicImputer(verbose=False),
        dp.DataMinMaxScaler(verbose=False),
        dp.EncoderOHE(verbose=False),
        get_model()
    ],
    cohort_col=["sector", "country"]
)
cht_manager.fit(X_train, y_train)
pred_cht = cht_manager.predict_proba(X_test)

pred_train = cht_manager.predict_proba(X_train)
metrics_train, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["country"], fixed_th=th_dict)
[11]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.921679 0.908100 0.902547 0.905012 0.908333 0.475074 3000
1 cohort_0 (`country` == "A") 0.925494 0.920170 0.911027 0.914990 0.919004 0.476344 2531
2 cohort_1 (`country` == "B") 0.869063 0.845424 0.845956 0.841803 0.841808 0.546313 177
3 cohort_2 (`country` == "C") 0.923664 0.874875 0.868682 0.869120 0.869863 0.388728 292

This resulted in only a slight increase in performance. These results show us what we wanted: for our artificial dataset, training a separate pipeline for each cohort of sector and country, we get the best results. This was already expected due to how we created this artificial dataset, where we defined different classification behaviors for each subset with different sector and country values.

Checking the “sector” + “country” cohorts

Let’s now look into the metrics for each combination of country and sector. We’ll first check how the baseline pipeline, trained over the entire dataset (Baseline 1), performs for each of these cohorts:

[12]:
#EXPERIMENT: Baseline 3

_, th_dict = fetch_cohort_results(X_train, y_train, pred_train_org, cohort_col=["sector", "country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_org, cohort_col=["sector", "country"], fixed_th=th_dict)
[12]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.803805 0.788653 0.790248 0.768984 0.769000 0.714791 3000
1 cohort_0 (`sector` == "s1") and (`country` == "A") 0.852323 0.860370 0.896209 0.873737 0.888436 0.714791 1228
2 cohort_1 (`sector` == "s1") and (`country` == "B") 0.238095 0.208333 0.416667 0.277778 0.384615 0.732667 13
3 cohort_2 (`sector` == "s1") and (`country` == "C") 0.161905 0.232143 0.464286 0.309524 0.448276 0.796960 29
4 cohort_3 (`sector` == "s2") and (`country` == "A") 0.803265 0.867622 0.834155 0.848937 0.893471 0.400690 291
5 cohort_4 (`sector` == "s2") and (`country` == "B") 0.813390 0.803529 0.867521 0.828205 0.880597 0.422347 67
6 cohort_5 (`sector` == "s2") and (`country` == "C") 0.775021 0.701744 0.787014 0.730828 0.858491 0.511512 106
7 cohort_6 (`sector` == "s3") and (`country` == "A") 0.188799 0.575122 0.529032 0.285017 0.304878 0.163522 246
8 cohort_7 (`sector` == "s3") and (`country` == "B") 0.710065 0.913043 0.684211 0.721612 0.842105 0.176188 76
9 cohort_8 (`sector` == "s3") and (`country` == "C") 0.766817 0.889474 0.773175 0.814833 0.906977 0.234479 129
10 cohort_9 (`sector` == "s4") and (`country` == "A") 0.870353 0.935381 0.894539 0.911125 0.925587 0.503846 766
11 cohort_10 (`sector` == "s4") and (`country` == "B") 0.944444 0.875000 0.933333 0.892857 0.904762 0.944555 21
12 cohort_11 (`sector` == "s4") and (`country` == "C") 0.794444 0.804813 0.816667 0.809524 0.821429 0.960627 28

Notice that some cohorts have very poor performance. For example, for companies from sector s1, we can notice that the model learned how to predict the correct class only for companies from country A (cohort_0). Since companies from sector s1 from country B used an inverted behavior (inverted_behavior flag in the function that creates the dataset), and companies from country C had a very different range of investment values compared to the country A, and since instances from country A are the majority of instances, then the model prioritized to learn how to classify only the instances of sector == s1 from country A.

Let’s now check the results for these cohorts using our best pipeline, which is the one where each cohort of sector and country had its own pipeline (Cohort 4):

[13]:
#EXPERIMENT: Cohort 5

_, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["sector", "country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["sector", "country"], fixed_th=th_dict)
[13]:
cohort cht_query roc precision recall f1 accuracy threshold cht_size
0 all all 0.921679 0.908100 0.902547 0.905012 0.908333 0.475074 3000
1 cohort_0 (`sector` == "s1") and (`country` == "A") 0.911655 0.941368 0.894939 0.914116 0.931596 0.445645 1228
2 cohort_1 (`sector` == "s1") and (`country` == "B") 0.904762 0.888889 0.833333 0.837500 0.846154 0.616965 13
3 cohort_2 (`sector` == "s1") and (`country` == "C") 0.914286 0.899038 0.895238 0.896057 0.896552 0.471948 29
4 cohort_3 (`sector` == "s2") and (`country` == "A") 0.861716 0.867622 0.834155 0.848937 0.893471 0.488684 291
5 cohort_4 (`sector` == "s2") and (`country` == "B") 0.814815 0.914912 0.836895 0.868782 0.925373 0.521201 67
6 cohort_5 (`sector` == "s2") and (`country` == "C") 0.874276 0.929167 0.840778 0.878077 0.952830 0.662477 106
7 cohort_6 (`sector` == "s3") and (`country` == "A") 0.875000 0.943813 0.877957 0.905093 0.934959 0.477076 246
8 cohort_7 (`sector` == "s3") and (`country` == "B") 0.762696 0.913043 0.684211 0.721612 0.842105 0.226486 76
9 cohort_8 (`sector` == "s3") and (`country` == "C") 0.780353 0.889474 0.773175 0.814833 0.906977 0.388728 129
10 cohort_9 (`sector` == "s4") and (`country` == "A") 0.907772 0.935381 0.894539 0.911125 0.925587 0.480231 766
11 cohort_10 (`sector` == "s4") and (`country` == "B") 0.900000 0.837500 0.800000 0.815249 0.857143 0.559088 21
12 cohort_11 (`sector` == "s4") and (`country` == "C") 0.833333 0.862500 0.822222 0.836257 0.857143 0.560495 28

As we can see, the results here are a lot more consistent, and there aren’t any cohorts with a drastic performance difference when compared to other cohorts. This is because each cohort had its own estimators, so it had only one behavior to learn.