Decoupled Classifiers Case Study 1
For the first case study, we’ll highlight the benefits of using the decoupled classifiers over different cohorts of the data. This module implements techniques for searching and combining cohorts to optimize for different definitions of group fairness based on the approach presented in the paper Decoupled classifiers for group-fair and efficient machine learning.
The techniques implemented in this module work with the Cohort
module of this library to fit an estimator over each cohort while leveraging transfer learning and other optimization techniques for minority cohorts when the data for such cohorts is not sufficient.
[1]:
import pandas as pd
import numpy as np
import random
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from raimitigations.utils import split_data
import raimitigations.dataprocessing as dp
from raimitigations.cohort import DecoupledClass, CohortDefinition, CohortManager, fetch_cohort_results
from sklearn.pipeline import Pipeline
SEED = 100
Throughout this case study, we will recreate and use a synthetic dataset created as part of Cohort case study 1 to showcase the additional techniques this module can use to optimize fairness and performance over cohorts.
[2]:
def _create_country_df(samples: int, sectors: dict, country_name: str):
df = None
for key in sectors.keys():
size = int(samples * sectors[key]["prob_occur"])
invest = np.random.uniform(low=sectors[key]["min"], high=sectors[key]["max"], size=size)
min_invest = min(invest)
max_invest = max(invest)
range_invest = max_invest - min_invest
bankrupt_th = sectors[key]["prob_success"] * range_invest
inverted_behavior = sectors[key]["inverted_behavior"]
bankrupt = []
for i in range(invest.shape[0]):
inst_class = 1
if invest[i] > bankrupt_th:
inst_class = 0
if inverted_behavior:
inst_class = int(not inst_class)
bankrupt.append(inst_class)
noise_ind = np.random.choice(range(size), int(size*0.05), replace=False)
for ind in noise_ind:
bankrupt[ind] = int(not bankrupt[ind])
noise_ind = np.random.choice(range(size), int(size*0.1), replace=False)
for ind in noise_ind:
invest[ind] = np.nan
country_col = [country_name for _ in range(size)]
sector_col = [key for _ in range(size)]
df_sector = pd.DataFrame({
"investment":invest,
"sector":sector_col,
"country":country_col,
"bankrupt":bankrupt
})
if df is None:
df = df_sector
else:
df = pd.concat([df, df_sector], axis=0)
return df
def create_df_multiple_distributions(samples: list):
sectors_c1 = {
"s1": {"prob_occur":0.5, "prob_success":0.99, "inverted_behavior":False, "min":2e6, "max":1e7},
"s2": {"prob_occur":0.1, "prob_success":0.2, "inverted_behavior":False, "min":1e7, "max":1.5e9},
"s3": {"prob_occur":0.1, "prob_success":0.9, "inverted_behavior":True, "min":1e9, "max":1e10},
"s4": {"prob_occur":0.3, "prob_success":0.7, "inverted_behavior":False, "min":4e10, "max":9e13},
}
sectors_c2 = {
"s1": {"prob_occur":0.1, "prob_success":0.6, "inverted_behavior":True, "min":1e3, "max":5e3},
"s2": {"prob_occur":0.3, "prob_success":0.9, "inverted_behavior":False, "min":1e5, "max":1.5e6},
"s3": {"prob_occur":0.5, "prob_success":0.3, "inverted_behavior":False, "min":5e4, "max":3e5},
"s4": {"prob_occur":0.1, "prob_success":0.8, "inverted_behavior":False, "min":1e6, "max":1e7},
}
sectors_c3 = {
"s1": {"prob_occur":0.3, "prob_success":0.9, "inverted_behavior":False, "min":3e2, "max":6e2},
"s2": {"prob_occur":0.6, "prob_success":0.7, "inverted_behavior":False, "min":5e3, "max":9e3},
"s3": {"prob_occur":0.07, "prob_success":0.6, "inverted_behavior":False, "min":4e3, "max":2e4},
"s4": {"prob_occur":0.03, "prob_success":0.1, "inverted_behavior":True, "min":6e5, "max":1.3e6},
}
countries = {
"A":{"sectors":sectors_c1, "sample_rate":0.85},
"B":{"sectors":sectors_c2, "sample_rate":0.05},
"C":{"sectors":sectors_c2, "sample_rate":0.1}
}
df = None
for key in countries.keys():
n_sample = int(samples * countries[key]["sample_rate"])
df_c = _create_country_df(n_sample, countries[key]["sectors"], key)
if df is None:
df = df_c
else:
df = pd.concat([df, df_c], axis=0)
idx = pd.Index([i for i in range(df.shape[0])])
df = df.set_index(idx)
return df
Note: this dataset details if a company has gone bankrupt (class 1) or hasn’t (class 0):
[3]:
np.random.seed(51)
df = create_df_multiple_distributions(10000)
df
[3]:
investment | sector | country | bankrupt | |
---|---|---|---|---|
0 | 7.405851e+06 | s1 | A | 1 |
1 | 2.357697e+06 | s1 | A | 1 |
2 | 4.746429e+06 | s1 | A | 1 |
3 | 7.152158e+06 | s1 | A | 1 |
4 | NaN | s1 | A | 1 |
... | ... | ... | ... | ... |
9995 | 4.226512e+06 | s4 | C | 1 |
9996 | 3.566758e+06 | s4 | C | 0 |
9997 | 9.281006e+06 | s4 | C | 0 |
9998 | 5.770378e+06 | s4 | C | 1 |
9999 | 3.661511e+06 | s4 | C | 1 |
10000 rows × 4 columns
Split data into train and test sets:
[3]:
X_train, X_test, y_train, y_test = split_data(df, label="bankrupt", test_size=0.3, random_state=SEED)
[4]:
def get_model():
#model = DecisionTreeClassifier(max_features="sqrt")
model = LGBMClassifier(random_state=SEED)
return model
Baseline Case
In order to demonstrate the additional benefits of the decoupled classifiers, we start with the CohortManager
class as the baseline.
Now, let’s look at the metrics and performance of the “sector” and “country” cohorts:
[5]:
# BASELINE: "sector"
cht_manager = CohortManager(
transform_pipe=[
dp.BasicImputer(verbose=False),
dp.DataMinMaxScaler(verbose=False),
dp.EncoderOHE(verbose=False),
get_model()
],
cohort_col=["sector"]
)
cht_manager.fit(X_train, y_train)
pred_cht = cht_manager.predict_proba(X_test)
pred_train = cht_manager.predict_proba(X_train)
metrics_train, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["sector"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["sector"], fixed_th=th_dict)
[5]:
cohort | cht_query | roc | precision | recall | f1 | accuracy | threshold | num_pos | %_pos | cht_size | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | all | all | 0.947905 | 0.929772 | 0.921173 | 0.924828 | 0.927667 | 0.535672 | 1829 | 0.609667 | 3000 |
1 | cohort_0 | (`sector` == "s1") | 0.933652 | 0.863698 | 0.896230 | 0.876748 | 0.893650 | 0.783542 | 863 | 0.660291 | 1307 |
2 | cohort_1 | (`sector` == "s2") | 0.942994 | 0.931929 | 0.916137 | 0.921523 | 0.924084 | 0.381770 | 147 | 0.384817 | 382 |
3 | cohort_2 | (`sector` == "s3") | 0.860066 | 0.716087 | 0.787660 | 0.738242 | 0.817073 | 0.156317 | 133 | 0.270325 | 492 |
4 | cohort_3 | (`sector` == "s4") | 0.925276 | 0.859874 | 0.885490 | 0.870424 | 0.886447 | 0.724742 | 536 | 0.654457 | 819 |
[6]:
# BASELINE: "country"
cht_manager = CohortManager(
transform_pipe=[
dp.BasicImputer(verbose=False),
dp.DataMinMaxScaler(verbose=False),
dp.EncoderOHE(verbose=False),
get_model()
],
cohort_col=["country"]
)
cht_manager.fit(X_train, y_train)
pred_cht = cht_manager.predict_proba(X_test)
pred_train = cht_manager.predict_proba(X_train)
metrics_train, th_dict = fetch_cohort_results(X_train, y_train, pred_train, cohort_col=["country"], return_th_dict=True)
fetch_cohort_results(X_test, y_test, pred_cht, cohort_col=["country"], fixed_th=th_dict)
[6]:
cohort | cht_query | roc | precision | recall | f1 | accuracy | threshold | num_pos | %_pos | cht_size | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | all | all | 0.944930 | 0.924198 | 0.917583 | 0.920483 | 0.923333 | 0.618757 | 1814 | 0.604667 | 3000 |
1 | cohort_0 | (`country` == "A") | 0.945985 | 0.927582 | 0.917092 | 0.921732 | 0.926562 | 0.664861 | 1628 | 0.635938 | 2560 |
2 | cohort_1 | (`country` == "B") | 0.935107 | 0.869528 | 0.860549 | 0.864621 | 0.875817 | 0.500140 | 53 | 0.346405 | 153 |
3 | cohort_2 | (`country` == "C") | 0.934542 | 0.925572 | 0.928491 | 0.926472 | 0.926829 | 0.340654 | 137 | 0.477352 | 287 |
The CohortManager
class in this case creates and trains a cohort for each unique value of these columns regardless of label distribution and size of cohort.
DecoupledClass Techniques
Instead, what if we were to use the DecoupledClass
to look at the same columns using the same pre-processing pipeline and estimator.
Let’s start with the “sector” column:
[7]:
preprocessing = [dp.BasicImputer(verbose=False), dp.DataMinMaxScaler(verbose=False), dp.EncoderOHE(verbose=False)]
dec_class = DecoupledClass(
cohort_col=['sector'],
transform_pipe=preprocessing,
estimator=get_model()
)
dec_class.fit(X_train, y_train)
dec_class.print_cohorts()
FINAL COHORTS
cohort_0:
Size: 3093
Query:
(`sector` == "s1")
Value Counts:
1: 2169 (70.13%)
0: 924 (29.87%)
Invalid: False
cohort_1:
Size: 918
Query:
(`sector` == "s2")
Value Counts:
0: 519 (56.54%)
1: 399 (43.46%)
Invalid: False
cohort_2:
Size: 1108
Query:
(`sector` == "s3")
Value Counts:
0: 889 (80.23%)
1: 219 (19.77%)
Invalid: False
cohort_3:
Size: 1881
Query:
(`sector` == "s4")
Value Counts:
1: 1307 (69.48%)
0: 574 (30.52%)
Invalid: False
[8]:
th_dict = dec_class.get_threasholds_dict()
pred = dec_class.predict_proba(X_test)
fetch_cohort_results(X_test, y_test, pred, cohort_def=dec_class, fixed_th=th_dict)
[8]:
cohort | cht_query | roc | precision | recall | f1 | accuracy | threshold | num_pos | %_pos | cht_size | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | all | all | 0.947905 | 0.929772 | 0.921173 | 0.924828 | 0.927667 | 0.500000 | 1829 | 0.609667 | 3000 |
1 | cohort_0 | (`sector` == "s1") | 0.933652 | 0.937394 | 0.886916 | 0.907627 | 0.928080 | 0.630641 | 994 | 0.760520 | 1307 |
2 | cohort_1 | (`sector` == "s2") | 0.942994 | 0.931929 | 0.916137 | 0.921523 | 0.924084 | 0.381770 | 147 | 0.384817 | 382 |
3 | cohort_2 | (`sector` == "s3") | 0.860066 | 0.898785 | 0.851521 | 0.872572 | 0.928862 | 0.310128 | 76 | 0.154472 | 492 |
4 | cohort_3 | (`sector` == "s4") | 0.925276 | 0.941381 | 0.885875 | 0.907826 | 0.926740 | 0.474869 | 619 | 0.755800 | 819 |
We can see that the 4 cohorts (one for each unique value of the column) are no different than the cohorts created by the Cohort
module, the reason for that is that all 4 cohorts were not invalid. Aninvalid cohortis defined as a cohort that has a size < ``max(min_cohort_size, df.shape[0] * min_cohort_pct)`` or is with a minority class (the label value with least occurrences) with an occurrence rate < ``minority_min_rate``.
In the case of invalid cohorts, the DecoupledClass
fit method uses a few techniques to create valid cohorts from these invalid ones. So how do we use the DecoupledClass
to handle invalid cohorts? For the remainder of this case study, we’ll explore the cohorts of the “country” column to demonstrate the latter.
Merging Invalid Cohorts
First, let’s look at merging invalid cohorts. This technique creates valid cohorts from invalid ones by choosing the smallest cohort different from the invalid cohort and merging the two.
[9]:
preprocessing = [dp.BasicImputer(verbose=False), dp.DataMinMaxScaler(verbose=False), dp.EncoderOHE(verbose=False)]
dec_class = DecoupledClass(
cohort_col=["country"],
min_cohort_pct=0.15,
minority_min_rate=0.15,
transform_pipe=preprocessing,
estimator=get_model()
)
dec_class.fit(X_train, y_train)
dec_class.print_cohorts()
FINAL COHORTS
cohort_0:
Size: 5940
Query:
(`country` == "A")
Value Counts:
1: 3612 (60.81%)
0: 2328 (39.19%)
Invalid: False
cohort_1:
Size: 1060
Query:
((`country` == "B")) or ((`country` == "C"))
Value Counts:
0: 578 (54.53%)
1: 482 (45.47%)
Invalid: False
Using the “country” column above has created 2 cohorts. Initially, we might have expected 3 for each unique value (“A”,”B”,”C”), however, the DecoupledClass
found invalid cohorts and performed a merge of the (country=="B")
and (country=="C")
cohorts to create a single valid one. Now let’s see how does this setup perform over the test set.
[11]:
th_dict = dec_class.get_threasholds_dict()
pred = dec_class.predict_proba(X_test)
fetch_cohort_results(X_test, y_test, pred, cohort_def=dec_class, fixed_th=th_dict)
[11]:
cohort | cht_query | roc | precision | recall | f1 | accuracy | threshold | num_pos | %_pos | cht_size | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | all | all | 0.946290 | 0.927526 | 0.919864 | 0.923168 | 0.926000 | 0.500000 | 1822 | 0.607333 | 3000 |
1 | cohort_0 | (`country` == "A") | 0.945985 | 0.930449 | 0.916938 | 0.922738 | 0.927734 | 0.525770 | 1643 | 0.641797 | 2560 |
2 | cohort_1 | ((`country` == "B")) or ((`country` == "C")) | 0.941379 | 0.919319 | 0.917429 | 0.918328 | 0.920455 | 0.499605 | 183 | 0.415909 | 440 |
Comparing the metrics of these merged cohorts to the baseline, it seems in this case that merging the invalid cohort (country=="B")
with the cohort (country=="C")
has a positive effect on (country=="B")
, slightly improving the label distribution and therefore its accuracy.
Transfer Learning
Next, let’s explore the other technique that distinguishes the decoupled classifiers, transfer learning.
In this approach, when calling the fit()
method with an invalid cohort, we use data from other cohorts (out-data), while weighing down these instances to create a valid cohort.
In order to use transfer learning with the DecoupledClass
module, we simply need to pass a theta
value. theta
can be a fixed float, a list of floats (the best value in the list is found using cross-validation), or a boolean True
to use a default list of floats optimized using cross-validation. If you’d like to learn more about how we select out-data and how to use different types of theta with transfer learning, see the tutorial notebook for the decoupled
classifiers.
Let’s take a look at how transfer learning handles the invalid cohorts in our case:
[12]:
preprocessing = [dp.BasicImputer(verbose=False), dp.DataMinMaxScaler(verbose=False), dp.EncoderOHE(verbose=False)]
dec_class = DecoupledClass(
cohort_col=["country"],
theta=True,
min_fold_size_theta=5,
min_cohort_pct=0.2,
minority_min_rate=0.15,
transform_pipe=preprocessing,
estimator=get_model()
)
dec_class.fit(X_train, y_train)
dec_class.print_cohorts()
FINAL COHORTS
cohort_0:
Size: 5940
Query:
(`country` == "A")
Value Counts:
1: 3612 (60.81%)
0: 2328 (39.19%)
Invalid: False
cohort_1:
Size: 347
Query:
(`country` == "B")
Value Counts:
0: 190 (54.76%)
1: 157 (45.24%)
Invalid: True
Cohorts used as outside data: ['cohort_0', 'cohort_2']
Theta = 0.4
cohort_2:
Size: 713
Query:
(`country` == "C")
Value Counts:
0: 388 (54.42%)
1: 325 (45.58%)
Invalid: True
Cohorts used as outside data: ['cohort_0', 'cohort_1']
Theta = 0.6
[13]:
th_dict = dec_class.get_threasholds_dict()
pred = dec_class.predict_proba(X_test)
fetch_cohort_results(X_test, y_test, pred, cohort_def=dec_class, fixed_th=th_dict)
[13]:
cohort | cht_query | roc | precision | recall | f1 | accuracy | threshold | num_pos | %_pos | cht_size | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | all | all | 0.947784 | 0.930492 | 0.921226 | 0.925124 | 0.928000 | 0.500000 | 1834 | 0.611333 | 3000 |
1 | cohort_0 | (`country` == "A") | 0.945985 | 0.930449 | 0.916938 | 0.922738 | 0.927734 | 0.525770 | 1643 | 0.641797 | 2560 |
2 | cohort_1 | (`country` == "B") | 0.953976 | 0.911828 | 0.923049 | 0.916697 | 0.921569 | 0.475236 | 60 | 0.392157 | 153 |
3 | cohort_2 | (`country` == "C") | 0.945566 | 0.922287 | 0.923322 | 0.922759 | 0.923345 | 0.594635 | 132 | 0.459930 | 287 |
Comparing these results to the metrics of the baseline valid/invalid cohorts, it seems that transfer learning might have made a positive but less noticeable difference for cohort (country=="B")
in this case.
Optimizing Fairness Metrics
Lastly, DecoupledClass
offers the option to optimize all models according to a fairness metric. We have the option for a fairness metric between “balanced”, “num_parity” and “dem_parity”, for a more detailed look on these different metrics and how to use them, see the tutorial notebook for the decoupled classfiers.
In this case, we’ll explore this feature over our cohorts using the “dem_parity” metric.
[14]:
preprocessing = [dp.BasicImputer(verbose=False), dp.DataMinMaxScaler(verbose=False), dp.EncoderOHE(verbose=False)]
dec_class = DecoupledClass(
cohort_col=["country"],
transform_pipe=preprocessing,
estimator=get_model(),
minority_min_rate=0.2,
min_cohort_pct=0.15,
theta=False,
fairness_loss="dem_parity",
lambda_coef=0.8,
max_joint_loss_time=2000
)
dec_class.fit(X_train, y_train)
pred = dec_class.predict_proba(X_test)
fetch_cohort_results(X_test, y_test, pred, cohort_def=dec_class, fixed_th=True)
[14]:
cohort | cht_query | roc | precision | recall | f1 | accuracy | threshold | num_pos | %_pos | cht_size | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | all | all | 0.946290 | 0.927526 | 0.919864 | 0.923168 | 0.926000 | 0.500000 | 1822 | 0.607333 | 3000 |
1 | cohort_0 | (`country` == "A") | 0.945985 | 0.886854 | 0.902609 | 0.891498 | 0.894531 | 0.829787 | 1420 | 0.554688 | 2560 |
2 | cohort_1 | ((`country` == "B")) or ((`country` == "C")) | 0.941379 | 0.845511 | 0.852320 | 0.838335 | 0.838636 | 0.151803 | 235 | 0.534091 | 440 |
Using fairness optimization in addition to merging cohorts (country=="B")
and (country=="C")
shows an even better distribution of the labels. Although, one should pay attention to the trade-off between accuracy and label distribution when borrowing training data from other cohorts as the merging capability of this tool does.