Cohort Case Study 1 - Using RAI ErrorAnalysis

We’ll repeat the experiment demonstrated in Case 1, but this time we’ll be using the Error Analysis Dashboard, available in the Responsible AI Toolbox project) to determine which cohorts of data this model performs worse on. To install this library, use the following command:

> pip install raiwidgets

To see how to transfer cohorts between raimitigations and raiwidgets, check our tutorial on saving cohorts.

In the following cells, we’ll create the toy dataset once more, but this time we won’t add any missing values into the dataset.

[16]:

import random
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.pipeline import Pipeline
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression

from raimitigations.utils import split_data
import raimitigations.dataprocessing as dp
from raimitigations.cohort import (
    CohortDefinition,
    CohortManager,
    fetch_cohort_results
)

SEED = 51
#SEED = None
np.random.seed(SEED)
random.seed(SEED)

def _create_country_df(samples: int, sectors: dict, country_name: str):
    df = None
    for key in sectors.keys():
        size = int(samples * sectors[key]["prob_occur"])
        invest = np.random.uniform(low=sectors[key]["min"], high=sectors[key]["max"], size=size)
        min_invest = min(invest)
        max_invest = max(invest)
        range_invest = max_invest - min_invest
        bankrupt_th = sectors[key]["prob_success"] * range_invest
        inverted_behavior = sectors[key]["inverted_behavior"]
        bankrupt = []
        for i in range(invest.shape[0]):
            inst_class = 1
            if invest[i] > bankrupt_th:
                inst_class = 0
            if inverted_behavior:
                inst_class = int(not inst_class)
            bankrupt.append(inst_class)
        noise_ind = np.random.choice(range(size), int(size*0.05), replace=False)
        for ind in noise_ind:
            bankrupt[ind] = int(not bankrupt[ind])

        country_col = [country_name for _ in range(size)]
        sector_col = [key for _ in range(size)]
        df_sector = pd.DataFrame({
            "investment":invest,
            "sector":sector_col,
            "country":country_col,
            "bankrupt":bankrupt
        })

        if df is None:
            df = df_sector
        else:
            df = pd.concat([df, df_sector], axis=0)
    return df

def create_df_multiple_distributions(samples: list):
    sectors_c1 = {
        "s1": {"prob_occur":0.5, "prob_success":0.99, "inverted_behavior":False, "min":2e6, "max":1e7},
        "s2": {"prob_occur":0.1, "prob_success":0.2, "inverted_behavior":False, "min":1e7, "max":1.5e9},
        "s3": {"prob_occur":0.1, "prob_success":0.9, "inverted_behavior":True, "min":1e9, "max":1e10},
        "s4": {"prob_occur":0.3, "prob_success":0.7, "inverted_behavior":False, "min":4e10, "max":9e13},
    }
    sectors_c2 = {
        "s1": {"prob_occur":0.1, "prob_success":0.6, "inverted_behavior":True, "min":1e3, "max":5e3},
        "s2": {"prob_occur":0.3, "prob_success":0.9, "inverted_behavior":False, "min":1e5, "max":1.5e6},
        "s3": {"prob_occur":0.5, "prob_success":0.3, "inverted_behavior":False, "min":5e4, "max":3e5},
        "s4": {"prob_occur":0.1, "prob_success":0.8, "inverted_behavior":False, "min":1e6, "max":1e7},
    }
    sectors_c3 = {
        "s1": {"prob_occur":0.3, "prob_success":0.9, "inverted_behavior":False, "min":3e2, "max":6e2},
        "s2": {"prob_occur":0.6, "prob_success":0.7, "inverted_behavior":False, "min":5e3, "max":9e3},
        "s3": {"prob_occur":0.07, "prob_success":0.6, "inverted_behavior":False, "min":4e3, "max":2e4},
        "s4": {"prob_occur":0.03, "prob_success":0.1, "inverted_behavior":True, "min":6e5, "max":1.3e6},
    }
    countries = {
        "A":{"sectors":sectors_c1, "sample_rate":0.85},
        "B":{"sectors":sectors_c2, "sample_rate":0.05},
        "C":{"sectors":sectors_c2, "sample_rate":0.1}
    }
    df = None
    for key in countries.keys():
        n_sample = int(samples * countries[key]["sample_rate"])
        df_c = _create_country_df(n_sample, countries[key]["sectors"], key)
        if df is None:
            df = df_c
        else:
            df = pd.concat([df, df_c], axis=0)

    idx = pd.Index([i for i in range(df.shape[0])])
    df = df.set_index(idx)
    return df

[17]:

df = create_df_multiple_distributions(10000)
df

[17]:

	investment	sector	country	bankrupt
0	7.405851e+06	s1	A	1
1	2.357697e+06	s1	A	1
2	4.746429e+06	s1	A	1
3	7.152158e+06	s1	A	1
4	4.273704e+06	s1	A	1
...	...	...	...	...
9995	5.628480e+06	s4	C	1
9996	4.907502e+06	s4	C	1
9997	9.941936e+06	s4	C	0
9998	5.940903e+06	s4	C	1
9999	8.602032e+06	s4	C	0

10000 rows × 4 columns

Single Pipeline for the entire dataset

We’ll now use our baseline approach, where we use a single pipeline to be trained using the entire dataset.

[18]:

label_col = "bankrupt"
df_train, df_test = split_data(df, label=label_col, test_size=0.3, full_df=True)

X_train = df_train.drop(columns=[label_col])
y_train = df_train[label_col]
X_test = df_test.drop(columns=[label_col])
y_test = df_test[label_col]

[19]:

def get_model():
    #model = LGBMClassifier(random_state=SEED)
    model = LogisticRegression(random_state=SEED)
    return model

[20]:

pipe = Pipeline([
            ("imputer", dp.BasicImputer(verbose=False)),
            ("scaler", dp.DataMinMaxScaler(verbose=False)),
            ("encoder", dp.EncoderOHE(verbose=False)),
            ("estimator", get_model()),
        ])
model = pipe.fit(X_train, y_train)

With the pipeline fitted, we can create the RAI Dashboard:

[21]:

from raiwidgets import ResponsibleAIDashboard
from responsibleai import RAIInsights
from raiwidgets.cohort import Cohort, CohortFilter, CohortFilterMethods

rai_insights = RAIInsights(model, df_train, df_test, label_col, 'classification',
                           categorical_features=pipe['encoder'].get_encoded_columns())

# Error Analysis
rai_insights.error_analysis.add()

rai_insights.compute()

Using categorical_feature in Dataset.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.

Finally, we use the following command to create a dashboard, which will be available in the localhost address printed below. We encourage users to interact with this dashboard and see all of the insights this dashboard offers.

[22]:

ResponsibleAIDashboard(rai_insights)

ResponsibleAI started at http://localhost:5002

[22]:

<raiwidgets.responsibleai_dashboard.ResponsibleAIDashboard at 0x7fc60ff64f70>

For now, our goal is to analyze the error distribution for the sector and country cohorts. In the dashboard, we’ll click on the “Heat map” option and add the sector column as the “Feature 1”, and the country column as “Feature 2”.

screen1

As we can see, the error is not evenly distributed among the cohorts. Some cohorts have high rates of errors, while other cohorts perform very well.

Different Pipelines for each cohort

We’ll now create a separate pipeline for each of the cohorts built according to the sector and country columns.

[23]:

cht_manager = CohortManager(
    transform_pipe=[
        dp.BasicImputer(verbose=False),
        dp.DataMinMaxScaler(verbose=False),
        dp.EncoderOHE(verbose=False),
        get_model()
    ],
    cohort_col=["sector", "country"]
)
cht_manager.fit(X_train, y_train)

rai_insights = RAIInsights(cht_manager, df_train, df_test, label_col, 'classification',
                           categorical_features=pipe['encoder'].get_encoded_columns())

# Error Analysis
rai_insights.error_analysis.add()

rai_insights.compute()

Using categorical_feature in Dataset.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.
Passing a set as an indexer is deprecated and will raise in a future version. Use a list instead.

[24]:

ResponsibleAIDashboard(rai_insights)

ResponsibleAI started at http://localhost:5003

[24]:

<raiwidgets.responsibleai_dashboard.ResponsibleAIDashboard at 0x7fc60fb3a9d0>

We’ll repeat the same analysis done for the baseline pipeline.

screen2

As we can see, the error is now evenly distributed among the cohorts.