Data Balance Analysis using the Adult Census Income dataset

In this example, we will conduct Data Balance Analysis (which consists on running three groups of measures) on the Adult Census Income dataset to determine how well features and feature values are represented in the dataset.

This dataset can be used to predict whether annual income exceeds $50,000/year or not based on demographic data from the 1994 U.S. Census. The dataset we’re reading contains 32,561 rows and 14 columns/features.

More info on the dataset here

[1]:
import sys
sys.path.append('../../notebooks')

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from download import download_datasets
[2]:
data_dir = "../../datasets/"
download_datasets(data_dir)

df = pd.read_csv(data_dir + "AdultCensusIncome.csv", skipinitialspace=True)

label_col = "income"

# convert to 0 and 1 encoding
df[label_col] = df[label_col].apply(lambda x: 0 if x == "<=50K" else 1)

First we can do some preliminary EDA on the data. We will look specifically at a couple columns of interest and see the distribution of how often those feature values show up in the data.

[3]:
df["race"].value_counts().plot(kind="bar")
plt.xlabel("Race", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Distribution of Race in the Data", y=1.02)
[3]:
Text(0.5, 1.02, 'Distribution of Race in the Data')
../../_images/notebooks_databalanceanalysis_data_balance_census_4_1.png
[4]:
df["sex"].value_counts().plot(kind="bar")
plt.xlabel("Sex", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Distribution of Sex in the Data", y=1.02)
[4]:
Text(0.5, 1.02, 'Distribution of Sex in the Data')
../../_images/notebooks_databalanceanalysis_data_balance_census_5_1.png
[5]:
df["marital-status"].value_counts().plot(kind="bar")
plt.xlabel("Marital Status", labelpad=14)
plt.ylabel("Count", labelpad=14)
plt.title("Distribution of Marital Status in the Data", y=1.02)
[5]:
Text(0.5, 1.02, 'Distribution of Marital Status in the Data')
../../_images/notebooks_databalanceanalysis_data_balance_census_6_1.png

We will now select a few columns to focus on for this analysis. In this case we will focus specifically on race and sex.

[6]:
cols_of_interest = ["race", "sex"]
[7]:
from raimitigations.databalanceanalysis import (
    FeatureBalanceMeasure,
    AggregateBalanceMeasure,
    DistributionBalanceMeasure,
)
feature_measures = FeatureBalanceMeasure(cols_of_interest, label_col)

# drill down to race
feat_measures = feature_measures.measures(df)
[8]:
feat_measures
[8]:
ClassA ClassB FeatureName dp pmi sdc ji krc llr t_test ttest_pvalue
0 White Black race 0.131980 0.725320 0.164302 0.212784 -5.412507 2.911817 -1.354903 0.123458
1 White Asian-Pac-Islander race -0.009780 -0.037512 0.168515 0.217291 -8.742759 3.249841 -2.582092 0.030596
2 White Amer-Indian-Eskimo race 0.140104 0.793149 0.195180 0.244934 -14.433135 5.286723 -4.895847 0.004034
3 White Other race 0.163609 1.020118 0.196514 0.246278 -15.322864 5.651366 -5.256855 0.003134
4 Black Asian-Pac-Islander race -0.141760 -0.762832 0.004213 0.004507 -3.330252 0.338024 -1.227189 0.143519
5 Black Amer-Indian-Eskimo race 0.008124 0.067829 0.030878 0.032150 -9.020628 2.374906 -3.540944 0.011997
6 Black Other race 0.031629 0.294798 0.032212 0.033494 -9.910357 2.739549 -3.901952 0.008756
7 Asian-Pac-Islander Amer-Indian-Eskimo race 0.149884 0.830661 0.026665 0.027642 -5.690376 2.036882 -2.313755 0.040847
8 Asian-Pac-Islander Other race 0.173389 1.057630 0.027999 0.028987 -6.580105 2.401525 -2.674763 0.027765
9 Amer-Indian-Eskimo Other race 0.023505 0.226969 0.001334 0.001344 -0.889729 0.364643 -0.361008 0.368176
0 Male Female sex 0.196276 1.027159 0.161486 0.222413 -0.880102 1.731753 -0.372541 0.386486

We will visualize the demographic parities using a heat map to see which pairs of values have the biggest differences in their label values

[9]:
races = df["race"].unique()
race_dp_values = feat_measures[feat_measures["FeatureName"] == "race"][
    ["ClassA", "ClassB", "dp"]
]
race_dp_array = np.zeros((len(races), len(races)))

for idx, row in race_dp_values.iterrows():
    class_a = row[0]
    class_b = row[1]
    dp_value = row[2]
    i, j = np.where(races == class_a)[0][0], np.where(races == class_b)
    dp_value = round(dp_value, 2)
    race_dp_array[i, j] = dp_value
    race_dp_array[j, i] = -1 * dp_value

colormap = "RdBu"
dp_min, dp_max = -1.0, 1.0

fig, ax = plt.subplots()
im = ax.imshow(race_dp_array, vmin=dp_min, vmax=dp_max, cmap=colormap)

cbar = ax.figure.colorbar(im, ax=ax)
cbar.ax.set_ylabel("Demographic Parity", rotation=-90, va="bottom")

ax.set_xticks(np.arange(len(races)))
ax.set_yticks(np.arange(len(races)))
ax.set_xticklabels(races)
ax.set_yticklabels(races)

plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

for i in range(len(races)):
    for j in range(len(races)):
        text = ax.text(j, i, race_dp_array[i, j], ha="center", va="center", color="k")

ax.set_title("Demographic Parity of Races in Adult Dataset")
fig.tight_layout()
plt.show()
../../_images/notebooks_databalanceanalysis_data_balance_census_12_0.png

Next we will look at the Distribution Measures, these measures allow us to compare our data to a reference distribution like the uniform distribution and are calculated based on each specified sensitive column. You don’t need the label column to calculate these measures

[10]:
dist_measures = DistributionBalanceMeasure(cols_of_interest).measures(df)

race_measures = dist_measures[dist_measures["FeatureName"] == "race"].iloc[0]
sex_measures = dist_measures[dist_measures["FeatureName"] == "sex"].iloc[0]
[11]:
race_measures
[11]:
FeatureName                     race
kl_divergence               1.055793
js_dist                       0.5104
wasserstein_dist            0.261709
inf_norm_dist               0.654274
total_variation_dist        0.654274
chi_sq_p_value                   0.0
chi_sq_stat             87941.889193
Name: 0, dtype: object
[12]:
# measures_of_interest = ["kl_divergence", "js_dist", "inf_norm_dist", "total_variation_dist", "wasserstein_dist"]
measures_of_interest =  [
    "kl_divergence",
    "js_dist",
    "inf_norm_dist",
    "total_variation_dist",
    "wasserstein_dist",
]
race_array = [round(race_measures[measure], 4) for measure in measures_of_interest]
sex_array = [round(sex_measures[measure], 4) for measure in measures_of_interest]

x = np.arange(len(measures_of_interest))
width = 0.35

fig, ax = plt.subplots()
rects1 = ax.bar(x - width / 2, race_array, width, label="Race")
rects2 = ax.bar(x + width / 2, sex_array, width, label="Sex")

ax.set_xlabel("Measure")
ax.set_ylabel("Value")
ax.set_title("Distribution Balance Measures of Sex and Race in Adult Dataset")
ax.set_xticks(x)
ax.set_xticklabels(measures_of_interest)
ax.legend()

plt.setp(ax.get_xticklabels(), rotation=20, ha="right", rotation_mode="default")


def autolabel(rects):
    for rect in rects:
        height = rect.get_height()
        ax.annotate(
            "{}".format(height),
            xy=(rect.get_x() + rect.get_width() / 2, height),
            xytext=(0, 1),  # 1 point vertical offset
            textcoords="offset points",
            ha="center",
            va="bottom",
        )


autolabel(rects1)
autolabel(rects2)

fig.tight_layout()

plt.show()
../../_images/notebooks_databalanceanalysis_data_balance_census_16_0.png

Interpret Distribution Balance Measures​ Race has a JS Distance of 0.5104 while Sex has a JS Distance of 0.1217.

Knowing that JS Distance is between [0, 1] where 0 means perfectly balanced distribution, we can tell that:

There is a larger disparity between various races than various sexes in our dataset. Race is nowhere close to a perfectly balanced distribution (i.e. some races are seen ALOT more than others in our dataset). Sex is fairly close to a perfectly balanced distribution.

Calculate Aggregate Balance Measures​ Aggregate Balance Measures allow us to obtain a higher notion of inequality. They are calculated on the global set of sensitive columns and don’t use the label column.

These measures look at distribution of records across all combinations of sensitive columns. For example, if Sex and Race are sensitive columns, it shall try to quantify imbalance across all combinations - (Male, Black), (Female, White), (Male, Asian-Pac-Islander), etc.

[13]:
agg_measures = AggregateBalanceMeasure(cols_of_interest)
agg_measures.measures(df)
[13]:
theil_l_index theil_t_index atkinson_index
0 1.50467 1.120823 0.777909

Summary​

Throughout the course of this sample notebook, we have:

  • Chosen “Race” and “Sex” as columns of interest in the Adult Census Income dataset.

  • Done preliminary analysis on our dataset.

  • Ran the 3 groups of measures that compose our Data Balance

Analysis:

  • Feature Balance Measures

    • Calculated Feature Balance Measures to see that the highest Demographic Parity is in “Sex”: Males see >50k income much more than Females.

    • Visualized Demographic Parity of Races to see that Asian-Pac-Islander sees >50k income much more than Other, in addition to other race combinations.

  • Distribution Balance Measures

    • Calculated Distribution Balance Measures to see that “Sex” is much closer to a perfectly balanced distribution than “Race”.

Visualized various distribution balance measures to compare their values for “Race” and “Sex”.

  • Aggregate Balance Measures

    • Calculated Aggregate Balance Measures to see that we need to forego 77.79% of data points to have a perfectly balanced dataset. We identified that our dataset is leaning towards maximum inequality, and we should take actionable steps to:

    • Upsample data points where the feature value is barely observed.

    • Downsample data points where the feature value is observed much more than others.

In conclusion:

  • These measures provide an indicator of disparity on the data, allowing for users to explore potential mitigations before proceeding to train.

  • Users can use these measures to set thresholds on their level of “tolerance” for data representation. Production pipelines can use these measures as baseline for models that require frequent retraining on new data.

  • These measures can also be saved as key metadata for the model/service built and added as part of model cards or transparency notes helping drive overall accountability for the ML service built and its performance across different demographics or sensitive attributes.