Getting Started

Here we provide an overview of the library, while also providing useful links within the documentation.

Encoder API

The Encoder API allows for ordinal or one-hot encoding of categorical features.

When is feature encoding a useful mitigation technique?

The Encoder API can be useful in cases where a feature does not contain sufficient information about the task because (1) the semantic information of the feature content has been hidden by the original encoding format or (2) because the model may not have the capacity to interpret the semantic information.

Example

If a feature contains values such as {“agree”, “mostly agree”, “neutral”, “mostly disagree”, “disagree”}, the string interpretation (or ordering) of these values cannot express the fact that the “agree” value is better than “mostly agree”. In other cases, if the data contains a categorical feature with high cardinality but there is no inherent ordering between the feature values, the training algorithm may still assign an inappropriate ordering to the values.

Responsible AI tip about feature encoding

Although feature encoding is a generally useful technique in machine learning, it’s important to be aware that encoding can sometimes affect different data cohorts differently, which could result in fairness-related harms or reliability and safety failures. To illustrate, imagine you have two cohorts of interest: “non-immigrants” and “immigrants”. If the data contains the “country of birth” as a feature, and the value of that feature is mostly uniform within the “non-immigrant” cohort but highly variable across the “immigrant” cohort, then the wrong ordering interpretation will negatively impact the “immigrant” cohort more because there are more possible values of the “country of birth” feature.

Feature Selection

The Feature Selection API enables selecting a subset of features that are the most informative for the prediction task.

When is feature selection a useful mitigation technique?

Sometimes training datasets may contain features that either have very little information about the task or are redundant in the context of other existing features. Selecting the right feature subset may improve the predictive power of models, their generalization properties, and their inference time. Focusing only on a subset of features also helps practitioners in the process of model understanding and interpretation.

Responsible AI tip about feature selection

It’s important to be aware that although feature selection is a generally useful machine-learning technique, it can sometimes affect various data cohorts differently, with the potential to result in fairness-related harms or reliability and safety failures. For example, it may be the case that within a particular cohort there exists full correlation between two features, but not with the rest of the data. In addition, if this cohort is also a minority group, the meaning and weight of a feature value can be drastically different.

Example

In the United States there are both private and public undergraduate schools, while in some countries all degree-granting schools are public. A university in the United States deciding which applicants to interview for graduate school uses the feature previous program type (meaning either private or public university). The university is interested in several location-based cohorts indicating where applicants did recent undergrad studies. However, a small group of applicants are from a country where all schools are public, thus their “previous program type” is always set to “public”. The feature previous program type is redundant for this cohort and not helpful to the prediction task of recommending who to interview. Furthermore, this feature selection could be even more harmful if the model, due to existing correlations in the larger data, has learned a negative correlation between “public” undergrad studies to acceptance rates in grad school. For the grad school program, this may even lead to harms of underrepresentation or even erasure of individuals from the countries with only “public” education.

Imputers

The Imputer API enables a simple approach for replacing missing values across several columns with different parameters, simultaneously replacing with the mean, median, most constant, or most frequent value in a dataset.

When is the Imputer API a useful mitigation technique?

Sometimes because of data collection practices, a given cohort may be missing data on a feature that is particularly helpful for prediction. This happens frequently when the training data comes from different sources of data collection (e.g., different hospitals collect different health indicators) or when the training data spans long periods of time, during which the data collection protocol may have changed.

Responsible AI tip about imputing value

It’s important to be aware that although imputing values is a generally useful machine-learning technique, it has the potential to result in fairness-related harms of over- or underrepresentation, which can impact quality of service or allocation of opportunities or resources, as well as reliability and safety.

It is recommended, for documentation and provenance purposes, to rename features after applying this mitigation so that the name conveys the information of which values have been imputed and how.

To avoid overfitting, it is important that feature imputation for testing datasets is performed based on statistics (e.g., minimum, maximum, mean, frequency) that are retrieved from the training set only. This approach ensures no information from the other samples in the test set is used to improve the prediction on an individual test sample.

Sampling

The Sampling API enables data augmentation by rebalancing existing data or synthesizing new data.

When is the Sampling API a useful mitigation technique?

Sampling helps address data imbalance in a given class or feature, a common problem in machine learning.

Responsible AI tip about sampling

The problem of data imbalance is most commonly studied in the context of class imbalance. However, from the responsible AI perspective the problem is much broader: Feature-value imbalance may lead to not enough data for cohorts of interest, which in turn may lead to lower quality predictions.

Example

Consider the task of predicting whether a house will sell for higher or lower than the asking price. Even when the class is balanced, there still may be feature imbalance for the geographic location because population densities vary in different areas. As such, if the goal is to improve model performance for areas with a lower population density, oversampling for this group may help the model to better represent these cohorts.

Scalers

The Scaler API enables applying numerical scaling transformations to several features at the same time.

When is scaling feature values a useful mitigation technique?

In general, scaling feature values is important for training algorithms that compute distances between different data samples based on several numerical features (e.g., KNNs, PCA). But because the semantic meaning of different features can vary significantly, computing distances across scaled versions of such features is more meaningful.

Example

Consider training data has the two numerical features, age and yearly wage. When computing distances across samples, the yearly wage feature will impact the distance significantly more than the age - not because it is more important but because it has a higher range of values.

Scaling is also critical for the convergence of popular gradient-based optimization algorithms for neural networks. Scaling also prevents the phenomenon of fast saturation of activation functions (e.g., sigmoids) in neural networks.

Responsible AI tip about scalers

Note that scalers transform the feature values globally, meaning that they scale the feature based on all samples of the dataset. This may not always be the most fair or inclusive approach, depending on the use case.

For example, if a training dataset for predicting credit reliability combines data from several countries, individuals with a relatively high salary for their particular country may still fall in the lower-than-average range when minimum and maximum values for scaling are computed based on data from countries where salaries are a lot higher. This misinterpretation of their salary may then lead to a wrong prediction, potentially resulting in the withholding of opportunities and resources.

Similarly in the medical domain, people with different ancestry may have varied minimum and maximum values for specific disease indicators. Scaling globally could lead the algorithm to underdiagnose the disease of interest for some ancestry cohorts. Of course, depending on the capacity and non-linearity of the training algorithm, the algorithm itself may find other ways of circumventing such issues. Nevertheless, it may still be a good idea for AI practitioners to apply a more cohort-aware approach by scaling one cohort at a time.

Data Balance Metrics

Aggregate measures

These measures look at the distribution of records across all value combinations of sensitive feature columns. For example, if sex and race are specified as sensitive features, the API tries to quantify imbalance across all combinations of the specified features (e.g., [Male, Black], [Female, White], [Male, Asian Pacific Islander])

Measure

Description

Interpretation

Atkinson index

The Atkinson index presents the
percentage of total income that
a given society would have to
forego in order to have more equal
shares of income among its
citizens. This measure depends on
the degree of societal aversion to
inequality (a theoretical parameter
decided by the researcher), where a
higher value entails greater social
utility or willingness by individuals
to accept smaller incomes in exchange
for a more equal distribution.

An important feature of the Atkinson
index is that it can be decomposed
into within-group and between-group
inequality.

Range [0,1]
0 = perfect equality
1 = maximum inequality

In this case, it is the
proportion of records for a
sensitive column’s combination.

Theil T index

GE(1) = Theil T, which is more
sensitive to differences at the
top of the distribution. The Theil
index is a statistic used to measure
economic inequality. The Theil index
measures an entropic “distance” the
population is away from the “ideal”
egalitarian state of everyone having
the same income.

If everyone has the same income,
then T_T equals 0.

If one person has all the income,
then T_T gives the result ln(N).

0 means equal income and larger
values mean higher level of
disproportion.

Theil L index

GE(0) = Theil L, which is more
sensitive to differences at the
lower end of the distribution.
Thiel L is the logarithm of
(mean income)/(income i), over
all the incomes included in the
summation. It is also referred
to as the mean log deviation
measure. Because a transfer from
a larger income to a smaller one
will change the smaller income’s
ratio more than it changes the
larger income’s ratio, the
transfer-principle is satisfied
by this index.

Same interpretation as
Theil T index.

Distribution measures

These metrics compare the data with a reference distribution (currently only uniform distribution is supported). They are calculated per sensitive feature column and do not depend on the class label column.

Measure

Description

Interpretation

KL divergence

Kullbeck–Leibler (KL) divergence
measures how one probability
distribution is different from
a second reference probability
distribution. It is the measure
of the information gained when
one revises one’s beliefs from
the prior probability distribution
Q to the posterior probability
distribution P. In other words,
it is the amount of information
lost when Q is used to approximate P.

Non-negative.

0 means P = Q.

JS distance

The Jensen-Shannon (JS) distance
measures the similarity between two
probability distributions. It is the
symmetrized and smoothed version of
the Kullback–Leibler (KL) divergence
and is the square root of JS divergence.

Range [0, 1].

0 means perfectly same to
balanced distribution.

Wasserstein distance

This distance is also known as the
Earth mover’s distance (EMD), since
it can be seen as the minimum amount
of “work” required to transform u
into v, where “work” is measured
as the amount of distribution weight
that must be moved, multiplied by the
distance it has to be moved.

Non-negative.

0 means P = Q.

Infinite norm distance

Also known as the Chebyshev distance
or chessboard distance, this is the
distance between two vectors that is
the greatest of their differences along
any coordinate dimension.

Non-negative.

0 means P = Q.

Total variation distance

The total variation distance is equal
to half the L1 (Manhattan) distance
between the two distributions. Take the
difference between the two proportions
in each category, add up the absolute
values of all the differences, and then
divide the sum by 2.

Non-negative.

0 means P = Q.

Chi-square test

The chi-square test is used to test the
null hypothesis that the categorical
data has the given expected frequencies
in each category.

The p-value gives evidence
against null-hypothesis that
the difference in observed and
expected frequencies is by
random chance.

Feature measures

These measure whether each combination of sensitive features is receiving the positive outcome (true prediction) at balanced probabilities. Many of these metrics were influenced by the paper, Measuring Model Biases in the Absence of Ground Truth (Osman Aka, Ken Burke, Alex Bäuerle, Christina Greer, Margaret Mitchell).

Association
Metric

Family

Description

Interpretation /
Formula

Statistical parity

Fairness

The proportion of each segment
of a protected class (e.g.,
gender) should receive the
positive outcome at equal
rates.

Parity increases with
proximity to 0.

DP = P(Y|A=“Male”)-
P(Y|A=“Female”)

Pointwise
mutual
information
(PMI),
normalized PMI

Entropy

The PMI of a pair of feature
values (e.g., Gender=Male
and Gender=Female) quantifies
the discrepancy between the
probability of their
coincidence, given their
joint distribution and their
individual distributions
(assuming independence).

Range (normalized) [−1,1]

-1 for no co-occurences

0 for co-occurences at
random

1 for complete
co-occurences

Sorensen-Dice
coefficient
(SDC)

Intersection
over union

The SDC is used to gauge the
similarity of two samples
and is related to F1 score.

Equals twice the number of
elements common to both
sets divided by the sum
of the number of elements
in each set.

Jaccard index

Intersection
over union

Similar to SDC, the Jaccard
index guages the similarity
and diversity of sample sets.

Equals the size of the
intersection divided by
the size of the union of
the sample sets.

Kendall rank
correlation

Correlation
and
statistical
tests

This is used to measure the
ordinal association between
two measured quantities.

High when observations
have a similar rank
between the two variables
and low when observations
have a dissimilar rank.

Log-
likelihood
ratio

Correlation
and
statistical
tests

This metric calculates the
degree to which data
supports one variable versus
another. The log-likelihood
ratio gives the probability
of correctly predicting the
label in ratio to
probability of incorrectly
predicting label.

If likelihoods are similar,
it should be close to 0.

T-test

Correlation
and
statistical
tests

The t-test is used to
compare the means of two
groups (pairwise).

The value that is being
assessed for statistical
significance in the
t-distribution.

Cohort Management

The Cohort Management feature allows managing multiple cohorts using a simple interface. This is an important tool for guaranteeing fairness across different cohorts, as shown in the scenarios described here. The cohort.CohortManager allows the application of different data processing pipelines over each cohort, and therefore represents a powerful tool when dealing with sensitive cohorts.

Example: Imputing missing values for each cohort separately

Consider the following situation: a dataset that shows several details of similar cars from a specific brand. The column price stores the price of a car model in US Dollars, while the column country indicates the country where that price was observed. Due to the differences in economy and local currency, it is expected that the price of these models will vary greatly based on the country column. Suppose now that we want to impute the missing values in the price columns using the mean value of that column. Given that the prices differ greatly based on the different country cohorts, then it is expected that this imputation approach will end up inserting a lot of noise into the price column. Instead, we could use the mean value of the price column based on each cohort, that is: compute the mean price value for each cohort and impute the missing values based on the mean value of the cohort to which the instance belongs. This will greatly reduce the noise inserted by the imputation method. This can be easily achieved by using the cohort.CohortManager class.

Cohort-based estimators

The Cohort module allows applying different mitigations to each cohort separately, as previously highlighted. But it allows us to go beyond that: it also allows creating full pipelines, including an estimator, for each cohort, while using a familiar and easy-to-use interface. If we are faced with a dataset that has a set of cohorts that behave very differently from each other, we are able of creating a custom pipeline for each cohort individually, which means that the pipeline is fitted separately for each cohort. This might help achieving more fair results, that is, the performance for each cohort is similar when compared to the other cohorts.

Tutorials and Examples

Check the Gallery page for some tutorials and examples of how to use the Cohort module.

  • The Tutorial - Cohort section has a set of tutorial notebooks that shows all of the features implemented in the different classes inside the Cohort module, as well as when and how to use each one of them.

  • The Tutorial - Using the Cohort Module sections presents a set of notebooks where we analyze a datasets that present some behavioral differences between different cohorts, and we use the Cohort module to create different pre-processing pipelines for each cohort, and in some cases, we even create different estimators for each cohort.

Note that this module is more useful in scenarios where there are considerable behavioral differences between cohorts. If that is not the case, then applying mitigations and training a single estimator over the entire dataset might prove the best approach, instead of creating different pipelines for each cohort.

Decoupled Classifiers

This class implements techniques for learning different estimators (models) for different cohorts based on the approach presented in “Decoupled classifiers for group-fair and efficient machine learning.” (Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. Conference on fairness, accountability and transparency. PMLR, 2018). The approach searches and combines cohort-specific classifiers to optimize for different definitions of group fairness and can be used as a post-processing step on top of any model class. The current implementation in this library supports only binary classification and we welcome contributions that can extend these ideas for multi-class and regression problems.

The basis decoupling algorithm can be summarized in two steps:

  • A different family of classifiers is trained on each cohort of interest. The algorithm partitions the training data for each cohort and learns a classifier for each cohort. Each cohort-specific trained classifier results in a family of potential classifiers to be used after the classifier output is adjusted based on different thresholds on the model output. For example, depending on which errors are most important to the application (e.g. false positives vs. false negatives for binary classification), thresholding the model prediction at different values of the model output (e.g. likelihood, softmax) will result in different classifiers. This step generates a whole family of classifiers based on different thresholds.

  • Among the cohort-specific classifiers search for one representative classifier for each cohort such that a joint loss is optimized. This step searches through all combinations of classifiers from the previous step to find the combination that best optimizes a definition of a joint loss across all cohorts. While there are different definitions of such a joint loss, this implementation currently supports definitions of the Balanced Loss, L1 loss, and Demographic Parity as examples of losses that focus on group fairness. More definitions of losses are described in the longer version of the paper.

Get involved

In the future, we plan to integrate more functionalities around data and model-oriented mitigations. Some top-of-mind improvements for the team include bagging and boosting, better data synthesis, constrained optimizers, and handling data noise. If you would like to collaborate or contribute to any of these ideas, contact us at rai-toolbox@microsoft.com.