Getting Started

Here we provide an overview of the library, while also providing useful links within the documentation.

Encoder API 

The Encoder API allows for ordinal or one-hot encoding of categorical features.

When is feature encoding a useful mitigation technique?

The Encoder API can be useful in cases where a feature does not contain sufficient information about the task because (1) the semantic information of the feature content has been hidden by the original encoding format or (2) because the model may not have the capacity to interpret the semantic information.

Example

If a feature contains values such as {“agree”, “mostly agree”, “neutral”, “mostly disagree”, “disagree”}, the string interpretation (or ordering) of these values cannot express the fact that the “agree” value is better than “mostly agree”. In other cases, if the data contains a categorical feature with high cardinality but there is no inherent ordering between the feature values, the training algorithm may still assign an inappropriate ordering to the values.

Responsible AI tip about feature encoding

Although feature encoding is a generally useful technique in machine learning, it’s important to be aware that encoding can sometimes affect different data cohorts differently, which could result in fairness-related harms or reliability and safety failures. To illustrate, imagine you have two cohorts of interest: “non-immigrants” and “immigrants”. If the data contains the “country of birth” as a feature, and the value of that feature is mostly uniform within the “non-immigrant” cohort but highly variable across the “immigrant” cohort, then the wrong ordering interpretation will negatively impact the “immigrant” cohort more because there are more possible values of the “country of birth” feature.

Feature Selection 

The Feature Selection API enables selecting a subset of features that are the most informative for the prediction task.

When is feature selection a useful mitigation technique?

Sometimes training datasets may contain features that either have very little information about the task or are redundant in the context of other existing features. Selecting the right feature subset may improve the predictive power of models, their generalization properties, and their inference time. Focusing only on a subset of features also helps practitioners in the process of model understanding and interpretation.

Responsible AI tip about feature selection

It’s important to be aware that although feature selection is a generally useful machine-learning technique, it can sometimes affect various data cohorts differently, with the potential to result in fairness-related harms or reliability and safety failures. For example, it may be the case that within a particular cohort there exists full correlation between two features, but not with the rest of the data. In addition, if this cohort is also a minority group, the meaning and weight of a feature value can be drastically different.

Example

In the United States there are both private and public undergraduate schools, while in some countries all degree-granting schools are public. A university in the United States deciding which applicants to interview for graduate school uses the feature previous program type (meaning either private or public university). The university is interested in several location-based cohorts indicating where applicants did recent undergrad studies. However, a small group of applicants are from a country where all schools are public, thus their “previous program type” is always set to “public”. The feature previous program type is redundant for this cohort and not helpful to the prediction task of recommending who to interview. Furthermore, this feature selection could be even more harmful if the model, due to existing correlations in the larger data, has learned a negative correlation between “public” undergrad studies to acceptance rates in grad school. For the grad school program, this may even lead to harms of underrepresentation or even erasure of individuals from the countries with only “public” education.

Imputers 

The Imputer API enables a simple approach for replacing missing values across several columns with different parameters, simultaneously replacing with the mean, median, most constant, or most frequent value in a dataset.

When is the Imputer API a useful mitigation technique?

Sometimes because of data collection practices, a given cohort may be missing data on a feature that is particularly helpful for prediction. This happens frequently when the training data comes from different sources of data collection (e.g., different hospitals collect different health indicators) or when the training data spans long periods of time, during which the data collection protocol may have changed.

Responsible AI tip about imputing value

It’s important to be aware that although imputing values is a generally useful machine-learning technique, it has the potential to result in fairness-related harms of over- or underrepresentation, which can impact quality of service or allocation of opportunities or resources, as well as reliability and safety.

It is recommended, for documentation and provenance purposes, to rename features after applying this mitigation so that the name conveys the information of which values have been imputed and how.

To avoid overfitting, it is important that feature imputation for testing datasets is performed based on statistics (e.g., minimum, maximum, mean, frequency) that are retrieved from the training set only. This approach ensures no information from the other samples in the test set is used to improve the prediction on an individual test sample.

Sampling 

The Sampling API enables data augmentation by rebalancing existing data or synthesizing new data.

When is the Sampling API a useful mitigation technique?

Sampling helps address data imbalance in a given class or feature, a common problem in machine learning.

Responsible AI tip about sampling

The problem of data imbalance is most commonly studied in the context of class imbalance. However, from the responsible AI perspective the problem is much broader: Feature-value imbalance may lead to not enough data for cohorts of interest, which in turn may lead to lower quality predictions.

Example

Consider the task of predicting whether a house will sell for higher or lower than the asking price. Even when the class is balanced, there still may be feature imbalance for the geographic location because population densities vary in different areas. As such, if the goal is to improve model performance for areas with a lower population density, oversampling for this group may help the model to better represent these cohorts.

Scalers 

The Scaler API enables applying numerical scaling transformations to several features at the same time.

When is scaling feature values a useful mitigation technique?

In general, scaling feature values is important for training algorithms that compute distances between different data samples based on several numerical features (e.g., KNNs, PCA). But because the semantic meaning of different features can vary significantly, computing distances across scaled versions of such features is more meaningful.

Example

Consider training data has the two numerical features, age and yearly wage. When computing distances across samples, the yearly wage feature will impact the distance significantly more than the age - not because it is more important but because it has a higher range of values.

Scaling is also critical for the convergence of popular gradient-based optimization algorithms for neural networks. Scaling also prevents the phenomenon of fast saturation of activation functions (e.g., sigmoids) in neural networks.

Responsible AI tip about scalers

Note that scalers transform the feature values globally, meaning that they scale the feature based on all samples of the dataset. This may not always be the most fair or inclusive approach, depending on the use case.

For example, if a training dataset for predicting credit reliability combines data from several countries, individuals with a relatively high salary for their particular country may still fall in the lower-than-average range when minimum and maximum values for scaling are computed based on data from countries where salaries are a lot higher. This misinterpretation of their salary may then lead to a wrong prediction, potentially resulting in the withholding of opportunities and resources.

Similarly in the medical domain, people with different ancestry may have varied minimum and maximum values for specific disease indicators. Scaling globally could lead the algorithm to underdiagnose the disease of interest for some ancestry cohorts. Of course, depending on the capacity and non-linearity of the training algorithm, the algorithm itself may find other ways of circumventing such issues. Nevertheless, it may still be a good idea for AI practitioners to apply a more cohort-aware approach by scaling one cohort at a time.

Data Balance Metrics 

Aggregate measures 

These measures look at the distribution of records across all value combinations of sensitive feature columns. For example, if sex and race are specified as sensitive features, the API tries to quantify imbalance across all combinations of the specified features (e.g., [Male, Black], [Female, White], [Male, Asian Pacific Islander])

Measure	Description	Interpretation
Atkinson index	The Atkinson index presents the percentage of total income that a given society would have to forego in order to have more equal shares of income among its citizens. This measure depends on the degree of societal aversion to inequality (a theoretical parameter decided by the researcher), where a higher value entails greater social utility or willingness by individuals to accept smaller incomes in exchange for a more equal distribution. An important feature of the Atkinson index is that it can be decomposed into within-group and between-group inequality.	Range `[0,1]` `0` = perfect equality `1` = maximum inequality In this case, it is the proportion of records for a sensitive column’s combination.
Theil T index	`GE(1) = Theil T`, which is more sensitive to differences at the top of the distribution. The Theil index is a statistic used to measure economic inequality. The Theil index measures an entropic “distance” the population is away from the “ideal” egalitarian state of everyone having the same income.	If everyone has the same income, then `T_T` equals 0. If one person has all the income, then `T_T` gives the result ln(N). `0` means equal income and larger values mean higher level of disproportion.
Theil L index	GE(0) = Theil L, which is more sensitive to differences at the lower end of the distribution. Thiel L is the logarithm of (mean income)/(income i), over all the incomes included in the summation. It is also referred to as the mean log deviation measure. Because a transfer from a larger income to a smaller one will change the smaller income’s ratio more than it changes the larger income’s ratio, the transfer-principle is satisfied by this index.	Same interpretation as Theil T index.

Distribution measures 

These metrics compare the data with a reference distribution (currently only uniform distribution is supported). They are calculated per sensitive feature column and do not depend on the class label column.

Measure	Description	Interpretation
KL divergence	Kullbeck–Leibler (KL) divergence measures how one probability distribution is different from a second reference probability distribution. It is the measure of the information gained when one revises one’s beliefs from the prior probability distribution Q to the posterior probability distribution P. In other words, it is the amount of information lost when Q is used to approximate P.	Non-negative. `0` means `P = Q`.
JS distance	The Jensen-Shannon (JS) distance measures the similarity between two probability distributions. It is the symmetrized and smoothed version of the Kullback–Leibler (KL) divergence and is the square root of JS divergence.	Range `[0, 1]`. `0` means perfectly same to balanced distribution.
Wasserstein distance	This distance is also known as the Earth mover’s distance (EMD), since it can be seen as the minimum amount of “work” required to transform `u` into `v`, where “work” is measured as the amount of distribution weight that must be moved, multiplied by the distance it has to be moved.	Non-negative. `0` means `P = Q`.
Infinite norm distance	Also known as the Chebyshev distance or chessboard distance, this is the distance between two vectors that is the greatest of their differences along any coordinate dimension.	Non-negative. `0` means `P = Q`.
Total variation distance	The total variation distance is equal to half the L1 (Manhattan) distance between the two distributions. Take the difference between the two proportions in each category, add up the absolute values of all the differences, and then divide the sum by 2.	Non-negative. `0` means `P = Q`.
Chi-square test	The chi-square test is used to test the null hypothesis that the categorical data has the given expected frequencies in each category.	The p-value gives evidence against null-hypothesis that the difference in observed and expected frequencies is by random chance.

Feature measures 

These measure whether each combination of sensitive features is receiving the positive outcome (true prediction) at balanced probabilities. Many of these metrics were influenced by the paper, Measuring Model Biases in the Absence of Ground Truth (Osman Aka, Ken Burke, Alex Bäuerle, Christina Greer, Margaret Mitchell).

Association Metric	Family	Description	Interpretation / Formula
Statistical parity	Fairness	The proportion of each segment of a protected class (e.g., gender) should receive the positive outcome at equal rates.	Parity increases with proximity to 0. DP = P(Y\|A=“Male”)- P(Y\|A=“Female”)
Pointwise mutual information (PMI), normalized PMI	Entropy	The PMI of a pair of feature values (e.g., Gender=Male and Gender=Female) quantifies the discrepancy between the probability of their coincidence, given their joint distribution and their individual distributions (assuming independence).	Range (normalized) `[−1,1]` `-1` for no co-occurences `0` for co-occurences at random `1` for complete co-occurences
Sorensen-Dice coefficient (SDC)	Intersection over union	The SDC is used to gauge the similarity of two samples and is related to F1 score.	Equals twice the number of elements common to both sets divided by the sum of the number of elements in each set.
Jaccard index	Intersection over union	Similar to SDC, the Jaccard index guages the similarity and diversity of sample sets.	Equals the size of the intersection divided by the size of the union of the sample sets.
Kendall rank correlation	Correlation and statistical tests	This is used to measure the ordinal association between two measured quantities.	High when observations have a similar rank between the two variables and low when observations have a dissimilar rank.
Log- likelihood ratio	Correlation and statistical tests	This metric calculates the degree to which data supports one variable versus another. The log-likelihood ratio gives the probability of correctly predicting the label in ratio to probability of incorrectly predicting label.	If likelihoods are similar, it should be close to 0.
T-test	Correlation and statistical tests	The t-test is used to compare the means of two groups (pairwise).	The value that is being assessed for statistical significance in the t-distribution.

Cohort Management 

The Cohort Management feature allows managing multiple cohorts using a simple interface. This is an important tool for guaranteeing fairness across different cohorts, as shown in the scenarios described here. The cohort.CohortManager allows the application of different data processing pipelines over each cohort, and therefore represents a powerful tool when dealing with sensitive cohorts.

Example: Imputing missing values for each cohort separately

Consider the following situation: a dataset that shows several details of similar cars from a specific brand. The column price stores the price of a car model in US Dollars, while the column country indicates the country where that price was observed. Due to the differences in economy and local currency, it is expected that the price of these models will vary greatly based on the country column. Suppose now that we want to impute the missing values in the price columns using the mean value of that column. Given that the prices differ greatly based on the different country cohorts, then it is expected that this imputation approach will end up inserting a lot of noise into the price column. Instead, we could use the mean value of the price column based on each cohort, that is: compute the mean price value for each cohort and impute the missing values based on the mean value of the cohort to which the instance belongs. This will greatly reduce the noise inserted by the imputation method. This can be easily achieved by using the cohort.CohortManager class.

Cohort-based estimators

The Cohort module allows applying different mitigations to each cohort separately, as previously highlighted. But it allows us to go beyond that: it also allows creating full pipelines, including an estimator, for each cohort, while using a familiar and easy-to-use interface. If we are faced with a dataset that has a set of cohorts that behave very differently from each other, we are able of creating a custom pipeline for each cohort individually, which means that the pipeline is fitted separately for each cohort. This might help achieving more fair results, that is, the performance for each cohort is similar when compared to the other cohorts.

Tutorials and Examples

Check the Gallery page for some tutorials and examples of how to use the Cohort module.

The Tutorial - Cohort section has a set of tutorial notebooks that shows all of the features implemented in the different classes inside the Cohort module, as well as when and how to use each one of them.
The Tutorial - Using the Cohort Module sections presents a set of notebooks where we analyze a datasets that present some behavioral differences between different cohorts, and we use the Cohort module to create different pre-processing pipelines for each cohort, and in some cases, we even create different estimators for each cohort.

Note that this module is more useful in scenarios where there are considerable behavioral differences between cohorts. If that is not the case, then applying mitigations and training a single estimator over the entire dataset might prove the best approach, instead of creating different pipelines for each cohort.

Decoupled Classifiers 

This class implements techniques for learning different estimators (models) for different cohorts based on the approach presented in “Decoupled classifiers for group-fair and efficient machine learning.” (Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. Conference on fairness, accountability and transparency. PMLR, 2018). The approach searches and combines cohort-specific classifiers to optimize for different definitions of group fairness and can be used as a post-processing step on top of any model class. The current implementation in this library supports only binary classification and we welcome contributions that can extend these ideas for multi-class and regression problems.

The basis decoupling algorithm can be summarized in two steps:

A different family of classifiers is trained on each cohort of interest. The algorithm partitions the training data for each cohort and learns a classifier for each cohort. Each cohort-specific trained classifier results in a family of potential classifiers to be used after the classifier output is adjusted based on different thresholds on the model output. For example, depending on which errors are most important to the application (e.g. false positives vs. false negatives for binary classification), thresholding the model prediction at different values of the model output (e.g. likelihood, softmax) will result in different classifiers. This step generates a whole family of classifiers based on different thresholds.

Among the cohort-specific classifiers search for one representative classifier for each cohort such that a joint loss is optimized. This step searches through all combinations of classifiers from the previous step to find the combination that best optimizes a definition of a joint loss across all cohorts. While there are different definitions of such a joint loss, this implementation currently supports definitions of the Balanced Loss, L1 loss, and Demographic Parity as examples of losses that focus on group fairness. More definitions of losses are described in the longer version of the paper.

Get involved

In the future, we plan to integrate more functionalities around data and model-oriented mitigations. Some top-of-mind improvements for the team include bagging and boosting, better data synthesis, constrained optimizers, and handling data noise. If you would like to collaborate or contribute to any of these ideas, contact us at rai-toolbox@microsoft.com.