Getting Started
Here we provide an overview of the library, while also providing useful links within the documentation.
Encoder API
The Encoder API allows for ordinal or one-hot encoding of categorical features.
When is feature encoding a useful mitigation technique?
The Encoder API can be useful in cases where a feature does not contain sufficient information about the task because (1) the semantic information of the feature content has been hidden by the original encoding format or (2) because the model may not have the capacity to interpret the semantic information.
Example
If a feature contains values such as {“agree”, “mostly agree”, “neutral”, “mostly disagree”, “disagree”}, the string interpretation (or ordering) of these values cannot express the fact that the “agree” value is better than “mostly agree”. In other cases, if the data contains a categorical feature with high cardinality but there is no inherent ordering between the feature values, the training algorithm may still assign an inappropriate ordering to the values.
Responsible AI tip about feature encoding
Although feature encoding is a generally useful technique in machine learning, it’s important to be aware that encoding can sometimes affect different data cohorts differently, which could result in fairness-related harms or reliability and safety failures. To illustrate, imagine you have two cohorts of interest: “non-immigrants” and “immigrants”. If the data contains the “country of birth” as a feature, and the value of that feature is mostly uniform within the “non-immigrant” cohort but highly variable across the “immigrant” cohort, then the wrong ordering interpretation will negatively impact the “immigrant” cohort more because there are more possible values of the “country of birth” feature.
Feature Selection
The Feature Selection API enables selecting a subset of features that are the most informative for the prediction task.
When is feature selection a useful mitigation technique?
Sometimes training datasets may contain features that either have very little information about the task or are redundant in the context of other existing features. Selecting the right feature subset may improve the predictive power of models, their generalization properties, and their inference time. Focusing only on a subset of features also helps practitioners in the process of model understanding and interpretation.
Responsible AI tip about feature selection
It’s important to be aware that although feature selection is a generally useful machine-learning technique, it can sometimes affect various data cohorts differently, with the potential to result in fairness-related harms or reliability and safety failures. For example, it may be the case that within a particular cohort there exists full correlation between two features, but not with the rest of the data. In addition, if this cohort is also a minority group, the meaning and weight of a feature value can be drastically different.
Example
In the United States there are both private and public undergraduate schools, while in some countries all degree-granting schools are public. A university
in the United States deciding which applicants to interview for graduate school uses the feature previous program type
(meaning either private or public
university). The university is interested in several location-based cohorts indicating where applicants did recent undergrad studies. However, a small group
of applicants are from a country where all schools are public, thus their “previous program type” is always set to “public”. The feature previous program
type
is redundant for this cohort and not helpful to the prediction task of recommending who to interview. Furthermore, this feature selection could be even
more harmful if the model, due to existing correlations in the larger data, has learned a negative correlation between “public” undergrad studies to acceptance
rates in grad school. For the grad school program, this may even lead to harms of underrepresentation or even erasure of individuals from the countries with
only “public” education.
Imputers
The Imputer API enables a simple approach for replacing missing values across several columns with different parameters, simultaneously replacing with the mean, median, most constant, or most frequent value in a dataset.
When is the Imputer API a useful mitigation technique?
Sometimes because of data collection practices, a given cohort may be missing data on a feature that is particularly helpful for prediction. This happens frequently when the training data comes from different sources of data collection (e.g., different hospitals collect different health indicators) or when the training data spans long periods of time, during which the data collection protocol may have changed.
Responsible AI tip about imputing value
It’s important to be aware that although imputing values is a generally useful machine-learning technique, it has the potential to result in fairness-related harms of over- or underrepresentation, which can impact quality of service or allocation of opportunities or resources, as well as reliability and safety.
It is recommended, for documentation and provenance purposes, to rename features after applying this mitigation so that the name conveys the information of which values have been imputed and how.
To avoid overfitting, it is important that feature imputation for testing datasets is performed based on statistics (e.g., minimum, maximum, mean, frequency) that are retrieved from the training set only. This approach ensures no information from the other samples in the test set is used to improve the prediction on an individual test sample.
Sampling
The Sampling API enables data augmentation by rebalancing existing data or synthesizing new data.
When is the Sampling API a useful mitigation technique?
Sampling helps address data imbalance in a given class or feature, a common problem in machine learning.
Responsible AI tip about sampling
The problem of data imbalance is most commonly studied in the context of class imbalance. However, from the responsible AI perspective the problem is much broader: Feature-value imbalance may lead to not enough data for cohorts of interest, which in turn may lead to lower quality predictions.
Example
Consider the task of predicting whether a house will sell for higher or lower than the asking price. Even when the class is balanced, there still may be feature imbalance for the geographic location because population densities vary in different areas. As such, if the goal is to improve model performance for areas with a lower population density, oversampling for this group may help the model to better represent these cohorts.
Scalers
The Scaler API enables applying numerical scaling transformations to several features at the same time.
When is scaling feature values a useful mitigation technique?
In general, scaling feature values is important for training algorithms that compute distances between different data samples based on several numerical features (e.g., KNNs, PCA). But because the semantic meaning of different features can vary significantly, computing distances across scaled versions of such features is more meaningful.
Example
Consider training data has the two numerical features, age
and yearly wage
. When computing distances across samples, the yearly wage
feature will
impact the distance significantly more than the age
- not because it is more important but because it has a higher range of values.
Scaling is also critical for the convergence of popular gradient-based optimization algorithms for neural networks. Scaling also prevents the phenomenon of fast saturation of activation functions (e.g., sigmoids) in neural networks.
Responsible AI tip about scalers
Note that scalers transform the feature values globally, meaning that they scale the feature based on all samples of the dataset. This may not always be the most fair or inclusive approach, depending on the use case.
For example, if a training dataset for predicting credit reliability combines data from several countries, individuals with a relatively high salary for their particular country may still fall in the lower-than-average range when minimum and maximum values for scaling are computed based on data from countries where salaries are a lot higher. This misinterpretation of their salary may then lead to a wrong prediction, potentially resulting in the withholding of opportunities and resources.
Similarly in the medical domain, people with different ancestry may have varied minimum and maximum values for specific disease indicators. Scaling globally could lead the algorithm to underdiagnose the disease of interest for some ancestry cohorts. Of course, depending on the capacity and non-linearity of the training algorithm, the algorithm itself may find other ways of circumventing such issues. Nevertheless, it may still be a good idea for AI practitioners to apply a more cohort-aware approach by scaling one cohort at a time.
Data Balance Metrics
Aggregate measures
These measures look at the distribution of records across all value combinations of sensitive feature columns. For example, if sex
and race
are specified as
sensitive features, the API tries to quantify imbalance across all combinations of the specified features (e.g., [Male, Black]
, [Female, White]
, [Male, Asian
Pacific Islander]
)
Measure |
Description |
Interpretation |
---|---|---|
The Atkinson index presents the |
Range |
|
|
If everyone has the same income, |
|
GE(0) = Theil L, which is more |
Same interpretation as |
Distribution measures
These metrics compare the data with a reference distribution (currently only uniform distribution is supported). They are calculated per sensitive feature column and do not depend on the class label column.
Measure |
Description |
Interpretation |
---|---|---|
Kullbeck–Leibler (KL) divergence |
Non-negative. |
|
The Jensen-Shannon (JS) distance |
Range |
|
This distance is also known as the |
Non-negative. |
|
Also known as the Chebyshev distance |
Non-negative. |
|
The total variation distance is equal |
Non-negative. |
|
The chi-square test is used to test the |
The p-value gives evidence |
Feature measures
These measure whether each combination of sensitive features is receiving the positive outcome (true prediction) at balanced probabilities. Many of these metrics were influenced by the paper, Measuring Model Biases in the Absence of Ground Truth (Osman Aka, Ken Burke, Alex Bäuerle, Christina Greer, Margaret Mitchell).
Association |
Family |
Description |
Interpretation / |
---|---|---|---|
Fairness |
The proportion of each segment |
Parity increases with |
|
Pointwise |
Entropy |
The PMI of a pair of feature |
Range (normalized) |
Sorensen-Dice |
Intersection |
The SDC is used to gauge the |
Equals twice the number of |
Intersection |
Similar to SDC, the Jaccard |
Equals the size of the |
|
Correlation |
This is used to measure the |
High when observations |
|
Correlation |
This metric calculates the |
If likelihoods are similar, |
|
Correlation |
The t-test is used to |
The value that is being |
Cohort Management
The Cohort Management feature allows managing multiple cohorts using a simple interface. This is an important tool for guaranteeing fairness across different cohorts, as shown in the scenarios described here. The cohort.CohortManager allows the application of different data processing pipelines over each cohort, and therefore represents a powerful tool when dealing with sensitive cohorts.
Example: Imputing missing values for each cohort separately
Consider the following situation: a dataset that shows several details of similar cars from a specific brand.
The column price
stores the price of a car model in US Dollars, while the column country
indicates
the country where that price was observed. Due to the differences in economy and local currency, it is expected
that the price of these models will vary greatly based on the country
column. Suppose now that we want to
impute the missing values in the price
columns using the mean value of that column. Given that the prices
differ greatly based on the different country cohorts, then it is expected that this imputation approach
will end up inserting a lot of noise into the price
column. Instead, we could use the mean value of the
price
column based on each cohort, that is: compute the mean price
value for each cohort and impute
the missing values based on the mean value of the cohort to which the instance belongs. This will
greatly reduce the noise inserted by the imputation method. This can be easily achieved by using the
cohort.CohortManager class.
Cohort-based estimators
The Cohort module allows applying different mitigations to each cohort separately, as previously highlighted. But it allows us to go beyond that: it also allows creating full pipelines, including an estimator, for each cohort, while using a familiar and easy-to-use interface. If we are faced with a dataset that has a set of cohorts that behave very differently from each other, we are able of creating a custom pipeline for each cohort individually, which means that the pipeline is fitted separately for each cohort. This might help achieving more fair results, that is, the performance for each cohort is similar when compared to the other cohorts.
Tutorials and Examples
Check the Gallery page for some tutorials and examples of how to use the Cohort module.
The Tutorial - Cohort section has a set of tutorial notebooks that shows all of the features implemented in the different classes inside the Cohort module, as well as when and how to use each one of them.
The Tutorial - Using the Cohort Module sections presents a set of notebooks where we analyze a datasets that present some behavioral differences between different cohorts, and we use the Cohort module to create different pre-processing pipelines for each cohort, and in some cases, we even create different estimators for each cohort.
Note that this module is more useful in scenarios where there are considerable behavioral differences between cohorts. If that is not the case, then applying mitigations and training a single estimator over the entire dataset might prove the best approach, instead of creating different pipelines for each cohort.
Decoupled Classifiers
This class implements techniques for learning different estimators (models) for different cohorts based on the approach presented in “Decoupled classifiers for group-fair and efficient machine learning.” (Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. Conference on fairness, accountability and transparency. PMLR, 2018). The approach searches and combines cohort-specific classifiers to optimize for different definitions of group fairness and can be used as a post-processing step on top of any model class. The current implementation in this library supports only binary classification and we welcome contributions that can extend these ideas for multi-class and regression problems.
The basis decoupling algorithm can be summarized in two steps:
A different family of classifiers is trained on each cohort of interest. The algorithm partitions the training data for each cohort and learns a classifier for each cohort. Each cohort-specific trained classifier results in a family of potential classifiers to be used after the classifier output is adjusted based on different thresholds on the model output. For example, depending on which errors are most important to the application (e.g. false positives vs. false negatives for binary classification), thresholding the model prediction at different values of the model output (e.g. likelihood, softmax) will result in different classifiers. This step generates a whole family of classifiers based on different thresholds.
Among the cohort-specific classifiers search for one representative classifier for each cohort such that a joint loss is optimized. This step searches through all combinations of classifiers from the previous step to find the combination that best optimizes a definition of a joint loss across all cohorts. While there are different definitions of such a joint loss, this implementation currently supports definitions of the Balanced Loss, L1 loss, and Demographic Parity as examples of losses that focus on group fairness. More definitions of losses are described in the longer version of the paper.
Get involved
In the future, we plan to integrate more functionalities around data and model-oriented mitigations. Some top-of-mind improvements for the team include bagging and boosting, better data synthesis, constrained optimizers, and handling data noise. If you would like to collaborate or contribute to any of these ideas, contact us at rai-toolbox@microsoft.com.