DistributionBalanceMeasure

These metrics compare the data with a reference distribution (currently only uniform distribution is supported). They are calculated per sensitive feature column and do not depend on the class label column.

Measure	Description	Interpretation
KL divergence	Kullbeck–Leibler (KL) divergence measures how one probability distribution is different from a second reference probability distribution. It is the measure of the information gained when one revises one’s beliefs from the prior probability distribution Q to the posterior probability distribution P. In other words, it is the amount of information lost when Q is used to approximate P.	Non-negative. `0` means `P = Q`.
JS distance	The Jensen-Shannon (JS) distance measures the similarity between two probability distributions. It is the symmetrized and smoothed version of the Kullback–Leibler (KL) divergence and is the square root of JS divergence.	Range `[0, 1]`. `0` means perfectly same to balanced distribution.
Wasserstein distance	This distance is also known as the Earth mover’s distance (EMD), since it can be seen as the minimum amount of “work” required to transform `u` into `v`, where “work” is measured as the amount of distribution weight that must be moved, multiplied by the distance it has to be moved.	Non-negative. `0` means `P = Q`.
Infinite norm distance	Also known as the Chebyshev distance or chessboard distance, this is the distance between two vectors that is the greatest of their differences along any coordinate dimension.	Non-negative. `0` means `P = Q`.
Total variation distance	The total variation distance is equal to half the L1 (Manhattan) distance between the two distributions. Take the difference between the two proportions in each category, add up the absolute values of all the differences, and then divide the sum by 2.	Non-negative. `0` means `P = Q`.
Chi-square test	The chi-square test is used to test the null hypothesis that the categorical data has the given expected frequencies in each category.	The p-value gives evidence against null-hypothesis that the difference in observed and expected frequencies is by random chance.

class raimitigations.databalanceanalysis.distribution_measures.DistributionBalanceMeasure(sensitive_cols: List[str])

Bases: BalanceMeasure

DISTRIBUTION_METRICS: Dict[Measures, Callable[[array, array], float]] = {Measures.CHISQ_PVALUE: <function get_chisq_pvalue>, Measures.CHISQ: <function get_chi_squared>, Measures.INF_NORM_DISTANCE: <function get_infinity_norm_distance>, Measures.JS_DISTANCE: <function get_js_distance>, Measures.KL_DIVERGENCE: <function get_kl_divergence>, Measures.TOTAL_VARIANCE_DISTANCE: <function get_total_variation_distance>, Measures.WS_DISTANCE: <function get_ws_distance>}

measures(df: DataFrame) → DataFrame

The output is a dataframe that maps the sensitive column name to another dictionary:

the dictionary for each sensitive column contains a mapping of the name of a measure to its value

Kullback-Leibler Divergence - https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

Jensen-Shannon Distance - https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence

Wasserstein Distance - https://en.wikipedia.org/wiki/Wasserstein_metric

Infinity Norm Distance - https://en.wikipedia.org/wiki/Chebyshev_distance

Total Variation Distance - https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures

Chi-Squared Test - https://en.wikipedia.org/wiki/Chi-squared_test

There is one dictionary for each of the sensitive columns specified

Parameters: df (pd.DataFrame) – the df to calculate all of the distribution measures on
Returns: a dataframe that has one column with the sensitive column name and column that contains the dictionary that has the mapping of the name of the measure to its value for that sensitive feature.
Return type: pd.DataFrame