DistributionBalanceMeasure

These metrics compare the data with a reference distribution (currently only uniform distribution is supported). They are calculated per sensitive feature column and do not depend on the class label column.

Measure

Description

Interpretation

KL divergence

Kullbeck–Leibler (KL) divergence
measures how one probability
distribution is different from
a second reference probability
distribution. It is the measure
of the information gained when
one revises one’s beliefs from
the prior probability distribution
Q to the posterior probability
distribution P. In other words,
it is the amount of information
lost when Q is used to approximate P.

Non-negative.

0 means P = Q.

JS distance

The Jensen-Shannon (JS) distance
measures the similarity between two
probability distributions. It is the
symmetrized and smoothed version of
the Kullback–Leibler (KL) divergence
and is the square root of JS divergence.

Range [0, 1].

0 means perfectly same to
balanced distribution.

Wasserstein distance

This distance is also known as the
Earth mover’s distance (EMD), since
it can be seen as the minimum amount
of “work” required to transform u
into v, where “work” is measured
as the amount of distribution weight
that must be moved, multiplied by the
distance it has to be moved.

Non-negative.

0 means P = Q.

Infinite norm distance

Also known as the Chebyshev distance
or chessboard distance, this is the
distance between two vectors that is
the greatest of their differences along
any coordinate dimension.

Non-negative.

0 means P = Q.

Total variation distance

The total variation distance is equal
to half the L1 (Manhattan) distance
between the two distributions. Take the
difference between the two proportions
in each category, add up the absolute
values of all the differences, and then
divide the sum by 2.

Non-negative.

0 means P = Q.

Chi-square test

The chi-square test is used to test the
null hypothesis that the categorical
data has the given expected frequencies
in each category.

The p-value gives evidence
against null-hypothesis that
the difference in observed and
expected frequencies is by
random chance.

class raimitigations.databalanceanalysis.distribution_measures.DistributionBalanceMeasure(sensitive_cols: List[str])

Bases: BalanceMeasure

DISTRIBUTION_METRICS: Dict[Measures, Callable[[array, array], float]] = {Measures.CHISQ_PVALUE: <function get_chisq_pvalue>, Measures.CHISQ: <function get_chi_squared>, Measures.INF_NORM_DISTANCE: <function get_infinity_norm_distance>, Measures.JS_DISTANCE: <function get_js_distance>, Measures.KL_DIVERGENCE: <function get_kl_divergence>, Measures.TOTAL_VARIANCE_DISTANCE: <function get_total_variation_distance>, Measures.WS_DISTANCE: <function get_ws_distance>}
measures(df: DataFrame) DataFrame
The output is a dataframe that maps the sensitive column name to another dictionary:

the dictionary for each sensitive column contains a mapping of the name of a measure to its value

There is one dictionary for each of the sensitive columns specified

Parameters

df (pd.DataFrame) – the df to calculate all of the distribution measures on

Returns

a dataframe that has one column with the sensitive column name and column that contains the dictionary that has the mapping of the name of the measure to its value for that sensitive feature.

Return type

pd.DataFrame