DistributionBalanceMeasure
These metrics compare the data with a reference distribution (currently only uniform distribution is supported). They are calculated per sensitive feature column and do not depend on the class label column.
Measure |
Description |
Interpretation |
---|---|---|
Kullbeck–Leibler (KL) divergence |
Non-negative. |
|
The Jensen-Shannon (JS) distance |
Range |
|
This distance is also known as the |
Non-negative. |
|
Also known as the Chebyshev distance |
Non-negative. |
|
The total variation distance is equal |
Non-negative. |
|
The chi-square test is used to test the |
The p-value gives evidence |
- class raimitigations.databalanceanalysis.distribution_measures.DistributionBalanceMeasure(sensitive_cols: List[str])
Bases:
BalanceMeasure
- DISTRIBUTION_METRICS: Dict[Measures, Callable[[array, array], float]] = {Measures.CHISQ_PVALUE: <function get_chisq_pvalue>, Measures.CHISQ: <function get_chi_squared>, Measures.INF_NORM_DISTANCE: <function get_infinity_norm_distance>, Measures.JS_DISTANCE: <function get_js_distance>, Measures.KL_DIVERGENCE: <function get_kl_divergence>, Measures.TOTAL_VARIANCE_DISTANCE: <function get_total_variation_distance>, Measures.WS_DISTANCE: <function get_ws_distance>}
- measures(df: DataFrame) DataFrame
- The output is a dataframe that maps the sensitive column name to another dictionary:
the dictionary for each sensitive column contains a mapping of the name of a measure to its value
Kullback-Leibler Divergence - https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence
Jensen-Shannon Distance - https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence
Wasserstein Distance - https://en.wikipedia.org/wiki/Wasserstein_metric
Infinity Norm Distance - https://en.wikipedia.org/wiki/Chebyshev_distance
Total Variation Distance - https://en.wikipedia.org/wiki/Total_variation_distance_of_probability_measures
Chi-Squared Test - https://en.wikipedia.org/wiki/Chi-squared_test
There is one dictionary for each of the sensitive columns specified
- Parameters
df (pd.DataFrame) – the df to calculate all of the distribution measures on
- Returns
a dataframe that has one column with the sensitive column name and column that contains the dictionary that has the mapping of the name of the measure to its value for that sensitive feature.
- Return type
pd.DataFrame