CorrelatedFeatures
- class raimitigations.dataprocessing.CorrelatedFeatures(df: Optional[Union[DataFrame, ndarray]] = None, label_col: Optional[str] = None, X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[DataFrame, ndarray]] = None, transform_pipe: Optional[list] = None, cor_features: Optional[list] = None, method_num_num: list = ['spearman'], num_corr_th: float = 0.85, num_pvalue_th: float = 0.05, method_num_cat: str = 'model', levene_pvalue: float = 0.01, anova_pvalue: float = 0.05, omega_th: float = 0.9, jensen_n_bins: Optional[int] = None, jensen_th: float = 0.8, model_metrics: list = ['f1', 'auc'], metric_th: float = 0.9, method_cat_cat: str = 'cramer', cat_corr_th: float = 0.85, cat_pvalue_th: float = 0.05, tie_method: str = 'missing', save_json: bool = True, json_summary: str = 'summary.json', json_corr: str = 'corr_pairs.json', json_uncorr: str = 'uncorr_pairs.json', compute_exact_matches: bool = True, verbose: bool = True)
Bases:
FeatureSelection
Concrete class that measures the correlation between variables (numerical x numerical, categorical x numerical, and categorical x categorical) and drop features that are correlated to another feature.
- Parameters
df – the data frame to be used during the fit method. This data frame must contain all the features, including the label column (specified in the
label_col
parameter). This parameter is mandatory iflabel_col
is also provided. The user can also provide this dataset (along with thelabel_col
) when calling thefit()
method. If df is provided during the class instantiation, it is not necessary to provide it again when callingfit()
. It is also possible to use theX
andy
instead ofdf
andlabel_col
, although it is mandatory to pass the pair of parameters (X,y) or (df, label_col) either during the class instantiation or during thefit()
method;label_col – the name or index of the label column. This parameter is mandatory if
df
is provided;X – contains only the features of the original dataset, that is, does not contain the label column. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if
y
is provided;y – contains only the label column of the original dataset. This parameter is mandatory if
X
is provided;transform_pipe – a list of transformations to be used as a pre-processing pipeline. Each transformation in this list must be a valid subclass of the current library (
EncoderOrdinal
,BasicImputer
, etc.). Some feature selection methods require a dataset with no categorical features or with no missing values (depending on the approach). If no transformations are provided, a set of default transformations will be used, which depends on the feature selection approach (subclass dependent);cor_features – a list of the column names or indexes that should have their correlations checked. If None, all columns are checked for correlations, where each correlation is checked in pairs (all possible column pairs are checked);
method_num_num – the method used to test the correlation between numerical variables. Must be a list containing one or more methods (limited to the number of available methods). The available methods are: [“spearman”, “pearson”, “kendall”]. If None, none of the correlations between two numerical variables will be tested;
num_corr_th – the correlation coefficient value used as a threshold for considering if there is a correlation between two numerical variables. That is, given two variables with a correlation coefficient of ‘x’ (depends on the correlation used, specified by
method_num_num
), a correlation is considered only if abs(x) >= method_num_num and if the associated p-value ‘p’ is smaller than ‘p’ <= num_pvalue_th;num_pvalue_th – the p-value used as a threshold when considering if there is a correlation between two variables. That is, given two variables with a correlation coefficient of ‘x’ (depends on the correlation used, specified by
method_num_num
), a correlation is considered only if abs(x) >= method_num_num and if the associated p-value ‘p’ is smaller than ‘p’ <= num_pvalue_th;method_num_cat –
the method used to compute the correlation between a categorical and a numerical variable. There are currently three approaches implemented:
’anova’: uses the ANOVA test to identify a correlation. First, we use the Levene test to see if the numerical variable has a similar variance across the different values of the categorical variable (Homoscedastic data). If the test passes (that is if the p-value of the Levene test is greater than
levene_pvalue
), then we can perform the ANOVA test, in which we compute the F-statistic to see if there is a correlation between the numerical and categorical variables and its associated p-value. We also compute the omega-squared metric. If the p-value is less thananova_pvalue
and the omega-squared is greater thanomega_th
, then both variables are considered to be correlated;’jensen’: first we clusterize the numerical values according to their respective values of the categorical data. We then compute the probability density function of the numerical variable for each cluster (we approximate the PDF with the histogram using
jensen_n_bins
different bins). The next step is to compute the Jensen-Shannon Distance metric between the distribution functions of each pair of clusters. This distance metric varies from 0 to 1, where values closer to 0 mean that both distributions tested are similar and values closer to 1 mean that the distributions are different. If all pairs of distributions tested are considered different (a Jensen-Shannon metric abovejensen_th
for all pairs tested), then both variables are considered to be correlated;’model’: trains a simple decision tree using the numerical variable and predicts the categorical variable. Both variables are first divided into a training and test set (70% and 30% of the size of the original variables, respectively). The training set is used to train the decision tree, where the only feature used by the model is the numerical variable and the predicted label is the different values within the categorical variable. After training, the model is used to predict the values of the test set and a set of metrics is computed to assess the performance of the model (the metrics computed are defined by
model_metrics
). If all metrics computed are above the thresholdmetric_th
, then both variables are considered to be correlated;
If set to None, then none of the correlations between numerical and categorical variables will be tested;
levene_pvalue – the threshold used to check if a set of samples are homoscedastic (similar variances across samples). This condition is necessary for the ANOVA test. This check is done using the Levene test, which considers that all samples have similar variances as the null hypothesis. If the p-value associated with this test is high, then the null hypothesis is accepted, thus allowing the ANOVA test to be carried out. This parameter defines the threshold used by the p-value of this test: if p-value > levene_pvalue, then the data is considered to be homoscedastic. This parameter is ignored if method_num_cat != ‘anova’;
anova_pvalue – threshold used by the p-value associated with the F-statistic computed by the ANOVA test. If the p-value < anova_pvalue, then we consider that there is a statistically significant difference between the numerical values of different clusters (clusterized according to the values of the categorical variable). This implies a possible correlation between the numerical and categorical variables, although the F-statistic doesn’t tell us the magnitude of this difference. For that, we use the Omega-Squared metric. This parameter is ignored if method_num_cat != ‘anova’;
omega_th – the threshold used for the omega squared metric. The omega squared is a metric that varies between 0 and 1 that indicates the effect of a categorical variable over the variance of a numerical variable. A value closer to 0 indicates a weak effect, while values closer to 1 show that the categorical variable has a significant impact on the variance of the numerical variable, thus showing a high correlation. If the omega squared is greater than omega_th, then both variables being analyzed are considered to be correlated. This parameter is ignored if method_num_cat != ‘anova’;
jensen_n_bins – the number of bins used for creating the histogram of each cluster of data when method_num_cat = ‘jensen’. For this method, we cluster the numerical data according to the categorical variable. For each cluster, we compute a histogram, which is used to approximate the Probability Density Function of that cluster. This parameter controls the number of bins used during the creation of the histogram. This parameter is ignored if method_num_cat != ‘jensen’. If None, the best number of bins for the numerical variable being analyzed is computed using the Freedman Diaconis rule;
jensen_th – when method_num_cat = ‘jensen’, we compare the distribution of each cluster of data using the Jensen-Shannon distance metric. If the distance is close to 1, then the distributions are considered different. If all pairs of clusters have a high distance, then both variables being analyzed are considered to be correlated. This parameter indicates the threshold used to check if a distance metric is high or not: if distance > jensen_th, then the distributions being compared are considered different. Must be a float value within [0, 1]. This parameter is ignored if method_num_cat != ‘jensen’;
model_metrics – a list of metric names that should be used when evaluating if a model trained using a single numerical variable to predict a categorical variable is good enough. If the trained model presents a good performance for the metrics in model_metrics, then both variables being analyzed are considered to be correlated. This parameter must be a list, and the allowed values in this list are: [“f1”, “auc”, “accuracy”, “precision”, “recall”]. This parameter is ignored if method_num_cat != ‘model’;
metric_th – given the metrics provided by model_metrics, a pair of variables being analyzed are only considered correlated if all metrics in model_metrics achieve a score greater than metric_th over the test set (the variables being analyzed are split into training and test set internally). This parameter is ignored if method_num_cat != ‘model’;
method_cat_cat –
the method used to test the correlation between two categorical variables. There is only one option implemented:
’cramer’: performs the Cramer’s V test between two categorical variables. This test returns a value between 0 and 1, where values near 1 indicate a high correlation between the variables and a p-value associated with this metric. If the Cramer’s V correlation coefficient is greater than cat_corr_th and its p-value is smaller than cat_pvalue_th, then both variables are considered to be correlated.
If set to None, then none of the correlations between two categorical variables will be tested;
cat_corr_th – the threshold used for the Cramer’s V correlation coefficient. Values greater than
cat_corr_th
indicates a high correlation between two variables, but only if the p-value associated with this coefficient is smaller thancat_pvalue_th
;cat_pvalue_th – check the description for the parameter
cat_corr_th
for more information;tie_method –
the method used to choose the variable to remove in case a correlation between them is identified. This is used for all types of correlations: numerical x numerical, categorical x numerical, and categorical x categorical. The possible values are:
”missing”: chooses the variable with the least number of missing values;
”var”: chooses the variable with the largest data dispersion (std / (V - v), where std is the standard deviation of the variable, V and v are the maximum and minimum values observed in the variable, respectively). Works only for numerical x numerical analysis. Otherwise, it uses the cardinality approach internally;
”cardinality”: chooses the variable with the most number of different values present;
In all three cases, if both variables are tied (same dispersion, same number of missing values, or same cardinality), the variable to be removed will be selected randomly;
save_json – if True, the summary jsons are saved according to the paths
json_summary
,json_corr
, andjson_uncorr
. If False, these json files are not saved;json_summary – when calling the fit method, all correlations will be computed according to the many parameters detailed previously. After computing all this data, everything is saved in a JSON file, which can then be accessed and analyzed carefully. We recommend using a JSON viewing tool for this. This parameter indicates the name of the file where the JSON should be saved (including the path to the file). If set to None no JSON file will be saved;
json_corr – similar to
json_summary
, but corresponds to the name of the JSON file that contains only the information of the pairs of variables considered to be correlated (with no repetitions);json_uncorr – similar to
json_summary
, but corresponds to the name of the JSON file that contains only the information of the pairs of variables considered NOT to be correlated (with no repetitions);compute_exact_matches – if True, compute the number of exact matches between two variables and save this information in the
json_summary
,json_uncorr
, andjson_corr
;verbose – indicates whether internal messages should be printed or not.
- get_correlated_pairs()
Returns a copy of the dictionary mapping all correlated pairs found.
- Returns
a copy of the dictionary mapping all correlated pairs found.
- Return type
dict
- get_summary(print_summary: bool = True)
Fetches three internal dictionaries:
self.corr_dict: stores information and correlation metrics for each variable. There is one key for each variable analyzed;
self.corr_pairs: stores information and correlation metrics for all pairs of correlated variables. Each key of this dictionary follows the pattern “{key1} x {key2}”;
self.uncorr_pairs: stores information and correlation metrics for all pairs of uncorrelated variables. Each key of this dictionary follows the pattern “{key1} x {key2}”.
- Parameters
print_summary – if True, print the values stored in the correlated and the uncorrelated dictionary. If False, just return the three dictionaries previously mentioned.
- Returns
three internal dictionaries that summarizes the correlations found.
- Return type
tuple
- update_selected_features(num_corr_th: Optional[float] = None, num_pvalue_th: Optional[float] = None, levene_pvalue: Optional[float] = None, anova_pvalue: Optional[float] = None, omega_th: Optional[float] = None, jensen_th: Optional[float] = None, model_metrics: Optional[float] = None, metric_th: Optional[float] = None, cat_corr_th: Optional[float] = None, cat_pvalue_th: Optional[float] = None, save_json: Optional[bool] = None, json_summary: Optional[str] = None, json_corr: Optional[str] = None, json_uncorr: Optional[str] = None)
Update different parameters associated to the different types of correlations and recomputes the selected features using these new parameter values without recomputing the correlations. This method allow users to change certain thresholds and metrics without requiring to recompute all of the correlations, which can be computationally expensive depending on the size of the dataset. The only parameters allowed to be changed are the ones accepted by this method. If another parameter not listed here needs to be changed, then it is necessary to instantiate a different object and call the
fit()
method again.- Parameters
num_corr_th – the correlation coefficient value used as a threshold for considering if there is a correlation between two numerical variables. That is, given two variables with a correlation coefficient of ‘x’ (depends on the correlation used, specified by
method_num_num
), a correlation is considered only if abs(x) >= method_num_num and if the associated p-value ‘p’ is smaller than ‘p’ <= num_pvalue_th;num_pvalue_th – the p-value used as a threshold when considering if there is a correlation between two variables. That is, given two variables with a correlation coefficient of ‘x’ (depends on the correlation used, specified by
method_num_num
), a correlation is considered only if abs(x) >= method_num_num and if the associated p-value ‘p’ is smaller than ‘p’ <= num_pvalue_th;levene_pvalue – the threshold used to check if a set of samples are homoscedastic (similar variances across samples). This condition is necessary for the ANOVA test. This check is done using the Levene test, which considers that all samples have similar variances as the null hypothesis. If the p-value associated with this test is high, then the null hypothesis is accepted, thus allowing the ANOVA test to be carried out. This parameter defines the threshold used by the p-value of this test: if p-value > levene_pvalue, then the data is considered to be homoscedastic. This parameter is ignored if method_num_cat != ‘anova’;
anova_pvalue – threshold used by the p-value associated with the F-statistic computed by the ANOVA test. If the p-value < anova_pvalue, then we consider that there is a statistically significant difference between the numerical values of different clusters (clusterized according to the values of the categorical variable). This implies a possible correlation between the numerical and categorical variables, although the F-statistic doesn’t tell us the magnitude of this difference. For that, we use the Omega-Squared metric. This parameter is ignored if method_num_cat != ‘anova’;
omega_th – the threshold used for the omega squared metric. The omega squared is a metric that varies between 0 and 1 that indicates the effect of a categorical variable over the variance of a numerical variable. A value closer to 0 indicates a weak effect, while values closer to 1 show that the categorical variable has a significant impact on the variance of the numerical variable, thus showing a high correlation. If the omega squared is greater than
omega_th
, then both variables being analyzed are considered to be correlated. This parameter is ignored if method_num_cat != ‘anova’;jensen_th – when method_num_cat = ‘jensen’, we compare the distribution of each cluster of data using the Jensen-Shannon distance metric. If the distance is close to 1, then the distributions are considered different. If all pairs of clusters have a high distance, then both variables being analyzed are considered to be correlated. This parameter indicates the threshold used to check if a distance metric is high or not: if distance > jensen_th, then the distributions being compared are considered different. Must be a float value within [0, 1]. This parameter is ignored if method_num_cat != ‘jensen’;
model_metrics – a list of metric names that should be used when evaluating if a model trained using a single numerical variable to predict a categorical variable is good enough. If the trained model presents a good performance for the metrics in
model_metrics
, then both variables being analyzed are considered to be correlated. This parameter must be a list, and the allowed values in this list are: [“f1”, “auc”, “accuracy”, “precision”, “recall”]. This parameter is ignored if method_num_cat != ‘model’;metric_th – given the metrics provided by
model_metrics
, a pair of variables being analyzed are only considered correlated if all metrics inmodel_metrics
achieve a score greater than metric_th over the test set (the variables being analyzed are split into training and test set internally). This parameter is ignored if method_num_cat != ‘model’;cat_corr_th – the threshold used for the Cramer’s V correlation coefficient. Values greater than
cat_corr_th
indicates a high correlation between two variables, but only if the p-value associated with this coefficient is smaller thancat_pvalue_th
;cat_pvalue_th – check the description for the parameter
cat_corr_th
for more information;save_json – if True, the summary jsons are saved according to the paths
json_summary
,json_corr
, andjson_uncorr
. If False, these json files are not saved;json_summary – when calling the fit method, all correlations will be computed according to the many parameters detailed previously. After computing all this data, everything is saved in a JSON file, which can then be accessed and analyzed carefully. We recommend using a JSON viewing tool for this. This parameter indicates the name of the file where the JSON should be saved (including the path to the file). If set to None no JSON file will be saved;
json_corr – similar to
json_summary
, but corresponds to the name of the JSON file that contains only the information of the pairs of variables considered to be correlated (with no repetitions);json_uncorr – similar to
json_summary
, but corresponds to the name of the JSON file that contains only the information of the pairs of variables considered NOT to be correlated (with no repetitions);
Class Diagram