CorrelatedFeatures

class raimitigations.dataprocessing.CorrelatedFeatures(df: Optional[Union[DataFrame, ndarray]] = None, label_col: Optional[str] = None, X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[DataFrame, ndarray]] = None, transform_pipe: Optional[list] = None, cor_features: Optional[list] = None, method_num_num: list = ['spearman'], num_corr_th: float = 0.85, num_pvalue_th: float = 0.05, method_num_cat: str = 'model', levene_pvalue: float = 0.01, anova_pvalue: float = 0.05, omega_th: float = 0.9, jensen_n_bins: Optional[int] = None, jensen_th: float = 0.8, model_metrics: list = ['f1', 'auc'], metric_th: float = 0.9, method_cat_cat: str = 'cramer', cat_corr_th: float = 0.85, cat_pvalue_th: float = 0.05, tie_method: str = 'missing', save_json: bool = True, json_summary: str = 'summary.json', json_corr: str = 'corr_pairs.json', json_uncorr: str = 'uncorr_pairs.json', compute_exact_matches: bool = True, verbose: bool = True)

Bases: FeatureSelection

Concrete class that measures the correlation between variables (numerical x numerical, categorical x numerical, and categorical x categorical) and drop features that are correlated to another feature.

Parameters
  • df – the data frame to be used during the fit method. This data frame must contain all the features, including the label column (specified in the label_col parameter). This parameter is mandatory if label_col is also provided. The user can also provide this dataset (along with the label_col) when calling the fit() method. If df is provided during the class instantiation, it is not necessary to provide it again when calling fit(). It is also possible to use the X and y instead of df and label_col, although it is mandatory to pass the pair of parameters (X,y) or (df, label_col) either during the class instantiation or during the fit() method;

  • label_col – the name or index of the label column. This parameter is mandatory if df is provided;

  • X – contains only the features of the original dataset, that is, does not contain the label column. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if y is provided;

  • y – contains only the label column of the original dataset. This parameter is mandatory if X is provided;

  • transform_pipe – a list of transformations to be used as a pre-processing pipeline. Each transformation in this list must be a valid subclass of the current library (EncoderOrdinal, BasicImputer, etc.). Some feature selection methods require a dataset with no categorical features or with no missing values (depending on the approach). If no transformations are provided, a set of default transformations will be used, which depends on the feature selection approach (subclass dependent);

  • cor_features – a list of the column names or indexes that should have their correlations checked. If None, all columns are checked for correlations, where each correlation is checked in pairs (all possible column pairs are checked);

  • method_num_num – the method used to test the correlation between numerical variables. Must be a list containing one or more methods (limited to the number of available methods). The available methods are: [“spearman”, “pearson”, “kendall”]. If None, none of the correlations between two numerical variables will be tested;

  • num_corr_th – the correlation coefficient value used as a threshold for considering if there is a correlation between two numerical variables. That is, given two variables with a correlation coefficient of ‘x’ (depends on the correlation used, specified by method_num_num), a correlation is considered only if abs(x) >= method_num_num and if the associated p-value ‘p’ is smaller than ‘p’ <= num_pvalue_th;

  • num_pvalue_th – the p-value used as a threshold when considering if there is a correlation between two variables. That is, given two variables with a correlation coefficient of ‘x’ (depends on the correlation used, specified by method_num_num), a correlation is considered only if abs(x) >= method_num_num and if the associated p-value ‘p’ is smaller than ‘p’ <= num_pvalue_th;

  • method_num_cat

    the method used to compute the correlation between a categorical and a numerical variable. There are currently three approaches implemented:

    • ’anova’: uses the ANOVA test to identify a correlation. First, we use the Levene test to see if the numerical variable has a similar variance across the different values of the categorical variable (Homoscedastic data). If the test passes (that is if the p-value of the Levene test is greater than levene_pvalue), then we can perform the ANOVA test, in which we compute the F-statistic to see if there is a correlation between the numerical and categorical variables and its associated p-value. We also compute the omega-squared metric. If the p-value is less than anova_pvalue and the omega-squared is greater than omega_th, then both variables are considered to be correlated;

    • ’jensen’: first we clusterize the numerical values according to their respective values of the categorical data. We then compute the probability density function of the numerical variable for each cluster (we approximate the PDF with the histogram using jensen_n_bins different bins). The next step is to compute the Jensen-Shannon Distance metric between the distribution functions of each pair of clusters. This distance metric varies from 0 to 1, where values closer to 0 mean that both distributions tested are similar and values closer to 1 mean that the distributions are different. If all pairs of distributions tested are considered different (a Jensen-Shannon metric above jensen_th for all pairs tested), then both variables are considered to be correlated;

    • ’model’: trains a simple decision tree using the numerical variable and predicts the categorical variable. Both variables are first divided into a training and test set (70% and 30% of the size of the original variables, respectively). The training set is used to train the decision tree, where the only feature used by the model is the numerical variable and the predicted label is the different values within the categorical variable. After training, the model is used to predict the values of the test set and a set of metrics is computed to assess the performance of the model (the metrics computed are defined by model_metrics). If all metrics computed are above the threshold metric_th, then both variables are considered to be correlated;

    If set to None, then none of the correlations between numerical and categorical variables will be tested;

  • levene_pvalue – the threshold used to check if a set of samples are homoscedastic (similar variances across samples). This condition is necessary for the ANOVA test. This check is done using the Levene test, which considers that all samples have similar variances as the null hypothesis. If the p-value associated with this test is high, then the null hypothesis is accepted, thus allowing the ANOVA test to be carried out. This parameter defines the threshold used by the p-value of this test: if p-value > levene_pvalue, then the data is considered to be homoscedastic. This parameter is ignored if method_num_cat != ‘anova’;

  • anova_pvalue – threshold used by the p-value associated with the F-statistic computed by the ANOVA test. If the p-value < anova_pvalue, then we consider that there is a statistically significant difference between the numerical values of different clusters (clusterized according to the values of the categorical variable). This implies a possible correlation between the numerical and categorical variables, although the F-statistic doesn’t tell us the magnitude of this difference. For that, we use the Omega-Squared metric. This parameter is ignored if method_num_cat != ‘anova’;

  • omega_th – the threshold used for the omega squared metric. The omega squared is a metric that varies between 0 and 1 that indicates the effect of a categorical variable over the variance of a numerical variable. A value closer to 0 indicates a weak effect, while values closer to 1 show that the categorical variable has a significant impact on the variance of the numerical variable, thus showing a high correlation. If the omega squared is greater than omega_th, then both variables being analyzed are considered to be correlated. This parameter is ignored if method_num_cat != ‘anova’;

  • jensen_n_bins – the number of bins used for creating the histogram of each cluster of data when method_num_cat = ‘jensen’. For this method, we cluster the numerical data according to the categorical variable. For each cluster, we compute a histogram, which is used to approximate the Probability Density Function of that cluster. This parameter controls the number of bins used during the creation of the histogram. This parameter is ignored if method_num_cat != ‘jensen’. If None, the best number of bins for the numerical variable being analyzed is computed using the Freedman Diaconis rule;

  • jensen_th – when method_num_cat = ‘jensen’, we compare the distribution of each cluster of data using the Jensen-Shannon distance metric. If the distance is close to 1, then the distributions are considered different. If all pairs of clusters have a high distance, then both variables being analyzed are considered to be correlated. This parameter indicates the threshold used to check if a distance metric is high or not: if distance > jensen_th, then the distributions being compared are considered different. Must be a float value within [0, 1]. This parameter is ignored if method_num_cat != ‘jensen’;

  • model_metrics – a list of metric names that should be used when evaluating if a model trained using a single numerical variable to predict a categorical variable is good enough. If the trained model presents a good performance for the metrics in model_metrics, then both variables being analyzed are considered to be correlated. This parameter must be a list, and the allowed values in this list are: [“f1”, “auc”, “accuracy”, “precision”, “recall”]. This parameter is ignored if method_num_cat != ‘model’;

  • metric_th – given the metrics provided by model_metrics, a pair of variables being analyzed are only considered correlated if all metrics in model_metrics achieve a score greater than metric_th over the test set (the variables being analyzed are split into training and test set internally). This parameter is ignored if method_num_cat != ‘model’;

  • method_cat_cat

    the method used to test the correlation between two categorical variables. There is only one option implemented:

    • ’cramer’: performs the Cramer’s V test between two categorical variables. This test returns a value between 0 and 1, where values near 1 indicate a high correlation between the variables and a p-value associated with this metric. If the Cramer’s V correlation coefficient is greater than cat_corr_th and its p-value is smaller than cat_pvalue_th, then both variables are considered to be correlated.

    If set to None, then none of the correlations between two categorical variables will be tested;

  • cat_corr_th – the threshold used for the Cramer’s V correlation coefficient. Values greater than cat_corr_th indicates a high correlation between two variables, but only if the p-value associated with this coefficient is smaller than cat_pvalue_th;

  • cat_pvalue_th – check the description for the parameter cat_corr_th for more information;

  • tie_method

    the method used to choose the variable to remove in case a correlation between them is identified. This is used for all types of correlations: numerical x numerical, categorical x numerical, and categorical x categorical. The possible values are:

    • ”missing”: chooses the variable with the least number of missing values;

    • ”var”: chooses the variable with the largest data dispersion (std / (V - v), where std is the standard deviation of the variable, V and v are the maximum and minimum values observed in the variable, respectively). Works only for numerical x numerical analysis. Otherwise, it uses the cardinality approach internally;

    • ”cardinality”: chooses the variable with the most number of different values present;

    In all three cases, if both variables are tied (same dispersion, same number of missing values, or same cardinality), the variable to be removed will be selected randomly;

  • save_json – if True, the summary jsons are saved according to the paths json_summary, json_corr, and json_uncorr. If False, these json files are not saved;

  • json_summary – when calling the fit method, all correlations will be computed according to the many parameters detailed previously. After computing all this data, everything is saved in a JSON file, which can then be accessed and analyzed carefully. We recommend using a JSON viewing tool for this. This parameter indicates the name of the file where the JSON should be saved (including the path to the file). If set to None no JSON file will be saved;

  • json_corr – similar to json_summary, but corresponds to the name of the JSON file that contains only the information of the pairs of variables considered to be correlated (with no repetitions);

  • json_uncorr – similar to json_summary, but corresponds to the name of the JSON file that contains only the information of the pairs of variables considered NOT to be correlated (with no repetitions);

  • compute_exact_matches – if True, compute the number of exact matches between two variables and save this information in the json_summary, json_uncorr, and json_corr;

  • verbose – indicates whether internal messages should be printed or not.

get_correlated_pairs()

Returns a copy of the dictionary mapping all correlated pairs found.

Returns

a copy of the dictionary mapping all correlated pairs found.

Return type

dict

get_summary(print_summary: bool = True)

Fetches three internal dictionaries:

  • self.corr_dict: stores information and correlation metrics for each variable. There is one key for each variable analyzed;

  • self.corr_pairs: stores information and correlation metrics for all pairs of correlated variables. Each key of this dictionary follows the pattern “{key1} x {key2}”;

  • self.uncorr_pairs: stores information and correlation metrics for all pairs of uncorrelated variables. Each key of this dictionary follows the pattern “{key1} x {key2}”.

Parameters

print_summary – if True, print the values stored in the correlated and the uncorrelated dictionary. If False, just return the three dictionaries previously mentioned.

Returns

three internal dictionaries that summarizes the correlations found.

Return type

tuple

update_selected_features(num_corr_th: Optional[float] = None, num_pvalue_th: Optional[float] = None, levene_pvalue: Optional[float] = None, anova_pvalue: Optional[float] = None, omega_th: Optional[float] = None, jensen_th: Optional[float] = None, model_metrics: Optional[float] = None, metric_th: Optional[float] = None, cat_corr_th: Optional[float] = None, cat_pvalue_th: Optional[float] = None, save_json: Optional[bool] = None, json_summary: Optional[str] = None, json_corr: Optional[str] = None, json_uncorr: Optional[str] = None)

Update different parameters associated to the different types of correlations and recomputes the selected features using these new parameter values without recomputing the correlations. This method allow users to change certain thresholds and metrics without requiring to recompute all of the correlations, which can be computationally expensive depending on the size of the dataset. The only parameters allowed to be changed are the ones accepted by this method. If another parameter not listed here needs to be changed, then it is necessary to instantiate a different object and call the fit() method again.

Parameters
  • num_corr_th – the correlation coefficient value used as a threshold for considering if there is a correlation between two numerical variables. That is, given two variables with a correlation coefficient of ‘x’ (depends on the correlation used, specified by method_num_num), a correlation is considered only if abs(x) >= method_num_num and if the associated p-value ‘p’ is smaller than ‘p’ <= num_pvalue_th;

  • num_pvalue_th – the p-value used as a threshold when considering if there is a correlation between two variables. That is, given two variables with a correlation coefficient of ‘x’ (depends on the correlation used, specified by method_num_num), a correlation is considered only if abs(x) >= method_num_num and if the associated p-value ‘p’ is smaller than ‘p’ <= num_pvalue_th;

  • levene_pvalue – the threshold used to check if a set of samples are homoscedastic (similar variances across samples). This condition is necessary for the ANOVA test. This check is done using the Levene test, which considers that all samples have similar variances as the null hypothesis. If the p-value associated with this test is high, then the null hypothesis is accepted, thus allowing the ANOVA test to be carried out. This parameter defines the threshold used by the p-value of this test: if p-value > levene_pvalue, then the data is considered to be homoscedastic. This parameter is ignored if method_num_cat != ‘anova’;

  • anova_pvalue – threshold used by the p-value associated with the F-statistic computed by the ANOVA test. If the p-value < anova_pvalue, then we consider that there is a statistically significant difference between the numerical values of different clusters (clusterized according to the values of the categorical variable). This implies a possible correlation between the numerical and categorical variables, although the F-statistic doesn’t tell us the magnitude of this difference. For that, we use the Omega-Squared metric. This parameter is ignored if method_num_cat != ‘anova’;

  • omega_th – the threshold used for the omega squared metric. The omega squared is a metric that varies between 0 and 1 that indicates the effect of a categorical variable over the variance of a numerical variable. A value closer to 0 indicates a weak effect, while values closer to 1 show that the categorical variable has a significant impact on the variance of the numerical variable, thus showing a high correlation. If the omega squared is greater than omega_th, then both variables being analyzed are considered to be correlated. This parameter is ignored if method_num_cat != ‘anova’;

  • jensen_th – when method_num_cat = ‘jensen’, we compare the distribution of each cluster of data using the Jensen-Shannon distance metric. If the distance is close to 1, then the distributions are considered different. If all pairs of clusters have a high distance, then both variables being analyzed are considered to be correlated. This parameter indicates the threshold used to check if a distance metric is high or not: if distance > jensen_th, then the distributions being compared are considered different. Must be a float value within [0, 1]. This parameter is ignored if method_num_cat != ‘jensen’;

  • model_metrics – a list of metric names that should be used when evaluating if a model trained using a single numerical variable to predict a categorical variable is good enough. If the trained model presents a good performance for the metrics in model_metrics, then both variables being analyzed are considered to be correlated. This parameter must be a list, and the allowed values in this list are: [“f1”, “auc”, “accuracy”, “precision”, “recall”]. This parameter is ignored if method_num_cat != ‘model’;

  • metric_th – given the metrics provided by model_metrics, a pair of variables being analyzed are only considered correlated if all metrics in model_metrics achieve a score greater than metric_th over the test set (the variables being analyzed are split into training and test set internally). This parameter is ignored if method_num_cat != ‘model’;

  • cat_corr_th – the threshold used for the Cramer’s V correlation coefficient. Values greater than cat_corr_th indicates a high correlation between two variables, but only if the p-value associated with this coefficient is smaller than cat_pvalue_th;

  • cat_pvalue_th – check the description for the parameter cat_corr_th for more information;

  • save_json – if True, the summary jsons are saved according to the paths json_summary, json_corr, and json_uncorr. If False, these json files are not saved;

  • json_summary – when calling the fit method, all correlations will be computed according to the many parameters detailed previously. After computing all this data, everything is saved in a JSON file, which can then be accessed and analyzed carefully. We recommend using a JSON viewing tool for this. This parameter indicates the name of the file where the JSON should be saved (including the path to the file). If set to None no JSON file will be saved;

  • json_corr – similar to json_summary, but corresponds to the name of the JSON file that contains only the information of the pairs of variables considered to be correlated (with no repetitions);

  • json_uncorr – similar to json_summary, but corresponds to the name of the JSON file that contains only the information of the pairs of variables considered NOT to be correlated (with no repetitions);

Class Diagram

Inheritance diagram of raimitigations.dataprocessing.CorrelatedFeatures

Examples