Feature Selection

This sub-module of the dataprocessing package collects a set of transformers that remove a set of unimportant features from a dataset. The difference between each feature selection approach lies in how this importance metric is computed. All the feature selection methods from the dataprocessing package are based on the abstract class presented below, called FeatureSelection.

class raimitigations.dataprocessing.FeatureSelection(df: Optional[Union[DataFrame, ndarray]] = None, label_col: Optional[str] = None, X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[DataFrame, ndarray]] = None, transform_pipe: Optional[list] = None, verbose: bool = True)

Bases: DataProcessing

Base class for all feature selection subclasses. Implements basic functionalities that can be used by all feature selection approaches.

Parameters

df – the data frame to be used during the fit method. This data frame must contain all the features, including the label column (specified in the label_col parameter). This parameter is mandatory if label_col is also provided. The user can also provide this dataset (along with the label_col) when calling the:meth:fit method. If df is provided during the class instantiation, it is not necessary to provide it again when calling:meth:fit. It is also possible to use the X and y instead of df and label_col, although it is mandatory to pass the pair of parameters (X,y) or (df, label_col) either during the class instantiation or during the:meth:fit method;
label_col – the name or index of the label column. This parameter is mandatory if df is provided;
X – contains only the features of the original dataset, that is, does not contain the label column. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if y is provided;
y – contains only the label column of the original dataset. This parameter is mandatory if X is provided;
transform_pipe – a list of transformations to be used as a pre-processing pipeline. Each transformation in this list must be a valid subclass of the current library (EncoderOrdinal, BasicImputer, etc.). Some feature selection methods require a dataset with no categorical features or with no missing values (depending on the approach). If no transformations are provided, a set of default transformations will be used, which depends on the feature selection approach (subclass dependent);
verbose – indicates whether internal messages should be printed or not.

fit(X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[Series, ndarray]] = None, df: Optional[Union[DataFrame, ndarray]] = None, label_col: Optional[str] = None)

Default fit method for all feature selection classes that inherit from the current class. The following steps are executed: (i) set the self.df_info and self.y_info attributes, (ii) set the transform list (or create a default one if needed), (iii) fit and then apply the transformations in the self.transform_pipe attribute to the dataset, (iv) call the concrete class’s specific _fit() method, and (v) set the self.selected_feat attribute.

Parameters

X – contains only the features of the original dataset, that is, does not contain the label column;
y – contains only the label column of the original dataset;
df – the full dataset;
label_col – the name or index of the label column;

Check the documentation of the _set_df_mult method (DataProcessing class) for more information on how these parameters work.

get_selected_features()

Public method that returns the list of the selected features. The difference between this method and _get_selected_features is that the latter returns the list of features selected by the feature selection method, wheres the current method returns the list of selected features currently assigned to self.selected_feat, which can be manually changed by the user.

Returns: list containing the name of indices of the currently selected features.
Return type: list

set_selected_features(selected: Optional[list] = None)

Sets the self.selected_feat attribute, which indicates the currently selected features. Receives a list of column names that should be set as the currently selected features. If this list is None, then the features selected by the feature selection method (implemented by a concrete class that inherits from the current class) are used instead. This method is meant to be used from the outside by the user (not a private method), allowing the user to manually set the features they want to select in case they disagree with features selected by the feature selection method. Before setting the self.selected_feat attribute, it is checked if the provided list of features are all within the dataset provided for the fit method. If one of the features in the list is not present in the dataset, a ValueError is raised.

Parameters: selected – a list of column names or indexes representing the new selected features. If None, the features selected by the feature selection method are used instead.

transform(df: Union[DataFrame, ndarray])

Default behavior for transforming the data for the different feature selection methods. If a concrete class requires a different behavior, just override this method.

Parameters: df – the dataset used for inference.
Returns: the transformed dataset.
Return type: pd.DataFrame or np.ndarray

The following is a list of all feature selection methods implemented in this module. All of the classes below inherit from the FeatureSelection class, and thus, have access to all of the methods previously shown.

Child Classes

Class Diagram

Inheritance diagram of raimitigations.dataprocessing.CatBoostSelection, raimitigations.dataprocessing.SeqFeatSelection, raimitigations.dataprocessing.CorrelatedFeatures

Examples

SeqFeatSelection Example

CatBoostSelection Example

Checking Correlations Between Variables - A Comprehensive Guide