Feature Selection
This sub-module of the dataprocessing package collects a set of transformers that remove a set of unimportant features from a dataset. The difference between each feature selection approach lies in how this importance metric is computed. All the feature selection methods from the dataprocessing package are based on the abstract class presented below, called FeatureSelection.
- class raimitigations.dataprocessing.FeatureSelection(df: Optional[Union[DataFrame, ndarray]] = None, label_col: Optional[str] = None, X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[DataFrame, ndarray]] = None, transform_pipe: Optional[list] = None, verbose: bool = True)
Bases:
DataProcessing
Base class for all feature selection subclasses. Implements basic functionalities that can be used by all feature selection approaches.
- Parameters
df – the data frame to be used during the fit method. This data frame must contain all the features, including the label column (specified in the
label_col
parameter). This parameter is mandatory iflabel_col
is also provided. The user can also provide this dataset (along with thelabel_col
) when calling the:meth:fit method. Ifdf
is provided during the class instantiation, it is not necessary to provide it again when calling:meth:fit. It is also possible to use theX
andy
instead ofdf
andlabel_col
, although it is mandatory to pass the pair of parameters (X,y) or (df, label_col) either during the class instantiation or during the:meth:fit method;label_col – the name or index of the label column. This parameter is mandatory if
df
is provided;X – contains only the features of the original dataset, that is, does not contain the label column. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if
y
is provided;y – contains only the label column of the original dataset. This parameter is mandatory if
X
is provided;transform_pipe – a list of transformations to be used as a pre-processing pipeline. Each transformation in this list must be a valid subclass of the current library (
EncoderOrdinal
,BasicImputer
, etc.). Some feature selection methods require a dataset with no categorical features or with no missing values (depending on the approach). If no transformations are provided, a set of default transformations will be used, which depends on the feature selection approach (subclass dependent);verbose – indicates whether internal messages should be printed or not.
- fit(X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[Series, ndarray]] = None, df: Optional[Union[DataFrame, ndarray]] = None, label_col: Optional[str] = None)
Default fit method for all feature selection classes that inherit from the current class. The following steps are executed: (i) set the
self.df_info
andself.y_info
attributes, (ii) set the transform list (or create a default one if needed), (iii) fit and then apply the transformations in theself.transform_pipe
attribute to the dataset, (iv) call the concrete class’s specific_fit()
method, and (v) set theself.selected_feat
attribute.- Parameters
X – contains only the features of the original dataset, that is, does not contain the label column;
y – contains only the label column of the original dataset;
df – the full dataset;
label_col – the name or index of the label column;
Check the documentation of the _set_df_mult method (DataProcessing class) for more information on how these parameters work.
- get_selected_features()
Public method that returns the list of the selected features. The difference between this method and _get_selected_features is that the latter returns the list of features selected by the feature selection method, wheres the current method returns the list of selected features currently assigned to self.selected_feat, which can be manually changed by the user.
- Returns
list containing the name of indices of the currently selected features.
- Return type
list
- set_selected_features(selected: Optional[list] = None)
Sets the
self.selected_feat
attribute, which indicates the currently selected features. Receives a list of column names that should be set as the currently selected features. If this list is None, then the features selected by the feature selection method (implemented by a concrete class that inherits from the current class) are used instead. This method is meant to be used from the outside by the user (not a private method), allowing the user to manually set the features they want to select in case they disagree with features selected by the feature selection method. Before setting the self.selected_feat attribute, it is checked if the provided list of features are all within the dataset provided for the fit method. If one of the features in the list is not present in the dataset, a ValueError is raised.- Parameters
selected – a list of column names or indexes representing the new selected features. If None, the features selected by the feature selection method are used instead.
- transform(df: Union[DataFrame, ndarray])
Default behavior for transforming the data for the different feature selection methods. If a concrete class requires a different behavior, just override this method.
- Parameters
df – the dataset used for inference.
- Returns
the transformed dataset.
- Return type
pd.DataFrame or np.ndarray
The following is a list of all feature selection methods implemented in this module. All of the classes below inherit from the FeatureSelection class, and thus, have access to all of the methods previously shown.
Class Diagram