SeqFeatSelection

class raimitigations.dataprocessing.SeqFeatSelection(df: Optional[Union[DataFrame, ndarray]] = None, label_col: Optional[str] = None, X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[DataFrame, ndarray]] = None, transform_pipe: Optional[list] = None, regression: Optional[bool] = None, estimator: Optional[BaseEstimator] = None, n_feat: Union[int, str, tuple] = 'best', fixed_cols: Optional[list] = None, cv: int = 3, scoring: Optional[str] = None, forward: bool = True, save_json: bool = False, json_summary: str = 'seq_feat_summary.json', n_jobs: int = 1, verbose: bool = True)

Bases: FeatureSelection

Concrete class that uses SequentialFeatureSelector over a dataset. Implements the sequential feature selection method using the``mlextend`` library. This approach uses a classifier and sequentially changes the set of features used for training the model. There are two ways to perform this: (i) forward feature selection or (ii) backward feature selection. The former starts with an empty set of features and tests the performance of the model when inserting each of the non-selected features. The feature with the best score in the test set is added to the selected features. It then restarts the process until the number of desired features is reached. The backward feature selection is the opposite: it starts with all the features and removes them one by one. The feature that has the least impact on the score of the test set is selected to be removed. This is repeated until the number of remaining features is the desired number of features.

Parameters
  • df – the data frame to be used during the fit method. This data frame must contain all the features, including the label column (specified in the label_col parameter). This parameter is mandatory if label_col is also provided. The user can also provide this dataset (along with the label_col) when calling the fit() method. If df is provided during the class instantiation, it is not necessary to provide it again when calling fit(). It is also possible to use the X and y instead of df and label_col, although it is mandatory to pass the pair or parameters (X,y) or (df, label_col) either during the class instantiation or during the fit() method;

  • label_col – the name or index of the label column. This parameter is mandatory if df is provided;

  • X – contains only the features of the original dataset, that is, does not contain the label column. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if y is provided;

  • y – contains only the label column of the original dataset. This parameter is mandatory if X is provided;

  • transform_pipe – a list of transformations to be used as a pre-processing pipeline. Each transformation in this list must be a valid subclass of the current library (EncoderOrdinal, BasicImputer, etc.). Some feature selection methods require a dataset with no categorical features or with no missing values (depending on the approach). If no transformations are provided, a set of default transformations will be used, which depends on the feature selection approach (subclass dependent);

  • regression – if True and no estimator is provided, then create a default CatBoostRegressor. If False, a CatBoostClassifier is created instead. This parameter is ignored if an estimator is provided using the ‘estimator’ parameter;

  • estimator – a sklearn estimator to be used during the sequential feature selection process. If no estimator is provided, a default classifier or regressor is used (BASE_CLASSIFIER and BASE_REGRESSOR, respectively);

  • n_feat

    the number of features to be selected. Can be an integer, string, or tuple:

    • int: a number between 1 and df.shape[1] (number of features);

    • string: the only value accepted in this case is the “best” string, which selects the number of features with the best score using cross-validation;

    • tuple: a tuple with only 2 values: (min, max), where min and max are the minimum and maximum number of features to be selected. The number of features selected the number of features that achieved the best score in the cross-validation and that is between min and max;

  • fixed_cols – a list of column names or indices that should always be included in the set of selected features. Note that the number of columns included here must be smaller than n_feat, otherwise there is nothing for the class to do (that is: len(fixed_cols) < n_feat);

  • cv – the number of folds used for the cross-validation;

  • scoring

    the score used to indicate which set of features is better. The set of valid values for this parameter depends on the task being solved: regression or classification. The valid values are:

    • Regression: “neg_mean_squared_error”, “r2”, “neg_median_absolute_error”;

    • Classification: “accuracy”, “f1”, “precision”, “recall”, “roc_auc”.

    If None, “roc_auc” is used for classification tasks, and “r2” is used for regression tasks;

  • forward – if True, a forward sequential feature selection approach is used. If False, a backward sequential feature selection approach is used;

  • save_json – if True, the summary json will be saved in the path specified by the json_summary parameter after calling the fit() method. If False, this json file is not saved;

  • json_summary – the path where the summary with the results obtained by the feature selection process should be saved. This summary is saved after the fit() method is called. Note that this summary is only saved if save_json is set to True;

  • n_jobs – the number of workers used to run the sequential feature selection method;

  • verbose – indicates whether internal messages should be printed or not.

get_summary()

Public method that returns the summary generated by the SequentialFeatureSelector class. This summary is a dictionary where each key represents a different run, which is associated with a secondary dictionary with all the relevant data regarding that particular run.

Returns

a dictionary where each key represents a different run, which is associated with a secondary dictionary with all the relevant data regarding that particular run.

Return type

dict

Class Diagram

Inheritance diagram of raimitigations.dataprocessing.SeqFeatSelection

Example