Rebalance
- class raimitigations.dataprocessing.Rebalance(df: Optional[Union[DataFrame, ndarray]] = None, rebalance_col: Optional[str] = None, X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[DataFrame, ndarray]] = None, transform_pipe: Optional[list] = None, cat_col: Optional[list] = None, strategy_over: Optional[Union[str, dict, float]] = None, k_neighbors: int = 4, over_sampler: Union[BaseSampler, bool] = True, strategy_under: Optional[Union[str, dict, float]] = None, under_sampler: Union[BaseSampler, bool] = False, n_jobs: int = 1, verbose: bool = True)
Bases:
DataProcessingConcrete class that uses under and oversampling approaches (implemented in the
imblearnlibrary) to rebalance a dataset. This class serves as a facilitation layer over theimblearnlibrary: it implements several automation processes and default parameters, making it easier to rebalance a dataset using approaches implemented in theimblearnlibrary.- Parameters
df – the dataset to be rebalanced, which is used during the fit method. This data frame must contain all the features, including the rebalance column (specified in the
rebalance_colparameter). This parameter is mandatory ifrebalance_colis also provided. The user can also provide this dataset (along with therebalance_col) when calling thefit()method. If df is provided during the class instantiation, it is not necessary to provide it again when callingfit(). It is also possible to use theXandyinstead ofdfandrebalance_col, although it is mandatory to pass the pair of parameters (X,y) or (df, rebalance_col) either during the class instantiation or during thefit()method;rebalance_col – the name or index of the column used to do the rebalance operation. This parameter is mandatory if
dfis provided;X – contains only the features of the original dataset, that is, does not contain the column used for rebalancing. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if
yis provided;y – contains only the rebalance column of the original dataset. The rebalance operation is executed based on the data distribution of this column. This parameter is mandatory if
Xis provided;transform_pipe – a list of transformations to be used as a pre-processing pipeline. Each transformation in this list must be a valid subclass of the current library (
EncoderOrdinal,BasicImputer, etc.). Some feature selection methods require a dataset with no categorical features or with no missing values (depending on the approach). If no transformations are provided, a set of default transformations will be used, which depends on the feature selection approach (subclass dependent);cat_col – a list of names or indexes of categorical columns. If None, this parameter will be set automatically as a list of all categorical variables in the dataset. These columns are used to determine the default
SMOTEtype that should be used: ifcat_colis None, then useSMOTE; ifcat_colrepresents all columns of the dataset, then useSMOTEN; ifcat_colis a subset of columns of the dataset, then useSMOTENC. If a specificSMOTEobject is provided in the constructor (using theover_samplerparameter), then the columns incat_colwill be automatically encoded using One-Hot encoding (EncoderOHE), unless another encoding transformer is provided in the transform_pipe parameter;strategy_over –
indicates which oversampling strategy should be used. This parameter can be a string, a float, or a dictionary, and their meaning are similar to what is used by the
imblearnlibrary forSMOTEclasses:Float: a value between [0, 1] that represents the desired ratio between the number of instances of the minority class over the majority class. The ratio ‘r’ is given by: \(r = N_m/N_M\) where \(N_m\) is the number of instances of the minority class after applying oversample and \(N_M\) is the number of instances of the majority class;
String: a string value must be one of the following, which identifies preset oversampling strategies (explanations retrieved from the
imblearn’s SMOTE documentation):’minority’: resample only the minority class;
’not minority’: resample all classes but the minority class;
’not majority’: resample all classes but the majority class;
’all’: resample all classes;
’auto’: equivalent to ‘not majority’.
Dictionary: the dictionary must have one key for each of the possible classes found in the label column, and the value associated to each key represents the number of instances desired for that class after the oversampling process is done;
k_neighbors – an integer value representing the number of neighbors that should be used when creating the artificial samples using the
SMOTEoversampling. This value is only valid if no oversampling object is passed, that is, over_sampler=True;over_sampler –
this parameter can be a boolean value or a sampler object from
imblearn:Boolean: if a boolean value is passed, it indicates if the current class should use an oversampling method or not. If True, a default
SMOTEis created internally using the parameters provided (such ask_neighbors,n_jobs,strategy_over, etc.), where theSMOTEtype (SMOTE,SMOTEN,SMOTENC) used is determined automatically based on the dataset provided: if the dataset contains only numerical data, then SMOTE is used, if the dataset contains only categorical features, thenSMOTENis used, and if the dataset contains numerical and categorical data,SMOTENCis used;BaseSampler object: if the value provided is an object that inherits from
BaseSampler, then this sampler is used instead of creating a new sampler. The preprocessing steps automatically applied to the dataset changes based on which SMOTE type is passed: if the object isSMOTE, then all categorical data is encoded using one-hot encoding (EncoderOHE) and all missing values are imputed using the BasicImputer, but if the object is anotherSMOTEtype (SMOTENorSMOTENC), then only the imputation preprocessing is applied;
strategy_under –
similar to strategy_over, but instead specifies the strategy to be used for the undersampling approach. This parameter can be a string, a float, or a dictionary, and their meaning are similar to what is used by the
imblearnlibrary for the ClusterCentroids class:Float: a value between [0, 1] that represents the desired ratio between the number of instances of the minority class over the majority class after undersampling. The ratio ‘r’ is given by: \(r = N_m/N_M\) where \(N_m\) is the number of instances of the minority class and \(N_M\) is the number of instances of the majority class after undersampling. Note: this parameter only works with undersampling approaches that allows controlling the number of instances to be undersampled, such as
RandomUnderSampler,ClusterCentroids(fromimblearn). If any other undersampler is provided in theunder_samplerparameter along with a float value for the strategy_under parameter, an error will be raised;Dictionary: the dictionary must have one key for each of the possible classes found in the label column, and the value associated to each key represents the number of instances desired for that class after the undersampling process is done. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as
RandomUnderSampler,ClusterCentroids(fromimblearn). If any other undersampler is provided in theunder_samplerparameter along with a float value for thestrategy_underparameter, an error will be raised;String: a string value must be one of the following, which identifies preset oversampling strategies (explanations retrieved from the
imblearn’sClusterCentroidsdocumentation):’majority’: resample only the majority class;
’not minority’: resample all classes but the minority class;
’not majority’: resample all classes but the majority class;
’all’: resample all classes;
’auto’: equivalent to ‘not minority’;
under_sampler –
this parameter can be a boolean value or a sampler object from
imblearn:Boolean: if a boolean value is passed, it indicates if the current class should use an undersampling method or not. If True, a default undersampler is created internally. There are two possible default undersamplers that can be created: (i) a
ClusterCentroidsis created if the value provided to thestrategy_underparameter is a float value or a dictionary (theClusterCentroidsallows control over the number of instances that should be undersampled), and (ii) a TomekLinks otherwise;BaseSampler object: if the value provided is an object that inherits from BaseSampler, then this sampler is used instead of creating a new sampler;
n_jobs – the number of workers used to run the sampling methods. This value is only used when a default sampler (under or over) is created, where this parameter is provided to the
n_jobsparameter of theimblearn’s classes;verbose – indicates whether internal messages should be printed or not
- fit_resample(X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[DataFrame, ndarray]] = None, df: Optional[Union[DataFrame, ndarray]] = None, rebalance_col: Optional[str] = None)
Runs the over and/or undersampling methods specified by the parameters provided in the constructor method. The following steps are performed: (i) set the dataset, (ii) set the list of categorical columns in the dataset, (iii) check for errors in the inputs provided, (iv) set, fit and apply the transforms in the
transform_pipe(if any), (v) set the oversampler to be used, (vi) set the undersampler to be used, (vii) run the oversampler, (viii) run the undersampler, and finally (ix) create a new data frame with the modified data.- Parameters
X – contains only the features of the original dataset, that is, does not contain the column used for rebalancing. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if
yis provided;y – contains only the rebalance column of the original dataset. The rebalance operation is executed based on the data distribution of this column. This parameter is mandatory if
Xis provided;df – the dataset to be rebalanced, which is used during the
fit()method. This data frame must contain all the features, including the rebalance column (specified in therebalance_colparameter). This parameter is mandatory ifrebalance_colis also provided. The user can also provide this dataset (along with therebalance_col) when calling thefit()method. Ifdfis provided during the class instantiation, it is not necessary to provide it again when callingfit(). It is also possible to use theXandyinstead ofdfandrebalance_col, although it is mandatory to pass the pair of parameters (X,y) or (df, rebalance_col) either during the class instantiation or during thefit()method;rebalance_col – the name or index of the column used to do the rebalance operation. This parameter is mandatory if
dfis provided.
- Returns
the transformed dataset.
- Return type
pd.DataFrame or np.ndarray
Class Diagram
