Rebalance
- class raimitigations.dataprocessing.Rebalance(df: Optional[Union[DataFrame, ndarray]] = None, rebalance_col: Optional[str] = None, X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[DataFrame, ndarray]] = None, transform_pipe: Optional[list] = None, cat_col: Optional[list] = None, strategy_over: Optional[Union[str, dict, float]] = None, k_neighbors: int = 4, over_sampler: Union[BaseSampler, bool] = True, strategy_under: Optional[Union[str, dict, float]] = None, under_sampler: Union[BaseSampler, bool] = False, n_jobs: int = 1, verbose: bool = True)
Bases:
DataProcessing
Concrete class that uses under and oversampling approaches (implemented in the
imblearn
library) to rebalance a dataset. This class serves as a facilitation layer over theimblearn
library: it implements several automation processes and default parameters, making it easier to rebalance a dataset using approaches implemented in theimblearn
library.- Parameters
df – the dataset to be rebalanced, which is used during the fit method. This data frame must contain all the features, including the rebalance column (specified in the
rebalance_col
parameter). This parameter is mandatory ifrebalance_col
is also provided. The user can also provide this dataset (along with therebalance_col
) when calling thefit()
method. If df is provided during the class instantiation, it is not necessary to provide it again when callingfit()
. It is also possible to use theX
andy
instead ofdf
andrebalance_col
, although it is mandatory to pass the pair of parameters (X,y) or (df, rebalance_col) either during the class instantiation or during thefit()
method;rebalance_col – the name or index of the column used to do the rebalance operation. This parameter is mandatory if
df
is provided;X – contains only the features of the original dataset, that is, does not contain the column used for rebalancing. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if
y
is provided;y – contains only the rebalance column of the original dataset. The rebalance operation is executed based on the data distribution of this column. This parameter is mandatory if
X
is provided;transform_pipe – a list of transformations to be used as a pre-processing pipeline. Each transformation in this list must be a valid subclass of the current library (
EncoderOrdinal
,BasicImputer
, etc.). Some feature selection methods require a dataset with no categorical features or with no missing values (depending on the approach). If no transformations are provided, a set of default transformations will be used, which depends on the feature selection approach (subclass dependent);cat_col – a list of names or indexes of categorical columns. If None, this parameter will be set automatically as a list of all categorical variables in the dataset. These columns are used to determine the default
SMOTE
type that should be used: ifcat_col
is None, then useSMOTE
; ifcat_col
represents all columns of the dataset, then useSMOTEN
; ifcat_col
is a subset of columns of the dataset, then useSMOTENC
. If a specificSMOTE
object is provided in the constructor (using theover_sampler
parameter), then the columns incat_col
will be automatically encoded using One-Hot encoding (EncoderOHE
), unless another encoding transformer is provided in the transform_pipe parameter;strategy_over –
indicates which oversampling strategy should be used. This parameter can be a string, a float, or a dictionary, and their meaning are similar to what is used by the
imblearn
library forSMOTE
classes:Float: a value between [0, 1] that represents the desired ratio between the number of instances of the minority class over the majority class. The ratio ‘r’ is given by: \(r = N_m/N_M\) where \(N_m\) is the number of instances of the minority class after applying oversample and \(N_M\) is the number of instances of the majority class;
String: a string value must be one of the following, which identifies preset oversampling strategies (explanations retrieved from the
imblearn
’s SMOTE documentation):’minority’: resample only the minority class;
’not minority’: resample all classes but the minority class;
’not majority’: resample all classes but the majority class;
’all’: resample all classes;
’auto’: equivalent to ‘not majority’.
Dictionary: the dictionary must have one key for each of the possible classes found in the label column, and the value associated to each key represents the number of instances desired for that class after the oversampling process is done;
k_neighbors – an integer value representing the number of neighbors that should be used when creating the artificial samples using the
SMOTE
oversampling. This value is only valid if no oversampling object is passed, that is, over_sampler=True;over_sampler –
this parameter can be a boolean value or a sampler object from
imblearn
:Boolean: if a boolean value is passed, it indicates if the current class should use an oversampling method or not. If True, a default
SMOTE
is created internally using the parameters provided (such ask_neighbors
,n_jobs
,strategy_over
, etc.), where theSMOTE
type (SMOTE
,SMOTEN
,SMOTENC
) used is determined automatically based on the dataset provided: if the dataset contains only numerical data, then SMOTE is used, if the dataset contains only categorical features, thenSMOTEN
is used, and if the dataset contains numerical and categorical data,SMOTENC
is used;BaseSampler object: if the value provided is an object that inherits from
BaseSampler
, then this sampler is used instead of creating a new sampler. The preprocessing steps automatically applied to the dataset changes based on which SMOTE type is passed: if the object isSMOTE
, then all categorical data is encoded using one-hot encoding (EncoderOHE
) and all missing values are imputed using the BasicImputer, but if the object is anotherSMOTE
type (SMOTEN
orSMOTENC
), then only the imputation preprocessing is applied;
strategy_under –
similar to strategy_over, but instead specifies the strategy to be used for the undersampling approach. This parameter can be a string, a float, or a dictionary, and their meaning are similar to what is used by the
imblearn
library for the ClusterCentroids class:Float: a value between [0, 1] that represents the desired ratio between the number of instances of the minority class over the majority class after undersampling. The ratio ‘r’ is given by: \(r = N_m/N_M\) where \(N_m\) is the number of instances of the minority class and \(N_M\) is the number of instances of the majority class after undersampling. Note: this parameter only works with undersampling approaches that allows controlling the number of instances to be undersampled, such as
RandomUnderSampler
,ClusterCentroids
(fromimblearn
). If any other undersampler is provided in theunder_sampler
parameter along with a float value for the strategy_under parameter, an error will be raised;Dictionary: the dictionary must have one key for each of the possible classes found in the label column, and the value associated to each key represents the number of instances desired for that class after the undersampling process is done. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as
RandomUnderSampler
,ClusterCentroids
(fromimblearn
). If any other undersampler is provided in theunder_sampler
parameter along with a float value for thestrategy_under
parameter, an error will be raised;String: a string value must be one of the following, which identifies preset oversampling strategies (explanations retrieved from the
imblearn
’sClusterCentroids
documentation):’majority’: resample only the majority class;
’not minority’: resample all classes but the minority class;
’not majority’: resample all classes but the majority class;
’all’: resample all classes;
’auto’: equivalent to ‘not minority’;
under_sampler –
this parameter can be a boolean value or a sampler object from
imblearn
:Boolean: if a boolean value is passed, it indicates if the current class should use an undersampling method or not. If True, a default undersampler is created internally. There are two possible default undersamplers that can be created: (i) a
ClusterCentroids
is created if the value provided to thestrategy_under
parameter is a float value or a dictionary (theClusterCentroids
allows control over the number of instances that should be undersampled), and (ii) a TomekLinks otherwise;BaseSampler object: if the value provided is an object that inherits from BaseSampler, then this sampler is used instead of creating a new sampler;
n_jobs – the number of workers used to run the sampling methods. This value is only used when a default sampler (under or over) is created, where this parameter is provided to the
n_jobs
parameter of theimblearn
’s classes;verbose – indicates whether internal messages should be printed or not
- fit_resample(X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[DataFrame, ndarray]] = None, df: Optional[Union[DataFrame, ndarray]] = None, rebalance_col: Optional[str] = None)
Runs the over and/or undersampling methods specified by the parameters provided in the constructor method. The following steps are performed: (i) set the dataset, (ii) set the list of categorical columns in the dataset, (iii) check for errors in the inputs provided, (iv) set, fit and apply the transforms in the
transform_pipe
(if any), (v) set the oversampler to be used, (vi) set the undersampler to be used, (vii) run the oversampler, (viii) run the undersampler, and finally (ix) create a new data frame with the modified data.- Parameters
X – contains only the features of the original dataset, that is, does not contain the column used for rebalancing. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if
y
is provided;y – contains only the rebalance column of the original dataset. The rebalance operation is executed based on the data distribution of this column. This parameter is mandatory if
X
is provided;df – the dataset to be rebalanced, which is used during the
fit()
method. This data frame must contain all the features, including the rebalance column (specified in therebalance_col
parameter). This parameter is mandatory ifrebalance_col
is also provided. The user can also provide this dataset (along with therebalance_col
) when calling thefit()
method. Ifdf
is provided during the class instantiation, it is not necessary to provide it again when callingfit()
. It is also possible to use theX
andy
instead ofdf
andrebalance_col
, although it is mandatory to pass the pair of parameters (X,y) or (df, rebalance_col) either during the class instantiation or during thefit()
method;rebalance_col – the name or index of the column used to do the rebalance operation. This parameter is mandatory if
df
is provided.
- Returns
the transformed dataset.
- Return type
pd.DataFrame or np.ndarray
Class Diagram