Synthesizer
- class raimitigations.dataprocessing.Synthesizer(df: Optional[DataFrame] = None, label_col: Optional[str] = None, X: Optional[DataFrame] = None, y: Optional[DataFrame] = None, transform_pipe: Optional[list] = None, model: Union[BaseTabularModel, str] = 'ctgan', epochs: int = 50, save_file: Optional[str] = None, load_existing: bool = True, strategy: Optional[Union[str, dict, float]] = None, verbose: bool = True)
Bases:
DataProcessing
Concrete class that uses generative models (implemented in the
sdv
library) to create synthetic data for imbalanced datasets. This class allows the creation of synthetic data according to a set of conditions specified by the user or according to a predefined strategies based on the minority and majority classes (both considering the label column only).- Parameters
df – the dataset to be rebalanced, which is used during the
fit()
method. This data frame must contain all the features, including the label column (specified in thelabel_col
parameter). This parameter is mandatory iflabel_col
is also provided. The user can also provide this dataset (along with thelabel_col
) when calling thefit()
method. If df is provided during the class instantiation, it is not necessary to provide it again when callingfit()
. It is also possible to use theX
andy
instead ofdf
andlabel_col
, although it is mandatory to pass the pair of parameters (X,y) or (df, label_col) either during the class instantiation or during thefit()
method;label_col – the name or index of the label column. This parameter is mandatory if
df
is provided;X – contains only the features of the original dataset, that is, does not contain the label column. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if
y
is provided;y – contains only the label column of the original dataset. This parameter is mandatory if
X
is provided;transform_pipe – a list of transformations to be used as a pre-processing pipeline. Each transformation in this list must be a valid subclass of the current library (
EncoderOrdinal
,BasicImputer
, etc.). Some feature selection methods require a dataset with no categorical features or with no missing values (depending on the approach). If no transformations are provided, a set of default transformations will be used, which depends on the feature selection approach (subclass dependent);model –
the model that should be used to generate the synthetic instances. Can be a string or an object that inherits from
sdv.tabular.base.BaseTabularModel
:BaseTabularModel: an object from one of the following classes:
CTGAN
,TVAE
,CopulaGAN
, all from thesdv.tabular
module;str: a string that identifies which base model should be created. The base models supported are:
CTGAN
,TVAE
,GaussianCopula
, andCopulaGAN
. The string values allowed associated to each of the previous models are (respectively): “ctgan”, “tvae”, “copula”, and “copula_gan”;
epochs – the number of epochs that the model (specified by the
model
parameter) should be trained. This parameter is not used when the selected model is from the class GaussianCopula;save_file – the name of the file containing the data of the trained model. After training the model (when calling
fit()
), the model’s weights are saved in the path specified by this parameter, which can then be loaded and reused for future use. This is useful when training over a large dataset since this results in an extended training time. If the provided value is None, then a default name will be created based on the model’s type and number of epochs. Ifload_existing
is True, then this parameter will indicate which save file should be loaded;load_existing – a boolean value indicating if the model should be loaded or not. If False, a new save file will be created (or overwritten if the file specified in
save_file
already exists) containing the model’s wights of a new model. If True, the model will be loaded from the filesave_file
;strategy –
represents the strategy used to generate the artificial instances. This parameter is ignored when
n_samples
is provided. Strategy can assume the following values:String: one of the following predefined strategies:
’minority’: generates synthetic samples for only the minority class;
’not majority’: generates synthetic samples for all classes but the
majority class; * ‘auto’: equivalent to ‘minority’;
Note that for a binary classification problem, “minority” is similar to “not majority”;
- Dictionary: the dictionary must have one key for each of the possible classes
found in the label column, and the value associated with each key represents the number of instances desired for that class after the undersampling process is done. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as
RandomUnderSampler
,ClusterCentroids
(fromimblearn
). If any other undersampler is provided in theunder_sampler
parameter along with a float value for the strategy_under parameter, an error will be raised;
- Float: a value between [0, 1] that represents the desired ratio between
the number of instances of the minority class over the majority class after undersampling. The ratio ‘r’ is given by: \(r = N_m/N_M\) where \(N_m\) is the number of instances of the minority class and \(N_M\) is the number of instances of the majority class after undersampling. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as
RandomUnderSampler
,ClusterCentroids
(fromimblearn
). If any other undersampler is provided in the under_sampler parameter along with a float value for thestrategy_under
parameter, an error will be raised; If None, the default value is set to “auto”, which is the same as “minority”.
verbose – indicates whether internal messages should be printed or not
- fit(X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[Series, ndarray]] = None, df: Optional[Union[DataFrame, ndarray]] = None, label_col: Optional[str] = None)
Prepare the dataset and then call
fit()
. If the model was loaded, then there is no need to callfit()
.- Parameters
X – contains only the features of the original dataset, that is, does not contain the label column;
y – contains only the label column of the original dataset;
df – the full dataset;
label_col – the name or index of the label column;
- fit_resample(X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[DataFrame, ndarray]] = None, df: Optional[Union[DataFrame, ndarray]] = None, n_samples: Optional[int] = None, conditions: Optional[dict] = None, strategy: Optional[Union[str, dict, float]] = None, label_col: Optional[str] = None)
Transforms a dataset by adding synthetic instances to it. The types of instances created depend on the number of samples provided and the set of conditions specified, or the chosen strategy. Returns a dataset with the original data and the synthetic data generated.
- Parameters
df – the full dataset to be transformed, which contains the label column (specified during
fit()
);X – contains only the features of the dataset, that is, does not contain the label column;
y – contains only the label column of the dataset to be transformed. If the user provides
df
,X
andy
must be left as None. Alternatively, if the user provides (X, y),df
must be left as None;n_samples – the number of samples that should be created using the set of conditions specified by the ‘conditions’ parameter;
conditions – a set of conditions, specified by a dictionary, that defines the characteristics of the synthetic instances that should be created. This parameter indicates the values for certain features that the synthetic instances should have. If None, then no restrictions will be imposed on how to generate the synthetic data;
strategy –
represents the strategy used to generate the artificial instances. This parameter is ignored when
n_samples
is provided. Strategy can assume the following values:String: one of the following predefined strategies:
’minority’: generates synthetic samples for only the minority class;
’not majority’: generates synthetic samples for all classes but the majority class;
’auto’: equivalent to ‘minority’;
Note that for a binary classification problem, “minority” is similar to “not majority”;
Dictionary: the dictionary must have one key for each of the possible classes found in the label column, and the value associated with each key represents the number of instances desired for that class after the undersampling process is done. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as
RandomUnderSampler
,ClusterCentroids
(fromimblearn
). If any other undersampler is provided in theunder_sampler
parameter along with a float value for the strategy_under parameter, an error will be raised;Float: a value between [0, 1] that represents the desired ratio between the number of instances of the minority class over the majority class after undersampling. The ratio ‘r’ is given by: \(r = N_m/N_M\) where \(N_m\) is the number of instances of the minority class and \(N_M\) is the number of instances of the majority class after undersampling. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as
RandomUnderSampler
,ClusterCentroids
(fromimblearn
). If any other undersampler is provided in the under_sampler parameter along with a float value for thestrategy_under
parameter, an error will be raised; If None, the default value is set to “auto”, which is the same as “minority”.
label_col – the name or index of the label column;
- Returns
the transformed dataset.
- Return type
pd.DataFrame or np.ndarray
- sample(n_samples: int, conditions: Optional[dict] = None)
Encapsulates
sample()
from the models that inherit fromsdv.tabular.baseBaseTabularModel
. This allows users to use this method without requiring to directly access the model object (self.model
).- Parameters
n_samples – the number of samples to be generated;
conditions – a set of conditions, specified by a dictionary, that defines the characteristics of the synthetic instances that should be created. This parameter indicates the values for certain features that the synthetic instances should have. If None, then no restrictions will be imposed on how to generate the synthetic data.
- Returns
a dataset containing the artificial samples.
- Return type
pd.DataFrame
Class Diagram