Synthesizer

class raimitigations.dataprocessing.Synthesizer(df: Optional[DataFrame] = None, label_col: Optional[str] = None, X: Optional[DataFrame] = None, y: Optional[DataFrame] = None, transform_pipe: Optional[list] = None, model: Union[BaseTabularModel, str] = 'ctgan', epochs: int = 50, save_file: Optional[str] = None, load_existing: bool = True, strategy: Optional[Union[str, dict, float]] = None, verbose: bool = True)

Bases: DataProcessing

Concrete class that uses generative models (implemented in the sdv library) to create synthetic data for imbalanced datasets. This class allows the creation of synthetic data according to a set of conditions specified by the user or according to a predefined strategies based on the minority and majority classes (both considering the label column only).

Parameters
  • df – the dataset to be rebalanced, which is used during the fit() method. This data frame must contain all the features, including the label column (specified in the label_col parameter). This parameter is mandatory if label_col is also provided. The user can also provide this dataset (along with the label_col) when calling the fit() method. If df is provided during the class instantiation, it is not necessary to provide it again when calling fit(). It is also possible to use the X and y instead of df and label_col, although it is mandatory to pass the pair of parameters (X,y) or (df, label_col) either during the class instantiation or during the fit() method;

  • label_col – the name or index of the label column. This parameter is mandatory if df is provided;

  • X – contains only the features of the original dataset, that is, does not contain the label column. This is useful if the user has already separated the features from the label column prior to calling this class. This parameter is mandatory if y is provided;

  • y – contains only the label column of the original dataset. This parameter is mandatory if X is provided;

  • transform_pipe – a list of transformations to be used as a pre-processing pipeline. Each transformation in this list must be a valid subclass of the current library (EncoderOrdinal, BasicImputer, etc.). Some feature selection methods require a dataset with no categorical features or with no missing values (depending on the approach). If no transformations are provided, a set of default transformations will be used, which depends on the feature selection approach (subclass dependent);

  • model

    the model that should be used to generate the synthetic instances. Can be a string or an object that inherits from sdv.tabular.base.BaseTabularModel:

    • BaseTabularModel: an object from one of the following classes: CTGAN, TVAE, CopulaGAN, all from the sdv.tabular module;

    • str: a string that identifies which base model should be created. The base models supported are: CTGAN, TVAE, GaussianCopula, and CopulaGAN. The string values allowed associated to each of the previous models are (respectively): “ctgan”, “tvae”, “copula”, and “copula_gan”;

  • epochs – the number of epochs that the model (specified by the model parameter) should be trained. This parameter is not used when the selected model is from the class GaussianCopula;

  • save_file – the name of the file containing the data of the trained model. After training the model (when calling fit()), the model’s weights are saved in the path specified by this parameter, which can then be loaded and reused for future use. This is useful when training over a large dataset since this results in an extended training time. If the provided value is None, then a default name will be created based on the model’s type and number of epochs. If load_existing is True, then this parameter will indicate which save file should be loaded;

  • load_existing – a boolean value indicating if the model should be loaded or not. If False, a new save file will be created (or overwritten if the file specified in save_file already exists) containing the model’s wights of a new model. If True, the model will be loaded from the file save_file;

  • strategy

    represents the strategy used to generate the artificial instances. This parameter is ignored when n_samples is provided. Strategy can assume the following values:

    • String: one of the following predefined strategies:

      • ’minority’: generates synthetic samples for only the minority class;

      • ’not majority’: generates synthetic samples for all classes but the

      majority class; * ‘auto’: equivalent to ‘minority’;

      Note that for a binary classification problem, “minority” is similar to “not majority”;

    • Dictionary: the dictionary must have one key for each of the possible classes

      found in the label column, and the value associated with each key represents the number of instances desired for that class after the undersampling process is done. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as RandomUnderSampler, ClusterCentroids (from imblearn). If any other undersampler is provided in the under_sampler parameter along with a float value for the strategy_under parameter, an error will be raised;

    • Float: a value between [0, 1] that represents the desired ratio between

      the number of instances of the minority class over the majority class after undersampling. The ratio ‘r’ is given by: \(r = N_m/N_M\) where \(N_m\) is the number of instances of the minority class and \(N_M\) is the number of instances of the majority class after undersampling. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as RandomUnderSampler, ClusterCentroids (from imblearn). If any other undersampler is provided in the under_sampler parameter along with a float value for the strategy_under parameter, an error will be raised; If None, the default value is set to “auto”, which is the same as “minority”.

  • verbose – indicates whether internal messages should be printed or not

fit(X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[Series, ndarray]] = None, df: Optional[Union[DataFrame, ndarray]] = None, label_col: Optional[str] = None)

Prepare the dataset and then call fit(). If the model was loaded, then there is no need to call fit().

Parameters
  • X – contains only the features of the original dataset, that is, does not contain the label column;

  • y – contains only the label column of the original dataset;

  • df – the full dataset;

  • label_col – the name or index of the label column;

fit_resample(X: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[DataFrame, ndarray]] = None, df: Optional[Union[DataFrame, ndarray]] = None, n_samples: Optional[int] = None, conditions: Optional[dict] = None, strategy: Optional[Union[str, dict, float]] = None, label_col: Optional[str] = None)

Transforms a dataset by adding synthetic instances to it. The types of instances created depend on the number of samples provided and the set of conditions specified, or the chosen strategy. Returns a dataset with the original data and the synthetic data generated.

Parameters
  • df – the full dataset to be transformed, which contains the label column (specified during fit());

  • X – contains only the features of the dataset, that is, does not contain the label column;

  • y – contains only the label column of the dataset to be transformed. If the user provides df, X and y must be left as None. Alternatively, if the user provides (X, y), df must be left as None;

  • n_samples – the number of samples that should be created using the set of conditions specified by the ‘conditions’ parameter;

  • conditions – a set of conditions, specified by a dictionary, that defines the characteristics of the synthetic instances that should be created. This parameter indicates the values for certain features that the synthetic instances should have. If None, then no restrictions will be imposed on how to generate the synthetic data;

  • strategy

    represents the strategy used to generate the artificial instances. This parameter is ignored when n_samples is provided. Strategy can assume the following values:

    • String: one of the following predefined strategies:

      • ’minority’: generates synthetic samples for only the minority class;

      • ’not majority’: generates synthetic samples for all classes but the majority class;

      • ’auto’: equivalent to ‘minority’;

      Note that for a binary classification problem, “minority” is similar to “not majority”;

    • Dictionary: the dictionary must have one key for each of the possible classes found in the label column, and the value associated with each key represents the number of instances desired for that class after the undersampling process is done. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as RandomUnderSampler, ClusterCentroids (from imblearn). If any other undersampler is provided in the under_sampler parameter along with a float value for the strategy_under parameter, an error will be raised;

    • Float: a value between [0, 1] that represents the desired ratio between the number of instances of the minority class over the majority class after undersampling. The ratio ‘r’ is given by: \(r = N_m/N_M\) where \(N_m\) is the number of instances of the minority class and \(N_M\) is the number of instances of the majority class after undersampling. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as RandomUnderSampler, ClusterCentroids (from imblearn). If any other undersampler is provided in the under_sampler parameter along with a float value for the strategy_under parameter, an error will be raised; If None, the default value is set to “auto”, which is the same as “minority”.

  • label_col – the name or index of the label column;

Returns

the transformed dataset.

Return type

pd.DataFrame or np.ndarray

sample(n_samples: int, conditions: Optional[dict] = None)

Encapsulates sample() from the models that inherit from sdv.tabular.baseBaseTabularModel. This allows users to use this method without requiring to directly access the model object (self.model).

Parameters
  • n_samples – the number of samples to be generated;

  • conditions – a set of conditions, specified by a dictionary, that defines the characteristics of the synthetic instances that should be created. This parameter indicates the values for certain features that the synthetic instances should have. If None, then no restrictions will be imposed on how to generate the synthetic data.

Returns

a dataset containing the artificial samples.

Return type

pd.DataFrame

Class Diagram

Inheritance diagram of raimitigations.dataprocessing.Synthesizer

Example