[1]:
import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.dataprocessing import Synthesizer
from download import download_datasets

Synthesizer Class (SDV)

This class is a wrapper over the generative models for tabular data available in the SDV library. The Synthesizer enhances the interface of these models by allowing the user to specify predefined strategies similar to those used in the imblearn library. This notebook will explore different ways to use this class for generating synthetic data and for data balance.

First of all, let’s load the dataset.

[2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset.drop(columns=['employee_id'], inplace=True)
dataset
[2]:
department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 Sales & Marketing region_9 NaN m sourcing 1 29 1.0 2 0 0 45 0
54807 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 13 columns

We can check that this dataset is imbalanced.

[3]:
dataset['is_promoted'].value_counts()
[3]:
0    50140
1     4668
Name: is_promoted, dtype: int64

We can now instantiate the class. We’ll use the default values for the constructor for now, with the exception of the epochs parameter: we’ll use a lower number of epochs just to run the fit method faster. But usually, the more epochs, the more the model manages to understand the structure of the dataset provided, and thus is capable of generating more realistic artificial data.

The fit method will train the generative model specified. The default model used is called CTGAN, but there are also other models, which can be changed using the “model” parameter. The allowed values for this parameter are: [“ctgan”, “copula”, “copula_gan”, “tvae”]. It is important to note that these models are usually large, and it takes a considerable amount of time to train them.

[4]:
synth = Synthesizer(epochs=10)
synth.fit(df=dataset, label_col='is_promoted')
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/sklearn/mixture/_base.py:277: ConvergenceWarning: Initialization 1 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data.
  warnings.warn(
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/sklearn/mixture/_base.py:143: ConvergenceWarning: Number of distinct clusters (6) found smaller than n_clusters (10). Possibly due to duplicate points in X.
  cluster.KMeans(
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/sklearn/mixture/_base.py:277: ConvergenceWarning: Initialization 1 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data.
  warnings.warn(
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/sklearn/mixture/_base.py:277: ConvergenceWarning: Initialization 1 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data.
  warnings.warn(
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].to_numpy().flatten()
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].to_numpy().flatten()
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].to_numpy().flatten()
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].to_numpy().flatten()
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].to_numpy().flatten()

We can now generate synthetic data. First of all, let’s just create a sample of synthetic data without specify anything else. We can do this using the sample() method, where we must specify the number of samples to be generated, and we can also pass a set of conditions for these samples (this last parameter is optional). For now, let’s just create 500 samples without any specific set of conditions:

[10]:
df_sample = synth.sample(500)
df_sample
[10]:
department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service awards_won? avg_training_score is_promoted
0 HR region_10 Bachelor's m sourcing 1 28 4.0 11 1 52 1
1 Operations region_26 Bachelor's f sourcing 1 21 3.0 12 0 91 0
2 HR region_28 Bachelor's f other 1 34 3.0 1 0 64 0
3 Analytics region_7 Bachelor's m sourcing 1 50 NaN 1 0 50 1
4 Operations region_34 Bachelor's m sourcing 2 20 3.0 6 0 57 0
... ... ... ... ... ... ... ... ... ... ... ... ...
495 HR region_22 Bachelor's f sourcing 1 29 3.0 9 0 67 0
496 Operations region_8 Bachelor's f other 1 27 NaN 6 0 74 0
497 Technology region_2 Master's & above f other 1 34 3.0 3 0 50 0
498 Operations region_23 Bachelor's m sourcing 4 37 2.0 6 0 62 0
499 Sales & Marketing region_26 Bachelor's m other 2 22 NaN 3 1 95 0

500 rows × 12 columns

We can also create a set of samples and automatically add it to the original dataset. For that, we use the fit_resample() method, which must receive a dataset as a mandatory parameter, along with the number of samples to be created (optional):

[11]:
df_resample = synth.fit_resample(df=dataset, label_col='is_promoted', n_samples=1000)
print(df_resample['is_promoted'].value_counts())
0    51038
1     4770
Name: is_promoted, dtype: int64

If we don’t specify any of the optional parameters to the transform() method, the default behavior will be to use the “minority” strategy, which creates synthetic data of the minority class until the number of samples of the minority class is equal to the majority class.

[6]:
df_resample = synth.fit_resample(df=dataset, label_col='is_promoted')
print(df_resample['is_promoted'].value_counts())
0    50140
1    50140
Name: is_promoted, dtype: int64

The strategy parameter is similar to imblearn’s strategy parameter: it specifies predefined behaviors for balancing the dataset. Here is the description for this parameter:

  • strategy: represents the strategy used to generate the artificial instances. This parameter is ignored when n_samples is provided. Strategy can assume the following values:

    • String: one of the following predefined strategies:

      • ‘minority’: generates synthetic samples for only the minority class;

      • ‘not majority’: generates synthetic samples for all classes but the majority class;

      • ‘auto’: equivalent to ‘minority’; Note that for a binary classification problem, “minority” is similar to “not majority”;

    • Dictionary: the dictionary must have one key for each of the possible classes found in the label column, and the value associated with each key represents the number of instances desired for that class after the undersampling process is done. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as RandomUnderSampler, ClusterCentroids (from imblearn). If any other undersampler is provided in the under_sampler parameter along with a float value for the strategy_under parameter, an error will be raised;

    • Float: a value between [0, 1] that represents the desired ratio between the number of instances of the minority class over the majority class after undersampling. The ratio ‘r’ is given by: \(r = N_m/N_M\) where \(N_m\) is the number of instances of the minority class and \(N_M\) is the number of instances of the majority class after undersampling. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as RandomUnderSampler, ClusterCentroids (from imblearn). If any other undersampler is provided in the under_sampler parameter along with a float value for the strategy_under parameter, an error will be raised;

    If None, the default value is set to “auto”, which is the same as “minority”.

[7]:
df_resample = synth.fit_resample(df=dataset, label_col='is_promoted', strategy='not majority')
print(df_resample['is_promoted'].value_counts())
0    50140
1    50140
Name: is_promoted, dtype: int64

Let’s show an example of how to use the conditions parameter. Here is a description of this parameter:

  • conditions: a set of conditions, specified by a dictionary, that defines the characteristics of the synthetic instances that should be created. This parameter indicates the values for certain features that the synthetic instances should have. If None, then no restrictions will be imposed on how to generate the synthetic data;

Let’s use the following conditions: the education feature must be “Below Secondary” and the is_promoted feature must be set to 1.

[12]:
conditions = {"education": "Below Secondary", "is_promoted":1}
df_resample = synth.fit_resample(df=dataset, label_col='is_promoted', n_samples=5000, conditions=conditions)
print(df_resample['is_promoted'].value_counts())
df_resample
0    50140
1     9668
Name: is_promoted, dtype: int64
[12]:
department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service awards_won? avg_training_score is_promoted
0 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 0 49 0
1 Operations region_22 Bachelor's m other 1 30 5.0 4 0 60 0
2 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 50 0
3 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 50 0
4 Technology region_26 Bachelor's m other 1 45 3.0 2 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ...
4995 Operations region_23 Below Secondary m sourcing 1 24 3.0 3 0 99 1
4996 Technology region_2 Below Secondary f other 1 23 3.0 3 0 79 1
4997 Procurement region_2 Below Secondary m sourcing 1 37 3.0 1 0 65 1
4998 Sales & Marketing region_22 Below Secondary m sourcing 1 22 NaN 6 0 57 1
4999 Procurement region_16 Below Secondary m sourcing 1 25 4.0 5 0 67 1

59808 rows × 12 columns

We can see that the new 5000 data instances created were automatically merged to the original dataset. This is because we are using the transform()) method. If we just want to create these samples with the predefined conditions, without the original dataset, we can use the sample() method, as previously mentioned.

[13]:
conditions = {"education": "Below Secondary", "is_promoted":1}
sample_df = synth.sample(n_samples=5000, conditions=conditions)
print(sample_df['is_promoted'].value_counts())
sample_df
1    5000
Name: is_promoted, dtype: int64
[13]:
department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service awards_won? avg_training_score is_promoted
0 Sales & Marketing region_15 Below Secondary m sourcing 1 23 3.0 3 0 53 1
1 Technology region_23 Below Secondary m other 1 23 NaN 3 0 95 1
2 Technology region_2 Below Secondary f sourcing 1 37 3.0 10 0 93 1
3 Sales & Marketing region_27 Below Secondary m sourcing 1 28 3.0 1 0 65 1
4 Procurement region_8 Below Secondary m other 1 39 3.0 6 0 77 1
... ... ... ... ... ... ... ... ... ... ... ... ...
4995 Operations region_9 Below Secondary m sourcing 2 29 3.0 14 0 67 1
4996 Sales & Marketing region_17 Below Secondary m sourcing 1 33 3.0 3 0 66 1
4997 Operations region_2 Below Secondary f sourcing 1 26 4.0 2 0 57 1
4998 Sales & Marketing region_23 Below Secondary f other 4 30 3.0 10 0 65 1
4999 HR region_31 Below Secondary m other 1 27 3.0 1 0 92 1

5000 rows × 12 columns

As we can see, now the sampled dataset contains only instances where the conditions are True.

Note that generating data using the conditions parameter is slower than not using it.