[1]:
import sys
sys.path.append('../../../notebooks')
import pandas as pd
import numpy as np
from raimitigations.dataprocessing import Synthesizer
from download import download_datasets
Synthesizer Class (SDV)
This class is a wrapper over the generative models for tabular data available in the SDV library. The Synthesizer enhances the interface of these models by allowing the user to specify predefined strategies similar to those used in the imblearn library. This notebook will explore different ways to use this class for generating synthetic data and for data balance.
First of all, let’s load the dataset.
[2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset.drop(columns=['employee_id'], inplace=True)
dataset
[2]:
department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | Sales & Marketing | region_9 | NaN | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns
We can check that this dataset is imbalanced.
[3]:
dataset['is_promoted'].value_counts()
[3]:
0 50140
1 4668
Name: is_promoted, dtype: int64
We can now instantiate the class. We’ll use the default values for the constructor for now, with the exception of the epochs parameter: we’ll use a lower number of epochs just to run the fit method faster. But usually, the more epochs, the more the model manages to understand the structure of the dataset provided, and thus is capable of generating more realistic artificial data.
The fit method will train the generative model specified. The default model used is called CTGAN, but there are also other models, which can be changed using the “model” parameter. The allowed values for this parameter are: [“ctgan”, “copula”, “copula_gan”, “tvae”]. It is important to note that these models are usually large, and it takes a considerable amount of time to train them.
[4]:
synth = Synthesizer(epochs=10)
synth.fit(df=dataset, label_col='is_promoted')
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/sklearn/mixture/_base.py:277: ConvergenceWarning: Initialization 1 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data.
warnings.warn(
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/sklearn/mixture/_base.py:143: ConvergenceWarning: Number of distinct clusters (6) found smaller than n_clusters (10). Possibly due to duplicate points in X.
cluster.KMeans(
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/sklearn/mixture/_base.py:277: ConvergenceWarning: Initialization 1 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data.
warnings.warn(
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/sklearn/mixture/_base.py:277: ConvergenceWarning: Initialization 1 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data.
warnings.warn(
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data[column_name] = data[column_name].to_numpy().flatten()
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data[column_name] = data[column_name].to_numpy().flatten()
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data[column_name] = data[column_name].to_numpy().flatten()
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data[column_name] = data[column_name].to_numpy().flatten()
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data[column_name] = data[column_name].to_numpy().flatten()
We can now generate synthetic data. First of all, let’s just create a sample of synthetic data without specify anything else. We can do this using the sample() method, where we must specify the number of samples to be generated, and we can also pass a set of conditions for these samples (this last parameter is optional). For now, let’s just create 500 samples without any specific set of conditions:
[10]:
df_sample = synth.sample(500)
df_sample
[10]:
department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HR | region_10 | Bachelor's | m | sourcing | 1 | 28 | 4.0 | 11 | 1 | 52 | 1 |
1 | Operations | region_26 | Bachelor's | f | sourcing | 1 | 21 | 3.0 | 12 | 0 | 91 | 0 |
2 | HR | region_28 | Bachelor's | f | other | 1 | 34 | 3.0 | 1 | 0 | 64 | 0 |
3 | Analytics | region_7 | Bachelor's | m | sourcing | 1 | 50 | NaN | 1 | 0 | 50 | 1 |
4 | Operations | region_34 | Bachelor's | m | sourcing | 2 | 20 | 3.0 | 6 | 0 | 57 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | HR | region_22 | Bachelor's | f | sourcing | 1 | 29 | 3.0 | 9 | 0 | 67 | 0 |
496 | Operations | region_8 | Bachelor's | f | other | 1 | 27 | NaN | 6 | 0 | 74 | 0 |
497 | Technology | region_2 | Master's & above | f | other | 1 | 34 | 3.0 | 3 | 0 | 50 | 0 |
498 | Operations | region_23 | Bachelor's | m | sourcing | 4 | 37 | 2.0 | 6 | 0 | 62 | 0 |
499 | Sales & Marketing | region_26 | Bachelor's | m | other | 2 | 22 | NaN | 3 | 1 | 95 | 0 |
500 rows × 12 columns
We can also create a set of samples and automatically add it to the original dataset. For that, we use the fit_resample() method, which must receive a dataset as a mandatory parameter, along with the number of samples to be created (optional):
[11]:
df_resample = synth.fit_resample(df=dataset, label_col='is_promoted', n_samples=1000)
print(df_resample['is_promoted'].value_counts())
0 51038
1 4770
Name: is_promoted, dtype: int64
If we don’t specify any of the optional parameters to the transform() method, the default behavior will be to use the “minority” strategy, which creates synthetic data of the minority class until the number of samples of the minority class is equal to the majority class.
[6]:
df_resample = synth.fit_resample(df=dataset, label_col='is_promoted')
print(df_resample['is_promoted'].value_counts())
0 50140
1 50140
Name: is_promoted, dtype: int64
The strategy parameter is similar to imblearn’s strategy parameter: it specifies predefined behaviors for balancing the dataset. Here is the description for this parameter:
strategy: represents the strategy used to generate the artificial instances. This parameter is ignored when n_samples is provided. Strategy can assume the following values:
String: one of the following predefined strategies:
‘minority’: generates synthetic samples for only the minority class;
‘not majority’: generates synthetic samples for all classes but the majority class;
‘auto’: equivalent to ‘minority’; Note that for a binary classification problem, “minority” is similar to “not majority”;
Dictionary: the dictionary must have one key for each of the possible classes found in the label column, and the value associated with each key represents the number of instances desired for that class after the undersampling process is done. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as RandomUnderSampler, ClusterCentroids (from imblearn). If any other undersampler is provided in the under_sampler parameter along with a float value for the strategy_under parameter, an error will be raised;
Float: a value between [0, 1] that represents the desired ratio between the number of instances of the minority class over the majority class after undersampling. The ratio ‘r’ is given by: \(r = N_m/N_M\) where \(N_m\) is the number of instances of the minority class and \(N_M\) is the number of instances of the majority class after undersampling. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as RandomUnderSampler, ClusterCentroids (from imblearn). If any other undersampler is provided in the under_sampler parameter along with a float value for the strategy_under parameter, an error will be raised;
If None, the default value is set to “auto”, which is the same as “minority”.
[7]:
df_resample = synth.fit_resample(df=dataset, label_col='is_promoted', strategy='not majority')
print(df_resample['is_promoted'].value_counts())
0 50140
1 50140
Name: is_promoted, dtype: int64
Let’s show an example of how to use the conditions parameter. Here is a description of this parameter:
conditions: a set of conditions, specified by a dictionary, that defines the characteristics of the synthetic instances that should be created. This parameter indicates the values for certain features that the synthetic instances should have. If None, then no restrictions will be imposed on how to generate the synthetic data;
Let’s use the following conditions: the education feature must be “Below Secondary” and the is_promoted feature must be set to 1.
[12]:
conditions = {"education": "Below Secondary", "is_promoted":1}
df_resample = synth.fit_resample(df=dataset, label_col='is_promoted', n_samples=5000, conditions=conditions)
print(df_resample['is_promoted'].value_counts())
df_resample
0 50140
1 9668
Name: is_promoted, dtype: int64
[12]:
department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4995 | Operations | region_23 | Below Secondary | m | sourcing | 1 | 24 | 3.0 | 3 | 0 | 99 | 1 |
4996 | Technology | region_2 | Below Secondary | f | other | 1 | 23 | 3.0 | 3 | 0 | 79 | 1 |
4997 | Procurement | region_2 | Below Secondary | m | sourcing | 1 | 37 | 3.0 | 1 | 0 | 65 | 1 |
4998 | Sales & Marketing | region_22 | Below Secondary | m | sourcing | 1 | 22 | NaN | 6 | 0 | 57 | 1 |
4999 | Procurement | region_16 | Below Secondary | m | sourcing | 1 | 25 | 4.0 | 5 | 0 | 67 | 1 |
59808 rows × 12 columns
We can see that the new 5000 data instances created were automatically merged to the original dataset. This is because we are using the transform()) method. If we just want to create these samples with the predefined conditions, without the original dataset, we can use the sample() method, as previously mentioned.
[13]:
conditions = {"education": "Below Secondary", "is_promoted":1}
sample_df = synth.sample(n_samples=5000, conditions=conditions)
print(sample_df['is_promoted'].value_counts())
sample_df
1 5000
Name: is_promoted, dtype: int64
[13]:
department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_15 | Below Secondary | m | sourcing | 1 | 23 | 3.0 | 3 | 0 | 53 | 1 |
1 | Technology | region_23 | Below Secondary | m | other | 1 | 23 | NaN | 3 | 0 | 95 | 1 |
2 | Technology | region_2 | Below Secondary | f | sourcing | 1 | 37 | 3.0 | 10 | 0 | 93 | 1 |
3 | Sales & Marketing | region_27 | Below Secondary | m | sourcing | 1 | 28 | 3.0 | 1 | 0 | 65 | 1 |
4 | Procurement | region_8 | Below Secondary | m | other | 1 | 39 | 3.0 | 6 | 0 | 77 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4995 | Operations | region_9 | Below Secondary | m | sourcing | 2 | 29 | 3.0 | 14 | 0 | 67 | 1 |
4996 | Sales & Marketing | region_17 | Below Secondary | m | sourcing | 1 | 33 | 3.0 | 3 | 0 | 66 | 1 |
4997 | Operations | region_2 | Below Secondary | f | sourcing | 1 | 26 | 4.0 | 2 | 0 | 57 | 1 |
4998 | Sales & Marketing | region_23 | Below Secondary | f | other | 4 | 30 | 3.0 | 10 | 0 | 65 | 1 |
4999 | HR | region_31 | Below Secondary | m | other | 1 | 27 | 3.0 | 1 | 0 | 92 | 1 |
5000 rows × 12 columns
As we can see, now the sampled dataset contains only instances where the conditions are True.
Note that generating data using the conditions parameter is slower than not using it.