[1]:

import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.dataprocessing import Synthesizer
from download import download_datasets

Synthesizer Class (SDV)

This class is a wrapper over the generative models for tabular data available in the SDV library. The Synthesizer enhances the interface of these models by allowing the user to specify predefined strategies similar to those used in the imblearn library. This notebook will explore different ways to use this class for generating synthetic data and for data balance.

First of all, let’s load the dataset.

[2]:

data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset.drop(columns=['employee_id'], inplace=True)
dataset

[2]:

	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	Technology	region_14	Bachelor's	m	sourcing	1	48	3.0	17	0	0	78	0
54804	Operations	region_27	Master's & above	f	other	1	37	2.0	6	0	0	56	0
54805	Analytics	region_1	Bachelor's	m	other	1	27	5.0	3	1	0	79	0
54806	Sales & Marketing	region_9	NaN	m	sourcing	1	29	1.0	2	0	0	45	0
54807	HR	region_22	Bachelor's	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 13 columns

We can check that this dataset is imbalanced.

[3]:

dataset['is_promoted'].value_counts()

[3]:

0    50140
1     4668
Name: is_promoted, dtype: int64

We can now instantiate the class. We’ll use the default values for the constructor for now, with the exception of the epochs parameter: we’ll use a lower number of epochs just to run the fit method faster. But usually, the more epochs, the more the model manages to understand the structure of the dataset provided, and thus is capable of generating more realistic artificial data.

The fit method will train the generative model specified. The default model used is called CTGAN, but there are also other models, which can be changed using the “model” parameter. The allowed values for this parameter are: [“ctgan”, “copula”, “copula_gan”, “tvae”]. It is important to note that these models are usually large, and it takes a considerable amount of time to train them.

[4]:

synth = Synthesizer(epochs=10)
synth.fit(df=dataset, label_col='is_promoted')

/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/sklearn/mixture/_base.py:277: ConvergenceWarning: Initialization 1 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data.
  warnings.warn(
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/sklearn/mixture/_base.py:143: ConvergenceWarning: Number of distinct clusters (6) found smaller than n_clusters (10). Possibly due to duplicate points in X.
  cluster.KMeans(
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/sklearn/mixture/_base.py:277: ConvergenceWarning: Initialization 1 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data.
  warnings.warn(
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/sklearn/mixture/_base.py:277: ConvergenceWarning: Initialization 1 did not converge. Try different init parameters, or increase max_iter, tol or check for degenerate data.
  warnings.warn(
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].to_numpy().flatten()
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].to_numpy().flatten()
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].to_numpy().flatten()
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].to_numpy().flatten()
/home/mmendonca/anaconda3/envs/resp/lib/python3.9/site-packages/ctgan/data_transformer.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[column_name] = data[column_name].to_numpy().flatten()

We can now generate synthetic data. First of all, let’s just create a sample of synthetic data without specify anything else. We can do this using the sample() method, where we must specify the number of samples to be generated, and we can also pass a set of conditions for these samples (this last parameter is optional). For now, let’s just create 500 samples without any specific set of conditions:

[10]:

df_sample = synth.sample(500)
df_sample

[10]:

	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	awards_won?	avg_training_score	is_promoted
0	HR	region_10	Bachelor's	m	sourcing	1	28	4.0	11	1	52	1
1	Operations	region_26	Bachelor's	f	sourcing	1	21	3.0	12	0	91	0
2	HR	region_28	Bachelor's	f	other	1	34	3.0	1	0	64	0
3	Analytics	region_7	Bachelor's	m	sourcing	1	50	NaN	1	0	50	1
4	Operations	region_34	Bachelor's	m	sourcing	2	20	3.0	6	0	57	0
...	...	...	...	...	...	...	...	...	...	...	...	...
495	HR	region_22	Bachelor's	f	sourcing	1	29	3.0	9	0	67	0
496	Operations	region_8	Bachelor's	f	other	1	27	NaN	6	0	74	0
497	Technology	region_2	Master's & above	f	other	1	34	3.0	3	0	50	0
498	Operations	region_23	Bachelor's	m	sourcing	4	37	2.0	6	0	62	0
499	Sales & Marketing	region_26	Bachelor's	m	other	2	22	NaN	3	1	95	0

500 rows × 12 columns

We can also create a set of samples and automatically add it to the original dataset. For that, we use the fit_resample() method, which must receive a dataset as a mandatory parameter, along with the number of samples to be created (optional):

[11]:

df_resample = synth.fit_resample(df=dataset, label_col='is_promoted', n_samples=1000)
print(df_resample['is_promoted'].value_counts())

0    51038
1     4770
Name: is_promoted, dtype: int64

If we don’t specify any of the optional parameters to the transform() method, the default behavior will be to use the “minority” strategy, which creates synthetic data of the minority class until the number of samples of the minority class is equal to the majority class.

[6]:

df_resample = synth.fit_resample(df=dataset, label_col='is_promoted')
print(df_resample['is_promoted'].value_counts())

0    50140
1    50140
Name: is_promoted, dtype: int64

The strategy parameter is similar to imblearn’s strategy parameter: it specifies predefined behaviors for balancing the dataset. Here is the description for this parameter:

strategy: represents the strategy used to generate the artificial instances. This parameter is ignored when n_samples is provided. Strategy can assume the following values:
- String: one of the following predefined strategies:
  - ‘minority’: generates synthetic samples for only the minority class;
  - ‘not majority’: generates synthetic samples for all classes but the majority class;
  - ‘auto’: equivalent to ‘minority’; Note that for a binary classification problem, “minority” is similar to “not majority”;
- Dictionary: the dictionary must have one key for each of the possible classes found in the label column, and the value associated with each key represents the number of instances desired for that class after the undersampling process is done. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as RandomUnderSampler, ClusterCentroids (from imblearn). If any other undersampler is provided in the under_sampler parameter along with a float value for the strategy_under parameter, an error will be raised;
- Float: a value between [0, 1] that represents the desired ratio between the number of instances of the minority class over the majority class after undersampling. The ratio ‘r’ is given by: \(r = N_m/N_M\) where \(N_m\) is the number of instances of the minority class and \(N_M\) is the number of instances of the majority class after undersampling. Note: this parameter only works with undersampling approaches that allow controlling the number of instances to be undersampled, such as RandomUnderSampler, ClusterCentroids (from imblearn). If any other undersampler is provided in the under_sampler parameter along with a float value for the strategy_under parameter, an error will be raised;
If None, the default value is set to “auto”, which is the same as “minority”.

[7]:

df_resample = synth.fit_resample(df=dataset, label_col='is_promoted', strategy='not majority')
print(df_resample['is_promoted'].value_counts())

0    50140
1    50140
Name: is_promoted, dtype: int64

Let’s show an example of how to use the conditions parameter. Here is a description of this parameter:

conditions: a set of conditions, specified by a dictionary, that defines the characteristics of the synthetic instances that should be created. This parameter indicates the values for certain features that the synthetic instances should have. If None, then no restrictions will be imposed on how to generate the synthetic data;

Let’s use the following conditions: the education feature must be “Below Secondary” and the is_promoted feature must be set to 1.

[12]:

conditions = {"education": "Below Secondary", "is_promoted":1}
df_resample = synth.fit_resample(df=dataset, label_col='is_promoted', n_samples=5000, conditions=conditions)
print(df_resample['is_promoted'].value_counts())
df_resample

0    50140
1     9668
Name: is_promoted, dtype: int64

[12]:

	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	awards_won?	avg_training_score	is_promoted
0	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	0	49	0
1	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	60	0
2	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	50	0
3	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	50	0
4	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...
4995	Operations	region_23	Below Secondary	m	sourcing	1	24	3.0	3	0	99	1
4996	Technology	region_2	Below Secondary	f	other	1	23	3.0	3	0	79	1
4997	Procurement	region_2	Below Secondary	m	sourcing	1	37	3.0	1	0	65	1
4998	Sales & Marketing	region_22	Below Secondary	m	sourcing	1	22	NaN	6	0	57	1
4999	Procurement	region_16	Below Secondary	m	sourcing	1	25	4.0	5	0	67	1

59808 rows × 12 columns

We can see that the new 5000 data instances created were automatically merged to the original dataset. This is because we are using the transform()) method. If we just want to create these samples with the predefined conditions, without the original dataset, we can use the sample() method, as previously mentioned.

[13]:

conditions = {"education": "Below Secondary", "is_promoted":1}
sample_df = synth.sample(n_samples=5000, conditions=conditions)
print(sample_df['is_promoted'].value_counts())
sample_df

1    5000
Name: is_promoted, dtype: int64

[13]:

	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	awards_won?	avg_training_score	is_promoted
0	Sales & Marketing	region_15	Below Secondary	m	sourcing	1	23	3.0	3	0	53	1
1	Technology	region_23	Below Secondary	m	other	1	23	NaN	3	0	95	1
2	Technology	region_2	Below Secondary	f	sourcing	1	37	3.0	10	0	93	1
3	Sales & Marketing	region_27	Below Secondary	m	sourcing	1	28	3.0	1	0	65	1
4	Procurement	region_8	Below Secondary	m	other	1	39	3.0	6	0	77	1
...	...	...	...	...	...	...	...	...	...	...	...	...
4995	Operations	region_9	Below Secondary	m	sourcing	2	29	3.0	14	0	67	1
4996	Sales & Marketing	region_17	Below Secondary	m	sourcing	1	33	3.0	3	0	66	1
4997	Operations	region_2	Below Secondary	f	sourcing	1	26	4.0	2	0	57	1
4998	Sales & Marketing	region_23	Below Secondary	f	other	4	30	3.0	10	0	65	1
4999	HR	region_31	Below Secondary	m	other	1	27	3.0	1	0	92	1

5000 rows × 12 columns

As we can see, now the sampled dataset contains only instances where the conditions are True.

Note that generating data using the conditions parameter is slower than not using it.