Rebalance Class (imblearn)

This notebook explroes the Rebalance class, which allows the user to create synthetic data and fix imbalanced data.

[1]:
import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.dataprocessing import Rebalance
from download import download_datasets

1 - Dataset with Headers

First, we will load the HR promotions dataset, which includes information about whether an employee was promoted or not.

[2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset.drop(columns=['employee_id'], inplace=True)
dataset
[2]:
department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 Sales & Marketing region_9 NaN m sourcing 1 29 1.0 2 0 0 45 0
54807 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 13 columns

We can check that this dataset is imbalanced.

[3]:
dataset['is_promoted'].value_counts()
[3]:
0    50140
1     4668
Name: is_promoted, dtype: int64

In order to rebalance a column, we specify the column to rebalance (rebalance_col). By default, the class will use the SMOTE oversampling method, and create enough samples of the minority class to equal the majority class. We specify the number of neighbors to use for the SMOTE, and suppress the output.

[4]:
rebalance = Rebalance(
                            df=dataset,
                            rebalance_col='is_promoted',
                            k_neighbors=6,
                            verbose=False
                    )
df_resample = rebalance.fit_resample()
print(df_resample['is_promoted'].value_counts())
df_resample
0    50140
1    50140
Name: is_promoted, dtype: int64
[4]:
department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.000000 8 1 0 49 0
1 Operations region_22 Bachelor's m other 1 30 5.000000 4 0 0 60 0
2 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.000000 7 0 0 50 0
3 Sales & Marketing region_23 Bachelor's m other 2 39 1.000000 10 0 0 50 0
4 Technology region_26 Bachelor's m other 1 45 3.000000 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
100275 Operations region_2 Master's & above f sourcing 1 36 3.000000 6 0 0 70 1
100276 Sales & Marketing region_22 Bachelor's m other 1 24 3.329256 1 1 0 51 1
100277 Technology region_2 Bachelor's m sourcing 1 24 4.941969 2 0 0 85 1
100278 Sales & Marketing region_2 Master's & above m other 1 50 3.234562 15 1 0 51 1
100279 Analytics region_22 Bachelor's m other 1 26 3.329256 1 0 0 80 1

100280 rows × 13 columns

If we do not want the samples of each class to be equal, we can specify the number of samples for each class with a dictionary in the strategy_over parameter.

[5]:
rebalance = Rebalance(
                            df=dataset,
                            rebalance_col='is_promoted',
                            strategy_over={0:50140, 1:20000},
                            k_neighbors=6
                    )
df_resample = rebalance.fit_resample()
print(df_resample['is_promoted'].value_counts())
No categorical columns specified. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
Running oversampling...
...finished
0    50140
1    20000
Name: is_promoted, dtype: int64

The Rebalance class can take in a pandas DataFrame or (X,y) set, in either the init() or fit_resample() methods. Additionally, if we do not want the Rebalance class to consider all columns, we remove them first from the dataframe.

Additionally, we can instead specify a float representing the ratio of instances of each class, instead of a dict.

[6]:
cat_df = dataset.drop(columns=['no_of_trainings', 'age', 'previous_year_rating', 'length_of_service', 'awards_won?', 'avg_training_score'])
X = cat_df.drop(columns=['is_promoted'])
y = cat_df['is_promoted']

print(type(X))

rebalance = Rebalance(
                            strategy_over=0.5,
                            k_neighbors=4
                    )
x_resample, y_resample = rebalance.fit_resample(X=X, y=y)
y_resample.value_counts()
<class 'pandas.core.frame.DataFrame'>
No categorical columns specified. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
No columns specified for imputation. These columns have been automatically identified:
['education']
Running oversampling...
...finished
[6]:
0    50140
1    25070
Name: is_promoted, dtype: int64

By default, Rebalance uses a SMOTE method (SMOTE [numerical columns only], SMOTEN [categorical only], or SMOTENC [categorical and numerical]) from imblearn. But, we could specify this directly by passing an over sampling method to Rebalance in the over_sampler parameter.

[7]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=0)

rebalance = Rebalance(
                            df=dataset,
                            rebalance_col='is_promoted',
                            strategy_over={0:50140, 1:20000},
                            over_sampler=smote
                    )
df_resample = rebalance.fit_resample()
print(df_resample['is_promoted'].value_counts())
df_resample
No categorical columns specified. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']

Over Sampler already provided.

Running oversampling...
...finished
0    50140
1    50140
Name: is_promoted, dtype: int64
[7]:
no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score department_Finance department_HR department_Legal ... region_region_7 region_region_8 region_region_9 education_Below Secondary education_Master's & above education_NULL gender_m recruitment_channel_referred recruitment_channel_sourcing is_promoted
0 1 35 5.000000 8 1 0 49 0 0 0 ... 1 0 0 0 1 0 0 0 1 0
1 1 30 5.000000 4 0 0 60 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
2 1 34 3.000000 7 0 0 50 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
3 2 39 1.000000 10 0 0 50 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
4 1 45 3.000000 2 0 0 73 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
100275 1 33 5.000000 7 1 0 57 1 0 0 ... 0 0 0 0 0 0 0 0 0 1
100276 1 29 3.000000 4 1 0 87 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
100277 1 50 3.013012 6 1 0 63 0 0 0 ... 0 0 0 0 1 0 1 0 0 1
100278 1 32 4.763964 2 1 0 60 0 0 0 ... 0 0 0 0 0 0 0 0 1 1
100279 1 33 4.000000 9 0 0 97 0 0 0 ... 0 0 0 0 0 0 0 0 0 1

100280 rows × 55 columns

We can perform both over and under sampling in the same function call. Over sampling is performed first, then under sampling. For undersampling, two different default methods can be used, by setting strategy_under to different values.

  • If Float or Dictionary, ClusterCentroids will be used (since we need a specific number of each class)

  • If String, then TomekLinks will be used. Options for which classes to resample are ‘majority, ‘not minority’, ‘not majority’, ‘all’, or ‘auto’ (‘auto’ is equivalent to ‘not minority’).

  • If None, the default is also TomekLinks

Here we specify ‘auto’ (i.e. ‘not minority’), so a TomekLinks method will be used, and all but the minority class will be resampled.

[8]:
rebalance = Rebalance(
                            df=dataset,
                            rebalance_col='is_promoted',
                            strategy_over={0:50140, 1:10000},
                            strategy_under='auto'
                    )
df_resample = rebalance.fit_resample()
print(df_resample['is_promoted'].value_counts())
df_resample
No categorical columns specified. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
Running oversampling...
...finished
Running undersampling...
...finished
0    49208
1    10000
Name: is_promoted, dtype: int64
[8]:
no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score department_Finance department_HR department_Legal ... region_region_7 region_region_8 region_region_9 education_Below Secondary education_Master's & above education_NULL gender_m recruitment_channel_referred recruitment_channel_sourcing is_promoted
0 1 35 5.0 8 1 0 49 0 0 0 ... 1 0 0 0 1 0 0 0 1 0
1 1 30 5.0 4 0 0 60 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
2 1 34 3.0 7 0 0 50 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
3 2 39 1.0 10 0 0 50 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
4 1 45 3.0 2 0 0 73 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
59203 1 31 5.0 6 1 0 49 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
59204 1 31 3.0 6 1 0 88 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
59205 1 28 3.0 4 0 0 88 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
59206 1 34 3.0 3 0 0 62 0 0 0 ... 0 0 0 0 1 0 1 0 0 1
59207 1 35 3.0 8 0 0 83 0 0 0 ... 0 0 0 0 1 0 0 0 0 1

59208 rows × 55 columns

Here, instead of setting setting strategy_under, we set “under_sampler=True”, which will use the default under_sampler. Since “strategy_under=None”, the TomekLinks undersampler will be used again.

[9]:
rebalance = Rebalance(
                            df=dataset,
                            rebalance_col='is_promoted',
                            strategy_over={0:50140, 1:10000},
                            under_sampler=True
                    )
df_resample = rebalance.fit_resample()
print(df_resample['is_promoted'].value_counts())
df_resample
No categorical columns specified. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
Running oversampling...
...finished
Running undersampling...
...finished
0    49199
1    10000
Name: is_promoted, dtype: int64
[9]:
no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score department_Finance department_HR department_Legal ... region_region_7 region_region_8 region_region_9 education_Below Secondary education_Master's & above education_NULL gender_m recruitment_channel_referred recruitment_channel_sourcing is_promoted
0 1 35 5.000000 8 1 0 49 0 0 0 ... 1 0 0 0 1 0 0 0 1 0
1 1 30 5.000000 4 0 0 60 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
2 1 34 3.000000 7 0 0 50 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
3 2 39 1.000000 10 0 0 50 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
4 1 45 3.000000 2 0 0 73 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
59194 1 30 3.229203 6 1 0 50 0 0 0 ... 0 0 0 0 1 0 1 0 0 1
59195 1 32 5.000000 6 1 0 64 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
59196 1 38 3.000000 6 1 0 58 0 0 0 ... 0 0 0 0 1 0 1 0 0 1
59197 1 35 3.000000 4 0 0 82 0 0 0 ... 0 0 0 0 0 0 0 0 1 1
59198 1 31 5.000000 9 1 0 69 0 0 0 ... 0 0 0 0 0 0 0 0 1 1

59199 rows × 55 columns

Resampling does not need to be performed on a column with binary values. Below, we perform resampling on no_of_trainings, which has 10 values. We set “strategy_over” to be a dict, since there are multiple classes that have the fewest instances.

[10]:
dataset['no_of_trainings'].value_counts()
[10]:
1     44378
2      7987
3      1776
4       468
5       128
6        44
7        12
8         5
10        5
9         5
Name: no_of_trainings, dtype: int64
[11]:
rebalance = Rebalance(
                            df=dataset,
                            rebalance_col='no_of_trainings',
                            strategy_over={8:12, 9:12, 10:12},
                            under_sampler=False
                    )
df_resample = rebalance.fit_resample()
print(df_resample['no_of_trainings'].value_counts())
df_resample
No categorical columns specified. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
Running oversampling...
...finished
1     44378
2      7987
3      1776
4       468
5       128
6        44
7        12
8        12
10       12
9        12
Name: no_of_trainings, dtype: int64
[11]:
department region education gender recruitment_channel age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted no_of_trainings
0 Sales & Marketing region_7 Master's & above f sourcing 35 5.000000 8 1 0 49 0 1
1 Operations region_22 Bachelor's m other 30 5.000000 4 0 0 60 0 1
2 Sales & Marketing region_19 Bachelor's m sourcing 34 3.000000 7 0 0 50 0 1
3 Sales & Marketing region_23 Bachelor's m other 39 1.000000 10 0 0 50 0 2
4 Technology region_26 Bachelor's m other 45 3.000000 2 0 0 73 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
54824 Procurement region_2 Bachelor's m sourcing 33 3.000000 2 0 0 70 0 10
54825 Procurement region_2 Bachelor's m other 43 1.815431 10 0 0 67 0 10
54826 Procurement region_2 Bachelor's m other 56 3.000000 8 0 0 69 0 10
54827 Procurement region_2 Bachelor's m other 49 2.436705 11 0 0 68 0 10
54828 Procurement region_2 Bachelor's m other 46 3.000000 12 0 0 56 0 10

54829 rows × 13 columns

[ ]: