Rebalance Class (imblearn)
This notebook explroes the Rebalance
class, which allows the user to create synthetic data and fix imbalanced data.
[1]:
import sys
sys.path.append('../../../notebooks')
import pandas as pd
import numpy as np
from raimitigations.dataprocessing import Rebalance
from download import download_datasets
1 - Dataset with Headers
First, we will load the HR promotions dataset, which includes information about whether an employee was promoted or not.
[2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset.drop(columns=['employee_id'], inplace=True)
dataset
[2]:
department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | Sales & Marketing | region_9 | NaN | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns
We can check that this dataset is imbalanced.
[3]:
dataset['is_promoted'].value_counts()
[3]:
0 50140
1 4668
Name: is_promoted, dtype: int64
In order to rebalance a column, we specify the column to rebalance (rebalance_col). By default, the class will use the SMOTE oversampling method, and create enough samples of the minority class to equal the majority class. We specify the number of neighbors to use for the SMOTE, and suppress the output.
[4]:
rebalance = Rebalance(
df=dataset,
rebalance_col='is_promoted',
k_neighbors=6,
verbose=False
)
df_resample = rebalance.fit_resample()
print(df_resample['is_promoted'].value_counts())
df_resample
0 50140
1 50140
Name: is_promoted, dtype: int64
[4]:
department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.000000 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.000000 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.000000 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.000000 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.000000 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
100275 | Operations | region_2 | Master's & above | f | sourcing | 1 | 36 | 3.000000 | 6 | 0 | 0 | 70 | 1 |
100276 | Sales & Marketing | region_22 | Bachelor's | m | other | 1 | 24 | 3.329256 | 1 | 1 | 0 | 51 | 1 |
100277 | Technology | region_2 | Bachelor's | m | sourcing | 1 | 24 | 4.941969 | 2 | 0 | 0 | 85 | 1 |
100278 | Sales & Marketing | region_2 | Master's & above | m | other | 1 | 50 | 3.234562 | 15 | 1 | 0 | 51 | 1 |
100279 | Analytics | region_22 | Bachelor's | m | other | 1 | 26 | 3.329256 | 1 | 0 | 0 | 80 | 1 |
100280 rows × 13 columns
If we do not want the samples of each class to be equal, we can specify the number of samples for each class with a dictionary in the strategy_over parameter.
[5]:
rebalance = Rebalance(
df=dataset,
rebalance_col='is_promoted',
strategy_over={0:50140, 1:20000},
k_neighbors=6
)
df_resample = rebalance.fit_resample()
print(df_resample['is_promoted'].value_counts())
No categorical columns specified. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
Running oversampling...
...finished
0 50140
1 20000
Name: is_promoted, dtype: int64
The Rebalance class can take in a pandas DataFrame or (X,y) set, in either the init()
or fit_resample()
methods. Additionally, if we do not want the Rebalance class to consider all columns, we remove them first from the dataframe.
Additionally, we can instead specify a float representing the ratio of instances of each class, instead of a dict.
[6]:
cat_df = dataset.drop(columns=['no_of_trainings', 'age', 'previous_year_rating', 'length_of_service', 'awards_won?', 'avg_training_score'])
X = cat_df.drop(columns=['is_promoted'])
y = cat_df['is_promoted']
print(type(X))
rebalance = Rebalance(
strategy_over=0.5,
k_neighbors=4
)
x_resample, y_resample = rebalance.fit_resample(X=X, y=y)
y_resample.value_counts()
<class 'pandas.core.frame.DataFrame'>
No categorical columns specified. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
No columns specified for imputation. These columns have been automatically identified:
['education']
Running oversampling...
...finished
[6]:
0 50140
1 25070
Name: is_promoted, dtype: int64
By default, Rebalance
uses a SMOTE method (SMOTE [numerical columns only], SMOTEN [categorical only], or SMOTENC [categorical and numerical]) from imblearn
. But, we could specify this directly by passing an over sampling method to Rebalance
in the over_sampler
parameter.
[7]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=0)
rebalance = Rebalance(
df=dataset,
rebalance_col='is_promoted',
strategy_over={0:50140, 1:20000},
over_sampler=smote
)
df_resample = rebalance.fit_resample()
print(df_resample['is_promoted'].value_counts())
df_resample
No categorical columns specified. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
Over Sampler already provided.
Running oversampling...
...finished
0 50140
1 50140
Name: is_promoted, dtype: int64
[7]:
no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | department_Finance | department_HR | department_Legal | ... | region_region_7 | region_region_8 | region_region_9 | education_Below Secondary | education_Master's & above | education_NULL | gender_m | recruitment_channel_referred | recruitment_channel_sourcing | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 35 | 5.000000 | 8 | 1 | 0 | 49 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 30 | 5.000000 | 4 | 0 | 0 | 60 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 1 | 34 | 3.000000 | 7 | 0 | 0 | 50 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
3 | 2 | 39 | 1.000000 | 10 | 0 | 0 | 50 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | 1 | 45 | 3.000000 | 2 | 0 | 0 | 73 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
100275 | 1 | 33 | 5.000000 | 7 | 1 | 0 | 57 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
100276 | 1 | 29 | 3.000000 | 4 | 1 | 0 | 87 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
100277 | 1 | 50 | 3.013012 | 6 | 1 | 0 | 63 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
100278 | 1 | 32 | 4.763964 | 2 | 1 | 0 | 60 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
100279 | 1 | 33 | 4.000000 | 9 | 0 | 0 | 97 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
100280 rows × 55 columns
We can perform both over and under sampling in the same function call. Over sampling is performed first, then under sampling. For undersampling, two different default methods can be used, by setting strategy_under
to different values.
If Float or Dictionary, ClusterCentroids will be used (since we need a specific number of each class)
If String, then TomekLinks will be used. Options for which classes to resample are ‘majority, ‘not minority’, ‘not majority’, ‘all’, or ‘auto’ (‘auto’ is equivalent to ‘not minority’).
If None, the default is also TomekLinks
Here we specify ‘auto’ (i.e. ‘not minority’), so a TomekLinks
method will be used, and all but the minority class will be resampled.
[8]:
rebalance = Rebalance(
df=dataset,
rebalance_col='is_promoted',
strategy_over={0:50140, 1:10000},
strategy_under='auto'
)
df_resample = rebalance.fit_resample()
print(df_resample['is_promoted'].value_counts())
df_resample
No categorical columns specified. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
Running oversampling...
...finished
Running undersampling...
...finished
0 49208
1 10000
Name: is_promoted, dtype: int64
[8]:
no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | department_Finance | department_HR | department_Legal | ... | region_region_7 | region_region_8 | region_region_9 | education_Below Secondary | education_Master's & above | education_NULL | gender_m | recruitment_channel_referred | recruitment_channel_sourcing | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
3 | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
59203 | 1 | 31 | 5.0 | 6 | 1 | 0 | 49 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
59204 | 1 | 31 | 3.0 | 6 | 1 | 0 | 88 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
59205 | 1 | 28 | 3.0 | 4 | 0 | 0 | 88 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
59206 | 1 | 34 | 3.0 | 3 | 0 | 0 | 62 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
59207 | 1 | 35 | 3.0 | 8 | 0 | 0 | 83 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
59208 rows × 55 columns
Here, instead of setting setting strategy_under, we set “under_sampler=True”, which will use the default under_sampler. Since “strategy_under=None”, the TomekLinks
undersampler will be used again.
[9]:
rebalance = Rebalance(
df=dataset,
rebalance_col='is_promoted',
strategy_over={0:50140, 1:10000},
under_sampler=True
)
df_resample = rebalance.fit_resample()
print(df_resample['is_promoted'].value_counts())
df_resample
No categorical columns specified. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
Running oversampling...
...finished
Running undersampling...
...finished
0 49199
1 10000
Name: is_promoted, dtype: int64
[9]:
no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | department_Finance | department_HR | department_Legal | ... | region_region_7 | region_region_8 | region_region_9 | education_Below Secondary | education_Master's & above | education_NULL | gender_m | recruitment_channel_referred | recruitment_channel_sourcing | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 35 | 5.000000 | 8 | 1 | 0 | 49 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 30 | 5.000000 | 4 | 0 | 0 | 60 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
2 | 1 | 34 | 3.000000 | 7 | 0 | 0 | 50 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
3 | 2 | 39 | 1.000000 | 10 | 0 | 0 | 50 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
4 | 1 | 45 | 3.000000 | 2 | 0 | 0 | 73 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
59194 | 1 | 30 | 3.229203 | 6 | 1 | 0 | 50 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
59195 | 1 | 32 | 5.000000 | 6 | 1 | 0 | 64 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
59196 | 1 | 38 | 3.000000 | 6 | 1 | 0 | 58 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
59197 | 1 | 35 | 3.000000 | 4 | 0 | 0 | 82 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
59198 | 1 | 31 | 5.000000 | 9 | 1 | 0 | 69 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
59199 rows × 55 columns
Resampling does not need to be performed on a column with binary values. Below, we perform resampling on no_of_trainings, which has 10 values. We set “strategy_over” to be a dict, since there are multiple classes that have the fewest instances.
[10]:
dataset['no_of_trainings'].value_counts()
[10]:
1 44378
2 7987
3 1776
4 468
5 128
6 44
7 12
8 5
10 5
9 5
Name: no_of_trainings, dtype: int64
[11]:
rebalance = Rebalance(
df=dataset,
rebalance_col='no_of_trainings',
strategy_over={8:12, 9:12, 10:12},
under_sampler=False
)
df_resample = rebalance.fit_resample()
print(df_resample['no_of_trainings'].value_counts())
df_resample
No categorical columns specified. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
No columns specified for imputation. These columns have been automatically identified:
['education', 'previous_year_rating']
Running oversampling...
...finished
1 44378
2 7987
3 1776
4 468
5 128
6 44
7 12
8 12
10 12
9 12
Name: no_of_trainings, dtype: int64
[11]:
department | region | education | gender | recruitment_channel | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | no_of_trainings | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 35 | 5.000000 | 8 | 1 | 0 | 49 | 0 | 1 |
1 | Operations | region_22 | Bachelor's | m | other | 30 | 5.000000 | 4 | 0 | 0 | 60 | 0 | 1 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 34 | 3.000000 | 7 | 0 | 0 | 50 | 0 | 1 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 39 | 1.000000 | 10 | 0 | 0 | 50 | 0 | 2 |
4 | Technology | region_26 | Bachelor's | m | other | 45 | 3.000000 | 2 | 0 | 0 | 73 | 0 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54824 | Procurement | region_2 | Bachelor's | m | sourcing | 33 | 3.000000 | 2 | 0 | 0 | 70 | 0 | 10 |
54825 | Procurement | region_2 | Bachelor's | m | other | 43 | 1.815431 | 10 | 0 | 0 | 67 | 0 | 10 |
54826 | Procurement | region_2 | Bachelor's | m | other | 56 | 3.000000 | 8 | 0 | 0 | 69 | 0 | 10 |
54827 | Procurement | region_2 | Bachelor's | m | other | 49 | 2.436705 | 11 | 0 | 0 | 68 | 0 | 10 |
54828 | Procurement | region_2 | Bachelor's | m | other | 46 | 3.000000 | 12 | 0 | 0 | 56 | 0 | 10 |
54829 rows × 13 columns
[ ]: