Encoding Examples

In this notebook, we will demonstrate the different ways the Encoder classes can be used.

[1]:
import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.dataprocessing import EncoderOHE, EncoderOrdinal

from download import download_datasets

Ordinal Encoding

In Ordinal Encoding, an integer is given to each value of the categorical variable. This is typically used when the categorical variable is ordered (e.g. education, where High School < Bachelors < Masters). In this example, we will see the different ways EncoderOrdinal can be used.

Using a dataset with headers

[2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset
[2]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 65141 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 7513 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 2542 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 48945 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 74592 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 13918 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 13614 Sales & Marketing region_9 NaN m sourcing 1 29 1.0 2 0 0 45 0
54807 51526 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

Here we don’t specify any columns that should be encoded. We let the class automatically identify which columns should be encoded.

[3]:
encode = EncoderOrdinal(
                            df=dataset,
                            col_encode=None
                    )
encode.fit()
new_df = encode.transform(dataset)
new_df
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[3]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438 7 31 2 0 2 1 35 5.0 8 1 0 49 0
1 65141 4 14 0 1 0 1 30 5.0 4 0 0 60 0
2 7513 7 10 0 1 2 1 34 3.0 7 0 0 50 0
3 2542 7 15 0 1 0 2 39 1.0 10 0 0 50 0
4 48945 8 18 0 1 0 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 8 5 0 1 2 1 48 3.0 17 0 0 78 0
54804 74592 4 19 2 0 0 1 37 2.0 6 0 0 56 0
54805 13918 0 0 0 1 0 1 27 5.0 3 1 0 79 0
54806 13614 7 33 -1 1 2 1 29 1.0 2 0 0 45 0
54807 51526 2 14 0 1 0 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

We can then get the mapping dictionary used by the encoder by calling the get_mapping() method:

[4]:
encode.get_mapping()
[4]:
{'department': {'values': ['Analytics',
   'Finance',
   'HR',
   'Legal',
   'Operations',
   'Procurement',
   'R&D',
   'Sales & Marketing',
   'Technology',
   'UNKNOWN'],
  'labels': [0, 1, 2, 3, 4, 5, 6, 7, 8, -1],
  'n_labels': 10},
 'region': {'values': ['region_1',
   'region_10',
   'region_11',
   'region_12',
   'region_13',
   'region_14',
   'region_15',
   'region_16',
   'region_17',
   'region_18',
   'region_19',
   'region_2',
   'region_20',
   'region_21',
   'region_22',
   'region_23',
   'region_24',
   'region_25',
   'region_26',
   'region_27',
   'region_28',
   'region_29',
   'region_3',
   'region_30',
   'region_31',
   'region_32',
   'region_33',
   'region_34',
   'region_4',
   'region_5',
   'region_6',
   'region_7',
   'region_8',
   'region_9',
   'UNKNOWN'],
  'labels': [0,
   1,
   2,
   3,
   4,
   5,
   6,
   7,
   8,
   9,
   10,
   11,
   12,
   13,
   14,
   15,
   16,
   17,
   18,
   19,
   20,
   21,
   22,
   23,
   24,
   25,
   26,
   27,
   28,
   29,
   30,
   31,
   32,
   33,
   -1],
  'n_labels': 35},
 'education': {'values': ["Bachelor's",
   'Below Secondary',
   "Master's & above",
   'UNKNOWN'],
  'labels': [0, 1, 2, -1],
  'n_labels': 4},
 'gender': {'values': ['f', 'm', 'UNKNOWN'],
  'labels': [0, 1, -1],
  'n_labels': 3},
 'recruitment_channel': {'values': ['other',
   'referred',
   'sourcing',
   'UNKNOWN'],
  'labels': [0, 1, 2, -1],
  'n_labels': 4}}

We can also provide the dataset only when calling the fit method, instead of providing it when instantiating the object.

[5]:
encode = EncoderOrdinal(col_encode=None)
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[5]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438 7 31 2 0 2 1 35 5.0 8 1 0 49 0
1 65141 4 14 0 1 0 1 30 5.0 4 0 0 60 0
2 7513 7 10 0 1 2 1 34 3.0 7 0 0 50 0
3 2542 7 15 0 1 0 2 39 1.0 10 0 0 50 0
4 48945 8 18 0 1 0 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 8 5 0 1 2 1 48 3.0 17 0 0 78 0
54804 74592 4 19 2 0 0 1 37 2.0 6 0 0 56 0
54805 13918 0 0 0 1 0 1 27 5.0 3 1 0 79 0
54806 13614 7 33 -1 1 2 1 29 1.0 2 0 0 45 0
54807 51526 2 14 0 1 0 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

We can also specify which columns to encode using the column names:

[6]:
encode = EncoderOrdinal(
                            df=dataset,
                            col_encode=["department", "region", "education"]
                    )
encode.fit()
new_df = encode.transform(dataset)
new_df
[6]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438 7 31 2 f sourcing 1 35 5.0 8 1 0 49 0
1 65141 4 14 0 m other 1 30 5.0 4 0 0 60 0
2 7513 7 10 0 m sourcing 1 34 3.0 7 0 0 50 0
3 2542 7 15 0 m other 2 39 1.0 10 0 0 50 0
4 48945 8 18 0 m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 8 5 0 m sourcing 1 48 3.0 17 0 0 78 0
54804 74592 4 19 2 f other 1 37 2.0 6 0 0 56 0
54805 13918 0 0 0 m other 1 27 5.0 3 1 0 79 0
54806 13614 7 33 -1 m sourcing 1 29 1.0 2 0 0 45 0
54807 51526 2 14 0 m other 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

or the column indices:

[7]:
encode = EncoderOrdinal(
                            df=dataset,
                            col_encode=[1,2,3]
                    )
encode.fit()
new_df = encode.transform(dataset)
new_df
[7]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438 7 31 2 f sourcing 1 35 5.0 8 1 0 49 0
1 65141 4 14 0 m other 1 30 5.0 4 0 0 60 0
2 7513 7 10 0 m sourcing 1 34 3.0 7 0 0 50 0
3 2542 7 15 0 m other 2 39 1.0 10 0 0 50 0
4 48945 8 18 0 m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 8 5 0 m sourcing 1 48 3.0 17 0 0 78 0
54804 74592 4 19 2 f other 1 37 2.0 6 0 0 56 0
54805 13918 0 0 0 m other 1 27 5.0 3 1 0 79 0
54806 13614 7 33 -1 m sourcing 1 29 1.0 2 0 0 45 0
54807 51526 2 14 0 m other 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

We can also specify the order of the encodings for the ordinal encoder. For example, let’s specify the order of the labels for the ‘education’ column.

[8]:
dataset['education'].unique()
[8]:
array(["Master's & above", "Bachelor's", nan, 'Below Secondary'],
      dtype=object)
[9]:
encode = EncoderOrdinal(
                            df=dataset,
                            col_encode=["education", "gender"],
                            categories={'education': ["Below Secondary", "Bachelor's", "Master's & above"]}
                    )
encode.fit()
new_df = encode.transform(dataset)
new_df
[9]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438 Sales & Marketing region_7 2 0 sourcing 1 35 5.0 8 1 0 49 0
1 65141 Operations region_22 1 1 other 1 30 5.0 4 0 0 60 0
2 7513 Sales & Marketing region_19 1 1 sourcing 1 34 3.0 7 0 0 50 0
3 2542 Sales & Marketing region_23 1 1 other 2 39 1.0 10 0 0 50 0
4 48945 Technology region_26 1 1 other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 Technology region_14 1 1 sourcing 1 48 3.0 17 0 0 78 0
54804 74592 Operations region_27 2 0 other 1 37 2.0 6 0 0 56 0
54805 13918 Analytics region_1 1 1 other 1 27 5.0 3 1 0 79 0
54806 13614 Sales & Marketing region_9 -1 1 sourcing 1 29 1.0 2 0 0 45 0
54807 51526 HR region_22 1 1 other 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

The get_encoded_columns() method returns the list of columns encoded. For the ordinal encoding, this is the same list passed to the col_encode parameter if it’s provided or the list of categorical features if the latter parameter is not provided.

[10]:
encode.get_encoded_columns()
[10]:
['education', 'gender']

Finally, we can recover the original values by calling the inverse_transform() method:

[11]:
org_df = encode.inverse_transform(new_df)
org_df
[11]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 65141 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 7513 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 2542 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 48945 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 74592 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 13918 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 13614 Sales & Marketing region_9 None m sourcing 1 29 1.0 2 0 0 45 0
54807 51526 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

Using a dataset without headers

The same proceedure can be performed with a dataset without headers. Here, column indices are used instead.

[12]:
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset.drop(columns=[0], inplace=True)
dataset
[12]:
1 2 3 4 5 6 7 8 9 10 11 12 13
0 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 Sales & Marketing region_9 NaN m sourcing 1 29 1.0 2 0 0 45 0
54807 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 13 columns

[13]:
encode = EncoderOrdinal(
                            df=dataset,
                            col_encode=[3, 2],
                            categories={2: ["Below Secondary", "Bachelor's", "Master's & above"]}
                    )
encode.fit()
new_df = encode.transform(dataset)
new_df
[13]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 Sales & Marketing region_7 2 0 sourcing 1 35 5.0 8 1 0 49 0
1 Operations region_22 0 1 other 1 30 5.0 4 0 0 60 0
2 Sales & Marketing region_19 0 1 sourcing 1 34 3.0 7 0 0 50 0
3 Sales & Marketing region_23 0 1 other 2 39 1.0 10 0 0 50 0
4 Technology region_26 0 1 other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Technology region_14 0 1 sourcing 1 48 3.0 17 0 0 78 0
54804 Operations region_27 2 0 other 1 37 2.0 6 0 0 56 0
54805 Analytics region_1 0 1 other 1 27 5.0 3 1 0 79 0
54806 Sales & Marketing region_9 -1 1 sourcing 1 29 1.0 2 0 0 45 0
54807 HR region_22 0 1 other 1 27 1.0 5 0 0 49 0

54808 rows × 13 columns

[14]:
org_df = encode.inverse_transform(new_df)
org_df
[14]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 Sales & Marketing region_9 None m sourcing 1 29 1.0 2 0 0 45 0
54807 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 13 columns

One-Hot Encoding

The following cell shows an example using the one-hot encoding class.

[15]:
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset.head()

encode = EncoderOHE(col_encode=None)
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[15]:
employee_id no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted department_Finance ... region_region_6 region_region_7 region_region_8 region_region_9 education_Below Secondary education_Master's & above education_nan gender_m recruitment_channel_referred recruitment_channel_sourcing
0 65438 1 35 5.0 8 1 0 49 0 0 ... 0 1 0 0 0 1 0 0 0 1
1 65141 1 30 5.0 4 0 0 60 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 7513 1 34 3.0 7 0 0 50 0 0 ... 0 0 0 0 0 0 0 1 0 1
3 2542 2 39 1.0 10 0 0 50 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 48945 1 45 3.0 2 0 0 73 0 0 ... 0 0 0 0 0 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 1 48 3.0 17 0 0 78 0 0 ... 0 0 0 0 0 0 0 1 0 1
54804 74592 1 37 2.0 6 0 0 56 0 0 ... 0 0 0 0 0 1 0 0 0 0
54805 13918 1 27 5.0 3 1 0 79 0 0 ... 0 0 0 0 0 0 0 1 0 0
54806 13614 1 29 1.0 2 0 0 45 0 0 ... 0 0 0 1 0 0 1 1 0 1
54807 51526 1 27 1.0 5 0 0 49 0 0 ... 0 0 0 0 0 0 0 1 0 0

54808 rows × 56 columns

When using the EncoderOHE, the get_encoded_columns() method will return the list of the original columns that were encoded, but that are not present in the transformed dataset anymore, because these were replaced by one or more one-hot encoded columns (one for each value in the column).

[16]:
encode.get_encoded_columns()
[16]:
['department', 'region', 'education', 'gender', 'recruitment_channel']

To get the list of all new columns created with the one-hot encodings, use the get_one_hot_columns() method:

[17]:
encode.get_one_hot_columns()
[17]:
['department_Finance',
 'department_HR',
 'department_Legal',
 'department_Operations',
 'department_Procurement',
 'department_R&D',
 'department_Sales & Marketing',
 'department_Technology',
 'region_region_10',
 'region_region_11',
 'region_region_12',
 'region_region_13',
 'region_region_14',
 'region_region_15',
 'region_region_16',
 'region_region_17',
 'region_region_18',
 'region_region_19',
 'region_region_2',
 'region_region_20',
 'region_region_21',
 'region_region_22',
 'region_region_23',
 'region_region_24',
 'region_region_25',
 'region_region_26',
 'region_region_27',
 'region_region_28',
 'region_region_29',
 'region_region_3',
 'region_region_30',
 'region_region_31',
 'region_region_32',
 'region_region_33',
 'region_region_34',
 'region_region_4',
 'region_region_5',
 'region_region_6',
 'region_region_7',
 'region_region_8',
 'region_region_9',
 'education_Below Secondary',
 "education_Master's & above",
 'education_nan',
 'gender_m',
 'recruitment_channel_referred',
 'recruitment_channel_sourcing']

We can also specify which columns to encode using the column names:

[18]:
encode = EncoderOHE(col_encode=['education'])
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df
[18]:
employee_id department region gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted education_Below Secondary education_Master's & above education_nan
0 65438 Sales & Marketing region_7 f sourcing 1 35 5.0 8 1 0 49 0 0 1 0
1 65141 Operations region_22 m other 1 30 5.0 4 0 0 60 0 0 0 0
2 7513 Sales & Marketing region_19 m sourcing 1 34 3.0 7 0 0 50 0 0 0 0
3 2542 Sales & Marketing region_23 m other 2 39 1.0 10 0 0 50 0 0 0 0
4 48945 Technology region_26 m other 1 45 3.0 2 0 0 73 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 Technology region_14 m sourcing 1 48 3.0 17 0 0 78 0 0 0 0
54804 74592 Operations region_27 f other 1 37 2.0 6 0 0 56 0 0 1 0
54805 13918 Analytics region_1 m other 1 27 5.0 3 1 0 79 0 0 0 0
54806 13614 Sales & Marketing region_9 m sourcing 1 29 1.0 2 0 0 45 0 0 0 1
54807 51526 HR region_22 m other 1 27 1.0 5 0 0 49 0 0 0 0

54808 rows × 16 columns

or the column indices:

[19]:
encode = EncoderOHE(col_encode=[3])
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df
[19]:
employee_id department region gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted education_Below Secondary education_Master's & above education_nan
0 65438 Sales & Marketing region_7 f sourcing 1 35 5.0 8 1 0 49 0 0 1 0
1 65141 Operations region_22 m other 1 30 5.0 4 0 0 60 0 0 0 0
2 7513 Sales & Marketing region_19 m sourcing 1 34 3.0 7 0 0 50 0 0 0 0
3 2542 Sales & Marketing region_23 m other 2 39 1.0 10 0 0 50 0 0 0 0
4 48945 Technology region_26 m other 1 45 3.0 2 0 0 73 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 Technology region_14 m sourcing 1 48 3.0 17 0 0 78 0 0 0 0
54804 74592 Operations region_27 f other 1 37 2.0 6 0 0 56 0 0 1 0
54805 13918 Analytics region_1 m other 1 27 5.0 3 1 0 79 0 0 0 0
54806 13614 Sales & Marketing region_9 m sourcing 1 29 1.0 2 0 0 45 0 0 0 1
54807 51526 HR region_22 m other 1 27 1.0 5 0 0 49 0 0 0 0

54808 rows × 16 columns

We can also revert the encoded column to their original state by using the inverse_transform() method.

[20]:
org_df = encode.inverse_transform(new_df)
org_df
[20]:
education employee_id department region gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 Master's & above 65438 Sales & Marketing region_7 f sourcing 1 35 5.0 8 1 0 49 0
1 Bachelor's 65141 Operations region_22 m other 1 30 5.0 4 0 0 60 0
2 Bachelor's 7513 Sales & Marketing region_19 m sourcing 1 34 3.0 7 0 0 50 0
3 Bachelor's 2542 Sales & Marketing region_23 m other 2 39 1.0 10 0 0 50 0
4 Bachelor's 48945 Technology region_26 m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Bachelor's 3030 Technology region_14 m sourcing 1 48 3.0 17 0 0 78 0
54804 Master's & above 74592 Operations region_27 f other 1 37 2.0 6 0 0 56 0
54805 Bachelor's 13918 Analytics region_1 m other 1 27 5.0 3 1 0 79 0
54806 NaN 13614 Sales & Marketing region_9 m sourcing 1 29 1.0 2 0 0 45 0
54807 Bachelor's 51526 HR region_22 m other 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

Let us now test the ‘drop’ parameter. For that, we will repeat the same procedure as the previous cell, but this time use drop = False, since drop = True is the default behavior. We can see that the cell that uses drop = False (below) has one extra column: education_Bachelor’s, while the cell that uses drop = True (above) doesn’t have this column, since it was droped.

[21]:
encode = EncoderOHE(col_encode=['education'], drop=False)
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df
[21]:
employee_id department region gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted education_Bachelor's education_Below Secondary education_Master's & above education_nan
0 65438 Sales & Marketing region_7 f sourcing 1 35 5.0 8 1 0 49 0 0 0 1 0
1 65141 Operations region_22 m other 1 30 5.0 4 0 0 60 0 1 0 0 0
2 7513 Sales & Marketing region_19 m sourcing 1 34 3.0 7 0 0 50 0 1 0 0 0
3 2542 Sales & Marketing region_23 m other 2 39 1.0 10 0 0 50 0 1 0 0 0
4 48945 Technology region_26 m other 1 45 3.0 2 0 0 73 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 Technology region_14 m sourcing 1 48 3.0 17 0 0 78 0 1 0 0 0
54804 74592 Operations region_27 f other 1 37 2.0 6 0 0 56 0 0 0 1 0
54805 13918 Analytics region_1 m other 1 27 5.0 3 1 0 79 0 1 0 0 0
54806 13614 Sales & Marketing region_9 m sourcing 1 29 1.0 2 0 0 45 0 0 0 0 1
54807 51526 HR region_22 m other 1 27 1.0 5 0 0 49 0 1 0 0 0

54808 rows × 17 columns

[22]:
org_df = encode.inverse_transform(new_df)
org_df
[22]:
education employee_id department region gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 Master's & above 65438 Sales & Marketing region_7 f sourcing 1 35 5.0 8 1 0 49 0
1 Bachelor's 65141 Operations region_22 m other 1 30 5.0 4 0 0 60 0
2 Bachelor's 7513 Sales & Marketing region_19 m sourcing 1 34 3.0 7 0 0 50 0
3 Bachelor's 2542 Sales & Marketing region_23 m other 2 39 1.0 10 0 0 50 0
4 Bachelor's 48945 Technology region_26 m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Bachelor's 3030 Technology region_14 m sourcing 1 48 3.0 17 0 0 78 0
54804 Master's & above 74592 Operations region_27 f other 1 37 2.0 6 0 0 56 0
54805 Bachelor's 13918 Analytics region_1 m other 1 27 5.0 3 1 0 79 0
54806 NaN 13614 Sales & Marketing region_9 m sourcing 1 29 1.0 2 0 0 45 0
54807 Bachelor's 51526 HR region_22 m other 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

Using a dataset without headers

[23]:
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset.drop(columns=[0], inplace=True)
dataset
[23]:
1 2 3 4 5 6 7 8 9 10 11 12 13
0 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 Sales & Marketing region_9 NaN m sourcing 1 29 1.0 2 0 0 45 0
54807 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 13 columns

[24]:
encode = EncoderOHE(col_encode=[2, 1])
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df
[24]:
0 3 4 5 6 7 8 9 10 11 ... 1_region_31 1_region_32 1_region_33 1_region_34 1_region_4 1_region_5 1_region_6 1_region_7 1_region_8 1_region_9
0 Sales & Marketing f sourcing 1 35 5.0 8 1 0 49 ... 0 0 0 0 0 0 0 1 0 0
1 Operations m other 1 30 5.0 4 0 0 60 ... 0 0 0 0 0 0 0 0 0 0
2 Sales & Marketing m sourcing 1 34 3.0 7 0 0 50 ... 0 0 0 0 0 0 0 0 0 0
3 Sales & Marketing m other 2 39 1.0 10 0 0 50 ... 0 0 0 0 0 0 0 0 0 0
4 Technology m other 1 45 3.0 2 0 0 73 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Technology m sourcing 1 48 3.0 17 0 0 78 ... 0 0 0 0 0 0 0 0 0 0
54804 Operations f other 1 37 2.0 6 0 0 56 ... 0 0 0 0 0 0 0 0 0 0
54805 Analytics m other 1 27 5.0 3 1 0 79 ... 0 0 0 0 0 0 0 0 0 0
54806 Sales & Marketing m sourcing 1 29 1.0 2 0 0 45 ... 0 0 0 0 0 0 0 0 0 1
54807 HR m other 1 27 1.0 5 0 0 49 ... 0 0 0 0 0 0 0 0 0 0

54808 rows × 47 columns

[25]:
org_df = encode.inverse_transform(new_df)
org_df
[25]:
2 1 0 3 4 5 6 7 8 9 10 11 12
0 Master's & above region_7 Sales & Marketing f sourcing 1 35 5.0 8 1 0 49 0
1 Bachelor's region_22 Operations m other 1 30 5.0 4 0 0 60 0
2 Bachelor's region_19 Sales & Marketing m sourcing 1 34 3.0 7 0 0 50 0
3 Bachelor's region_23 Sales & Marketing m other 2 39 1.0 10 0 0 50 0
4 Bachelor's region_26 Technology m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Bachelor's region_14 Technology m sourcing 1 48 3.0 17 0 0 78 0
54804 Master's & above region_27 Operations f other 1 37 2.0 6 0 0 56 0
54805 Bachelor's region_1 Analytics m other 1 27 5.0 3 1 0 79 0
54806 NaN region_9 Sales & Marketing m sourcing 1 29 1.0 2 0 0 45 0
54807 Bachelor's region_22 HR m other 1 27 1.0 5 0 0 49 0

54808 rows × 13 columns