Encoding Examples
In this notebook, we will demonstrate the different ways the Encoder classes can be used.
[1]:
import sys
sys.path.append('../../../notebooks')
import pandas as pd
import numpy as np
from raimitigations.dataprocessing import EncoderOHE, EncoderOrdinal
from download import download_datasets
Ordinal Encoding
In Ordinal Encoding, an integer is given to each value of the categorical variable. This is typically used when the categorical variable is ordered (e.g. education, where High School < Bachelors < Masters). In this example, we will see the different ways EncoderOrdinal
can be used.
Using a dataset with headers
[2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset
[2]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | 74592 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | 13918 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | 13614 | Sales & Marketing | region_9 | NaN | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | 51526 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
Here we don’t specify any columns that should be encoded. We let the class automatically identify which columns should be encoded.
[3]:
encode = EncoderOrdinal(
df=dataset,
col_encode=None
)
encode.fit()
new_df = encode.transform(dataset)
new_df
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[3]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | 7 | 31 | 2 | 0 | 2 | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | 4 | 14 | 0 | 1 | 0 | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | 7 | 10 | 0 | 1 | 2 | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | 7 | 15 | 0 | 1 | 0 | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | 8 | 18 | 0 | 1 | 0 | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | 8 | 5 | 0 | 1 | 2 | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | 74592 | 4 | 19 | 2 | 0 | 0 | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | 13918 | 0 | 0 | 0 | 1 | 0 | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | 13614 | 7 | 33 | -1 | 1 | 2 | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | 51526 | 2 | 14 | 0 | 1 | 0 | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
We can then get the mapping dictionary used by the encoder by calling the get_mapping() method:
[4]:
encode.get_mapping()
[4]:
{'department': {'values': ['Analytics',
'Finance',
'HR',
'Legal',
'Operations',
'Procurement',
'R&D',
'Sales & Marketing',
'Technology',
'UNKNOWN'],
'labels': [0, 1, 2, 3, 4, 5, 6, 7, 8, -1],
'n_labels': 10},
'region': {'values': ['region_1',
'region_10',
'region_11',
'region_12',
'region_13',
'region_14',
'region_15',
'region_16',
'region_17',
'region_18',
'region_19',
'region_2',
'region_20',
'region_21',
'region_22',
'region_23',
'region_24',
'region_25',
'region_26',
'region_27',
'region_28',
'region_29',
'region_3',
'region_30',
'region_31',
'region_32',
'region_33',
'region_34',
'region_4',
'region_5',
'region_6',
'region_7',
'region_8',
'region_9',
'UNKNOWN'],
'labels': [0,
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
-1],
'n_labels': 35},
'education': {'values': ["Bachelor's",
'Below Secondary',
"Master's & above",
'UNKNOWN'],
'labels': [0, 1, 2, -1],
'n_labels': 4},
'gender': {'values': ['f', 'm', 'UNKNOWN'],
'labels': [0, 1, -1],
'n_labels': 3},
'recruitment_channel': {'values': ['other',
'referred',
'sourcing',
'UNKNOWN'],
'labels': [0, 1, 2, -1],
'n_labels': 4}}
We can also provide the dataset only when calling the fit method, instead of providing it when instantiating the object.
[5]:
encode = EncoderOrdinal(col_encode=None)
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[5]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | 7 | 31 | 2 | 0 | 2 | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | 4 | 14 | 0 | 1 | 0 | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | 7 | 10 | 0 | 1 | 2 | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | 7 | 15 | 0 | 1 | 0 | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | 8 | 18 | 0 | 1 | 0 | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | 8 | 5 | 0 | 1 | 2 | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | 74592 | 4 | 19 | 2 | 0 | 0 | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | 13918 | 0 | 0 | 0 | 1 | 0 | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | 13614 | 7 | 33 | -1 | 1 | 2 | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | 51526 | 2 | 14 | 0 | 1 | 0 | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
We can also specify which columns to encode using the column names:
[6]:
encode = EncoderOrdinal(
df=dataset,
col_encode=["department", "region", "education"]
)
encode.fit()
new_df = encode.transform(dataset)
new_df
[6]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | 7 | 31 | 2 | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | 4 | 14 | 0 | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | 7 | 10 | 0 | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | 7 | 15 | 0 | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | 8 | 18 | 0 | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | 8 | 5 | 0 | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | 74592 | 4 | 19 | 2 | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | 13918 | 0 | 0 | 0 | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | 13614 | 7 | 33 | -1 | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | 51526 | 2 | 14 | 0 | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
or the column indices:
[7]:
encode = EncoderOrdinal(
df=dataset,
col_encode=[1,2,3]
)
encode.fit()
new_df = encode.transform(dataset)
new_df
[7]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | 7 | 31 | 2 | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | 4 | 14 | 0 | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | 7 | 10 | 0 | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | 7 | 15 | 0 | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | 8 | 18 | 0 | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | 8 | 5 | 0 | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | 74592 | 4 | 19 | 2 | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | 13918 | 0 | 0 | 0 | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | 13614 | 7 | 33 | -1 | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | 51526 | 2 | 14 | 0 | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
We can also specify the order of the encodings for the ordinal encoder. For example, let’s specify the order of the labels for the ‘education’ column.
[8]:
dataset['education'].unique()
[8]:
array(["Master's & above", "Bachelor's", nan, 'Below Secondary'],
dtype=object)
[9]:
encode = EncoderOrdinal(
df=dataset,
col_encode=["education", "gender"],
categories={'education': ["Below Secondary", "Bachelor's", "Master's & above"]}
)
encode.fit()
new_df = encode.transform(dataset)
new_df
[9]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | Sales & Marketing | region_7 | 2 | 0 | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | Operations | region_22 | 1 | 1 | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | Sales & Marketing | region_19 | 1 | 1 | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | Sales & Marketing | region_23 | 1 | 1 | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | Technology | region_26 | 1 | 1 | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | Technology | region_14 | 1 | 1 | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | 74592 | Operations | region_27 | 2 | 0 | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | 13918 | Analytics | region_1 | 1 | 1 | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | 13614 | Sales & Marketing | region_9 | -1 | 1 | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | 51526 | HR | region_22 | 1 | 1 | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
The get_encoded_columns() method returns the list of columns encoded. For the ordinal encoding, this is the same list passed to the col_encode parameter if it’s provided or the list of categorical features if the latter parameter is not provided.
[10]:
encode.get_encoded_columns()
[10]:
['education', 'gender']
Finally, we can recover the original values by calling the inverse_transform() method:
[11]:
org_df = encode.inverse_transform(new_df)
org_df
[11]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | 74592 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | 13918 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | 13614 | Sales & Marketing | region_9 | None | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | 51526 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
Using a dataset without headers
The same proceedure can be performed with a dataset without headers. Here, column indices are used instead.
[12]:
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset.drop(columns=[0], inplace=True)
dataset
[12]:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | Sales & Marketing | region_9 | NaN | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns
[13]:
encode = EncoderOrdinal(
df=dataset,
col_encode=[3, 2],
categories={2: ["Below Secondary", "Bachelor's", "Master's & above"]}
)
encode.fit()
new_df = encode.transform(dataset)
new_df
[13]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | 2 | 0 | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | 0 | 1 | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | 0 | 1 | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | 0 | 1 | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | 0 | 1 | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | region_14 | 0 | 1 | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Operations | region_27 | 2 | 0 | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Analytics | region_1 | 0 | 1 | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | Sales & Marketing | region_9 | -1 | 1 | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | HR | region_22 | 0 | 1 | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns
[14]:
org_df = encode.inverse_transform(new_df)
org_df
[14]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | Sales & Marketing | region_9 | None | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns
One-Hot Encoding
The following cell shows an example using the one-hot encoding class.
[15]:
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset.head()
encode = EncoderOHE(col_encode=None)
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[15]:
employee_id | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | department_Finance | ... | region_region_6 | region_region_7 | region_region_8 | region_region_9 | education_Below Secondary | education_Master's & above | education_nan | gender_m | recruitment_channel_referred | recruitment_channel_sourcing | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 65141 | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 7513 | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
3 | 2542 | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 48945 | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
54804 | 74592 | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
54805 | 13918 | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
54806 | 13614 | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 |
54807 | 51526 | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
54808 rows × 56 columns
When using the EncoderOHE, the get_encoded_columns() method will return the list of the original columns that were encoded, but that are not present in the transformed dataset anymore, because these were replaced by one or more one-hot encoded columns (one for each value in the column).
[16]:
encode.get_encoded_columns()
[16]:
['department', 'region', 'education', 'gender', 'recruitment_channel']
To get the list of all new columns created with the one-hot encodings, use the get_one_hot_columns() method:
[17]:
encode.get_one_hot_columns()
[17]:
['department_Finance',
'department_HR',
'department_Legal',
'department_Operations',
'department_Procurement',
'department_R&D',
'department_Sales & Marketing',
'department_Technology',
'region_region_10',
'region_region_11',
'region_region_12',
'region_region_13',
'region_region_14',
'region_region_15',
'region_region_16',
'region_region_17',
'region_region_18',
'region_region_19',
'region_region_2',
'region_region_20',
'region_region_21',
'region_region_22',
'region_region_23',
'region_region_24',
'region_region_25',
'region_region_26',
'region_region_27',
'region_region_28',
'region_region_29',
'region_region_3',
'region_region_30',
'region_region_31',
'region_region_32',
'region_region_33',
'region_region_34',
'region_region_4',
'region_region_5',
'region_region_6',
'region_region_7',
'region_region_8',
'region_region_9',
'education_Below Secondary',
"education_Master's & above",
'education_nan',
'gender_m',
'recruitment_channel_referred',
'recruitment_channel_sourcing']
We can also specify which columns to encode using the column names:
[18]:
encode = EncoderOHE(col_encode=['education'])
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df
[18]:
employee_id | department | region | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | education_Below Secondary | education_Master's & above | education_nan | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | Sales & Marketing | region_7 | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 | 0 | 1 | 0 |
1 | 65141 | Operations | region_22 | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 | 0 | 0 | 0 |
2 | 7513 | Sales & Marketing | region_19 | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 | 0 | 0 | 0 |
3 | 2542 | Sales & Marketing | region_23 | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 | 0 | 0 | 0 |
4 | 48945 | Technology | region_26 | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | Technology | region_14 | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 | 0 | 0 | 0 |
54804 | 74592 | Operations | region_27 | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 | 0 | 1 | 0 |
54805 | 13918 | Analytics | region_1 | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 | 0 | 0 | 0 |
54806 | 13614 | Sales & Marketing | region_9 | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 | 0 | 0 | 1 |
54807 | 51526 | HR | region_22 | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 | 0 | 0 | 0 |
54808 rows × 16 columns
or the column indices:
[19]:
encode = EncoderOHE(col_encode=[3])
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df
[19]:
employee_id | department | region | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | education_Below Secondary | education_Master's & above | education_nan | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | Sales & Marketing | region_7 | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 | 0 | 1 | 0 |
1 | 65141 | Operations | region_22 | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 | 0 | 0 | 0 |
2 | 7513 | Sales & Marketing | region_19 | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 | 0 | 0 | 0 |
3 | 2542 | Sales & Marketing | region_23 | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 | 0 | 0 | 0 |
4 | 48945 | Technology | region_26 | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | Technology | region_14 | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 | 0 | 0 | 0 |
54804 | 74592 | Operations | region_27 | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 | 0 | 1 | 0 |
54805 | 13918 | Analytics | region_1 | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 | 0 | 0 | 0 |
54806 | 13614 | Sales & Marketing | region_9 | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 | 0 | 0 | 1 |
54807 | 51526 | HR | region_22 | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 | 0 | 0 | 0 |
54808 rows × 16 columns
We can also revert the encoded column to their original state by using the inverse_transform() method.
[20]:
org_df = encode.inverse_transform(new_df)
org_df
[20]:
education | employee_id | department | region | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Master's & above | 65438 | Sales & Marketing | region_7 | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Bachelor's | 65141 | Operations | region_22 | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Bachelor's | 7513 | Sales & Marketing | region_19 | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Bachelor's | 2542 | Sales & Marketing | region_23 | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Bachelor's | 48945 | Technology | region_26 | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Bachelor's | 3030 | Technology | region_14 | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Master's & above | 74592 | Operations | region_27 | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Bachelor's | 13918 | Analytics | region_1 | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | NaN | 13614 | Sales & Marketing | region_9 | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | Bachelor's | 51526 | HR | region_22 | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
Let us now test the ‘drop’ parameter. For that, we will repeat the same procedure as the previous cell, but this time use drop = False, since drop = True is the default behavior. We can see that the cell that uses drop = False (below) has one extra column: education_Bachelor’s, while the cell that uses drop = True (above) doesn’t have this column, since it was droped.
[21]:
encode = EncoderOHE(col_encode=['education'], drop=False)
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df
[21]:
employee_id | department | region | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | education_Bachelor's | education_Below Secondary | education_Master's & above | education_nan | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | Sales & Marketing | region_7 | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 | 0 | 0 | 1 | 0 |
1 | 65141 | Operations | region_22 | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 | 1 | 0 | 0 | 0 |
2 | 7513 | Sales & Marketing | region_19 | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 | 1 | 0 | 0 | 0 |
3 | 2542 | Sales & Marketing | region_23 | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 | 1 | 0 | 0 | 0 |
4 | 48945 | Technology | region_26 | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | Technology | region_14 | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 | 1 | 0 | 0 | 0 |
54804 | 74592 | Operations | region_27 | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 | 0 | 0 | 1 | 0 |
54805 | 13918 | Analytics | region_1 | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 | 1 | 0 | 0 | 0 |
54806 | 13614 | Sales & Marketing | region_9 | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 | 0 | 0 | 0 | 1 |
54807 | 51526 | HR | region_22 | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 | 1 | 0 | 0 | 0 |
54808 rows × 17 columns
[22]:
org_df = encode.inverse_transform(new_df)
org_df
[22]:
education | employee_id | department | region | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Master's & above | 65438 | Sales & Marketing | region_7 | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Bachelor's | 65141 | Operations | region_22 | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Bachelor's | 7513 | Sales & Marketing | region_19 | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Bachelor's | 2542 | Sales & Marketing | region_23 | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Bachelor's | 48945 | Technology | region_26 | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Bachelor's | 3030 | Technology | region_14 | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Master's & above | 74592 | Operations | region_27 | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Bachelor's | 13918 | Analytics | region_1 | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | NaN | 13614 | Sales & Marketing | region_9 | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | Bachelor's | 51526 | HR | region_22 | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
Using a dataset without headers
[23]:
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset.drop(columns=[0], inplace=True)
dataset
[23]:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | Sales & Marketing | region_9 | NaN | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns
[24]:
encode = EncoderOHE(col_encode=[2, 1])
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df
[24]:
0 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | ... | 1_region_31 | 1_region_32 | 1_region_33 | 1_region_34 | 1_region_4 | 1_region_5 | 1_region_6 | 1_region_7 | 1_region_8 | 1_region_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | Operations | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | Sales & Marketing | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | Sales & Marketing | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | Technology | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
54804 | Operations | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
54805 | Analytics | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
54806 | Sales & Marketing | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
54807 | HR | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
54808 rows × 47 columns
[25]:
org_df = encode.inverse_transform(new_df)
org_df
[25]:
2 | 1 | 0 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Master's & above | region_7 | Sales & Marketing | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Bachelor's | region_22 | Operations | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Bachelor's | region_19 | Sales & Marketing | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Bachelor's | region_23 | Sales & Marketing | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Bachelor's | region_26 | Technology | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Bachelor's | region_14 | Technology | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Master's & above | region_27 | Operations | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Bachelor's | region_1 | Analytics | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | NaN | region_9 | Sales & Marketing | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | Bachelor's | region_22 | HR | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns