Encoding Examples

In this notebook, we will demonstrate the different ways the Encoder classes can be used.

[1]:

import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.dataprocessing import EncoderOHE, EncoderOrdinal

from download import download_datasets

Ordinal Encoding

In Ordinal Encoding, an integer is given to each value of the categorical variable. This is typically used when the categorical variable is ordered (e.g. education, where High School < Bachelors < Masters). In this example, we will see the different ways EncoderOrdinal can be used.

Using a dataset with headers

[2]:

data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset

[2]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	65141	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	7513	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	2542	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	48945	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	Technology	region_14	Bachelor's	m	sourcing	1	48	3.0	17	0	0	78	0
54804	74592	Operations	region_27	Master's & above	f	other	1	37	2.0	6	0	0	56	0
54805	13918	Analytics	region_1	Bachelor's	m	other	1	27	5.0	3	1	0	79	0
54806	13614	Sales & Marketing	region_9	NaN	m	sourcing	1	29	1.0	2	0	0	45	0
54807	51526	HR	region_22	Bachelor's	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

Here we don’t specify any columns that should be encoded. We let the class automatically identify which columns should be encoded.

[3]:

encode = EncoderOrdinal(
                            df=dataset,
                            col_encode=None
                    )
encode.fit()
new_df = encode.transform(dataset)
new_df

No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']

[3]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438	7	31	2	0	2	1	35	5.0	8	1	0	49	0
1	65141	4	14	0	1	0	1	30	5.0	4	0	0	60	0
2	7513	7	10	0	1	2	1	34	3.0	7	0	0	50	0
3	2542	7	15	0	1	0	2	39	1.0	10	0	0	50	0
4	48945	8	18	0	1	0	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	8	5	0	1	2	1	48	3.0	17	0	0	78	0
54804	74592	4	19	2	0	0	1	37	2.0	6	0	0	56	0
54805	13918	0	0	0	1	0	1	27	5.0	3	1	0	79	0
54806	13614	7	33	-1	1	2	1	29	1.0	2	0	0	45	0
54807	51526	2	14	0	1	0	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

We can then get the mapping dictionary used by the encoder by calling the get_mapping() method:

[4]:

encode.get_mapping()

[4]:

{'department': {'values': ['Analytics',
   'Finance',
   'HR',
   'Legal',
   'Operations',
   'Procurement',
   'R&D',
   'Sales & Marketing',
   'Technology',
   'UNKNOWN'],
  'labels': [0, 1, 2, 3, 4, 5, 6, 7, 8, -1],
  'n_labels': 10},
 'region': {'values': ['region_1',
   'region_10',
   'region_11',
   'region_12',
   'region_13',
   'region_14',
   'region_15',
   'region_16',
   'region_17',
   'region_18',
   'region_19',
   'region_2',
   'region_20',
   'region_21',
   'region_22',
   'region_23',
   'region_24',
   'region_25',
   'region_26',
   'region_27',
   'region_28',
   'region_29',
   'region_3',
   'region_30',
   'region_31',
   'region_32',
   'region_33',
   'region_34',
   'region_4',
   'region_5',
   'region_6',
   'region_7',
   'region_8',
   'region_9',
   'UNKNOWN'],
  'labels': [0,
   1,
   2,
   3,
   4,
   5,
   6,
   7,
   8,
   9,
   10,
   11,
   12,
   13,
   14,
   15,
   16,
   17,
   18,
   19,
   20,
   21,
   22,
   23,
   24,
   25,
   26,
   27,
   28,
   29,
   30,
   31,
   32,
   33,
   -1],
  'n_labels': 35},
 'education': {'values': ["Bachelor's",
   'Below Secondary',
   "Master's & above",
   'UNKNOWN'],
  'labels': [0, 1, 2, -1],
  'n_labels': 4},
 'gender': {'values': ['f', 'm', 'UNKNOWN'],
  'labels': [0, 1, -1],
  'n_labels': 3},
 'recruitment_channel': {'values': ['other',
   'referred',
   'sourcing',
   'UNKNOWN'],
  'labels': [0, 1, 2, -1],
  'n_labels': 4}}

We can also provide the dataset only when calling the fit method, instead of providing it when instantiating the object.

[5]:

encode = EncoderOrdinal(col_encode=None)
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df

No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']

[5]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438	7	31	2	0	2	1	35	5.0	8	1	0	49	0
1	65141	4	14	0	1	0	1	30	5.0	4	0	0	60	0
2	7513	7	10	0	1	2	1	34	3.0	7	0	0	50	0
3	2542	7	15	0	1	0	2	39	1.0	10	0	0	50	0
4	48945	8	18	0	1	0	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	8	5	0	1	2	1	48	3.0	17	0	0	78	0
54804	74592	4	19	2	0	0	1	37	2.0	6	0	0	56	0
54805	13918	0	0	0	1	0	1	27	5.0	3	1	0	79	0
54806	13614	7	33	-1	1	2	1	29	1.0	2	0	0	45	0
54807	51526	2	14	0	1	0	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

We can also specify which columns to encode using the column names:

[6]:

encode = EncoderOrdinal(
                            df=dataset,
                            col_encode=["department", "region", "education"]
                    )
encode.fit()
new_df = encode.transform(dataset)
new_df

[6]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438	7	31	2	f	sourcing	1	35	5.0	8	1	0	49	0
1	65141	4	14	0	m	other	1	30	5.0	4	0	0	60	0
2	7513	7	10	0	m	sourcing	1	34	3.0	7	0	0	50	0
3	2542	7	15	0	m	other	2	39	1.0	10	0	0	50	0
4	48945	8	18	0	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	8	5	0	m	sourcing	1	48	3.0	17	0	0	78	0
54804	74592	4	19	2	f	other	1	37	2.0	6	0	0	56	0
54805	13918	0	0	0	m	other	1	27	5.0	3	1	0	79	0
54806	13614	7	33	-1	m	sourcing	1	29	1.0	2	0	0	45	0
54807	51526	2	14	0	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

or the column indices:

[7]:

encode = EncoderOrdinal(
                            df=dataset,
                            col_encode=[1,2,3]
                    )
encode.fit()
new_df = encode.transform(dataset)
new_df

[7]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438	7	31	2	f	sourcing	1	35	5.0	8	1	0	49	0
1	65141	4	14	0	m	other	1	30	5.0	4	0	0	60	0
2	7513	7	10	0	m	sourcing	1	34	3.0	7	0	0	50	0
3	2542	7	15	0	m	other	2	39	1.0	10	0	0	50	0
4	48945	8	18	0	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	8	5	0	m	sourcing	1	48	3.0	17	0	0	78	0
54804	74592	4	19	2	f	other	1	37	2.0	6	0	0	56	0
54805	13918	0	0	0	m	other	1	27	5.0	3	1	0	79	0
54806	13614	7	33	-1	m	sourcing	1	29	1.0	2	0	0	45	0
54807	51526	2	14	0	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

We can also specify the order of the encodings for the ordinal encoder. For example, let’s specify the order of the labels for the ‘education’ column.

[8]:

dataset['education'].unique()

[8]:

array(["Master's & above", "Bachelor's", nan, 'Below Secondary'],
      dtype=object)

[9]:

encode = EncoderOrdinal(
                            df=dataset,
                            col_encode=["education", "gender"],
                            categories={'education': ["Below Secondary", "Bachelor's", "Master's & above"]}
                    )
encode.fit()
new_df = encode.transform(dataset)
new_df

[9]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438	Sales & Marketing	region_7	2	0	sourcing	1	35	5.0	8	1	0	49	0
1	65141	Operations	region_22	1	1	other	1	30	5.0	4	0	0	60	0
2	7513	Sales & Marketing	region_19	1	1	sourcing	1	34	3.0	7	0	0	50	0
3	2542	Sales & Marketing	region_23	1	1	other	2	39	1.0	10	0	0	50	0
4	48945	Technology	region_26	1	1	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	Technology	region_14	1	1	sourcing	1	48	3.0	17	0	0	78	0
54804	74592	Operations	region_27	2	0	other	1	37	2.0	6	0	0	56	0
54805	13918	Analytics	region_1	1	1	other	1	27	5.0	3	1	0	79	0
54806	13614	Sales & Marketing	region_9	-1	1	sourcing	1	29	1.0	2	0	0	45	0
54807	51526	HR	region_22	1	1	other	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

The get_encoded_columns() method returns the list of columns encoded. For the ordinal encoding, this is the same list passed to the col_encode parameter if it’s provided or the list of categorical features if the latter parameter is not provided.

[10]:

encode.get_encoded_columns()

[10]:

['education', 'gender']

Finally, we can recover the original values by calling the inverse_transform() method:

[11]:

org_df = encode.inverse_transform(new_df)
org_df

[11]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	65141	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	7513	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	2542	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	48945	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	Technology	region_14	Bachelor's	m	sourcing	1	48	3.0	17	0	0	78	0
54804	74592	Operations	region_27	Master's & above	f	other	1	37	2.0	6	0	0	56	0
54805	13918	Analytics	region_1	Bachelor's	m	other	1	27	5.0	3	1	0	79	0
54806	13614	Sales & Marketing	region_9	None	m	sourcing	1	29	1.0	2	0	0	45	0
54807	51526	HR	region_22	Bachelor's	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

Using a dataset without headers

The same proceedure can be performed with a dataset without headers. Here, column indices are used instead.

[12]:

dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset.drop(columns=[0], inplace=True)
dataset

[12]:

	1	2	3	4	5	6	7	8	9	10	11	12	13
0	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	Technology	region_14	Bachelor's	m	sourcing	1	48	3.0	17	0	0	78	0
54804	Operations	region_27	Master's & above	f	other	1	37	2.0	6	0	0	56	0
54805	Analytics	region_1	Bachelor's	m	other	1	27	5.0	3	1	0	79	0
54806	Sales & Marketing	region_9	NaN	m	sourcing	1	29	1.0	2	0	0	45	0
54807	HR	region_22	Bachelor's	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 13 columns

[13]:

encode = EncoderOrdinal(
                            df=dataset,
                            col_encode=[3, 2],
                            categories={2: ["Below Secondary", "Bachelor's", "Master's & above"]}
                    )
encode.fit()
new_df = encode.transform(dataset)
new_df

[13]:

	0	1	2	3	4	5	6	7	8	9	10	11	12
0	Sales & Marketing	region_7	2	0	sourcing	1	35	5.0	8	1	0	49	0
1	Operations	region_22	0	1	other	1	30	5.0	4	0	0	60	0
2	Sales & Marketing	region_19	0	1	sourcing	1	34	3.0	7	0	0	50	0
3	Sales & Marketing	region_23	0	1	other	2	39	1.0	10	0	0	50	0
4	Technology	region_26	0	1	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	Technology	region_14	0	1	sourcing	1	48	3.0	17	0	0	78	0
54804	Operations	region_27	2	0	other	1	37	2.0	6	0	0	56	0
54805	Analytics	region_1	0	1	other	1	27	5.0	3	1	0	79	0
54806	Sales & Marketing	region_9	-1	1	sourcing	1	29	1.0	2	0	0	45	0
54807	HR	region_22	0	1	other	1	27	1.0	5	0	0	49	0

54808 rows × 13 columns

[14]:

org_df = encode.inverse_transform(new_df)
org_df

[14]:

	0	1	2	3	4	5	6	7	8	9	10	11	12
0	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	Technology	region_14	Bachelor's	m	sourcing	1	48	3.0	17	0	0	78	0
54804	Operations	region_27	Master's & above	f	other	1	37	2.0	6	0	0	56	0
54805	Analytics	region_1	Bachelor's	m	other	1	27	5.0	3	1	0	79	0
54806	Sales & Marketing	region_9	None	m	sourcing	1	29	1.0	2	0	0	45	0
54807	HR	region_22	Bachelor's	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 13 columns

One-Hot Encoding

The following cell shows an example using the one-hot encoding class.

[15]:

dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset.head()

encode = EncoderOHE(col_encode=None)
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df

No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']

[15]:

	employee_id	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted	department_Finance	...	region_region_6	region_region_7	region_region_8	region_region_9	education_Below Secondary	education_Master's & above	education_nan	gender_m	recruitment_channel_referred	recruitment_channel_sourcing
0	65438	1	35	5.0	8	1	0	49	0	0	...	0	1	0	0	0	1	0	0	0	1
1	65141	1	30	5.0	4	0	0	60	0	0	...	0	0	0	0	0	0	0	1	0	0
2	7513	1	34	3.0	7	0	0	50	0	0	...	0	0	0	0	0	0	0	1	0	1
3	2542	2	39	1.0	10	0	0	50	0	0	...	0	0	0	0	0	0	0	1	0	0
4	48945	1	45	3.0	2	0	0	73	0	0	...	0	0	0	0	0	0	0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	1	48	3.0	17	0	0	78	0	0	...	0	0	0	0	0	0	0	1	0	1
54804	74592	1	37	2.0	6	0	0	56	0	0	...	0	0	0	0	0	1	0	0	0	0
54805	13918	1	27	5.0	3	1	0	79	0	0	...	0	0	0	0	0	0	0	1	0	0
54806	13614	1	29	1.0	2	0	0	45	0	0	...	0	0	0	1	0	0	1	1	0	1
54807	51526	1	27	1.0	5	0	0	49	0	0	...	0	0	0	0	0	0	0	1	0	0

54808 rows × 56 columns

When using the EncoderOHE, the get_encoded_columns() method will return the list of the original columns that were encoded, but that are not present in the transformed dataset anymore, because these were replaced by one or more one-hot encoded columns (one for each value in the column).

[16]:

encode.get_encoded_columns()

[16]:

['department', 'region', 'education', 'gender', 'recruitment_channel']

To get the list of all new columns created with the one-hot encodings, use the get_one_hot_columns() method:

[17]:

encode.get_one_hot_columns()

[17]:

['department_Finance',
 'department_HR',
 'department_Legal',
 'department_Operations',
 'department_Procurement',
 'department_R&D',
 'department_Sales & Marketing',
 'department_Technology',
 'region_region_10',
 'region_region_11',
 'region_region_12',
 'region_region_13',
 'region_region_14',
 'region_region_15',
 'region_region_16',
 'region_region_17',
 'region_region_18',
 'region_region_19',
 'region_region_2',
 'region_region_20',
 'region_region_21',
 'region_region_22',
 'region_region_23',
 'region_region_24',
 'region_region_25',
 'region_region_26',
 'region_region_27',
 'region_region_28',
 'region_region_29',
 'region_region_3',
 'region_region_30',
 'region_region_31',
 'region_region_32',
 'region_region_33',
 'region_region_34',
 'region_region_4',
 'region_region_5',
 'region_region_6',
 'region_region_7',
 'region_region_8',
 'region_region_9',
 'education_Below Secondary',
 "education_Master's & above",
 'education_nan',
 'gender_m',
 'recruitment_channel_referred',
 'recruitment_channel_sourcing']

We can also specify which columns to encode using the column names:

[18]:

encode = EncoderOHE(col_encode=['education'])
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df

[18]:

	employee_id	department	region	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted	education_Below Secondary	education_Master's & above	education_nan
0	65438	Sales & Marketing	region_7	f	sourcing	1	35	5.0	8	1	0	49	0	0	1	0
1	65141	Operations	region_22	m	other	1	30	5.0	4	0	0	60	0	0	0	0
2	7513	Sales & Marketing	region_19	m	sourcing	1	34	3.0	7	0	0	50	0	0	0	0
3	2542	Sales & Marketing	region_23	m	other	2	39	1.0	10	0	0	50	0	0	0	0
4	48945	Technology	region_26	m	other	1	45	3.0	2	0	0	73	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	Technology	region_14	m	sourcing	1	48	3.0	17	0	0	78	0	0	0	0
54804	74592	Operations	region_27	f	other	1	37	2.0	6	0	0	56	0	0	1	0
54805	13918	Analytics	region_1	m	other	1	27	5.0	3	1	0	79	0	0	0	0
54806	13614	Sales & Marketing	region_9	m	sourcing	1	29	1.0	2	0	0	45	0	0	0	1
54807	51526	HR	region_22	m	other	1	27	1.0	5	0	0	49	0	0	0	0

54808 rows × 16 columns

or the column indices:

[19]:

encode = EncoderOHE(col_encode=[3])
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df

[19]:

	employee_id	department	region	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted	education_Below Secondary	education_Master's & above	education_nan
0	65438	Sales & Marketing	region_7	f	sourcing	1	35	5.0	8	1	0	49	0	0	1	0
1	65141	Operations	region_22	m	other	1	30	5.0	4	0	0	60	0	0	0	0
2	7513	Sales & Marketing	region_19	m	sourcing	1	34	3.0	7	0	0	50	0	0	0	0
3	2542	Sales & Marketing	region_23	m	other	2	39	1.0	10	0	0	50	0	0	0	0
4	48945	Technology	region_26	m	other	1	45	3.0	2	0	0	73	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	Technology	region_14	m	sourcing	1	48	3.0	17	0	0	78	0	0	0	0
54804	74592	Operations	region_27	f	other	1	37	2.0	6	0	0	56	0	0	1	0
54805	13918	Analytics	region_1	m	other	1	27	5.0	3	1	0	79	0	0	0	0
54806	13614	Sales & Marketing	region_9	m	sourcing	1	29	1.0	2	0	0	45	0	0	0	1
54807	51526	HR	region_22	m	other	1	27	1.0	5	0	0	49	0	0	0	0

54808 rows × 16 columns

We can also revert the encoded column to their original state by using the inverse_transform() method.

[20]:

org_df = encode.inverse_transform(new_df)
org_df

[20]:

	education	employee_id	department	region	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	Master's & above	65438	Sales & Marketing	region_7	f	sourcing	1	35	5.0	8	1	0	49	0
1	Bachelor's	65141	Operations	region_22	m	other	1	30	5.0	4	0	0	60	0
2	Bachelor's	7513	Sales & Marketing	region_19	m	sourcing	1	34	3.0	7	0	0	50	0
3	Bachelor's	2542	Sales & Marketing	region_23	m	other	2	39	1.0	10	0	0	50	0
4	Bachelor's	48945	Technology	region_26	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	Bachelor's	3030	Technology	region_14	m	sourcing	1	48	3.0	17	0	0	78	0
54804	Master's & above	74592	Operations	region_27	f	other	1	37	2.0	6	0	0	56	0
54805	Bachelor's	13918	Analytics	region_1	m	other	1	27	5.0	3	1	0	79	0
54806	NaN	13614	Sales & Marketing	region_9	m	sourcing	1	29	1.0	2	0	0	45	0
54807	Bachelor's	51526	HR	region_22	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

Let us now test the ‘drop’ parameter. For that, we will repeat the same procedure as the previous cell, but this time use drop = False, since drop = True is the default behavior. We can see that the cell that uses drop = False (below) has one extra column: education_Bachelor’s, while the cell that uses drop = True (above) doesn’t have this column, since it was droped.

[21]:

encode = EncoderOHE(col_encode=['education'], drop=False)
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df

[21]:

	employee_id	department	region	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted	education_Bachelor's	education_Below Secondary	education_Master's & above	education_nan
0	65438	Sales & Marketing	region_7	f	sourcing	1	35	5.0	8	1	0	49	0	0	0	1	0
1	65141	Operations	region_22	m	other	1	30	5.0	4	0	0	60	0	1	0	0	0
2	7513	Sales & Marketing	region_19	m	sourcing	1	34	3.0	7	0	0	50	0	1	0	0	0
3	2542	Sales & Marketing	region_23	m	other	2	39	1.0	10	0	0	50	0	1	0	0	0
4	48945	Technology	region_26	m	other	1	45	3.0	2	0	0	73	0	1	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	Technology	region_14	m	sourcing	1	48	3.0	17	0	0	78	0	1	0	0	0
54804	74592	Operations	region_27	f	other	1	37	2.0	6	0	0	56	0	0	0	1	0
54805	13918	Analytics	region_1	m	other	1	27	5.0	3	1	0	79	0	1	0	0	0
54806	13614	Sales & Marketing	region_9	m	sourcing	1	29	1.0	2	0	0	45	0	0	0	0	1
54807	51526	HR	region_22	m	other	1	27	1.0	5	0	0	49	0	1	0	0	0

54808 rows × 17 columns

[22]:

org_df = encode.inverse_transform(new_df)
org_df

[22]:

	education	employee_id	department	region	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	Master's & above	65438	Sales & Marketing	region_7	f	sourcing	1	35	5.0	8	1	0	49	0
1	Bachelor's	65141	Operations	region_22	m	other	1	30	5.0	4	0	0	60	0
2	Bachelor's	7513	Sales & Marketing	region_19	m	sourcing	1	34	3.0	7	0	0	50	0
3	Bachelor's	2542	Sales & Marketing	region_23	m	other	2	39	1.0	10	0	0	50	0
4	Bachelor's	48945	Technology	region_26	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	Bachelor's	3030	Technology	region_14	m	sourcing	1	48	3.0	17	0	0	78	0
54804	Master's & above	74592	Operations	region_27	f	other	1	37	2.0	6	0	0	56	0
54805	Bachelor's	13918	Analytics	region_1	m	other	1	27	5.0	3	1	0	79	0
54806	NaN	13614	Sales & Marketing	region_9	m	sourcing	1	29	1.0	2	0	0	45	0
54807	Bachelor's	51526	HR	region_22	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

Using a dataset without headers

[23]:

dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset.drop(columns=[0], inplace=True)
dataset

[23]:

	1	2	3	4	5	6	7	8	9	10	11	12	13
0	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	Technology	region_14	Bachelor's	m	sourcing	1	48	3.0	17	0	0	78	0
54804	Operations	region_27	Master's & above	f	other	1	37	2.0	6	0	0	56	0
54805	Analytics	region_1	Bachelor's	m	other	1	27	5.0	3	1	0	79	0
54806	Sales & Marketing	region_9	NaN	m	sourcing	1	29	1.0	2	0	0	45	0
54807	HR	region_22	Bachelor's	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 13 columns

[24]:

encode = EncoderOHE(col_encode=[2, 1])
encode.fit(df=dataset)
new_df = encode.transform(dataset)
new_df

[24]:

	0	3	4	5	6	7	8	9	10	11	...	1_region_31	1_region_32	1_region_33	1_region_34	1_region_4	1_region_5	1_region_6	1_region_7	1_region_8	1_region_9
0	Sales & Marketing	f	sourcing	1	35	5.0	8	1	0	49	...	0	0	0	0	0	0	0	1	0	0
1	Operations	m	other	1	30	5.0	4	0	0	60	...	0	0	0	0	0	0	0	0	0	0
2	Sales & Marketing	m	sourcing	1	34	3.0	7	0	0	50	...	0	0	0	0	0	0	0	0	0	0
3	Sales & Marketing	m	other	2	39	1.0	10	0	0	50	...	0	0	0	0	0	0	0	0	0	0
4	Technology	m	other	1	45	3.0	2	0	0	73	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	Technology	m	sourcing	1	48	3.0	17	0	0	78	...	0	0	0	0	0	0	0	0	0	0
54804	Operations	f	other	1	37	2.0	6	0	0	56	...	0	0	0	0	0	0	0	0	0	0
54805	Analytics	m	other	1	27	5.0	3	1	0	79	...	0	0	0	0	0	0	0	0	0	0
54806	Sales & Marketing	m	sourcing	1	29	1.0	2	0	0	45	...	0	0	0	0	0	0	0	0	0	1
54807	HR	m	other	1	27	1.0	5	0	0	49	...	0	0	0	0	0	0	0	0	0	0

54808 rows × 47 columns

[25]:

org_df = encode.inverse_transform(new_df)
org_df

[25]:

	2	1	0	3	4	5	6	7	8	9	10	11	12
0	Master's & above	region_7	Sales & Marketing	f	sourcing	1	35	5.0	8	1	0	49	0
1	Bachelor's	region_22	Operations	m	other	1	30	5.0	4	0	0	60	0
2	Bachelor's	region_19	Sales & Marketing	m	sourcing	1	34	3.0	7	0	0	50	0
3	Bachelor's	region_23	Sales & Marketing	m	other	2	39	1.0	10	0	0	50	0
4	Bachelor's	region_26	Technology	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	Bachelor's	region_14	Technology	m	sourcing	1	48	3.0	17	0	0	78	0
54804	Master's & above	region_27	Operations	f	other	1	37	2.0	6	0	0	56	0
54805	Bachelor's	region_1	Analytics	m	other	1	27	5.0	3	1	0	79	0
54806	NaN	region_9	Sales & Marketing	m	sourcing	1	29	1.0	2	0	0	45	0
54807	Bachelor's	region_22	HR	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 13 columns