BasicImputer Example

In this example, we will explore how to perform imputation with the HR promotion dataset.

[1]:

import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.dataprocessing import BasicImputer

from download import download_datasets

Handling a DataFrame with column names

[2]:

data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset

[2]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	65141	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	7513	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	2542	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	48945	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	Technology	region_14	Bachelor's	m	sourcing	1	48	3.0	17	0	0	78	0
54804	74592	Operations	region_27	Master's & above	f	other	1	37	2.0	6	0	0	56	0
54805	13918	Analytics	region_1	Bachelor's	m	other	1	27	5.0	3	1	0	79	0
54806	13614	Sales & Marketing	region_9	NaN	m	sourcing	1	29	1.0	2	0	0	45	0
54807	51526	HR	region_22	Bachelor's	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

[3]:

print(dataset.isna().any())
print(dataset['education'].unique())
print(dataset['previous_year_rating'].unique())

employee_id             False
department              False
region                  False
education                True
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating     True
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5.  3.  1.  4. nan  2.]

As we can see above, both the education and previous_year_rating have missing values. Below, we will not specify the columns to use for imputation (BasicImputer will determine this for us), but we will specify an imputation strategy (constant value = -100) for previous_year_rating.

[4]:

imputer = BasicImputer(
                            df=dataset,
                            col_impute=None,
                            specific_col={'previous_year_rating': { 'missing_values':np.nan,
                                                                                                            'strategy':'constant',
                                                                                                            'fill_value': -100 } }
                    )
imputer.fit()
new_df = imputer.transform(dataset)
new_df

[4]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	65141	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	7513	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	2542	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	48945	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	Technology	region_14	Bachelor's	m	sourcing	1	48	3.0	17	0	0	78	0
54804	74592	Operations	region_27	Master's & above	f	other	1	37	2.0	6	0	0	56	0
54805	13918	Analytics	region_1	Bachelor's	m	other	1	27	5.0	3	1	0	79	0
54806	13614	Sales & Marketing	region_9	NULL	m	sourcing	1	29	1.0	2	0	0	45	0
54807	51526	HR	region_22	Bachelor's	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

[5]:

print(new_df.isna().any())
print(new_df['education'].unique())
print(new_df['previous_year_rating'].unique())

employee_id             False
department              False
region                  False
education               False
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating    False
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool
["Master's & above" "Bachelor's" 'NULL' 'Below Secondary']
[   5.    3.    1.    4. -100.    2.]

We can also specify which columns to impute using either the column names or the column indices. In the following cell, we use the latter approach:

[6]:

imputer = BasicImputer(
                            df=dataset,
                            col_impute=[3, 8],
                            specific_col={8: {      'missing_values':np.nan,
                                                                    'strategy':'constant',
                                                                    'fill_value':-100 } }
                    )
imputer.fit()
new_df = imputer.transform(dataset)
new_df

[6]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	65141	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	7513	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	2542	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	48945	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	Technology	region_14	Bachelor's	m	sourcing	1	48	3.0	17	0	0	78	0
54804	74592	Operations	region_27	Master's & above	f	other	1	37	2.0	6	0	0	56	0
54805	13918	Analytics	region_1	Bachelor's	m	other	1	27	5.0	3	1	0	79	0
54806	13614	Sales & Marketing	region_9	NULL	m	sourcing	1	29	1.0	2	0	0	45	0
54807	51526	HR	region_22	Bachelor's	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

[7]:

print(new_df.isna().any())
print(new_df['education'].unique())
print(new_df['previous_year_rating'].unique())

employee_id             False
department              False
region                  False
education               False
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating    False
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool
["Master's & above" "Bachelor's" 'NULL' 'Below Secondary']
[   5.    3.    1.    4. -100.    2.]

Another approach is to specify a dictionary for all categorical features instead of using the “specific_col” parameter, alghtough we can use both parameters at the same time, where the “specific_col” parameter has a priority over the “categorical” and “numerical” parameters (this means that the rule in the “specific_col” parameter will be used for all numerical and categorical features which are accounted in the “specific_col” dictionary):

[8]:

imputer = BasicImputer(
                            categorical={   'missing_values':np.nan,
                                                            'strategy':'constant',
                                                            'fill_value': 'null' }
                    )
imputer.fit(df=dataset)
new_df = imputer.transform(dataset)

print(new_df['education'].unique())

["Master's & above" "Bachelor's" 'null' 'Below Secondary']

[9]:

imputer = BasicImputer(
                            numerical={     'missing_values':np.nan,
                                                    'strategy':'constant',
                                                    'fill_value': -5 }
                    )
imputer.fit(df=dataset)
new_df = imputer.transform(dataset)

print(new_df['previous_year_rating'].unique())

[ 5.  3.  1.  4. -5.  2.]

Now the the missing values have been filled, we can encode the categorical variables and fit a decision tree.

[10]:

from raimitigations.dataprocessing import EncoderOHE, EncoderOrdinal

encode = EncoderOrdinal(
                            df=new_df,
                            col_encode=None
                    )
encode.fit()
new_df = encode.transform(new_df)
new_df

No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']

[10]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438	7	31	2	0	2	1	35	5.0	8	1	0	49	0
1	65141	4	14	0	1	0	1	30	5.0	4	0	0	60	0
2	7513	7	10	0	1	2	1	34	3.0	7	0	0	50	0
3	2542	7	15	0	1	0	2	39	1.0	10	0	0	50	0
4	48945	8	18	0	1	0	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	3030	8	5	0	1	2	1	48	3.0	17	0	0	78	0
54804	74592	4	19	2	0	0	1	37	2.0	6	0	0	56	0
54805	13918	0	0	0	1	0	1	27	5.0	3	1	0	79	0
54806	13614	7	33	3	1	2	1	29	1.0	2	0	0	45	0
54807	51526	2	14	0	1	0	1	27	1.0	5	0	0	49	0

54808 rows × 14 columns

[11]:

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import BayesianRidge

#estimator = KNeighborsClassifier(n_neighbors=5)
estimator = DecisionTreeClassifier(max_features="sqrt", random_state=0)
#estimator = BayesianRidge()
X = new_df.drop(columns=['is_promoted', 'employee_id'])
Y = new_df['is_promoted']

train_X, test_X, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=0, stratify=Y)

estimator.fit(train_X, train_y)
estimator.score(test_X, test_y)

[11]:

0.8881590950556468

Handling a DataFrame without column names

Even if the dataset contains no header columns, we can perform the same operations, instead with the column index. The next few cells will demonstrate how to do this.

[12]:

dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset.drop(columns=[0], inplace=True)
dataset

[12]:

	1	2	3	4	5	6	7	8	9	10	11	12	13
0	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	Technology	region_14	Bachelor's	m	sourcing	1	48	3.0	17	0	0	78	0
54804	Operations	region_27	Master's & above	f	other	1	37	2.0	6	0	0	56	0
54805	Analytics	region_1	Bachelor's	m	other	1	27	5.0	3	1	0	79	0
54806	Sales & Marketing	region_9	NaN	m	sourcing	1	29	1.0	2	0	0	45	0
54807	HR	region_22	Bachelor's	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 13 columns

[13]:

print(dataset.isna().any())
print(dataset.iloc[:,2].unique())
print(dataset.iloc[:,7].unique())

1     False
2     False
3      True
4     False
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12    False
13    False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5.  3.  1.  4. nan  2.]

[14]:

imputer = BasicImputer(
                            df=dataset,
                            col_impute=[2,7],
                            specific_col={7: {      'missing_values':np.nan,
                                                                    'strategy':'constant',
                                                                    'fill_value':-100 } }
                    )
imputer.fit()
new_df = imputer.transform(dataset)
new_df

[14]:

	0	1	2	3	4	5	6	7	8	9	10	11	12
0	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
54803	Technology	region_14	Bachelor's	m	sourcing	1	48	3.0	17	0	0	78	0
54804	Operations	region_27	Master's & above	f	other	1	37	2.0	6	0	0	56	0
54805	Analytics	region_1	Bachelor's	m	other	1	27	5.0	3	1	0	79	0
54806	Sales & Marketing	region_9	NULL	m	sourcing	1	29	1.0	2	0	0	45	0
54807	HR	region_22	Bachelor's	m	other	1	27	1.0	5	0	0	49	0

54808 rows × 13 columns

[15]:

print(new_df.isna().any())
print(new_df.iloc[:,2].unique())
print(new_df.iloc[:,7].unique())

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
dtype: bool
["Master's & above" "Bachelor's" 'NULL' 'Below Secondary']
[   5.    3.    1.    4. -100.    2.]