BasicImputer Example

In this example, we will explore how to perform imputation with the HR promotion dataset.

[1]:
import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.dataprocessing import BasicImputer

from download import download_datasets

Handling a DataFrame with column names

[2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset
[2]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 65141 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 7513 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 2542 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 48945 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 74592 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 13918 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 13614 Sales & Marketing region_9 NaN m sourcing 1 29 1.0 2 0 0 45 0
54807 51526 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

[3]:
print(dataset.isna().any())
print(dataset['education'].unique())
print(dataset['previous_year_rating'].unique())
employee_id             False
department              False
region                  False
education                True
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating     True
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5.  3.  1.  4. nan  2.]

As we can see above, both the education and previous_year_rating have missing values. Below, we will not specify the columns to use for imputation (BasicImputer will determine this for us), but we will specify an imputation strategy (constant value = -100) for previous_year_rating.

[4]:
imputer = BasicImputer(
                            df=dataset,
                            col_impute=None,
                            specific_col={'previous_year_rating': { 'missing_values':np.nan,
                                                                                                            'strategy':'constant',
                                                                                                            'fill_value': -100 } }
                    )
imputer.fit()
new_df = imputer.transform(dataset)
new_df
[4]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 65141 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 7513 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 2542 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 48945 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 74592 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 13918 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 13614 Sales & Marketing region_9 NULL m sourcing 1 29 1.0 2 0 0 45 0
54807 51526 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

[5]:
print(new_df.isna().any())
print(new_df['education'].unique())
print(new_df['previous_year_rating'].unique())
employee_id             False
department              False
region                  False
education               False
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating    False
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool
["Master's & above" "Bachelor's" 'NULL' 'Below Secondary']
[   5.    3.    1.    4. -100.    2.]

We can also specify which columns to impute using either the column names or the column indices. In the following cell, we use the latter approach:

[6]:
imputer = BasicImputer(
                            df=dataset,
                            col_impute=[3, 8],
                            specific_col={8: {      'missing_values':np.nan,
                                                                    'strategy':'constant',
                                                                    'fill_value':-100 } }
                    )
imputer.fit()
new_df = imputer.transform(dataset)
new_df
[6]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 65141 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 7513 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 2542 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 48945 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 74592 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 13918 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 13614 Sales & Marketing region_9 NULL m sourcing 1 29 1.0 2 0 0 45 0
54807 51526 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

[7]:
print(new_df.isna().any())
print(new_df['education'].unique())
print(new_df['previous_year_rating'].unique())
employee_id             False
department              False
region                  False
education               False
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating    False
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool
["Master's & above" "Bachelor's" 'NULL' 'Below Secondary']
[   5.    3.    1.    4. -100.    2.]

Another approach is to specify a dictionary for all categorical features instead of using the “specific_col” parameter, alghtough we can use both parameters at the same time, where the “specific_col” parameter has a priority over the “categorical” and “numerical” parameters (this means that the rule in the “specific_col” parameter will be used for all numerical and categorical features which are accounted in the “specific_col” dictionary):

[8]:
imputer = BasicImputer(
                            categorical={   'missing_values':np.nan,
                                                            'strategy':'constant',
                                                            'fill_value': 'null' }
                    )
imputer.fit(df=dataset)
new_df = imputer.transform(dataset)

print(new_df['education'].unique())
["Master's & above" "Bachelor's" 'null' 'Below Secondary']
[9]:
imputer = BasicImputer(
                            numerical={     'missing_values':np.nan,
                                                    'strategy':'constant',
                                                    'fill_value': -5 }
                    )
imputer.fit(df=dataset)
new_df = imputer.transform(dataset)

print(new_df['previous_year_rating'].unique())
[ 5.  3.  1.  4. -5.  2.]

Now the the missing values have been filled, we can encode the categorical variables and fit a decision tree.

[10]:
from raimitigations.dataprocessing import EncoderOHE, EncoderOrdinal

encode = EncoderOrdinal(
                            df=new_df,
                            col_encode=None
                    )
encode.fit()
new_df = encode.transform(new_df)
new_df
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[10]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438 7 31 2 0 2 1 35 5.0 8 1 0 49 0
1 65141 4 14 0 1 0 1 30 5.0 4 0 0 60 0
2 7513 7 10 0 1 2 1 34 3.0 7 0 0 50 0
3 2542 7 15 0 1 0 2 39 1.0 10 0 0 50 0
4 48945 8 18 0 1 0 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 3030 8 5 0 1 2 1 48 3.0 17 0 0 78 0
54804 74592 4 19 2 0 0 1 37 2.0 6 0 0 56 0
54805 13918 0 0 0 1 0 1 27 5.0 3 1 0 79 0
54806 13614 7 33 3 1 2 1 29 1.0 2 0 0 45 0
54807 51526 2 14 0 1 0 1 27 1.0 5 0 0 49 0

54808 rows × 14 columns

[11]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import BayesianRidge

#estimator = KNeighborsClassifier(n_neighbors=5)
estimator = DecisionTreeClassifier(max_features="sqrt", random_state=0)
#estimator = BayesianRidge()
X = new_df.drop(columns=['is_promoted', 'employee_id'])
Y = new_df['is_promoted']

train_X, test_X, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=0, stratify=Y)

estimator.fit(train_X, train_y)
estimator.score(test_X, test_y)
[11]:
0.8881590950556468

Handling a DataFrame without column names

Even if the dataset contains no header columns, we can perform the same operations, instead with the column index. The next few cells will demonstrate how to do this.

[12]:
dataset =  pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset.drop(columns=[0], inplace=True)
dataset
[12]:
1 2 3 4 5 6 7 8 9 10 11 12 13
0 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 Sales & Marketing region_9 NaN m sourcing 1 29 1.0 2 0 0 45 0
54807 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 13 columns

[13]:
print(dataset.isna().any())
print(dataset.iloc[:,2].unique())
print(dataset.iloc[:,7].unique())
1     False
2     False
3      True
4     False
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12    False
13    False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5.  3.  1.  4. nan  2.]
[14]:
imputer = BasicImputer(
                            df=dataset,
                            col_impute=[2,7],
                            specific_col={7: {      'missing_values':np.nan,
                                                                    'strategy':'constant',
                                                                    'fill_value':-100 } }
                    )
imputer.fit()
new_df = imputer.transform(dataset)
new_df
[14]:
0 1 2 3 4 5 6 7 8 9 10 11 12
0 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
54803 Technology region_14 Bachelor's m sourcing 1 48 3.0 17 0 0 78 0
54804 Operations region_27 Master's & above f other 1 37 2.0 6 0 0 56 0
54805 Analytics region_1 Bachelor's m other 1 27 5.0 3 1 0 79 0
54806 Sales & Marketing region_9 NULL m sourcing 1 29 1.0 2 0 0 45 0
54807 HR region_22 Bachelor's m other 1 27 1.0 5 0 0 49 0

54808 rows × 13 columns

[15]:
print(new_df.isna().any())
print(new_df.iloc[:,2].unique())
print(new_df.iloc[:,7].unique())
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
dtype: bool
["Master's & above" "Bachelor's" 'NULL' 'Below Secondary']
[   5.    3.    1.    4. -100.    2.]