BasicImputer Example
In this example, we will explore how to perform imputation with the HR promotion dataset.
[1]:
import sys
sys.path.append('../../../notebooks')
import pandas as pd
import numpy as np
from raimitigations.dataprocessing import BasicImputer
from download import download_datasets
Handling a DataFrame with column names
[2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset
[2]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | 74592 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | 13918 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | 13614 | Sales & Marketing | region_9 | NaN | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | 51526 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
[3]:
print(dataset.isna().any())
print(dataset['education'].unique())
print(dataset['previous_year_rating'].unique())
employee_id False
department False
region False
education True
gender False
recruitment_channel False
no_of_trainings False
age False
previous_year_rating True
length_of_service False
KPIs_met >80% False
awards_won? False
avg_training_score False
is_promoted False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5. 3. 1. 4. nan 2.]
As we can see above, both the education and previous_year_rating have missing values. Below, we will not specify the columns to use for imputation (BasicImputer
will determine this for us), but we will specify an imputation strategy (constant value = -100) for previous_year_rating.
[4]:
imputer = BasicImputer(
df=dataset,
col_impute=None,
specific_col={'previous_year_rating': { 'missing_values':np.nan,
'strategy':'constant',
'fill_value': -100 } }
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df
[4]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | 74592 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | 13918 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | 13614 | Sales & Marketing | region_9 | NULL | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | 51526 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
[5]:
print(new_df.isna().any())
print(new_df['education'].unique())
print(new_df['previous_year_rating'].unique())
employee_id False
department False
region False
education False
gender False
recruitment_channel False
no_of_trainings False
age False
previous_year_rating False
length_of_service False
KPIs_met >80% False
awards_won? False
avg_training_score False
is_promoted False
dtype: bool
["Master's & above" "Bachelor's" 'NULL' 'Below Secondary']
[ 5. 3. 1. 4. -100. 2.]
We can also specify which columns to impute using either the column names or the column indices. In the following cell, we use the latter approach:
[6]:
imputer = BasicImputer(
df=dataset,
col_impute=[3, 8],
specific_col={8: { 'missing_values':np.nan,
'strategy':'constant',
'fill_value':-100 } }
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df
[6]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | 74592 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | 13918 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | 13614 | Sales & Marketing | region_9 | NULL | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | 51526 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
[7]:
print(new_df.isna().any())
print(new_df['education'].unique())
print(new_df['previous_year_rating'].unique())
employee_id False
department False
region False
education False
gender False
recruitment_channel False
no_of_trainings False
age False
previous_year_rating False
length_of_service False
KPIs_met >80% False
awards_won? False
avg_training_score False
is_promoted False
dtype: bool
["Master's & above" "Bachelor's" 'NULL' 'Below Secondary']
[ 5. 3. 1. 4. -100. 2.]
Another approach is to specify a dictionary for all categorical features instead of using the “specific_col” parameter, alghtough we can use both parameters at the same time, where the “specific_col” parameter has a priority over the “categorical” and “numerical” parameters (this means that the rule in the “specific_col” parameter will be used for all numerical and categorical features which are accounted in the “specific_col” dictionary):
[8]:
imputer = BasicImputer(
categorical={ 'missing_values':np.nan,
'strategy':'constant',
'fill_value': 'null' }
)
imputer.fit(df=dataset)
new_df = imputer.transform(dataset)
print(new_df['education'].unique())
["Master's & above" "Bachelor's" 'null' 'Below Secondary']
[9]:
imputer = BasicImputer(
numerical={ 'missing_values':np.nan,
'strategy':'constant',
'fill_value': -5 }
)
imputer.fit(df=dataset)
new_df = imputer.transform(dataset)
print(new_df['previous_year_rating'].unique())
[ 5. 3. 1. 4. -5. 2.]
Now the the missing values have been filled, we can encode the categorical variables and fit a decision tree.
[10]:
from raimitigations.dataprocessing import EncoderOHE, EncoderOrdinal
encode = EncoderOrdinal(
df=new_df,
col_encode=None
)
encode.fit()
new_df = encode.transform(new_df)
new_df
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[10]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | 7 | 31 | 2 | 0 | 2 | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | 4 | 14 | 0 | 1 | 0 | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | 7 | 10 | 0 | 1 | 2 | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | 7 | 15 | 0 | 1 | 0 | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | 8 | 18 | 0 | 1 | 0 | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | 3030 | 8 | 5 | 0 | 1 | 2 | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | 74592 | 4 | 19 | 2 | 0 | 0 | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | 13918 | 0 | 0 | 0 | 1 | 0 | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | 13614 | 7 | 33 | 3 | 1 | 2 | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | 51526 | 2 | 14 | 0 | 1 | 0 | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 14 columns
[11]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import BayesianRidge
#estimator = KNeighborsClassifier(n_neighbors=5)
estimator = DecisionTreeClassifier(max_features="sqrt", random_state=0)
#estimator = BayesianRidge()
X = new_df.drop(columns=['is_promoted', 'employee_id'])
Y = new_df['is_promoted']
train_X, test_X, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=0, stratify=Y)
estimator.fit(train_X, train_y)
estimator.score(test_X, test_y)
[11]:
0.8881590950556468
Handling a DataFrame without column names
Even if the dataset contains no header columns, we can perform the same operations, instead with the column index. The next few cells will demonstrate how to do this.
[12]:
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset.drop(columns=[0], inplace=True)
dataset
[12]:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | Sales & Marketing | region_9 | NaN | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns
[13]:
print(dataset.isna().any())
print(dataset.iloc[:,2].unique())
print(dataset.iloc[:,7].unique())
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 True
9 False
10 False
11 False
12 False
13 False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5. 3. 1. 4. nan 2.]
[14]:
imputer = BasicImputer(
df=dataset,
col_impute=[2,7],
specific_col={7: { 'missing_values':np.nan,
'strategy':'constant',
'fill_value':-100 } }
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df
[14]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
54803 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 0 | 78 | 0 |
54804 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 0 | 56 | 0 |
54805 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 1 | 0 | 79 | 0 |
54806 | Sales & Marketing | region_9 | NULL | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | 0 | 45 | 0 |
54807 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 0 | 49 | 0 |
54808 rows × 13 columns
[15]:
print(new_df.isna().any())
print(new_df.iloc[:,2].unique())
print(new_df.iloc[:,7].unique())
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
dtype: bool
["Master's & above" "Bachelor's" 'NULL' 'Below Secondary']
[ 5. 3. 1. 4. -100. 2.]