KNNDataImputer Example

In this example, we will explore how to perform imputation with the HR promotion dataset using KNN imputation.

This method imputes missing data of a feature using k-nearest neighbours. A feature’s missing values are imputed using the mean value from k-nearest neighbors in the dataset. Two samples are close if the features that neither is missing are close.

This subclass uses the class sklearn.impute.KNNImputer class from sklearn in the background.

[1]:
import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.dataprocessing import KNNDataImputer

from download import download_datasets

Handling a DataFrame with column names

[2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset = dataset[:10000]

dataset
[2]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 65141 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 7513 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 2542 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 48945 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 14934 Procurement region_13 Master's & above f other 1 37 4.0 7 1 0 71 0
9996 22040 Sales & Marketing region_33 Master's & above m sourcing 1 39 3.0 7 0 0 48 0
9997 14188 Finance region_13 Master's & above f sourcing 1 33 4.0 4 1 0 58 0
9998 73566 Operations region_28 Master's & above m other 1 32 4.0 4 1 0 57 1
9999 21372 Procurement region_13 Bachelor's f sourcing 1 32 3.0 6 0 0 71 0

10000 rows × 14 columns

[3]:
print(dataset.isna().any())
print(dataset['education'].unique())
print(dataset['previous_year_rating'].unique())
employee_id             False
department              False
region                  False
education                True
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating     True
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5.  3.  1.  4. nan  2.]

Without Enabling Encoding

As we can see above, both the education and previous_year_rating have missing values.

However, note that the dataset includes categorical columns such as education, while sklearn’s sklearn.KNNImputer can only handle numerical data. The good thing is, KNNDataImputer can in fact handle categorical data using the boolean parameter enable_encoder.

First, let’s try to use the default value enable_encoder=False, categorical data will be excluded from the imputation process. We can also use col_impute to specify only previous_year_rating to be imputed.

Additionally, we can specify sklearn.KNNImputer’s parameters using the knn_params dictionary as long as it uses the following format (if you choose not to pass anything to this param, default values will be used).

[4]:
imputer = KNNDataImputer(
    df=dataset,
    col_impute=['previous_year_rating'],
    enable_encoder=False,
    knn_params={
        "missing_values": np.nan,
        "n_neighbors": 4,
        "weights": "uniform",
        "metric": "nan_euclidean",
        "copy": True,
    }
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df


WARNING: Categorical columns will be excluded from the iterative imputation process.
If you'd like to include these columns, you need to set 'enable_encoder'=True.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
[4]:
employee_id no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted department education recruitment_channel region gender
0 65438.0 1.0 35.0 5.0 8.0 1.0 0.0 49.0 0.0 Sales & Marketing Master's & above sourcing region_7 f
1 65141.0 1.0 30.0 5.0 4.0 0.0 0.0 60.0 0.0 Operations Bachelor's other region_22 m
2 7513.0 1.0 34.0 3.0 7.0 0.0 0.0 50.0 0.0 Sales & Marketing Bachelor's sourcing region_19 m
3 2542.0 2.0 39.0 1.0 10.0 0.0 0.0 50.0 0.0 Sales & Marketing Bachelor's other region_23 m
4 48945.0 1.0 45.0 3.0 2.0 0.0 0.0 73.0 0.0 Technology Bachelor's other region_26 m
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 14934.0 1.0 37.0 4.0 7.0 1.0 0.0 71.0 0.0 Procurement Master's & above other region_13 f
9996 22040.0 1.0 39.0 3.0 7.0 0.0 0.0 48.0 0.0 Sales & Marketing Master's & above sourcing region_33 m
9997 14188.0 1.0 33.0 4.0 4.0 1.0 0.0 58.0 0.0 Finance Master's & above sourcing region_13 f
9998 73566.0 1.0 32.0 4.0 4.0 1.0 0.0 57.0 1.0 Operations Master's & above other region_28 m
9999 21372.0 1.0 32.0 3.0 6.0 0.0 0.0 71.0 0.0 Procurement Bachelor's sourcing region_13 f

10000 rows × 14 columns

[5]:
print(new_df.isna().any())
employee_id             False
no_of_trainings         False
age                     False
previous_year_rating    False
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
department              False
education                True
recruitment_channel     False
region                  False
gender                  False
dtype: bool

With Enabling Encoding

Now if we use enable_encoder=True, we can include categorical data to be imputed.

Using the original dataset, we know that both education and previous_year_rating have missing values. Below, we will not specify the columns to use for imputation (KNNDataImputer will determine this for us).

The enable_encoder parameter allows the class to use ordinal encoding internally before imputation, however, it uses the default ‘auto’ option for its categories parameter. We recommend that you use this internal encoding function when you aren’t looking to specify ordinal semantics or order. Otherwise, you should consider using the Pipeline class to apply custom encoding before calling the KNNDataImputer class for imputation.

[6]:
imputer = KNNDataImputer(
    df=dataset,
    col_impute=None,
    enable_encoder=True,
    knn_params=None
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df


WARNING: 'enable_encoder'=True and categorical columns will be encoded using ordinal encoding before applying the iterative imputation process.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
No columns specified for imputation. These columns have been automatically identified at transform time:
['education', 'previous_year_rating']

WARNING: Note that encoded columns are not guaranteed to reverse transform if they have imputed values.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass.

Imputed categorical columns' reverse encoding transformation complete.
[6]:
employee_id department region education gender recruitment_channel no_of_trainings age previous_year_rating length_of_service KPIs_met >80% awards_won? avg_training_score is_promoted
0 65438.0 Sales & Marketing region_7 Master's & above f sourcing 1.0 35.0 5.0 8.0 1.0 0.0 49.0 0.0
1 65141.0 Operations region_22 Bachelor's m other 1.0 30.0 5.0 4.0 0.0 0.0 60.0 0.0
2 7513.0 Sales & Marketing region_19 Bachelor's m sourcing 1.0 34.0 3.0 7.0 0.0 0.0 50.0 0.0
3 2542.0 Sales & Marketing region_23 Bachelor's m other 2.0 39.0 1.0 10.0 0.0 0.0 50.0 0.0
4 48945.0 Technology region_26 Bachelor's m other 1.0 45.0 3.0 2.0 0.0 0.0 73.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 14934.0 Procurement region_13 Master's & above f other 1.0 37.0 4.0 7.0 1.0 0.0 71.0 0.0
9996 22040.0 Sales & Marketing region_33 Master's & above m sourcing 1.0 39.0 3.0 7.0 0.0 0.0 48.0 0.0
9997 14188.0 Finance region_13 Master's & above f sourcing 1.0 33.0 4.0 4.0 1.0 0.0 58.0 0.0
9998 73566.0 Operations region_28 Master's & above m other 1.0 32.0 4.0 4.0 1.0 0.0 57.0 1.0
9999 21372.0 Procurement region_13 Bachelor's f sourcing 1.0 32.0 3.0 6.0 0.0 0.0 71.0 0.0

10000 rows × 14 columns

Note that using the encoder before the imputation of education column isn’t always guaranteed to reverse transformed as it now includes new imputed values that the encoder can’t always map back to categorical data.

[7]:
print(new_df.isna().any())
employee_id             False
department              False
region                  False
education               False
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating    False
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool

Now that the missing values have been filled, we can fit a model to the data. Here we do so using a decision tree…

[8]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from raimitigations.dataprocessing import EncoderOrdinal

encode = EncoderOrdinal(
    df=new_df,
    col_encode=None
)
encode.fit()
new_df = encode.transform(new_df)

estimator = DecisionTreeClassifier(max_features="sqrt", random_state=0)
X = new_df.drop(columns=['is_promoted', 'employee_id'])
Y = new_df['is_promoted']

train_X, test_X, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=0, stratify=Y)

estimator.fit(train_X, train_y)
estimator.score(test_X, test_y)
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[8]:
0.8865

Handling a DataFrame without column names

Even if the dataset contains no header columns, we can perform the same operations, instead with the column index. The next few cells will demonstrate how to do this.

[9]:
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset = dataset[:10000]
dataset
[9]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 65438 Sales & Marketing region_7 Master's & above f sourcing 1 35 5.0 8 1 0 49 0
1 65141 Operations region_22 Bachelor's m other 1 30 5.0 4 0 0 60 0
2 7513 Sales & Marketing region_19 Bachelor's m sourcing 1 34 3.0 7 0 0 50 0
3 2542 Sales & Marketing region_23 Bachelor's m other 2 39 1.0 10 0 0 50 0
4 48945 Technology region_26 Bachelor's m other 1 45 3.0 2 0 0 73 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 14934 Procurement region_13 Master's & above f other 1 37 4.0 7 1 0 71 0
9996 22040 Sales & Marketing region_33 Master's & above m sourcing 1 39 3.0 7 0 0 48 0
9997 14188 Finance region_13 Master's & above f sourcing 1 33 4.0 4 1 0 58 0
9998 73566 Operations region_28 Master's & above m other 1 32 4.0 4 1 0 57 1
9999 21372 Procurement region_13 Bachelor's f sourcing 1 32 3.0 6 0 0 71 0

10000 rows × 14 columns

[10]:
print(dataset.isna().any())
print(dataset.iloc[:, 3].unique())
print(dataset.iloc[:, 8].unique())
0     False
1     False
2     False
3      True
4     False
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12    False
13    False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5.  3.  1.  4. nan  2.]

Without Enabling Encoding

Using the sklearn_obj parameter, we have the option to pass a pre-defined sklearn.impute.KNNImputer object. If the latter is used, knn_params will be overwritten.

[11]:
from sklearn.impute import KNNImputer

imputer = KNNDataImputer(
    df=dataset,
    col_impute=[8],
    enable_encoder=False,
    sklearn_obj = KNNImputer(n_neighbors=5)
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df


WARNING: Categorical columns will be excluded from the iterative imputation process.
If you'd like to include these columns, you need to set 'enable_encoder'=True.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
[11]:
0 6 7 8 9 10 11 12 13 1 4 5 2 3
0 65438.0 1.0 35.0 5.0 8.0 1.0 0.0 49.0 0.0 Sales & Marketing f sourcing region_7 Master's & above
1 65141.0 1.0 30.0 5.0 4.0 0.0 0.0 60.0 0.0 Operations m other region_22 Bachelor's
2 7513.0 1.0 34.0 3.0 7.0 0.0 0.0 50.0 0.0 Sales & Marketing m sourcing region_19 Bachelor's
3 2542.0 2.0 39.0 1.0 10.0 0.0 0.0 50.0 0.0 Sales & Marketing m other region_23 Bachelor's
4 48945.0 1.0 45.0 3.0 2.0 0.0 0.0 73.0 0.0 Technology m other region_26 Bachelor's
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 14934.0 1.0 37.0 4.0 7.0 1.0 0.0 71.0 0.0 Procurement f other region_13 Master's & above
9996 22040.0 1.0 39.0 3.0 7.0 0.0 0.0 48.0 0.0 Sales & Marketing m sourcing region_33 Master's & above
9997 14188.0 1.0 33.0 4.0 4.0 1.0 0.0 58.0 0.0 Finance f sourcing region_13 Master's & above
9998 73566.0 1.0 32.0 4.0 4.0 1.0 0.0 57.0 1.0 Operations m other region_28 Master's & above
9999 21372.0 1.0 32.0 3.0 6.0 0.0 0.0 71.0 0.0 Procurement f sourcing region_13 Bachelor's

10000 rows × 14 columns

[12]:
print(new_df.isna().any())
0     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
1     False
4     False
5     False
2     False
3      True
dtype: bool

With Enabling Encoding

[13]:
imputer = KNNDataImputer(
    df=dataset,
    col_impute=None,
    enable_encoder=True,
    knn_params={
        "missing_values": np.nan,
        "n_neighbors": 6,
        "weights": "uniform",
        "metric": "nan_euclidean",
        "copy": True,
    }
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df


WARNING: 'enable_encoder'=True and categorical columns will be encoded using ordinal encoding before applying the iterative imputation process.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
No columns specified for imputation. These columns have been automatically identified at transform time:
['3', '8']

WARNING: Note that encoded columns are not guaranteed to reverse transform if they have imputed values.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass.

Imputed categorical columns' reverse encoding transformation complete.
[13]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 65438.0 Sales & Marketing region_7 Master's & above f sourcing 1.0 35.0 5.0 8.0 1.0 0.0 49.0 0.0
1 65141.0 Operations region_22 Bachelor's m other 1.0 30.0 5.0 4.0 0.0 0.0 60.0 0.0
2 7513.0 Sales & Marketing region_19 Bachelor's m sourcing 1.0 34.0 3.0 7.0 0.0 0.0 50.0 0.0
3 2542.0 Sales & Marketing region_23 Bachelor's m other 2.0 39.0 1.0 10.0 0.0 0.0 50.0 0.0
4 48945.0 Technology region_26 Bachelor's m other 1.0 45.0 3.0 2.0 0.0 0.0 73.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 14934.0 Procurement region_13 Master's & above f other 1.0 37.0 4.0 7.0 1.0 0.0 71.0 0.0
9996 22040.0 Sales & Marketing region_33 Master's & above m sourcing 1.0 39.0 3.0 7.0 0.0 0.0 48.0 0.0
9997 14188.0 Finance region_13 Master's & above f sourcing 1.0 33.0 4.0 4.0 1.0 0.0 58.0 0.0
9998 73566.0 Operations region_28 Master's & above m other 1.0 32.0 4.0 4.0 1.0 0.0 57.0 1.0
9999 21372.0 Procurement region_13 Bachelor's f sourcing 1.0 32.0 3.0 6.0 0.0 0.0 71.0 0.0

10000 rows × 14 columns

[14]:
print(new_df.isna().any())
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
dtype: bool