KNNDataImputer Example
In this example, we will explore how to perform imputation with the HR promotion dataset using KNN imputation.
This method imputes missing data of a feature using k-nearest neighbours. A feature’s missing values are imputed using the mean value from k-nearest neighbors in the dataset. Two samples are close if the features that neither is missing are close.
This subclass uses the class sklearn.impute.KNNImputer
class from sklearn
in the background.
[1]:
import sys
sys.path.append('../../../notebooks')
import pandas as pd
import numpy as np
from raimitigations.dataprocessing import KNNDataImputer
from download import download_datasets
Handling a DataFrame with column names
[2]:
data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset = dataset[:10000]
dataset
[2]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 14934 | Procurement | region_13 | Master's & above | f | other | 1 | 37 | 4.0 | 7 | 1 | 0 | 71 | 0 |
9996 | 22040 | Sales & Marketing | region_33 | Master's & above | m | sourcing | 1 | 39 | 3.0 | 7 | 0 | 0 | 48 | 0 |
9997 | 14188 | Finance | region_13 | Master's & above | f | sourcing | 1 | 33 | 4.0 | 4 | 1 | 0 | 58 | 0 |
9998 | 73566 | Operations | region_28 | Master's & above | m | other | 1 | 32 | 4.0 | 4 | 1 | 0 | 57 | 1 |
9999 | 21372 | Procurement | region_13 | Bachelor's | f | sourcing | 1 | 32 | 3.0 | 6 | 0 | 0 | 71 | 0 |
10000 rows × 14 columns
[3]:
print(dataset.isna().any())
print(dataset['education'].unique())
print(dataset['previous_year_rating'].unique())
employee_id False
department False
region False
education True
gender False
recruitment_channel False
no_of_trainings False
age False
previous_year_rating True
length_of_service False
KPIs_met >80% False
awards_won? False
avg_training_score False
is_promoted False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5. 3. 1. 4. nan 2.]
Without Enabling Encoding
As we can see above, both the education and previous_year_rating have missing values.
However, note that the dataset includes categorical columns such as education, while sklearn’s sklearn.KNNImputer
can only handle numerical data. The good thing is, KNNDataImputer
can in fact handle categorical data using the boolean parameter enable_encoder
.
First, let’s try to use the default value enable_encoder=False
, categorical data will be excluded from the imputation process. We can also use col_impute
to specify only previous_year_rating
to be imputed.
Additionally, we can specify sklearn.KNNImputer
’s parameters using the knn_params
dictionary as long as it uses the following format (if you choose not to pass anything to this param, default values will be used).
[4]:
imputer = KNNDataImputer(
df=dataset,
col_impute=['previous_year_rating'],
enable_encoder=False,
knn_params={
"missing_values": np.nan,
"n_neighbors": 4,
"weights": "uniform",
"metric": "nan_euclidean",
"copy": True,
}
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df
WARNING: Categorical columns will be excluded from the iterative imputation process.
If you'd like to include these columns, you need to set 'enable_encoder'=True.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
[4]:
employee_id | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | department | education | recruitment_channel | region | gender | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438.0 | 1.0 | 35.0 | 5.0 | 8.0 | 1.0 | 0.0 | 49.0 | 0.0 | Sales & Marketing | Master's & above | sourcing | region_7 | f |
1 | 65141.0 | 1.0 | 30.0 | 5.0 | 4.0 | 0.0 | 0.0 | 60.0 | 0.0 | Operations | Bachelor's | other | region_22 | m |
2 | 7513.0 | 1.0 | 34.0 | 3.0 | 7.0 | 0.0 | 0.0 | 50.0 | 0.0 | Sales & Marketing | Bachelor's | sourcing | region_19 | m |
3 | 2542.0 | 2.0 | 39.0 | 1.0 | 10.0 | 0.0 | 0.0 | 50.0 | 0.0 | Sales & Marketing | Bachelor's | other | region_23 | m |
4 | 48945.0 | 1.0 | 45.0 | 3.0 | 2.0 | 0.0 | 0.0 | 73.0 | 0.0 | Technology | Bachelor's | other | region_26 | m |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 14934.0 | 1.0 | 37.0 | 4.0 | 7.0 | 1.0 | 0.0 | 71.0 | 0.0 | Procurement | Master's & above | other | region_13 | f |
9996 | 22040.0 | 1.0 | 39.0 | 3.0 | 7.0 | 0.0 | 0.0 | 48.0 | 0.0 | Sales & Marketing | Master's & above | sourcing | region_33 | m |
9997 | 14188.0 | 1.0 | 33.0 | 4.0 | 4.0 | 1.0 | 0.0 | 58.0 | 0.0 | Finance | Master's & above | sourcing | region_13 | f |
9998 | 73566.0 | 1.0 | 32.0 | 4.0 | 4.0 | 1.0 | 0.0 | 57.0 | 1.0 | Operations | Master's & above | other | region_28 | m |
9999 | 21372.0 | 1.0 | 32.0 | 3.0 | 6.0 | 0.0 | 0.0 | 71.0 | 0.0 | Procurement | Bachelor's | sourcing | region_13 | f |
10000 rows × 14 columns
[5]:
print(new_df.isna().any())
employee_id False
no_of_trainings False
age False
previous_year_rating False
length_of_service False
KPIs_met >80% False
awards_won? False
avg_training_score False
is_promoted False
department False
education True
recruitment_channel False
region False
gender False
dtype: bool
With Enabling Encoding
Now if we use enable_encoder=True
, we can include categorical data to be imputed.
Using the original dataset, we know that both education and previous_year_rating have missing values. Below, we will not specify the columns to use for imputation (KNNDataImputer
will determine this for us).
The
enable_encoder
parameter allows the class to use ordinal encoding internally before imputation, however, it uses the default ‘auto’ option for itscategories
parameter. We recommend that you use this internal encoding function when you aren’t looking to specify ordinal semantics or order. Otherwise, you should consider using the Pipeline class to apply custom encoding before calling theKNNDataImputer
class for imputation.
[6]:
imputer = KNNDataImputer(
df=dataset,
col_impute=None,
enable_encoder=True,
knn_params=None
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df
WARNING: 'enable_encoder'=True and categorical columns will be encoded using ordinal encoding before applying the iterative imputation process.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
No columns specified for imputation. These columns have been automatically identified at transform time:
['education', 'previous_year_rating']
WARNING: Note that encoded columns are not guaranteed to reverse transform if they have imputed values.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass.
Imputed categorical columns' reverse encoding transformation complete.
[6]:
employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438.0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1.0 | 35.0 | 5.0 | 8.0 | 1.0 | 0.0 | 49.0 | 0.0 |
1 | 65141.0 | Operations | region_22 | Bachelor's | m | other | 1.0 | 30.0 | 5.0 | 4.0 | 0.0 | 0.0 | 60.0 | 0.0 |
2 | 7513.0 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1.0 | 34.0 | 3.0 | 7.0 | 0.0 | 0.0 | 50.0 | 0.0 |
3 | 2542.0 | Sales & Marketing | region_23 | Bachelor's | m | other | 2.0 | 39.0 | 1.0 | 10.0 | 0.0 | 0.0 | 50.0 | 0.0 |
4 | 48945.0 | Technology | region_26 | Bachelor's | m | other | 1.0 | 45.0 | 3.0 | 2.0 | 0.0 | 0.0 | 73.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 14934.0 | Procurement | region_13 | Master's & above | f | other | 1.0 | 37.0 | 4.0 | 7.0 | 1.0 | 0.0 | 71.0 | 0.0 |
9996 | 22040.0 | Sales & Marketing | region_33 | Master's & above | m | sourcing | 1.0 | 39.0 | 3.0 | 7.0 | 0.0 | 0.0 | 48.0 | 0.0 |
9997 | 14188.0 | Finance | region_13 | Master's & above | f | sourcing | 1.0 | 33.0 | 4.0 | 4.0 | 1.0 | 0.0 | 58.0 | 0.0 |
9998 | 73566.0 | Operations | region_28 | Master's & above | m | other | 1.0 | 32.0 | 4.0 | 4.0 | 1.0 | 0.0 | 57.0 | 1.0 |
9999 | 21372.0 | Procurement | region_13 | Bachelor's | f | sourcing | 1.0 | 32.0 | 3.0 | 6.0 | 0.0 | 0.0 | 71.0 | 0.0 |
10000 rows × 14 columns
Note that using the encoder before the imputation of education column isn’t always guaranteed to reverse transformed as it now includes new imputed values that the encoder can’t always map back to categorical data.
[7]:
print(new_df.isna().any())
employee_id False
department False
region False
education False
gender False
recruitment_channel False
no_of_trainings False
age False
previous_year_rating False
length_of_service False
KPIs_met >80% False
awards_won? False
avg_training_score False
is_promoted False
dtype: bool
Now that the missing values have been filled, we can fit a model to the data. Here we do so using a decision tree…
[8]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from raimitigations.dataprocessing import EncoderOrdinal
encode = EncoderOrdinal(
df=new_df,
col_encode=None
)
encode.fit()
new_df = encode.transform(new_df)
estimator = DecisionTreeClassifier(max_features="sqrt", random_state=0)
X = new_df.drop(columns=['is_promoted', 'employee_id'])
Y = new_df['is_promoted']
train_X, test_X, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=0, stratify=Y)
estimator.fit(train_X, train_y)
estimator.score(test_X, test_y)
No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']
[8]:
0.8865
Handling a DataFrame without column names
Even if the dataset contains no header columns, we can perform the same operations, instead with the column index. The next few cells will demonstrate how to do this.
[9]:
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset = dataset[:10000]
dataset
[9]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 1 | 0 | 49 | 0 |
1 | 65141 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 0 | 60 | 0 |
2 | 7513 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 0 | 50 | 0 |
3 | 2542 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 0 | 50 | 0 |
4 | 48945 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 0 | 73 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 14934 | Procurement | region_13 | Master's & above | f | other | 1 | 37 | 4.0 | 7 | 1 | 0 | 71 | 0 |
9996 | 22040 | Sales & Marketing | region_33 | Master's & above | m | sourcing | 1 | 39 | 3.0 | 7 | 0 | 0 | 48 | 0 |
9997 | 14188 | Finance | region_13 | Master's & above | f | sourcing | 1 | 33 | 4.0 | 4 | 1 | 0 | 58 | 0 |
9998 | 73566 | Operations | region_28 | Master's & above | m | other | 1 | 32 | 4.0 | 4 | 1 | 0 | 57 | 1 |
9999 | 21372 | Procurement | region_13 | Bachelor's | f | sourcing | 1 | 32 | 3.0 | 6 | 0 | 0 | 71 | 0 |
10000 rows × 14 columns
[10]:
print(dataset.isna().any())
print(dataset.iloc[:, 3].unique())
print(dataset.iloc[:, 8].unique())
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 True
9 False
10 False
11 False
12 False
13 False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5. 3. 1. 4. nan 2.]
Without Enabling Encoding
Using the sklearn_obj
parameter, we have the option to pass a pre-defined sklearn.impute.KNNImputer
object. If the latter is used, knn_params
will be overwritten.
[11]:
from sklearn.impute import KNNImputer
imputer = KNNDataImputer(
df=dataset,
col_impute=[8],
enable_encoder=False,
sklearn_obj = KNNImputer(n_neighbors=5)
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df
WARNING: Categorical columns will be excluded from the iterative imputation process.
If you'd like to include these columns, you need to set 'enable_encoder'=True.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
[11]:
0 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 1 | 4 | 5 | 2 | 3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438.0 | 1.0 | 35.0 | 5.0 | 8.0 | 1.0 | 0.0 | 49.0 | 0.0 | Sales & Marketing | f | sourcing | region_7 | Master's & above |
1 | 65141.0 | 1.0 | 30.0 | 5.0 | 4.0 | 0.0 | 0.0 | 60.0 | 0.0 | Operations | m | other | region_22 | Bachelor's |
2 | 7513.0 | 1.0 | 34.0 | 3.0 | 7.0 | 0.0 | 0.0 | 50.0 | 0.0 | Sales & Marketing | m | sourcing | region_19 | Bachelor's |
3 | 2542.0 | 2.0 | 39.0 | 1.0 | 10.0 | 0.0 | 0.0 | 50.0 | 0.0 | Sales & Marketing | m | other | region_23 | Bachelor's |
4 | 48945.0 | 1.0 | 45.0 | 3.0 | 2.0 | 0.0 | 0.0 | 73.0 | 0.0 | Technology | m | other | region_26 | Bachelor's |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 14934.0 | 1.0 | 37.0 | 4.0 | 7.0 | 1.0 | 0.0 | 71.0 | 0.0 | Procurement | f | other | region_13 | Master's & above |
9996 | 22040.0 | 1.0 | 39.0 | 3.0 | 7.0 | 0.0 | 0.0 | 48.0 | 0.0 | Sales & Marketing | m | sourcing | region_33 | Master's & above |
9997 | 14188.0 | 1.0 | 33.0 | 4.0 | 4.0 | 1.0 | 0.0 | 58.0 | 0.0 | Finance | f | sourcing | region_13 | Master's & above |
9998 | 73566.0 | 1.0 | 32.0 | 4.0 | 4.0 | 1.0 | 0.0 | 57.0 | 1.0 | Operations | m | other | region_28 | Master's & above |
9999 | 21372.0 | 1.0 | 32.0 | 3.0 | 6.0 | 0.0 | 0.0 | 71.0 | 0.0 | Procurement | f | sourcing | region_13 | Bachelor's |
10000 rows × 14 columns
[12]:
print(new_df.isna().any())
0 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
1 False
4 False
5 False
2 False
3 True
dtype: bool
With Enabling Encoding
[13]:
imputer = KNNDataImputer(
df=dataset,
col_impute=None,
enable_encoder=True,
knn_params={
"missing_values": np.nan,
"n_neighbors": 6,
"weights": "uniform",
"metric": "nan_euclidean",
"copy": True,
}
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df
WARNING: 'enable_encoder'=True and categorical columns will be encoded using ordinal encoding before applying the iterative imputation process.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
No columns specified for imputation. These columns have been automatically identified at transform time:
['3', '8']
WARNING: Note that encoded columns are not guaranteed to reverse transform if they have imputed values.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass.
Imputed categorical columns' reverse encoding transformation complete.
[13]:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65438.0 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1.0 | 35.0 | 5.0 | 8.0 | 1.0 | 0.0 | 49.0 | 0.0 |
1 | 65141.0 | Operations | region_22 | Bachelor's | m | other | 1.0 | 30.0 | 5.0 | 4.0 | 0.0 | 0.0 | 60.0 | 0.0 |
2 | 7513.0 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1.0 | 34.0 | 3.0 | 7.0 | 0.0 | 0.0 | 50.0 | 0.0 |
3 | 2542.0 | Sales & Marketing | region_23 | Bachelor's | m | other | 2.0 | 39.0 | 1.0 | 10.0 | 0.0 | 0.0 | 50.0 | 0.0 |
4 | 48945.0 | Technology | region_26 | Bachelor's | m | other | 1.0 | 45.0 | 3.0 | 2.0 | 0.0 | 0.0 | 73.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9995 | 14934.0 | Procurement | region_13 | Master's & above | f | other | 1.0 | 37.0 | 4.0 | 7.0 | 1.0 | 0.0 | 71.0 | 0.0 |
9996 | 22040.0 | Sales & Marketing | region_33 | Master's & above | m | sourcing | 1.0 | 39.0 | 3.0 | 7.0 | 0.0 | 0.0 | 48.0 | 0.0 |
9997 | 14188.0 | Finance | region_13 | Master's & above | f | sourcing | 1.0 | 33.0 | 4.0 | 4.0 | 1.0 | 0.0 | 58.0 | 0.0 |
9998 | 73566.0 | Operations | region_28 | Master's & above | m | other | 1.0 | 32.0 | 4.0 | 4.0 | 1.0 | 0.0 | 57.0 | 1.0 |
9999 | 21372.0 | Procurement | region_13 | Bachelor's | f | sourcing | 1.0 | 32.0 | 3.0 | 6.0 | 0.0 | 0.0 | 71.0 | 0.0 |
10000 rows × 14 columns
[14]:
print(new_df.isna().any())
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
dtype: bool