IterativeDataImputer Example

In this example, we will explore how to perform imputation with the HR promotion dataset using iterative imputation.

This method imputes missing data of a feature using the other features. It uses a round-robin method of modeling each feature with missing values to be imputed as a function of the other features.

This subclass uses the class sklearn.impute.IterativeImputer class from sklearn in the background (note that this sklearn class is still in an experimental stage).

[1]:

import sys
sys.path.append('../../../notebooks')

import pandas as pd
import numpy as np
from raimitigations.dataprocessing import IterativeDataImputer

from download import download_datasets

from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.ensemble import RandomForestRegressor

Handling a DataFrame with column names

[2]:

data_dir = '../../../datasets/'
download_datasets(data_dir)
dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv')
dataset = dataset[:10000]

dataset

[2]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	65141	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	7513	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	2542	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	48945	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9995	14934	Procurement	region_13	Master's & above	f	other	1	37	4.0	7	1	0	71	0
9996	22040	Sales & Marketing	region_33	Master's & above	m	sourcing	1	39	3.0	7	0	0	48	0
9997	14188	Finance	region_13	Master's & above	f	sourcing	1	33	4.0	4	1	0	58	0
9998	73566	Operations	region_28	Master's & above	m	other	1	32	4.0	4	1	0	57	1
9999	21372	Procurement	region_13	Bachelor's	f	sourcing	1	32	3.0	6	0	0	71	0

10000 rows × 14 columns

[3]:

print(dataset.isna().any())
print(dataset['education'].unique())
print(dataset['previous_year_rating'].unique())

employee_id             False
department              False
region                  False
education                True
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating     True
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5.  3.  1.  4. nan  2.]

Without Enabling Encoding

As we can see above, both the education and previous_year_rating have missing values.

However, note that the dataset includes categorical columns such as education, while sklearn’s sklearn.IterativeImputer can only handle numerical data. The good thing is, IteratveDataImputer can in fact handle categorical data using the boolean parameter enable_encoder.

First, let’s try to use the default value enable_encoder=False, categorical data will be excluded from the imputation process, whether it has missing values or not. We can also use col_impute to specify only previous_year_rating to be imputed.

Additionally, we can specify sklearn.IterativeImputer’s parameters using the iterative_params dictionary as long as it uses the following format (if you choose not to pass anything to this param, default values will be used).

[4]:

imputer = IterativeDataImputer(
    df=dataset,
    col_impute=['previous_year_rating'],
    enable_encoder=False,
    iterative_params={
        'estimator': RandomForestRegressor(),
        'missing_values': np.nan,
        'sample_posterior': False,
        'max_iter': 10,
        'tol': 1e-3,
        'n_nearest_features': None,
        'initial_strategy': 'mean',
        'imputation_order': 'ascending',
        'skip_complete': False,
        'min_value': -np.inf,
        'max_value': np.inf,
        'random_state': None}
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df


WARNING: Categorical columns will be excluded from the iterative imputation process.
If you'd like to include these columns, you need to set 'enable_encoder'=True.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
[IterativeImputer] Completing matrix with shape (10000, 9)
[IterativeImputer] Change: 1.5071906463137383, scaled tolerance: 78.297
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (10000, 9)

[4]:

	employee_id	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted	gender	recruitment_channel	education	region	department
0	65438.0	1.0	35.0	5.0	8.0	1.0	0.0	49.0	0.0	f	sourcing	Master's & above	region_7	Sales & Marketing
1	65141.0	1.0	30.0	5.0	4.0	0.0	0.0	60.0	0.0	m	other	Bachelor's	region_22	Operations
2	7513.0	1.0	34.0	3.0	7.0	0.0	0.0	50.0	0.0	m	sourcing	Bachelor's	region_19	Sales & Marketing
3	2542.0	2.0	39.0	1.0	10.0	0.0	0.0	50.0	0.0	m	other	Bachelor's	region_23	Sales & Marketing
4	48945.0	1.0	45.0	3.0	2.0	0.0	0.0	73.0	0.0	m	other	Bachelor's	region_26	Technology
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9995	14934.0	1.0	37.0	4.0	7.0	1.0	0.0	71.0	0.0	f	other	Master's & above	region_13	Procurement
9996	22040.0	1.0	39.0	3.0	7.0	0.0	0.0	48.0	0.0	m	sourcing	Master's & above	region_33	Sales & Marketing
9997	14188.0	1.0	33.0	4.0	4.0	1.0	0.0	58.0	0.0	f	sourcing	Master's & above	region_13	Finance
9998	73566.0	1.0	32.0	4.0	4.0	1.0	0.0	57.0	1.0	m	other	Master's & above	region_28	Operations
9999	21372.0	1.0	32.0	3.0	6.0	0.0	0.0	71.0	0.0	f	sourcing	Bachelor's	region_13	Procurement

10000 rows × 14 columns

[5]:

print(new_df.isna().any())

employee_id             False
no_of_trainings         False
age                     False
previous_year_rating    False
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
gender                  False
recruitment_channel     False
education                True
region                  False
department              False
dtype: bool

With Enabling Encoding

Now if we use enable_encoder=True, we can include categorical data both to be imputed and to be used in the imputation of other features.

Using the original dataset, we know that both education and previous_year_rating have missing values. Below, we will not specify the columns to use for imputation (IterativeDataImputer will determine this for us).

The enable_encoder parameter allows the class to use ordinal encoding internally before imputation, however, it uses the default ‘auto’ option for its categories parameter. We recommend that you use this internal encoding function when you aren’t looking to specify ordinal semantics or order. Otherwise, you should consider using the Pipeline class to apply custom encoding before calling the IterativeDataImputer class for imputation.

[6]:

imputer = IterativeDataImputer(
    df=dataset,
    col_impute=None,
    enable_encoder=True,
    iterative_params={
        'estimator': RandomForestRegressor(),
        'missing_values': np.nan,
        'sample_posterior': False,
        'max_iter': 10,
        'tol': 1e-3,
        'n_nearest_features': None,
        'initial_strategy': 'mean',
        'imputation_order': 'ascending',
        'skip_complete': False,
        'min_value': -np.inf,
        'max_value': np.inf,
        'random_state': None}
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df


WARNING: 'enable_encoder'=True and categorical columns will be encoded using ordinal encoding before applying the iterative imputation process.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
[IterativeImputer] Completing matrix with shape (10000, 14)
[IterativeImputer] Change: 1.624667046565469, scaled tolerance: 78.297
[IterativeImputer] Early stopping criterion reached.
No columns specified for imputation. These columns have been automatically identified at transform time:
['education', 'previous_year_rating']

WARNING: Note that encoded columns are not guaranteed to reverse transform if they have imputed values.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass.
[IterativeImputer] Completing matrix with shape (10000, 14)

Imputed categorical columns' reverse encoding transformation complete.

[6]:

	employee_id	department	region	education	gender	recruitment_channel	no_of_trainings	age	previous_year_rating	length_of_service	KPIs_met >80%	awards_won?	avg_training_score	is_promoted
0	65438.0	Sales & Marketing	region_7	Master's & above	f	sourcing	1.0	35.0	5.0	8.0	1.0	0.0	49.0	0.0
1	65141.0	Operations	region_22	Bachelor's	m	other	1.0	30.0	5.0	4.0	0.0	0.0	60.0	0.0
2	7513.0	Sales & Marketing	region_19	Bachelor's	m	sourcing	1.0	34.0	3.0	7.0	0.0	0.0	50.0	0.0
3	2542.0	Sales & Marketing	region_23	Bachelor's	m	other	2.0	39.0	1.0	10.0	0.0	0.0	50.0	0.0
4	48945.0	Technology	region_26	Bachelor's	m	other	1.0	45.0	3.0	2.0	0.0	0.0	73.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9995	14934.0	Procurement	region_13	Master's & above	f	other	1.0	37.0	4.0	7.0	1.0	0.0	71.0	0.0
9996	22040.0	Sales & Marketing	region_33	Master's & above	m	sourcing	1.0	39.0	3.0	7.0	0.0	0.0	48.0	0.0
9997	14188.0	Finance	region_13	Master's & above	f	sourcing	1.0	33.0	4.0	4.0	1.0	0.0	58.0	0.0
9998	73566.0	Operations	region_28	Master's & above	m	other	1.0	32.0	4.0	4.0	1.0	0.0	57.0	1.0
9999	21372.0	Procurement	region_13	Bachelor's	f	sourcing	1.0	32.0	3.0	6.0	0.0	0.0	71.0	0.0

10000 rows × 14 columns

Note that using the encoder before the imputation of education column isn’t always guaranteed to reverse transformed as it now includes new imputed values that the encoder can’t always map back to categorical data.

[7]:

print(new_df.isna().any())

employee_id             False
department              False
region                  False
education               False
gender                  False
recruitment_channel     False
no_of_trainings         False
age                     False
previous_year_rating    False
length_of_service       False
KPIs_met >80%           False
awards_won?             False
avg_training_score      False
is_promoted             False
dtype: bool

Now that the missing values have been filled, we can fit a model to the data. Here we do so using a decision tree…

[8]:

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from raimitigations.dataprocessing import EncoderOrdinal

encode = EncoderOrdinal(
    df=new_df,
    col_encode=None
)
encode.fit()
new_df = encode.transform(new_df)

estimator = DecisionTreeClassifier(max_features="sqrt", random_state=0)
X = new_df.drop(columns=['is_promoted', 'employee_id'])
Y = new_df['is_promoted']

train_X, test_X, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=0, stratify=Y)

estimator.fit(train_X, train_y)
estimator.score(test_X, test_y)

No columns specified for encoding. These columns have been automatically identfied as the following:
['department', 'region', 'education', 'gender', 'recruitment_channel']

[8]:

0.879

Handling a DataFrame without column names

Even if the dataset contains no header columns, we can perform the same operations, instead with the column index. The next few cells will demonstrate how to do this.

[9]:

dataset = pd.read_csv(data_dir + 'hr_promotion/train.csv', header=None, skiprows=1)
dataset = dataset[:10000]
dataset

[9]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13
0	65438	Sales & Marketing	region_7	Master's & above	f	sourcing	1	35	5.0	8	1	0	49	0
1	65141	Operations	region_22	Bachelor's	m	other	1	30	5.0	4	0	0	60	0
2	7513	Sales & Marketing	region_19	Bachelor's	m	sourcing	1	34	3.0	7	0	0	50	0
3	2542	Sales & Marketing	region_23	Bachelor's	m	other	2	39	1.0	10	0	0	50	0
4	48945	Technology	region_26	Bachelor's	m	other	1	45	3.0	2	0	0	73	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9995	14934	Procurement	region_13	Master's & above	f	other	1	37	4.0	7	1	0	71	0
9996	22040	Sales & Marketing	region_33	Master's & above	m	sourcing	1	39	3.0	7	0	0	48	0
9997	14188	Finance	region_13	Master's & above	f	sourcing	1	33	4.0	4	1	0	58	0
9998	73566	Operations	region_28	Master's & above	m	other	1	32	4.0	4	1	0	57	1
9999	21372	Procurement	region_13	Bachelor's	f	sourcing	1	32	3.0	6	0	0	71	0

10000 rows × 14 columns

[10]:

print(dataset.isna().any())
print(dataset.iloc[:, 3].unique())
print(dataset.iloc[:, 8].unique())

0     False
1     False
2     False
3      True
4     False
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12    False
13    False
dtype: bool
["Master's & above" "Bachelor's" nan 'Below Secondary']
[ 5.  3.  1.  4. nan  2.]

Without Enabling Encoding

Using the sklearn_obj parameter, we have the option to pass a pre-defined sklearn.impute.IterativeImputer object. If the latter is used, iterative_params will be overwritten.

[11]:

from sklearn.experimental import enable_iterative_imputer  # noqa # pylint: disable=unused-import
from sklearn.impute import IterativeImputer

imputer = IterativeDataImputer(
    df=dataset,
    col_impute=[8],
    enable_encoder=False,
    sklearn_obj=IterativeImputer(estimator=RandomForestRegressor(), random_state=100),
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df


WARNING: Categorical columns will be excluded from the iterative imputation process.
If you'd like to include these columns, you need to set 'enable_encoder'=True.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.

[11]:

	0	6	7	8	9	10	11	12	13	1	2	4	5	3
0	65438.0	1.0	35.0	5.0	8.0	1.0	0.0	49.0	0.0	Sales & Marketing	region_7	f	sourcing	Master's & above
1	65141.0	1.0	30.0	5.0	4.0	0.0	0.0	60.0	0.0	Operations	region_22	m	other	Bachelor's
2	7513.0	1.0	34.0	3.0	7.0	0.0	0.0	50.0	0.0	Sales & Marketing	region_19	m	sourcing	Bachelor's
3	2542.0	2.0	39.0	1.0	10.0	0.0	0.0	50.0	0.0	Sales & Marketing	region_23	m	other	Bachelor's
4	48945.0	1.0	45.0	3.0	2.0	0.0	0.0	73.0	0.0	Technology	region_26	m	other	Bachelor's
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9995	14934.0	1.0	37.0	4.0	7.0	1.0	0.0	71.0	0.0	Procurement	region_13	f	other	Master's & above
9996	22040.0	1.0	39.0	3.0	7.0	0.0	0.0	48.0	0.0	Sales & Marketing	region_33	m	sourcing	Master's & above
9997	14188.0	1.0	33.0	4.0	4.0	1.0	0.0	58.0	0.0	Finance	region_13	f	sourcing	Master's & above
9998	73566.0	1.0	32.0	4.0	4.0	1.0	0.0	57.0	1.0	Operations	region_28	m	other	Master's & above
9999	21372.0	1.0	32.0	3.0	6.0	0.0	0.0	71.0	0.0	Procurement	region_13	f	sourcing	Bachelor's

10000 rows × 14 columns

[12]:

print(new_df.isna().any())

0     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
1     False
2     False
4     False
5     False
3      True
dtype: bool

With Enabling Encoding

[13]:

imputer = IterativeDataImputer(
    df=dataset,
    col_impute=None,
    enable_encoder=True,
    iterative_params={
        'estimator': RandomForestRegressor(),
        'missing_values': np.nan,
        'sample_posterior': False,
        'max_iter': 10,
        'tol': 1e-3,
        'n_nearest_features': None,
        'initial_strategy': 'mean',
        'imputation_order': 'ascending',
        'skip_complete': False,
        'min_value': -np.inf,
        'max_value': np.inf,
        'random_state': None}
)
imputer.fit()
new_df = imputer.transform(dataset)
new_df


WARNING: 'enable_encoder'=True and categorical columns will be encoded using ordinal encoding before applying the iterative imputation process.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation.
[IterativeImputer] Completing matrix with shape (10000, 14)
[IterativeImputer] Change: 1.744667046565469, scaled tolerance: 78.297
[IterativeImputer] Early stopping criterion reached.
No columns specified for imputation. These columns have been automatically identified at transform time:
['3', '8']

WARNING: Note that encoded columns are not guaranteed to reverse transform if they have imputed values.
If you'd like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass.
[IterativeImputer] Completing matrix with shape (10000, 14)

Imputed categorical columns' reverse encoding transformation complete.

[13]:

	0	1	2	3	4	5	6	7	8	9	10	11	12	13
0	65438.0	Sales & Marketing	region_7	Master's & above	f	sourcing	1.0	35.0	5.0	8.0	1.0	0.0	49.0	0.0
1	65141.0	Operations	region_22	Bachelor's	m	other	1.0	30.0	5.0	4.0	0.0	0.0	60.0	0.0
2	7513.0	Sales & Marketing	region_19	Bachelor's	m	sourcing	1.0	34.0	3.0	7.0	0.0	0.0	50.0	0.0
3	2542.0	Sales & Marketing	region_23	Bachelor's	m	other	2.0	39.0	1.0	10.0	0.0	0.0	50.0	0.0
4	48945.0	Technology	region_26	Bachelor's	m	other	1.0	45.0	3.0	2.0	0.0	0.0	73.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9995	14934.0	Procurement	region_13	Master's & above	f	other	1.0	37.0	4.0	7.0	1.0	0.0	71.0	0.0
9996	22040.0	Sales & Marketing	region_33	Master's & above	m	sourcing	1.0	39.0	3.0	7.0	0.0	0.0	48.0	0.0
9997	14188.0	Finance	region_13	Master's & above	f	sourcing	1.0	33.0	4.0	4.0	1.0	0.0	58.0	0.0
9998	73566.0	Operations	region_28	Master's & above	m	other	1.0	32.0	4.0	4.0	1.0	0.0	57.0	1.0
9999	21372.0	Procurement	region_13	Bachelor's	f	sourcing	1.0	32.0	3.0	6.0	0.0	0.0	71.0	0.0

10000 rows × 14 columns

[14]:

print(new_df.isna().any())

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
dtype: bool