Imputers

This sub-module of the dataprocessing package collects all imputation transformers implemented here. Imputers are responsible for removing missing values from a dataset by replacing these missing values with some valid value inferred from known data. The difference between each imputer lies in how these valid values are computed before being used to replace a missing value.

We support 3 types of imputation: Basic (Simple), Iterative and K-Nearest Neighbor (KNN). BasicImputer is a univariate imputation algorithim; to impute missing values of a feature, it only uses the non-missing values of that feature, while IterativeDataImputer and KNNDataImputer are multivariate algorithims that use the entire set of features to estimate all missing values in the set. Hence, you should consider the relationship between the features in your data when choosing an imputation method.

  • BasicImputer: provides basic strategies to fill missing values, using a stant value or statistics of non-missing data. It takes a single feature into account, one at a time, independently.

  • IterativeDataImputer: allows for a more sophisticated but flexible approach, by predicting each feature with missing values as a function of other features in the set. It provides flexibility in the user’s choice of regressor used to train, predict and impute data.

  • KNNDataImputer: scans for the nearest rows to the row with missing data and imputes each missing value using the uniform or distance-weighted average of n-nearest neighbors in the set.

All the imputer methods from the dataprocessing package are based on the abstract class presented below, called DataImputer.

class raimitigations.dataprocessing.DataImputer(df: Optional[Union[DataFrame, ndarray]] = None, col_impute: Optional[list] = None, verbose: bool = True)

Bases: DataProcessing

Base class for all imputer subclasses. Implements basic functionalities that can be used by all imputation approaches.

Parameters
  • df – pandas data frame that contains the columns to be encoded;

  • col_impute – a list of the column names or indexes that will be imputed. If None, this parameter will be set automatically to be the list of columns with at least one NaN value;

  • verbose – indicates whether internal messages should be printed or not.

fit(df: Optional[Union[DataFrame, ndarray]] = None, y: Optional[Union[Series, ndarray]] = None)

Default fit method for all imputation methods that inherits from the current class. The following steps are executed: (i) set the self.df_info attribute, (ii) set the list of columns to impute (or create a default one if needed), (iii) check if the dataset provided is valid (contains all columns that should be imputed), and (iv) call the concrete class’s specific _fit method.

Parameters
  • df – the full dataset;

  • y – ignored. This exists for compatibility with the sklearn’s Pipeline class.

transform(df: Union[DataFrame, ndarray])

Default behavior for transforming the data for the different imputation methods.

Parameters

df – the full dataset with the columns to be imputed.

Returns

the transformed dataset.

Return type

pd.DataFrame or np.ndarray

The following is a list of all imputers implemented in this module. All of the classes below inherit from the DataImputer class, and thus, have access to all of the methods previously shown.

Class Diagram

Inheritance diagram of raimitigations.dataprocessing.BasicImputer, raimitigations.dataprocessing.IterativeDataImputer, raimitigations.dataprocessing.KNNDataImputer

Examples