KNNDataImputer
- class raimitigations.dataprocessing.KNNDataImputer(df: Optional[Union[DataFrame, ndarray]] = None, col_impute: Optional[list] = None, enable_encoder: bool = False, knn_params: Optional[dict] = None, sklearn_obj: Optional[object] = None, verbose: bool = True)
Bases:
DataImputer
Concrete class that imputes missing data of a feature using K-nearest neighbours. A feature’s missing values are imputed using the mean value from k-nearest neighbors in the dataset. Two samples are close if the features that neither is missing are close. This subclass uses the
KNNImputer
class fromsklearn
in the background. sklearn.impute.KNNImputer can only handle numerical data, however, this subclass allows for categorical input by applying ordinal encoding before calling the sklearn class. In order to use this function, use enable_encoder=True. Note that encoded columns are not guaranteed to reverse transform if they have imputed values. If you’d like to use a different type of encoding before imputation, consider using the Pipeline class and call your own encoder before calling this subclass for imputation. For more details see: https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#- Parameters
df – pandas data frame that contains the columns to be imputed;
col_impute – a list of the column names or indexes that will be imputed. If None, this parameter will be set automatically as being a list of all columns with any NaN value;
enable_encoder – a boolean flag to allow for applying ordinal encoding of categorical data before applying the KNNImputer since it only accepts numerical values.
knn_params –
a dict indicating the parameters used by
KNNImputer
. The dict has the following structure:{‘missing_values’:np.nan,’n_neighbors’:5,’weights’:’uniform’,’metric’:’nan_euclidean’,’copy’:True,}where these are the parameters used by sklearn’s KNNImputer. If None, this dict will be auto-filled as the one above.
Note: 'weights' can take one of these values: ['uniform', 'distance'] or callable
and 'metric' can take one of these values: ['nan_euclidean'] or callable
sklearn_obj – an sklearn.impute.KNNImputer object to use directly. If this parameter is used, knn_params will be overwritten.
verbose – indicates whether internal messages should be printed or not.
Class Diagram