[1]:
import random
import pandas as pd
import numpy as np
import scipy.stats as stat
from sklearn.preprocessing import (
    MinMaxScaler,
    RobustScaler,
    QuantileTransformer
)

from raimitigations.utils import create_dummy_dataset
from raimitigations.dataprocessing import (
    BasicImputer,
    EncoderOHE,
    DataRobustScaler,
    DataPowerTransformer,
    DataNormalizer,
    DataStandardScaler,
    DataMinMaxScaler,
    DataQuantileTransformer
)

Data Scaling

In this notebook we will show how to use data transformation classes. Each class applies a different data transformation over the numerical data, and it was designed to allow several transformations to be applied using a simple and easy to use interface.

Let’s start with the basic scenarios where we use only default values. But before jumping to the different classes, let’s first create a dummy dataset with numerical and categorical features, and Nan values scattered along different columns.

[2]:
# -----------------------------------
def add_nan(vec, pct):
    vec = list(vec)
    nan_index = random.sample(range(len(vec)), int(pct*len(vec)))
    for index in nan_index:
            vec[index] = np.nan
    return vec

# -----------------------------------
def build_dummy_dataset():
    df = create_dummy_dataset(
                                    samples=3000,
                                    n_features=6,
                                    n_num_num=2,
                                    n_cat_num=2,
                                    n_cat_cat=2,
                                    num_num_noise=[0.01, 0.05],
                                    pct_change=[0.05, 0.1]
                            )
    label_col = "label"

    return df, label_col

df, label_col = build_dummy_dataset()
df
[2]:
num_0 num_1 num_2 num_3 num_4 num_5 label num_c0_num_0 num_c1_num_1 CN_0_num_0 CN_1_num_1 CC_0_num_0 CC_1_num_1
0 1.838682 -1.408271 2.758909 -1.558914 0.606338 2.828664 1 1.796571 -1.397257 val0_2 val1_2 val0_1 val1_1
1 3.249825 -3.925450 2.953185 3.540991 -2.340552 3.398367 1 3.257607 -3.925562 val0_1 val1_1 val0_1 val1_0
2 0.978148 0.330699 1.483723 3.198539 -4.134112 2.312435 1 1.041164 0.338390 val0_2 val1_3 val0_0 val1_1
3 -0.847425 1.008353 1.192225 3.496178 -1.120895 1.910650 0 -0.875685 1.020226 val0_1 val1_0 val0_0 val1_2
4 0.972314 -2.305080 2.136697 2.224677 -2.409424 2.504838 1 1.012618 -2.295010 val0_2 val1_0 val0_0 val1_1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2995 0.740834 -1.171719 0.357037 1.863302 -2.869701 0.658475 1 0.776097 -1.167798 val0_2 val1_2 val0_0 val1_1
2996 1.865602 -1.226943 3.212207 0.961388 -0.925646 2.316213 1 1.848407 -1.231393 val0_2 val1_2 val0_1 val1_1
2997 -0.854474 -2.369469 0.457335 3.216465 -4.680820 3.322093 1 -0.876535 -2.366553 val0_1 val1_1 val0_0 val1_1
2998 0.892994 -2.115929 2.741942 -0.218843 -0.097098 1.893662 1 0.890398 -2.143039 val0_2 val1_1 val0_0 val1_1
2999 1.664427 -1.075205 3.487387 -0.182207 0.033374 2.417933 1 1.654759 -1.052976 val0_2 val1_2 val0_1 val1_1

3000 rows × 13 columns

[3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   num_0         3000 non-null   float64
 1   num_1         3000 non-null   float64
 2   num_2         3000 non-null   float64
 3   num_3         3000 non-null   float64
 4   num_4         3000 non-null   float64
 5   num_5         3000 non-null   float64
 6   label         3000 non-null   int64
 7   num_c0_num_0  3000 non-null   float64
 8   num_c1_num_1  3000 non-null   float64
 9   CN_0_num_0    3000 non-null   object
 10  CN_1_num_1    3000 non-null   object
 11  CC_0_num_0    3000 non-null   object
 12  CC_1_num_1    3000 non-null   object
dtypes: float64(8), int64(1), object(4)
memory usage: 304.8+ KB

Basic Use Case: Robust Scaler

As we can see, the dataset we are working with has several numerical and several categorical data. Also, all columns have a few missing values. When using data scalers or data transformers using Sklearn, we usually need to provide a dataset with only the numerical columns. This requires the user to separate these columns from the original dataset, and then they need to merge the scaled columns with the categorical columns. We encapsulate all of these processes inside the main scaler class, making it easier to use.

Let’s see an example of how to use the Robust scaler through the DataRobustScaler class implemented in this library. For now, we will use only the default values, with exception of the exclude_cols parameter. This parameter is a list of column names or column indexes that should be ignored when doing the scaling process, that is, all columns that should not be scaled. By default, this parameter is set as being the list of all column names associated to categorical columns. But since our label is numerical and we don’t want to transform the values in the label column, we will force the label column to be ignored as well (along with the categorical columns). This way, we create the object by passing “exclude_cols=[‘label’]”. The categorical columns are added internally to the exclude_cols list if they are not already specified.

After creating the object of DataRobustScaler, we must call the fit() and transform() methods, following the pattern established in this library (which is a pattern followed by several well established machine learning libraries). Here we will fit and transform the data using the same dataset, but remember that we should use the training dataset when calling the fit method and then call the transform method for both training and testing datasets.

[4]:
scaler = DataRobustScaler(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
[4]:
num_0 num_1 num_2 num_3 num_4 num_5 label num_c0_num_0 num_c1_num_1 CN_0_num_0 CN_1_num_1 CC_0_num_0 CC_1_num_1
0 0.118733 0.237866 0.492773 -1.849241 1.363420 0.372973 1 0.105654 0.243515 val0_2 val1_2 val0_1 val1_1
1 0.601353 -0.795141 0.621367 0.807144 -0.197305 0.629804 1 0.604797 -0.796681 val0_1 val1_1 val0_1 val1_0
2 -0.175575 0.951508 -0.351287 0.628771 -1.147206 0.140249 1 -0.152420 0.957595 val0_2 val1_3 val0_0 val1_1
3 -0.799932 1.229606 -0.544233 0.783802 0.448647 -0.040882 0 -0.807285 1.238116 val0_1 val1_0 val0_0 val1_2
4 -0.177570 -0.130169 0.080924 0.121516 -0.233781 0.226987 1 -0.162172 -0.125839 val0_2 val1_0 val0_0 val1_1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2995 -0.256738 0.334942 -1.097054 -0.066713 -0.477552 -0.605382 1 -0.242977 0.337919 val0_2 val1_2 val0_0 val1_1
2996 0.127940 0.312279 0.792817 -0.536493 0.552054 0.141952 1 0.123363 0.311755 val0_2 val1_2 val0_1 val1_1
2997 -0.802343 -0.156593 -1.030666 0.638108 -1.436752 0.595418 1 -0.807575 -0.155273 val0_1 val1_1 val0_0 val1_1
2998 -0.204698 -0.052545 0.481543 -1.151239 0.990868 -0.048541 1 -0.203927 -0.063315 val0_2 val1_1 val0_0 val1_1
2999 0.059137 0.374550 0.974962 -1.132156 1.059968 0.187809 1 0.057206 0.385159 val0_2 val1_2 val0_1 val1_1

3000 rows × 13 columns

[5]:
scaled_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   num_0         3000 non-null   float64
 1   num_1         3000 non-null   float64
 2   num_2         3000 non-null   float64
 3   num_3         3000 non-null   float64
 4   num_4         3000 non-null   float64
 5   num_5         3000 non-null   float64
 6   label         3000 non-null   int64
 7   num_c0_num_0  3000 non-null   float64
 8   num_c1_num_1  3000 non-null   float64
 9   CN_0_num_0    3000 non-null   object
 10  CN_1_num_1    3000 non-null   object
 11  CC_0_num_0    3000 non-null   object
 12  CC_1_num_1    3000 non-null   object
dtypes: float64(8), int64(1), object(4)
memory usage: 304.8+ KB

Inverse Transform

The scalers implemented in Sklearn all have an inverse transform method, which reverts the scaling performed over the numerical data, returning the dataset to its original state. We encapsulate this behavior in a method with the same name as the one used by sklearn: inverse_transform(). Note that in our case, the inverse_transform method receives a dataset with scaled numerical columns, as well as categorical columns, and returns a new dataframe with only the scaled numerical columns reversed (the label column is kept the same, since it was set to be ignored).

[6]:
org_df = scaler.inverse_transform(scaled_df)
org_df
[6]:
num_0 num_1 num_2 num_3 num_4 num_5 label num_c0_num_0 num_c1_num_1 CN_0_num_0 CN_1_num_1 CC_0_num_0 CC_1_num_1
0 1.838682 -1.408271 2.758909 -1.558914 0.606338 2.828664 1 1.796571 -1.397257 val0_2 val1_2 val0_1 val1_1
1 3.249825 -3.925450 2.953185 3.540991 -2.340552 3.398367 1 3.257607 -3.925562 val0_1 val1_1 val0_1 val1_0
2 0.978148 0.330699 1.483723 3.198539 -4.134112 2.312435 1 1.041164 0.338390 val0_2 val1_3 val0_0 val1_1
3 -0.847425 1.008353 1.192225 3.496178 -1.120895 1.910650 0 -0.875685 1.020226 val0_1 val1_0 val0_0 val1_2
4 0.972314 -2.305080 2.136697 2.224677 -2.409424 2.504838 1 1.012618 -2.295010 val0_2 val1_0 val0_0 val1_1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2995 0.740834 -1.171719 0.357037 1.863302 -2.869701 0.658475 1 0.776097 -1.167798 val0_2 val1_2 val0_0 val1_1
2996 1.865602 -1.226943 3.212207 0.961388 -0.925646 2.316213 1 1.848407 -1.231393 val0_2 val1_2 val0_1 val1_1
2997 -0.854474 -2.369469 0.457335 3.216465 -4.680820 3.322093 1 -0.876535 -2.366553 val0_1 val1_1 val0_0 val1_1
2998 0.892994 -2.115929 2.741942 -0.218843 -0.097098 1.893662 1 0.890398 -2.143039 val0_2 val1_1 val0_0 val1_1
2999 1.664427 -1.075205 3.487387 -0.182207 0.033374 2.417933 1 1.654759 -1.052976 val0_2 val1_2 val0_1 val1_1

3000 rows × 13 columns

Power Transform

[7]:
scaler = DataPowerTransformer(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
[7]:
num_0 num_1 num_2 num_3 num_4 num_5 label num_c0_num_0 num_c1_num_1 CN_0_num_0 CN_1_num_1 CC_0_num_0 CC_1_num_1
0 0.204668 0.340865 0.624318 -2.355844 1.892091 0.497892 1 0.184078 0.347054 val0_2 val1_2 val0_1 val1_1
1 0.938741 -1.046138 0.791909 1.060248 -0.260243 0.853418 1 0.943084 -1.046297 val0_1 val1_1 val0_1 val1_0
2 -0.211606 1.280236 -0.462634 0.821744 -1.544629 0.177042 1 -0.181683 1.284699 val0_2 val1_3 val0_0 val1_1
3 -0.969191 1.639213 -0.707276 1.028996 0.621462 -0.071696 0 -0.979450 1.645924 val0_1 val1_0 val0_0 val1_2
4 -0.214327 -0.150747 0.090894 0.147964 -0.309796 0.296469 1 -0.195061 -0.145133 val0_2 val1_0 val0_0 val1_1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2995 -0.321017 0.469917 -1.397383 -0.100105 -0.640434 -0.839805 1 -0.304500 0.472252 val0_2 val1_2 val0_0 val1_1
2996 0.218123 0.439816 1.016051 -0.713380 0.763497 0.179385 1 0.209931 0.437581 val0_2 val1_2 val0_1 val1_1
2997 -0.971766 -0.186168 -1.315517 0.834209 -1.934027 0.805738 1 -0.979760 -0.184495 val0_1 val1_1 val0_0 val1_1
2998 -0.251167 -0.046782 0.609704 -1.497267 1.370180 -0.082193 1 -0.251934 -0.061582 val0_2 val1_1 val0_0 val1_1
2999 0.118160 0.522483 1.255006 -1.473387 1.466475 0.242502 1 0.113814 0.534794 val0_2 val1_2 val0_1 val1_1

3000 rows × 13 columns

Normalizer

[8]:
scaler = DataNormalizer(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
No columns specified for imputation. These columns have been automatically identified:
[]
[8]:
num_0 num_1 num_2 num_3 num_4 num_5 label num_c0_num_0 num_c1_num_1 CN_0_num_0 CN_1_num_1 CC_0_num_0 CC_1_num_1
0 0.341701 -0.261714 0.512717 -0.289709 0.112682 0.525680 1.0 0.333876 -0.259667 val0_2 val1_2 val0_1 val1_1
1 0.342031 -0.413138 0.310811 0.372675 -0.246334 0.357664 1.0 0.342850 -0.413149 val0_1 val1_1 val0_1 val1_0
2 0.160514 0.054268 0.243479 0.524880 -0.678407 0.379470 1.0 0.170855 0.055530 val0_2 val1_3 val0_0 val1_1
3 -0.180286 0.214523 0.253640 0.743796 -0.238466 0.406482 0.0 -0.186298 0.217048 val0_1 val1_0 val0_0 val1_2
4 0.166395 -0.394475 0.365659 0.380715 -0.412332 0.428660 1.0 0.173292 -0.392752 val0_2 val1_0 val0_0 val1_1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2995 0.184312 -0.291512 0.088827 0.463571 -0.713954 0.163822 1.0 0.193085 -0.290537 val0_2 val1_2 val0_0 val1_1
2996 0.356516 -0.234468 0.613851 0.183721 -0.176890 0.442627 1.0 0.353230 -0.235319 val0_2 val1_2 val0_1 val1_1
2997 -0.113966 -0.316029 0.060997 0.428997 -0.624306 0.443085 1.0 -0.116908 -0.315640 val0_1 val1_1 val0_0 val1_1
2998 0.191164 -0.452958 0.586970 -0.046848 -0.020786 0.405378 1.0 0.190608 -0.458762 val0_2 val1_1 val0_0 val1_1
2999 0.327582 -0.211615 0.686367 -0.035861 0.006568 0.475883 1.0 0.325680 -0.207240 val0_2 val1_2 val0_1 val1_1

3000 rows × 13 columns

Standard Scaler

[9]:
scaler = DataStandardScaler(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
[9]:
num_0 num_1 num_2 num_3 num_4 num_5 label num_c0_num_0 num_c1_num_1 CN_0_num_0 CN_1_num_1 CC_0_num_0 CC_1_num_1
0 0.279543 0.332628 0.628637 -2.422861 1.875546 0.502397 1 0.260887 0.338829 val0_2 val1_2 val0_1 val1_1
1 0.908693 -1.043038 0.793104 1.055566 -0.255713 0.853082 1 0.912494 -1.043198 val0_1 val1_1 val0_1 val1_0
2 -0.104122 1.282994 -0.450880 0.821995 -1.552858 0.184628 1 -0.076018 1.287573 val0_2 val1_3 val0_0 val1_1
3 -0.918043 1.653340 -0.697650 1.025001 0.626371 -0.062694 0 -0.930914 1.660279 val0_1 val1_0 val0_0 val1_2
4 -0.106723 -0.157488 0.101900 0.157765 -0.305523 0.303063 1 -0.088749 -0.151902 val0_2 val1_0 val0_0 val1_1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2995 -0.209927 0.461906 -1.404684 -0.088714 -0.638406 -0.833480 1 -0.194235 0.464257 val0_2 val1_2 val0_0 val1_1
2996 0.291545 0.431726 1.012380 -0.703871 0.767580 0.186954 1 0.284005 0.429494 val0_2 val1_2 val0_1 val1_1
2997 -0.921186 -0.192677 -1.319776 0.834221 -1.948250 0.806131 1 -0.931293 -0.191009 val0_1 val1_1 val0_0 val1_1
2998 -0.142087 -0.054115 0.614274 -1.508856 1.366805 -0.073151 1 -0.143258 -0.068831 val0_2 val1_1 val0_0 val1_1
2999 0.201852 0.514652 1.245336 -1.483868 1.461165 0.249568 1 0.197640 0.527021 val0_2 val1_2 val0_1 val1_1

3000 rows × 13 columns

[10]:
scaled_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   num_0         3000 non-null   float64
 1   num_1         3000 non-null   float64
 2   num_2         3000 non-null   float64
 3   num_3         3000 non-null   float64
 4   num_4         3000 non-null   float64
 5   num_5         3000 non-null   float64
 6   label         3000 non-null   int64
 7   num_c0_num_0  3000 non-null   float64
 8   num_c1_num_1  3000 non-null   float64
 9   CN_0_num_0    3000 non-null   object
 10  CN_1_num_1    3000 non-null   object
 11  CC_0_num_0    3000 non-null   object
 12  CC_1_num_1    3000 non-null   object
dtypes: float64(8), int64(1), object(4)
memory usage: 304.8+ KB

Min Max Scaler

[11]:
scaler = DataMinMaxScaler(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
[11]:
num_0 num_1 num_2 num_3 num_4 num_5 label num_c0_num_0 num_c1_num_1 CN_0_num_0 CN_1_num_1 CC_0_num_0 CC_1_num_1
0 0.538771 0.495719 0.588105 0.161065 0.773957 0.542666 1 0.537741 0.496233 val0_2 val1_2 val0_1 val1_1
1 0.634397 0.317108 0.608658 0.621755 0.492937 0.590060 1 0.637324 0.317234 val0_1 val1_1 val0_1 val1_0
2 0.480457 0.619110 0.453200 0.590821 0.321901 0.499721 1 0.486254 0.619113 val0_2 val1_3 val0_0 val1_1
3 0.356747 0.667194 0.422362 0.617707 0.609246 0.466296 0 0.355604 0.667385 val0_1 val1_0 val0_0 val1_2
4 0.480061 0.432084 0.522280 0.502849 0.486370 0.515727 1 0.484308 0.432674 val0_2 val1_0 val0_0 val1_1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2995 0.464375 0.512504 0.334005 0.470205 0.442477 0.362127 1 0.468187 0.512478 val0_2 val1_2 val0_0 val1_1
2996 0.540595 0.508585 0.636061 0.388732 0.627865 0.500035 1 0.541275 0.507975 val0_2 val1_2 val0_1 val1_1
2997 0.356269 0.427515 0.344616 0.592440 0.269766 0.583715 1 0.355546 0.427609 val0_1 val1_1 val0_0 val1_1
2998 0.474686 0.445506 0.586310 0.282118 0.706876 0.464883 1 0.475978 0.443433 val0_2 val1_1 val0_0 val1_1
2999 0.526962 0.519352 0.665173 0.285427 0.719318 0.508497 1 0.528076 0.520607 val0_2 val1_2 val0_1 val1_1

3000 rows × 13 columns

Quantile Transform

[12]:
scaler = DataQuantileTransformer(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
[12]:
num_0 num_1 num_2 num_3 num_4 num_5 label num_c0_num_0 num_c1_num_1 CN_0_num_0 CN_1_num_1 CC_0_num_0 CC_1_num_1
0 0.563929 0.629615 0.748641 0.010215 0.970001 0.687788 1 0.554528 0.632123 val0_2 val1_2 val0_1 val1_1
1 0.823907 0.149258 0.792533 0.857954 0.400110 0.800766 1 0.825264 0.148166 val0_1 val1_1 val0_1 val1_0
2 0.406645 0.906626 0.326266 0.799757 0.057018 0.564204 1 0.412858 0.907607 val0_2 val1_3 val0_0 val1_1
3 0.182652 0.951966 0.232610 0.852475 0.735349 0.478572 0 0.180338 0.952968 val0_1 val1_0 val0_0 val1_2
4 0.405598 0.430716 0.542525 0.566506 0.379286 0.606670 1 0.409656 0.434478 val0_2 val1_0 val0_0 val1_1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2995 0.364130 0.678050 0.075035 0.462473 0.266227 0.194694 1 0.372719 0.678672 val0_2 val1_2 val0_0 val1_1
2996 0.568323 0.666334 0.857290 0.229171 0.778167 0.564785 1 0.564750 0.665816 val0_2 val1_2 val0_1 val1_1
2997 0.182363 0.415749 0.088817 0.804872 0.024023 0.786688 1 0.180254 0.417115 val0_1 val1_1 val0_0 val1_1
2998 0.391255 0.475301 0.744740 0.065804 0.913498 0.473249 1 0.392185 0.470084 val0_2 val1_1 val0_0 val1_1
2999 0.531529 0.698122 0.897845 0.069955 0.927006 0.584906 1 0.529527 0.702879 val0_2 val1_2 val0_1 val1_1

3000 rows × 13 columns

Using Multiple Scalers: Fit and Transform

In case the user wants to use multiple scalers in succession, they can use the transform_pipe parameter. This parameter is used by many other classes in this library, and it allows users to provide other class objects that has a fit and transform methods from the current library to be used internally. To do this, instantiate different scaler classes implemented in the dataprocessing module, create a list of these objects, and provide this list to the transform_pipe parameter. In the example below, we created a list with the following scalers: DataRobustScaler, DataMinMaxScaler, and DataQuantileTransformer. These scalers are passed as a parameter when creating an object of the DataRobustScaler class. Note however that we also pass an object from the class BasicImputer() into this list. The order of the list is important: it dictates the order by which the fit() and transform() methods will be called. In this case for example, when the fit method is called, we call the fit method of all transforms in the transform_pipe before calling the fit method of the main scaler (DataRobustScaler in this case). Therefore, we first call the fit and transform for the BasicImputer class, the resulting data frame is used to call the fit and transform methods of the DataRobustScaler class, and so on. This same process is used for the transform method as well.

Note that this time, instead of using the default values for each scaler, we are also providing a specific sklearn scaler to each of the scalers created. For instance, when creating the DataPowerTransformer object, we specify that instead of creating a default PowerTransformer (from sklearn) internally, we want to use a specific object of this PowerTransformer class. This allows the users to customize their sklearn instances that should be used inside each scaler class.

[13]:
imputer = BasicImputer()
s1 = DataRobustScaler(
                    scaler_obj=RobustScaler(),
                    exclude_cols=['num_3', 'num_4', 'num_5', 'label', 'CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1']
                    )
s2 = DataMinMaxScaler(
                    scaler_obj=MinMaxScaler(),
                    exclude_cols=['num_0', 'num_1', 'num_2', 'num_4', 'num_5', 'label', 'CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1']
                    )
s3 = DataQuantileTransformer(
                    scaler_obj=QuantileTransformer(),
                    exclude_cols=['num_0', 'num_1', 'num_2', 'num_3', 'num_5', 'label', 'CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1']
                    )
scaler = DataRobustScaler(
                    scaler_obj=RobustScaler(),
                    exclude_cols=['num_0', 'num_1', 'num_2', 'num_3', 'num_4', 'label', 'CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1'],
                    transform_pipe=[imputer, s1, s2, s3]
                    )


scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
No columns specified for imputation. These columns have been automatically identified:
[]
[13]:
num_0 num_1 num_2 num_3 num_4 num_5 label num_c0_num_0 num_c1_num_1 CN_0_num_0 CN_1_num_1 CC_0_num_0 CC_1_num_1
0 0.118733 0.237866 0.492773 0.161065 0.970001 0.372973 1.0 0.108853 0.264084 val0_2 val1_2 val0_1 val1_1
1 0.601353 -0.795141 0.621367 0.621755 0.400110 0.629804 1.0 0.650448 -0.703694 val0_1 val1_1 val0_1 val1_0
2 -0.175575 0.951508 -0.351287 0.590821 0.057018 0.140249 1.0 -0.174553 0.814974 val0_2 val1_3 val0_0 val1_1
3 -0.799932 1.229606 -0.544233 0.617707 0.735349 -0.040882 0.0 -0.639699 0.905684 val0_1 val1_0 val0_0 val1_2
4 -0.177570 -0.130169 0.080924 0.502849 0.379286 0.226987 1.0 -0.180957 -0.131151 val0_2 val1_0 val0_0 val1_1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2995 -0.256738 0.334942 -1.097054 0.470205 0.266227 -0.605382 1.0 -0.254850 0.357168 val0_2 val1_2 val0_0 val1_1
2996 0.127940 0.312279 0.792817 0.388732 0.778167 0.141952 1.0 0.129300 0.331460 val0_2 val1_2 val0_1 val1_1
2997 -0.802343 -0.156593 -1.030666 0.592440 0.024023 0.595418 1.0 -0.639868 -0.165871 val0_1 val1_1 val0_0 val1_1
2998 -0.204698 -0.052545 0.481543 0.282118 0.913498 -0.048541 1.0 -0.215909 -0.059949 val0_2 val1_1 val0_0 val1_1
2999 0.059137 0.374550 0.974962 0.285427 0.927006 0.187809 1.0 0.058840 0.405575 val0_2 val1_2 val0_1 val1_1

3000 rows × 13 columns

Using Multiple Scalers: Inverse Transform

When calling the inverse_transform, we should use the reverse logic used for the fit and transform methods when using the transform_pipe parameter: we call the inverse_transform of each scaler in the reverse order that they were called during the transform method. This way, the order of the inverse_transform calls are the following: DataRobustScaler, DataQuantileTransformer, DataMinMaxScaler, DataPowerTransformer. By doing this in the reverse order, we can recover the original dataset, as shown below.

Note however that the inverse_transform() method is only called for the scalers that apear after the last transform object that doesn’t inherit from the DataScaler() class. For example, if we create a DataRobustScaler() object using transform_pipe = [BasicImputer(), DataQuantileTransformer(), EncoderOHE(), DataMinMaxScaler()], then when calling the fit or transform method for the base scaler (DataRobustScaler), we would execute the fit or transform for the objects in the transform_pipe in the order they appear. But when the inverse_transform() of the base scaler is called, the inverse_transform would be called in the following order: DataRobustScaler, followed by DataMinMaxScaler. The inverse transform of the DataQuantileTransformer wouldn’t be called because it appears between two non-scaler transforms (BasicImputer and EncoderOHE). Since the EncoderOHE doesn’t have an inverse transform, we stop the chain of inverse transformations when we reach the first non-scaler transform object moving from the last object to the first, which in this case is the EncoderOHE.

[14]:
org_df = scaler.inverse_transform(scaled_df)
org_df
[14]:
num_0 num_1 num_2 num_3 num_4 num_5 label num_c0_num_0 num_c1_num_1 CN_0_num_0 CN_1_num_1 CC_0_num_0 CC_1_num_1
0 1.838682 -1.408271 2.758909 -1.558914 0.606338 2.828664 1.0 1.796571 -1.397257 val0_2 val1_2 val0_1 val1_1
1 3.249825 -3.925450 2.953185 3.540991 -2.340552 3.398367 1.0 3.257607 -3.925562 val0_1 val1_1 val0_1 val1_0
2 0.978148 0.330699 1.483723 3.198539 -4.134112 2.312435 1.0 1.041164 0.338390 val0_2 val1_3 val0_0 val1_1
3 -0.847425 1.008353 1.192225 3.496178 -1.120895 1.910650 0.0 -0.875685 1.020226 val0_1 val1_0 val0_0 val1_2
4 0.972314 -2.305080 2.136697 2.224677 -2.409424 2.504838 1.0 1.012618 -2.295010 val0_2 val1_0 val0_0 val1_1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2995 0.740834 -1.171719 0.357037 1.863302 -2.869701 0.658475 1.0 0.776097 -1.167798 val0_2 val1_2 val0_0 val1_1
2996 1.865602 -1.226943 3.212207 0.961388 -0.925646 2.316213 1.0 1.848407 -1.231393 val0_2 val1_2 val0_1 val1_1
2997 -0.854474 -2.369469 0.457335 3.216465 -4.680820 3.322093 1.0 -0.876535 -2.366553 val0_1 val1_1 val0_0 val1_1
2998 0.892994 -2.115929 2.741942 -0.218843 -0.097098 1.893662 1.0 0.890398 -2.143039 val0_2 val1_1 val0_0 val1_1
2999 1.664427 -1.075205 3.487387 -0.182207 0.033374 2.417933 1.0 1.654759 -1.052976 val0_2 val1_2 val0_1 val1_1

3000 rows × 13 columns

Using the include_cols instead of the exclude_cols parameter

The exclude_cols parameter is useful when we want to leave only a few features out of the scaling process. However, in some cases (such as the one presented above), we might want to create scalers that are applied to only a few columns. Using the exclude_cols can become tedious and error prone in these situations, since we need to pass a long list of columns that should be ignored. In these scenarios, we can instead use the include_cols parameter, which works as the inverse of the exclude_cols parameter: the data transformations will be applied only to the columns in the include_cols parameter. We replicated the experiment presented two cells above, but this time using the include_cols parameter instead of the exclude_cols parameter:

[15]:
imputer = BasicImputer()
encoder = EncoderOHE()
s1 = DataRobustScaler(
                    scaler_obj=RobustScaler(),
                    include_cols=['num_0', 'num_1', 'num_2']
                    )
s2 = DataMinMaxScaler(
                    scaler_obj=MinMaxScaler(),
                    include_cols=['num_3']
                    )
s3 = DataQuantileTransformer(
                    scaler_obj=QuantileTransformer(),
                    include_cols=['num_4']
                    )
scaler = DataRobustScaler(
                    scaler_obj=RobustScaler(),
                    include_cols=['num_5'],
                    transform_pipe=[imputer, encoder, s1, s2, s3]
                    )


scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
No columns specified for imputation. These columns have been automatically identified:
[]
No columns specified for encoding. These columns have been automatically identfied as the following:
['CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1']
[15]:
num_0 num_1 num_2 num_3 num_4 num_5 label num_c0_num_0 num_c1_num_1 CN_0_num_0_val0_1 CN_0_num_0_val0_2 CN_0_num_0_val0_3 CN_0_num_0_val0_4 CN_1_num_1_val1_1 CN_1_num_1_val1_2 CN_1_num_1_val1_3 CC_0_num_0_val0_1 CC_1_num_1_val1_1 CC_1_num_1_val1_2
0 0.118733 0.237866 0.492773 0.161065 0.970001 0.372973 1.0 1.796571 -1.397257 0 1 0 0 0 1 0 1 1 0
1 0.601353 -0.795141 0.621367 0.621755 0.400110 0.629804 1.0 3.257607 -3.925562 1 0 0 0 1 0 0 1 0 0
2 -0.175575 0.951508 -0.351287 0.590821 0.057018 0.140249 1.0 1.041164 0.338390 0 1 0 0 0 0 1 0 1 0
3 -0.799932 1.229606 -0.544233 0.617707 0.735349 -0.040882 0.0 -0.875685 1.020226 1 0 0 0 0 0 0 0 0 1
4 -0.177570 -0.130169 0.080924 0.502849 0.379286 0.226987 1.0 1.012618 -2.295010 0 1 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2995 -0.256738 0.334942 -1.097054 0.470205 0.266227 -0.605382 1.0 0.776097 -1.167798 0 1 0 0 0 1 0 0 1 0
2996 0.127940 0.312279 0.792817 0.388732 0.778167 0.141952 1.0 1.848407 -1.231393 0 1 0 0 0 1 0 1 1 0
2997 -0.802343 -0.156593 -1.030666 0.592440 0.024023 0.595418 1.0 -0.876535 -2.366553 1 0 0 0 1 0 0 0 1 0
2998 -0.204698 -0.052545 0.481543 0.282118 0.913498 -0.048541 1.0 0.890398 -2.143039 0 1 0 0 1 0 0 0 1 0
2999 0.059137 0.374550 0.974962 0.285427 0.927006 0.187809 1.0 1.654759 -1.052976 0 1 0 0 0 1 0 1 1 0

3000 rows × 19 columns

[16]:
org_df = scaler.inverse_transform(scaled_df)
org_df
[16]:
CN_0_num_0 CN_1_num_1 CC_0_num_0 CC_1_num_1 num_0 num_1 num_2 num_3 num_4 num_5 label num_c0_num_0 num_c1_num_1
0 val0_2 val1_2 val0_1 val1_1 1.838682 -1.408271 2.758909 -1.558914 0.606338 2.828664 1.0 1.796571 -1.397257
1 val0_1 val1_1 val0_1 val1_0 3.249825 -3.925450 2.953185 3.540991 -2.340552 3.398367 1.0 3.257607 -3.925562
2 val0_2 val1_3 val0_0 val1_1 0.978148 0.330699 1.483723 3.198539 -4.134112 2.312435 1.0 1.041164 0.338390
3 val0_1 val1_0 val0_0 val1_2 -0.847425 1.008353 1.192225 3.496178 -1.120895 1.910650 0.0 -0.875685 1.020226
4 val0_2 val1_0 val0_0 val1_1 0.972314 -2.305080 2.136697 2.224677 -2.409424 2.504838 1.0 1.012618 -2.295010
... ... ... ... ... ... ... ... ... ... ... ... ... ...
2995 val0_2 val1_2 val0_0 val1_1 0.740834 -1.171719 0.357037 1.863302 -2.869701 0.658475 1.0 0.776097 -1.167798
2996 val0_2 val1_2 val0_1 val1_1 1.865602 -1.226943 3.212207 0.961388 -0.925646 2.316213 1.0 1.848407 -1.231393
2997 val0_1 val1_1 val0_0 val1_1 -0.854474 -2.369469 0.457335 3.216465 -4.680820 3.322093 1.0 -0.876535 -2.366553
2998 val0_2 val1_1 val0_0 val1_1 0.892994 -2.115929 2.741942 -0.218843 -0.097098 1.893662 1.0 0.890398 -2.143039
2999 val0_2 val1_2 val0_1 val1_1 1.664427 -1.075205 3.487387 -0.182207 0.033374 2.417933 1.0 1.654759 -1.052976

3000 rows × 13 columns

Other non-scaler Transforms

As a final note, the scaler classes presented here also accept a list of other transformations that are not scalers, such as DataImputer, EncoderOHE, CatBoostSelection, among other classes from the dataprocessing module. Just create a list with these objects and pass it to the transform_pipe parameter. These transforms will be applied before scaling the numerical columns.