[1]:
import random
import pandas as pd
import numpy as np
import scipy.stats as stat
from sklearn.preprocessing import (
MinMaxScaler,
RobustScaler,
QuantileTransformer
)
from raimitigations.utils import create_dummy_dataset
from raimitigations.dataprocessing import (
BasicImputer,
EncoderOHE,
DataRobustScaler,
DataPowerTransformer,
DataNormalizer,
DataStandardScaler,
DataMinMaxScaler,
DataQuantileTransformer
)
Data Scaling
In this notebook we will show how to use data transformation classes. Each class applies a different data transformation over the numerical data, and it was designed to allow several transformations to be applied using a simple and easy to use interface.
Let’s start with the basic scenarios where we use only default values. But before jumping to the different classes, let’s first create a dummy dataset with numerical and categorical features, and Nan values scattered along different columns.
[2]:
# -----------------------------------
def add_nan(vec, pct):
vec = list(vec)
nan_index = random.sample(range(len(vec)), int(pct*len(vec)))
for index in nan_index:
vec[index] = np.nan
return vec
# -----------------------------------
def build_dummy_dataset():
df = create_dummy_dataset(
samples=3000,
n_features=6,
n_num_num=2,
n_cat_num=2,
n_cat_cat=2,
num_num_noise=[0.01, 0.05],
pct_change=[0.05, 0.1]
)
label_col = "label"
return df, label_col
df, label_col = build_dummy_dataset()
df
[2]:
num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | CN_0_num_0 | CN_1_num_1 | CC_0_num_0 | CC_1_num_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.838682 | -1.408271 | 2.758909 | -1.558914 | 0.606338 | 2.828664 | 1 | 1.796571 | -1.397257 | val0_2 | val1_2 | val0_1 | val1_1 |
1 | 3.249825 | -3.925450 | 2.953185 | 3.540991 | -2.340552 | 3.398367 | 1 | 3.257607 | -3.925562 | val0_1 | val1_1 | val0_1 | val1_0 |
2 | 0.978148 | 0.330699 | 1.483723 | 3.198539 | -4.134112 | 2.312435 | 1 | 1.041164 | 0.338390 | val0_2 | val1_3 | val0_0 | val1_1 |
3 | -0.847425 | 1.008353 | 1.192225 | 3.496178 | -1.120895 | 1.910650 | 0 | -0.875685 | 1.020226 | val0_1 | val1_0 | val0_0 | val1_2 |
4 | 0.972314 | -2.305080 | 2.136697 | 2.224677 | -2.409424 | 2.504838 | 1 | 1.012618 | -2.295010 | val0_2 | val1_0 | val0_0 | val1_1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | 0.740834 | -1.171719 | 0.357037 | 1.863302 | -2.869701 | 0.658475 | 1 | 0.776097 | -1.167798 | val0_2 | val1_2 | val0_0 | val1_1 |
2996 | 1.865602 | -1.226943 | 3.212207 | 0.961388 | -0.925646 | 2.316213 | 1 | 1.848407 | -1.231393 | val0_2 | val1_2 | val0_1 | val1_1 |
2997 | -0.854474 | -2.369469 | 0.457335 | 3.216465 | -4.680820 | 3.322093 | 1 | -0.876535 | -2.366553 | val0_1 | val1_1 | val0_0 | val1_1 |
2998 | 0.892994 | -2.115929 | 2.741942 | -0.218843 | -0.097098 | 1.893662 | 1 | 0.890398 | -2.143039 | val0_2 | val1_1 | val0_0 | val1_1 |
2999 | 1.664427 | -1.075205 | 3.487387 | -0.182207 | 0.033374 | 2.417933 | 1 | 1.654759 | -1.052976 | val0_2 | val1_2 | val0_1 | val1_1 |
3000 rows × 13 columns
[3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 num_0 3000 non-null float64
1 num_1 3000 non-null float64
2 num_2 3000 non-null float64
3 num_3 3000 non-null float64
4 num_4 3000 non-null float64
5 num_5 3000 non-null float64
6 label 3000 non-null int64
7 num_c0_num_0 3000 non-null float64
8 num_c1_num_1 3000 non-null float64
9 CN_0_num_0 3000 non-null object
10 CN_1_num_1 3000 non-null object
11 CC_0_num_0 3000 non-null object
12 CC_1_num_1 3000 non-null object
dtypes: float64(8), int64(1), object(4)
memory usage: 304.8+ KB
Basic Use Case: Robust Scaler
As we can see, the dataset we are working with has several numerical and several categorical data. Also, all columns have a few missing values. When using data scalers or data transformers using Sklearn, we usually need to provide a dataset with only the numerical columns. This requires the user to separate these columns from the original dataset, and then they need to merge the scaled columns with the categorical columns. We encapsulate all of these processes inside the main scaler class, making it easier to use.
Let’s see an example of how to use the Robust scaler through the DataRobustScaler class implemented in this library. For now, we will use only the default values, with exception of the exclude_cols parameter. This parameter is a list of column names or column indexes that should be ignored when doing the scaling process, that is, all columns that should not be scaled. By default, this parameter is set as being the list of all column names associated to categorical columns. But since our label is numerical and we don’t want to transform the values in the label column, we will force the label column to be ignored as well (along with the categorical columns). This way, we create the object by passing “exclude_cols=[‘label’]”. The categorical columns are added internally to the exclude_cols list if they are not already specified.
After creating the object of DataRobustScaler, we must call the fit() and transform() methods, following the pattern established in this library (which is a pattern followed by several well established machine learning libraries). Here we will fit and transform the data using the same dataset, but remember that we should use the training dataset when calling the fit method and then call the transform method for both training and testing datasets.
[4]:
scaler = DataRobustScaler(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
[4]:
num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | CN_0_num_0 | CN_1_num_1 | CC_0_num_0 | CC_1_num_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.118733 | 0.237866 | 0.492773 | -1.849241 | 1.363420 | 0.372973 | 1 | 0.105654 | 0.243515 | val0_2 | val1_2 | val0_1 | val1_1 |
1 | 0.601353 | -0.795141 | 0.621367 | 0.807144 | -0.197305 | 0.629804 | 1 | 0.604797 | -0.796681 | val0_1 | val1_1 | val0_1 | val1_0 |
2 | -0.175575 | 0.951508 | -0.351287 | 0.628771 | -1.147206 | 0.140249 | 1 | -0.152420 | 0.957595 | val0_2 | val1_3 | val0_0 | val1_1 |
3 | -0.799932 | 1.229606 | -0.544233 | 0.783802 | 0.448647 | -0.040882 | 0 | -0.807285 | 1.238116 | val0_1 | val1_0 | val0_0 | val1_2 |
4 | -0.177570 | -0.130169 | 0.080924 | 0.121516 | -0.233781 | 0.226987 | 1 | -0.162172 | -0.125839 | val0_2 | val1_0 | val0_0 | val1_1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | -0.256738 | 0.334942 | -1.097054 | -0.066713 | -0.477552 | -0.605382 | 1 | -0.242977 | 0.337919 | val0_2 | val1_2 | val0_0 | val1_1 |
2996 | 0.127940 | 0.312279 | 0.792817 | -0.536493 | 0.552054 | 0.141952 | 1 | 0.123363 | 0.311755 | val0_2 | val1_2 | val0_1 | val1_1 |
2997 | -0.802343 | -0.156593 | -1.030666 | 0.638108 | -1.436752 | 0.595418 | 1 | -0.807575 | -0.155273 | val0_1 | val1_1 | val0_0 | val1_1 |
2998 | -0.204698 | -0.052545 | 0.481543 | -1.151239 | 0.990868 | -0.048541 | 1 | -0.203927 | -0.063315 | val0_2 | val1_1 | val0_0 | val1_1 |
2999 | 0.059137 | 0.374550 | 0.974962 | -1.132156 | 1.059968 | 0.187809 | 1 | 0.057206 | 0.385159 | val0_2 | val1_2 | val0_1 | val1_1 |
3000 rows × 13 columns
[5]:
scaled_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 num_0 3000 non-null float64
1 num_1 3000 non-null float64
2 num_2 3000 non-null float64
3 num_3 3000 non-null float64
4 num_4 3000 non-null float64
5 num_5 3000 non-null float64
6 label 3000 non-null int64
7 num_c0_num_0 3000 non-null float64
8 num_c1_num_1 3000 non-null float64
9 CN_0_num_0 3000 non-null object
10 CN_1_num_1 3000 non-null object
11 CC_0_num_0 3000 non-null object
12 CC_1_num_1 3000 non-null object
dtypes: float64(8), int64(1), object(4)
memory usage: 304.8+ KB
Inverse Transform
The scalers implemented in Sklearn all have an inverse transform method, which reverts the scaling performed over the numerical data, returning the dataset to its original state. We encapsulate this behavior in a method with the same name as the one used by sklearn: inverse_transform(). Note that in our case, the inverse_transform method receives a dataset with scaled numerical columns, as well as categorical columns, and returns a new dataframe with only the scaled numerical columns reversed (the label column is kept the same, since it was set to be ignored).
[6]:
org_df = scaler.inverse_transform(scaled_df)
org_df
[6]:
num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | CN_0_num_0 | CN_1_num_1 | CC_0_num_0 | CC_1_num_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.838682 | -1.408271 | 2.758909 | -1.558914 | 0.606338 | 2.828664 | 1 | 1.796571 | -1.397257 | val0_2 | val1_2 | val0_1 | val1_1 |
1 | 3.249825 | -3.925450 | 2.953185 | 3.540991 | -2.340552 | 3.398367 | 1 | 3.257607 | -3.925562 | val0_1 | val1_1 | val0_1 | val1_0 |
2 | 0.978148 | 0.330699 | 1.483723 | 3.198539 | -4.134112 | 2.312435 | 1 | 1.041164 | 0.338390 | val0_2 | val1_3 | val0_0 | val1_1 |
3 | -0.847425 | 1.008353 | 1.192225 | 3.496178 | -1.120895 | 1.910650 | 0 | -0.875685 | 1.020226 | val0_1 | val1_0 | val0_0 | val1_2 |
4 | 0.972314 | -2.305080 | 2.136697 | 2.224677 | -2.409424 | 2.504838 | 1 | 1.012618 | -2.295010 | val0_2 | val1_0 | val0_0 | val1_1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | 0.740834 | -1.171719 | 0.357037 | 1.863302 | -2.869701 | 0.658475 | 1 | 0.776097 | -1.167798 | val0_2 | val1_2 | val0_0 | val1_1 |
2996 | 1.865602 | -1.226943 | 3.212207 | 0.961388 | -0.925646 | 2.316213 | 1 | 1.848407 | -1.231393 | val0_2 | val1_2 | val0_1 | val1_1 |
2997 | -0.854474 | -2.369469 | 0.457335 | 3.216465 | -4.680820 | 3.322093 | 1 | -0.876535 | -2.366553 | val0_1 | val1_1 | val0_0 | val1_1 |
2998 | 0.892994 | -2.115929 | 2.741942 | -0.218843 | -0.097098 | 1.893662 | 1 | 0.890398 | -2.143039 | val0_2 | val1_1 | val0_0 | val1_1 |
2999 | 1.664427 | -1.075205 | 3.487387 | -0.182207 | 0.033374 | 2.417933 | 1 | 1.654759 | -1.052976 | val0_2 | val1_2 | val0_1 | val1_1 |
3000 rows × 13 columns
Power Transform
[7]:
scaler = DataPowerTransformer(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
[7]:
num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | CN_0_num_0 | CN_1_num_1 | CC_0_num_0 | CC_1_num_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.204668 | 0.340865 | 0.624318 | -2.355844 | 1.892091 | 0.497892 | 1 | 0.184078 | 0.347054 | val0_2 | val1_2 | val0_1 | val1_1 |
1 | 0.938741 | -1.046138 | 0.791909 | 1.060248 | -0.260243 | 0.853418 | 1 | 0.943084 | -1.046297 | val0_1 | val1_1 | val0_1 | val1_0 |
2 | -0.211606 | 1.280236 | -0.462634 | 0.821744 | -1.544629 | 0.177042 | 1 | -0.181683 | 1.284699 | val0_2 | val1_3 | val0_0 | val1_1 |
3 | -0.969191 | 1.639213 | -0.707276 | 1.028996 | 0.621462 | -0.071696 | 0 | -0.979450 | 1.645924 | val0_1 | val1_0 | val0_0 | val1_2 |
4 | -0.214327 | -0.150747 | 0.090894 | 0.147964 | -0.309796 | 0.296469 | 1 | -0.195061 | -0.145133 | val0_2 | val1_0 | val0_0 | val1_1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | -0.321017 | 0.469917 | -1.397383 | -0.100105 | -0.640434 | -0.839805 | 1 | -0.304500 | 0.472252 | val0_2 | val1_2 | val0_0 | val1_1 |
2996 | 0.218123 | 0.439816 | 1.016051 | -0.713380 | 0.763497 | 0.179385 | 1 | 0.209931 | 0.437581 | val0_2 | val1_2 | val0_1 | val1_1 |
2997 | -0.971766 | -0.186168 | -1.315517 | 0.834209 | -1.934027 | 0.805738 | 1 | -0.979760 | -0.184495 | val0_1 | val1_1 | val0_0 | val1_1 |
2998 | -0.251167 | -0.046782 | 0.609704 | -1.497267 | 1.370180 | -0.082193 | 1 | -0.251934 | -0.061582 | val0_2 | val1_1 | val0_0 | val1_1 |
2999 | 0.118160 | 0.522483 | 1.255006 | -1.473387 | 1.466475 | 0.242502 | 1 | 0.113814 | 0.534794 | val0_2 | val1_2 | val0_1 | val1_1 |
3000 rows × 13 columns
Normalizer
[8]:
scaler = DataNormalizer(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
No columns specified for imputation. These columns have been automatically identified:
[]
[8]:
num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | CN_0_num_0 | CN_1_num_1 | CC_0_num_0 | CC_1_num_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.341701 | -0.261714 | 0.512717 | -0.289709 | 0.112682 | 0.525680 | 1.0 | 0.333876 | -0.259667 | val0_2 | val1_2 | val0_1 | val1_1 |
1 | 0.342031 | -0.413138 | 0.310811 | 0.372675 | -0.246334 | 0.357664 | 1.0 | 0.342850 | -0.413149 | val0_1 | val1_1 | val0_1 | val1_0 |
2 | 0.160514 | 0.054268 | 0.243479 | 0.524880 | -0.678407 | 0.379470 | 1.0 | 0.170855 | 0.055530 | val0_2 | val1_3 | val0_0 | val1_1 |
3 | -0.180286 | 0.214523 | 0.253640 | 0.743796 | -0.238466 | 0.406482 | 0.0 | -0.186298 | 0.217048 | val0_1 | val1_0 | val0_0 | val1_2 |
4 | 0.166395 | -0.394475 | 0.365659 | 0.380715 | -0.412332 | 0.428660 | 1.0 | 0.173292 | -0.392752 | val0_2 | val1_0 | val0_0 | val1_1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | 0.184312 | -0.291512 | 0.088827 | 0.463571 | -0.713954 | 0.163822 | 1.0 | 0.193085 | -0.290537 | val0_2 | val1_2 | val0_0 | val1_1 |
2996 | 0.356516 | -0.234468 | 0.613851 | 0.183721 | -0.176890 | 0.442627 | 1.0 | 0.353230 | -0.235319 | val0_2 | val1_2 | val0_1 | val1_1 |
2997 | -0.113966 | -0.316029 | 0.060997 | 0.428997 | -0.624306 | 0.443085 | 1.0 | -0.116908 | -0.315640 | val0_1 | val1_1 | val0_0 | val1_1 |
2998 | 0.191164 | -0.452958 | 0.586970 | -0.046848 | -0.020786 | 0.405378 | 1.0 | 0.190608 | -0.458762 | val0_2 | val1_1 | val0_0 | val1_1 |
2999 | 0.327582 | -0.211615 | 0.686367 | -0.035861 | 0.006568 | 0.475883 | 1.0 | 0.325680 | -0.207240 | val0_2 | val1_2 | val0_1 | val1_1 |
3000 rows × 13 columns
Standard Scaler
[9]:
scaler = DataStandardScaler(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
[9]:
num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | CN_0_num_0 | CN_1_num_1 | CC_0_num_0 | CC_1_num_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.279543 | 0.332628 | 0.628637 | -2.422861 | 1.875546 | 0.502397 | 1 | 0.260887 | 0.338829 | val0_2 | val1_2 | val0_1 | val1_1 |
1 | 0.908693 | -1.043038 | 0.793104 | 1.055566 | -0.255713 | 0.853082 | 1 | 0.912494 | -1.043198 | val0_1 | val1_1 | val0_1 | val1_0 |
2 | -0.104122 | 1.282994 | -0.450880 | 0.821995 | -1.552858 | 0.184628 | 1 | -0.076018 | 1.287573 | val0_2 | val1_3 | val0_0 | val1_1 |
3 | -0.918043 | 1.653340 | -0.697650 | 1.025001 | 0.626371 | -0.062694 | 0 | -0.930914 | 1.660279 | val0_1 | val1_0 | val0_0 | val1_2 |
4 | -0.106723 | -0.157488 | 0.101900 | 0.157765 | -0.305523 | 0.303063 | 1 | -0.088749 | -0.151902 | val0_2 | val1_0 | val0_0 | val1_1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | -0.209927 | 0.461906 | -1.404684 | -0.088714 | -0.638406 | -0.833480 | 1 | -0.194235 | 0.464257 | val0_2 | val1_2 | val0_0 | val1_1 |
2996 | 0.291545 | 0.431726 | 1.012380 | -0.703871 | 0.767580 | 0.186954 | 1 | 0.284005 | 0.429494 | val0_2 | val1_2 | val0_1 | val1_1 |
2997 | -0.921186 | -0.192677 | -1.319776 | 0.834221 | -1.948250 | 0.806131 | 1 | -0.931293 | -0.191009 | val0_1 | val1_1 | val0_0 | val1_1 |
2998 | -0.142087 | -0.054115 | 0.614274 | -1.508856 | 1.366805 | -0.073151 | 1 | -0.143258 | -0.068831 | val0_2 | val1_1 | val0_0 | val1_1 |
2999 | 0.201852 | 0.514652 | 1.245336 | -1.483868 | 1.461165 | 0.249568 | 1 | 0.197640 | 0.527021 | val0_2 | val1_2 | val0_1 | val1_1 |
3000 rows × 13 columns
[10]:
scaled_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 num_0 3000 non-null float64
1 num_1 3000 non-null float64
2 num_2 3000 non-null float64
3 num_3 3000 non-null float64
4 num_4 3000 non-null float64
5 num_5 3000 non-null float64
6 label 3000 non-null int64
7 num_c0_num_0 3000 non-null float64
8 num_c1_num_1 3000 non-null float64
9 CN_0_num_0 3000 non-null object
10 CN_1_num_1 3000 non-null object
11 CC_0_num_0 3000 non-null object
12 CC_1_num_1 3000 non-null object
dtypes: float64(8), int64(1), object(4)
memory usage: 304.8+ KB
Min Max Scaler
[11]:
scaler = DataMinMaxScaler(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
[11]:
num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | CN_0_num_0 | CN_1_num_1 | CC_0_num_0 | CC_1_num_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.538771 | 0.495719 | 0.588105 | 0.161065 | 0.773957 | 0.542666 | 1 | 0.537741 | 0.496233 | val0_2 | val1_2 | val0_1 | val1_1 |
1 | 0.634397 | 0.317108 | 0.608658 | 0.621755 | 0.492937 | 0.590060 | 1 | 0.637324 | 0.317234 | val0_1 | val1_1 | val0_1 | val1_0 |
2 | 0.480457 | 0.619110 | 0.453200 | 0.590821 | 0.321901 | 0.499721 | 1 | 0.486254 | 0.619113 | val0_2 | val1_3 | val0_0 | val1_1 |
3 | 0.356747 | 0.667194 | 0.422362 | 0.617707 | 0.609246 | 0.466296 | 0 | 0.355604 | 0.667385 | val0_1 | val1_0 | val0_0 | val1_2 |
4 | 0.480061 | 0.432084 | 0.522280 | 0.502849 | 0.486370 | 0.515727 | 1 | 0.484308 | 0.432674 | val0_2 | val1_0 | val0_0 | val1_1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | 0.464375 | 0.512504 | 0.334005 | 0.470205 | 0.442477 | 0.362127 | 1 | 0.468187 | 0.512478 | val0_2 | val1_2 | val0_0 | val1_1 |
2996 | 0.540595 | 0.508585 | 0.636061 | 0.388732 | 0.627865 | 0.500035 | 1 | 0.541275 | 0.507975 | val0_2 | val1_2 | val0_1 | val1_1 |
2997 | 0.356269 | 0.427515 | 0.344616 | 0.592440 | 0.269766 | 0.583715 | 1 | 0.355546 | 0.427609 | val0_1 | val1_1 | val0_0 | val1_1 |
2998 | 0.474686 | 0.445506 | 0.586310 | 0.282118 | 0.706876 | 0.464883 | 1 | 0.475978 | 0.443433 | val0_2 | val1_1 | val0_0 | val1_1 |
2999 | 0.526962 | 0.519352 | 0.665173 | 0.285427 | 0.719318 | 0.508497 | 1 | 0.528076 | 0.520607 | val0_2 | val1_2 | val0_1 | val1_1 |
3000 rows × 13 columns
Quantile Transform
[12]:
scaler = DataQuantileTransformer(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
[12]:
num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | CN_0_num_0 | CN_1_num_1 | CC_0_num_0 | CC_1_num_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.563929 | 0.629615 | 0.748641 | 0.010215 | 0.970001 | 0.687788 | 1 | 0.554528 | 0.632123 | val0_2 | val1_2 | val0_1 | val1_1 |
1 | 0.823907 | 0.149258 | 0.792533 | 0.857954 | 0.400110 | 0.800766 | 1 | 0.825264 | 0.148166 | val0_1 | val1_1 | val0_1 | val1_0 |
2 | 0.406645 | 0.906626 | 0.326266 | 0.799757 | 0.057018 | 0.564204 | 1 | 0.412858 | 0.907607 | val0_2 | val1_3 | val0_0 | val1_1 |
3 | 0.182652 | 0.951966 | 0.232610 | 0.852475 | 0.735349 | 0.478572 | 0 | 0.180338 | 0.952968 | val0_1 | val1_0 | val0_0 | val1_2 |
4 | 0.405598 | 0.430716 | 0.542525 | 0.566506 | 0.379286 | 0.606670 | 1 | 0.409656 | 0.434478 | val0_2 | val1_0 | val0_0 | val1_1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | 0.364130 | 0.678050 | 0.075035 | 0.462473 | 0.266227 | 0.194694 | 1 | 0.372719 | 0.678672 | val0_2 | val1_2 | val0_0 | val1_1 |
2996 | 0.568323 | 0.666334 | 0.857290 | 0.229171 | 0.778167 | 0.564785 | 1 | 0.564750 | 0.665816 | val0_2 | val1_2 | val0_1 | val1_1 |
2997 | 0.182363 | 0.415749 | 0.088817 | 0.804872 | 0.024023 | 0.786688 | 1 | 0.180254 | 0.417115 | val0_1 | val1_1 | val0_0 | val1_1 |
2998 | 0.391255 | 0.475301 | 0.744740 | 0.065804 | 0.913498 | 0.473249 | 1 | 0.392185 | 0.470084 | val0_2 | val1_1 | val0_0 | val1_1 |
2999 | 0.531529 | 0.698122 | 0.897845 | 0.069955 | 0.927006 | 0.584906 | 1 | 0.529527 | 0.702879 | val0_2 | val1_2 | val0_1 | val1_1 |
3000 rows × 13 columns
Using Multiple Scalers: Fit and Transform
In case the user wants to use multiple scalers in succession, they can use the transform_pipe parameter. This parameter is used by many other classes in this library, and it allows users to provide other class objects that has a fit and transform methods from the current library to be used internally. To do this, instantiate different scaler classes implemented in the dataprocessing module, create a list of these objects, and provide this list to the transform_pipe parameter. In the example below, we created a list with the following scalers: DataRobustScaler, DataMinMaxScaler, and DataQuantileTransformer. These scalers are passed as a parameter when creating an object of the DataRobustScaler class. Note however that we also pass an object from the class BasicImputer() into this list. The order of the list is important: it dictates the order by which the fit() and transform() methods will be called. In this case for example, when the fit method is called, we call the fit method of all transforms in the transform_pipe before calling the fit method of the main scaler (DataRobustScaler in this case). Therefore, we first call the fit and transform for the BasicImputer class, the resulting data frame is used to call the fit and transform methods of the DataRobustScaler class, and so on. This same process is used for the transform method as well.
Note that this time, instead of using the default values for each scaler, we are also providing a specific sklearn scaler to each of the scalers created. For instance, when creating the DataPowerTransformer object, we specify that instead of creating a default PowerTransformer (from sklearn) internally, we want to use a specific object of this PowerTransformer class. This allows the users to customize their sklearn instances that should be used inside each scaler class.
[13]:
imputer = BasicImputer()
s1 = DataRobustScaler(
scaler_obj=RobustScaler(),
exclude_cols=['num_3', 'num_4', 'num_5', 'label', 'CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1']
)
s2 = DataMinMaxScaler(
scaler_obj=MinMaxScaler(),
exclude_cols=['num_0', 'num_1', 'num_2', 'num_4', 'num_5', 'label', 'CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1']
)
s3 = DataQuantileTransformer(
scaler_obj=QuantileTransformer(),
exclude_cols=['num_0', 'num_1', 'num_2', 'num_3', 'num_5', 'label', 'CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1']
)
scaler = DataRobustScaler(
scaler_obj=RobustScaler(),
exclude_cols=['num_0', 'num_1', 'num_2', 'num_3', 'num_4', 'label', 'CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1'],
transform_pipe=[imputer, s1, s2, s3]
)
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
No columns specified for imputation. These columns have been automatically identified:
[]
[13]:
num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | CN_0_num_0 | CN_1_num_1 | CC_0_num_0 | CC_1_num_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.118733 | 0.237866 | 0.492773 | 0.161065 | 0.970001 | 0.372973 | 1.0 | 0.108853 | 0.264084 | val0_2 | val1_2 | val0_1 | val1_1 |
1 | 0.601353 | -0.795141 | 0.621367 | 0.621755 | 0.400110 | 0.629804 | 1.0 | 0.650448 | -0.703694 | val0_1 | val1_1 | val0_1 | val1_0 |
2 | -0.175575 | 0.951508 | -0.351287 | 0.590821 | 0.057018 | 0.140249 | 1.0 | -0.174553 | 0.814974 | val0_2 | val1_3 | val0_0 | val1_1 |
3 | -0.799932 | 1.229606 | -0.544233 | 0.617707 | 0.735349 | -0.040882 | 0.0 | -0.639699 | 0.905684 | val0_1 | val1_0 | val0_0 | val1_2 |
4 | -0.177570 | -0.130169 | 0.080924 | 0.502849 | 0.379286 | 0.226987 | 1.0 | -0.180957 | -0.131151 | val0_2 | val1_0 | val0_0 | val1_1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | -0.256738 | 0.334942 | -1.097054 | 0.470205 | 0.266227 | -0.605382 | 1.0 | -0.254850 | 0.357168 | val0_2 | val1_2 | val0_0 | val1_1 |
2996 | 0.127940 | 0.312279 | 0.792817 | 0.388732 | 0.778167 | 0.141952 | 1.0 | 0.129300 | 0.331460 | val0_2 | val1_2 | val0_1 | val1_1 |
2997 | -0.802343 | -0.156593 | -1.030666 | 0.592440 | 0.024023 | 0.595418 | 1.0 | -0.639868 | -0.165871 | val0_1 | val1_1 | val0_0 | val1_1 |
2998 | -0.204698 | -0.052545 | 0.481543 | 0.282118 | 0.913498 | -0.048541 | 1.0 | -0.215909 | -0.059949 | val0_2 | val1_1 | val0_0 | val1_1 |
2999 | 0.059137 | 0.374550 | 0.974962 | 0.285427 | 0.927006 | 0.187809 | 1.0 | 0.058840 | 0.405575 | val0_2 | val1_2 | val0_1 | val1_1 |
3000 rows × 13 columns
Using Multiple Scalers: Inverse Transform
When calling the inverse_transform, we should use the reverse logic used for the fit and transform methods when using the transform_pipe parameter: we call the inverse_transform of each scaler in the reverse order that they were called during the transform method. This way, the order of the inverse_transform calls are the following: DataRobustScaler, DataQuantileTransformer, DataMinMaxScaler, DataPowerTransformer. By doing this in the reverse order, we can recover the original dataset, as shown below.
Note however that the inverse_transform() method is only called for the scalers that apear after the last transform object that doesn’t inherit from the DataScaler() class. For example, if we create a DataRobustScaler() object using transform_pipe = [BasicImputer(), DataQuantileTransformer(), EncoderOHE(), DataMinMaxScaler()], then when calling the fit or transform method for the base scaler (DataRobustScaler), we would execute the fit or transform for the objects in the transform_pipe in the order they appear. But when the inverse_transform() of the base scaler is called, the inverse_transform would be called in the following order: DataRobustScaler, followed by DataMinMaxScaler. The inverse transform of the DataQuantileTransformer wouldn’t be called because it appears between two non-scaler transforms (BasicImputer and EncoderOHE). Since the EncoderOHE doesn’t have an inverse transform, we stop the chain of inverse transformations when we reach the first non-scaler transform object moving from the last object to the first, which in this case is the EncoderOHE.
[14]:
org_df = scaler.inverse_transform(scaled_df)
org_df
[14]:
num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | CN_0_num_0 | CN_1_num_1 | CC_0_num_0 | CC_1_num_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.838682 | -1.408271 | 2.758909 | -1.558914 | 0.606338 | 2.828664 | 1.0 | 1.796571 | -1.397257 | val0_2 | val1_2 | val0_1 | val1_1 |
1 | 3.249825 | -3.925450 | 2.953185 | 3.540991 | -2.340552 | 3.398367 | 1.0 | 3.257607 | -3.925562 | val0_1 | val1_1 | val0_1 | val1_0 |
2 | 0.978148 | 0.330699 | 1.483723 | 3.198539 | -4.134112 | 2.312435 | 1.0 | 1.041164 | 0.338390 | val0_2 | val1_3 | val0_0 | val1_1 |
3 | -0.847425 | 1.008353 | 1.192225 | 3.496178 | -1.120895 | 1.910650 | 0.0 | -0.875685 | 1.020226 | val0_1 | val1_0 | val0_0 | val1_2 |
4 | 0.972314 | -2.305080 | 2.136697 | 2.224677 | -2.409424 | 2.504838 | 1.0 | 1.012618 | -2.295010 | val0_2 | val1_0 | val0_0 | val1_1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | 0.740834 | -1.171719 | 0.357037 | 1.863302 | -2.869701 | 0.658475 | 1.0 | 0.776097 | -1.167798 | val0_2 | val1_2 | val0_0 | val1_1 |
2996 | 1.865602 | -1.226943 | 3.212207 | 0.961388 | -0.925646 | 2.316213 | 1.0 | 1.848407 | -1.231393 | val0_2 | val1_2 | val0_1 | val1_1 |
2997 | -0.854474 | -2.369469 | 0.457335 | 3.216465 | -4.680820 | 3.322093 | 1.0 | -0.876535 | -2.366553 | val0_1 | val1_1 | val0_0 | val1_1 |
2998 | 0.892994 | -2.115929 | 2.741942 | -0.218843 | -0.097098 | 1.893662 | 1.0 | 0.890398 | -2.143039 | val0_2 | val1_1 | val0_0 | val1_1 |
2999 | 1.664427 | -1.075205 | 3.487387 | -0.182207 | 0.033374 | 2.417933 | 1.0 | 1.654759 | -1.052976 | val0_2 | val1_2 | val0_1 | val1_1 |
3000 rows × 13 columns
Using the include_cols instead of the exclude_cols parameter
The exclude_cols parameter is useful when we want to leave only a few features out of the scaling process. However, in some cases (such as the one presented above), we might want to create scalers that are applied to only a few columns. Using the exclude_cols can become tedious and error prone in these situations, since we need to pass a long list of columns that should be ignored. In these scenarios, we can instead use the include_cols parameter, which works as the inverse of the exclude_cols parameter: the data transformations will be applied only to the columns in the include_cols parameter. We replicated the experiment presented two cells above, but this time using the include_cols parameter instead of the exclude_cols parameter:
[15]:
imputer = BasicImputer()
encoder = EncoderOHE()
s1 = DataRobustScaler(
scaler_obj=RobustScaler(),
include_cols=['num_0', 'num_1', 'num_2']
)
s2 = DataMinMaxScaler(
scaler_obj=MinMaxScaler(),
include_cols=['num_3']
)
s3 = DataQuantileTransformer(
scaler_obj=QuantileTransformer(),
include_cols=['num_4']
)
scaler = DataRobustScaler(
scaler_obj=RobustScaler(),
include_cols=['num_5'],
transform_pipe=[imputer, encoder, s1, s2, s3]
)
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df
No columns specified for imputation. These columns have been automatically identified:
[]
No columns specified for encoding. These columns have been automatically identfied as the following:
['CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1']
[15]:
num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | CN_0_num_0_val0_1 | CN_0_num_0_val0_2 | CN_0_num_0_val0_3 | CN_0_num_0_val0_4 | CN_1_num_1_val1_1 | CN_1_num_1_val1_2 | CN_1_num_1_val1_3 | CC_0_num_0_val0_1 | CC_1_num_1_val1_1 | CC_1_num_1_val1_2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.118733 | 0.237866 | 0.492773 | 0.161065 | 0.970001 | 0.372973 | 1.0 | 1.796571 | -1.397257 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
1 | 0.601353 | -0.795141 | 0.621367 | 0.621755 | 0.400110 | 0.629804 | 1.0 | 3.257607 | -3.925562 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | -0.175575 | 0.951508 | -0.351287 | 0.590821 | 0.057018 | 0.140249 | 1.0 | 1.041164 | 0.338390 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
3 | -0.799932 | 1.229606 | -0.544233 | 0.617707 | 0.735349 | -0.040882 | 0.0 | -0.875685 | 1.020226 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | -0.177570 | -0.130169 | 0.080924 | 0.502849 | 0.379286 | 0.226987 | 1.0 | 1.012618 | -2.295010 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | -0.256738 | 0.334942 | -1.097054 | 0.470205 | 0.266227 | -0.605382 | 1.0 | 0.776097 | -1.167798 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
2996 | 0.127940 | 0.312279 | 0.792817 | 0.388732 | 0.778167 | 0.141952 | 1.0 | 1.848407 | -1.231393 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
2997 | -0.802343 | -0.156593 | -1.030666 | 0.592440 | 0.024023 | 0.595418 | 1.0 | -0.876535 | -2.366553 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2998 | -0.204698 | -0.052545 | 0.481543 | 0.282118 | 0.913498 | -0.048541 | 1.0 | 0.890398 | -2.143039 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2999 | 0.059137 | 0.374550 | 0.974962 | 0.285427 | 0.927006 | 0.187809 | 1.0 | 1.654759 | -1.052976 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
3000 rows × 19 columns
[16]:
org_df = scaler.inverse_transform(scaled_df)
org_df
[16]:
CN_0_num_0 | CN_1_num_1 | CC_0_num_0 | CC_1_num_1 | num_0 | num_1 | num_2 | num_3 | num_4 | num_5 | label | num_c0_num_0 | num_c1_num_1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | val0_2 | val1_2 | val0_1 | val1_1 | 1.838682 | -1.408271 | 2.758909 | -1.558914 | 0.606338 | 2.828664 | 1.0 | 1.796571 | -1.397257 |
1 | val0_1 | val1_1 | val0_1 | val1_0 | 3.249825 | -3.925450 | 2.953185 | 3.540991 | -2.340552 | 3.398367 | 1.0 | 3.257607 | -3.925562 |
2 | val0_2 | val1_3 | val0_0 | val1_1 | 0.978148 | 0.330699 | 1.483723 | 3.198539 | -4.134112 | 2.312435 | 1.0 | 1.041164 | 0.338390 |
3 | val0_1 | val1_0 | val0_0 | val1_2 | -0.847425 | 1.008353 | 1.192225 | 3.496178 | -1.120895 | 1.910650 | 0.0 | -0.875685 | 1.020226 |
4 | val0_2 | val1_0 | val0_0 | val1_1 | 0.972314 | -2.305080 | 2.136697 | 2.224677 | -2.409424 | 2.504838 | 1.0 | 1.012618 | -2.295010 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2995 | val0_2 | val1_2 | val0_0 | val1_1 | 0.740834 | -1.171719 | 0.357037 | 1.863302 | -2.869701 | 0.658475 | 1.0 | 0.776097 | -1.167798 |
2996 | val0_2 | val1_2 | val0_1 | val1_1 | 1.865602 | -1.226943 | 3.212207 | 0.961388 | -0.925646 | 2.316213 | 1.0 | 1.848407 | -1.231393 |
2997 | val0_1 | val1_1 | val0_0 | val1_1 | -0.854474 | -2.369469 | 0.457335 | 3.216465 | -4.680820 | 3.322093 | 1.0 | -0.876535 | -2.366553 |
2998 | val0_2 | val1_1 | val0_0 | val1_1 | 0.892994 | -2.115929 | 2.741942 | -0.218843 | -0.097098 | 1.893662 | 1.0 | 0.890398 | -2.143039 |
2999 | val0_2 | val1_2 | val0_1 | val1_1 | 1.664427 | -1.075205 | 3.487387 | -0.182207 | 0.033374 | 2.417933 | 1.0 | 1.654759 | -1.052976 |
3000 rows × 13 columns
Other non-scaler Transforms
As a final note, the scaler classes presented here also accept a list of other transformations that are not scalers, such as DataImputer, EncoderOHE, CatBoostSelection, among other classes from the dataprocessing module. Just create a list with these objects and pass it to the transform_pipe parameter. These transforms will be applied before scaling the numerical columns.