[1]:

import random
import pandas as pd
import numpy as np
import scipy.stats as stat
from sklearn.preprocessing import (
    MinMaxScaler,
    RobustScaler,
    QuantileTransformer
)

from raimitigations.utils import create_dummy_dataset
from raimitigations.dataprocessing import (
    BasicImputer,
    EncoderOHE,
    DataRobustScaler,
    DataPowerTransformer,
    DataNormalizer,
    DataStandardScaler,
    DataMinMaxScaler,
    DataQuantileTransformer
)

Data Scaling

In this notebook we will show how to use data transformation classes. Each class applies a different data transformation over the numerical data, and it was designed to allow several transformations to be applied using a simple and easy to use interface.

Let’s start with the basic scenarios where we use only default values. But before jumping to the different classes, let’s first create a dummy dataset with numerical and categorical features, and Nan values scattered along different columns.

[2]:

# -----------------------------------
def add_nan(vec, pct):
    vec = list(vec)
    nan_index = random.sample(range(len(vec)), int(pct*len(vec)))
    for index in nan_index:
            vec[index] = np.nan
    return vec

# -----------------------------------
def build_dummy_dataset():
    df = create_dummy_dataset(
                                    samples=3000,
                                    n_features=6,
                                    n_num_num=2,
                                    n_cat_num=2,
                                    n_cat_cat=2,
                                    num_num_noise=[0.01, 0.05],
                                    pct_change=[0.05, 0.1]
                            )
    label_col = "label"

    return df, label_col

df, label_col = build_dummy_dataset()
df

[2]:

	num_0	num_1	num_2	num_3	num_4	num_5	label	num_c0_num_0	num_c1_num_1	CN_0_num_0	CN_1_num_1	CC_0_num_0	CC_1_num_1
0	1.838682	-1.408271	2.758909	-1.558914	0.606338	2.828664	1	1.796571	-1.397257	val0_2	val1_2	val0_1	val1_1
1	3.249825	-3.925450	2.953185	3.540991	-2.340552	3.398367	1	3.257607	-3.925562	val0_1	val1_1	val0_1	val1_0
2	0.978148	0.330699	1.483723	3.198539	-4.134112	2.312435	1	1.041164	0.338390	val0_2	val1_3	val0_0	val1_1
3	-0.847425	1.008353	1.192225	3.496178	-1.120895	1.910650	0	-0.875685	1.020226	val0_1	val1_0	val0_0	val1_2
4	0.972314	-2.305080	2.136697	2.224677	-2.409424	2.504838	1	1.012618	-2.295010	val0_2	val1_0	val0_0	val1_1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	0.740834	-1.171719	0.357037	1.863302	-2.869701	0.658475	1	0.776097	-1.167798	val0_2	val1_2	val0_0	val1_1
2996	1.865602	-1.226943	3.212207	0.961388	-0.925646	2.316213	1	1.848407	-1.231393	val0_2	val1_2	val0_1	val1_1
2997	-0.854474	-2.369469	0.457335	3.216465	-4.680820	3.322093	1	-0.876535	-2.366553	val0_1	val1_1	val0_0	val1_1
2998	0.892994	-2.115929	2.741942	-0.218843	-0.097098	1.893662	1	0.890398	-2.143039	val0_2	val1_1	val0_0	val1_1
2999	1.664427	-1.075205	3.487387	-0.182207	0.033374	2.417933	1	1.654759	-1.052976	val0_2	val1_2	val0_1	val1_1

3000 rows × 13 columns

[3]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   num_0         3000 non-null   float64
 1   num_1         3000 non-null   float64
 2   num_2         3000 non-null   float64
 3   num_3         3000 non-null   float64
 4   num_4         3000 non-null   float64
 5   num_5         3000 non-null   float64
 6   label         3000 non-null   int64
 7   num_c0_num_0  3000 non-null   float64
 8   num_c1_num_1  3000 non-null   float64
 9   CN_0_num_0    3000 non-null   object
 10  CN_1_num_1    3000 non-null   object
 11  CC_0_num_0    3000 non-null   object
 12  CC_1_num_1    3000 non-null   object
dtypes: float64(8), int64(1), object(4)
memory usage: 304.8+ KB

Basic Use Case: Robust Scaler

As we can see, the dataset we are working with has several numerical and several categorical data. Also, all columns have a few missing values. When using data scalers or data transformers using Sklearn, we usually need to provide a dataset with only the numerical columns. This requires the user to separate these columns from the original dataset, and then they need to merge the scaled columns with the categorical columns. We encapsulate all of these processes inside the main scaler class, making it easier to use.

Let’s see an example of how to use the Robust scaler through the DataRobustScaler class implemented in this library. For now, we will use only the default values, with exception of the exclude_cols parameter. This parameter is a list of column names or column indexes that should be ignored when doing the scaling process, that is, all columns that should not be scaled. By default, this parameter is set as being the list of all column names associated to categorical columns. But since our label is numerical and we don’t want to transform the values in the label column, we will force the label column to be ignored as well (along with the categorical columns). This way, we create the object by passing “exclude_cols=[‘label’]”. The categorical columns are added internally to the exclude_cols list if they are not already specified.

After creating the object of DataRobustScaler, we must call the fit() and transform() methods, following the pattern established in this library (which is a pattern followed by several well established machine learning libraries). Here we will fit and transform the data using the same dataset, but remember that we should use the training dataset when calling the fit method and then call the transform method for both training and testing datasets.

[4]:

scaler = DataRobustScaler(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df

[4]:

	num_0	num_1	num_2	num_3	num_4	num_5	label	num_c0_num_0	num_c1_num_1	CN_0_num_0	CN_1_num_1	CC_0_num_0	CC_1_num_1
0	0.118733	0.237866	0.492773	-1.849241	1.363420	0.372973	1	0.105654	0.243515	val0_2	val1_2	val0_1	val1_1
1	0.601353	-0.795141	0.621367	0.807144	-0.197305	0.629804	1	0.604797	-0.796681	val0_1	val1_1	val0_1	val1_0
2	-0.175575	0.951508	-0.351287	0.628771	-1.147206	0.140249	1	-0.152420	0.957595	val0_2	val1_3	val0_0	val1_1
3	-0.799932	1.229606	-0.544233	0.783802	0.448647	-0.040882	0	-0.807285	1.238116	val0_1	val1_0	val0_0	val1_2
4	-0.177570	-0.130169	0.080924	0.121516	-0.233781	0.226987	1	-0.162172	-0.125839	val0_2	val1_0	val0_0	val1_1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	-0.256738	0.334942	-1.097054	-0.066713	-0.477552	-0.605382	1	-0.242977	0.337919	val0_2	val1_2	val0_0	val1_1
2996	0.127940	0.312279	0.792817	-0.536493	0.552054	0.141952	1	0.123363	0.311755	val0_2	val1_2	val0_1	val1_1
2997	-0.802343	-0.156593	-1.030666	0.638108	-1.436752	0.595418	1	-0.807575	-0.155273	val0_1	val1_1	val0_0	val1_1
2998	-0.204698	-0.052545	0.481543	-1.151239	0.990868	-0.048541	1	-0.203927	-0.063315	val0_2	val1_1	val0_0	val1_1
2999	0.059137	0.374550	0.974962	-1.132156	1.059968	0.187809	1	0.057206	0.385159	val0_2	val1_2	val0_1	val1_1

3000 rows × 13 columns

[5]:

scaled_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   num_0         3000 non-null   float64
 1   num_1         3000 non-null   float64
 2   num_2         3000 non-null   float64
 3   num_3         3000 non-null   float64
 4   num_4         3000 non-null   float64
 5   num_5         3000 non-null   float64
 6   label         3000 non-null   int64
 7   num_c0_num_0  3000 non-null   float64
 8   num_c1_num_1  3000 non-null   float64
 9   CN_0_num_0    3000 non-null   object
 10  CN_1_num_1    3000 non-null   object
 11  CC_0_num_0    3000 non-null   object
 12  CC_1_num_1    3000 non-null   object
dtypes: float64(8), int64(1), object(4)
memory usage: 304.8+ KB

Inverse Transform

The scalers implemented in Sklearn all have an inverse transform method, which reverts the scaling performed over the numerical data, returning the dataset to its original state. We encapsulate this behavior in a method with the same name as the one used by sklearn: inverse_transform(). Note that in our case, the inverse_transform method receives a dataset with scaled numerical columns, as well as categorical columns, and returns a new dataframe with only the scaled numerical columns reversed (the label column is kept the same, since it was set to be ignored).

[6]:

org_df = scaler.inverse_transform(scaled_df)
org_df

[6]:

	num_0	num_1	num_2	num_3	num_4	num_5	label	num_c0_num_0	num_c1_num_1	CN_0_num_0	CN_1_num_1	CC_0_num_0	CC_1_num_1
0	1.838682	-1.408271	2.758909	-1.558914	0.606338	2.828664	1	1.796571	-1.397257	val0_2	val1_2	val0_1	val1_1
1	3.249825	-3.925450	2.953185	3.540991	-2.340552	3.398367	1	3.257607	-3.925562	val0_1	val1_1	val0_1	val1_0
2	0.978148	0.330699	1.483723	3.198539	-4.134112	2.312435	1	1.041164	0.338390	val0_2	val1_3	val0_0	val1_1
3	-0.847425	1.008353	1.192225	3.496178	-1.120895	1.910650	0	-0.875685	1.020226	val0_1	val1_0	val0_0	val1_2
4	0.972314	-2.305080	2.136697	2.224677	-2.409424	2.504838	1	1.012618	-2.295010	val0_2	val1_0	val0_0	val1_1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	0.740834	-1.171719	0.357037	1.863302	-2.869701	0.658475	1	0.776097	-1.167798	val0_2	val1_2	val0_0	val1_1
2996	1.865602	-1.226943	3.212207	0.961388	-0.925646	2.316213	1	1.848407	-1.231393	val0_2	val1_2	val0_1	val1_1
2997	-0.854474	-2.369469	0.457335	3.216465	-4.680820	3.322093	1	-0.876535	-2.366553	val0_1	val1_1	val0_0	val1_1
2998	0.892994	-2.115929	2.741942	-0.218843	-0.097098	1.893662	1	0.890398	-2.143039	val0_2	val1_1	val0_0	val1_1
2999	1.664427	-1.075205	3.487387	-0.182207	0.033374	2.417933	1	1.654759	-1.052976	val0_2	val1_2	val0_1	val1_1

3000 rows × 13 columns

Power Transform

[7]:

scaler = DataPowerTransformer(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df

[7]:

	num_0	num_1	num_2	num_3	num_4	num_5	label	num_c0_num_0	num_c1_num_1	CN_0_num_0	CN_1_num_1	CC_0_num_0	CC_1_num_1
0	0.204668	0.340865	0.624318	-2.355844	1.892091	0.497892	1	0.184078	0.347054	val0_2	val1_2	val0_1	val1_1
1	0.938741	-1.046138	0.791909	1.060248	-0.260243	0.853418	1	0.943084	-1.046297	val0_1	val1_1	val0_1	val1_0
2	-0.211606	1.280236	-0.462634	0.821744	-1.544629	0.177042	1	-0.181683	1.284699	val0_2	val1_3	val0_0	val1_1
3	-0.969191	1.639213	-0.707276	1.028996	0.621462	-0.071696	0	-0.979450	1.645924	val0_1	val1_0	val0_0	val1_2
4	-0.214327	-0.150747	0.090894	0.147964	-0.309796	0.296469	1	-0.195061	-0.145133	val0_2	val1_0	val0_0	val1_1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	-0.321017	0.469917	-1.397383	-0.100105	-0.640434	-0.839805	1	-0.304500	0.472252	val0_2	val1_2	val0_0	val1_1
2996	0.218123	0.439816	1.016051	-0.713380	0.763497	0.179385	1	0.209931	0.437581	val0_2	val1_2	val0_1	val1_1
2997	-0.971766	-0.186168	-1.315517	0.834209	-1.934027	0.805738	1	-0.979760	-0.184495	val0_1	val1_1	val0_0	val1_1
2998	-0.251167	-0.046782	0.609704	-1.497267	1.370180	-0.082193	1	-0.251934	-0.061582	val0_2	val1_1	val0_0	val1_1
2999	0.118160	0.522483	1.255006	-1.473387	1.466475	0.242502	1	0.113814	0.534794	val0_2	val1_2	val0_1	val1_1

3000 rows × 13 columns

Normalizer

[8]:

scaler = DataNormalizer(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df

No columns specified for imputation. These columns have been automatically identified:
[]

[8]:

	num_0	num_1	num_2	num_3	num_4	num_5	label	num_c0_num_0	num_c1_num_1	CN_0_num_0	CN_1_num_1	CC_0_num_0	CC_1_num_1
0	0.341701	-0.261714	0.512717	-0.289709	0.112682	0.525680	1.0	0.333876	-0.259667	val0_2	val1_2	val0_1	val1_1
1	0.342031	-0.413138	0.310811	0.372675	-0.246334	0.357664	1.0	0.342850	-0.413149	val0_1	val1_1	val0_1	val1_0
2	0.160514	0.054268	0.243479	0.524880	-0.678407	0.379470	1.0	0.170855	0.055530	val0_2	val1_3	val0_0	val1_1
3	-0.180286	0.214523	0.253640	0.743796	-0.238466	0.406482	0.0	-0.186298	0.217048	val0_1	val1_0	val0_0	val1_2
4	0.166395	-0.394475	0.365659	0.380715	-0.412332	0.428660	1.0	0.173292	-0.392752	val0_2	val1_0	val0_0	val1_1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	0.184312	-0.291512	0.088827	0.463571	-0.713954	0.163822	1.0	0.193085	-0.290537	val0_2	val1_2	val0_0	val1_1
2996	0.356516	-0.234468	0.613851	0.183721	-0.176890	0.442627	1.0	0.353230	-0.235319	val0_2	val1_2	val0_1	val1_1
2997	-0.113966	-0.316029	0.060997	0.428997	-0.624306	0.443085	1.0	-0.116908	-0.315640	val0_1	val1_1	val0_0	val1_1
2998	0.191164	-0.452958	0.586970	-0.046848	-0.020786	0.405378	1.0	0.190608	-0.458762	val0_2	val1_1	val0_0	val1_1
2999	0.327582	-0.211615	0.686367	-0.035861	0.006568	0.475883	1.0	0.325680	-0.207240	val0_2	val1_2	val0_1	val1_1

3000 rows × 13 columns

Standard Scaler

[9]:

scaler = DataStandardScaler(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df

[9]:

	num_0	num_1	num_2	num_3	num_4	num_5	label	num_c0_num_0	num_c1_num_1	CN_0_num_0	CN_1_num_1	CC_0_num_0	CC_1_num_1
0	0.279543	0.332628	0.628637	-2.422861	1.875546	0.502397	1	0.260887	0.338829	val0_2	val1_2	val0_1	val1_1
1	0.908693	-1.043038	0.793104	1.055566	-0.255713	0.853082	1	0.912494	-1.043198	val0_1	val1_1	val0_1	val1_0
2	-0.104122	1.282994	-0.450880	0.821995	-1.552858	0.184628	1	-0.076018	1.287573	val0_2	val1_3	val0_0	val1_1
3	-0.918043	1.653340	-0.697650	1.025001	0.626371	-0.062694	0	-0.930914	1.660279	val0_1	val1_0	val0_0	val1_2
4	-0.106723	-0.157488	0.101900	0.157765	-0.305523	0.303063	1	-0.088749	-0.151902	val0_2	val1_0	val0_0	val1_1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	-0.209927	0.461906	-1.404684	-0.088714	-0.638406	-0.833480	1	-0.194235	0.464257	val0_2	val1_2	val0_0	val1_1
2996	0.291545	0.431726	1.012380	-0.703871	0.767580	0.186954	1	0.284005	0.429494	val0_2	val1_2	val0_1	val1_1
2997	-0.921186	-0.192677	-1.319776	0.834221	-1.948250	0.806131	1	-0.931293	-0.191009	val0_1	val1_1	val0_0	val1_1
2998	-0.142087	-0.054115	0.614274	-1.508856	1.366805	-0.073151	1	-0.143258	-0.068831	val0_2	val1_1	val0_0	val1_1
2999	0.201852	0.514652	1.245336	-1.483868	1.461165	0.249568	1	0.197640	0.527021	val0_2	val1_2	val0_1	val1_1

3000 rows × 13 columns

[10]:

scaled_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   num_0         3000 non-null   float64
 1   num_1         3000 non-null   float64
 2   num_2         3000 non-null   float64
 3   num_3         3000 non-null   float64
 4   num_4         3000 non-null   float64
 5   num_5         3000 non-null   float64
 6   label         3000 non-null   int64
 7   num_c0_num_0  3000 non-null   float64
 8   num_c1_num_1  3000 non-null   float64
 9   CN_0_num_0    3000 non-null   object
 10  CN_1_num_1    3000 non-null   object
 11  CC_0_num_0    3000 non-null   object
 12  CC_1_num_1    3000 non-null   object
dtypes: float64(8), int64(1), object(4)
memory usage: 304.8+ KB

Min Max Scaler

[11]:

scaler = DataMinMaxScaler(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df

[11]:

	num_0	num_1	num_2	num_3	num_4	num_5	label	num_c0_num_0	num_c1_num_1	CN_0_num_0	CN_1_num_1	CC_0_num_0	CC_1_num_1
0	0.538771	0.495719	0.588105	0.161065	0.773957	0.542666	1	0.537741	0.496233	val0_2	val1_2	val0_1	val1_1
1	0.634397	0.317108	0.608658	0.621755	0.492937	0.590060	1	0.637324	0.317234	val0_1	val1_1	val0_1	val1_0
2	0.480457	0.619110	0.453200	0.590821	0.321901	0.499721	1	0.486254	0.619113	val0_2	val1_3	val0_0	val1_1
3	0.356747	0.667194	0.422362	0.617707	0.609246	0.466296	0	0.355604	0.667385	val0_1	val1_0	val0_0	val1_2
4	0.480061	0.432084	0.522280	0.502849	0.486370	0.515727	1	0.484308	0.432674	val0_2	val1_0	val0_0	val1_1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	0.464375	0.512504	0.334005	0.470205	0.442477	0.362127	1	0.468187	0.512478	val0_2	val1_2	val0_0	val1_1
2996	0.540595	0.508585	0.636061	0.388732	0.627865	0.500035	1	0.541275	0.507975	val0_2	val1_2	val0_1	val1_1
2997	0.356269	0.427515	0.344616	0.592440	0.269766	0.583715	1	0.355546	0.427609	val0_1	val1_1	val0_0	val1_1
2998	0.474686	0.445506	0.586310	0.282118	0.706876	0.464883	1	0.475978	0.443433	val0_2	val1_1	val0_0	val1_1
2999	0.526962	0.519352	0.665173	0.285427	0.719318	0.508497	1	0.528076	0.520607	val0_2	val1_2	val0_1	val1_1

3000 rows × 13 columns

Quantile Transform

[12]:

scaler = DataQuantileTransformer(exclude_cols=['label'])
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df

[12]:

	num_0	num_1	num_2	num_3	num_4	num_5	label	num_c0_num_0	num_c1_num_1	CN_0_num_0	CN_1_num_1	CC_0_num_0	CC_1_num_1
0	0.563929	0.629615	0.748641	0.010215	0.970001	0.687788	1	0.554528	0.632123	val0_2	val1_2	val0_1	val1_1
1	0.823907	0.149258	0.792533	0.857954	0.400110	0.800766	1	0.825264	0.148166	val0_1	val1_1	val0_1	val1_0
2	0.406645	0.906626	0.326266	0.799757	0.057018	0.564204	1	0.412858	0.907607	val0_2	val1_3	val0_0	val1_1
3	0.182652	0.951966	0.232610	0.852475	0.735349	0.478572	0	0.180338	0.952968	val0_1	val1_0	val0_0	val1_2
4	0.405598	0.430716	0.542525	0.566506	0.379286	0.606670	1	0.409656	0.434478	val0_2	val1_0	val0_0	val1_1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	0.364130	0.678050	0.075035	0.462473	0.266227	0.194694	1	0.372719	0.678672	val0_2	val1_2	val0_0	val1_1
2996	0.568323	0.666334	0.857290	0.229171	0.778167	0.564785	1	0.564750	0.665816	val0_2	val1_2	val0_1	val1_1
2997	0.182363	0.415749	0.088817	0.804872	0.024023	0.786688	1	0.180254	0.417115	val0_1	val1_1	val0_0	val1_1
2998	0.391255	0.475301	0.744740	0.065804	0.913498	0.473249	1	0.392185	0.470084	val0_2	val1_1	val0_0	val1_1
2999	0.531529	0.698122	0.897845	0.069955	0.927006	0.584906	1	0.529527	0.702879	val0_2	val1_2	val0_1	val1_1

3000 rows × 13 columns

Using Multiple Scalers: Fit and Transform

In case the user wants to use multiple scalers in succession, they can use the transform_pipe parameter. This parameter is used by many other classes in this library, and it allows users to provide other class objects that has a fit and transform methods from the current library to be used internally. To do this, instantiate different scaler classes implemented in the dataprocessing module, create a list of these objects, and provide this list to the transform_pipe parameter. In the example below, we created a list with the following scalers: DataRobustScaler, DataMinMaxScaler, and DataQuantileTransformer. These scalers are passed as a parameter when creating an object of the DataRobustScaler class. Note however that we also pass an object from the class BasicImputer() into this list. The order of the list is important: it dictates the order by which the fit() and transform() methods will be called. In this case for example, when the fit method is called, we call the fit method of all transforms in the transform_pipe before calling the fit method of the main scaler (DataRobustScaler in this case). Therefore, we first call the fit and transform for the BasicImputer class, the resulting data frame is used to call the fit and transform methods of the DataRobustScaler class, and so on. This same process is used for the transform method as well.

Note that this time, instead of using the default values for each scaler, we are also providing a specific sklearn scaler to each of the scalers created. For instance, when creating the DataPowerTransformer object, we specify that instead of creating a default PowerTransformer (from sklearn) internally, we want to use a specific object of this PowerTransformer class. This allows the users to customize their sklearn instances that should be used inside each scaler class.

[13]:

imputer = BasicImputer()
s1 = DataRobustScaler(
                    scaler_obj=RobustScaler(),
                    exclude_cols=['num_3', 'num_4', 'num_5', 'label', 'CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1']
                    )
s2 = DataMinMaxScaler(
                    scaler_obj=MinMaxScaler(),
                    exclude_cols=['num_0', 'num_1', 'num_2', 'num_4', 'num_5', 'label', 'CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1']
                    )
s3 = DataQuantileTransformer(
                    scaler_obj=QuantileTransformer(),
                    exclude_cols=['num_0', 'num_1', 'num_2', 'num_3', 'num_5', 'label', 'CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1']
                    )
scaler = DataRobustScaler(
                    scaler_obj=RobustScaler(),
                    exclude_cols=['num_0', 'num_1', 'num_2', 'num_3', 'num_4', 'label', 'CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1'],
                    transform_pipe=[imputer, s1, s2, s3]
                    )


scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df

No columns specified for imputation. These columns have been automatically identified:
[]

[13]:

	num_0	num_1	num_2	num_3	num_4	num_5	label	num_c0_num_0	num_c1_num_1	CN_0_num_0	CN_1_num_1	CC_0_num_0	CC_1_num_1
0	0.118733	0.237866	0.492773	0.161065	0.970001	0.372973	1.0	0.108853	0.264084	val0_2	val1_2	val0_1	val1_1
1	0.601353	-0.795141	0.621367	0.621755	0.400110	0.629804	1.0	0.650448	-0.703694	val0_1	val1_1	val0_1	val1_0
2	-0.175575	0.951508	-0.351287	0.590821	0.057018	0.140249	1.0	-0.174553	0.814974	val0_2	val1_3	val0_0	val1_1
3	-0.799932	1.229606	-0.544233	0.617707	0.735349	-0.040882	0.0	-0.639699	0.905684	val0_1	val1_0	val0_0	val1_2
4	-0.177570	-0.130169	0.080924	0.502849	0.379286	0.226987	1.0	-0.180957	-0.131151	val0_2	val1_0	val0_0	val1_1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	-0.256738	0.334942	-1.097054	0.470205	0.266227	-0.605382	1.0	-0.254850	0.357168	val0_2	val1_2	val0_0	val1_1
2996	0.127940	0.312279	0.792817	0.388732	0.778167	0.141952	1.0	0.129300	0.331460	val0_2	val1_2	val0_1	val1_1
2997	-0.802343	-0.156593	-1.030666	0.592440	0.024023	0.595418	1.0	-0.639868	-0.165871	val0_1	val1_1	val0_0	val1_1
2998	-0.204698	-0.052545	0.481543	0.282118	0.913498	-0.048541	1.0	-0.215909	-0.059949	val0_2	val1_1	val0_0	val1_1
2999	0.059137	0.374550	0.974962	0.285427	0.927006	0.187809	1.0	0.058840	0.405575	val0_2	val1_2	val0_1	val1_1

3000 rows × 13 columns

Using Multiple Scalers: Inverse Transform

When calling the inverse_transform, we should use the reverse logic used for the fit and transform methods when using the transform_pipe parameter: we call the inverse_transform of each scaler in the reverse order that they were called during the transform method. This way, the order of the inverse_transform calls are the following: DataRobustScaler, DataQuantileTransformer, DataMinMaxScaler, DataPowerTransformer. By doing this in the reverse order, we can recover the original dataset, as shown below.

Note however that the inverse_transform() method is only called for the scalers that apear after the last transform object that doesn’t inherit from the DataScaler() class. For example, if we create a DataRobustScaler() object using transform_pipe = [BasicImputer(), DataQuantileTransformer(), EncoderOHE(), DataMinMaxScaler()], then when calling the fit or transform method for the base scaler (DataRobustScaler), we would execute the fit or transform for the objects in the transform_pipe in the order they appear. But when the inverse_transform() of the base scaler is called, the inverse_transform would be called in the following order: DataRobustScaler, followed by DataMinMaxScaler. The inverse transform of the DataQuantileTransformer wouldn’t be called because it appears between two non-scaler transforms (BasicImputer and EncoderOHE). Since the EncoderOHE doesn’t have an inverse transform, we stop the chain of inverse transformations when we reach the first non-scaler transform object moving from the last object to the first, which in this case is the EncoderOHE.

[14]:

org_df = scaler.inverse_transform(scaled_df)
org_df

[14]:

	num_0	num_1	num_2	num_3	num_4	num_5	label	num_c0_num_0	num_c1_num_1	CN_0_num_0	CN_1_num_1	CC_0_num_0	CC_1_num_1
0	1.838682	-1.408271	2.758909	-1.558914	0.606338	2.828664	1.0	1.796571	-1.397257	val0_2	val1_2	val0_1	val1_1
1	3.249825	-3.925450	2.953185	3.540991	-2.340552	3.398367	1.0	3.257607	-3.925562	val0_1	val1_1	val0_1	val1_0
2	0.978148	0.330699	1.483723	3.198539	-4.134112	2.312435	1.0	1.041164	0.338390	val0_2	val1_3	val0_0	val1_1
3	-0.847425	1.008353	1.192225	3.496178	-1.120895	1.910650	0.0	-0.875685	1.020226	val0_1	val1_0	val0_0	val1_2
4	0.972314	-2.305080	2.136697	2.224677	-2.409424	2.504838	1.0	1.012618	-2.295010	val0_2	val1_0	val0_0	val1_1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	0.740834	-1.171719	0.357037	1.863302	-2.869701	0.658475	1.0	0.776097	-1.167798	val0_2	val1_2	val0_0	val1_1
2996	1.865602	-1.226943	3.212207	0.961388	-0.925646	2.316213	1.0	1.848407	-1.231393	val0_2	val1_2	val0_1	val1_1
2997	-0.854474	-2.369469	0.457335	3.216465	-4.680820	3.322093	1.0	-0.876535	-2.366553	val0_1	val1_1	val0_0	val1_1
2998	0.892994	-2.115929	2.741942	-0.218843	-0.097098	1.893662	1.0	0.890398	-2.143039	val0_2	val1_1	val0_0	val1_1
2999	1.664427	-1.075205	3.487387	-0.182207	0.033374	2.417933	1.0	1.654759	-1.052976	val0_2	val1_2	val0_1	val1_1

3000 rows × 13 columns

Using the include_cols instead of the exclude_cols parameter

The exclude_cols parameter is useful when we want to leave only a few features out of the scaling process. However, in some cases (such as the one presented above), we might want to create scalers that are applied to only a few columns. Using the exclude_cols can become tedious and error prone in these situations, since we need to pass a long list of columns that should be ignored. In these scenarios, we can instead use the include_cols parameter, which works as the inverse of the exclude_cols parameter: the data transformations will be applied only to the columns in the include_cols parameter. We replicated the experiment presented two cells above, but this time using the include_cols parameter instead of the exclude_cols parameter:

[15]:

imputer = BasicImputer()
encoder = EncoderOHE()
s1 = DataRobustScaler(
                    scaler_obj=RobustScaler(),
                    include_cols=['num_0', 'num_1', 'num_2']
                    )
s2 = DataMinMaxScaler(
                    scaler_obj=MinMaxScaler(),
                    include_cols=['num_3']
                    )
s3 = DataQuantileTransformer(
                    scaler_obj=QuantileTransformer(),
                    include_cols=['num_4']
                    )
scaler = DataRobustScaler(
                    scaler_obj=RobustScaler(),
                    include_cols=['num_5'],
                    transform_pipe=[imputer, encoder, s1, s2, s3]
                    )


scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df

No columns specified for imputation. These columns have been automatically identified:
[]
No columns specified for encoding. These columns have been automatically identfied as the following:
['CN_0_num_0', 'CN_1_num_1', 'CC_0_num_0', 'CC_1_num_1']

[15]:

	num_0	num_1	num_2	num_3	num_4	num_5	label	num_c0_num_0	num_c1_num_1	CN_0_num_0_val0_1	CN_0_num_0_val0_2	CN_0_num_0_val0_3	CN_0_num_0_val0_4	CN_1_num_1_val1_1	CN_1_num_1_val1_2	CN_1_num_1_val1_3	CC_0_num_0_val0_1	CC_1_num_1_val1_1	CC_1_num_1_val1_2
0	0.118733	0.237866	0.492773	0.161065	0.970001	0.372973	1.0	1.796571	-1.397257	0	1	0	0	0	1	0	1	1	0
1	0.601353	-0.795141	0.621367	0.621755	0.400110	0.629804	1.0	3.257607	-3.925562	1	0	0	0	1	0	0	1	0	0
2	-0.175575	0.951508	-0.351287	0.590821	0.057018	0.140249	1.0	1.041164	0.338390	0	1	0	0	0	0	1	0	1	0
3	-0.799932	1.229606	-0.544233	0.617707	0.735349	-0.040882	0.0	-0.875685	1.020226	1	0	0	0	0	0	0	0	0	1
4	-0.177570	-0.130169	0.080924	0.502849	0.379286	0.226987	1.0	1.012618	-2.295010	0	1	0	0	0	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	-0.256738	0.334942	-1.097054	0.470205	0.266227	-0.605382	1.0	0.776097	-1.167798	0	1	0	0	0	1	0	0	1	0
2996	0.127940	0.312279	0.792817	0.388732	0.778167	0.141952	1.0	1.848407	-1.231393	0	1	0	0	0	1	0	1	1	0
2997	-0.802343	-0.156593	-1.030666	0.592440	0.024023	0.595418	1.0	-0.876535	-2.366553	1	0	0	0	1	0	0	0	1	0
2998	-0.204698	-0.052545	0.481543	0.282118	0.913498	-0.048541	1.0	0.890398	-2.143039	0	1	0	0	1	0	0	0	1	0
2999	0.059137	0.374550	0.974962	0.285427	0.927006	0.187809	1.0	1.654759	-1.052976	0	1	0	0	0	1	0	1	1	0

3000 rows × 19 columns

[16]:

org_df = scaler.inverse_transform(scaled_df)
org_df

[16]:

	CN_0_num_0	CN_1_num_1	CC_0_num_0	CC_1_num_1	num_0	num_1	num_2	num_3	num_4	num_5	label	num_c0_num_0	num_c1_num_1
0	val0_2	val1_2	val0_1	val1_1	1.838682	-1.408271	2.758909	-1.558914	0.606338	2.828664	1.0	1.796571	-1.397257
1	val0_1	val1_1	val0_1	val1_0	3.249825	-3.925450	2.953185	3.540991	-2.340552	3.398367	1.0	3.257607	-3.925562
2	val0_2	val1_3	val0_0	val1_1	0.978148	0.330699	1.483723	3.198539	-4.134112	2.312435	1.0	1.041164	0.338390
3	val0_1	val1_0	val0_0	val1_2	-0.847425	1.008353	1.192225	3.496178	-1.120895	1.910650	0.0	-0.875685	1.020226
4	val0_2	val1_0	val0_0	val1_1	0.972314	-2.305080	2.136697	2.224677	-2.409424	2.504838	1.0	1.012618	-2.295010
...	...	...	...	...	...	...	...	...	...	...	...	...	...
2995	val0_2	val1_2	val0_0	val1_1	0.740834	-1.171719	0.357037	1.863302	-2.869701	0.658475	1.0	0.776097	-1.167798
2996	val0_2	val1_2	val0_1	val1_1	1.865602	-1.226943	3.212207	0.961388	-0.925646	2.316213	1.0	1.848407	-1.231393
2997	val0_1	val1_1	val0_0	val1_1	-0.854474	-2.369469	0.457335	3.216465	-4.680820	3.322093	1.0	-0.876535	-2.366553
2998	val0_2	val1_1	val0_0	val1_1	0.892994	-2.115929	2.741942	-0.218843	-0.097098	1.893662	1.0	0.890398	-2.143039
2999	val0_2	val1_2	val0_1	val1_1	1.664427	-1.075205	3.487387	-0.182207	0.033374	2.417933	1.0	1.654759	-1.052976

3000 rows × 13 columns

Other non-scaler Transforms

As a final note, the scaler classes presented here also accept a list of other transformations that are not scalers, such as DataImputer, EncoderOHE, CatBoostSelection, among other classes from the dataprocessing module. Just create a list with these objects and pass it to the transform_pipe parameter. These transforms will be applied before scaling the numerical columns.