Using Pipelines

The dataprocessing library works with the Pipeline class from SKLearn. In this notebook, we show a few examples of how to use pipelines with transformers from the dataprocessing library.

[1]:

from typing import Union

import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import uci_dataset as database

from raimitigations.utils import evaluate_set, split_data
import raimitigations.dataprocessing as dp

In this notebook, we’ll use the thyroid disease dataset. Let’s load this dataset and take a look at it:

[2]:

df = database.load_thyroid_disease()
label_col = "sick-euthyroid"
df[label_col] = df[label_col].replace({"sick-euthyroid": 1, "negative": 0})
df

[2]:

	sick-euthyroid	age	sex	on_thyroxine	query_on_thyroxine	on_antithyroid_medication	thyroid_surgery	query_hypothyroid	query_hyperthyroid	pregnant	...	T3_measured	T3	TT4_measured	TT4	T4U_measured	T4U	FTI_measured	FTI	TBG_measured	TBG
0	1	72.0	M	f	f	f	f	f	f	f	...	y	1.0	y	83.0	y	0.95	y	87.0	n	NaN
1	1	45.0	F	f	f	f	f	f	f	f	...	y	1.0	y	82.0	y	0.73	y	112.0	n	NaN
2	1	64.0	F	f	f	f	f	f	f	f	...	y	1.0	y	101.0	y	0.82	y	123.0	n	NaN
3	1	56.0	M	f	f	f	f	f	f	f	...	y	0.8	y	76.0	y	0.77	y	99.0	n	NaN
4	1	78.0	F	t	f	f	f	t	f	f	...	y	0.3	y	87.0	y	0.95	y	91.0	n	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3158	0	40.0	F	f	f	f	f	f	f	f	...	y	1.2	y	76.0	y	0.90	y	84.0	n	NaN
3159	0	69.0	F	f	f	f	f	f	f	f	...	y	1.8	y	126.0	y	1.02	y	124.0	n	NaN
3160	0	58.0	F	f	f	f	f	f	f	f	...	y	1.7	y	86.0	y	0.91	y	95.0	n	NaN
3161	0	29.0	F	f	f	f	f	f	f	f	...	y	1.8	y	99.0	y	1.01	y	98.0	n	NaN
3162	0	56.0	F	t	f	f	f	f	f	f	...	y	1.8	y	139.0	y	0.97	y	143.0	n	NaN

3163 rows × 26 columns

[3]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3163 entries, 0 to 3162
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   sick-euthyroid             3163 non-null   int64
 1   age                        2717 non-null   float64
 2   sex                        3090 non-null   object
 3   on_thyroxine               3163 non-null   object
 4   query_on_thyroxine         3163 non-null   object
 5   on_antithyroid_medication  3163 non-null   object
 6   thyroid_surgery            3163 non-null   object
 7   query_hypothyroid          3163 non-null   object
 8   query_hyperthyroid         3163 non-null   object
 9   pregnant                   3163 non-null   object
 10  sick                       3163 non-null   object
 11  tumor                      3163 non-null   object
 12  lithium                    3163 non-null   object
 13  goitre                     3163 non-null   object
 14  TSH_measured               3163 non-null   object
 15  TSH                        2695 non-null   float64
 16  T3_measured                3163 non-null   object
 17  T3                         2468 non-null   float64
 18  TT4_measured               3163 non-null   object
 19  TT4                        2914 non-null   float64
 20  T4U_measured               3163 non-null   object
 21  T4U                        2915 non-null   float64
 22  FTI_measured               3163 non-null   object
 23  FTI                        2916 non-null   float64
 24  TBG_measured               3163 non-null   object
 25  TBG                        260 non-null    float64
dtypes: float64(7), int64(1), object(18)
memory usage: 642.6+ KB

Comparing different pipelines

Using pipelines allow us to compare different pre-processing methods in a easy and efficient way. Here we will build different pipelines, each with different transformations, and then compare the results obtained for each one. But before we can do this, let’s build a function that calls the pipeline’s fit and predict_proba methods, followed by the computation of a few metrics, such as ROC AUC, F1, precision, recall, accuracy, and the confusion matrix. To compute these metrics, we use the evaluate_set function from the utils.model_utils module.

[4]:

def fit_and_get_results_bin(
    pipeline: Pipeline,
    x_train: pd.DataFrame,
    y_train: pd.DataFrame,
    x_test: pd.DataFrame,
    y_test: pd.DataFrame,
    best_th_auc: bool=True
):
    pipeline.fit(x_train, y_train)
    pred_test = pipeline.predict_proba(x_test)
    _ = evaluate_set(y_test, pred_test, regression=False, plot_pr=False, best_th_auc=best_th_auc, classes=pipeline.classes_)

We’ll also test the efficiency of using scalers over the numerical features of the dataset. It only makes sense to scale the numerical features that doesn’t represent one-hot encodings, so let’s get the list of all float columns from the dataset. This list will represent the list of columns to which we’ll apply the scaler transformations.

[5]:

float_cols = [col for col in df.columns if 'float' in df[col].dtype.name]
float_cols

[5]:

['age', 'TSH', 'T3', 'TT4', 'T4U', 'FTI', 'TBG']

Each pipeline can have different transformations. However, some of these transformations could be repeated among different pipelines. For example, all pipelines should have an imputer transformer. In the following cell we’ll instantiate a few transformers in an effort to simplify the pipeline creation process.

[6]:

ohe_encoder = dp.EncoderOHE(drop=False, unknown_err=False, verbose=False)
imputer = dp.BasicImputer(numerical={'missing_values':np.nan,
                                                                    'strategy':'constant',
                                                                    'fill_value':-1},
                          verbose=False)
feat_sel = dp.CatBoostSelection(catboost_log=False, verbose=False)
cor_feat = dp.CorrelatedFeatures(save_json=False)

Now we just have to split the original dataset into train and test sets. After that, we can start comparing the different pipelines.

[7]:

train_x, test_x, train_y, test_y = split_data(df, label_col, test_size=0.2)

Pipeline 0

[8]:

pipe0 = Pipeline([
            ("encoder", ohe_encoder),
            ("imputer", imputer),
            ("estimator", SVC(probability=True)),
        ])
fit_and_get_results_bin(pipe0, train_x, train_y, test_x, test_y, best_th_auc=True)

[[501  73]
 [  7  52]]

../../../_images/notebooks_dataprocessing_module_tests_pipeline_test_14_1.png

ROC AUC: 0.9299001948857261
Precision: 0.7011102362204724
Recall: 0.877089115927479
F1: 0.7456401189423771
Accuracy: 0.8736176935229067
Optimal Threshold (ROC curve): 0.1250311491470217
Optimal Threshold (Precision x Recall curve): 0.251147863435757
Threshold used: 0.1250311491470217

Pipeline 1

[9]:

pipe1 = Pipeline([
            ("encoder", ohe_encoder),
            ("imputer", imputer),
            ("scaler", dp.DataStandardScaler(include_cols=float_cols, verbose=False)),
            ("estimator", SVC(probability=True)),
        ])
fit_and_get_results_bin(pipe1, train_x, train_y, test_x, test_y, best_th_auc=True)

[[555  19]
 [  7  52]]

../../../_images/notebooks_dataprocessing_module_tests_pipeline_test_16_1.png

ROC AUC: 0.9250280517333019
Precision: 0.8599694250914741
Recall: 0.9241274434536113
F1: 0.888556338028169
Accuracy: 0.9589257503949447
Optimal Threshold (ROC curve): 0.29745720239980394
Optimal Threshold (Precision x Recall curve): 0.43828324901046595
Threshold used: 0.29745720239980394

Pipeline 2

[10]:

pipe2 = Pipeline([
            ("encoder", ohe_encoder),
            ("imputer", imputer),
            ("scaler", dp.DataStandardScaler(include_cols=float_cols)),
            ("feat_sel", feat_sel),
            ("estimator", SVC(probability=True)),
        ])
fit_and_get_results_bin(pipe2, train_x, train_y, test_x, test_y, best_th_auc=True)

/home/matheus/miniconda3/envs/rai/lib/python3.9/site-packages/catboost/core.py:1222: FutureWarning: iteritems is deprecated and will be removed in a future version. Use .items instead.
  self._init_pool(data, label, cat_features, text_features, embedding_features, pairs, weight,

[[548  26]
 [  7  52]]

../../../_images/notebooks_dataprocessing_module_tests_pipeline_test_18_2.png

ROC AUC: 0.9340046063898896
Precision: 0.827027027027027
Recall: 0.9180298824780015
F1: 0.8649473405183838
Accuracy: 0.9478672985781991
Optimal Threshold (ROC curve): 0.22808377020466744
Optimal Threshold (Precision x Recall curve): 0.4029785100782772
Threshold used: 0.22808377020466744

Using SKLearn and Dataprocessing transformers in the same Pipeline

The transformations in SKLearn always return a numpy array, while the transformers in our dataprocessing library usually operate over a Pandas DataFrame (although these transformers also accept numpy arrays as input). The latter uses dataframes in order to allow it to filter specific columns, just like we did when we created the DataStandardScaler object and specified that only the columns in the float_cols list should be scaled. With a numpy array, we can also do this, but instead of using column names, we are forced to use the column indices instead, since numpy arrays doesn’t have column names.

In this scenario, when we place a transformer from the SKLearn library before a transformer from the dataprocessing library in a Pipeline, we must be aware that we’ll lose the column names (considering that the input to the Pipeline is indeed a Pandas dataframe) after the sklearn transformer calls its transform() method. This way, all of the transformers after this point will have to use column indices instead of column names, as depicted in the following example. In this example, we’ll use the SKLearn’s SimpleImputer instead of using our similar class, the BasicImputer, just so we can test the behavior of the Pipeline class with transformers from both SKLearn and dataprocessing libraries. We’ll execute the following steps in this pipeline:

encode the columns “sex” and “on_thyroxine” using one-hot encoding,
impute missing data by replacing it with a -100 value,
apply a min-max scaler over the column “age”,
apply an ordinal encoding over all of the remaining categorical variables.

[11]:

pipe = Pipeline([
            ("encoder_ohe", dp.EncoderOHE(col_encode=["sex", "on_thyroxine"])), # dataprocessing
            ("imputer", SimpleImputer(strategy="constant", fill_value=-100)),   # sklearn
            ("scaler", dp.DataMinMaxScaler(include_cols=[0])),                  # dataprocessing
            ("ordinal_encoder", dp.EncoderOrdinal(verbose=False)),              # dataprocessing
        ])
pipe.fit(train_x, train_y)
new_x_test = pipe.transform(test_x)
new_x_test

[11]:

	0	1	2	3	4	5	6	7	8	9	...	16	17	18	19	20	21	22	23	24	25
0	0.641414	0	0	0	0	0	0	0	0	0	...	10.0	1	0.9	1	12.0	0	-100	0	0	0
1	0.929293	0	0	0	0	0	0	0	0	0	...	162.0	1	1.03	1	158.0	0	-100	0	0	0
2	0.000000	0	0	0	0	0	0	0	0	0	...	175.0	1	0.86	1	204.0	0	-100	0	1	0
3	0.858586	0	0	0	1	0	0	0	0	0	...	69.0	1	0.83	1	83.0	0	-100	0	0	0
4	0.808081	0	0	0	0	0	0	0	0	0	...	130.0	1	0.94	1	138.0	0	-100	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
628	0.863636	0	0	0	0	0	0	0	0	0	...	137.0	1	1.11	1	123.0	0	-100	1	0	0
629	0.606061	0	0	0	1	0	0	0	0	0	...	110.0	1	0.92	1	120.0	0	-100	0	0	0
630	0.000000	0	0	0	0	0	0	0	0	0	...	79.0	1	1.0	1	80.0	0	-100	0	0	0
631	0.858586	0	0	0	1	0	0	0	0	0	...	68.0	1	0.0	1	78.0	0	-100	0	0	0
632	0.616162	0	0	0	0	0	0	0	0	0	...	-100	0	-100	0	-100	1	110.0	0	0	0

633 rows × 26 columns

As we can see in the previous cell, we used column names when instantiating the EncoderOHE class, but used column indices when specifying the columns that should be scaled using the DataMinMaxScaler. This is required because the SimpleImputer object returns a numpy array, which has no column names. From there onward, the column names are lost. We could use indices for all transformers. This would result in the same new x test set, as showed in the following cell:

[12]:

pipe = Pipeline([
            ("encoder_ohe", dp.EncoderOHE(col_encode=[1, 2])),                  # dataprocessing
            ("imputer", SimpleImputer(strategy="constant", fill_value=-100)),   # sklearn
            ("scaler", dp.DataMinMaxScaler(include_cols=[0])),                  # dataprocessing
            ("ordinal_encoder", dp.EncoderOrdinal(verbose=False)),              # dataprocessing
        ])
pipe.fit(train_x, train_y)
new_x_test = pipe.transform(test_x)
new_x_test

[12]:

	0	1	2	3	4	5	6	7	8	9	...	16	17	18	19	20	21	22	23	24	25
0	0.641414	0	0	0	0	0	0	0	0	0	...	10.0	1	0.9	1	12.0	0	-100	0	0	0
1	0.929293	0	0	0	0	0	0	0	0	0	...	162.0	1	1.03	1	158.0	0	-100	0	0	0
2	0.000000	0	0	0	0	0	0	0	0	0	...	175.0	1	0.86	1	204.0	0	-100	0	1	0
3	0.858586	0	0	0	1	0	0	0	0	0	...	69.0	1	0.83	1	83.0	0	-100	0	0	0
4	0.808081	0	0	0	0	0	0	0	0	0	...	130.0	1	0.94	1	138.0	0	-100	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
628	0.863636	0	0	0	0	0	0	0	0	0	...	137.0	1	1.11	1	123.0	0	-100	1	0	0
629	0.606061	0	0	0	1	0	0	0	0	0	...	110.0	1	0.92	1	120.0	0	-100	0	0	0
630	0.000000	0	0	0	0	0	0	0	0	0	...	79.0	1	1.0	1	80.0	0	-100	0	0	0
631	0.858586	0	0	0	1	0	0	0	0	0	...	68.0	1	0.0	1	78.0	0	-100	0	0	0
632	0.616162	0	0	0	0	0	0	0	0	0	...	-100	0	-100	0	-100	1	110.0	0	0	0

633 rows × 26 columns

Using column indices is not as intuitive as using column names. Using indices is also more error prone, especially when dealing with a pipeline with transformations that could change the order of the columns, such as the EncoderOHE and feature selection methods. In these scenarios, the index of a column might change in the middle of the pipeline, making it harder for the user to know exactly the column index that they should use to apply a specific operation over a specific column. Therefore, it is of interest to the user to continue to use only column names when possible. In the remainder of this section, we’ll present three solutions to this problem.

Solution 1: use the Pipeline class with only transformers from the dataprocessing library

The first solution is pretty straightforward: use only transformers from the dataprocessing library. Note that an estimator from SKLearn or other libraries could still be used in the end of the pipeline, but the transformers used must be always from the dataprocessing library. Since all transformers from this library operate over Pandas Dataframes, and if all transformers in the pipeline return dataframes, then at no point in the pipeline this dataframe will be lost. This guarantees that all transformers will receive a dataframe as input, without losing the column names. As a result, all transformers are allowed to use filters based on column names.

Let’s recreate the previous pipeline using only transformers from our library. For this, we’ll replace the SimpleImputer by the BasicImputer. This allows us to use column names when specifying which columns the DataMinMaxScaler should be applied on:

[13]:

miss_dict ={'missing_values':np.nan, 'strategy':'constant', 'fill_value':-100}

pipe = Pipeline([
            ("encoder_ohe", dp.EncoderOHE(col_encode=["sex", "on_thyroxine"])),                         # dataprocessing
            ("imputer", dp.BasicImputer(numerical=miss_dict, categorical=miss_dict, verbose=False)),    # dataprocessing
            ("scaler", dp.DataMinMaxScaler(include_cols=["age"])),                                      # dataprocessing
            ("ordinal_encoder", dp.EncoderOrdinal(verbose=False)),                                      # dataprocessing
        ])
pipe.fit(train_x, train_y)
new_x_test = pipe.transform(test_x)
new_x_test

[13]:

	age	query_on_thyroxine	on_antithyroid_medication	thyroid_surgery	query_hypothyroid	query_hyperthyroid	pregnant	sick	tumor	lithium	...	TT4	T4U_measured	T4U	FTI_measured	FTI	TBG_measured	TBG	sex_M	sex_nan	on_thyroxine_t
3073	0.641414	0	0	0	0	0	0	0	0	0	...	10.0	1	0.90	1	12.0	0	-100.0	0	0	0
1839	0.929293	0	0	0	0	0	0	0	0	0	...	162.0	1	1.03	1	158.0	0	-100.0	0	0	0
3096	0.000000	0	0	0	0	0	0	0	0	0	...	175.0	1	0.86	1	204.0	0	-100.0	0	1	0
1147	0.858586	0	0	0	1	0	0	0	0	0	...	69.0	1	0.83	1	83.0	0	-100.0	0	0	0
2413	0.808081	0	0	0	0	0	0	0	0	0	...	130.0	1	0.94	1	138.0	0	-100.0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
148	0.863636	0	0	0	0	0	0	0	0	0	...	137.0	1	1.11	1	123.0	0	-100.0	1	0	0
2526	0.606061	0	0	0	1	0	0	0	0	0	...	110.0	1	0.92	1	120.0	0	-100.0	0	0	0
593	0.000000	0	0	0	0	0	0	0	0	0	...	79.0	1	1.00	1	80.0	0	-100.0	0	0	0
2203	0.858586	0	0	0	1	0	0	0	0	0	...	68.0	1	0.00	1	78.0	0	-100.0	0	0	0
1302	0.616162	0	0	0	0	0	0	0	0	0	...	-100.0	0	-100.00	0	-100.0	1	110.0	0	0	0

633 rows × 26 columns

Solution 2: use the SklearnTransformerWrapper class

If there is a specific transformer from the SKLearn library not present in the dataprocessing library, then the user has no alternative but to use the SKLearn transformer. The SklearnTransformerWrapper class, from the Feature Engine library, is a wrapper for transformers from the SKLearn library that forces, among other features, forces the output of the transformers to be a pandas dataframe if the input is also a dataframe. This solves our problem, since this way the dataframe is never lost. To do this, the user must pass the transformer object to the SklearnTransformerWrapper constructor when instantiating it. The SklearnTransformerWrapper object will then allow users to use the fit and transform methods, but now, the transform method returns a pandas dataframe instead of a numpy array.

The following cell shows how to use the SklearnTransformerWrapper to wrap a transformer from the SKLearn library and then use it in a Pipeline. Before running the following cell, make sure to install the feature-engine package following this link.

[14]:

from feature_engine.wrappers import SklearnTransformerWrapper

imputer = SklearnTransformerWrapper(transformer=SimpleImputer(strategy="constant", fill_value=-100))

pipe = Pipeline([
            ("encoder_ohe", dp.EncoderOHE(col_encode=["sex", "on_thyroxine"])), # dataprocessing
            ("imputer", imputer),                                               # sklearn
            ("scaler", dp.DataMinMaxScaler(include_cols=["age"])),                  # dataprocessing
            ("ordinal_encoder", dp.EncoderOrdinal(verbose=False)),              # dataprocessing
        ])
pipe.fit(train_x, train_y)
new_x_test = pipe.transform(test_x)
new_x_test

[14]:

	age	query_on_thyroxine	on_antithyroid_medication	thyroid_surgery	query_hypothyroid	query_hyperthyroid	pregnant	sick	tumor	lithium	...	TT4	T4U_measured	T4U	FTI_measured	FTI	TBG_measured	TBG	sex_M	sex_nan	on_thyroxine_t
3073	0.641414	0	0	0	0	0	0	0	0	0	...	10.0	1	0.9	1	12.0	0	-100	0	0	0
1839	0.929293	0	0	0	0	0	0	0	0	0	...	162.0	1	1.03	1	158.0	0	-100	0	0	0
3096	0.000000	0	0	0	0	0	0	0	0	0	...	175.0	1	0.86	1	204.0	0	-100	0	1	0
1147	0.858586	0	0	0	1	0	0	0	0	0	...	69.0	1	0.83	1	83.0	0	-100	0	0	0
2413	0.808081	0	0	0	0	0	0	0	0	0	...	130.0	1	0.94	1	138.0	0	-100	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
148	0.863636	0	0	0	0	0	0	0	0	0	...	137.0	1	1.11	1	123.0	0	-100	1	0	0
2526	0.606061	0	0	0	1	0	0	0	0	0	...	110.0	1	0.92	1	120.0	0	-100	0	0	0
593	0.000000	0	0	0	0	0	0	0	0	0	...	79.0	1	1.0	1	80.0	0	-100	0	0	0
2203	0.858586	0	0	0	1	0	0	0	0	0	...	68.0	1	0.0	1	78.0	0	-100	0	0	0
1302	0.616162	0	0	0	0	0	0	0	0	0	...	-100	0	-100	0	-100	1	110.0	0	0	0

633 rows × 26 columns

Solution 3: Wrap the Pipeline class and apply solution 2 automatically

We can also create a wrapper over the Pipeline class that automatically applies Solution 2 to the transformers in the pipeline. This extension, called here the PipelineWrapper, provides a simple modification: all transformers that are not from the dataprocessing library are wrapped around the SklearnTransformerWrapper class. Therefore, the PipelineWrapper executes the previous solution internally, removing the burden of wrapping all of SKLearn’s transformers from the hands of the user. The PipelineWrapper doesn’t change any other behavior from the Pipeline class, and therefore, the latter can be replaced by the former without any changes in the behavior of the code.

First, let’s create the PipelineWrapper class. This class inherits from the Pipeline class. The only modification that we make is that we call a new method, called **_wrap_skl_transf(), in the init method. This method will check all transformers in the pipeline and wrap all of the SKLearn’s transformers in theSklearnTransformerWrapper**. This guarantees that, if the input is a pandas dataframe, the transformers from SKLearn will also output a dataframe.

[15]:

from feature_engine.wrappers.wrappers import _ALL_TRANSFORMERS

class PipelineWrapper(Pipeline):

    # -----------------------------------
    def __init__(self, steps, *, memory=None, verbose=False):
        super().__init__(steps, memory=memory, verbose=verbose)
        self._wrap_skl_transf()

    # -----------------------------------
    def _wrap_skl_transf(self):
        for i in range(len(self.steps)):
            name, transform = self.steps[i]
            if not isinstance(transform, dp.DataProcessing):
                if not isinstance(transform, SklearnTransformerWrapper):
                    if transform.__class__.__name__ in _ALL_TRANSFORMERS:
                        transform = SklearnTransformerWrapper(transformer=transform)
                        self.steps[i] = (name, transform)

Now let’s use the PipelineWrapper instead of the Pipeline class. Note that nothing changes, except that now we can use the column names even after a transformer from the SKLearn library.

[16]:

pipe = PipelineWrapper([
            ("encoder_ohe", dp.EncoderOHE(col_encode=["sex", "on_thyroxine"])), # dataprocessing
            ("imputer", SimpleImputer(strategy="constant", fill_value=-100)),   # sklearn
            ("scaler", dp.DataMinMaxScaler(include_cols=["age"])),              # dataprocessing
            ("ordinal_encoder", dp.EncoderOrdinal(verbose=False)),              # dataprocessing
        ])
pipe.fit(train_x, train_y)
new_x_test = pipe.transform(test_x)
new_x_test

[16]:

	age	query_on_thyroxine	on_antithyroid_medication	thyroid_surgery	query_hypothyroid	query_hyperthyroid	pregnant	sick	tumor	lithium	...	TT4	T4U_measured	T4U	FTI_measured	FTI	TBG_measured	TBG	sex_M	sex_nan	on_thyroxine_t
3073	0.641414	0	0	0	0	0	0	0	0	0	...	10.0	1	0.9	1	12.0	0	-100	0	0	0
1839	0.929293	0	0	0	0	0	0	0	0	0	...	162.0	1	1.03	1	158.0	0	-100	0	0	0
3096	0.000000	0	0	0	0	0	0	0	0	0	...	175.0	1	0.86	1	204.0	0	-100	0	1	0
1147	0.858586	0	0	0	1	0	0	0	0	0	...	69.0	1	0.83	1	83.0	0	-100	0	0	0
2413	0.808081	0	0	0	0	0	0	0	0	0	...	130.0	1	0.94	1	138.0	0	-100	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
148	0.863636	0	0	0	0	0	0	0	0	0	...	137.0	1	1.11	1	123.0	0	-100	1	0	0
2526	0.606061	0	0	0	1	0	0	0	0	0	...	110.0	1	0.92	1	120.0	0	-100	0	0	0
593	0.000000	0	0	0	0	0	0	0	0	0	...	79.0	1	1.0	1	80.0	0	-100	0	0	0
2203	0.858586	0	0	0	1	0	0	0	0	0	...	68.0	1	0.0	1	78.0	0	-100	0	0	0
1302	0.616162	0	0	0	0	0	0	0	0	0	...	-100	0	-100	0	-100	1	110.0	0	0	0

633 rows × 26 columns