Utilities (Case Study)
- case_study_utils.add_results_df(result_df, result_stat, test_name)
Adds a new experiment in a data frame containing multiple experiment results. The new experiment is identified by the test_name parameter, and it is defined by the result dictionary given by the result_stat parameter. Returns the result_df data frame with the data of the new experiment.
- Parameters
result_df – a data frame containing the data of experiments executed previously (or None if this is the first experiment);
result_stat – the result dictionary returned by the result_statistics function containing the metrics for the new experiment to be added to the result_df data frame;
test_name – the name of the new experiment.
- case_study_utils.artificial_ctgan(train_x, train_y, strategy, savefile, epochs=400)
Creates a Synthesizer object with the CTGAN model and using the strategy specified by the strategy parameter. Fit this object using the train_x and train_y datasets, and then create new artificial instances. Returns the original train_x and train_y sets with the new artificial instances.
- Parameters
train_x – the data frame containing only the feature columns of the training set;
train_y – the data frame containing only the label column of the training set;
strategy – the strategy used for the transform method of the Synthesizer class;
epochs – the number of epochs that the Synthesizer class should train for.
- case_study_utils.artificial_smote(train_x, train_y, strategy, under_sample)
Creates a Rebalance object using the oversampling strategy specified by the strategy parameter and the under sampling method specified by the under_sample parameter. Fit this object using the train_x and train_y datasets, and then resample the training set provided. Returns the resampled train_x and train_y sets.
- Parameters
train_x – the data frame containing only the feature columns of the training set;
train_y – the data frame containing only the label column of the training set;
strategy – the strategy used for the oversampling method. This is the same as the strategy_over parameter from the Rebalance class;
under_sample – the under sampling method used. This is the same as the under_sampler parameter from the Rebalance class.
- case_study_utils.feature_selection(train_x, train_y, test_x, feat_sel_type='fwd')
Creates a feature selection object, fit this object using the dataset train_x, and then remove a set of the correlated features from the datasets train_x and test_x. Returns the datasets train_x and test_x containing only the selected features.
- Parameters
train_x – the data frame containing only the feature columns of the training set;
train_y – the data frame containing only the label column of the training set;
test_x – the data frame containing only the feature columns of the test set;
feat_sel_type –
specifies which feature selection approach is used:
’forward’: SeqFeatSelection object using the forward strategy;
’backward’: SeqFeatSelection object using the backward strategy;
’catboost’: CatBoostSelection object.
- case_study_utils.plot_results(res_df, y_lim=[0.5, 0.75])
Creates a bar plot with the metrics of multiple experiments.
- Parameters
res_df – the data frame containing the metrics of one or more experiments. Must be a data frame returned by the add_results_df function;
y_lim – a list with two float values specifying the range of the Y axis of the bar plot.
- case_study_utils.remove_corr_feat(df, label_col)
Creates a CorrelatedFeatures object, fit this object using the dataset df, and then remove a set of the correlated features from the dataset. Returns the dataset without a set of the correlated features.
- Parameters
df – the full data frame to be analyzed;
label_col – the column name or column index of the label column.
- case_study_utils.result_statistics(result_list)
Build a dictionary with the statistics of the results obtained.
- Parameters
result_list – a list of result metrics. Each index in this list must be a list of result metrics returned by the utils.train_model_fetch_results function.
- case_study_utils.transform_num_data(train_x, test_x, scaler_ref, num_col)
Creates a scaler object (specified by the scaler_ref parameter), fit it to the train_x dataset, and then apply the scaler to the train_x and test_x datasets. Return the train_x and test_x datasets with all numerical columns specified by num_col scaled.
- Parameters
train_x – the data frame containing only the feature columns of the training set;
test_x – the data frame containing only the feature columns of the test set;
scaler_ref – the class reference for the scaler to be used. Must be one of the scalers implemented in the scaler dataprocessing.scaler submodule;
num_col – a list with the name of the numerical columns.