Model Utils

raimitigations.utils.model_utils.evaluate_set(y: ndarray, y_pred: ndarray, regression: bool = False, plot_pr: bool = True, best_th_auc: bool = True, classes: Optional[Union[list, ndarray]] = None)

Evaluates the performance of a prediction array based on its true values. This function computes a set of metrics, prints the computed metrics, and if the problem is a classification problem, then it also plots the confusion matrix and the precision x recall graph. Finally, return a dictionary with the following metrics (depends if it is a regression or classification problem):

  • Classification: ROC AUC, Precision, Recall, F1, accuracy, log loss, threshold used, and final classes predicted (using the probabilities with the threshold);

  • Regression: MSE, RMSE, MAE, and R2

Parameters
  • y – an array with the true labels or true values;

  • y_pred – an array with the predicted values. For classification problems, this array must contain the probabilities for each class, with shape = (N, C), where N is the number of rows and C is the number of classes;

  • regression – if True, regression metrics are computed. If False, only classification metrics are computed;

  • plot_pr – if True, plots a graph showing the precision and recall values for different threshold values;

  • best_th_auc – if True, the best threshold is computed using ROC graph. If False, the threshold is computed using the precision x recall graph;

  • classes – an array or list with the unique classes of the label column. These classes are used when plotting the confusion matrix.

Returns

a dictionary with the following metrics:

  • Classification: ROC AUC, Precision, Recall, F1, accuracy, log loss, threshold used, and final classes predicted (using the probabilities with the threshold);

  • Regression: MSE, RMSE, MAE, and R2

Return type

dict

raimitigations.utils.model_utils.split_data(df: DataFrame, label: str, test_size: float = 0.2, full_df: bool = False, regression: bool = False, random_state: Optional[int] = None)

Splits the dataset given by df into train and test sets.

Parameters
  • df – the dataset that will be splitted;

  • label – the name of the label column;

  • test_size – a value between [0.0, 1.0] that indicates the size of the test dataset. For example, if test_size = 0.2, then 20% of the original dataset will be used as a test set;

  • full_df

    If full_df is set to True, this function returns 2 dataframes: a train and a test dataframe, where both datasets include the label column given by the parameter label. Otherwise, 4 values are returned:

    • train_x: the train dataset containing all features (all columns except the label column);

    • test_x: the test dataset containing all features (all columns except the label column);

    • train_y: the train dataset containing only the label column;

    • test_y: the test dataset containing only the label column;

  • regression – if True, the problem is treated as a regression problem. This way, the split between train and test is random, without any stratification. If False, then the problem is treated as a classification problem, where the label column is treated as a list of labels. This way, the split tries to maintain the same proportion of classes in the train and test sets;

  • random_state – controls the randomness of how the data is divided.

Returns

if full_df is set to True, this function returns 2 dataframes: a train and a test dataframe, where both datasets include the label column given by the parameter label. Otherwise, 4 values are returned:

  • train_x: the train dataset containing all features (all columns except the label column);

  • test_x: the test dataset containing all features (all columns except the label column);

  • train_y: the train dataset containing only the label column;

  • test_y: the test dataset containing only the label column;

Return type

tuple

raimitigations.utils.model_utils.train_model_fetch_results(x: DataFrame, y: Series, x_test: DataFrame, y_test: Series, model: Union[BaseEstimator, str] = 'tree', best_th_auc: bool = True, regression: bool = False)

Given a train and test sets, and a model name, this function instantiates the model, fits the model to the train dataset, predicts the output for the test set, and then compute a set of metrics that evaluates the performance of the model in the test set. Returns a dictionary with the computed metrics (the same dictionary returned by the get_metrics() function).

Parameters
  • x – the feature columns of the train dataset;

  • y – the label column of the train dataset;

  • x_test – the feature columns of the test dataset;

  • y_test – the label column of the test dataset;

  • model

    the object of the model to be used, or a string specifying the model to be used. The string values allowed are:

    • ”tree”: Decision Tree Classifier

    • ”knn”: KNN Classifier

    • ”xgb”: XGBoost

    • ”log”: Logistic Regression

  • best_th_auc – if True, the best threshold is computed using ROC graph. If False, the threshold is computed using the precision x recall graph;

  • regression – if True, regression metrics are computed. If False, only classification metrics are computed.

Returns

a dictionary with the computed metrics(the same dictionary returned by the get_metrics() function).

Return type

dict

raimitigations.utils.model_utils.train_model_plot_results(x: DataFrame, y: Series, x_test: DataFrame, y_test: Series, model: Union[BaseEstimator, str] = 'tree', train_result: bool = True, plot_pr: bool = True, best_th_auc: bool = True)

Given a train and test sets, and a model name, this function instantiates the model, fits the model to the train dataset, predicts the output for the train and test sets, and then compute a set of metrics that evaluates the performance of the model in both sets. If the “model” parameter is an estimator object, then this object will be used instead. These metrics are printed in the stdout. Returns the trained model. This function assumes a classification problem.

Parameters
  • x – the feature columns of the train dataset;

  • y – the label column of the train dataset;

  • x_test – the feature columns of the test dataset;

  • y_test – the label column of the test dataset;

  • model

    the object of the model to be used, or a string specifying the model to be used. The string values allowed are:

    • ”tree”: Decision Tree Classifier

    • ”knn”: KNN Classifier

    • ”xgb”: XGBoost

    • ”log”: Logistic Regression

  • train_result – if True, shows the results for the train dataset. If False, show the results only for the test dataset;

  • plot_pr – if True, plots a graph showing the precision and recall values for different threshold values;

  • best_th_auc – if True, the best threshold is computed using ROC graph. If False, the threshold is computed using the precision x recall graph.

Returns

returns the model object used to fit the dataset provided.

Return type

reference to the model object used