For the Data Scientist

This page describes the SQL Server VM on Azure SQL Server On-Premises HDInsight Spark solution.

For the Data Scientist - Develop with R

Loan Credit Risk

Development Stage

Step 0: Create Intermediate Directories
Step 1: Merging and Cleaning
Step 2: Splitting and Feature Engineering
Step 3: Training, Testing and Evaluating
Step 4: Operational Metrics Computation and Scores Transformation

This implementation on Azure SQL Server ML Services is a great option which takes advantage of the power of SQL Server and RevoScaleR (Microsoft ML Server).

This solution package shows how to pre-process data (cleaning and feature engineering), train prediction models, and perform scoring on the SQL Server machine.

Data scientists who are testing and developing solutions can work from the convenience of their R IDE on their client machine, while setting the computation context to SQL (see R folder for code). They can also deploy the completed solutions to SQL Server (2016 or higher) by embedding calls to R in stored procedures (see SQLR folder for code). These solutions can then be further automated by the use of SQL Server Integration Services and SQL Server agent: a PowerShell script (.ps1 file) automates the running of the SQL code.

Loan Credit Risk

When a financial institution examines a request for a loan, it is crucial to assess the risk of default to determine whether to grant it. This solution is based on simulated data for a small personal loan financial institution, containing the borrower’s financial history as well as information about the requested loan. View more information about the data.

In this solution, the final scored database table Scores is created in SQL Server. This data is then visualized in PowerBI.

To try this out yourself, visit the Quick Start page.

Below is a description of what happens in each of the steps: data preparation, feature engineering, model development, prediction, and deployment in more detail.

The file modeling_main.R enables the user to define the input and call all the steps. Inputs are: paths to the raw data files, database name, server name, username and password. The database is created if it does not not already exist, and the connection string as well as the SQL compute context are defined.

Step 1: Merging and Cleaning

In this step, the raw data is loaded into SQL in two tables called Loan and Borrower. They are then merged into one, Merged.

Then, if there are missing values, the data is cleaned by replacing missing values with the mode (categorical variables) or mean (float variables). This assumes that the ID variables (loanId and memberId) as well as loanStatus do not contain blanks.

The cleaned data is written to the SQL table Merged_Cleaned. The Statistics are written to SQL if you want to run a batch scoring from SQL after a development stage in R.

Input:

Raw data: Loan.csv and Borrower.csv.

Output:

Loan and Borrower SQL tables with the raw data.
Merged_Cleaned SQL table , with missing values replaced if applicable.
Stats SQL table , with global means or modes for every variable.

Related files:

step1_preprocessing.R

Step 2: Splitting and Feature Engineering

Visualize

For feature engineering, we want to design new features:

Categorical versions of all the numeric variables. This is done for interpretability and is a standard practice in the Credit Score industry.
isBad: the label, specifying whether the loan has been charged off or has defaulted (isBad = 1) or if it is in good standing (isBad = 0), based on loanStatus.

This is done by following these steps:

Create the label isBad with rxDataStep function into the table Merged_Labeled.

Create the label isBad with rxDataStep function. Outputs are written to the HDFS directory “MergedLabeled”. We create at the same time the variable hashCode with values corresponding to hashing loanId to integers. It will be used for splitting. This hashing function ensures repeatability of the splitting procedure.
Split the data set into a training and a testing set. This is done by selecting randomly 70% of loanId to be part of the training set. In order to ensure repeatability, loanId values are mapped to integers through a hash function, with the mapping and loanId written to the Hash_Id SQL table. The splitting is performed prior to feature engineering instead of in the training step because the feature engineering creates bins based on conditional inference trees that should be built only on the training set. If the bins were computed with the whole data set, the evaluation step would be rigged.

Split the data set into a training and a testing set. This is done by selecting randomly a proportion (equal to the user-specified splitting ratio) of the MergedLabeled data. The output is written to the “Train” directory.
Compute the bins that will be used to create the categorical variables with smbinning. Because some of the numeric variables have too few unique values, or because the binning function did not return significant splits, we decided to manually specify default bins in case smbinning does not return the splits. These default bins have been determined through an analysis of the data or through running smbinning on a larger data set.
The bins computation is optimized by running smbinning in parallel across the different cores of the server, through the use of rxExec function applied in a Local Parallel (localpar) compute context. The rxElemArg argument it takes is used to specify the list of variables (here the numeric variables names) we want to apply smbinning on. They are saved to SQL in case you want to run a production stage with SQL after running a development stage in R.
Bucketize the variables based on the computed/specified bins with the function bucketize, wrapped into an rxDataStep function. The final output is written into the SQL table Merged_Features.
Finally, we convert the newly created variables from character to factors, and save the variable information in the directory LocalModelsDir, to be used for the Production or Web Scoring stages. The data with the correct variable types is written to the directory “MergedFeaturesFactors” on HDFS.

Input:

Merged_Cleaned SQL table.
Working directories on the edge node and HDFS.
The splitting ratio, corresponding to the proportion of the input data set that will go to the training set.

Output:

Merged_Features SQL table containing new features.
Hash_Id SQL table containing the loanId and the mapping through the hash function.
Bins SQL table containing the serialized list of cutoffs to be used in a future Production stage.
Cutoffs saved to the local edge node in the LocalModelsDir folder.
Factor information saved to the local edge node in the LocalModelsDir folder.
Analytical data set with correct variable types MergedFeaturesFactors on HDFS in xdf format.

Related files:

step2_feature_engineering.R

Step 3: Training, Testing and Evaluating

Visualize

After converting the strings to factors (with stringsAsFactors = TRUE), we get the variables information (types and levels) of the Merged_Features SQL table with rxCreateColInfo. We then point to the training and testing sets with the correct column information.

Then we build a Logistic Regression Model on the training set. The trained model is serialized and uploaded to a SQL table Model if needed later, through an Odbc connection.

Training a Logistic Regression for loan credit risk prediction is a standard practice in the Credit Score industry. Contrary to more complex models such as random forests or neural networks, it is easily understandable through the simple formula generated during the training. Also, the presence of bucketed numeric variables helps understand the impact of each category and variable on the probability of default. The variables used and their respective coefficients, sorted by order of magnitude, are stored in the data frame Logistic_Coeff returned by the step 3 function.

Finally, we compute predictions on the testing set, as well as performance metrics:

KS (Kolmogorov-Smirnov) statistic. The KS statistic is a standard performance metric in the credit score industry. It represents how well the model can differenciate between the Good Credit applicants and the Bad Credit applicants in the testing set. We also draw the KS plot which corresponds to two cumulative distributions of the predicted probabilities. One is a subset of the predictions for which the observed values were bad loans (is_bad = 1) and the other concerns good loans (is_bad = 0). KS will be the biggest distance between those two curves.

Visualize

Various classification performance metrics computed on the confusion matrix. These are dependent on the threshold chosen to decide whether to classify a predicted probability as good or bad. Here, we use as a threshold the point of the x axis in the KS plot where the curves are the farthest possible.
AUC (Area Under the Curve) for the ROC. This represents how well the model can differenciate between the Good Credit applicants from the Bad Credit applicants given a good decision threshold in the testing set. We draw the ROC, representing the true positive rate in function of the false positive rate for various possible cutoffs.

Visualize

The Lift Chart. The lift chart represents how well the model can perform compared to a naive approach. For instance, at the level where a naive effort could produce a 10% rate of positive predictions, we draw a vertical line on x = 0.10 and read the lift value where the vertical line crosses the lift curve. If the lift value is 3, it means that the model would produce 3 times the 10%, ie. 30% rate of positive predictions.

Visualize

Input:

Merged_Features SQL table containing new features.
Hash_Id SQL table containing the loanId and the mapping through the hash function.
Working directories on the edge node and HDFS.
The splitting ratio, corresponding to the proportion of the input data set that will go to the training set. It should be the same as the one used in Step 2.

Output:

Model SQL table containing the serialized logistic regression model.
Logistics_Coeff data frame returned by the function. It contains variables names and coefficients of the logistic regression formula. They are sorted in decreasing order of the absolute value of the coefficients.
Predictions_Logistic SQL table containing the predictions made on the testing set.
Column_Info SQL table containing the serialized list of factor levels to be used in a future Production stage.
Performance metrics returned by the step 3 function.
Logistic Regression model saved to the local edge node in the LocalModelsDir folder.
Logistic Regression formula saved to the local edge node in the LocalModelsDir folder.
Prediction results given by the model on the testing set, “PredictionsLogistic” on HDFS in xdf format.

Related files:

step3_train_score_evaluate.R

Step 4: Operational Metrics Computation and Scores Transformation

Visualize

In this step, we create two functions compute_operational_metrics, and apply_score_transformation.

The first, compute_operational_metrics will:

Apply a sigmoid function to the output scores of the logistic regression, in order to spread them in [0,1] and make them more interpretable. This sigmoid uses the average predicted score, which is saved to SQL in case you want to run a production stage through SQL after a development stage with R.
Compute bins for the scores, based on quantiles (we compute the 1%-99% percentiles).
Take each lower bound of each bin as a decision threshold for default loan classification, and compute the rate of bad loans among loans with a score higher than the threshold.

It outputs the data frame Operational_Metrics, which is also saved to SQL. It can be read in the following way: If the score cutoff of the 91th score percentile is 0.9834, and we read a bad rate of 0.6449, this means that if 0.9834 is used as a threshold to classify loans as bad, we would have a bad rate of 64.49%. This bad rate is equal to the number of observed bad loans over the total number of loans with a score greater than the threshold.

The second, apply_score_transformation will:

Apply the same sigmoid function to the output scores of the logistic regression, in order to spread them in [0,1].
Asssign each score to a percentile bin with the bad rates given by the Operational_Metrics table.

Input:

(assume the predictions on the testing set, PredictionsLogistic, are already created there by Step 3)

Predictions_Logistic SQL table storing the predictions from the tested model.
Working directories on the edge node and HDFS.

Output:

Operational_Metrics SQL table and data frame containing the percentiles from 1% to 99%, the scores thresholds each one corresponds to, and the observed bad rate among loans with a score higher than the corresponding threshold.
Scores SQL table containing the transformed scores for each record of the testing set, together with the percentiles they belong to, the corresponding score cutoff, and the observed bad rate among loans with a higher score than this cutoff.
Scores_Average SQL table , with the average score on the testing set, to be used in a future Production stage.
Average of the predicted scores on the testing set of the Development stage, saved to the local edge node in the LocalModelsDir folder.
Operational Metrics saved to the local edge node in the LocalModelsDir folder.
Scores on HDFS in xdf format. It contains the transformed scores for each record of the testing set, together with the percentiles they belong to, the corresponding score cutoff, and the observed bad rate among loans with a higher score than this cutoff.
ScoresData on HDFS in Hive format for visualizations in PowerBI.

Related files:

step4_operational_metrics.R

The modeling_main.R script uses the Operational_Metrics table to plot the rates of bad loans among those with scores higher than each decision threshold. The decision thresholds correspond to the beginning of each percentile-based bin.

For example, if the score cutoff of the 91th score percentile is 0.9834, and we read a bad rate of 0.6449. This means that if 0.9834 is used as a threshold to classify loans as bad, we would have a bad rate of 64.49%. This bad rate is equal to the number of observed bad loans over the total number of loans with a score greater than the threshold.

Production Stage

The R code from each of the above steps is operationalized in the SQL Server as stored procedures. In the Production pipeline, the data from the files Loan_Prod.csv and Borrower_Prod.csv is uploaded through PowerShell to the Loan_Prod and Borrower_Prod tables.

The tables Stats, Bins, Column_Info, Model, Operational_Metrics and Scores_Average created during the Development pipeline are then moved to the Production database.

Visualize Results

The score results are reside in the table Scores of the Loans database. The production data is in a new database, Loans_Prod, in the table Scores_Prod. The final step of this solution visualizes both predictions tables in PowerBI.

The score results are in the Hive table ScoresData (testing set from the Development Stage), and ScoresData_Prod (data set from the Production stage). The final step of this solution visualizes predictions of both the test and productions results.

See For the Business Manager for details of the PowerBI dashboard.

System Requirements

The following are required to run the scripts in this solution:

SQL Server (2016 or higher) with Microsoft ML Server (version 9.1.0 or higher) installed and configured.
The SQL user name and password, and the user configured properly to execute R scripts in-memory.
SQL Database which the user has write permission and execute stored procedures.
For more information about SQL server and ML Services, please visit: https://docs.microsoft.com/en-us/sql/advanced-analytics/what-s-new-in-sql-server-machine-learning-services

Template Contents

View the contents of this solution template.

To try this out yourself:

View the Quick Start.

< Home

Loan Credit Risk