Predicting Hospital Length of Stay

Implemented with Microsoft Machine Learning Services

For the Database Analyst - Operationalize with SQL


As businesses are starting to acknowledge the power of data, leveraging machine learning techniques to grow has become a must. In order for hospitals to optimize resource allocation, it is important to predict accurately how long a newly admitted patient will stay in the hospital.

Among the key variables to learn from data are number of previous admissions as well as various diagnostic codes and lab results. (View the full data set description. )

For businesses that prefer an on-prem solution, the implementation with Microsoft Machine Learning Services is a great option, which takes advantage of the power of SQL Server and RevoScaleR (Microsoft R Server). In this template, we implemented all steps in SQL stored procedures: data preprocessing, and feature engineering are implemented in pure SQL, while data cleaning, and the model training, scoring and evaluation steps are implemented with SQL stored procedures calling R (Microsoft R Server) code.

All the steps can be executed on SQL Server client environment (SQL Server Management Studio). We provide a Windows PowerShell script which invokes the SQL scripts and demonstrates the end-to-end modeling process.

System Requirements


To run the scripts requires the following:
  • for R: SQL Server 2016 with Microsoft R Server (version 9.0.1) or higher installed and configured
  • for Python: SQL Server 2017 with Machine Learning Services (In-Database) installed and configured
  • The SQL user name and password, and the user configured properly to execute R scripts in-memory;
  • SQL Database which the user has write permission and execute stored procedures;
  • For more information about Machine Learning Services in SQL Server, please visit: https://msdn.microsoft.com/en-us/library/mt604847.aspx

Workflow Automation


Follow the PowerShell instructions to execute all the scripts described below. Click here to view the details all tables created in this solution.

The steps below describe the R solution (found in the Hospital_R database) and the Python solution (found in the Hospital_Py database).

Step 0: Creating Tables


The data set LengthOfStay.csv is provided in the Data directory.

In this step, we create a table LengthOfStay in a SQL Server database, and the data is uploaded to these tables using bcp command in PowerShell. This is done through either load_data.ps1 or through running the beginning of Length_Of_Stay.ps1.

Input:

  • Raw data: LengthOfStay.csv

Output:

  • 1 Table filled with the raw data: LengthOfStay (filled through PowerShell).
  • step0_create_tables.sql

Example:

EXEC [dbo].[compute_stats];

Step 1: Pre-Processing and Cleaning


In this step, the raw data is cleaned in-place. This assumes that the ID variable (eid) does not contain blanks. There are two ways to replace missing values:

  • The first provided stored procedure, [fill_NA_explicit], will replace the missing values with “missing” (character variables) or -1 (numeric variables). It should be used if it is important to know where the missing values were.

  • The second stored procedure, [fill_NA_mode_mean], will replace the missing values with the mode (categorical variables) or mean (float variables).

If running the stored procedures yourself you will have the opportunity to choose between the two stored procedures. During the initial deployment [fill_NA_mode_mean] will be automatically used.

Input:

  • 1 Table filled with the raw data: LengthOfStay (filled through PowerShell).

Output:

  • The same table, with missing values replaced.
  • step1_data_processing.sql

Example:

EXEC [dbo].[fill_NA_mode_mean] @input='LengthOfStay', @output = 'LoS0';

Step 2: Feature Engineering


In this step, we create a stored procedure [dbo].[feature_engineering] that designs new features:

  • The continuous laboratory measurements (e.g. hemo, hematocritic, sodium, glucose etc.) are standardized: we substract the mean and divide by the standard deviation.
  • number_of_issues: the total number of preidentified medical conditions.

Input:

  • LengthOfStay table.

Output:

  • LoS table containing new features.
  • step2_feature_engineering.sql

Example:

EXEC [dbo].[feature_engineering]  @input = 'LoS0', @output = 'LoS', @is_production = '0';
EXEC [dbo].[get_column_info] @input = 'LoS';

Step 3a: Splitting the data set


In this step, we create a stored procedure [dbo].[splitting] that splits the data into a training set and a testing set. The user has to specify a splitting percentage. For example, if the splitting percentage is 70, 70% of the data will be put in the training set, while the other 30% will be assigned to the testing set. The eid that will end in the training set, are stored in the table Train_Id.

Input:

  • LoS table.

Output:

  • Train_Id table containing the eid that will end in the training set.
  • step3a_splitting.sql

Example:

EXEC [dbo].[splitting] @splitting_percent = 70, @input = 'LoS';

Step 3b: Training


In this step, we create a stored procedure [dbo].[train_model] that trains a regression Random Forest (rxDForest implementation) on the training set. The trained models are serialized then stored in a table called “Models”.

Input:

  • LoS and Train_Id tables.

Output:

  • Models table containing the regression RF trained models.
  • step3b_training.sql

Example:

exec [dbo].[train_model] @model_name = 'RF', @dataset_name = 'LoS';

Step 3c: Scoring


In this step, we create a stored procedure [dbo].[score] that scores a trained model on the testing set. The Predictions are stored in a SQL table.

Input:

LoS,Train_Id, and Models tables.

Output:

Table(s) storing the predictions from the tested model(s).

step3c_scoring.sql

Example:

EXEC [dbo].[score] @model_name = 'RF',
	@inquery = 'SELECT * FROM LoS WHERE eid NOT IN (SELECT eid FROM Train_Id)',
	@output = 'Forest_Prediction';

Step 3d: Evaluating


In this step, we create a stored procedure [dbo].[evaluate] that computes regression performance metrics written in Metrics.

Input:

Table(s) storing the predictions from the tested model(s).

Output:

Metrics table containing the performance metrics of the model(s).

step3d_evaluating.sql Finally, a table LoS_Predictions stores data from the testing set as well as predicted discharge dates from the rxFastTrees model, and will be used for PowerBI. The stored procedure that creates it can be found in the step4_full_table.sql file.

Example:

exec [dbo].[evaluate] @model_name  = 'RF',
	@predictions_table = 'Forest_Prediction';

Step 4: The Production Pipeline


The stored procedure prod_lengthofstay can be used to score new data during production. To run this with the sample production data you can execute this procdure as follows:

USE [Hospital_R]
GO

DECLARE	@return_value int

EXEC	@return_value = [dbo].[prod_lengthofstay]
        @input = N'LengthOfStay_Prod',
        @dev_db = N'Hospital_R'

SELECT	'Return Value' = @return_value

GO

LengthOfStay_Prod is then cleaned as in Step 1, and a feature engineered view is created as in Step 2 (both using the Stats table). Finally, the view is scored on the model(s) stored in the Models table, using the ColInfo table information. The predictions are stored in a SQL table.