For the Database Analyst - Operationalize with SQL
This solution package shows how to pre-process images (cleaning and feature engineering), train prediction models, and perform scoring on the SQL Server machine with stored procedures which includes Python code.
All the steps can be executed on SQL Server client environment (SQL Server Management Studio). We provide a Windows PowerShell script which invokes the SQL scripts and demonstrates the end-to-end modeling process.
System Requirements
To run the scripts requires the following:
- SQL Server 2017 with Machine Learning Services (In-Database) installed and configured
- The SQL user name and password, and the user configured properly to execute R scripts in-memory;
- SQL Database which the user has write permission and execute stored procedures;
- For more information about Machine Learning Services in SQL Server, please visit: https://docs.microsoft.com/en-us/sql/advanced-analytics/what-s-new-in-sql-server-machine-learning-services
Workflow Automation
Follow the PowerShell instructions to execute all the scripts described below. Click here to view the SQL database tables created in this solution.
The SQL Stored procedure Initial_Run_Once_Py
can be used to re-run steps 1-5 below.
Step 0: Creating Tables
Input:
- Images in the directory C:\Solution\ImageSimilarity\data\fashionTexture are used to populate the FileTable created in this step
Output:
- ImageStore
Example:
EXEC CreateTables
Then copy images into the directory \\computer-name\MSSQLSERVER\FileTableimages\ImageStore
Step 1: Featurization of images with pre-trained DNN model
This step generates features from the images using a pre-trained Resnet in microsoftml
. The input is the FileTable @image_table
which contains the images, the output is the SQL Table @feature_table
which saves the images’ path, label,
and DNN features. The dimension of the features depends on which Resnet Model is used in this step. Here we used Resnet18 which generates 512-dimensional features for each image.
The stored procedure FeaturizeImages
contains three steps:
-
First, get the images path from the FileTable, map the distinct categories of all the images to factor labels.
-
Second, get a label for each image based on the its category.
-
Third, calculate the features using
microsoftml
library given the images path. You can find the code in image_similarity/image_similarity_utils.py.
Input:
ImageStore
table
Output:
features
table
Example:
EXEC FeaturizeImages 'ImageStore', 'features'
Step 2: Prepare training/testing/evaluation set
This step prepares the training/testing/evaluation image set. Here is the detail information about how to generate training/testing/evaluation set:
-
Randomly split all the images into training/testing set based on category information and train/test ratio, users can change the parameter
@ratioTrainTest
according to the number of total images they have. For example, if the@ratioTrainTest = 0.7
, then for each category, randomly select 70% images as training images and the left 30% images as testing images. -
Once the testing images were inserted into the SQL table, we generate evaluation image set based on testing images since we do not want to evaluation images overlap with the training images.
-
Randomly select images from each category as query images, and then randomly select 1 positive image from the same category and some negative images from the other categories. So for each query image, we create 101 image pairs. Users also can set up parameter
@queryImagePerCat
to decide how many query images they want to select from each category, and set up parameter@negImgsPerQueryImg
to decide how many negative images they want to select for each query image. -
For example, in this sample, we set up
@queryImagePerCat = 20
and@negImgsPerQueryImg = 100
, finally, the evaluation set contains 220 query images since the image images contains 11 categories, and each query image has 101 candidates (1 positive image and 100 negative images).
Input:
features
table
Output:
training_images
tabletesting_images
tableevaluation_images
table@negImgsPerQueryImg
Example:
EXEC PrepareData 'features', 'training_images', 'testing_images', 'evaluation_images', 0.75, 20, 100
Step 3: Training multi-class classifier
Once the features are computed, and the training images and testing images are inserted into the SQL table, we can use them to train a neural network model using microsoftml
library and then save the model into SQL table.
-
Get the DNN features for the training images and testing images from the feature table
@feature_table
, then train multi-class classifier using neural network algorithm inmicrosoftml
library. Finally, evaluate the performance of the classifier using testing images. -
Overall accuracy is calculated to measure the performance of the classifier.
Pre-trained model Classifier Accuracy on train set Accuracy on test set Resnet18 rx_neural_network 89.7% 75.1% -
Get the predicted scores of all the images in training and testing table using trained classifier, and save the predicted scores into SQL table
@scores_table
. Here we use all the images as the candidates for the last step. Users also can have their own candidate images. To get the predicted scores for users’ own candidate images, first, you have to featurize the candidate images using the pre-trained Resnet, and then load the classifier to calculate the predicted scores for your own candidate images.
Input:
features
tabletraining_images
tabletesting_images
table
Output:
scores
tablemodel
table
Example:
EXEC TrainClassifier 'features', 'training_images', 'testing_images', 'scores', 'model'
Step 4: Evaluate model
Once the model and the predicted scores of all the images are saved into SQL table, we can get the predicted scores from the @scores_table
for all the image pairs in the evaluation table @evaluation_table
. Based on the predicted scores, we can calculate the distance between each image pair to measure
their similarity so that we can evaluate the performance of the model in terms of ranking.
-
Load the predicted scores for all the images, for example, in this sample, the image images contains 11 categories, so the predicted score is a 11-dimensional vector for each image.
-
Load the image pairs from the evaluation table, for each image pair, we can get two 11-dimensional vectors, we calculate L2 and Cosine distance between these two vectors to measure the similarity. So for each image pair, we get two distances.
-
We calculate top 1, 2, 4, 5, 8, 10, 15, 20, 28 and 32 accuracy to measure the ranking performance.
Input:
scores
tableevaluation_images
table
Output:
- accuracy measures
Example:
EXEC EvaluateModel 'scores', 'evaluation_images'
Step 5: Ranking candidates for each query image
Once the accuracy of the image ranking system satisfy the requirement, we can rank the candidates for the query images.
-
In order to get the similar images for each query image quickly, we have to make the predicted scores of all the candidate images ready before this step. We explained how to get the predicted scores for users’ own candidate images in step 3. So we assume the predicted scores of all the candidate images are already saved in SQL table
@scores_table
, we just need to load the predicted scores for all the candidate images from the SQL table. We don’t need to calculate them in this step. -
Assume all the query images are already saved in SQL table
@query_table
. we load the query images from the SQL table, and then featurize the query images using pre-trained Resnet, here you have to used the same pre-trained model which used in the step 1. -
Load the model which trained in step 3 form SQL table
@model_table
, and calculate the predicted scores for all the query images using the model. -
Calculate the Cosine distance between each query image and all the candidates, based on the distance, return top K similar images for each query images. Users can set up parameter
@topKCandidates
to decide how many similar images should be returned for each query image. For example, here we set@topKCandidates
equal to 10, so in the result table@results_table
, each query image has 10 similar images.
Input:
@topKCandidates
- number of images to return for each inputquery_images
tablescores
tablemodel
table
Output:
-
ranking_results
table
Example:
EXEC RankCandidates 10, 'query_images', 'scores', 'model', 'ranking_results'