For the Data Scientist - Develop with R or Python
- Step 1: Loading the data to SQL Server
- Step 2: Create features on the fly for the training set and train the model
- Step 3: Create features on the fly for the testing set, make predictions, and evaluate the model
This solution package shows how to pre-process data (cleaning and feature engineering), train prediction models, and perform scoring on the SQL Server machine in either R or Python.
When a customer sends a support ticket, it is important to route it to the right team in order to examine the issue and solve it in the fastest way possible. We use sample data containing a Subject, a Text, and a Label to build a machine learning model that can predict a label for new incoming data.
This solution takes advantage of the power of SQL Server and RevoScaleR and RevoScalePy. The tables are all stored in a SQL Server, and most of the computations are done by loading chunks of data in-memory instead of the whole dataset.
All these steps can be executed in a R or Python IDE, and are then operationalized into SQL Server Stored Procedures. Scoring of new data is then performed in SQL Server.
Scientists who are testing and developing solutions can work from the convenience of their preferred IDE on their client machine, while setting the computation context to SQL (see R or Python folder for code). They can also deploy the completed solutions to SQL Server 2017 by embedding calls to R or Python in stored procedures. These solutions can then be further automated by the use of SQL Server Integration Services and SQL Server agent: a PowerShell script (.ps1 file) automates the running of the SQL code.
To try this out yourself, see the Quick Start section on the main page.
This page describes what happens in each of the steps.
Running the R or Python scripts requires the following:
- SQL server 2017 with RevoscalePy (version 9.2) and MicrosoftML (version 1.5.0) installed and configured;
- SQL Database which the user has write permission and execute stored procedures;
- For more information about Machine Learning Services in SQL Server, please visit: https://docs.microsoft.com/en-us/sql/advanced-analytics/what-s-new-in-sql-server-machine-learning-services
This code was tested using SQL Server 2017, and assumes that SQL Server with ML Services was installed. The installation process can be found here.
This solution uses a preprocessed version of NewsGroups20, containing a Subject, a Text, and a Label. It has a similar structure to a support ticket data set which would also have two data fields: Title, and Problem description. Thus, you can easily change this solution to use your support ticket data simply by providing the same structure for the test and train datasets.
Click here to view the SQL database tables created in this solution.
The file R/modeling_main.R or Python/modeling_main.py enables the user to define the input and perform all the steps. Inputs are: paths to the raw data files, database name, server name, username and password. The database is created if necessary, and the connection string as well as the SQL compute context are defined.
The steps below describe the process which applies identically to the R or Python code.
Step 1: Loading the data to SQL Server
The NewsGroups20 data has already been downloaded and preprocessed in R and Python, then saved to disk. The raw data has been uploaded to SQL Server. It consists of:
- News_Train: Training set.
- News_Test: Testing set.
- Label_Names: Link between the Label integers and the Label names.
The preprocessing consists of:
- Separating the Subject from the rest of the text in the text variable. This creates two separate text variables: “Subject” and “Text”. In this context, the data set has a similar structure to a Support Ticket classification problem.
- Remove unnecessary words or special characters (“\t”, “\n” etc.).
- Remove rows with an empty text body.
- Add an Id variable.
Step 2: Create features on the fly for the training set and train the model
For feature engineering, we want to featurize the text variables in the data: Subject and Text. The Subject and Text are featurized separately in order to give to the words in the Subject the same weight as those in the Text. This approach is also applicable to Support ticket Classification problems.
For each of the Subject and the Text separately, we:
- Remove stopwords, diacritics, punctuation and numbers.
- Change capital letters to lower case.
- Hash the different words and characters.
The parameters or options can be further optimized by parameter sweeping.
This is done by following these steps:
Get the factor levels of the label in the order of encounter based on the Id, serialize them, and save them to SQL Server for future use.
Define the text transformation to be used to generate features, in the form of a list. It will be applied on the fly during training and testing.
Train a multiclass logistic regression on the training set, using the text transformation list.
The model trained is then serialized and saved to SQL Server for future use.
Step 3: Create features on the fly for the testing set, make predictions, and evaluate the model
In the same fashion, we call the prediction function on the testing set News_Test. It makes predictions while featurizing the text variables (Subject and Text) separately on the fly. This will automatically use the same text transformation as in the training, encoded in logistic_model.
Once we get the predictions, we perform an inner join with the “Label_Names” table in order to get the names of the predicted labels.
Finally, we compute performance metrics in order to evaluate the model for this multi-class classification problem.
- Micro Average Accuracy: It is the overall accuracy, which corresponds to the number of correct class predictions divided by the total number of observations in the testing set (ie. rate of correct classification). In this sense, performing poorly on some rare classes but very accurately predicting the rest might still give a good result.
- Macro Average Accuracy: It corresponds to the average of the per-class accuracy. In this sense, it treats all the classes equally, even if some of the classes are rare.
These two metrics should thus be analyzed together, especially in the case of class imbalances.
Step 4: Visualize the Results
The Power BI file TextClassification.pbix included with this solution can be used to visualize how the model performs.
Training Summary tab, you can see the predicted and actual labels for the Test data, along with the Micro Average Accuracy and Macro Average Accuracy values.
Scoring New Text tab you can view the predicted labels for new text.
To try this out yourself:
- View the Quick Start.