AML Tool Selection Guide

Source: AML Tool Selection Guide

Microsoft Azure offers a myriad of services and capabilities. Building an end-to-end machine learning pipeline from experimentation to deployment often requires bringing together a set of services from across Azure. While it may be possible to have one pipeline do it all, there are tradeoffs when you don't use the services for what they're best at.

So, then you have to ask the question: When is it worth it to adopt each service for your use case? The answer often depends on different details that are not necessarily related to the functional requirements. The main factors are:

  • The skillsets on your team.

  • How the Azure service plugs into your existing architecture.

  • The maintainability of the solution that you build using the service.

  • The cost of these services at scale.

This document focuses on Azure services that you can use to support data or machine learning workloads. While not exhaustive, this document covers the most popular Azure service options for supporting the end-to-end workflow:

  1. Experimentation

  2. Overall Orchestration/Scheduling

  3. Data Transfer

  4. Data Transformation

  5. Model Training

  6. Model Deployment

  7. Monitoring

Services for each stage

Listed below are the service options that you should consider for each stage of an end-to-end machine learning pipeline:

Service options

Stage Service options
Experimentation Azure Machine Learning Notebook VMs
Databricks Notebooks
Azure Machine Learning Experiment for Python SDK
Azure Data Science Virtual Machine (DSVM)
Overall Orchestration / Scheduling Azure Logic Apps
Azure Data Factory
Azure Machine Learning Pipelines
Azure DevOps
Data Transfer Azure Data Factory Copy Activity
Azure Machine Learning DataTransferStep
Compute Databricks
Azure Machine Learning Compute
Tracking / Versioning options Experiment/Hyper-tuning Tracking:
Azure Machine Learning Experiments
Databricks and MLFLow Tracking
Data Versioning/Data Drift: Azure Machine Learning Datasets
Model Versioning:
Azure Machine Learning Model Management Service
Databricks and MLFlow Model Registry
Model Training Azure Machine Learning Pipelines
Databricks
Model Deployment Batch Scoring in Azure Machine Learning Pipeline
Real-time Deployment in Azure Machine Learning Service
Azure Kubernetes Service (AKS)
Azure Container Instance
Azure App Service
Azure Functions
IoT Edge
Azure Machine Learning Model Deployment
Monitoring Azure Monitor
Application Insights
Azure DashboardsPower BI

Use No Code or Code implementation approach

The first decision you need to make is whether to use a No Code implementation approach or the traditional Code implementation approach. Each has its own benefits and tradeoffs.

No Code

If you don't want to code your own solutions, a set of tools is available for building workflows without writing any code:

  • For experimentation, use Azure Machine Learning Designer.

  • For overall orchestration/scheduling, use Logic Apps, especially if integrating to Microsoft 365 suite.

  • For data transfer and data transformation, use Data Factory Data Flows. If datasets are simple and on a small scale, Azure Machine Learning Designer can also handle them.

  • For model training and model deployment, use Azure Machine Learning Designer. It supports both real-time and batch deployments.

  • For monitoring, use Azure Monitor with Azure Dashboards, which lets you click to pin visuals and set up alerts without code. For more configuration, you can use Power BI to create historical dashboards.

The primary issue you'll come across here is that you must work within the constraints of the services. However, if your use case fits within these limitations, these services could be a good solution for you. They're always evolving and their capabilities will expand over time. So you should familiarize yourself with their latest features at the time you consider them. This diagram summarizes the process for the No Code option.

no code option process diagram

Code

If you want to code or need the flexibility that a coded solution offers, all of the options described have a "code-like" interface. The options also have a representation of processing logic that you can export to JSON or YAML format and check in the exported files to a code repository. From there, deployment is handled through Azure DevOps or scripts. This diagram summarizes the Code option process.

code option process diagram

Experimentation: Notebooks vs. Python/R scripts

Depending on the skillsets or comfort level of your team's data scientists/engineers with notebooks or plain scripts, there are choices for experimentation that support both options.

Notebooks

Type Description
Azure Machine Learning Notebook VMs Azure Machine Learning manages them. The data scientist only navigates to the link to interact with Jupyter notebooks. 1 VM backs them. You can stop and start the VM at will. The Notebook VM uses Azure Machine Learning SDKs to interact with data stored in Azure. This option allows you to pick the Compute instance option needed for experimentation based on memory, CPU, or GPU needs.
Databricks Notebooks Azure Databricks workspace stores them with the help of Git integration. To use it, set up a cluster to run Databricks Notebooks. They use built-in dbutils to access data stored in Azure. They cost more than Azure Machine Learning Notebook VMs.
Jupyter Notebook See Azure Machine Learning Notebook VMs above.

Python/R scripts

Type Description
Azure Machine Learning Experiment from Python SDK If you're using Python or scripts, you can submit them directly to Azure Machine Learning as steps in a pipeline. You can also run a Databricks Notebook or another method of stepping through this method, but you still need to create the actual pipeline using scripts of some kind. If you go this route, you'll have to upskill your engineers in Azure Machine Learning Pipelines. A benefit is that you can leverage dataset connections to existing data in Azure.
• Note that pipeline startup cost can be prohibitive to iterating quickly.
DSVM A catch all for teams who want to have a GPU or non-GPU VM with standard machine learning frameworks pre-installed, but full flexibility in what tooling to use for coding. There's a low amount of upskilling needed.
Local experimentation If you don't require compute power in the cloud, local experimentation is also an option.

Overall Orchestration and Scheduling

Trigger Options

The next table lists which Azure services are best for supported trigger options.

Triggered by Service/System
Code Azure DevOps
Schedule Azure Machine Learning Pipelines (can only trigger itself)
Data/schedule Azure Data Factory
Events/alerts/other non-Azure products Azure Logic Apps

Trigger options with scheduling

This table adds scheduling options to the triggers.

  Azure DevOps Azure Machine Learning Pipeline Azure Data Factory Logic Apps
Schedule Cron schedule Recurrence-Based (run at these hours on these days) Recurrence-Based + Additional Support for Tumbling Windows Recurrence-based
Event-based Trigger Pull request, branch, and build completion triggers. Artifact triggers not available in new YAML builds. None. Blob creation and blob deletion events only. Many triggers from Microsoft and non-Microsoft services. Twitter, Dropbox, SharePoint, and so on.
Manual Intervention or Approval Based Yes, limited. No. No. No.
Integration with Other Orchestrators Yes, limited. Supports deployment to most Azure Services. Can call and wait Azure Machine Learning Pipeline from Agentless task. No built-in support for Azure DevOps, Azure Data Factory, or Logic Apps. Yes, limited. Can run Azure Machine Learning Pipeline. Yes, limited. Can trigger Azure DevOps build. Can fire and forget trigger Azure Data Factory. No integration with Azure Machine Learning Pipeline.

Data Transfer Options

Type Description
Azure Data Factory Copy Activity Large Scale (GBs to TBs) with many options for source and sinks.
Azure Machine Learning DataTransferStep Small Scale (MBs to GBs) with limited options for source and sinks.

Compute Options

Type Description
Azure Machine Learning Compute Instance Scalable compute instances that work for GPU or non-GPU clusters. You run Python or R code in configurable Conda environments managed by Azure Machine Learning. It helps scale out multiple jobs but doesn't handle distributed data partitioning/execution except in unique cases.
Databricks A scalable compute instance that handles distributed data partitioning/job execution on top of Spark. Big data jobs will likely execute faster on Databricks. You have to manage dependencies and environments. Compute for Databricks is more expensive.
Azure Synapse (preview) Open-source Spark/RDD processing, distributed. (Big Data Analytics)
Big Data Cluster/SQL 2019 Big Data Analytics

Tracking and Versioning options

Type Description
Experiment/Hyper-tuning Tracking • Azure Machine Learning Experiments
• Databricks and MLFLow Tracking
Data Versioning/Data Drift • Azure Machine Learning Datasets
Model Versioning • Azure Machine Learning Model Management Service
• Databricks and MLFlow Model Registry

Model Training options

Type Description
Option 1 Azure Machine Learning Pipelines
Option 2 Databricks

Model deployment options

Type Description
Batch Scoring in Azure Machine Learning Pipeline Batch Deployment and Scoring in Azure Machine Learning Pipeline
Real-time Deployment in Azure Machine Learning Service Azure Machine Learning Service supports real-time deployment and scoring using:
• AKS
• Azure Container Instance
• Azure App Service
• Azure Functions
• IoT Edge
• and more …

Monitoring options

Type Description
Azure Monitor • Application Insights
• Azure Dashboards
Power BI Analytics and Reports