MLOps best practices with Azure Machine Learning

Source: Azure CAF: Machine Learning DevOps Guide

Azure Machine Learning offers several asset management, orchestration, and automation services to help you manage the lifecycle of your model training and deployment workflows. This section discusses best practices and recommendations to apply MLOps across the areas of people, process, and technology supported by Azure Machine Learning.

People

Work in project teams to best utilize specialist and domain knowledge in your organization. Organize and set up Azure Machine Learning Workspaces on a project basis to comply with use case segregation requirements.
Define a set of responsibilities and tasks in your organization as a role, where one team member on an MLOps project team could fulfill multiple roles. Use Custom Roles in Azure to define a set of granular Azure RBAC operations for Azure Machine Learning that each role can perform.
Standardize on a project lifecycle and agile methodology. The Team Data Science Process provides a reference lifecycle implementation.
Balanced teams can execute all MLOps stages from exploration to development to operations.

Process

Standardize on a code template to allow for code reuse and increase ramp up time at project start or when a new team member joins the project. Azure Machine Learning pipelines and job submissions script, and CI/CD pipelines lend themselves well for the creation of templates.
Use version control. Jobs that are submitted from a Git-backed folder automatically track repo metadata with the job in Azure Machine Learning for reproducibility.
Version experiment inputs and outputs to enable reproducibility. Use Azure Machine Learning datasets, model management, and environment management capabilities to facilitate.
Build up a run history of experiment runs to allow for comparison, planning, and collaboration. Use an experiment tracking framework like MLflow for metric collection.
Continuously measure and control the quality of your team's work through continuous integration on the full experimentation code base.
Early-terminate training when a model doesn't converge. Use an experiment tracking framework in combination with the run history in Azure Machine Learning to monitor job execution.
Define an experiment and model management strategy. Consider using a name like Champion to refer to the current baseline model, or refer to Challenger models for candidate models, which could outperform the Champion model in production. Apply tags in Azure Machine Learning to mark experiment and models as appropriate. In some scenarios, such as sales forecasting, it can take months to determine whether the model's predictions are accurate.
Elevate continuous integration to continuous training by including model training as part of the build. For example, start model training on the full dataset with each pull request.
Shorten the time-to-feedback on the quality of machine learning pipeline by running an automated build on a sample of the data. Use Azure Machine Learning pipeline parameters to parameterize input datasets.
Use continuous deployment for machine learning models to automate the deployment and testing of real time scoring services across your Azure environments (development, test, production).
In some regulated industries, model validation steps might be required before a machine learning model can be used in a production environment. By automating validation steps, to an extent, you might accelerate time to delivery. When manual review or validation steps are still the bottleneck, consider whether it's possible to certify the automated model validation pipeline. Use resource tags in Azure Machine Learning to indicate asset compliance, candidates for review, or as triggers for deployment.
Don't retrain in production and directly replace the production model without any integration testing. Even though model performance and functional requirements are good, among other potential issues, a model might have grown its environment footprint, breaking the serving environment.
When production data access is only available in production, use Azure RBAC and custom roles to give a select number of machine learning practitioners the read access they require, for example for data exploration. Alternatively, make a data copy available in the non-production environments.
Agree on naming conventions and tags for Azure Machine Learning experiments to differentiate retraining baseline machine learning pipelines from experimental work.

Technology

When you submit jobs via the studio user interface (UI) or CLI interface, instead of submitting jobs via the SDK, use the CLI or Azure DevOps Machine Learning tasks to configure automation pipeline steps. This process might reduce the code footprint by reusing the same job submissions directly from automation pipelines.
Use event-based programming. For example, trigger an offline model testing pipeline using an Azure Function once a new model gets registered. Or send a notification to an Ops email alias when a critical pipeline fails to run. Azure Machine Learning produces events to Event Grid that you can subscribe to.
When you use Azure DevOps for automation, use Azure DevOps Tasks for Machine Learning to use machine learning models as pipeline triggers.
When you develop Python packages for your machine learning application, you can host them in an Azure DevOps repository as artifacts and publish them as a feed. This approach allows you to integrate the DevOps workflow for building packages with your Azure Machine Learning workspace.
Consider the use of a staging environment to system integration test machine learning pipelines with upstream or downstream application components.
Create unit and integration tests for your inference endpoints for enhanced debugging and accelerated time to deployment.
To trigger retraining, use dataset monitors and use event-driven workflows to subscribe to data drift events and automate the trigger of machine learning pipelines for retraining.