Observability in Machine Learning

Development process of software system with machine learning component is more complex than traditional software. We need to monitor changes and variations in three dimensions: the code, the model and the data. We can distinguish two stages of such system lifespan: experimentation and production that require different approaches to observability as discussed below:

Model experimentation and tuning

Experimentation is a process of finding suitable machine learning model and its parameters via training and evaluating such models with one or more datasets.

When developing and tuning machine learning models, the data scientists are interested in observing and comparing selected performance metrics for various model parameters. They also need a reliable way to reproduce a training process, such that a given dataset and given parameters produces the same models.

There are many model metric evaluation solutions available, both open source (like MLFlow) and proprietary (like Azure Machine Learning Service), and of which some serve different purposes. To capture model metrics, there are a.o. the following options available:

Azure Machine Learning Service SDK Azure Machine Learning service provides an SDK for Python, R and C# to capture your evaluation metrics to an Azure Machine Learning service (AML) Experiment. Experiments are viewed in the AML dashboard. Reproducibility is achieved by storing code or notebook snapshot together with viewed metric. You can create versioned Datasets within Azure Machine Learning service.

MLFlow (for Databricks) MLFlow is open source framework, and can be hosted on Azure Databricks as its remote tracking server (it currently is the only solution that offers first-party integration with Databricks). You can use the MLFlow SDK tracking component to capture your evaluation metrics or any parameter you would like and track it at experimentation board in Azure Databricks. Source code and dataset version are also saved with log snapshot to provide reproducibility.

TensorBoard TensorBoard is a popular tool amongst data scientist to visualize specific metrics of Deep Learning runs, especially of TensorFlow runs. TensorBoard is not an MLOps tool like AML/MLFlow, and therefore does not offer extensive logging capabilities. It is meant to be transient; and can therefore be used as an addition to an end-to-end MLOps tool like AML, but not as a complete MLOps tool.

Application Insights Application Insights can be used as an alternative sink to capture model metrics, and can therefore offer more extensive options as metrics can be transferred to e.g. a PowerBI dashboard. It also enables log querying. However, this solution means that a custom application needs to be written to send logs to AppInsights (using for example the OpenCensus Python SDK), which would mean extra effort of creating/maintaining custom code.

An extensive comparison of the four tools can be found as follows:

	Azure ML	MLFlow	TensorBoard	Application Insights
Metrics support	Values, images, matrices, logs	Values, images, matrices and plots as files	Metrics relevant to DL research phase	Values, images, matrices, logs
Customizabile	Basic	Basic	Very basic	High
Metrics accessible	AML portal, AML SDK	MLFlow UI, Tracking service API	Tensorboard UI, history object	Application Insights
Logs accessible	Rolling logs written to .txt files in blob storage, accessible via blob or AML portal. Not query-able	Rolling logs are not stored	Rolling logs are not stored	Application Insights in Azure Portal. Query-able with KQL
Ease of use and set up	Very straightforward, only one portal	More moving parts due to remote tracking server	A bit over process overhead. Also depending on ML framework	More moving parts as a custom app needs to be maintained
Shareability	Across people with access to AML workspace	Across people with access to remote tracking server	Across people with access to same directory	Across people with access to AppInsights

Model in production

The trained model can be deployed to production as container. Azure Machine Learning service provides SDK to deploy model as Azure Container Instance and publishes REST endpoint. You can monitor it using microservice observability methods( for more details -refer to Recipes section). MLFLow is an alternative way to deploy ML model as a service.

Training and re-training

To automatically retrain the model you can use AML Pipelines or Azure Databricks. When re-training with AML Pipelines you can monitor information of each run, including the output, logs, and various metrics in the Azure portal experiment dashboard, or manually extract it using the AML SDK

Model performance over time: data drift

We re-train machine learning models to improve their performance and make models better aligned with data changing over time. However, in some cases model performance may degrade. This may happen in case data change dramatically and do not exhibit the patterns we observed during model development anymore. This effect is called data drift. Azure Machine Learning Service has preview feature to observe and report data drift. This article describes it in detail.

Data versioning

It is recommended practice to add version to all datasets. You can create a versioned Azure ML Dataset for this purpose, or manually version it if using other systems.

Last update: December 15, 2022