Skip to main content

Experiment Tracking

Experiment tracking for Isaac Lab and LeRobot training workflows. Azure ML provides managed MLflow tracking on both platforms (Azure ML directly, OSMO via the Azure ML backend).

📊 MLflow Tracking​

Azure ML manages MLflow as the default experiment tracking backend. Isaac Lab training with SKRL logs metrics automatically through monkey-patching.

Isaac Lab (Automatic)​

SKRL training logs metrics to MLflow without additional configuration. Metrics include episode rewards, training losses, optimization stats, and timing data.

Configure logging frequency with --mlflow_log_interval:

IntervalBehaviorUse Case
stepLog every training stepDebugging
balancedLog every 10 steps (default)Standard training
rolloutLog once per rollout cycleLong runs
IntegerCustom step intervalTuned granularity

See MLflow Integration for SKRL metric categories, filtering, and troubleshooting.

LeRobot​

MLflow is enabled automatically for LeRobot training on both OSMO and Azure ML. Submit an OSMO training job:

training/il/scripts/submit-osmo-lerobot-training.sh \
-d user/dataset

MLflow Configuration​

ParameterDefaultDescriptionSource
--mlflow-token-retries3MLflow token refresh retry countMLFLOW_TRACKING_TOKEN_REFRESH_RETRIES
--mlflow-http-timeout60MLflow HTTP request timeout (sec)MLFLOW_HTTP_REQUEST_TIMEOUT

Model Registration​

Training scripts register model checkpoints to Azure ML automatically at completion.

Registration Parameters​

ParameterDefaultDescription
--register-checkpointDerived from taskModel name for registration
--skip-register-checkpointfalseSkip automatic registration
--register-model(none)Model name (LeRobot inference)

Registration Examples​

# Isaac Lab: custom model name
training/rl/scripts/submit-azureml-training.sh \
--register-checkpoint my-anymal-model

# Isaac Lab: skip registration
training/rl/scripts/submit-osmo-training.sh \
--skip-register-checkpoint

# LeRobot: register after evaluation
evaluation/sil/scripts/submit-osmo-lerobot-eval.sh \
--policy-repo-id user/trained-policy \
-r my-evaluated-model

Retrieve Registered Models​

# Download from Azure ML
az ml model download \
--name anymal-c-velocity --version 1 \
--download-path ./checkpoint

# Download from HuggingFace Hub
huggingface-cli download user/trained-policy --local-dir ./checkpoint

🔄 Checkpoint Workflows​

Training supports three checkpoint initialization modes:

ModeWeightsOptimizerCountersUse Case
from-scratchRandomFreshResetInitial training
warm-startLoadedFreshResetTransfer learning
resumeLoadedLoadedLoadedContinue interrupted training
# Resume training from MLflow artifact
training/rl/scripts/submit-azureml-training.sh \
--checkpoint-uri "runs:/abc123/checkpoint" \
--checkpoint-mode resume

# Warm-start from registered model
training/rl/scripts/submit-osmo-training.sh \
--checkpoint-uri "models:/anymal-c-velocity/1" \
--checkpoint-mode warm-start

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.