Skip to main content

Experiment Tracking

Experiment tracking for Isaac Lab and LeRobot training workflows. Azure ML provides managed MLflow tracking. OSMO supports both WANDB (default for LeRobot) and MLflow (via Azure ML backend).

📊 MLflow Tracking​

Azure ML manages MLflow as the default experiment tracking backend. Isaac Lab training with SKRL logs metrics automatically through monkey-patching.

Isaac Lab (Automatic)​

SKRL training logs metrics to MLflow without additional configuration. Metrics include episode rewards, training losses, optimization stats, and timing data.

Configure logging frequency with --mlflow_log_interval:

IntervalBehaviorUse Case
stepLog every training stepDebugging
balancedLog every 10 steps (default)Standard training
rolloutLog once per rollout cycleLong runs
IntegerCustom step intervalTuned granularity

See MLflow Integration for SKRL metric categories, filtering, and troubleshooting.

LeRobot​

Enable MLflow for LeRobot on OSMO:

./scripts/submit-osmo-lerobot-training.sh \
-d user/dataset \
--mlflow-enable

Azure ML LeRobot submissions use MLflow automatically.

MLflow Configuration​

ParameterDefaultDescriptionSource
--mlflow-token-retries3MLflow token refresh retry countMLFLOW_TRACKING_TOKEN_REFRESH_RETRIES
--mlflow-http-timeout60MLflow HTTP request timeout (sec)MLFLOW_HTTP_REQUEST_TIMEOUT

📈 WANDB Integration​

WANDB is the default experiment tracker for LeRobot workflows on OSMO. Tracks training loss, evaluation metrics, and model outputs.

Credential Setup​

# Set WANDB API key (required)
osmo credential set wandb_api_key --generic --value "..."

# Set HuggingFace token (required for private datasets)
osmo credential set hf_token --generic --value "hf_..."

Enable and Disable​

WANDB is enabled by default on OSMO LeRobot workflows:

# Explicitly enable (default)
./scripts/submit-osmo-lerobot-training.sh -d user/dataset --wandb-enable

# Disable WANDB
./scripts/submit-osmo-lerobot-training.sh -d user/dataset --wandb-disable

# Use MLflow instead
./scripts/submit-osmo-lerobot-training.sh -d user/dataset --mlflow-enable

📦 Model Registration​

Training scripts register model checkpoints to Azure ML automatically at completion.

Registration Parameters​

ParameterDefaultDescription
--register-checkpointDerived from taskModel name for registration
--skip-register-checkpointfalseSkip automatic registration
--register-model(none)Model name (LeRobot inference)

Registration Examples​

# Isaac Lab: custom model name
./scripts/submit-azureml-training.sh \
--register-checkpoint my-anymal-model

# Isaac Lab: skip registration
./scripts/submit-osmo-training.sh \
--skip-register-checkpoint

# LeRobot: register after inference
./scripts/submit-osmo-lerobot-inference.sh \
--policy-repo-id user/trained-policy \
-r my-evaluated-model

Retrieve Registered Models​

# Download from Azure ML
az ml model download \
--name anymal-c-velocity --version 1 \
--download-path ./checkpoint

# Download from HuggingFace Hub
huggingface-cli download user/trained-policy --local-dir ./checkpoint

🔄 Checkpoint Workflows​

Training supports four checkpoint initialization modes:

ModeWeightsOptimizerCountersUse Case
from-scratchRandomFreshResetInitial training
warm-startLoadedFreshResetTransfer learning
resumeLoadedLoadedLoadedContinue interrupted training
freshRandomFreshResetArchitecture-only initialization
# Resume training from MLflow artifact
./scripts/submit-azureml-training.sh \
--checkpoint-uri "runs:/abc123/checkpoint" \
--checkpoint-mode resume

# Warm-start from registered model
./scripts/submit-osmo-training.sh \
--checkpoint-uri "models:/anymal-c-velocity/1" \
--checkpoint-mode warm-start

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.