physical-ai-toolchain

Experiment tracking for Isaac Lab and LeRobot training workflows. Azure ML provides managed MLflow tracking. OSMO supports both WANDB (default for LeRobot) and MLflow (via Azure ML backend).

📊 MLflow Tracking

Azure ML manages MLflow as the default experiment tracking backend. Isaac Lab training with SKRL logs metrics automatically through monkey-patching.

Isaac Lab (Automatic)

SKRL training logs metrics to MLflow without additional configuration. Metrics include episode rewards, training losses, optimization stats, and timing data.

Configure logging frequency with --mlflow_log_interval:

Interval	Behavior	Use Case
`step`	Log every training step	Debugging
`balanced`	Log every 10 steps (default)	Standard training
`rollout`	Log once per rollout cycle	Long runs
Integer	Custom step interval	Tuned granularity

See MLflow Integration for SKRL metric categories, filtering, and troubleshooting.

LeRobot

Enable MLflow for LeRobot on OSMO:

./scripts/submit-osmo-lerobot-training.sh \
  -d user/dataset \
  --mlflow-enable

Azure ML LeRobot submissions use MLflow automatically.

MLflow Configuration

Parameter	Default	Description	Source
`--mlflow-token-retries`	`3`	MLflow token refresh retry count	`MLFLOW_TRACKING_TOKEN_REFRESH_RETRIES`
`--mlflow-http-timeout`	`60`	MLflow HTTP request timeout (sec)	`MLFLOW_HTTP_REQUEST_TIMEOUT`

📈 WANDB Integration

WANDB is the default experiment tracker for LeRobot workflows on OSMO. Tracks training loss, evaluation metrics, and model outputs.

Credential Setup

# Set WANDB API key (required)
osmo credential set wandb_api_key --generic --value "..."

# Set HuggingFace token (required for private datasets)
osmo credential set hf_token --generic --value "hf_..."

Enable and Disable

WANDB is enabled by default on OSMO LeRobot workflows:

# Explicitly enable (default)
./scripts/submit-osmo-lerobot-training.sh -d user/dataset --wandb-enable

# Disable WANDB
./scripts/submit-osmo-lerobot-training.sh -d user/dataset --wandb-disable

# Use MLflow instead
./scripts/submit-osmo-lerobot-training.sh -d user/dataset --mlflow-enable

📦 Model Registration

Training scripts register model checkpoints to Azure ML automatically at completion.

Registration Parameters

Parameter	Default	Description
`--register-checkpoint`	Derived from task	Model name for registration
`--skip-register-checkpoint`	`false`	Skip automatic registration
`--register-model`	(none)	Model name (LeRobot inference)

Registration Examples

# Isaac Lab: custom model name
./scripts/submit-azureml-training.sh \
  --register-checkpoint my-anymal-model

# Isaac Lab: skip registration
./scripts/submit-osmo-training.sh \
  --skip-register-checkpoint

# LeRobot: register after inference
./scripts/submit-osmo-lerobot-inference.sh \
  --policy-repo-id user/trained-policy \
  -r my-evaluated-model

Retrieve Registered Models

# Download from Azure ML
az ml model download \
  --name anymal-c-velocity --version 1 \
  --download-path ./checkpoint

# Download from HuggingFace Hub
huggingface-cli download user/trained-policy --local-dir ./checkpoint

🔄 Checkpoint Workflows

Training supports four checkpoint initialization modes:

Mode	Weights	Optimizer	Counters	Use Case
`from-scratch`	Random	Fresh	Reset	Initial training
`warm-start`	Loaded	Fresh	Reset	Transfer learning
`resume`	Loaded	Loaded	Loaded	Continue interrupted training
`fresh`	Random	Fresh	Reset	Architecture-only initialization

# Resume training from MLflow artifact
./scripts/submit-azureml-training.sh \
  --checkpoint-uri "runs:/abc123/checkpoint" \
  --checkpoint-mode resume

# Warm-start from registered model
./scripts/submit-osmo-training.sh \
  --checkpoint-uri "models:/anymal-c-velocity/1" \
  --checkpoint-mode warm-start

MLflow Integration for SKRL metric logging internals
Isaac Lab Training for RL training workflows
LeRobot Training for behavioral cloning workflows
Scripts Reference for full CLI parameter tables

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.

This site is open source. Improve this page.