Skip to main content

Isaac Lab Training

Isaac Lab reinforcement learning training with SKRL and RSL-RL backends. Both Azure ML and OSMO platforms support distributed GPU training with automatic checkpointing and MLflow experiment tracking.

📋 Prerequisites

ComponentRequirement
InfrastructureAKS cluster deployed via Infrastructure Guide
Azure MLExtension installed via 02-deploy-azureml-extension.sh
OSMOControl plane and backend via 03-deploy-osmo-control-plane.sh
Terraform outputsAvailable in infrastructure/terraform/ (or provide values via CLI / environment vars)
Azure CLIaz with ml extension for Azure ML submissions
OSMO CLIosmo CLI installed and authenticated for OSMO submissions

🚀 Quick Start

Azure ML

./scripts/submit-azureml-training.sh \
--task Isaac-Velocity-Rough-Anymal-C-v0 \
--num-envs 2048 \
--stream

OSMO (Base64 Payload)

./scripts/submit-osmo-training.sh \
--task Isaac-Velocity-Rough-Anymal-C-v0 \
--num-envs 2048

OSMO (Dataset Injection)

./scripts/submit-osmo-dataset-training.sh \
--task Isaac-Velocity-Rough-Anymal-C-v0 \
--dataset-name my-training-v1

Dataset injection removes the ~1 MB payload size limit of base64-encoded archives and enables dataset reuse across runs.

⚖️ Platform Selection

AspectAzure MLOSMO
Submissionaz ml job create via YAML templatesosmo workflow submit
OrchestrationAKS compute targetsKAI Scheduler / Volcano integration
Experiment trackingMLflow (managed)MLflow (Azure ML backend)
Dataset deliveryAzure ML datastoresBase64 payload or OSMO bucket upload
MonitoringAzure ML StudioOSMO UI Dashboard
Payload modesSingle (YAML template)Base64 or dataset folder injection

Azure ML provides managed compute and experiment tracking through Azure ML Studio. OSMO adds distributed training coordination, KAI Scheduler integration, and a dataset versioning system.

⚙️ Training Configuration

Core parameters shared across platforms:

ParameterDefaultDescription
--taskIsaac-Velocity-Rough-Anymal-C-v0Isaac Lab task identifier
--num-envs2048Parallel simulation environments
--max-iterations(unset)Training iteration limit
--imagenvcr.io/nvidia/isaac-lab:2.2.0Container image
--backendskrlTraining backend: skrl or rsl_rl
--headlesstrueDisable rendering

Values resolve in order: CLI arguments → environment variables → Terraform outputs.

Training Backends

BackendAlgorithmsUse Case
SKRLPPO, IPPO, MAPPO, AMP, SACGeneral-purpose RL with MLflow
RSL-RLPPO, DistillationLocomotion-focused, teacher-student

SKRL is the default backend and supports automatic MLflow metric logging via monkey-patching. See MLflow Integration for metric details.

🔄 Checkpoint Workflows

Four checkpoint modes control how training initializes:

ModeBehavior
from-scratchDefault. No checkpoint loaded, training starts fresh.
warm-startLoad weights only. Resets optimizer and iteration counters.
resumeLoad full state. Continues from exact training position.
freshLoad model architecture only. Reinitializes all parameters.

Checkpoint Examples

# Resume interrupted training (Azure ML)
./scripts/submit-azureml-training.sh \
--checkpoint-uri "runs:/abc123/checkpoint" \
--checkpoint-mode resume

# Warm-start from a registered model (OSMO)
./scripts/submit-osmo-training.sh \
--checkpoint-uri "models:/anymal-c-velocity/1" \
--checkpoint-mode warm-start

Model Registration

Training scripts register checkpoints to Azure ML automatically. Override the model name or skip registration:

# Custom model name
./scripts/submit-azureml-training.sh \
--register-checkpoint my-custom-model

# Skip registration
./scripts/submit-osmo-training.sh \
--skip-register-checkpoint

💾 Dataset Injection (OSMO)

OSMO supports two payload delivery modes for training code:

ModeScriptSize LimitVersioning
Base64 payloadsubmit-osmo-training.sh~1 MBNone
Dataset injectionsubmit-osmo-dataset-training.shUnlimitedAutomatic

Dataset injection uploads training/rl/ as a versioned OSMO dataset, mounted at /data/<dataset_name>/training in the container:

./scripts/submit-osmo-dataset-training.sh \
--dataset-bucket custom-bucket \
--dataset-name my-training-v1 \
--task Isaac-Velocity-Rough-Anymal-C-v0

The script stages files to exclude __pycache__ and build artifacts via .amlignore patterns before upload.

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.