Skip to main content

OSMO Training Workflows

Submit distributed Isaac Lab training jobs through NVIDIA OSMO workflow orchestration on Azure Kubernetes Service. OSMO provides multi-GPU scheduling, automatic checkpointing, and a monitoring dashboard.

📋 Prerequisites

ComponentRequirement
OSMO control planeDeployed via 03-deploy-osmo-control-plane.sh
OSMO backendInstalled via 04-deploy-osmo-backend.sh
StorageCheckpoint storage configured
OSMO CLIInstalled and authenticated (see Accessing OSMO)

📦 Available Templates

TemplatePurposeSubmission Script
train.yamlIsaac Lab training (base64 inline)scripts/submit-osmo-training.sh
train-dataset.yamlIsaac Lab training (dataset upload)scripts/submit-osmo-dataset-training.sh
lerobot-train.yamlLeRobot behavioral cloningscripts/submit-osmo-lerobot-training.sh
lerobot-infer.yamlLeRobot inference/evaluationscripts/submit-osmo-lerobot-inference.sh

⚙️ Workflow Comparison

Aspecttrain.yamltrain-dataset.yaml
PayloadBase64-encoded archiveDataset folder upload
Size limit~1MBUnlimited
VersioningNoneAutomatic
ReusabilityPer-runAcross runs
SetupNoneBucket configured

🏋️ Isaac Lab Training

Multi-GPU distributed training with KAI Scheduler / Volcano integration, automatic checkpointing, and OSMO UI monitoring.

Training Parameters

ParameterDescription
azure_subscription_idAzure subscription ID
azure_resource_groupResource group name
azure_workspace_nameML workspace name
taskIsaac Lab task name
num_envsParallel environments
max_iterationsTraining iterations

Submit Training

# Default configuration from Terraform outputs
./scripts/submit-osmo-training.sh

# Override parameters
./scripts/submit-osmo-training.sh \
--azure-subscription-id "your-subscription-id" \
--azure-resource-group "rg-custom"

📂 Isaac Lab Dataset Training

Dataset folder injection via OSMO bucket system instead of base64-encoded archives. Training folder mounts at /data/<dataset_name>/training.

Dataset Parameters

ParameterDefaultDescription
dataset_buckettrainingOSMO bucket for training code
dataset_nametraining-codeDataset name in bucket
training_localpath(required)Local path to training/ relative to workflow

Submit Dataset Training

# Default configuration
./scripts/submit-osmo-dataset-training.sh

# Custom dataset bucket
./scripts/submit-osmo-dataset-training.sh \
--dataset-bucket custom-bucket \
--dataset-name my-training-code

🔧 Environment Variables

VariableDescription
AZURE_SUBSCRIPTION_IDAzure subscription ID
AZURE_RESOURCE_GROUPResource group name
WORKFLOW_TEMPLATEPath to workflow template
OSMO_CONFIG_DIROSMO configuration directory
OSMO_DATASET_BUCKETDataset bucket name (default: training)
OSMO_DATASET_NAMEDataset name (default: training-code)

🔌 Accessing OSMO

OSMO services deploy to the osmo-control-plane namespace. Access method depends on network configuration.

Via VPN (Default Private Cluster)

ServiceURL
UI Dashboardhttp://10.0.5.7
API Servicehttp://10.0.5.7/api
osmo login http://10.0.5.7 --method=dev --username=testuser
osmo info

[!NOTE] Verify the internal load balancer IP: kubectl get svc -n azureml azureml-nginx-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}'

Via Port-Forward (Public Cluster without VPN)

ServicePort-Forward CommandLocal URL
UI Dashboardkubectl port-forward svc/osmo-ui 3000:80 -n osmo-control-planehttp://localhost:3000
API Servicekubectl port-forward svc/osmo-service 9000:80 -n osmo-control-planehttp://localhost:9000
Routerkubectl port-forward svc/osmo-router 8080:80 -n osmo-control-planehttp://localhost:8080
# Start port-forward in background
kubectl port-forward svc/osmo-service 9000:80 -n osmo-control-plane &

# Login and verify
osmo login http://localhost:9000 --method=dev --username=testuser
osmo info

[!NOTE] Port-forwarding does not support osmo workflow exec and osmo workflow port-forward commands. These require the router service accessible via ingress.

📊 Monitoring

Access the OSMO UI dashboard:

Access MethodURL
VPNhttp://10.0.5.7
Port-forwardhttp://localhost:3000 (after kubectl port-forward svc/osmo-ui 3000:80 -n osmo-control-plane)

🚀 Quick Start

# Isaac Lab training with defaults
./scripts/submit-osmo-training.sh

# Isaac Lab training with custom parameters
./scripts/submit-osmo-training.sh \
--task Isaac-Cartpole-v0 \
--num-envs 512

# Dataset-based training
./scripts/submit-osmo-dataset-training.sh \
--dataset-bucket training \
--dataset-name my-code

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.