physical-ai-toolchain

Inventory of submission scripts for training, validation, and inference workflows on Azure ML and OSMO platforms. Each entry includes CLI arguments, environment variable overrides, and Terraform output resolution.

[!NOTE] For detailed submission examples, see Script Examples.

Submission Scripts

Script Purpose Platform
submit-azureml-training.sh Package code and submit Azure ML training job Azure ML
submit-azureml-validation.sh Submit model validation job Azure ML
submit-azureml-lerobot-training.sh Submit LeRobot training to Azure ML Azure ML
submit-osmo-training.sh Package code and submit OSMO workflow (base64) OSMO
submit-osmo-dataset-training.sh Submit OSMO workflow using dataset folder injection OSMO
submit-osmo-lerobot-training.sh Submit LeRobot behavioral cloning training OSMO
submit-osmo-lerobot-inference.sh Submit LeRobot inference/evaluation OSMO
run-lerobot-pipeline.sh End-to-end train → evaluate → register pipeline OSMO

Quick Start

Scripts auto-detect Azure context from Terraform outputs in infrastructure/terraform/:

# Azure ML training
./submit-azureml-training.sh --task Isaac-Velocity-Rough-Anymal-C-v0

# OSMO training (base64 encoded)
./submit-osmo-training.sh --task Isaac-Velocity-Rough-Anymal-C-v0

# OSMO training (dataset folder upload)
./submit-osmo-dataset-training.sh --task Isaac-Velocity-Rough-Anymal-C-v0

# LeRobot behavioral cloning (OSMO)
./submit-osmo-lerobot-training.sh -d lerobot/aloha_sim_insertion_human

# LeRobot behavioral cloning (Azure ML)
./submit-azureml-lerobot-training.sh -d lerobot/aloha_sim_insertion_human

# LeRobot inference/evaluation
./submit-osmo-lerobot-inference.sh --policy-repo-id user/trained-policy

# End-to-end pipeline: train → evaluate → register
./run-lerobot-pipeline.sh \
  -d lerobot/aloha_sim_insertion_human \
  --policy-repo-id user/my-policy \
  -r my-model

# Validation (requires registered model)
./submit-azureml-validation.sh --model-name anymal-c-velocity --model-version 1

Prerequisites

Common requirements:

Script-specific tools:

CLI Arguments

Values resolve in order: CLI arguments → environment variables → Terraform outputs (when applicable).

submit-azureml-training.sh

Option Default Description Source
--environment-name isaaclab-training-env AzureML environment name CLI
--environment-version 2.3.2 AzureML environment version CLI
--image / -i nvcr.io/nvidia/isaac-lab:2.3.2 Container image CLI
--assets-only false Register environment without submitting a job CLI
--job-file / -w workflows/azureml/train.yaml Job YAML template CLI
--task / -t Isaac-Velocity-Rough-Anymal-C-v0 IsaacLab task TASK
--num-envs / -n 2048 Number of parallel environments NUM_ENVS
--max-iterations / -m unset Max iterations (empty to unset) MAX_ITERATIONS
--checkpoint-uri / -c unset MLflow checkpoint artifact URI CHECKPOINT_URI
--checkpoint-mode / -M from-scratch from-scratch, warm-start, resume, fresh CHECKPOINT_MODE
--register-checkpoint / -r derived from task Model name for checkpoint registration REGISTER_CHECKPOINT
--skip-register-checkpoint false Skip automatic model registration CLI
--headless true Force headless rendering CLI
--gui / --no-headless false Disable headless mode CLI
--run-smoke-test / -s false Run Azure connectivity smoke test before submit RUN_AZURE_SMOKE_TEST
--mode train Execution mode CLI
--subscription-id from TF Azure subscription ID AZURE_SUBSCRIPTION_ID / TF
--resource-group from TF Azure resource group AZURE_RESOURCE_GROUP / TF
--workspace-name from TF Azure ML workspace AZUREML_WORKSPACE_NAME / TF
--compute from TF Compute target override AZUREML_COMPUTE / TF
--instance-type gpuspot Instance type CLI
--experiment-name unset Experiment name override CLI
--job-name unset Job name override CLI
--display-name unset Display name override CLI
--stream false Stream logs after submission CLI
--mlflow-token-retries 3 MLflow token refresh retries MLFLOW_TRACKING_TOKEN_REFRESH_RETRIES
--mlflow-http-timeout 60 MLflow HTTP request timeout (seconds) MLFLOW_HTTP_REQUEST_TIMEOUT
-- n/a Forward remaining args to az ml job create CLI

Example:

./submit-azureml-training.sh \
  --task Isaac-Velocity-Rough-Anymal-C-v0 \
  --num-envs 1024 \
  --stream

submit-azureml-validation.sh

Option Default Description Source
--model-name derived from task Azure ML model name CLI
--model-version latest Azure ML model version CLI
--environment-name isaaclab-training-env AzureML environment name CLI
--environment-version 2.3.2 AzureML environment version CLI
--image nvcr.io/nvidia/isaac-lab:2.3.2 Container image CLI
--task Isaac-Velocity-Rough-Anymal-C-v0 Override task ID TASK
--framework unset Override framework CLI
--eval-episodes 100 Evaluation episodes CLI
--num-envs 64 Parallel environments CLI
--success-threshold unset Success threshold (defaults from model metadata) CLI
--headless true Run headless CLI
--gui false Disable headless mode CLI
--job-file workflows/azureml/validate.yaml Job YAML template CLI
--compute from TF Compute target override AZUREML_COMPUTE / TF
--instance-type gpuspot Instance type CLI
--experiment-name unset Experiment name override CLI
--job-name unset Job name override CLI
--stream false Stream logs after submission CLI
--subscription-id from TF Azure subscription ID AZURE_SUBSCRIPTION_ID / TF
--resource-group from TF Azure resource group AZURE_RESOURCE_GROUP / TF
--workspace-name from TF Azure ML workspace AZUREML_WORKSPACE_NAME / TF

Example:

./submit-azureml-validation.sh \
  --model-name anymal-c-velocity \
  --model-version 1 \
  --stream

submit-osmo-training.sh (base64 payload)

Option Default Description Source
--workflow / -w workflows/osmo/train.yaml Workflow template CLI
--task / -t Isaac-Velocity-Rough-Anymal-C-v0 IsaacLab task TASK
--num-envs / -n 2048 Number of parallel environments NUM_ENVS
--max-iterations / -m unset Max iterations (empty to unset) MAX_ITERATIONS
--image / -i nvcr.io/nvidia/isaac-lab:2.3.2 Container image IMAGE
--payload-root / -p /workspace/isaac_payload Runtime extraction root PAYLOAD_ROOT
--backend / -b skrl Training backend: skrl (default), rsl_rl TRAINING_BACKEND
--checkpoint-uri / -c unset MLflow checkpoint artifact URI CHECKPOINT_URI
--checkpoint-mode / -M from-scratch from-scratch, warm-start, resume, fresh CHECKPOINT_MODE
--register-checkpoint / -r derived from task Model name for checkpoint registration REGISTER_CHECKPOINT
--skip-register-checkpoint false Skip automatic model registration CLI
--sleep-after-unpack unset Sleep seconds post-unpack (debug) SLEEP_AFTER_UNPACK
--run-smoke-test / -s false Enable Azure connectivity smoke test RUN_AZURE_SMOKE_TEST
--azure-subscription-id from TF Azure subscription ID AZURE_SUBSCRIPTION_ID / TF
--azure-resource-group from TF Azure resource group AZURE_RESOURCE_GROUP / TF
--azure-workspace-name from TF Azure ML workspace AZUREML_WORKSPACE_NAME / TF
-- n/a Forward remaining args to osmo workflow submit CLI

Example:

./submit-osmo-training.sh \
  --task Isaac-Velocity-Rough-Anymal-C-v0 \
  --backend skrl \
  -- --dry-run

submit-osmo-dataset-training.sh (dataset injection)

Option Default Description Source
--workflow / -w workflows/osmo/train-dataset.yaml Workflow template CLI
--task / -t Isaac-Velocity-Rough-Anymal-C-v0 IsaacLab task TASK
--num-envs / -n 2048 Number of parallel environments NUM_ENVS
--max-iterations / -m unset Max iterations (empty to unset) MAX_ITERATIONS
--image / -i nvcr.io/nvidia/isaac-lab:2.3.2 Container image IMAGE
--backend / -b skrl Training backend: skrl (default), rsl_rl TRAINING_BACKEND
--dataset-bucket training OSMO bucket name OSMO_DATASET_BUCKET
--dataset-name training-code Dataset name (auto-versioned) OSMO_DATASET_NAME
--training-path training/ Local path to upload TRAINING_PATH
--checkpoint-uri / -c unset MLflow checkpoint artifact URI CHECKPOINT_URI
--checkpoint-mode / -M from-scratch from-scratch, warm-start, resume, fresh CHECKPOINT_MODE
--register-checkpoint / -r derived from task Model name for checkpoint registration REGISTER_CHECKPOINT
--skip-register-checkpoint false Skip automatic model registration CLI
--run-smoke-test / -s false Enable Azure connectivity smoke test RUN_AZURE_SMOKE_TEST
--azure-subscription-id from TF Azure subscription ID AZURE_SUBSCRIPTION_ID / TF
--azure-resource-group from TF Azure resource group AZURE_RESOURCE_GROUP / TF
--azure-workspace-name from TF Azure ML workspace AZUREML_WORKSPACE_NAME / TF
-- n/a Forward remaining args to osmo workflow submit CLI

Example:

./submit-osmo-dataset-training.sh \
  --task Isaac-Velocity-Rough-Anymal-C-v0 \
  --dataset-name my-training-v1

Configuration

Scripts resolve values in order: CLI arguments → environment variables → Terraform outputs.

Variable Description
AZURE_SUBSCRIPTION_ID Azure subscription
AZURE_RESOURCE_GROUP Resource group name
AZUREML_WORKSPACE_NAME ML workspace name
TASK IsaacLab task name
NUM_ENVS Number of parallel environments
OSMO_DATASET_BUCKET Dataset bucket for OSMO training
OSMO_DATASET_NAME Dataset name for OSMO training
DATASET_REPO_ID HuggingFace dataset repo ID
POLICY_TYPE LeRobot policy architecture

Script Library

File Purpose
lib/terraform-outputs.sh Shared functions for reading Terraform outputs

Source the library to use helper functions:

source lib/terraform-outputs.sh
read_terraform_outputs ../infrastructure/terraform
get_aks_cluster_name   # Returns AKS cluster name
get_azureml_workspace  # Returns ML workspace name

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.