physical-ai-toolchain

LeRobot behavioral cloning training for ACT and Diffusion policy architectures. Training runs on Azure ML and OSMO platforms using HuggingFace Hub datasets with WANDB and MLflow experiment tracking.

📋 Prerequisites

Component Requirement
Infrastructure AKS cluster deployed via Infrastructure Guide
Azure ML or OSMO At least one platform configured (see Platform Selection section)
HuggingFace token Required for private datasets (hf_token credential)
WANDB API key Required when --wandb-enable is set (default on OSMO)

🚀 Quick Start

Azure ML

./scripts/submit-azureml-lerobot-training.sh \
  -d lerobot/aloha_sim_insertion_human

OSMO

./scripts/submit-osmo-lerobot-training.sh \
  -d lerobot/aloha_sim_insertion_human

End-to-End Pipeline (OSMO)

Train, evaluate, and register in one command:

./scripts/run-lerobot-pipeline.sh \
  -d lerobot/aloha_sim_insertion_human \
  --policy-repo-id user/my-act-policy \
  -r my-act-model

🧠 Policy Architectures

Architecture Type Strengths
ACT Action Chunking with Transformers Multi-step prediction, temporal coherence
Diffusion Denoising Diffusion Policy Multi-modal action distributions

Select the architecture with --policy-type:

# ACT policy (default)
./scripts/submit-osmo-lerobot-training.sh -d user/dataset -p act

# Diffusion policy
./scripts/submit-osmo-lerobot-training.sh -d user/dataset -p diffusion

⚖️ Platform Selection

Aspect Azure ML OSMO
Submission az ml job create osmo workflow submit
Experiment tracking MLflow (managed) WANDB (default) + MLflow (optional)
Credential handling Azure ML environment variables osmo credential set injection
Dataset delivery HuggingFace Hub download Hub download or OSMO bucket mount
Pipeline support Manual multi-step run-lerobot-pipeline.sh orchestration

⚙️ Training Configuration

Parameter Default Description
--dataset-repo-id (required) HuggingFace dataset repository
--policy-type act Policy: act or diffusion
--job-name lerobot-act-training Job identifier
--image pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime Container image
--training-steps (LeRobot default) Total training iterations
--batch-size (LeRobot default) Training batch size
--save-freq 5000 Checkpoint save frequency
--policy-repo-id (none) Pre-trained policy for fine-tuning

Fine-Tuning from Existing Policy

./scripts/submit-osmo-lerobot-training.sh \
  -d user/my-dataset \
  --policy-repo-id user/pretrained-act \
  --training-steps 50000 \
  --batch-size 16

🔑 Credential Setup

OSMO Credentials

OSMO injects credentials at workflow runtime:

# HuggingFace token (required for private datasets)
osmo credential set hf_token --generic --value "hf_..."

# WANDB API key (required when wandb_enable=true)
osmo credential set wandb_api_key --generic --value "..."

Azure ML Credentials

Azure ML uses workspace-managed identity. Set environment variables for custom configurations:

Variable Description
AZURE_SUBSCRIPTION_ID Azure subscription ID
AZURE_RESOURCE_GROUP Resource group name
AZUREML_WORKSPACE_NAME Azure ML workspace name
AZUREML_COMPUTE Compute target name

📊 Experiment Logging

WANDB (Default on OSMO)

WANDB logging is enabled by default on OSMO workflows. Requires wandb_api_key credential.

# Disable WANDB
./scripts/submit-osmo-lerobot-training.sh \
  -d user/dataset \
  --wandb-disable

MLflow (Azure ML Managed)

Azure ML training uses MLflow automatically. Enable MLflow on OSMO with:

./scripts/submit-osmo-lerobot-training.sh \
  -d user/dataset \
  --mlflow-enable

See Experiment Tracking for platform comparison and configuration details.

💾 Dataset Workflows

HuggingFace Hub (Default)

LeRobot downloads datasets from HuggingFace Hub at runtime. Specify datasets with --dataset-repo-id:

./scripts/submit-osmo-lerobot-training.sh \
  -d lerobot/aloha_sim_insertion_human

OSMO Dataset Mount

Mount datasets from OSMO buckets backed by Azure Blob Storage:

./scripts/submit-osmo-lerobot-training.sh \
  -w workflows/osmo/lerobot-train-dataset.yaml \
  -d user/fallback-dataset \
  --dataset-bucket my-bucket \
  --dataset-name my-lerobot-data

Falls back to HuggingFace Hub download when no dataset mount is available.

🔄 End-to-End Pipeline

The run-lerobot-pipeline.sh script orchestrates the full lifecycle on OSMO:

Stage Action
1 Submit training workflow
2 Poll workflow status until completion
3 Submit inference/evaluation workflow
# Full pipeline
./scripts/run-lerobot-pipeline.sh \
  -d lerobot/aloha_sim_insertion_human \
  --policy-repo-id user/my-policy \
  -r my-model

# Training only with polling (skip inference)
./scripts/run-lerobot-pipeline.sh \
  -d user/dataset \
  --skip-inference

# Async mode (submit and exit)
./scripts/run-lerobot-pipeline.sh \
  -d user/dataset \
  --skip-wait

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.