physical-ai-toolchain

Technical documentation for deploying, training, and operating robotics workloads on Azure with NVIDIA Isaac and OSMO. This index organizes every guide, reference, and walkthrough in the repository by topic so you can find what you need based on where you are in the workflow.

Documentation spans the full lifecycle — from provisioning Azure infrastructure with Terraform, through training reinforcement-learning policies with Isaac Lab and AzureML, to running inference on edge devices. Each section targets a specific audience and phase of the project.

👤 Audience Guide

Role Start here
First-time deployer Getting Started (coming soon), then Deploy
ML / Robotics engineer Training (coming soon) and Inference (coming soon)
Platform operator Operations and Security Guide
Contributor Contributing

📖 Documentation Index

Section Description Status
Getting Started Environment setup, prerequisites, and first deployment walkthrough Coming soon
Deploy Infrastructure provisioning with Terraform, AKS cluster setup, and networking Available
Training Model training pipelines with Isaac Lab, AzureML jobs, and OSMO orchestration Coming soon
Inference Serving trained policies for real-time control on edge and cloud Coming soon
Workflows AzureML and OSMO job templates, pipeline configuration, and submission scripts Coming soon
Operations Monitoring, scaling, troubleshooting, and cost management for running clusters Available
Security Identity, networking, compliance, and hardening for production deployments Coming soon
Reference CLI parameter tables, script usage, workflow templates, and configuration reference Available
Contributing Contribution guidelines, PR process, deployment validation, and coding conventions Available

📄 Current Guides

Standalone guides available now. These cover common tasks and will move into their respective topic sections as the documentation structure expands.

Guide Description
AzureML Validation Job Debugging Diagnosing and resolving AzureML validation job failures on AKS, including pod scheduling and resource quota issues
LeRobot Inference Running LeRobot inference workloads with pre-trained policies on Azure infrastructure
MLflow Integration Configuring MLflow experiment tracking for SKRL training agents with automatic metric logging to Azure ML
Security Guide Security configuration inventory, deployment responsibilities, and hardening checklist for robotics workloads

🚀 Next Steps

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.