physical-ai-toolchain

Centralized hub for operational documentation covering monitoring, troubleshooting, security configuration, and GPU management for Azure-NVIDIA robotics deployments.

📖 Operations Guides

Guide	Description
Troubleshooting	Symptom-based resolution for common deployment, GPU, and workflow errors
Security Guide	Security configuration inventory and deployment checklist
GPU Configuration	Driver selection, MIG strategy, and GPU Operator configuration
AzureML Validation Job Debugging	Debug AzureML extension and InstanceType validation failures
Deployment Validation	Post-deployment verification steps
Cost Considerations	Azure resource cost guidance

📋 Operational Overview

The reference architecture deploys configurable monitoring components through Terraform feature flags.

Component	Purpose	Feature Flag
Log Analytics workspace	Central log aggregation	Always deployed
Application Insights	Application performance monitoring	Always deployed
Azure Monitor workspace	Prometheus metrics backend	`should_deploy_monitor_workspace`
Managed Grafana	Visualization dashboards	`should_deploy_grafana`
Container Insights	AKS container telemetry	`should_deploy_monitor_workspace`
Prometheus data collection rules	Metric scraping configuration	`should_deploy_monitor_workspace`
Azure Monitor Private Link Scope	Private network monitoring	`should_deploy_ampls`
Data collection endpoint	Private ingestion endpoint	`should_deploy_dce`

[!IMPORTANT] The default configuration deploys a private AKS cluster. Connect through the VPN Gateway before running any kubectl or Helm commands. See VPN Gateway for setup instructions.

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.

This site is open source. Improve this page.

physical-ai-toolchain

📖 Operations Guides

📋 Operational Overview

🔗 Related Resources