physical-ai-toolchain

Centralized hub for operational documentation covering monitoring, troubleshooting, security configuration, and GPU management for Azure-NVIDIA robotics deployments.

📖 Operations Guides

Guide Description
Troubleshooting Symptom-based resolution for common deployment, GPU, and workflow errors
Security Guide Security configuration inventory and deployment checklist
GPU Configuration Driver selection, MIG strategy, and GPU Operator configuration
AzureML Validation Job Debugging Debug AzureML extension and InstanceType validation failures
Deployment Validation Post-deployment verification steps
Cost Considerations Azure resource cost guidance

📋 Operational Overview

The reference architecture deploys configurable monitoring components through Terraform feature flags.

Component Purpose Feature Flag
Log Analytics workspace Central log aggregation Always deployed
Application Insights Application performance monitoring Always deployed
Azure Monitor workspace Prometheus metrics backend should_deploy_monitor_workspace
Managed Grafana Visualization dashboards should_deploy_grafana
Container Insights AKS container telemetry should_deploy_monitor_workspace
Prometheus data collection rules Metric scraping configuration should_deploy_monitor_workspace
Azure Monitor Private Link Scope Private network monitoring should_deploy_ampls
Data collection endpoint Private ingestion endpoint should_deploy_dce

[!IMPORTANT] The default configuration deploys a private AKS cluster. Connect through the VPN Gateway before running any kubectl or Helm commands. See VPN Gateway for setup instructions.

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.