Centralized hub for operational documentation covering monitoring, troubleshooting, security configuration, and GPU management for Azure-NVIDIA robotics deployments.
| Guide | Description |
|---|---|
| Troubleshooting | Symptom-based resolution for common deployment, GPU, and workflow errors |
| Security Guide | Security configuration inventory and deployment checklist |
| GPU Configuration | Driver selection, MIG strategy, and GPU Operator configuration |
| AzureML Validation Job Debugging | Debug AzureML extension and InstanceType validation failures |
| Deployment Validation | Post-deployment verification steps |
| Cost Considerations | Azure resource cost guidance |
The reference architecture deploys configurable monitoring components through Terraform feature flags.
| Component | Purpose | Feature Flag |
|---|---|---|
| Log Analytics workspace | Central log aggregation | Always deployed |
| Application Insights | Application performance monitoring | Always deployed |
| Azure Monitor workspace | Prometheus metrics backend | should_deploy_monitor_workspace |
| Managed Grafana | Visualization dashboards | should_deploy_grafana |
| Container Insights | AKS container telemetry | should_deploy_monitor_workspace |
| Prometheus data collection rules | Metric scraping configuration | should_deploy_monitor_workspace |
| Azure Monitor Private Link Scope | Private network monitoring | should_deploy_ampls |
| Data collection endpoint | Private ingestion endpoint | should_deploy_dce |
[!IMPORTANT] The default configuration deploys a private AKS cluster. Connect through the VPN Gateway before running any
kubectlor Helm commands. See VPN Gateway for setup instructions.
🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.