Skip to main content

Infrastructure Deployment

Terraform configuration for the robotics reference architecture. Deploys Azure resources including AKS with GPU node pools, Azure ML workspace, storage, and OSMO backend services.

[!NOTE] This page is part of the deployment guide. Return there for the full deployment sequence.

📋 Prerequisites

ToolVersionSetup or Check
Azure CLILatestaz login
Terraform1.5+terraform version
GPU VM quotaRegion-specifice.g., Standard_NV36ads_A10_v5

Azure RBAC Permissions

RoleScope
ContributorSubscription (new RG) or Resource Group (existing RG)
Role Based Access Control AdministratorSubscription (new RG) or Resource Group (existing RG)

Terraform creates role assignments for managed identities, requiring Microsoft.Authorization/roleAssignments/write permission. The Contributor role explicitly blocks this action; the RBAC Administrator role provides it.

[!NOTE] Use subscription scope if creating a new resource group (should_create_resource_group = true). Use resource group scope if the resource group already exists.

Alternative: Owner role (grants more permissions than required).

🚀 Quick Start

cd infrastructure/terraform
source prerequisites/az-sub-init.sh
cp terraform.tfvars.example terraform.tfvars
terraform init && terraform apply -var-file=terraform.tfvars

[!IMPORTANT] The default configuration creates a private AKS cluster (should_enable_private_aks_cluster = true). After deploying infrastructure, you must deploy the VPN Gateway and connect before running kubectl commands or cluster setup scripts.

⚙️ Configuration

Core Variables

VariableDescriptionRequired
environmentDeployment environment (dev, test, prod)Yes
resource_prefixResource naming prefixYes
locationAzure regionYes
instanceInstance identifierNo (default: "001")
tagsResource group tagsNo (default: {})

AKS System Node Pool

VariableDescriptionDefault
system_node_pool_vm_sizeVM size for AKS system node poolStandard_D8ds_v5
system_node_pool_node_countNumber of nodes for AKS system node pool1
system_node_pool_zonesAvailability zones for system node poolnull
should_enable_system_node_pool_auto_scalingEnable auto-scaling for system node poolfalse
system_node_pool_min_countMinimum nodes when auto-scaling enablednull
system_node_pool_max_countMaximum nodes when auto-scaling enablednull

Feature Flags

VariableDescriptionDefault
should_enable_nat_gatewayDeploy NAT Gateway for outbound connectivitytrue
should_enable_private_endpointDeploy private endpoints and DNS zones for Azure servicestrue
should_enable_private_aks_clusterMake AKS API endpoint private (requires VPN for kubectl)true
should_enable_public_network_accessAllow public access to resourcestrue
should_deploy_postgresqlDeploy PostgreSQL Flexible Server for OSMOtrue
should_deploy_redisDeploy Azure Managed Redis for OSMOtrue
should_deploy_grafanaDeploy Azure Managed Grafana dashboardtrue
should_deploy_monitor_workspaceDeploy Azure Monitor Workspace for Prometheus metricstrue
should_deploy_amplsDeploy Azure Monitor Private Link Scope and endpointtrue
should_deploy_dceDeploy Data Collection Endpoint for observabilitytrue
should_deploy_aml_computeDeploy AzureML managed GPU compute clusterfalse
should_include_aks_dns_zoneInclude AKS private DNS zone in core DNS zonestrue

Network Configuration Modes

Three deployment modes are supported based on security requirements:

Full Private (Default)

All Azure services use private endpoints and AKS has a private control plane. Requires VPN for all access.

# terraform.tfvars (default values)
should_enable_private_endpoint = true
should_enable_private_aks_cluster = true

Deploy VPN Gateway after infrastructure: cd vpn && terraform apply

Hybrid: Private Services, Public AKS

Azure services (Storage, Key Vault, ACR, PostgreSQL, Redis) use private endpoints, but AKS control plane is publicly accessible. No VPN required for kubectl access.

# terraform.tfvars
should_enable_private_endpoint = true
should_enable_private_aks_cluster = false

This mode provides security for Azure resources while allowing cluster management without VPN.

Full Public

All endpoints are publicly accessible. Not recommended for production without additional hardening.

# terraform.tfvars
should_enable_private_endpoint = false
should_enable_private_aks_cluster = false

[!WARNING] Public endpoints expose services to the internet. When using this configuration, you must secure cluster workloads:

AzureML Extension: Configure HTTPS and restrict access via inference router settings. See Secure online endpoints and Inference routing.

OSMO UI: Enable Keycloak authentication to protect the web interface. See OSMO Keycloak configuration.

OSMO Workload Identity

Enable managed identity for OSMO services (recommended for production):

osmo_config = {
should_enable_identity = true
should_federate_identity = true
control_plane_namespace = "osmo-control-plane"
operator_namespace = "osmo-operator"
workflows_namespace = "osmo-workflows"
}

See variables.tf for all configuration options.

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.