Infrastructure Reference
Architecture details, module structure, Terraform outputs, and troubleshooting for the infrastructure deployment.
[!NOTE] This page is part of the deployment guide. Return there for the full deployment sequence.
🏗️ Architecture
Directory Structure
001-iac/
├── main.tf # Module composition
├── variables.tf # Input variables
├── outputs.tf # Output values
├── versions.tf # Provider versions
├── terraform.tfvars # Configuration (gitignored)
├── modules/
│ ├── platform/
│ │ ├── networking.tf # VNet, subnets, NAT Gateway, DNS resolver
│ │ ├── security.tf # Key Vault, managed identities
│ │ ├── observability.tf # LAW, App Insights; optional Monitor, Grafana, AMPLS
│ │ ├── storage.tf # Storage Account
│ │ ├── acr.tf # Container Registry
│ │ ├── azureml.tf # ML Workspace, optional compute cluster
│ │ ├── postgresql.tf # PostgreSQL Flexible Server
│ │ ├── redis.tf # Azure Managed Redis
│ │ └── private-dns-zones.tf # Private DNS zones
│ ├── sil/
│ │ ├── aks.tf # AKS cluster, node pools
│ │ ├── networking.tf # AKS subnets, NAT associations
│ │ ├── observability.tf # Container Insights, Prometheus DCRs
│ │ └── osmo-federated-credentials.tf # OSMO workload identity
│ ├── vpn/ # VPN Gateway module
│ └── automation/ # Automation Account module
├── vpn/ # Standalone VPN deployment
├── dns/ # OSMO UI DNS configuration
└── automation/ # Scheduled startup deployment
Module Structure
Root Module (001-iac/)
├── Platform Module # Shared Azure services
│ ├── Networking # VNet, subnets, NAT Gateway, DNS resolver
│ ├── Security # Key Vault (RBAC), managed identities
│ ├── Observability # LAW, App Insights (always); Monitor, Grafana, DCE, AMPLS (optional)
│ ├── Storage # Storage Account, ACR
│ ├── Machine Learning # AzureML Workspace, optional compute cluster
│ └── OSMO Backend # PostgreSQL, Redis
│
└── SiL Module # AKS-specific infrastructure
├── AKS Cluster # Azure CNI Overlay, workload identity
├── GPU Node Pools # Configurable via node_pools variable
└── Observability # Container Insights, Prometheus DCRs
Resources by Category
| Category | Resources |
|---|---|
| Networking | VNet, subnets (main, PE, AKS, GPU pools), NSG, NAT Gateway, DNS Private Resolver |
| Security | Key Vault (RBAC mode), ML identity, OSMO identity |
| Observability | Log Analytics (always), App Insights (always), Monitor Workspace, Grafana, DCE, AMPLS (optional) |
| Storage | Storage Account (blob/file), Container Registry (Premium) |
| Machine Learning | AzureML Workspace, optional managed compute cluster |
| AKS | Cluster with Azure CNI Overlay, system pool, GPU node pools |
| Private DNS | 6 base zones + conditional AKS zone + conditional monitor zones (up to 11) |
| OSMO Services | PostgreSQL Flexible Server (HA), Azure Managed Redis |
Conditional Resources
| Condition | Resources Created |
|---|---|
should_enable_private_endpoint | Private endpoints, DNS zones, DNS resolver |
should_enable_nat_gateway | NAT Gateway, Public IP, subnet associations |
should_deploy_postgresql | PostgreSQL server, databases, delegated subnet, DNS zone |
should_deploy_redis | Redis cache, private endpoint (if PE enabled), DNS zone |
should_deploy_grafana | Azure Managed Grafana, role assignments |
should_deploy_monitor_workspace | Azure Monitor Workspace for Prometheus |
should_deploy_ampls | AMPLS, scoped services, private endpoint (if PE enabled) |
should_deploy_dce | Data Collection Endpoint, AMPLS link (if AMPLS enabled) |
aml_compute_clusters | AzureML managed compute clusters keyed by cluster name |
should_include_aks_dns_zone | AKS private DNS zone in core zones |
AzureML managed network isolation
Use aml_managed_network_isolation_mode to control the AzureML workspace managed network directly.
| Value | AzureML workspace behavior |
|---|---|
Disabled | AzureML managed network is off, and AML compute can use a subnet ID |
AllowInternetOutbound | AzureML managed network is on with Microsoft-managed outbound access |
AllowOnlyApprovedOutbound | AzureML managed network is on and outbound access is restricted |
Treat changes to aml_managed_network_isolation_mode as AzureML redeploy operations. AzureML does not support disabling managed network isolation after it is enabled, or switching between AllowInternetOutbound and AllowOnlyApprovedOutbound in place. Delete and recreate managed compute resources when enabling managed networking on an existing workspace; recreate the workspace for unsupported mode transitions.
AzureML compute cluster migration
[!WARNING] Existing deployments that used the previous singular AzureML compute cluster resource require a one-time Terraform state migration. Without this migration, Terraform plans to destroy the existing cluster and create a replacement at the new keyed address.
Replace the previous should_deploy_aml_compute and aml_compute_config settings with the aml_compute_clusters map in terraform.tfvars:
aml_compute_clusters = {
"gpu-cluster" = {
vm_size = "Standard_NC4as_T4_v3"
vm_priority = "LowPriority"
min_node_count = 0
max_node_count = 1
scale_down_after_idle = "PT5M"
identity_type = "SystemAssigned"
}
}
Set identity_type = "SystemAssigned" when preserving the previous compute-cluster identity behavior. Omit identity_type to use the new default, UserAssigned, with the platform managed identity.
Move existing Terraform state from the former count address to the new cluster-name key before applying:
terraform state mv \
'module.platform.azurerm_machine_learning_compute_cluster.gpu[0]' \
'module.platform.azurerm_machine_learning_compute_cluster.gpu["gpu-cluster"]'
Update output consumers to read the cluster entry from the new map output:
terraform output -json aml_compute_clusters | jq '."gpu-cluster"'
📦 Modules
| Module | Purpose |
|---|---|
| platform | Networking, storage, Key Vault, ML workspace, PostgreSQL, Redis |
| sil | AKS cluster with GPU node pools |
| vpn | VPN Gateway module (used by vpn/ standalone deployment) |
📤 Outputs
terraform output
# AKS cluster details
terraform output -json aks_cluster | jq -r '.name'
# OSMO connection details
terraform output postgresql_connection_info
terraform output managed_redis_connection_info
# Key Vault name (for 002-setup scripts)
terraform output key_vault_name
# DNS server IP (for VPN clients)
terraform output dns_server_ip
# AzureML compute clusters keyed by cluster name
terraform output -json aml_compute_clusters | jq -r '."gpu-cluster".name'
🔧 Optional Components
Standalone deployments extend the base infrastructure.
VPN Gateway
Point-to-Site VPN for secure remote access to the private AKS cluster and Azure services.
[!IMPORTANT] Required for default configuration. With
should_enable_private_aks_cluster = true, you cannot runkubectlcommands or cluster setup scripts without VPN connectivity. To skip VPN, setshould_enable_private_aks_cluster = falsein yourterraform.tfvars.
cd vpn
cp terraform.tfvars.example terraform.tfvars
terraform init && terraform apply -var-file=terraform.tfvars
See VPN Gateway for configuration options and VPN client setup.
Private DNS for OSMO UI
Configure DNS resolution for the OSMO UI LoadBalancer after setup from infrastructure/setup/03-deploy-osmo-control-plane.sh (requires VPN):
cd dns
terraform init
terraform apply -var="osmo_loadbalancer_ip=10.0.x.x"
See dns/README.md for details.
Automation Account
Scheduled startup of AKS and PostgreSQL to reduce costs:
cd automation
cp terraform.tfvars.example terraform.tfvars
terraform init && terraform apply -var-file=terraform.tfvars
See automation/README.md for schedule configuration.
🔍 Troubleshooting
Issues and resolutions encountered during infrastructure deployment and destroy.
Destroy Takes a Long Time
Terraform destroy removes resources in dependency order. Private Endpoints, AKS clusters, and PostgreSQL servers commonly take 5-10 minutes each. Full destruction typically takes 20-30 minutes.
Monitor remaining resources during destruction:
az resource list --resource-group <resource-group> --query "[].{name:name, type:type}" -o table
Soft-Deleted Resources Block Redeployment
Azure retains certain deleted resources in a soft-deleted state. Redeployment fails when Terraform attempts to create a resource with the same name as a soft-deleted one.
| Resource | Soft Delete | Retention Period | Blocks Redeployment |
|---|---|---|---|
| Key Vault | Mandatory | 7-90 days (configurable) | Yes |
| Azure ML Workspace | Mandatory | 14 days (fixed) | Yes |
| Container Registry | Opt-in (preview) | 1-90 days (configurable) | No (disabled by default) |
| Storage Account | Recovery only | 14 days | No (same-name creation allowed) |
Purge Soft-Deleted Key Vault
az keyvault list-deleted --subscription <subscription-id> --resource-type vault -o table
az keyvault purge --subscription <subscription-id> --name <key-vault-name>
[!NOTE] Key Vaults with
purge_protection_enabled = truecannot be purged and must wait for retention expiry. This configuration defaults toshould_enable_purge_protection = false.
Purge Soft-Deleted Azure ML Workspace
Azure ML workspaces enter soft-delete for 14 days after deletion. List via Azure Portal under Azure Machine Learning > Manage deleted workspaces.
az ml workspace delete \
--name <workspace-name> \
--resource-group <resource-group> \
--permanently-delete
Terraform State Mismatch
Resources manually deleted or created outside Terraform cause state mismatches.
Refresh State for Deleted Resources
terraform refresh -var-file=terraform.tfvars
terraform plan -var-file=terraform.tfvars
Import Existing Resources
terraform plan -var-file=terraform.tfvars
terraform import -var-file=terraform.tfvars '<resource_address>' '<azure_resource_id>'
# Example: Import a resource group
terraform import -var-file=terraform.tfvars \
'module.platform.azurerm_resource_group.main' \
'/subscriptions/<sub-id>/resourceGroups/<rg-name>'
# Example: Import an AKS cluster
terraform import -var-file=terraform.tfvars \
'module.sil.azurerm_kubernetes_cluster.main' \
'/subscriptions/<sub-id>/resourceGroups/<rg-name>/providers/Microsoft.ContainerService/managedClusters/<aks-name>'
After import, run terraform plan to verify the imported resource matches the configuration.
Resource Locks Prevent Deletion
az lock list --resource-group <resource-group> -o table
az lock delete --name <lock-name> --resource-group <resource-group>
🔗 Related
- Infrastructure Deployment — deploy and configure Terraform resources
🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.