Manage Node Pools

Add, remove, and resize AKS GPU and CPU node pools on a running cluster, then reconcile OSMO pool, platform, and pod-template configs without redeploying infrastructure.

[!NOTE] This workflow is for adjusting pool composition after initial deployment. For first-time cluster provisioning, see Cluster Setup.

When to Use

Use this when a workload requires resources the existing pools cannot provide. An AKS node pool has a single VM SKU, so changing the SKU means provisioning a new pool — node pool resources cannot be edited in place beyond a few mutable fields (see What Can and Cannot Change in Place).

Examples:

An SDG workflow requires >= 6.5 vCPU but the initial pool uses Standard_B4 (4 vCPU). Add a new pool with a larger SKU.
A new model needs H100 GPUs, but only A10 Spot nodes exist. Add a new H100 pool alongside the existing A10 pool.
A pool is no longer used and should be removed to reclaim quota.

How It Works

All node pools are driven by the node_pools Terraform variable in infrastructure/terraform/. The variable is a map keyed by pool name; Terraform uses for_each over the map to manage each pool, its subnet, NSG associations, and NAT gateway associations independently.

Pool changes follow the standard repo flow:

Edit node_pools in infrastructure/terraform/terraform.tfvars.
Run terraform apply to create, destroy, or update pool resources.
If the new pool requires different nodeSelector, tolerations, or resource overrides, edit infrastructure/setup/values/osmo-platforms.yaml and rerun infrastructure/setup/03-deploy-osmo.sh.

Script 03 deploys osmo-platforms.yaml as a Helm values overlay. Rerun is only needed when platform configuration changes (nodeSelector, tolerations, resource limits) — not for count-only scaling changes.

[!NOTE] Single-node multi-GPU jobs require an OSMO platform whose pod-template nodeSelector targets a multi-GPU node SKU. The shipped gpu_platform_2x platform (pod template gpu_tpl_2x) binds a 2x A100 pool and is selected via the workflow platform field — for example submit-osmo-lerobot-training.sh --num-gpus 2 --platform gpu_platform_2x. The single-GPU gpu_platform cannot satisfy a 2-GPU request because its node SKU exposes one GPU. Add further multi-GPU platforms by copying this pair in osmo-platforms.yaml.

Prerequisites

Terraform state in infrastructure/terraform/ matches the deployed cluster.
kubectl, terraform, az, helm, osmo, and jq available on PATH.
Active Azure CLI session (az login) with rights to modify the cluster resource group.
VPN connection if the cluster is private (default).
The same flags you originally passed to 03-deploy-osmo.sh (for example, --use-acr).

What Can and Cannot Change in Place

These fields on a node_pools entry are ForceNew — editing them destroys and recreates the pool under the same name:

Field	In-place?	Notes
`vm_size`	No	VMSS SKU is immutable; AKS rejects in-place SKU changes
`subnet_address_prefixes`	No	The subnet itself is also a `ForceNew` resource
`zones`	No	Availability zone is set at pool creation
`priority`	No	`Regular` vs `Spot` is set at pool creation
`eviction_policy`	No	Tied to `priority`; only valid for `Spot`
`gpu_driver`	No	Affects pool creation flags
`node_count`	Yes	When autoscaler is disabled
`min_count`, `max_count`	Yes	When autoscaler is enabled
`should_enable_auto_scaling`	Yes	Toggling on/off updates the existing pool
`node_labels`	Yes	Applied to existing nodes
`node_taints`	Yes	Applied to existing nodes (workloads may be evicted)

Anything in the "No" rows means choosing between two flows:

Add new pool, then remove old (recommended for SKU upgrades): zero capacity gap, no forced eviction. Workloads migrate at your pace.
In-place replace (simpler, but pool goes away before the new one is ready): brief capacity gap, all pods on the pool evicted at once.

Workflows

List Current Pools

terraform -chdir=infrastructure/terraform output -json | \
  jq -r '.node_pools.value | to_entries[] | "\(.key)\t\(.value.vm_size)\t\(.value.priority)"'

Resize an Existing Pool (In-Place)

Resizing means changing node_count, min_count, max_count, node_labels, or node_taints. None of these recreate the pool.

Edit infrastructure/terraform/terraform.tfvars:

node_pools = {
  gpu = {
    vm_size                    = "Standard_NV36ads_A10_v5"
    subnet_address_prefixes    = ["10.0.7.0/24"]
    priority                   = "Spot"
    should_enable_auto_scaling = true
    min_count                  = 1
    max_count                  = 4   # changed from 1
    eviction_policy            = "Delete"
    node_taints                = ["nvidia.com/gpu:NoSchedule", "kubernetes.azure.com/scalesetpriority=spot:NoSchedule"]
    gpu_driver                 = "Install"
  }
}

Apply:

source infrastructure/terraform/prerequisites/az-sub-init.sh
terraform -chdir=infrastructure/terraform apply

Rerun the OSMO control-plane script if taints or labels changed (not needed for count-only changes):
```
bash infrastructure/setup/03-deploy-osmo.sh --use-acr
```

Add a New Pool

Use this to add capacity (different SKU, different priority, different zones) without disturbing existing pools.

Add a new map entry in terraform.tfvars alongside the existing pools. Pick a non-overlapping subnet:

node_pools = {
  gpu = { ... }                                     # existing - unchanged
  sdgcpu = {                                        # new
    vm_size                    = "Standard_D8ds_v5"
    subnet_address_prefixes    = ["10.0.12.0/24"]
    priority                   = "Regular"
    should_enable_auto_scaling = false
    node_count                 = 1
  }
}

Apply Terraform (for_each creates only the new pool, its subnet, and NSG/NAT associations):
```
terraform -chdir=infrastructure/terraform apply
```
Rerun the OSMO control-plane script so the new pool appears in POOL, PLATFORM, and POD_TEMPLATE configs:
```
bash infrastructure/setup/03-deploy-osmo.sh --use-acr
```
Verify:
```
kubectl get nodes -l agentpool=sdgcpu
az aks nodepool list --resource-group <rg> --cluster-name <aks> -o table
```
The OSMO-side pool/platform configuration is applied by the rerun in step 3; a successful run is the confirmation. The pool definition itself lives in infrastructure/setup/values/osmo-platforms.yaml.

Remove a Pool

Drain workloads off the pool. For OSMO workflows, stop submitting to that pool and let active workflows finish, or cordon the nodes:

kubectl get nodes -l agentpool=<pool> -o name | xargs -I {} kubectl cordon {}
kubectl get nodes -l agentpool=<pool> -o name | xargs -I {} kubectl drain {} --ignore-daemonsets --delete-emptydir-data

Delete the map entry from terraform.tfvars and apply:
```
terraform -chdir=infrastructure/terraform apply
```
Edit infrastructure/setup/values/osmo-platforms.yaml if it referenced the removed pool, then rerun the OSMO control-plane script:
```
bash infrastructure/setup/03-deploy-osmo.sh --use-acr
```

Replace a Pool SKU (Two-Step, No Capacity Gap)

Recommended path for upgrading from one SKU to another without evicting workloads.

Add the new pool with a different name (see Add a New Pool).
Migrate workloads. For OSMO, submit new workflows targeting the new pool; let active workflows on the old pool drain.
Remove the old pool (see Remove a Pool).

Replace a Pool SKU (In-Place, With Capacity Gap)

Faster but disruptive. Use only when no workloads are running on the pool, or when downtime is acceptable.

Edit vm_size on the existing map entry:

node_pools = {
  gpu = {
    vm_size = "Standard_NC40ads_H100_v5"   # changed
    # ...
  }
}

Apply - Terraform plans -/+ destroy and replace:
```
terraform -chdir=infrastructure/terraform apply
```
All nodes in the pool are evicted at once; new nodes come up under the same pool name.

Rerun the OSMO control-plane script:

bash infrastructure/setup/03-deploy-osmo.sh --use-acr

Operational Notes

Subnet planning. Every pool gets its own subnet. Pick a CIDR that does not overlap aks_subnet_config or any other pool's subnet_address_prefixes. AKS Overlay mode applies to pods; the node IP space is what you size here.
OSMO flag parity. Pass the same flags you used for the initial 03-deploy-osmo.sh run (for example, --use-acr). Omitting them reverts the deployment to defaults.
Spot constraints. Azure rejects upgrade_settings for Spot pools; the Terraform module already handles this. eviction_policy applies only when priority = "Spot".
Autoscaling. min_count = 0 is allowed; the pool scales up on demand from pending pods. KAI/Volcano coscheduling requires whole-pool capacity for gang-scheduled jobs.
Scale-from-zero for AzureML. GPU pools used by AzureML jobs must declare node_labels = { accelerator = "nvidia" }. Without this static label, the cluster autoscaler cannot prove a from-zero scale-up would satisfy AzureML InstanceTypes that select on accelerator: nvidia, and refuses to scale. See Azure ML Training Workflows — Scale-from-zero GPU Pools.
Rerun script 03 after pool changes. 03-deploy-osmo.sh reconciles the unified OSMO deployment, including pool, platform, and backend config.

Cluster Setup — initial deployment and scenarios
Cluster Operations — troubleshooting and optional scripts
Infrastructure Reference — node_pools variable schema

🤖 Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.

When to Use​

How It Works​

Prerequisites​

What Can and Cannot Change in Place​

Workflows​

List Current Pools​

Resize an Existing Pool (In-Place)​

Add a New Pool​

Remove a Pool​

Replace a Pool SKU (Two-Step, No Capacity Gap)​

Replace a Pool SKU (In-Place, With Capacity Gap)​

Operational Notes​

🔗 Related​