# Volcano scheduler config tuned for AKS cluster autoscaler scale-from-zero.
#
# The AzureML Kubernetes extension installs Volcano with the `overcommit` and
# `proportion` plugins in the third tier. Both implement `JobEnqueueable`, which
# the `enqueue` action calls to decide whether a PodGroup may transition from
# Pending to Inqueue. Their decision is based on currently-Ready node capacity
# only (proportion: queue.Allocated + queue.Free, overcommit: total * factor),
# so on a cluster whose GPU node pools sit at count=0 they always return false:
# no PodGroup is enqueued, Volcano never creates the underlying Pod, no Pending
# Pod is visible to the AKS cluster autoscaler, and the GPU pool is never
# scaled up — a self-induced deadlock.
#
# Removing both plugins makes `enqueue` permissive (default Permit when no
# plugin objects), so Volcano creates the Pod immediately. The Pod is Pending
# on a missing nvidia.com/gpu node, the autoscaler scales the pool from 0 to 1,
# and once the node is Ready the `allocate` action binds the Pod. Gang
# scheduling (the `gang` plugin in the second tier) still gates `allocate`, so
# multi-pod jobs continue to wait for minAvailable before any task starts.
#
# Trade-off: queue-level capacity fairness across multiple PodGroups is no
# longer enforced at enqueue time. Acceptable on single-tenant dev/training
# clusters; re-enable proportion/overcommit on multi-tenant production clusters
# (pass --enforce-volcano-capacity-check to 02-deploy-azureml-extension.sh).
actions: "enqueue, allocate, backfill"
tiers:
  - plugins:
      - name: priority
        enableJobStarving: false
      - name: conformance
  - plugins:
      - name: gang
      - name: drf
        enablePreemptable: false
  - plugins:
      - name: predicates
      - name: nodeorder
      - name: binpack