virtualnodesOnAzureContainerInstances

Node Customizations

If you would like an alternate way to install virtual nodes on ACI, the Helm chart in this repo is also published to the chart repository https://microsoft.github.io/virtualnodesOnAzureContainerInstances/.

Customizations to the virtual node Node configuration are generally done by modifying the values.yaml file for the HELM install and then running a HELM upgrade action.

High Level Section List for convenient jumping:

Standby Pools
Node Customizations
Running Multiple Customized virtual nodes
Scaling virtual nodes Up / Down

Standby Pools

For fast boot latency, Standby Pools allows ACI to pre-create UVMs and cache the images on them. General information about Standby Pools can be found here

Prepare subscription

Register the below providers to get access:

Register-AzResourceProvider -ProviderNamespace Microsoft.ContainerInstance
Register-AzResourceProvider -ProviderNamespace Microsoft.StandbyPool
Register-AzProviderFeature -FeatureName StandbyContainerGroupPoolPreview -ProviderNamespace Microsoft.StandbyPool

Configure the appropriate RBAC roles:

In the Azure Portal, navigate to your subscriptions.
Select the subscription you want to adjust RBAC permissions.
Select Access Control (IAM).
Select Add -> Add Custom Role.
Name your role ContainersContributor.
Move to the Permissions Tab.
Select Add Permissions.
Search for Microsoft.Container and select Microsoft Container Instance.
Search for Microsoft.Network/virtualNetworks/subnets/join/action and select it.
Select the permissions box to select all the permissions available.
Select Add.
Select Review + create.
Select Create.
Select Add -> Add role assignment
Under the roles tab, search for the custom role you created earlier called ContainersContributor and select it
Move to the Members tab
Select + Select Members
Search for Standby Pool Resource Provider.
Select the Standby Pool Resource Provider and select Review + Assign.
If you do not have Contributor/ Owner/ Administrator roles to the subscription you are using ACI Pools for, you will also need to setup StandbyPools RBAC roles (Standby Pool create, reads, etc.) and assign to yourself.

Install VN2 with Standby Pools

Modify the Helm chart values.yaml to set up the standby pools using the below parameters. | value | Short Summary | | – | – | | sandboxProviderType | Indicates if virtual node is configured to use standby pools with StandbyPool, or OnDemand (the default) if not | | standbyPool.standbyPoolsCpu | How many cores to allocate for each standby pool UVM | | standbyPool.standbyPoolsMemory | Memory in GB to allocate for each standby pool UVM | | standbyPool.maxReadyCapacity | Number of warm, unused UVMs the standby pool will try to keep ready at all times | | standbyPool.ccePolicy | Set the cce policy for pods that will be applied to pods running on this node if standby pool is used. This policy is applied to the standby pool UVMs. | | standbyPool.zones | Semi-colon delimited list of zone names for the standby pool to ready UVMs in. |

Some Notes, to be explicit on how standby pools move the above settings into the node level:

If using standby pools, the UVM size will be predetermined by this VN2 configuration, regardless of what the requested pod size is.
If using standby pools, CCE policy is set at the node level. All pods scheduled to the node should have the same exact matching policy. If the pod has no policy specified it will run with the node’s policy. If the pod has a policy specified, it has to match the node’s policy or a client side validation will fail the pod creation.

Image Caching

To cache an image to your standby pool, you will need to create a pod with annotation

"microsoft.containerinstance.virtualnode.imagecachepod": "true"

and schedule it on the virtual node(s). When the virtual nodes see this annotation it will not actually activate this pod, but rather list the images in it and cache them on every standby pool UVM. A pod could have multiple images and also multiple pods could be defined. Any image credential type would work as well.

If this pod is deleted, the virtual node will stop ensuring the images contained in it are pre-cached into the standby pool for the node.

Example Pod YAML:

apiVersion: v1
kind: Pod
metadata:
  annotations:   
    microsoft.containerinstance.virtualnode.imagecachepod: "true"
  name: demo-pod
spec:
  containers:
  - command:
    - /bin/bash
    - -c
    - 'counter=1; while true; do echo "Hello, World! Counter: $counter"; counter=$((counter+1)); sleep 1; done'
    image: mcr.microsoft.com/azure-cli
    name: hello-world-counter
    resources:
      limits:
        cpu: 2250m
        memory: 2256Mi
      requests:
        cpu: 100m
        memory: 128Mi
  nodeSelector:
    virtualization: virtualnode2
  tolerations:
  - effect: NoSchedule
    key: virtual-kubelet.io/provider
    operator: Exists

Non-StandbyPool Node customizations

A non-exhaustive list of non-standby-pool specific configuration values available, and how to use them

value	Short Summary
replicaCount	Count of VN2 node pods. See Scaling virtual nodes Up / Down for more details
admissionControllerReplicaCount	Count of VN2 admission controller pods. See Scaling virtual nodes Up / Down for more details
aciSubnetName	a comma delimited list of subnets to potentially use as the node default. See this section on behaviors
aciResourceGroupName	the name of the Azure Resource Group to put virtual node’s ACI CGs into. See this section on behaviors
zones	a semi-colon delimited list of Azure Zones to deploy pods to. See this section on behaviors
containerLogsVolumeHostPath	overrides directory behavior for virtual nodes container logs for niche customer scenarios. See documentation here
priorityClassName	Name of the Kubernetes Priority Class to assign to the virtual node pods. See Using Priority Classes for more details
admissionControllerPriorityClassName	Name of the Kubernetes Priority Class to assign to the VN2 admission controller pods. See Using Priority Classes for more details
podDisruptionBudget	Configurations for the Kubernetes Pod Disruption Budget (PDB) resource to use for the virtual node deployment. See Kubernetes documentation for available fields.
admissionControllerPodDisruptionBudget	Configurations for the Kubernetes Pod Disruption Budget (PDB) resource to use for the VN2 admission controller deployment. See Kubernetes documentation for available fields.
kubeProxyEnabled	Disables node from being able to create pods with kube-proxy. See Disabling the Kube-Proxy
customTags	Setting ARM Tags for created ACI CGs. See Custom Tags
acrTrustedAccess	Setting to replace the default image credential retriever with one capable of pulling images from a private network ACR. See ACR with Trusted Access

Default ACI Subnet behaviors with a customized `aciSubnetName`

This suboptimally-named field is actually a comma delimited list of subnets to potentially use as the default for the node.

What value does it have as a list, when the setting is intended what to use for the default subnet? One would reasonably assume they can only default to one setting!

It allows the customer to scale outward more naturally to use multiple subnets with a single VN2 configuration. The virtual node code is configured to distribute itself so each individual node replica brought up will pick the least used subnet (with a very naive implementation, but which should still get us decent spread). This “default” subnet it picks will be used for pods which do not have a subnet override, AND like most node settings will also be used by the Node’s Standby Pool for its configuration, if it is enabled.

What values can be in this list? Each value can either be a subnet name OR a subnet resource ID

If subnet name(s) are provided, they will be used assuming that they are within the AKS VNET.
If full resource IDs to the subnet are provided, they will be used as is. Normal VNET restrictions apply (EG - must be in same region as resources in VNET)

Can my list have both subnet names and subnet resource Ids? It can indeed!
EG -

aciSubnetName: cg,/subscriptions/mySubGuid/resourceGroups/myAksRg/providers/Microsoft.Network/virtualNetworks/aks-vnet-25784907/subnets/cg,/subscriptions/mySubGuid/resourceGroups/myAksRg/providers/Microsoft.Network/virtualNetworks/adifferentvnet/subnets/adifferentsubnet

Changing the Azure Resource Group used for ACI resources via `aciResourceGroupName`

By default virtual node will put its ACI resources into the same resource group as the AKS infrastructure (default AKS RG name MC_<aks rg name>_<aks cluster name>_<aks region>). However, this behavior can be controlled by updating the HELM value aciResourceGroupName

When empty the default will be used, but if overridden it should just contain the name of the desired resource group, which must exist within the same subscription as the AKS cluster virtual nodes is being used with.

IMPORTANT Customers must ensure that if this override is used that they do not reuse the same RG for multiple AKS’s virtual node targets! The product is actively managing the target RG and ensuring it matches what it expects, so if multiple virtual nodes deployed to different AKS are all targeting the same aciResourceGroupName they can and will fight with each other! But this is not an issue with multiple virtual nodes within the same AKS cluster.

aciResourceGroupName: my_great_rg_name

Default Azure Zone behaviors with a customized `zones`

Azure has a concept of Availability Zones, which are separated groups of datacenters that exist within the same region. If your scenario calls for it, you can specify a zone for your pods to be hosted on within your given region.

zones: '<semi-colon delimited string of zones>'

NOTE: Today, ACI only supports providing a single zone as part of the request to allocate a sandbox for your pod. If you provide multiple, you should get an informative error effectively saying you can only provide one.

This setting applies a node level default zone, so pods which do not have a pod level annotation for zone will have this applied. When set with an empty string, no zones will be used as this default.

Override the virtual node log directory with a hostPath mount

For some niche customer scenarios, it can be useful for the container logs from containers running on the virtual nodes infrastructure pods to be available to the physical K8s host VMs.

If the value containerLogsVolumeHostPath is blank or empty (the default), there will be no change in behavior—container logs will use an emptyDir only within the virtual node infra pod.

If the value containerLogsVolumeHostPath is present and not an empty string, the value will be used as the directory to hostPath mount the container logs volume to for the virtual nodes infra pod’s host. The directory will be created if it doesn’t already exist.

Example usage:

containerLogsVolumeHostPath: '/var/log/virtualnode'

Using Priority Classes

If you would like to use Kubernetes Priority Classes with the virtual node pods, you can specify the name of the priority class to use in the values.yaml file for the HELM chart using the following settings:

priorityClassName: <name of priority class for virtual node infra pods>
admissionControllerPriorityClassName: <name of priority class for admission controller pods>

priorityClassName will be used for the virtual node infra pods, while admissionControllerPriorityClassName will be used for the Admission Controller pods.

IMPORTANT: if you specify a priority class name that does not exist in the cluster, the virtual node infra pods will fail to start. Ensure that the specified priority class exists in the cluster before deploying the virtual node infra pods. The assigned priority classes should also exist as long as the virtual nodes infra pods are running.

An example of how to create a priority class in Kubernetes is as follows:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-virtnode
value: 1000000
globalDefault: false
description: "This priority class should be used for virtual node infra pods only."

You can then set the priority class names in the values.yaml file:

priorityClassName: high-priority-virtnode
admissionControllerPriorityClassName: high-priority-virtnode

Setting separate priority class names for the virtual node pods and the admission controller pods is also possible. You can also specify an existing priority class name that was separately created in the cluster.

Disabling the Kube-Proxy

You can disable the capability for pods created by the virtual node to have a kube-proxy sidecar added to them (by default enabled for non-confidential pods) by setting this value to false:

kubeProxyEnabled: false

The default is true, which is the existing behavior. Note that setting the pod level setting explicitly true will not override this, as this option disables the node from having the configuration necessary for being able to create this hookup.

Using Custom ARM Tags

Azure has a concept of resources having Tags, and this applies to the ACI CGs created by virtual nodes.

For customers who want to set these tags for ACI CGs created by virtual nodes, they can set a default setting as part of the HELM values.

customTags: 'env=Dev;project=silverSpoon' # Custom tags to add to ACI container groups, in the format key1=value1;key2=value2

Azure ARM Tags are key-value pairs, so the annotation is a semi-colon delimited string of key=value pairs. The above example has two tags:

Key: env
Value: Dev
Key: project
Value: silverSpoon

If custom tags are provided both at the node and pod level, and a given key is defined for both, the value from the pod level will override.

If custom tags are provided that are already used by virtual nodes for system-level information, only the key=values that overlap with a system-level key will be ignored.

Using a Private ACR with Trusted Access

Some customers have use cases where they would like to use ACR’s which are not accessible to the public internet. This is now supported for use with virtual nodes, though it is a node-wide configuration that replaces the original image credential retrieval with new logic able to work in this scenario.

Updating an ACR so that it cannot be publicly accessed but which has Trusted Access enabled:
ACR without Public Access but Set with Trusted Access
Important: Trusted Access is required to be enabled for this feature to work!

You will then need a managed identity which has access to the ACR (can be set from the ACR’s Access Control with a role like AcrPull). You will need both the MI’s full resource ID as well as its principal ID (easily retrieved in the portal from MI’s Overview blade).

Then update these values in the values.yaml:

acrTrustedAccess:
  enabled: false
  identityResourceId: '' # Resource ID of managed identity with ACR access eg -  /subscriptions/<subId>/resourceGroups/<rgName>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<miName>
  identityPrincipalId: '' # Principal ID of the managed identity with ACR access

enabled must be set to true
identityResourceId should be the full resource ID to the MI which has access to the private networked ACR
identityPrincipalId must be set to the principal ID for the MI in the resource ID above

Nodes created with this configuration will be able to retrieve images from private networked ACRs to which the identity they were configured with has permissions!

How to run more than one type of customized virtual node in the same AKS

You may have a scenario that you want to run more than 1 virtual node HELM configuration in one AKS cluster.

To achieve this, you will need to ensure only one of those HELM releases’ value.yaml files has a non-zero replica count for the Admission Controller, which also controls implicitly registering the web hook. The default value is 1, as it is a required service to be running for virtual node to function.

admissionControllerReplicaCount: 1

It is also strongly recommended to update the values.yaml namespace used for deploying each virtual node configuration so each has its own unique namespace.

namespace: <Something unique for each config>

Scaling virtual nodes Up / Down

As would be expected from K8s resources, virtual node can be scaled up or down by modifying the replica count, either in place or with a HELM update.

The number of virtual node pods and Admission Controller pods can each be scaled separately. The virtual node pods are responsible for most K8s interactions with the virtualized pods, and can at most support 200 pods each. The Admission Controllers are present to ensure certain state about the virtual nodes is updated for the K8s Control Plane, as well as making modifications to the pods being sent to the virtual nodes which enables some functionalities.

For every 200 pods you wish to host in virtual node you will need to scale out an additional virtual node replica for it.

NOTE: Regardless which method is used, scaling down your virtual nodes requires some manual cleanup.

Manual Cleanup when Scaling Down virtual node pods

If you scale DOWN replicas for the virtual node, this will remove the virtual node backing pods but it will NOT clean up the “fake” K8s Nodes and they will still appear to be “Ready” to the control plane. These will need to be manually cleaned up, via a command like

kubectl delete node <nodeName>

To determine which are the nodes which need to be cleaned up, they would be the ones which no longer have backing pods (virtual node node names are the same as the pod backing them)… which can be queried like so:

kubectl get pods -n <HELM NAMESPACE>
kubectl get nodes

Updating Replica Count in-place via Kubectl

Replica count for the resources can be updated in-place via Kubectl commands, like:

kubectl scale StatefulSet <HELM RELEASE NAME> -n <HELM RELEASE NAMESPACE> --replicas=<DESIRED REPLICA COUNT>

EG: kubectl scale StatefulSet virtualnode -n vn2 --replicas=1

Pitfall: The danger with this method is if you do not align the HELM chart, the next time you apply a HELM update it will overwrite the replica count and force a scale up / down to whatever is in the HELM.

Updated Replica Count via HELM

The HELM’s values.yaml file has two values for controlling replica counts:

replicaCount: 1
admissionControllerReplicaCount: 1 

replicaCount controls the virtual node pod replicas, while admissionControllerReplicaCount controls the AdmissionController pod replicas.

This site is open source. Improve this page.