Vertical Pod Autoscaler (VPA)

The Vertical Pod Autoscaler automatically adjusts the CPU and memory requests and limits for pods based on observed resource utilization patterns. Unlike HPA, which changes the number of pods, VPA modifies the resource allocation of existing pods to better match actual usage patterns. This approach is particularly valuable for applications with predictable resource requirements that may change over time.

Operation Modes and Behaviour

VPA operates in several modes that determine how resource recommendations are applied. The “Off” mode provides recommendations without making any changes, allowing administrators to review suggested modifications before implementation. The “Initial” mode sets resource requests only when pods are created, while the “Auto” mode actively updates resource requests for running pods by recreating them with new specifications.

The recreation process in Auto mode involves graceful pod termination and restart with updated resource specifications. This approach ensures that resource changes are applied consistently but may result in temporary service interruption for applications that cannot tolerate pod restarts. Organizations must carefully consider the impact of pod recreation on application availability and user experience.

VPA continuously analyses resource utilization patterns over time to generate recommendations. The system examines historical usage data and applies statistical analysis to determine appropriate resource requests that balance efficiency with reliability. The recommendation engine considers factors such as peak usage, average consumption, and usage variability to establish resource specifications that accommodate normal operational variations.

Resource Recommendation Engine

The VPA recommendation engine utilizes machine learning algorithms to analyze resource consumption patterns and generate appropriate resource specifications. The system examines CPU and memory usage over configurable time windows, typically ranging from several days to weeks, to identify trends and patterns in resource consumption.

CPU recommendations focus on ensuring adequate processing capacity while avoiding over-provisioning that wastes cluster resources. The recommendation engine considers factors such as CPU burst patterns, sustained usage levels, and application responsiveness requirements to establish appropriate CPU requests and limits.

Memory recommendations are particularly critical, as memory allocation directly impacts application performance and stability. The recommendation engine analyses memory usage patterns, including baseline consumption, peak usage, and growth trends, to establish memory specifications that prevent out-of-memory conditions while minimizing waste.

Resource Type	Recommendation Approach	Key Considerations	Impact on Performance
CPU Requests	Based on sustained usage patterns and burst requirements	Burst capacity, response time requirements, cost optimization	Affects scheduling and performance under load
CPU Limits	Considers peak usage and performance requirements	Prevents resource starvation, balances fairness	Can impact application responsiveness during peaks
Memory Requests	Analyses baseline consumption and growth trends	Startup requirements, caching behaviour, data processing	Critical for scheduling and avoiding OOM conditions
Memory Limits	Based on peak usage and safety margins	Prevents memory leaks from impacting other applications	Essential for cluster stability and resource isolation

Integration with Application Lifecycle

VPA integration with application lifecycle management requires careful consideration of application characteristics and operational requirements. Stateless applications generally adapt well to VPA, as pod recreation has minimal impact on service availability. Stateful applications may require more sophisticated approaches, such as coordinated pod replacement or integration with application-specific scaling mechanisms.

Applications with persistent state or long-running connections may experience service disruption during VPA-initiated pod recreation. Organizations should evaluate the trade-offs between resource optimization and service availability when implementing VPA for such applications. Alternative approaches include using VPA in recommendation mode only or implementing custom scaling logic that considers application state.

VPA works effectively in combination with HPA for applications that benefit from both horizontal and vertical scaling. This combined approach enables automatic adjustment of both pod count and individual pod resource allocation, providing comprehensive scaling capabilities that adapt to various load patterns and resource requirements.

Monitoring and Observability

Effective VPA implementation requires comprehensive monitoring of resource utilization patterns and scaling events. Organizations should establish monitoring dashboards that track resource consumption trends, VPA recommendations, and the impact of resource changes on application performance. This visibility enables continuous optimization of VPA configuration and helps identify applications that benefit most from vertical scaling.

Resource utilization metrics should be collected at both the pod and application level to provide comprehensive visibility into scaling effectiveness. Key metrics include CPU and memory utilization before and after VPA adjustments, application performance indicators, and resource waste metrics that indicate over-provisioning.

VPA events and recommendations should be logged and analysed to understand scaling patterns and identify opportunities for optimization. Regular review of VPA recommendations helps ensure that resource specifications remain appropriate as application requirements evolve and traffic patterns change.

VPA Limitations on AKS

While VPA provides significant benefits for resource optimization, several limitations must be considered when implementing it on AKS clusters.

Pod Limit: Maximum of 1,000 pods per cluster can use VPA; plan carefully for large deployments.
Resource Availability: VPA may recommend resources beyond cluster capacity, causing scheduling issues; LimitRange and VPA max settings help but are static.
HPA Conflict: Avoid using VPA and HPA together if both scale on CPU/memory, as this can cause instability.
Short Data Retention: VPA Recommender keeps only 8 days of history, limiting accuracy for workloads with long-term or seasonal patterns.
JVM Workloads: VPA may be inaccurate for Java apps due to JVM memory management obscuring true usage.
Windows Container Support: VPA works only with Linux containers; Windows containers are not supported.
Custom Implementations: Only custom recommenders can supplement VPA; full custom or parallel VPA implementations are not supported.

Choosing Between HPA and VPA

Selecting the appropriate scaling strategy depends on application characteristics and operational requirements. Generally you want to avoid using both the HPA and VPA on the same workloads, so you should select the one that best fits your needs.

When to Use HPA

HPA is ideal for stateless applications that can efficiently distribute load across multiple instances, such as web servers, API gateways, and microservices. Applications with variable load patterns benefit from HPA’s ability to scale out during peaks and scale in during quiet periods, providing both performance and cost optimization.

HPA provides better fault tolerance since load is distributed across multiple instances, making it essential for high availability applications that cannot tolerate service interruptions.

When to Use VPA

VPA suits stateful applications or those with significant startup costs that maintain in-memory state or establish expensive connections. Applications with predictable, steady resource requirements benefit from VPA’s right-sizing capabilities, particularly those initially configured with conservative resource estimates.

VPA excels at optimizing resource utilization by eliminating over-provisioning, but requires pod recreation which can cause brief service interruptions.