Intro to Kubernetes Autoscaling and Best Practices for Successful Implementations

Autoscaling in Kubernetes is all about having the right resources at the right time: the goal is to balance cost and reliability through automation, fine-tuning resource allocation while ensuring application reliability under varying load levels. Autoscaling enables efficient use of cloud resources, allowing for elastic on-demand scalability and ensuring you pay only for the needed resources. This process is managed through three primary scaling dimensions: cluster scaling, horizontal scaling, and vertical scaling.

Autoscaling for each of these dimensions comes with its own set of challenges. Cluster autoscaling must balance cost, efficiency, and response times, often necessitating complex configurations. Horizontal scaling can be rapid and robust but may require careful setup to trigger based on metrics other than current resource consumption, which is needed to scale predictively rather than reactively. Vertical scaling is theoretically applicable to all workloads, but it’s tricky and can lead to service interruptions or reliability issues if attempted without sufficiently predictive intelligence driving each scaling decision. Using horizontal and vertical scaling together is ideal, but harmonization is required to prevent the two mechanisms from thrashing and ensure that pod counts and pod sizes scale effectively in tandem.

This article dives deep into Kubernetes autoscaling, exploring key components and practical challenges that Kubernetes administrators, DevOps engineers, and cloud architects encounter. It offers a comprehensive overview of advanced autoscaling strategies and tools.

Summary of key Kubernetes autoscaling concepts

Concept Description
Understanding Kubernetes autoscaling dimensions Autoscaling involves the dynamic adjustment of resources: cluster scaling manages overall capacity to meet demands, horizontal scaling changes pod counts based on metrics, and vertical scaling optimizes pod resource requests for efficiency.
Getting started with cluster autoscaling Cluster autoscaler is the standard option with broad cloud platform support. Karpenter is a newer and more advanced project offering rapid, resource-efficient scaling decisions and workload consolidation.
Getting started with horizontal autoscaling HPA is built-in to Kubernetes and easy to use. KEDA extends HPA, enabling responsive, event-driven scaling and the ability to scale workloads to zero for optimal resource usage.
Getting started with vertical autoscaling VPA is built into Kubernetes but has low adoption due to considerable caveats and limitations. StormForge applies machine learning algorithms and policy-based automation to the vertical scaling problem to overcome these limitations and enable automated vertical scaling for the masses.
Best practices for successful Kubernetes autoscaling
  • Prioritize monitoring and visibility
  • Adopt an iterative approach
  • Harmonize cluster, horizontal, and vertical autoscaling
  • Simplify developer experience

Understanding Kubernetes autoscaling dimensions

To address the diverse scaling needs within a Kubernetes environment, autoscaling is typically approached in three dimensions: cluster scaling, horizontal scaling, and vertical scaling. Each of these scaling dimensions has unique strengths and weaknesses related to resource optimization. The most ideal resource management outcomes are achieved through intelligent adoption and blending together mechanisms to autoscale on all three.

Kubernetes autoscaling dimensions
Kubernetes autoscaling dimensions

Cluster scaling

Cluster scaling addresses the need to adjust the overall capacity of the Kubernetes cluster based on the sum total of workload resource requirements across the whole cluster. The scale of a cluster is determined by the number of nodes and the allocatable CPU, memory, and other resources nodes they contribute. 

A cluster scaling strategy — whether manual or automated, static or elastic — is crucial for managing the cost and reliability of the cluster as a whole. If the cluster doesn’t have enough nodes to schedule all pods, some workloads simply won’t run. If the cluster’s nodes are predominantly idle or underutilized, the cluster will be costing money unnecessarily and with no real return.

All other scaling mechanisms rely on appropriate cluster scaling. The cluster must have enough node capacity to schedule all requested pods at their requested sizes. A cluster autoscaling solution automatically adds or removes nodes from the cluster to ensure sufficient node resources to run all pods while avoiding costly overprovisioning.

Horizontal scaling

Horizontal scaling, primarily managed through the Horizontal Pod Autoscaler (HPA), dynamically adjusts the number of pod replicas in a deployment or replication controller. This adjustment is based on real-time metrics, such as CPU utilization or custom metrics that reflect or predict the application’s performance and demands.

For workloads that support it, horizontal autoscaling is ideally suited to enable rapid elastic capacity changes, quickly scaling up or scaling down pod replicas in response to fluctuating demand on a component or application. This rapid response characteristic often makes horizontal autoscaling essential for the purpose of ensuring application reliability in the face of unpredictable traffic or events.

Horizontal scaling increments or decrements resource allocation for a workload of whole pods at a time, so it is not necessarily well suited for workloads that do not need or benefit from having more than one instance of a pod.

Autonomous Rightsizing for K8S Workloads

Automated vertical autoscaling designed to scale for 100K+ containers

Fully compatible with HPA functionality and cloud-based services

Powered by advanced machine learning with user-controlled guardrails

Vertical scaling

Vertical scaling, often associated with Kubernetes’ built-in Vertical Pod Autoscaler (VPA), focuses on adjusting individual pods’ CPU and memory allocations to best match observed or predicted workload resource consumption.

Vertical scaling as a practice can apply to all workloads, regardless of whether they have one replica or many, making it more broadly applicable than HPA. However, vertical scale-up or scale-down activity usually requires restarting pods. This makes vertical scaling less appropriate than HPA for rapid response to unexpected spikes in demand.

Vertical scaling is best suited for recurring adjustments to resource allocations to ensure each pod’s base resource allocation is set at an optimal balance between cost and reliability. A vertical scaling solution performs tailored adjustments automatically. The frequency of vertical tailoring can be weekly, daily, or even dynamically in response to reliability events.

Assessment of workload requirements

Before diving into autoscaling configurations, assess your applications’ dynamic needs. This assessment will guide you in choosing the right mix of cluster, horizontal, and vertical autoscaling to meet your application’s specific requirements. Understanding workload patterns, peak usage times, and resource consumption trends allows you to tailor an autoscaling strategy that ensures optimal performance and cost efficiency.

Effectively autoscaling Kubernetes workloads requires a nuanced understanding of application requirements, the intelligent integration of scaling metrics, and the seamless operation of different autoscaling mechanisms. In the next section, we will explore key strategies with the standard tooling for autoscaling on each dimension, the associated challenges, and advanced solutions that improve upon or replace the standard tooling.

Getting started with cluster autoscaling

Kubernetes cluster autoscaling tools

Cluster autoscaler

Cluster Autoscaler is the standard tool for managing cluster nodes dynamically, and has support for autoscaling Kubernetes across more than twenty different cloud providers.

Cluster Autoscaler is a reliable workhorse with predictable performance. When it comes to ensuring that enough nodes are deployed to schedule all of a cluster’s pods, there is little to complain about. 

Because of the additional complexity of dealing with multiple node configurations, however, unless there is a very good reason for workloads to request a specific node type (which they can do through nodeAffinity spec), it’s simplest to define a single node type which autoscaler will deploy when it needs to provision additional capacity. 

Additionally, while scaling up is very responsive, Cluster Autoscaler deprovisions nodes only one at a time with a predefined delay before the next step-down. As a result of this behavior, a rapid scale-up of many nodes may take some time to slowly return to previous levels once the nodes are no longer needed.


  • Broad applicability with support for more than twenty different cloud providers
  • Fast and reliable scale-up behavior


  • Node pool configuration is very static
  • Limited control over node type selection for scale-up and scale-down events
  • Conservative scale-down behavior


Karpenter is an advanced project that replaces the use of cluster autoscaler. Initially released by AWS in 2021 and donated to the CNCF in 2023, its stated goals are to take full advantage of cloud capabilities while remaining fast and simple to use.

Karpenter’s approach to cluster autoscaling is designed to improve how clusters respond to dynamic workloads by scaling faster and requiring less manual configuration for optimal outcomes. Unlike the traditional cluster autoscaler that requires users to specifically configure which node instance types to use and reacts slowly to reductions in workload demand, Karpenter automates the selection of specific instance types and responds quickly to opportunities to consolidate nodes, resulting in autoscaling that’s faster, simpler, and more efficient.

Here’s how Karpenter improves Kubernetes autoscaling:

  • Node management: Unlike traditional cluster autoscalers that may react slower to changes in workload demand and require you to set up Auto Scaling Groups (ASGs), Karpenter proactively manages the scaling of nodes directly. This direct management of instances allows Karpenter to make more targeted scaling decisions, such as selecting the most appropriate from the hundreds of available instance types, without modifying ASGs. This results in faster scaling responses to workload changes, enhancing speed and efficiency.
  • Resource consolidation: Karpenter’s resource consolidation refers to its strategy of intelligently identifying opportunities to reduce cluster cost by modifying the node-type mixture of the cluster. By analyzing node allocation efficiency, Karpenter picks nodes to eliminate—and sometimes different nodes to replace them with—to reduce waste and improve overall cluster efficiency. The goal is to ensure that every node in the cluster is used efficiently, matching resource supply closely with demand and reducing the overall resource footprint.
  • Spot instance management: Spot instance management in Karpenter involves strategically using spot instances, which are temporarily available VMs offered at a lower price than standard instances due to excess capacity. Karpenter’s intelligent scaling decisions factor in spot instances’ availability and cost benefits, seamlessly integrating them into the cluster while managing the inherent risks associated with spot instance volatility. This management capability ensures that the cluster can expand resource availability at a reduced cost without compromising the cluster’s performance or reliability.

      At the time of writing, Karpenter is supported on AWS and Azure.


      • Automates instance selection, enhancing speed and simplifying configuration.
      • Proactively manages node scaling for quick adjustments to workload demands.
      • Utilizes spot instances intelligently to cut costs without impacting reliability.


      • Requires an understanding of cloud services to take full advantage.
      • Might involve complexities in managing spot instance volatility.
      • A transition from traditional autoscalers could require adjusting existing workflows.

      Getting started with horizontal autoscaling

      Kubernetes horizontal autoscaling tools

      The HorizontalPodAutoscaler (HPA)

      The HPA is a core Kubernetes resource type built into the Kubernetes platform. HPA is widely adopted for its simplicity and effectiveness. Horizontal scaling isn’t appropriate for use with all workloads since not all workloads are designed for parallelism, but when HPA can be used, it’s one of the best ways available to ensure automated elasticity on a per-workload basis.

      HPA is simplest when using Kubernetes’ inbuilt resource metrics pipeline, setting targets for percent CPU or memory utilization. CPU and memory consumption are usually trailing indicators for load rather than leading ones, so HPA also supports integrating with custom or external metrics, such as work queue depth or upstream gateway connection count. Integrating with custom metrics in this way is possible but requires a deeper dive into Kubernetes’ metrics server and the workload’s performance characteristics. The struggle some users face in plumbing for external metrics points to the need for solutions to simplify this process.

      Notably, users often face an either/or choice between HPA and VPA if considered in isolation since the two solutions often use the same metrics when making scaling decisions. The net benefit of any autotuning or adjustment by a VPA may be reversed or undone by an unharmonized HPA when the two systems duel to assert conflicting homeostasis against common observations such as CPU or memory usage.


      • Rapid automatic capacity adjustments
      • Easy to scale up or scale down pod count based on built-in metrics such as pod CPU or memory usage
      • Possible to scale based on more sophisticated or custom metrics


      • Not all workloads are designed to be run with parallelism, restricting which workloads are candidates for horizontal scaling
      • Setting up custom scaling metrics can be intricate and/or require specific expertise
      • Pods still have vertical sizes, but HPA does not assist with decisions around or management of vertical sizing
      • The standard HPA cannot be used in conjunction with the standard VPA when both are using the same metric to scale on

      Kubernetes Event-Driven Autoscaling (KEDA)

      KEDA is an advanced project that builds on and enhances HPA, effectively replacing the HPA interface. KEDA was originally created to address a critical missing feature of the HPA, scaling on arbitrary indicators or metrics. It became an official CNCF project in 2020.

      KEDA introduces a shift in Kubernetes horizontal autoscaling by enabling event-driven scaling. This approach allows applications to scale based on the occurrence of specific events, offering a more granular and responsive scaling mechanism.

      A differentiator compared to HPA is KEDA’s wide selection of built-in “scalers,” or plugins for interfacing with external scaling indicators or event sources, which drastically simplifies the work required to autoscale on common leading indicators of load. Easily scaling on leading indicators or events helps developers ensure that applications can efficiently handle bursts of activity, scaling up rapidly to meet demand and scaling down to conserve resources when the activity decreases.

      A distinctive feature of KEDA is its ability to scale workloads down to zero replicas. This capability is particularly beneficial for workloads that experience sporadic activity, ensuring that resources are not consumed when the workload is idle. Scaling to zero can lead to substantial cost savings, especially for applications with variable traffic patterns.


      • Enables scaling based on specific events for granular control. 
      • Offers built-in scalers for easy integration with external events and metrics. 
      • Allows scaling of workloads down to zero to save resources during idle times.


      • Implementing event-driven scaling requires understanding specific events impacting application performance.
      • Relies on external events, which might introduce complexity in monitoring and managing scaling conditions.
      • Scaling to zero could lead to cold starts, potentially affecting performance when scaling back up.

      Getting started with vertical autoscaling

      Kubernetes vertical autoscaling tools

      The VerticalPodAutoscaler (VPA)

      The VPA is the most commonly known tooling for vertical autoscaling and is maintained as part of the autoscaler repository alongside the cluster autoscaler.

      VPA is less commonly used than either cluster autoscaler or HPA by a significant margin. The 2023 Datadog Container Report found that over half of the surveyed organizations used HPA, while less than 1% used VPA. This is due perhaps to the high configuration effort required to set up VPA at the scale necessary to achieve a significant ROI, or perhaps, it is due to few users trusting that the standard VPA’s scaling recommendations won’t negatively impact their workload reliability. The fact that the built in HPA and VPA cannot easily be deployed together for the same workload almost certainly plays a role as well. If HPA is seen as improving reliability and VPA is seen as improving cost efficiency, reliability will almost always win out.

      The simultaneous use of VPA and HPA poses a challenge primarily due to their potentially conflicting actions: HPA scales pod numbers based on usage metrics, while VPA adjusts pod resource requests – requests that then factor into the calculation for usage. This interplay can break the autoscaling strategy unless carefully managed.

      Configuring horizontal and vertical scaling to work in concert has been a sought-after project goal for years, but it is not appropriate today to use the standard HPA and VPA implementations together on the same workload.


      • Can, in theory, be applied to any and all workloads
      • Automatic pod-level resource allocation can free developers from needing to spend time calculating and updating these numbers themselves


      • Use requires a VPA resource per workload to autoscale
      • It cannot be easily combined with HPA on the same workload
      • Any change to the vertical scale (up or down) requires pods to restart and reschedule
      • The quality and real-world reliability of VPA’s vertical sizing recommendations are low
      Stop Setting Kubernetes Requests and Limits

      StormForge Optimize Live

      StormForge Optimize Live is an advanced solution that replaces the use of VPA, solving for the limitations that have prevented VPA’s widespread adoption.

      The recent rapid progress in AI and machine learning is causing these technologies to be applied to nearly every tech field, including Kubernetes. In the context of Kubernetes autoscaling, these advancements can accurately anticipate workload demands by leveraging historical data and real-time metrics. This allows AI/ML to significantly improve how autoscaling decisions can be made, moving toward more predictive and dynamically optimized scaling methods. 

      In the context of vertical scaling specifically, tools like StormForge use these technologies to address quality and harmonization issues that have historically held back the widespread adoption of tools like the standard VPA.

      StormForge’s Optimize Live platform seamlessly integrates with Kubernetes to offer continuous vertical right-sizing of applications, ensuring that they run with the most efficient allocation of resources.

      A screenshot of the StormForge Optimize Live UI overview page

      Here’s how StormForge improves Kubernetes autoscaling:

      • Continuous right-sizing: StormForge’s ML-driven approach analyzes historical and real-time data to produce recommendations for resource requests, limits, and HPA target utilizations. This constant optimization process ensures that applications are always configured for optimal performance and resource usage, reducing waste and minimizing costs.
      • Harmonizing VPA and HPA: As mentioned earlier, a key challenge in Kubernetes autoscaling is ensuring that VPA and HPA work in concert without conflict. StormForge addresses this by deploying an intelligent agent within the cluster that automatically discovers workloads and leverages ML to analyze metrics utilization. This agent recommends right-sizing applications that consider and, if necessary, adjust both horizontal and vertical scaling aspects, layering in vertical scaling that harmonizes with HPA.
      • Complementary to KEDA and Karpenter: While KEDA focuses on event-driven scaling and Karpenter emphasizes improving cluster autoscaling efficiency, StormForge complements these solutions by enhancing the resource allocation within the pods and nodes they scale. As KEDA and Karpenter adjust the number and size of workloads, StormForge ensures that each workload runs with the ideal configuration for its current needs.
      • Human-centric approach: Beyond its technological innovations, StormForge emphasizes a human-centric approach to autoscaling. By automating the tedious and complex resource optimization process, StormForge enables DevOps teams and Kubernetes administrators to focus on strategic tasks, augmenting their capabilities without replacing their roles. Based on a minimal cluster footprint and respectful access permissions, the platform’s design further underscores its alignment with operational best practices and security considerations.

            As Kubernetes environments grow increasingly dynamic and complex, integrating AI and ML in autoscaling solutions like StormForge represents a new advancement in vertical autoscaling capability for Kubernetes environments. 

            Automate K8s autoscaling with machine learning
            Learn More
            Solution Rightsizing recommendations Automation Fully compatible with HPA  Powered by machine learning  Historical metrics analysis Trend forecasting
            VPA ✖️ ✖️ ✖️ ✖️

            Best practices for successful Kubernetes autoscaling

            Incorporating best practices for successful autoscaling in Kubernetes environments is crucial for maintaining optimal application performance and resource efficiency. Here, we outline key considerations and practices, including insights on integrating various tools to guide you in refining your autoscaling strategy.

            Prioritize monitoring and visibility

            Autoscaling is about optimization, and success in optimization requires understanding. You need to be able to measure your performance as you adopt or tune your autoscaling strategy.

            Use tools like Prometheus for monitoring and Grafana for visualizing Kubernetes metrics. Establish baseline performance metrics and identify key performance indicators (KPIs) that accurately reflect your cluster’s health and efficiency levels.

            For cluster autoscaling, measure and track resource allocation efficiency at the node level and in aggregate. For horizontal autoscaling, pay attention to workloads idling at minReplicas or frequently topping out at maxReplicas. For vertical scaling, measure and track resource usage against resource requests at the pod level, as well as activity like OOMKills and throttling.

            Tools like Kubecost can help provide some of these insights in a Kubernetes-specific packaged offering, as well as attaching cost estimates to the raw resource insights.

            Adopt an iterative approach

            Each autoscaling mechanism provides value independently, and the Pareto principle commonly applies: 80% of the value is often delivered through 20% of the possible effort. Identify which elements of autoscaling are implemented easily while delivering a broad impact. Prioritize the adoption of those elements first.

            Cluster autoscaling is a good mechanism to adopt early due to the relative simplicity of the problem space. While optimization of either cluster autoscaler or Karpenter can go deep, getting even a fairly basic autoscaler config in place will immediately deliver the ability for the cluster to dynamically scale down nodes when not needed, directly impacting cluster costs.

            Because horizontal autoscaling is implemented per workload, the cost/performance impact of using HPA depends on the size and load fluctuation of each potential workload to autoscale. Sometimes, applying an HPA to even one or a handful of workloads can make a big impact.

            Vertical scaling with tools like StormForge can have a cluster-wide impact for very low levels of time investment. Even on clusters with some workloads that you don’t want to vertically autoscale, the ability to either target or exclude swaths of applications by namespaces or individual deployments lets you apply vertical autoscaling to a majority of workloads on the cluster without needing to invest time in achieving autoscaling coverage for everything.

            Harmonize cluster, horizontal, and vertical autoscaling

            Simultaneous use of cluster, vertical, and horizontal autoscaling is required to achieve total elasticity and optimal cost-effectiveness while maintaining high reliability for applications on Kubernetes. The challenge lies in making all of these autoscaling dimensions operate in concert, ensuring that they complement each other to optimize resource allocation and application performance rather than conflicting.

            Adopting and using advanced projects that simplify or enhance each autoscaling mechanism and which naturally work in concert with each other is key to achieving sustainable and maximized performance of all three. Identify and select appropriate tools early in the adoption process to accelerate time to value.

            Simplify developer experience

            At the end of the day, Kubernetes is a platform supporting workloads owned by many different teams or developers, and spending time mastering Kubernetes is not the end goal or value-add in and of itself. Autoscaling should be made either as accessible or as invisible to developers as possible, reducing the cognitive load required for them to take advantage of it when building and deploying their applications.

            Towards that end, create a knowledge base with resources to understand your environment’s autoscaling behavior and choices. Provide case studies and tool documentation. Organize regular short training sessions and workshops, including hands-on labs, to use or simply understand the Kubernetes autoscaling tools you’ve selected.

            Encourage a culture of experimentation and learning where team members can safely explore current or new autoscaling mechanisms in sandbox environments.

            Experience StormForge in a sandbox – no email required
            Access Sandbox


            In this article, we explored the three key dimensions of Kubernetes autoscaling: cluster, horizontal, and vertical. We discussed the challenges associated with autoscaling on each dimension, such as the need for efficient cluster autoscaling, the complexity of implementing custom metrics with HPA, and the fairly significant limitations of traditional VPA. To address some of these challenges, we highlighted advanced projects like Karpenter, KEDA, and StormForge, each enhancing Kubernetes autoscaling’s flexibility and efficiency for one of the three dimensions.

            Finally, we outlined best practices for successful Kubernetes autoscaling and discussed how, by adopting these best practices, you can optimize your Kubernetes deployments for peak performance, cost-efficiency, and resilience in the face of dynamic workloads.

            As Kubernetes environments evolve, staying informed about the latest autoscaling strategies and tools is essential. By leveraging advanced solutions like Karpenter, KEDA, and StormForge, and by following best practices, you can optimize your Kubernetes deployments for performance, cost-efficiency, and reliability.

            Continue reading this series:

            Privacy Preference Center