A look at how Kubernetes autoscaling works, the different autoscaling methods Kubernetes provides and potential errors that can arise.

Autoscaling refers to dynamically assigning resources to match an application’s changing demands. It’s also one of the most impactful capabilities that Kubernetes clusters have.

Kubernetes has built-in service discovery and self-healing features, and it can schedule and distribute your application across thousands of nodes. This means your cluster can increase or decrease the number of nodes available to match changes in resource requirements.

When implementing autoscaling in Kubernetes, we must prioritize efficiency. Failing to implement it properly can have significant consequences, from increased costs to an outage. Therefore, optimization is key.

This article will review how Kubernetes autoscaling works, discuss the different autoscaling methods Kubernetes provides and highlight potential errors that can arise during autoscaling. Then, most importantly, we’ll explore how to resolve them.

Using Kubernetes Autoscaling Effectively

To autoscale effectively, we need to allocate resources properly. To do so, we need to understand our resource availability and application requirements and assign resource limits accordingly. These limits will generally depend on workload fluctuation, which is frequently unpredictable and prone to spikes.

Optimal resource assignments are paramount. Too few resources result in outages, latency and poor user experience, while overallocation is costly and inefficient, leading to wasted time and resources.

This is where autoscaling comes into play. Autoscaling enables us to scale up components in the event of a surge and decrease them when things return to normal. Autoscaling in Kubernetes works by specifying a target CPU percentage with a minimum and maximum replica count, which gets compared to the actual CPU consumption. If the maximum threshold is exceeded, the cluster increases the number of replicas. If CPU usage falls below the minimum, the number of replicas decrease. Otherwise, it maintains the status quo.

Methods of Autoscaling in Kubernetes

Kubernetes provides three main built-in methods of autoscaling. In the following sections, we’ll explain each method and how it works, and highlight a use case where using that method of autoscaling is beneficial.

Using Horizontal Pod Autoscaling

The Horizontal Pod Autoscaler (HPA) allows us to increase or decrease the number of pod replicas of a given resource based on CPU use or other user-defined metrics.

HPA works by regularly monitoring the metrics tool or server for resource usage and periodically adjusting the number of pod replicas to match the observed metrics. Its goal is to maintain an average CPU usage percentage across all pods at a specified value.

For example, if a deployment experiences high traffic, the Kubernetes cluster can scale up the number of replicas to handle the load.

Using Vertical Pod Autoscaling

Unlike HPA, the Vertical Pod Autoscaler (VPA) allocates resources to existing pods based on use rather than the number of pods. The VPA recommends target values for the pod replicas by tracking the application’s resource use.

This scaling method aims to match resource allotment to actual usage. For example, if a replica uses too much CPU, the Kubernetes cluster can scale up its CPU limit to accommodate it.

VPA is useful for preventing the Kubernetes scheduler from overcommitting resources on the false assumption that pods will stick to their initial minimum limits rather than the maximum threshold. The VPA lets us set a resource cap using live data rather than relying on guesswork.

Using the Cluster Autoscaler

The cluster autoscaler monitors and adjusts the number of nodes and the overall size of the cluster to meet demand. It works by watching for pods that cannot be scheduled due to insufficient resources and determining the appropriate response, which can include:

  • Adding another node to unblock the pod.
  • Increasing the size of the node pool.
  • Reallocating the currently deployed pods into a smaller number of nodes.
  • Evicting pods from the blocked node to remove it entirely.

It’s important to note that because this method involves adding and deleting infrastructure, we must ensure that any credentials are adequately secured.

Configuring Kubernetes Autoscaling

While autoscaling is an excellent solution for better efficiency during workload fluctuations, it’s not a hands-off process. Regardless of which method we choose, it’s crucial that we properly configure and tune Kubernetes resource allocation to prevent errors and wasted resources.

The best way to manage our resource use during autoscaling is to establish requests and limits for CPU and memory. Requests are the minimum number of resources a container requires, while limits are the maximum amount of a resource a container will consume.

However, we must carefully set and adjust our limits and requests in response to actual use. Remember that CPU and memory on a single machine are finite, and every active workload in a cluster needs adequate resources. Otherwise, they’ll run out of memory, resulting in cluster failure and increased costs.

Making wise use of requests and limits also gives the Kubernetes scheduler better insight into our service needs. The scheduler places nodes on pods based on the best-effort quality of service (QoS). QoS determines a pod’s priority based on its resource threshold. Upon creation, pods are automatically assigned one of three possible QoS classes. These include, from highest to lowest priority:

  • Guaranteed
  • Burstable
  • Best effort

Best-effort QoS means that when a system runs out of memory, these low-priority pods are destroyed to free up resources, preventing out-of-memory (OOM) errors and crashes. Proper configuration of requests and limits is essential to ensuring a high QoS.

Limitations of Kubernetes Autoscaling

As we’ve seen, effective Kubernetes autoscaling requires a lot of fine-tuning. However, knowing how to best set configuration options isn’t always intuitive or obvious. Each of Kubernetes’ three autoscaling methods has challenges and limitations that we need to note if we choose to use them.

For example, neither HPA nor VPA takes input/output operations per second (IOPS), the network, or storage into account when making their calculations. This leaves applications susceptible to slowdowns or breakdowns. VPA also doesn’t allow us to update resource limits on actively running pods, meaning we have to remove the pods and create new ones to set new limits.

Additionally, the cluster autoscaler has limited compatibility with newer, third-party tools, meaning we can only use it in Kubernetes-supported platforms. Cluster autoscaling also only looks at a pod’s requests in making scaling decisions, rather than assessing the actual use. As a result, it can’t identify spare resources that a user requests. This can lead to inefficient, wasteful clusters.

Optimizing Kubernetes Autoscaling with Machine Learning

Resource inefficiency with Kubernetes autoscaling often begins with improper vertical scaling. Then horizontal scaling compounds these issues, which manifest as high cloud costs as the cluster autoscaler adds instances to an already over-requested cluster.

Machine learning-based optimization solutions, for example StormForge Optimize Live, can address these challenges by automating the process of maintaining Kubernetes resource efficiency at scale. Using data from observability solutions and machine learning, these tools offer more advanced recommendations on modifying configurations beyond the Kubernetes default.

Optimization solutions like StormForge essentially act as a replacement VPA, handling pod-sizing for us. Furthermore, if it detects that we also have an HPA running, it will recommend a target utilization setting that reduces resource usage without compromising our application’s performance.

Using machine learning to optimize Kubernetes environments and lower costs eliminates the guesswork for automating resource management. As a result, we can concentrate on innovating rather than troubleshooting.


Kubernetes autoscaling ensures that your application always makes the best use of what it has available, no matter the workload — saving time and money.

Of course, there are a few things to remember when implementing autoscaling in Kubernetes. Proper configuration is key to high QoS and continually monitoring your application to ensure that it’s scaling correctly. This fine-tuning process can be challenging, so turning to solutions like StormForge Optimize Live can help ensure that your CPU and memory resources are always properly allocated to keep your application running at optimal capacity and low cost.

When taking advantage of autoscaling in Kubernetes, make sure you’re using the latest version of Kubernetes and that you’ve defined the resource requests to avoid autoscaling miscalculations. Doing so ensures that your autoscaling implementation is comprehensive, efficient and, most importantly, accurate.