The Horizontal Pod Autoscaler (HPA) is a key component of the Kubernetes autoscaling feature sets. It adjusts the number of replicas for a workload based on observed metrics such as CPU utilization or custom metrics. Configuring the HPA in your clusters can ensure optimal application performance, reliability, and cost efficiency in Kubernetes.

By default, the HPA scales based on CPU usage without requiring additional configuration. However, effectively leveraging the HPA comes with challenges, such as precisely configuring custom metrics and fine-tuning autoscaling parameters. One significant challenge is configuring custom metrics, which requires precise configuration and setting up and maintaining additional components to collect, aggregate, and expose these metrics to the Autoscaler.

This comprehensive article describes the fundamental concepts, components, and inner workings of the HPA. We also walk through setting up, configuring, and observing it under various load scenarios through a practical demonstration. We also explore common obstacles and advanced HPA implementation tools in production environments. By the end of this article, you will have a solid foundation on the HPA and be well-equipped to utilize its capabilities in your Kubernetes clusters.

Summary of key Horizontal Pod Autoscaler concepts #

Concept	Description
Understanding the Kubernetes Horizontal Pod Autoscaler (HPA)	The HPA automatically scales the number of pods for various Kubernetes resources, including built-in types like deployments, replica sets, and custom resources designed to work with the HPA. It makes decisions using metrics and scaling algorithms and consists of key components, such as the HPA resource definition, metrics server, and custom/external metrics API.
Common HPA pitfalls and challenges	Common issues with HPA include incorrect resource requests and limits, a need for appropriate metrics for autoscaling, and difficulty tuning HPA parameters. Relying solely on CPU utilization can be problematic because it may not accurately reflect the actual load on the application. Scaling limitations due to cluster capacity and monitoring and troubleshooting issues also pose significant challenges for using the HPA effectively. Addressing these challenges is crucial for the effective use of the HPA.

Understanding the Kubernetes Horizontal Pod Autoscaler (HPA) #

What Is the HPA?

The Horizontal Pod Autoscaler is a Kubernetes resource that automatically scales the number of pods in scalable resources such as deployments and replica sets based on observed metrics. The primary goal of the HPA is to ensure that applications have sufficient resources to handle varying levels of load while also avoiding overprovisioning and wasting resources during periods of low demand.

Evolution of the HPA: from CPU-based scaling to custom metrics

The HPA has evolved significantly since its introduction. In the early versions of Kubernetes, HPA v1 only supported scaling based on CPU utilization. This meant that the HPA could only make scaling decisions by comparing the observed CPU usage of pods against a target CPU utilization percentage set by the user.

To address this limitation, Kubernetes introduced HPA v2 (v1.23 onward), which added support for scaling based on memory utilization and custom metrics. This allowed users to define their metrics so the HPA could make more informed scaling decisions tailored to the specific needs of their applications.

How the HPA works

The HPA can scale based on three types of metrics:

Resource metrics: These are metrics related to the resource usage of pods, such as CPU and memory utilization. The HPA compares the observed resource usage against a target value specified by the user.
ContainerResource: These are similar to resource metrics but allow you to specify a particular container within the pod for monitoring rather than using the overall pod resource utilization.
Pods metrics: These are metrics that describe the behavior of the pods being scaled. They are collected from each pod in the scaling target and then aggregated. For example, you might have a custom metric exposed by your application, such as "requests_per_second". The HPA will fetch this metric from each pod, calculate the average across all pods, and use that for scaling decisions.
Object metrics: These refer to a single metric describing a single Kubernetes object other than the pods being scaled. This could be the scaling target itself (like a Deployment) or a different object entirely (like an Ingress). For instance, you might want to scale based on the number of requests an Ingress is receiving. Object metrics can use either the Value target type (comparing the metric directly to a threshold) or AverageValue (dividing the metric value by the number of pods before comparing to the threshold).
External metrics: These come from sources outside the Kubernetes cluster, exposed through the External Metrics API. External metrics allow the HPA to make scaling decisions based on metrics from third-party systems, such as cloud provider monitoring services or other external tools.

Scaling algorithms and calculations

The HPA periodically fetches the specified metrics and calculates the desired number of replicas based on the observed values and the target values set by the user. The basic scaling algorithm can be summarized as follows:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

For example, if the current metric value is 200milicores and the desired value is 100m, the number of replicas will be doubled (200 / 100 = 2). If the current value is 50m, the number of replicas will be halved (50 / 100 = 0.5, rounded up to 1).

The HPA also considers the minReplicas and maxReplicas values specified in the HPA definition to ensure that the number of replicas stays within the configured bounds.

When multiple metrics are specified, the HPA calculates each metric’s desired number of replicas independently and then takes the maximum value as the final desired replica count.

Scaling down behavior

When the load drops, the HPA doesn't immediately scale down the number of replicas. Instead, it uses a stabilization window to prevent rapid fluctuations in the number of replicas due to dynamic load patterns.

You can customize the scaling-down behavior using the behavior field in the HPA specification. For example:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15

This configuration sets a 5-minute stabilization window and allows scaling down by 100% of the current replicas every 15 seconds. You can further refine this behavior by specifying multiple policies and using the selectPolicy field to choose how these policies are applied.

It's also possible to disable scale-down altogether:

behavior:
  scaleDown:
    selectPolicy: Disabled

Autonomous Rightsizing for Kubernetes Workloads

Learn More

Automated vertical autoscaling designed to scale for 100K+ containers

Fully compatible with HPA functionality and cloud-based services

Key components of the HPA

Metrics Server

The Metrics Server is a cluster-wide aggregator of resource usage data in Kubernetes. It collects CPU and memory usage metrics from the kubelet on each node and exposes them through the Kubernetes API. The HPA uses the Metrics Server to access resource metrics for scaling decisions.

To use the HPA with resource metrics, you must deploy the Metrics Server in your Kubernetes cluster. The Metrics Server is not deployed by default in most Kubernetes distributions, so you may need to install it separately.

Custom and External Metrics APIs

The Custom Metrics API and External Metrics API allow the HPA to access application-specific and external metrics for scaling decisions. These APIs are provided by third-party adapters that collect metrics from various sources and expose them in a format compatible with the HPA.

Some popular custom and external metrics adapters include the following:

Prometheus Adapter: Exposes metrics collected by Prometheus as custom metrics for the HPA
Stackdriver Adapter: Provides custom metrics from Google Cloud Platform’s Stackdriver monitoring service
Azure Adapter: Exposes custom metrics from Microsoft Azure Monitor
AWS Adapter: Provides custom metrics from Amazon CloudWatch

To use custom or external metrics with the HPA, you must deploy the appropriate adapter in your Kubernetes cluster and configure it to expose the desired metrics. The adapter then registers itself with the Kubernetes API server, making the metrics available for the HPA to consume.

Horizontal Pod Autoscaler Demo #

In this section, we walk through a hands-on demo of the HPA to demonstrate its usage and behavior in a real-world scenario. Before we begin, be sure that you have the following prerequisites in place:

Kubernetes Cluster (v1.23 onwards)
Kubectl
Metrics Server

Deploying a demo application

We deploy a web application for this demo and expose an endpoint using a service.

Create a new file named hpa-demo.yaml with the following content:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hpa-demo
spec:
  selector:
    matchLabels:
      run: hpa-demo
  template:
    metadata:
      labels:
        run: hpa-demo
    spec:
      containers:
      - name: hpa-demo
        image: registry.k8s.io/hpa-example
        ports:
        - containerPort: 80
        resources:
          limits:
            cpu: 500m
          requests:
            cpu: 200m
---
apiVersion: v1
kind: Service
metadata:
  name: hpa-demo
  labels:
    run: hpa-demo
spec:
  ports:
  - port: 80
  selector:
    run: hpa-demo

To deploy, run the commands below:

> kubectl apply -f hpa-demo.yaml

deployment.apps/hpa-demo created
service/hpa-demo created

Next, verify the deployment status:

> kubectl get deploy/hpa-demo

NAME       READY   UP-TO-DATE   AVAILABLE   AGE
hpa-demo   1/1     1            1           71s

Now that the demo app is running, we can create the autoscaler.

Create a new file named hpa.yaml with the following definition:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-demo
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hpa-demo
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50

This HPA configuration will automatically scale the hpa-demo deployment based on the average CPU utilization of its pods. By adjusting the number of replicas between 1 and 10, it will try to maintain an average CPU utilization of 50%.

To create the HPA, run this command.

> kubectl apply -f hpa.yaml

horizontalpodautoscaler.autoscaling/hpa-demo created

Testing the HPA under load conditions

To observe how the autoscaler responds to increased load, we’ll create a separate pod that acts as a client. This client pod will continuously send requests to the hpa-demo service in an infinite loop.

In a new terminal, run the following command to create the load generator pod:

> kubectl run -i --tty test-load-generator --rm --image=busybox:1.35 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://hpa-demo; done"

This command will run the load generator pod and send requests to the hpa-demo service every 0.01 seconds.

In the original terminal, execute the following command to monitor the HPA:

> kubectl get hpa hpa-demo --watch

Within a short period, you should notice an increase in CPU load. As the load continues, the number of replicas will increase. For example:

NAME       REFERENCE             TARGETS   MINPODS   MAXPODS   REPLICAS   AGE

hpa-demo   Deployment/hpa-demo   0%/50%    1         10        1          8m52s
hpa-demo   Deployment/hpa-demo   68%/50%   1         10        1          9m16s
hpa-demo   Deployment/hpa-demo   250%/50%   1         10        2          9m31s
hpa-demo   Deployment/hpa-demo   200%/50%   1         10        4          9m46s
hpa-demo   Deployment/hpa-demo   105%/50%   1         10        5          10m

In this case, the CPU consumption has risen to 250% of the target, prompting the Horizontal Pod Autoscaler to scale the deployment to five replicas.

To verify the number of replicas, run:

> kubectl get deployment hpa-demo

The output should display the replica count, matching the value from the HPA:

NAME       READY   UP-TO-DATE   AVAILABLE   AGE
hpa-demo   5/5     5            5           10m

To complete the demonstration, you must stop generating load on the hpa-demo service.

In the terminal where you created the load generator pod using the busybox image, press <Ctrl> + C to terminate the load generation process.

After allowing a brief period for the system to adjust, observe the HorizontalPodAutoscaler’s status by running:

kubectl get hpa hpa-demo --watch

You should see output resembling the following:

NAME       REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE

hpa-demo   Deployment/hpa-demo     0%/50%    1         10        1          15m

This indicates that the CPU utilization has dropped to 0%, falling below the target of 50%.

To confirm that the deployment has scaled down, run:

kubectl get deployment hpa-demo

The output should show that the number of replicas has been reduced to one:

NAME       READY   UP-TO-DATE   AVAILABLE   AGE
hpa-demo   1/1     1            1           20m

As the CPU utilization decreased to 0%, the HPA automatically scaled the deployment to one replica.

Remember that the autoscaling process may take a few minutes to complete, so be patient while the system adjusts the number of replicas.

Stop Setting Kubernetes Requests and Limits

Learn How

Implementing pod-metrics-based HPA

In the example above, we covered CPU usage-based scaling for pods. However, scaling on metrics like CPU and memory will often be insufficient. To address this, HPA provides custom pod-metrics-based scaling, as mentioned earlier.

In this section, we deploy a demo app that exposes a custom metric called request_count and deploy an HPA that scales the demo based on this custom metric.

We must install the Prometheus Operator and Prometheus Adapter to implement our use case:

Prometheus Operator (Installation)
Prometheus Adapter (Installation)

The Prometheus Adapter bridges Prometheus and the Kubernetes metrics API, exposing custom metrics collected by Prometheus to a format that Kubernetes can understand and use for autoscaling.

After installation, we must update the adapter rules to ensure that the Prometheus Adapter collects the metrics from our demo app. Save the rules below as values.yaml:

rules:
  default: true
  custom:
    - seriesQuery: 'request_count'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_count"
        as: "${1}_per_second"
      metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

    - seriesQuery: |
         {namespace!="",__name__!~"^container_.*"}
      resources:
        template: "<<.Resource>>"
      name:
        matches: "^(.*)_count"
        as: ""
      metricsQuery: |
        sum by (<<.GroupBy>>) (
          irate (
            <<.Series>>{<<.LabelMatchers>>}[1m]
          )
        )

> helm upgrade -f values.yaml prometheus-adapter prometheus-community/prometheus-adapter

To implement our use case, apply the manifest below.

> kubectl apply -f https://gist.githubusercontent.com/decisivedevops/d23cb8620275af24d1c0ac4097518f49/raw/9b332a69504e3ae8998387e298c62aeb8fc374ec/pod-metrics-hpa-demo.yaml

Here’s a brief overview of each resource:

Deployment: This defines the py-hpa-demo application, specifying the container image, resource requests, and limits. It starts with one replica of the pod running on port 8080. The app source code, Dockerfile, and requirements.txt are available here.
Service: Exposes the py-hpa-demo application internally within the Kubernetes cluster on port 8080, allowing other services to communicate with it.
Horizontal Pod Autoscaler: Configures automatic scaling for the py-hpa-demo deployment. It adjusts the number of pod replicas between 1 and 10 based on the custom metric request_count, targeting an average value of 50 requests.
ServiceMonitor: Sets up Prometheus to scrape metrics from the py-hpa-demo application’s /metrics endpoint on port 8080, enabling monitoring and using these metrics for HPA. Prometheus collects custom metrics using ServiceMonitor.

Next, ensure that the application is deployed and running.

> kubectl get pods

NAME                                   READY   STATUS    RESTARTS   AGE

prometheus-adapter-65bf5d48b-5rprb     1/1     Running   0          4m40s
prometheus-operator-865844f8b4-jxhqd   1/1     Running   0          2m17s
prometheus-prometheus-0                2/2     Running   0          2m3s
prometheus-prometheus-1                2/2     Running   0          2m3s
py-hpa-demo-746778999b-kxw7r           1/1     Running   0          111s

We can see that the metrics exposed by the application are available in Prometheus:

> kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

You should see the pods/request_count metric.

Now we can confirm that the HPA is listening to the pod metrics:

> kubectl get hpa

NAME              REFERENCE                TARGETS   MINPODS   MAXPODS   REPLICAS   AGE

py-hpa-demo-hpa   Deployment/py-hpa-demo   0/50      1         10        1          5m15s

TARGETS 0/50 indicates that the demo app has not received any requests yet, so HPA is idle.

To generate the requests, you can use the test-load-generator loop below:

> kubectl run -i --tty test-load-generator --rm --image=busybox:1.35 --restart=Never -- /bin/sh -c "while sleep 0.01; do wget -q -O- http://py-hpa-demo:8080; done"

If you don't see a command prompt, try pressing enter.
Hello, World! Request count: 11
Hello, World! Request count: 12
Hello, World! Request count: 13
Hello, World! Request count: 14
Hello, World! Request count: 15
Hello, World! Request count: 16

After a few seconds, you can see that HPA TARGETS is updated:

> kubectl get hpa

NAME              REFERENCE                TARGETS   MINPODS   MAXPODS   REPLICAS   AGE

py-hpa-demo-hpa   Deployment/py-hpa-demo   28/50     1         10        1          8m4s

Continue to run that loop to see the HPA in action. After 1-2 minutes of running the test-load-generator loop, check the HPA again:

> kubectl get hpa

NAME              REFERENCE                TARGETS   MINPODS   MAXPODS   REPLICAS   AGE

py-hpa-demo-hpa   Deployment/py-hpa-demo   154/50    1         10        2          9m59s

This confirms that the HPA is indeed scaling the pods based on the custom metrics that the app is exposing.

Common HPA pitfalls and challenges #

While the Horizontal Pod Autoscaler (HPA) is a powerful tool for automatically scaling applications in Kubernetes, users may encounter several common pitfalls and challenges when working with HPA. Let’s explore them in detail.

Challenges with setting up custom metrics

While HPA supports custom metrics, setting them up can be complex and painful. It requires additional components and configuration, such as a metrics server and custom metrics API.

Collecting and exposing custom metrics often involves integrating with external monitoring systems or instrumenting the application code. This setup process can be time-consuming and requires expertise in Kubernetes and the chosen monitoring solution. Ensuring the reliability and scalability of the custom metrics pipeline adds another layer of complexity to the overall setup.

Difficulty tuning HPA parameters

HPA has several parameters that must be tuned based on the application’s requirements and behavior. These parameters include the scaling thresholds, number of replicas, and scaling behavior.

Finding the correct values for these parameters can be challenging because it requires understanding the application’s performance characteristics and load patterns. Incorrectly tuned parameters can lead to aggressive scaling or slow response to changes in demand, impacting the application’s performance and cost efficiency.

Scaling limitations due to cluster capacity

HPA can only scale applications within the limits of the available cluster resources. It can only scale the application within a certain point (cluster’s max capacity) if the cluster capacity cannot accommodate the increased demand. This can result in performance degradation or even service unavailability. It’s vital to ensure that the cluster has enough capacity to handle the expected workload and plan for potential scaling needs in advance.

Monitoring and troubleshooting issues

Monitoring and troubleshooting HPA can be challenging, especially when dealing with complex applications and large-scale deployments. Proper monitoring and logging mechanisms are required to track scaling behavior and identify issues.

Troubleshooting HPA issues may involve analyzing metrics, reviewing HPA events and logs, and correlating them with application behavior. Effective monitoring and troubleshooting practices are essential to ensure the smooth operation of HPA and to resolve any issues that arise quickly.

Experience StormForge in a sandbox – no email required

Access Sandbox

Conclusion #

In this comprehensive guide, we explored the Horizontal Pod Autoscaler (HPA), a powerful Kubernetes feature that automatically scales the number of pods based on observed metrics. We learned the core concepts, components, and working principles of HPA and implemented it through a hands-on demonstration, showcasing its ability to scale a deployment under varying load conditions.

We also discussed the common pitfalls and challenges associated with HPA, such as the complexities of setting up custom metrics, the difficulty in tuning HPA parameters, and the limitations imposed by cluster capacity. To address these challenges and extend HPA’s capabilities, we introduced Kubernetes Event-driven Autoscaling (KEDA) as an advanced tool. KEDA simplifies the process of autoscaling based on event-driven and custom metrics, providing a more flexible and intuitive approach to scaling applications.

As you continue to explore Kubernetes autoscaling, remember to assess your application’s requirements carefully, choose the appropriate metrics for scaling, and leverage the right tools and techniques to overcome the challenges associated with HPA. With a solid understanding of HPA and the benefits of advanced tools like KEDA, you’ll be well-equipped to build and manage scalable applications on Kubernetes confidently.

Solution	Rightsizing recommendations	Automation	Fully compatible with HPA	Powered by machine learning	Historical metrics analysis	Trend forecasting
VPA
StormForge

Horizontal Pod Autoscaler