Kubernetes users often rely on the Horizontal Pod Autoscaler (HPA) and cluster autoscaling to scale applications. We show how using StormForge Ops to optimize the whole application alongside the HPA improves cost and performance using the example of a web-application.

What is the Kubernetes Horizontal Pod Autoscaler?

The Horizontal Pod Autoscaler (HPA) in Kubernetes scales up and down the number of replicas in a deployment or a stateful set based on metrics prescribed by the user. The most common metrics are CPU and memory utilization of the target pods.

To deploy the HPA, the user sets target metrics for all replicas in a deployment as well as the minimum and maximum number of replicas. The HPA is responsible for adding or deleting replicas to keep the observed metrics lower than the target values while keeping the number of replicas within the prescribed bounds.

When scaling based on CPU and memory utilization, HPA uses the metrics.k8s.io API implemented by the metrics-server. The HPA can also use custom metrics or external metrics (e.g. number of requests per second on the ingress) implemented by a third-party or the user.

HPA tuning challenges

Optimizing the target metrics of the HPA for all applications and their specific workloads can be frustrating. While newer versions of kubernetes support more in depth configuration of the HPA through policies*, many users are left with a minimal set of configuration options: namely a cpu/memory utilization target and the maximum number of replicas.

If the application is a web server for example, the speed at which the HPA adds replicas is critical to accommodate bursts in traffic. A simple fix would be to reduce the CPU utilization target to a small value (say 15%) so that the HPA adds replicas early on when the traffic increases. However, this increases your cloud cost because many replicas are underutilized. To limit the cost, one could use replicas with a high CPU utilization target. If the traffic increases, while the HPA is creating replicas waiting to be available, the current replicas experience CPU throttling and the HTTP request latency increases, impacting the clients experience.

A ML-powered experimentation engine such as StormForge Ops can be used to design a highly available, scalable and cost-efficient application. Let’s demonstrate this using an example web application.

Example web application

In the following example, we will optimize the Docker example voting app using StormForge Ops. This app is a simple distributed application that allows the user to vote between two options – cats or dogs.

A Redis queue collects the votes. Workers consume the votes and insert them in a postgres database. Finally there is a node.js webapp that shows the results of the voting in real time. You can deploy this application in a dedicated namespace with:

kustomize build github.com/redskyops/redskyops-recipes/voting-webapp/application | kubectl apply -f -

StormForge Ops experiments

An “experiment” is made of multiple trials whereby the StormForge Ops server patches the whole application to find the optimal configuration. For each trial, the application is tested using a scalability test. At the end of each test, the StormForge Ops controller measures metrics to optimize for. In this experiment, we will optimize for two metrics of opposite goals: cost of running the application (in $/month) and p95 latency (in ms). While the performance increases with resources, cost becomes problematic. On the other hand, starving an application reduces user experience. Machine learning helps finding the best tradeoff. You can find detailed instructions to run the experiment yourself here with Locust.

We are using the HPA to scale up and down the number of replicas of the front-end deployment responsible for the user experience. For simplicity, we will tune HPA using a target utilization available via the metrics server. The parameters that we are optimizing are the minimum and maximum number of replicas, target utilization used by the HPA and CPU requests for the voting-service pod. Note that every pod runs with a guaranteed QoS (limits=requests).

We run the experiment for 400 trials, and consider a trial as failed if the response latency is greater than one second.

Scalability test

During each trial we load test the application with increasing requests per second (RPS): 100 RPS for one minute, 500 RPS for one minute, 1000 RPS for one minute, 2000 for one minute. This allows us to test the scalability of the application and make sure that the HPA is configured correctly. The application should be able to minimize the cost for a low level of traffic (100 RPS) but still able to scale up fast enough to 2000 RPS.

Experiment results

We first set a baseline configuration with:

Minimum replicas=3
Maximum replicas=7
CPU utilization target=65%
CPU per replica=2

In this case the cost is $481/month to run this application for a p95 latency of 579 milliseconds.

Each dot represents the metric values measured at the end of each trial (see Figure 1). The red dots are the best trials found during the experiment, i.e. there is no better configuration for one metric without increasing the value of the other metric. After some exploration of the parameter space, the algorithm converges towards optimal configurations. We find a sharp transition in latency around a cost of $330 per month, where the most satisfying performance is achieved. We find that the best application is obtained for:

Minimum replicas=10
Maximum replicas=15
CPU utilization target=80%
CPU per replica=0.855

The cost of running this application is $365/month (24% savings) while the latency is 27.8 milliseconds (95% increase in performance). The advantage of providing multiple best configurations is the ability for the user to pick based on experience. For example, if an experienced devops engineer wants a more scalable application in case of larger spikes of traffic than the ones created for the load test, the following configuration can be chosen:

Minimum replicas=3
Maximum replicas=9
CPU utilization target=10%
CPU per replica=0.849

The cost is $344/month (28% savings) while the latency is 60 milliseconds (89% increase in performance). Because the CPU utilization target per replica is lower in this case, a sudden burst in traffic triggers a scale up from the HPA early allowing for the newly added replicas to be available.

Figure 1. StormForge Ops experiment results


Using StormForge Ops, we can deploy a web application using the HPA that efficiently scales and avoids over provisioning for spikes in traffic.

We decided to tune the CPU target utilization of the HPA. This is more of an infrastructure monitoring approach that is made available quite easily by the metrics server. A different approach would have been to tune the HPA based on the number of requests-per-second on the ingress. Check out this great blog post on how to set up the external metrics server to work with the HPA.

Like the USE and RED methods for monitoring your infrastructure and user experience, the StormForge Ops experiments can be written to optimize both of your infrastructure cost and usage and/or user experience on your deployed application.

Finally, StormForge Ops allows you to efficiently find the optimal configurations when the correlations are too complicated for a human to understand. For this example, we have oversimplified the application to easily interpret the results, but in production one would tune the resources of all the deployments.

To try it for yourself, create a free StormForge Ops account.


*Note: For v1.18+ the HPA API will allow the scaling behavior to be configurable, allowing the user to design the scale up and scale down policies.

stabilizationWindowSeconds: 300
- type: Percent
value: 100
periodSeconds: 15
stabilizationWindowSeconds: 0
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max

In order to check if your kubernetes cluster has the behavior field available run:

kubectl explain --api-version=autoscaling/v2beta2 hpa.spec.metrics