There is one source of performance problems we’ve been encountering even before we started with StormForge Performance Testing and still see every time: Missing HTTP Keep-Alive. This article is about why this is still a problem and relevant, and why performance testing is an important tool to uncover such issues.

HTTP is with its 24 years a well aged fellow among the web protocols.1 Today we are mostly using HTTP/1.12 or HTTP/2 and if you have fully embraced the new HTTP/2 world in your entire system this article is mostly an anecdote of past issues. But HTTP/1.1 is still alive and kicking for many systems. And even given its age, people are still forgetting about a very important feature that previous versions did not provide: Keep-Alive.3

To clarify, I’m not talking about TCP keep-alive (which is disabled by default). Also I’m not talking about other kinds of keep-alive mechanisms for other protocols, which are equally important to keep an eye on. Today, we will focus on HTTP keep-alive.

How does HTTP work?

HTTP (at least prior to HTTP/2) is a very simple protocol. For a given request to fetch data from a server, the following steps happen (simplified):TLS Handshake!TLS Handshake!

  • DNS lookup is made (not in picture),
  • a new TCP connection is established,
  • the TLS handshake is performed,
  • request headers and optional payload is sent,
  • the response is read and
  • the connection is closed.

The last point is the topic of this article: Don’t close the connection!

HTTP 1.1 learned to re-use an existing connection: If the response was read entirely, a new request could be sent using the existing connection. This happens automatically if both parties understand it. Unless the client sets the Connection: close request header or the server actively closes the connection, it will be reused for subsequent requests. Sounds like a no-brainer, right?

Get Started with StormForge

Try StormForge for FREE, and start optimizing your Kubernetes environment now.

Start Trial

Why is this important? Why bother?

We seem to forget about the fact that there might be an issue with keep-alive. Almost everyone seems to be aware that this concept exists, but few are actively checking that everything is working as expected. You might be surprised how often keep-alive is not configured properly!

The other issue is: Developers and operations people heavily underestimate the impact of doing a DNS lookup, establishing a TCP connection, and making a TLS handshake. Over and over again. For every single HTTP request. Every. Single. Time.

From our experience, we can tell that the overhead will add up very quickly. And it does not make a big difference what kind of system you are building. Even for internal or even local systems, there is usually not really anything to gain from closing the connection. You don’t have to take our word for it – there are many resources out there supporting this.

What we and our customers are observing when running tests with missing keep-alive is slower response times, even for moderate load. If more and more requests take longer to process, more connections stay active so more resources are consumed and blocked. In many cases, systems under tests do not recover until traffic stops.

Here is a quick example I used a while back for a talk at the AWS User Group in Cologne. I used a simple StormForge test case to give you an idea how the TCP reconnects impacts latency (find the test definition at the end of this article). The following image is a latency histogram over all requests made by this test (available in all StormForge reports):

Bimodal Distribution

You might have already guessed it: Left is with keep-alive, right is without. Same target, same request, same response.

Yes, this is a simple and a bit artificial example, but not so far from many setups we see our customers are testing. We see a clear bimodal distribution: One maximum where new connections need to be established and the other when an existing connection is being used. The difference is rather significant.

The difference comes from multiple factors:

  • only spend DNS, TCP and TLS once per peer (multiple times if you are using a pool of connections)
  • allocating a TCP socket is also not for free, especially when the system is under load
  • resources are finite and keeping sockets around can also quickly add up. Also look out for sockets in the TIME_WAIT state.
  • worst-case: You can also run out of ephemeral ports.

If you want to learn more about TCP, sockets and TIME_WAIT and how to optimize your servers, check out this great article by Vincent Bernat.

Keep-Alive and Current Architectural Approaches

The issue with keep-alive being overlooked is that the impact gets bigger considering some currently trending architectural approaches.

For example, take Server-less or Function-as-a-Service (FaaS)4. With FaaS, you need to be stateless, but an application is usually not really fully stateless. Most of the time you solve this by externalizing state to other components and services. And how do you access the state again? Quite often it is done via HTTP. You should also check out Yan Cui’s article on HTTP keep-alive as an optimization for AWS Lambda.

This especially affects Microservices: HTTP is often selected as the communication protocol of choice.

Again and again we are witnesses when our customers uncover these problems using performance tests and have rather quick wins in terms of latency, stability and general efficiency.


Use HTTP keep-alive. Always.

More importantly don’t just assume it is used, check it. It can easily be tested with curl via curl -v and looking for * Connection #0 to host left intact at the end of the output. Testing it on a larger scale and especially the impact is also done easily with a performance test using StormForge. Catching a misconfiguration or an unintended configuration change using automated performance testing is even better because you minimize the risk of the potential havoc.

More Details

I’ve been using a simple test case to showcase the impact of HTTP keep-alive. We have two scenarios, each weighted 50%. One session does 25 HTTP requests with keep-alive (which is the default with StormForge) and the other one does 25 HTTP requests without keep-alive.

Note that our testapp does HTTP keep-alive by default:

definition.session("keep-alive", function(session) {
  // Every clients gets a new environment, so the first
  // request cannot reuse an existing connection.
  context.get("", { tag: "no-keep-alive", });
  // HTTP Keep-Alive is the default, so for all the following
  // requests in this loop, we can reuse the connection.
  session.times(26, function(context) {
    context.get("", { tag: "keep-alive" });

definition.session("no-keep-alive", function(session) { 
  // Setting the "Connection: close" header, we signal our
  // client to close the connection when the transfer has
  // finished, regardless if the server offers to keep the
  // connection intact.
  session.times(25, function(context) {
    context.get("", {
      tag: "no-keep-alive",
      headers: { Connection: "close", },

  1. Actually HTTP is even older, but I’m referring to RFC1945, or HTTP V1.0. HTTP V0.9 actually dates back almost 30 years.
  2. HTTP 1.1 is actually a collection of RFCs: RFC 7230, HTTP/1.1: Message Syntax and Routing, RFC 7231, HTTP/1.1: Semantics and Content, RFC 7232, HTTP/1.1: Conditional Requests, RFC 7233, HTTP/1.1: Range Requests, RFC 7234, HTTP/1.1: Caching, RFC 7235, HTTP/1.1: Authentication
  3. Technically HTTP 1.0 could also support keep-alive but it was opt-in and not actually specified how this should work in detail. If the client wants a connection to be reused, one has to send Connection: keep-alive and check if the server responds with the same header. Only then (depending on the implementation) the connection was kept intact after a request.
  4. Node.js’s HTTP client or better HTTP Agent does not keep connections alive. You have to configure it explicitly, which is a bummer, because Node.js is a pretty popular technology for FaaS and Server-less applications.