Demo Video

StormForge Optimize Live Demo

In this video, we provide a brief overview of Optimize Live and how it works. Kubernetes Resource optimization can be done with just a few clicks or fully automated in real-time.

StormForge is the leader in automated Kubernetes optimization, making it easy to achieve cloud resource efficiency at scale. It uses patent-pending machine learning in both pre-production and production to drive resource and cost savings, reduce risk of performance issues, and uncover insights that improve your cloud native architecture.

Kubernetes gives developers a lot of freedom. That along with the dynamic nature of Kubernetes-based applications can make it challenging for DevOps, SREs, and FinOps teams to ensure applications run efficiently, minimizing resource usage and cost, while still ensuring performance and stability. Today, teams attempting to optimize their Kubernetes environments have to use a manual trial and error approach.

They guess at the right resource settings, deploy the app, wait to see how it behaves, update YAML files, and then try again. With hundreds or thousands of containers, it’s not humanly possible to effectively optimize using this approach. At StormForge we thought large amounts of structured data and multi dimensional real time optimization sounds very much like a machine learning problem.

That’s why we created Optimize Live. Optimize Live analyzes observability data you already capture and recommends optimized configurations for CPU and memory on a container basis and in real time. And it’s very simple to get started. So here I’ve logged in to the StormForge web user interface at app.stormforge.io, and I’ve chosen my application, which I’ve entitled Microservices Demo.

I chose to use the namespace demo, which is where my application is deployed. And within that namespace in my Kubernetes cluster, I have 11 microservices that interact with each other as a sort of shopping website. This website has all of the typical components you might expect, like a checkout service, an ad service, front end for the inbound, product catalog recommendations, cash of a cart in Redis, and a shipping service – all these things.

There’s about 11 components here in total. What I’ve also done is I’ve found boundaries for this. So within the demo namespace, I could further refine this and use label selectors to narrow down which containers I’m most interested in having Optimize Live watch and provide recommendations for. But I also have to provide boundaries for it to work with them, such as the minimum amount of CPU and the maximum amount of CPU I want to give to any container.

Same with memory. Now, an interesting element here is risk tolerance, and this defines how closely recommendations follow the load based on our recommendations. So as we’re making these recommendations, we feel pretty confident in what we’re recommending from the machine learning algorithm. However, maybe you don’t want to cut it so close. You want to give some more boundary space, and especially with memory, that’s really important.

Maybe we’ll set our risk tolerance too low here because an out of memory error on a container is a bad thing and will certainly cause user experience problems. Finally, we get to choose how frequently we want recommendations and whether or not we want the recommendations to be applied automatically or manually.

Now, I’ve chosen manual recommendations here because I want to see what the system is recommending. I’m going to first look at the ad service, for example. Now, here, all of these deployments been defined with 500 millicores of CPU for requests and 2 gigabytes of memory for the memory requests. Here Optimize Live is recommending 60 millicores of CPU, which is a substantial reduction and 195 megabytes, which is also a substantial reduction.

Now I have this mapped in my Grafana dashboard here just to kind of see what it looks like and see what the data looks like. And the blue line here is the actual usage CPU-wise that we’ve seen over time. And there’s been a little bit – it’s grown over time, but it’s still quite small. And if you look at the summary and this is all pulling from my Prometheus environment that is pointed at this cluster. The last usage that was recorded was 52.5.

OK, just changed 33.1 millicores. So that’s really, really small. But you can see here the requests are set to 500 and the limit is set to 1000. Our latest recommendation is saying that you should set your CPU limits to 213, but more importantly, the requests can be set as low as 63. So I think that is substantial and I agree with it.

That’s great. And I can look at the graph and see how it’s hugged along the lines.

Now, while I’m here, I’ll also look at the Redis service. And the reason I want to look at this one is this one has some interesting memory usage profiles. So Redis is an in-memory cache and with a Redis cache, it’s going to use quite a bit of memory based on how it works. To that end, the deployment profile suggested 6 GB as the limit for memory usage here and about 500 for the request size.

But even with all of that, the usage has only been around 450. Optimize Live has noticed this trend line, has noticed it growing, and is saying that perhaps we want to increase what we’re using from 537 up to 575. So you can see here that Optimize Live will not always recommend going down based on what it’s determined over the last two weeks of history.

Now I can also look at these metrics inside of Datadog, which we can also use as a source for the metrics and we’ll see some of the same results here. So I’ll hover over the last recommendation that was made and here I’ve got a current usage of about 39.7 millicores and you might see some discrepancies here between the Grafana on a dashboard I was showing and the Datadog dashboard.

And that’s because they use different resolution on the metrics. So please don’t be distracted by that. That’s expected behavior. But here what we’re saying is instead of the current requests at 500 millicores, we can safely bring the ad services profile or manifest down to 63. And that is so significant. And that would really help from a node density perspective.

And though the way I can do that is to quite easily hit the “Approve Recommendation” button. This will apply all of the recommendations that are here, or all of what I observed in either my dashboards, to apply to this and restart those pods. Now, in my example, I have only one container running, so this would cause an outage, but in your environment, you’re going to have more than one and that would follow the restart policies of your deployments to do that without downtime for your users.

These are some of the key things we think are important as you consider using machine learning for optimizing your Kubernetes environment and what makes StormForge unique. Most organizations are already generating massive amounts of data with their observability solutions, but there’s often a gap between visibility and actionable insights. We help you close that gap and turn observability into actionability.

Lastly, it’s important to think about how much time and effort are needed to start seeing results. If there’s a lot of configuration involving editing of YAML files, or if the process is mostly manual you have to weigh the cost versus the benefit. With StormForge, we’ve tried to remove as much of that manual work as possible, while still leaving you in full control.

It’s easy for you to try StormForge for yourself. Sign up to test and optimize your own app for free at stormforge.io and see what automatic Kubernetes resource efficiency can do for you.