On-Demand

Beyond Cloud Cost Management

Air Date: July 29, 2021

Rich Bentley: Alright, well, I think we will go ahead and get started. Welcome everyone. Thank you for joining today’s webinar which is called, “Beyond Cloud Cost Management”. We’re going to be talking about optimizing Kubernetes for cost, performance, and productivity.

Before we get started, just a couple of quick housekeeping things. If you have questions, we would love to answer them and we’ll get to as many as we can at the end of the webinar today, so just enter those questions into the Q&A button within the Zoom interface, and as I said, we’ll get to as many of those as we can at the end of today’s webinar. So let’s go ahead and dive in.

A colleague of mine said something along these lines a little while back and it’s always stuck with me because I thought was a really good comment, and that is, you know, asking engineers to define resource requests and limits for their Kubernetes apps is a little like asking your kids to take only what they need at a candy store, right? If you’re an engineer and you want your application to perform well, your resources aren’t going to be something that you are too worried about minimizing or minimizing costs or anything like that, right? Your primary motivation is around making sure that your app works well, that it delivers business value, and that really gets to the crux of the challenge that we’re talking about today. 

The way that we make decisions about compute resources has changed and we’ll talk about that. But really developers and engineers are now in charge of those decisions and, while cost is a really important aspect, they’re not always empowered to understand what the costs of running their application are. They don’t always know what the ideal configuration is going to be to minimize those costs, and it’s really hard to figure out what that is.

And you need to also think about more than just cost, right? You need to think about business value, the customer experience, the performance, the reliability of your application, but also the time and effort that you need to put into configuring your application and making those decisions.

This webinar is going to focus on how to empower and enable engineers to make smart resource decisions, decisions that minimize the cost of running the application, while ensuring that business goals are met, without spending a huge amount of time and effort.

So, just a quick introduction. I’m Rich Bentley. I lead up product marketing at StormForge. I’ve been in the industry for a long time, starting out as a software engineer, but more recently in marketing roles, and I’m based out of Michigan.

And I’m joined today by Erwin Daria, who is a Principal Sales Engineer for StormForge, and he has also been in the industry for quite a while, both on the vendor side of things, and from an IT side of things. So he’s been in roles very similar to many of you on the webinar today and will be able to bring a really good perspective from that point of view. Erwin is out in California, so, Erwin, welcome to the webinar.

Alright, so I wanted to spend a little bit of time talking about how resource decisions have evolved over time. So as we’ve gone from traditional on-premise environments with data centers and servers, into more of a cloud type environment, but you know infrastructure as a service type environment, and then all the way to containers and microservices where we’re talking about you know running applications on Kubernetes, you know, much more smaller type of type of footprints.

When we think about the way that resource decisions get made and resource decisions really equate to cost, right, how many resources are we using, that’s how much it’s costing us to run that application.

So first off we think about the scope of those decisions, back in the days of data centers, that the decisions that we’re making were about, do we need a new data center? Do we need new servers? Do we need new storage? So they were hardware decisions, right.

When we move to cloud and infrastructure as a service, we were still making those similar decisions, but they were more about cloud-based servers or hosts and cloud based storage. But then as we move to micro service and containerized environments, we’re making decisions that are much more granular. They’re about the actual resources that are going to be consumed or needed by our containers or our applications, right. CPU, memory, replicas, things like that.

When we talk about who owns those decisions, so traditional environments everything had to go through a centralized procurement team, right. Those were the folks that have the authority to approve of those you know major capital expenditures in most cases. When we move to the cloud, it probably move to more of a workgroup type level. What do we need to run our application for our team?

But now as we’ve moved into a containerized world, those decisions are completely decentralized. So every engineer who is deploying and owning the software that they’re building needs to make those decisions on what resources do I need to run this application.

And then, in terms of the process, the process used to be much more formal, much more structured, right. We have capacity management processes in place to make decisions about looking at the long term needs of our capacity over time and that drove a lot of the decision around what we needed to purchase in a data center type environment.

As we move to cloud it became less structure, less formal of a process, and then again in the world of containers and microservices, that process is very agile, right? it’s very lean, it’s very quick. We’ve got to make decisions on the fly that change the resource that we need for our applications.

And then, lastly, is the horizon with which our decisions have an impact, right. When we’re talking about data centers, we’re talking about years that our decisions have an impact. If we’re buying a server, that server is you know amortized over three years or five years or whatever it might be.

When we move to cloud, it’s much shorter than that, right? We can obviously turn on and off an instance in the cloud very quickly.

But generally when we’re deploying something, we don’t plan to turn it off, you know really quickly. But then as we move to containers and microservices, we’re talking about minutes and seconds, right. We’re making decisions about resources that could change on the fly at any moment as our needs and our traffic increase and decrease. So it’s a big change in the way that we make these decisions and that’s really kind of the crux of the challenges that we have in trying to effectively manage that control costs, while ensuring your applications are delivering business value and performing effectively.

So one of the challenges that we have in this kind of new environment in this new world is that engineers have very different goals from the finance team. So if I’m an engineer, I really want to work on something that’s meaningful, that I like working on, that is enjoyable, I want to deliver software fast and reliably.

I want to use resources efficiently, right. I don’t want to waste, but at the same time, I don’t want to worry about cost, right? I don’t want to have to think about that.

I just want to stay up to speed on the latest technology, and I really want to focus on innovating and delivering capabilities that are going to differentiate my company and provide business value.

On the finance side of things, it’s really more about accurately forecasting and predicting spend, charging back, allocating revenues to who they belong to, controlling, reducing costs, and really being aware of budget risks. 

So having these two very different motivations can be a real challenge when we kind of bring that world together and get to the point where we really have to think about the cost of running the application, especially when it’s the engineering team that are making the decisions that impact that cost so directly.

One of the reasons this is so challenging is just the complexity of managing resources within Kubernetes. So if we think about the decisions that you have to make as an engineer when you’re deploying your application, your application may consist of a number of different services that are running in containers, within each of those containers, you have to set your CPU requests, and limits, memory requests and limits, replicas. 

And then there’s also application specific settings that you have to configure as well. So heap size, garbage collection settings. All of those things can impact the cost of running the application, they can impact the performance of the application, they can impact the reliability of the application.

And they’re all interrelated as well, so if you change one thing, it may impact another thing, and if you think about that, for a typical application, there may be 15, 20, 30 different settings that you have to configure when you deploy that application.

So as a human being, that number of combinations, the number of possibilities, is really essentially infinite and it’s really a difficult thing to do without spending a huge amount of time and effort on trying to try things out, see how it works out, go back and make changes, try it again… it’s really just not even feasible to do that. So that’s really kind of from a technology perspective, one of the big drivers of the challenge that we’re really talking about today.

And there are a lot of tools that you’re probably using today to help address some of those challenges, and each of these tools has their place. Each tool has value in different ways, but a lot of these tools are very focused on one aspect of the problem or another. 

So performance and load testing tools are a great way to see how your application will perform before you deploy it under load. The problem is that you come back and find out that there’s an issue, the question is, what do you do about that then? How do you address that issue before you deploy it? What do you do after that fact? It’s very much focused on performance, right, not cost, not productivity of your engineering team or anything like that. It’s very focused on performance.

We’ve got auto-scaling tools that are part of Kubernetes that are helping to dynamically scale the number of pods that you’re running in production, for example.

And that helps with right sizing your application, but it takes configuration to deploy that, to optimize that, so there are configuration settings you have to undertake when you’re using the horizontal pod autoscaler. If you’re not setting those correctly, you’re also not running as efficiently as you could be.

You’ve got monitoring observability tools. Most organizations have a lot of these tools that are out there. They’re great for identifying problems, helping to troubleshoot problems, but again, they don’t necessarily tell you what to do when there’s a problem that you find. They’re also very focused on performance, right. They’re not as focused on cost. They’re also very reactive in nature, because they’re running in production and when you find a problem, it’s a problem that’s already impacted the user experience. So it’s very reactive just by definition because you’re monitoring what’s happening in production.

And then finally you’ve got cloud cost management tools, right. These are tools that help you allocate and track the costs of running in the cloud, but these are very focused on cost only, right. They don’t take into account business value. They don’t take into account performance. They’re really about getting visibility into costs, but looking at that in a silo, for example.

So there’s a lot of tools out there, and when you think about deploying an application on Kubernetes, there’s a few different things that you have to think about.

If you’ve ever worked in the project management world, there is the golden triangle, or sometimes called the iron triangle of project management.

It’s sometimes referred to as you know, Cost, Scope, and Time, or Good, Fast, and Cheap. You can pick two, but you can’t have all three. Very similarly for Kubernetes, there’s these different aspects that you have to think about.

A change in any one of these aspects has an impact on at least one of the other aspects. So if I want to run my application at a lower cost, I either need to sacrifice performance, or I need to spend a lot more time configuring and time and effort into setting up my app and figuring out what the best way to do that is by lowering my cost. If I want my app to perform better, that’s going to either cost me more, or again it’s going to take me more time and effort. Again, if I want to spend less time and effort, then it’s either going to cost me more to run my app, or it’s going to perform worse. 

So you can almost think about the area within the triangle as the total cost of running your application on Kubernetes because it’s all those things you need to think about. It’s not just the cloud costs itself, it’s also the time and effort you have to put into it. And it’s also the level of performance and reliability that you can expect from that application.

We’re going to revisit this triangle a little bit later and talk about sort of how you can change the equation a little bit and how you can reshape that, but for now I think this is a really important thing to think about when we talk about going beyond just thinking about costs.

It’s really these things that all sort of work together and are all interrelated.

So there’s an organization, many of you have probably heard of it called the FinOps Foundation, which is a relatively new organization. It’s part of the Linux Foundation alongside CNCF, but really the mission of the FinOps Foundation is all around advancing the discipline of cloud financial management through best practices, education, and standards. They’ve published this FinOps lifecycle, which I think is a really good way to think about it as you’re looking to tackle this problem.

And there are three phases of that lifecycle, so Inform is the first step. It’s all about getting visibility and allocating where your cloud costs are going. Optimize is the second step, so how do you actually make sure that your apps are running at peak efficiency with the minimum amount of time and effort to go into that.

And then Operate, right. As you’re actually operating, running your applications, how do you continuously improve? How do you make sure that you’re doing that effectively and also efficiently?

So diving in a little bit on each of these areas. So Inform, again, this is about visibility and allocation of costs and that’s really the first step. Get an understanding of where your cloud spend is going and who that should be allocated to.

Again, keeping with the theme of going beyond just costs, you also have to think about performance and business value.

So you want to give visibility to the people that are making resource decisions on a day to day basis, so your engineering team.

You want to make sure that they have kind of a near real time understanding of the cost of running their app, the performance that their application is having, the business value that their application is delivering, and giving all that visibility to drive better behavior, and also visibility into how efficiently your application is running.How well it’s actually utilizing the resources that you’ve allocated, or have you over provisioned? Is there cloud waste? So having that sort of visibility, that level of information is really key because people, engineers, want to make the right decision.

But if they’re not empowered with the information to make the right decision, then they’re not going to. They just don’t have the information. They don’t want to spend the time on having to come up with that information. So it’s really all about empowering developers to make the right decisions.

And then when we think about optimizing, part of this is setting goals for okay here’s where we are today, now where do we want to be in terms of the efficiency, the utilization of resources, and really the goal is that we want to meet our service level objectives at minimal cost and effort, right. So we want to make sure the application is performing to expectations, but we don’t want to spend any more time or any more money than we need to to make that happen.

And again, we want to empower engineering teams to make smart resource decisions at the container level, right. So as they’re configuring their Kubernetes app, what is the best configuration that I can deliver in order to really effectively run my application with the right performance at the lowest cost and really understand those trade offs that have to happen in order to make sure that I’m driving business value.

And then the last phase of this is about operating, right. So this is all about continuous improvement, how do we actually make sure that what we’re doing in production is we have a process in place, we have a methodology in place, we’re improving that over time, a big part of that is automating wherever possible, right. So if we can build optimization, build visibility into our CI/CD workflow, it just becomes a regular part of our work, right. It becomes kind of part of our standard practice part of our culture.

And again, it gives us the ability to be more proactive and how we ensure that applications are going to perform the way that we expect them to.

There’s also things that we need to think about in the Operate phase, like efficient auto-scaling. So things like the horizontal pod auto-scaler, making sure that those are running at peak efficiency and giving us the right level of performance at the right cost as well.

So the goal really is to create an environment where our infrastructure is developer ready. I really love this phrase of developer ready infrastructure, because the goal as part of the IT organization, whether you’re an SRE, or part of the cloud OPS team, or part of a platform team is that you want to provide a platform that developers can really focus on innovating and not have to worry about the nuts and bolts of the Platform. We don’t want our engineers to be spending a lot of their time tweaking, and tuning, and troubleshooting at the Kubernetes level, right? We want them focused on delivering new capabilities that are going to provide business value. We want cost efficiency or cost consciousness and efficiency, just to be part of the culture of the organization, we want everybody to think in those terms.

We want teams to have real time visibility into, again, not just cost, but also performance, and efficiency, and utilization.

And I’ll repeat it again because it’s so important, we want our engineers to be empowered to make smart resource decisions and to do that without spending hours or days or weeks on trying to figure out what the smartest resource decision is.

We want those decisions driven by business value, what’s actually going to provide the most value to the business, what’s going to provide the best user experience.

And to get there, we need teams to be working from common data and sharing a common language.

So thinking back to the slide earlier with the engineering team and the finance team, everybody has to be working off of the same playbook, the same data, looking at the same information in order to be able to communicate effectively and really have a productive dialogue. 

Last but not least, we want everyone on our engineering team to take ownership of their cloud resource usage. So the mantra of devops is that if you build it, you run it.

Part of that includes owning the performance of the application, owning the cost of the application, all of those things. We want to give engineers the tools, the knowledge, the information that they need to really fully own their application throughout the lifecycle.

So a couple of quotes that I really like here, there’s a book called “Cloud Finance,” which if you haven’t read it, I would highly recommend it by JR Storm and Mike Fuller. It goes through this lifecycle that I talked through.

One of the quotes that I really love is, “if it seems that FinOps is about saving money then think again. It’s really about making money.” It’s really about delivering business value and minimizing the cost to do that.

Kind of on a related note, Andreessen Horowitz is a venture capital firm that’s focused on the technology space and a really good quote here in an article that they wrote a little while ago about a billion dollar private software company told them that their public cloud spending amounted to 81% of their cost of revenues.

And that is not an uncommon number for most organizations in the tech space, and probably even beyond the tech space, because really every company these days is a software company in a lot of ways. So 75 to 80% of your cost of revenue going to your cloud spend. That’s a huge expense, a huge chunk of your revenues that goes toward that. Then, maybe even more importantly, the second part of the quote, for every dollar of gross profit that you can save, the value of your company, the market cap of your company, will rise on average 24 to 25 times the cost savings. So if you’re able to save $1 on your cloud spend, that really increases the value of your company by $25. If you’re able to save a million dollars on your cloud spend, that’s increasing the value of your company by $25 million. So it’s a huge huge impact on being able to address this challenge and really kind of focusing on that from not just a cost perspective, but also from a business value perspective.

So I want to jump back to the golden triangle here, and this is the last thing I’ll cover before I turn it over to Erwin to dive into how we help address the challenge here with StormForge.

So if you want to reshape the golden triangle, if you want to change the equation, right. Remember I talked about before if you change one point on this triangle, you have to change at least one other point on the triangle. Well the good thing is that we now have technologies like artificial intelligence, machine learning, and automation that can kind of help with that problem and help address that. 

So what we can do now with AI is we can find better configurations for our application.

And we can do that, to save money on how we run our application, but not affect the performance of the application, and not spend more time and effort on trying to find that, right. AI can do it for us.

Same thing on the performance improvement side of things, right. We can find ways to improve the performance of your application without costing more, without spending more time and money, or time and effort. Then lastly, with automation we can save a huge amount of time without impacting the cost of running the application, or the performance and reliability of the application.

So, if you remember back what I said before, if you think about the area inside that triangle, as the total cost of running or owning our application, what we’ve just done is we’ve eliminated a huge amount of that cost if we’re able to change the equation and change the iron triangle to really look at this in a different way, using technologies like AI and automation.

So with that we’re going to dive into how we do that with StormForge. I’m going to let Erwin take it away and talk through that.

Erwin Daria: Thanks, Rich. So let’s talk about StormForge, the platform that we’re bringing to market, StormForge Optimize, and how it can really align with all the various challenges Rich spent the last few minutes describing. 

So first and foremost, again, calling back to what you saw in the kind of early FinOps slides, our platform is really built and focused on creating a way for us to experiment across various configurations and Kubernetes for the given applications that we’re trying to optimize, and then provide that intelligent business decision making, right. So we’re working within that FinOps framework. Maybe your company is in the midst of it, maybe it’s the very beginning of it, maybe you’re you’re you’re pretty mature on that journey.

What we really rely on in terms of making that whole cycle work is ensuring that every stakeholder in that process understands what the impact is of all those various potential decisions. 

We’re using machine learning to provide that visibility, that knowledge around what those potential impacts might be. The other thing that we can do again related to the visuals that you just saw Rich show is really use machine learning to look at competing, or in more specifically, multi-dimensional optimizations, right. So what we talked about. Somebody from finance saying hey, we need to cut cloud costs for a given application for a given BU, and the developers, engineers, SREs, platform teams are incentive to kind of make sure things are running or performant, those two things can be competing goals. So by using our machine learning on this multi objective kind of platform allows us to understand where those trade offs might be, so we can inform our various stakeholders into what those configurations look like and what those impacts might be.

And then, again, we want to empower developers to make smart resource decisions. Again, so if we use our experimentation, we use our machine learning in our various trial execution methods to find those things, then the developers don’t have to worry about interfacing with other stakeholders. They have the ability to use our platform and find those right answers.

Second, we really do focus on proactive resource optimization. So this is not about installing some tools, installing some instrumentation. Rich showed a slide earlier that had a whole array of various tools and measure things and production do cloud cost optimization and production, we want to be able to shift all of that burden left, right. Everyone is relatively familiar with the shift left nomenclature. We talked about it in security, we talked about it in other areas like performance testing.

We want to be able to understand what the impact of various configuration changes are before we push those changes to production, so that we’re not causing things like escalations. We’re not causing things like poor user experience. So our tool can be used, and we suggest that we use it, in pre prod. I’ll show you how that looks in contrast to maybe a production focused implementation in the next couple slides, but we want to shift all that optimization to the left, so that when we go to push a given configuration to production, we feel comfortable that we’ve met all of our SLA and SLOs. We’ve met our business objectives when it comes to cost and that all the stakeholders within that FinOps framework are aware of what’s going on in production.. 

Then, lastly, we are purpose built for Kubernetes. I’ll talk about the architecture of what it looks like in the next couple slides, but we are deeply integrated into Kubernetes.

And in all of the various tools and features and functions are really focused on that, whereas the other optimization tools can be a little bit more broadly targeted, they also don’t have the ability to optimize to the level that we can in the world of Kubernetes.

So this graphic here is highly simplified, but it is meant to kind of demonstrate what the current paradigm looks like, right. So generally, starting with the left, you’ve got an application developed in or it’s deployed in production. We’ve got a series of observability tools and maybe an APM, it could be a kind of custom built dashboard, and what we’re doing is we’re trying to figure out whether or not the application is meeting whatever SLIs or SLOs we’ve defined for it, right?

And we’re trying to kind of estimate whether or not our user experience fully given application is getting met.

And, in some cases, you know we might be instrumenting it for cost as well.

When there’s something that’s maybe not well aligned with whatever our goals are, it is up to a team of people to identify what those impacts might be, what those risks might be, and then make a series of changes to the configuration files. In the case of Kubernetes, these could be just straight up manifest, they could be helm charts, it could be config maps.

And the various changes or settings for each one of those things. Then in the next cycle, we may push a new version of the manifest to production and then that cycle continues.

A lot of what I’ve heard from Rich I’m going to echo here. This is really a reactive approach by virtue of measuring what’s happening in production and you cannot make any changes until you see how it impacts your end users.

And in that case, again we like to emphasize that a lot of these things, using the StormForge Platform can be found, prior to production.

And then, instead of having a trial and error, where you try something and then the errors have an impact on the end users before you can figure out what the next changes are, we want to be able to do that, using machine learning, rapid experimentation, to find that before we push the prime.

Now the StormForge Platform, again this is highly simplified, but it kind of looks like this. We’ve got StormForge, the optimized controller deployed on a pre-prod environment. So this could be QA, this could be staging, whatever it is it’s generally before the production environment. StormForge will then as part of a trial cycle create load against your application. This load test can be our load test, we provide one free with the optimized solution.

We also can integrate your own performance tests, so if you’ve got performance tests already, maybe you’ve got a QA team that already does stuff with jmeter or locust or blaze meter or something else, we can use those tests as well.

And then, what we do is we generate load against the application, and then we configure StormForge Optimize to optimize against a certain number of metrics. These metrics would be the SLOs or SLIs that you define, either in an APM or performance test, maybe something like P 95 latency for given transactions or things like CPU and memory utilization for a given application.

And the StromForge machine learning will then iterate that cycle, kick off load, watch the metrics change the configurations for the various parameters that we expose in the manifest, and it will do that, over and over and over again until it finds that optimal configuration.

Once you find an optimal configuration in pre prod, you can pour over the results in our GUI, find the the right configuration, export that to the target, that target can be those config maps or helm charts or just the text manifest if you want, to a version control system, or if you’re using a continuous delivery platform, we can integrate with that as well.

And then you can use it as a gating factor for ensuring that anything that gets pushed to prod has gone through this level of optimization.

This is, again, focused on being proactive. This is about using AI and machine learning as opposed to trial and error where human beings are constantly waiting for the results in a given observability platform and making changes. And again this is rapid, meaning that the time to value, the time to results for the optimization is not something like you know weeks or months potentially, it’s something like hours, right. The Ai can rapidly test more configurations than human beings could ever do in a very short amount of time. 

So with that let’s go into a demo and we’ll take a look at what the results can look like after an experiment and I’ll walk you through all that.

Alright, what you see on the screen right now is the StormForge Optimize user interface on the web, right. So this is you sign up with us, you download the controller, you install that controller on your pre prod environment, it’s very simple, and then you go through the process of designing an experiment. That experiment is going to ask you for things like what is your performance test, what namespace do you want to optimize, what parameters do you want to optimize? We can ingest things from a config map, from a helm chart, from any manifest. So if there are certain knobs and buttons that you’ve exposed to Kubernetes for tuning, these can be things like Rich mentioned, CPU and memory, reservations and limits, it can be things like replicas, it could be HPA configurations, it could be application specific parameters like jvm parameters, heap size, garbage collectors things like that a whole host of different things. If it can be exposed through a Kubernetes manifest, it’s something that our machine learning can tune for.

When you click on experiments, you’ll see these are a series of experiments that we’ve done towards optimizing different types of applications.

I like to start with the Voting Web App. It’s something that anybody that uses Docker containers quickly understands. Let’s walk through this to kind of understand what the power is of machine learning and what rapid experimentation in pre prod can do.

Especially given that we’re really talking about not only optimizing cloud costs, but optimizing all of the various costs around the operations of our applications, right. So it’s not just about the number of dollars that your organization has to pay the cloud providers, it’s also about understanding what the costs are for getting a configuration wrong, having an escalation, doing a root cause analysis, having one of your SREs or one of your developers woken up in the middle the night because something broken… What are the costs associated with that and how do we use something like StormForge to mitigate those types of risks.

So here on the screen you’ll see I’m going to refer to things like as gray boxes. I know that all our entire UI is gray boxes, but if you look at this one with the dots on it, we call that the parameter space. This is essentially a representation of the space in which the AI is able to change configurations and meet specific goals. On the X axis, we’ve defined the experiment to try to find the optimal cost for an application, and then on the Y axis we’re trying to maximize throughput.

And so, this blue dot here, let me step back here, all the dots that you see on the screen represent the results of a trial that was undertaken by our optimized controller and submitted to the machine learning for recording. Every dot, I’ll start with this blue dot. This blue dot is our baseline. So if I scroll down here, you can see that the knobs and buttons that we’ve given to the machine learning, the parameters that we’ve exposed via the manifest, are things like the CPU and memory for the given layers in this multi-microservice application. Things like replicas, things like CPU, things like memory, these are all the knobs and buttons will be exposed via the manifest, and these are the baseline settings. So we’ve set a minimum of 100 – 2,000, you can see these things on screen.

And without doing any kind of tuning, these are the results right. so we’ve got an application that can hit you know this relative throughput at this cost. The cost is something that we calculate given the experiment right so, but if you’ve got an APM that has a different type of instrumentation, that does cost a different way, we can query that APM to find a more accurate cost number.

Now, again I mentioned every one of these dots is the result of a specific trial as part of this experiment. So the experiment in and of itself is kind of a big umbrella term that we use for an optimization. Each trial is a series of settings for these given parameters and then the measured results for that given trial.

You’ll notice that some of these dots are a different color, right. So some of them are teal and some of them are just like peach color and one of them is orange.

And then they all adhere to this curve, right. This curve is called a Pareto Front, and mathematically this is where machine learning has found the optimal configuration in any one of the two dimensions, right.

The most throughput or the least cost, right. The dark orange dot is the exact middle of that curve, right. This is where we found the highest throughput at the lowest cost.  Any configuration other than this one is going to either give up throughput or give up cost.

Now, Rich mentioned the notion within FinOps about informing the various business stakeholders around what changes to make, what optimizations to make, and I think this is really important to illustrate here.

One reason that we do things in pre-prod is so that we can understand what the impact of the changes are given our priorities before we impact the end users.

And a tool like this is really important, because it allows either the platform team or developers to really explicitly communicate to, if it’s finance or the business owner, what the impact is of the various priorities that we might impose on the developer team.

So in this particular case, I’ve clicked on what our machine learning has honed in on as the optimal configuration. Again, this is the most throughput at the least cost without giving up significantly on any one of those two axes. You can see that the machine learning found that you know DB CPU for the postgres layer in this voting app doesn’t mean anything. It’s not that important, and it does not increase throughput for this particular application, given the load that we’ve generated against it.

But you might find that you don’t want to sacrifice 12% throughput from maximum.

And maybe 72% savings on cost is really, really good, but maybe we do want to prioritize something else. We want to prioritize a higher throughput.

What does it look like if we chose this configuration? Well in this configuration, we still save 57% on CPU and memory utilization for this application, but we don’t give up anything on the throughput side, we actually increased throughput by 2%. Now, if this is something that you can go back to your business stakeholders, your finance team, whoever is part of the decision making process in your organization, this is an explicit representation of what we’re going to do and what we can expect the application to do. Then simply export this config, so we we provide this really easy to export command line command, to then take the results of this experiment and not only export the settings all these various parameters to a target of your choosing, again, this is just exporting it to a local text file, you can replace this with wherever your configurations by, but it also tags this as trial number 63 of all these various trials. Furthermore, you can actually use this as a gating factor or quality check on subsequent releases. Let’s say you decided this is your optimal configuration, you can use the results of this trial to measure subsequent optimizations against.

So maybe you don’t want to optimize, you don’t want to do the 200 trials it is going to take the optimize application next time, just compare it to this one trial. The really powerful thing to do and if you implement this, we don’t have this in the slide where i’d be happy to talk about it if you guys decide to reach out we’ll talk about next steps in a little bit, you can use this again as a quality gating factor on subsequent releases, right. So it’s really important that again to Rich’s point when we talk about optimizing and continuing to operate that we have the systems in place so that you’re not spending all these cycles, and that we continue to optimize and continue to reduce time to value for these applications. 

Another thing I just want to kind of cover because I think sometimes it doesn’t get covered explicitly, we talk about Kubernetes. What you saw in the voting web app was things like CPU or reservations and limits, replicas things like that.

But I think it’s also important to understand that we can optimize things, anything that can be exposed via Kubernetes manifest. So in this particular case, we’re optimizing a Spark pterosaur process, right. We’re doing things that are not explicitly Kubernetes objects or parameters. You just things like driver cores, shuffle partitions and shuffle file buffers. Again, these are things that were exposed within the manifest config maps, and our machine learning can still optimize against those things, without being explicitly limited to just the native Kubernetes object parameters. So again a very powerful tool, a very powerful way of being able to rapidly experiment against your own load test. If you’ve got your own load test, we can certainly use that, if you don’t have your own load tests, we do provide one free as part of the Optimize suite.

Again, the notion that you can use this tool to experiment against an application in pre prod so that when you go to your decision makers you’re doing it with an explicitly known kind of stake.

Let me stop this, and we’ll go back. I want to show you the results of what we’ve been able to do with one of our customers.

So what you’ve seen so far is just kind of a high level demonstration of how our machine learning is so powerful, how it is able to hone in on the right configuration, given this rapid experimentation, given this experiment kind of approach to to optimization, and the kind of voting web APP example that I showed you just a few minutes ago, you saw things like you know 50% savings of cost and 12% improvement in total throughput, but what does it mean for real applications, for real customers?

This is one that we were able to help one of our customers with. We can’t share the logo, it’s a large travel website. They advise people on what attractions to go see given a place that you want to go visit.

And this particular team, or application, is their main application. This is our customer facing application and they had a team of nine working on this for I believe something, doesn’t say here on the slide, but I think they’ve been working on it for either a year or more than a year, trying to optimize their application. That whole manual process that I had shown you in the slide where beforehand where they were instrumental their application, they expose a bunch of metrics to an APM, and they had about a team of developers looking at what those results were making changes to the manifest or configuration files pushing those next round of improvements to production and using that as a manual cycle, they’ve been doing this. Nine people have been poring over the configurations for this given application over the course of a year. 

 

What you’ll see here is that they were really good at driving latency out of the application, but they weren’t doing very much on costs. So, by implementing the StromForge Optimize platform, not only were they able to validate that they were able to get as much latency out of their stack as much as possible, the machine learning was able to hone in on a configuration that gave them just as low latency, but by reducing costs by 50%. 

So this is something that we were able to do, that our platform was able to do it, where machine learning was able to do that in just a matter of I think a day or two days.

And again, this is the power of the machine learning being able to trigger a test very quickly, watch an APM, scrub the APM for the results of that performance test, set it up for machine learning, having the machine learning come up with a series of parameter changes, and then having the controller implement those changes in the loop to find that. So this particular I don’t remember the number of trials that went into this experiment, but if it’s in the order of maybe 100-200.

You just imagine how the system is continuously doing this over and over and over again, rapidly finding and honing in on this parental front that you see here on the screen and finding that right configuration for maintaining that really low latency, 1.2 milliseconds here, but then also ensuring that the cost is significantly improved.

This is something that for human beings would be very difficult to do. You might maybe be able to do it, but over a much longer time horizon. So to Rich’s point, a lot of times the decisions that we make with these the new, modern application platforms, the impact is you know seen in minutes, or seconds, days even, but we don’t have you know we years to try to fix these things, especially given the context of the Andreessen Horowitz quote. 

At the very best you might have quarter to quarter flexibility, because we want to be able to share with our investors, our stakeholders, that we’re able to save money. This is not something where you might have two or three years to optimize for. 50% savings is really impactful for not only the developer teams, but all the various stakeholders, all the investors that are kind of tied to our companies and organizations.

So with that, I want to hand it back to Rich to tell you about next steps. If there’s anything that you’ve seen so far that you want us to follow up on, or if anybody has any questions, I hand it back to Rich.

Rich Bentley: Thanks, Erwin. Really good information – great demo. We’re a little bit over time, but I do want to talk quickly about next steps and then take a couple of the questions that came in just to make sure that we’ve covered them.

But in terms of next steps, we have an e-book that we just came out with a little while ago that you can find here that I think is probably really helpful as a next step.

But also we’re really happy to jump on a call with you if you’re interested to do a demo, kind of find out more about your specific situation and your needs and really walk through the product in more detail. You can also sign up for a free trial of the product as well if you’d like to go that route. So handy QR code on the screen here that you can use to access all this information, plus a lot more. 

So with that, Erwin, I think you touched on both of these questions a little bit, but I want to make sure that we answer them, make sure that people understand. So the first question is really around how long does it take to run one of your experiments on an application to better optimize it?

Erwin Daria: Yeah, so there’s a couple different factors that go into how long it takes for an experiment to run. One is the number of parameters. If you’ve got a small number of parameters, something like CPU, memory reservations, and limits, that’s going to take less time than something that has a lot of parameters. The other thing is the performance test. So the performance test, generally your creating performance that has a specific traffic shape. So if you’re just looking to optimize for peak performance, that traffic shape can be very sharp.

You just want, you have a load phase, you’ve got a peak traffic, and then you have a drop off. That could just be maybe a couple of minutes.

And so each iterative trial within an experiment might be a couple minutes. So, if there’s only let’s say 50 or 100 trials within an experimentation, that is 50 or 100 times however long the performance test takes, give or take a few minutes maybe here or there for the controller to set up the environment and tear it back down. Get ready in between those various trials.

So in that case, it could be just a couple hours to find an optimization. 

When you’re talking about something that’s a lot more complex that maybe has something on the order of me a dozen parameters, and the test has a more dynamic traffic shape, maybe you want to do something like HPA configuration where you want it to look at a very small number of traffic and then a very large amount of traffic and then a very steep growth between the two, and you want to make sure that the scaling behavior for a given application is appropriate, than that could take a little bit longer. But what I think is important is, even if you hear it takes a couple hours, or it might take a couple days, this is dramatically less time than it would take our human being, or even a set or a team of human beings, to optimize, right? Generally when we talk about human driven optimization, this is something on the order of like weeks, months.

 

Maybe even never in some cases. We showed the example of the travel website where they had a team of nine working over a year, and they were able to do one aspect, right – performance, latency. But not able to derive all the savings and other and other dimensions.

That was a long answer, but I think it’s important to understand the time skills.

Rich Bentley: Yeah, thanks for that, Erwin. The other question, I’m going to kind of group two questions together because they’re pretty similar, but when our AI detects a needed change to a configuration, like a either change in the number of replicas, in the number of resources, how does it get that new configuration into production? Is it an auto-scaling type of thing? Is that a rolling upgrade? How does it go from our AI detecting it to actually getting it working in production?

Erwin Daria: Yeah, so great question. So one, I just want to cover this explicitly.

When you do your first optimization, what I showed you in the demo is imagine that is the first time we’ve ever tried to do it. What we don’t want to do is we don’t want to pose any changes to a production application without understanding what that impact is. Sometimes that can be like a single stakeholder, maybe a developer teams are like I think this is amazing, in more cases I think going into the future based on what we’re seeing with the FinOps Foundation, these types of first time optimizations where we are going to need multiple stakeholders. And in that case what we would do is, we would go over the GUI, maybe use it as a place to get reporting — screenshots or maybe go or live with all the various stakeholders in a meeting — we’d all agree on what that optimization looks like is a trial 63, is it trial 171, whatever it might be, right. 

Our tool then gives you a pre formatted command line that you can then script against, locally, on our system. What that does is it allows you not only to quickly export the right configuration to the target of your choosing, the target again can be at the very basic file, like a manifest file that it generally will get exported to the very same target that we scan the parameter configurations from and then again without additional context of it being trial 63 of the case in the demo.

Subsequent optimizations again, I think this is what where the where the question is really going, if I’ve chosen the optimal configuration, how do I ensure that the process is fully optimized, and in that case again, our tool is built not only you know fully integrated into Kubernetes, it’s built around you know CLIs and scripting. 

So if you’ve got a pipeline, where you have certain gating or quality functions that are kind of built into like the pre CD part of your pipeline, then we would integrate right there, right. So if you’ve got a continuous delivery platform, or you’ve got a GitOps framework where changes are submitted to a repo and then those are accepted by either somebody or another series of tests, we could integrate directly into that and that whole process would be fully automated.

Rich Bentley: Awesome. Thanks, Erwin. Thanks everybody for staying on. Hopefully, this was helpful and valuable information. Again, just want to encourage everyone to go check out some of the other resources we have available, or to reach out and we’d be happy to go through a demo in person and talk more specifically about your needs. Thank you, everyone. Thank you, Erwin. That will conclude today’s webinar. Have a great day.

Erwin Daria: Thanks, everyone.