Crossing the Kubernetes Performance Chasm

How to Proactively Improve User Experience While Reducing Costs

Air Date: May 13, 2021

Rich: All right. We will go ahead and get started. A few people are still joining, but we’ll dive in here. Thank you everybody for joining today’s webinar, Crossing the Kubernetes Performance Chasm, where we’ll be talking about how to proactively improve user experience, while at the same time reducing your cloud costs. Let’s dive in, and Brad if you want to go to the next slide here… 

My name is Rich Bentley. I’m the Senior Director of Product Marketing at StormForge coming to you from Ann Arbor, Michigan and really happy to be with you today. I’m going to be joined by Brad Ascar who is the Senior Solutions Architect for StormForge. Brad has a wealth of industry experience as an architect and cloud practitioner at companies like Red Hat and Autotrader and HP as well. Brad is a Certified Kubernetes Administrator and he is coming to you from Atlanta, Georgia today. So happy to be with you and to share the content for today.

So we know that there are a lot of advantages to cloud native applications running in containers on Kubernetes, right? So it’s all about moving faster, being more agile, providing a better user experience, better scalability, and really becoming more efficient in terms of resource utilization and cost. But, if you go to the next slide Brad… 

As I’m sure everybody has seen, realizing the value of cloud native technologies is not always as easy as it might sound, right? There are challenges that are related to performance, where we know that application performance results in conversions, brand perception, customer satisfaction, and ultimately revenues to your company.  Related to that is reliability, right, where any downtime has the potential to result in lost revenues, lost customers, customer satisfaction issues, and so on. Scalability and capacity management are critical to success in the cloud, but if we’re not able to effectively scale and adapt to the changing traffic patterns and workloads, that can be a big challenge. Then last, but not least is cost efficiency, right? So for a lot of organizations, the reason you’re moving to the cloud and to cloud native architectures is to be a lot more efficient with the way that you’re delivering services. So if you’re not able to do that effectively, you’re not getting the value from your cloud migration. Let’s go to the next slide…

So one of the things that really contributes to a lot of these challenges is just the complexity of deploying and running applications on Kubernetes. So when you deploy a containerized application, there are a number of different configuration decisions you have to make, right, from memory and CPU, requests and limits, replicas and then application specific settings as well. JVM heap size, garbage collection if you’re running a Java app for example, the list goes on and on. And if you multiply the number of different settings by the number of containers that are part of your application, the number of different combinations of configurations really becomes essentially infinite and the impact of making the wrong choice can be really significant. The result can be application failures, poor performance, downtime, over-provisioning, cloud costs that are a lot higher than they need to be… And all of these different settings and parameters are all interrelated as well. And so as a human being trying to configure and deploy your app correctly, it can be really, really impossible to know exactly what the impact of a change in your application configuration will actually be. If you go to the next slide…

So we’ve seen this at enough companies to have developed a name for it. We talk about it as the Kubernetes Application Performance Chasm. It’s really the gap between the promise of cloud native, realizing all the benefits that we talked about, and all the challenges that come into play if you’re unable to realize those benefits. So you get in this mode where you’re always reacting to performance and availability issues. You’re reacting, dealing with over-provisioning of applications and budget overrides and that requires a lot of time and energy, which can have a serious impact on your team’s productivity. So instead of focusing on developing new innovations, you’re spending your time reacting, troubleshooting, and manually tuning and configuring applications. If you go to the next slide…

So many companies that, most companies, and I’m sure all of you on the call today are familiar, have a number of tools and techniques that you might be using to try to address this gap and each of these approaches can provide value, but each also has its drawbacks and limitations. So trial-and-error is a really easy way to get started. You deploy your application, you see how it’s performing, you see the resource utilization, and then you go back and tweak it and make changes and try it again. Easy way to get started, but a great way to waste time from your engineering team as well. Really with the complexity that I talked about, it’s really not even possible to get to a point where you’re really effectively optimizing the performance of your application through a trial-and-error approach. Performance testing, load testing tools are really great for telling you how your application will perform when it’s under load, but by themselves they can’t really tell you how to address the issues that they find, right? They’ll give you information, but they won’t tell you what to do about it. And then Kubernetes has resources like the Horizontal Pod Autoscaler that can help you dynamically scale your deployment, but the HPA itself requires tuning to ensure that it’s performing and that it’s efficient in the work that it’s doing. And again that adds another layer of complexity. And then last, but not least you’ve got your monitoring tools. I’m sure you all have a number of monitoring tools in place both at an infrastructure and an application level, and those tools can alert you to problems that are happening in your production environment, which is great, but by the time that you’re alerted, the problem has already impacted your users and you’re still kind of in in this reaction mode, and troubleshooting and firefighting mode. So those are kind of the challenges and what people are doing to address those today, and what we want to move into now is really how to address those. So I’m going to turn it over to Brad to dive into that part of the presentation.

Brad: Thanks, Rich! So before I leave this slide, one of the things I’d like to point out, especially on the HPA tuning, is by default it’s tuning on memory and CPU utilization of the pods. That might not be the way that you want to scale your application. In fact, usually that’s not the way you’re gonna want. It’s not the most efficient way. It’s not the leading indicator. You actually want to be able to tune it for application specific things like backlog of the queue, how busy things are, latency, things that are going on in the application itself that are indicators that you need to scale both up and also scaling down. Both of those can be tuned and that’s an important part of what we talk about. 

So we’ll talk about the three steps to closing the gap. So number one: test it. Actually do performance testing in your app as a regular part of your release process. Number two: optimize it. So it’s great to test it and see what the behavior is, so once you see that the behavior is probably not what you want it to be, you need to optimize it. You need to be able to take your application and determine how it needs to behave and then once you’re done doing that, then you’re going to want to automate it so that you’re regularly doing this as part of your regular deployment process. So you test it, you optimize it, and then you automate it. 

So for each one of those steps, we need to consider the people that make that happen, the processes that make that happen, and the technology that’s necessary for it, and the changes that you possibly need in your technology. 

So step one: make performance testing a regular part of your release process. That’s the way you move to more agile. Everybody’s starting to go to CI/CD. You might have been doing it for a long time and that is a huge improvement over where we were before, but it’s only allowing you to do things up to a point. So we want from a performance standpoint to be able to move from the column on the left to what’s going on the right. That’s the move to really becoming more proactive in what you’re doing. So we want to go from being silo to actually having performance as part of each team’s work. Go from manual to automated, from reactive to proactive, from infrequent to continuous, and from a low priority or a later priority, actually just being part of it, in the same way that DevSecOps is making sure that everybody knows that security is part of their job, we want performance to be part of that as well because it has huge impact on your application. 

Performance testing cycle. So in your normal performance testing cycle you do your discovery. You determine what it is you want to do. Ultimately you define what it is you’re doing, what’s the system under test, all of the things that you do in a standard performance testing cycle, around back to your build process, your test data, your test cases, your automation, then the execution, what is it that you execute to actually do your performance. And then we think it’s super important to do the last piece here which is adapt. Now that you’ve tested to determine how it’s behaving, how do you adapt? How do you optimize your application to make sure that you can meet your SLAs and SLOs that you have for your application? 

So when you’re testing it, these are the things to consider. On the people side, it’s developers’ ownership consistent with devops, right? The performance is part of their regular everyday job and QA responsibilities evolve to empower the dev team. Sometimes that is them helping directly in helping to create those things, but ultimately it’s all about working together to make the performance a central part of what you’re doing. From a process standpoint, you’re moving from towards agile performance testing and matching the devops agility and development practices. Then ultimately in the technology, performance testing is code integrated into your CI/CD process. Also the ability, and we’ve got an open source project for, recording and replaying real traffic in your environment because one of the hardest things about starting the performance test is how do you write the performance test to have you actually do the performance testing.

Moving on to step two: optimize it. So you optimize your applications for performance, stability, and cost efficiency. So optimizing isn’t just about savings. Performance and stability is a large part of it as well. 

So what is optimization? So choosing the inputs that make the best outputs, right? So your applications need tuning. For Kubernetes it’s things like CPU and memory requests and limits, things like the number of replicas and how you handle auto scalers, but even more importantly is the application specific parameters, like the JVM heap size, like shuffle buffers, like the things that actually are the knobs and levers of application performance. Because ultimately what you want is the Kubernetes part to go to the background, right? Because it is what, where, how you deliver your application, but it’s not the application itself. The application itself still needs to be able to act in a way that you want to act, which is to increase your performance and stability, and ultimately you probably also want to reduce the cost of doing that or at least be the most efficient that you can be. If you can’t reduce the cost, you still want to be the most efficient that you can. 

But that’s always a trade-off, right? So application performance, cost and resource utilization, and time and effort are all trade offs on how you do your application. So you can get unlimited kinds of performance if you don’t mind how much money you spend. This is the same way that I buy cars, right? I could go out and buy really, really expensive sports cars to be able to get what I want, but then it’s not going to be cost effective for what I want to do. From a time and effort standpoint, I might have to get a second or third job to be able to pay for that sports car, right? So time and effort is a reality as you’re trying to balance all these. So that triangle of these three is really important. 

What if you could do that in a better way. What if we actually took the time and effort part and shrunk it and didn’t spend as much time there. So somebody else is working that second or third job for me, right? That’s really one of the ways that you can shrink the amount of investment that you have to be able to do that kind of optimization.

So under optimization, let’s talk about the people part. So devs have to understand the importance of the optimization, app owners must establish and clearly communicate the business goals, right, to the SLAs, what are the SLOs. From a process standpoint, we want optimization incorporated as part of the pre-deployment process. Not looking at it after the fact the way a lot of people do with their current tooling, right? APMs, dashboards about performance tell you what is happening or what did happen. You want to work in a way that’s going to move it more forward and in your environment to where you’re doing it proactively. Then ultimately, technology. So one of the things we’re going to demonstrate to you today is a Rapid Experimentation Engine that combines machine learning with performance testing. That’s ultimately where you want to go is to be able to reduce that amount of time by using some automation to be able to do that. 

So getting to automate it. So automated performance testing and optimization to drive continuous optimization. 

So as we talked about your CI/CD pipelines, we want to do a shift left and part of that shift left is in your continuous integration towards continuous delivery, which you’re probably already doing, we want to insert this step in the middle. We think it’s super important for you to understand how your application performs, how it performs best so that you can optimize, and then ultimately inside of your CI/CD cycle you test to make sure that the application is still performant in the way you did. As you make changes to your application, some of them are intentional. Intentional changes because you’re adding business functionality to your application. Some of them are unintentional. Things that slip into code that you don’t realize have an impact on performance. Maybe it’s an underlying library. Everything’s built on other people’s libraries nowadays, there’s layers and layers of libraries. You didn’t write all of those. There’s changes that happen to those, there’s security fixes, ultimately CDEs that get fixed. Those have an impact ultimately on your application. It could be that one of those got rolled into that release, you didn’t realize it, somebody else didn’t realize it. Maybe they just made a mistake in a configuration file, but before you even go and push it out to delivery, you want to know that it happened and you want to know before you do that, that had major impact on performance, or has a major impact on latency, throughput, cost, whatever it is that you’re sensitive to. You want to test for those to make sure before you push out a release because if you’re pushing that out without doing those things, of course you have processes like canary builds or blue green deployments, but those still ultimately mean that some portion of your users get a bad experience. Sometimes you don’t even catch it while you’re in your canary phase or when you’re testing in your blue green everything looks okay. You do a full roll out and then the performance problem happens because you actually take the load that exposes that performance problem. 

So when you’re automating it, things you want to consider are extending the concept of you build it, you run it, so if you build it, you own its performance. Eliminate the silo QA teams and bring together with development. From a process standpoint, build continuous performance testing into your optimization in your CI/CD pipeline as a quality gate. Make sure that how it performs, not just that it passed unit tests. And ultimately, establish a business line SLA as testing criteria, right? So we have an SLA we actually check before we do a release that it can still make that SLA. And then from a technology standpoint, automate all that. You don’t want this to be something that becomes a burdensome manual process. You want to just automate it in your technology stack. 

So now I’ll talk about the StormForge platform. The StormForge platform has two very important pieces. The performance testing which gives you the ability to quickly create performance tests as a service. So this is a hosted service hosted in AWS data centers to be able to send performance testing load anywhere in the world, usually within 30 to 60 seconds, right? So that’s part of it. A lot of people have a lot of uptime time in their testing process. Sometimes some of the testing systems out there take tens of minutes and hours sometimes to fire themselves up and be able to go. That doesn’t work really well with automated systems. Automate that testing as code in your CI/CD workflow. You can check that code in the same way that you do the rest of your code and then execute realistic tests with an open workload model. So an open workload model is one where the system that’s testing actually monitors itself so that it still does what you’re intending from a load standpoint. As opposed to a lot of current and older tools are a closed workload system. A closed workload system unfortunately suffers from an artifact that is when it’s testing a system that slows down, as that system slows down, it actually slows down the system that’s testing it. Gives you a false sense of security that everything’s running okay, but in fact what’s happening is that the system under test slows down, the system that’s testing it slows down, and you don’t get real world kinds of behaviors. In the real world when you have a spike in traffic from your customers, they keep coming. They don’t… In fact if your system’s running slow, they start hitting refresh. You get even more traffic because of that. That’s an important thing and that’s what an open workload model is. And then the next part is the application optimization. This is the machine learning part, which is the rapid experimentation engine. A couple things that are really important because a lot of people have seen some machine learning. Number one: it’s true machine learning. We have machine learning scientists. That’s what they do for a living. They spend all day long every day making sure that our algorithms are the optimal algorithms for what we do. It requires no upfront data. A lot of solutions that are in the machine learning space want months and months worth of your data to be able to look at that data and determine what it is you need to do. From what I will show you in how we experiment, you don’t need any upfront data. Once you create the experiment and do start the testing, we immediately start giving you feedback on that. And when the experiments are done, you’ve got the information necessary to know how your application behaves. And you can automatically implement the optimal configuration based on your goals and that’s part of the design of the experiment. And that also allows you to identify high risk configurations because there’s plenty of settings that you can use in your application that actually don’t work well together. You’re changing a lot of knobs and levers in places and you need to understand what those are. Machine learning actually with the experimentation shows you those configurations that you would never want to deploy because they fail in some way or they don’t meet your SLAs. 

That’s where I jump into the demo, give you a little bit of view of what we’re doing on the StormForge platform. So here I’m going to do a cooking show. I’m going to show you the results after running the experiment because the running and experimenting takes minutes, minutes and tens of minutes, and time to be able to run those experiments. So I’m going to show you the configuration and what we did for optimizing several applications. So I’m going to start with one that I like to show because a lot of people are running things like Java applications in their environment. Java is very prominent in a lot of environments and Java is very hard to tune in Kubernetes as it turns out. It wasn’t designed originally for that environment and so it has some challenges. As I show you the interface here, you will see a bunch of dots on the screen. These green dots are the trials that we ran in the experiment. These diamonds are what we call the optimal configurations. The square is the original configuration, so for this particular experiment this was the original configuration of this application. If I click on this it will show what the configuration was. So for this particular Java app, these are the knobs and layers of things like Java. Things like parallel garbage collection threads and heap sizes and concurrent garbage collection threads, on top of how much CPUs are allocated and how much memory is allocated. These are the kinds of things that make a difference in the tuning of an application for those of you there in Java. It’s the xx parameters that you configure for that. 

Let me filter this screen down so that you will see the optimal configurations because it’s nice to know we ran a bunch of trials, but ultimately you want the best. That ultimately leaves you what we what’s called the pareto front, which is given for what you’re measuring for. In this case, it’s cost versus the duration. This was a fixed amount of work, what are the best points taking those into balance, and then this one that’s darker colored here, this actually shows you what the machine learning says is the best configuration based on balancing what you’re trying to optimize for, cost versus duration. If you look at the bottom as I click on this dot, there’s a few changes that happened, in fact all the items I believe changed on this one. All of the settings changed a little bit. After doing this experimenting, we found the best configuration for what’s going on. In this case it increased the duration by about 16%, so you might be sensitive to that and you can drive down from that, but it also saves 75%. That’s a huge savings for minimal changes, but this isn’t really something that is intuitively obvious to people who are doing this because it’s a lot of possible combinations. As you add more and more parameters it becomes harder and harder for a human to fit that much because you’re talking about millions and billions of possible combinations for what you’re doing in your application. 

So I’m going to go back and give you an example of another kind of application. A lot of you are now changing your applications to the microservices applications. It’s interesting because microservice applications become more interesting because there’s more moving parts. As you break up your application into these smaller independent microservices, the effect of each one of them in the ecosystem is more pronounced because each one of them can have an impact on the other services and the application that’s calling it. So in this case, this is the Docker Voting Web App example. So those of you that know the Docker Voting cats versus dog example, this is the original configuration and the amount of memory that’s configured for each one of the microservices. Things like numbers of replicas, how much memory is allocated to each one of the services, how much CPUs allocated for it. Again, I’m going to drill back down to the optimal and then from here I’m going to pick a point. Now it may be that I want to choose this point that is the one the machine learning picks, but in this case there was actually reduction in throughput. So I’m going to pick one that’s going to give me a throughput that’s equivalent or better than what we picked. So in this case, based on the changes that we recommended, so if you look at the bottom as I change these, these changes drive a behavior that reduces cost by 46 percent, while also increasing the throughput, right? So nearly 50% savings and it’s a faster application once you’ve done this configuration. It’s great because we now have told you and experimented. This is done in down level environments. This is not done in production because you can’t make some of these changes in production. You’re making changes to the architecture of the application in some cases in a way that’s not compatible with doing it in production. So you’re doing it in a down level environment, so the configuration that you need and then ultimately you have this ability to export this as the manifests that deploy your application. Depending on which way you deploy your application, you can check those files right into your repository, git repository, whatever you’re using for source control, and then promote it within your environment. So this is an example of getting an optimization and understanding what the optimization is. 

A couple other really important things. Number one, this is where you start doing things like building in your CI/CD pipeline. If this is the configuration you’ve chosen as your best configuration for what you want to do in your production environment, you check this in. Then in your CI/CD process, remember where we talked about inserting those steps, as part of the gating feature you actually run this trial again, just this one not all of them, which runs very quickly and it tells you is the code still performing in the way it is or did it drift? It could be as I described earlier that the code changed significantly and if it changed significantly it would fail because it no longer performs in the same way. It may be 25% slower because of some change. You want to know that. Now it could be that you actually added business functionality to your application. You’re trying to make your applications better for your customers. In fact, we’re freeing up the time that your developers spend trying to optimize the application by doing the machine learning based optimization in the background and they’re actually building new business functions, which allows you to drive to the business outcomes that you want. Well they’re adding functionality. Functionality takes CPU and horsepower and also settings for doing that. Hopefully that’s the case. So as you’re doing that you need to understand did I change something that ultimately affects performance, and if so, it’s time to re-optimize. You re-run this experiment over minutes and minutes and hours and then it tells you the net new configuration. You pick the new configuration based on what your desires are and then you do this process again.

Once you’ve done that, you’re going along, we also have tooling that helps you determine how does my application actually behave? So say this is the configuration that I want, you can start doing some what-ifs within our environment, that is let me see how the various things of this application affect the performance and what can I change to make my application faster, less costly, whatever it is you’re trying to optimize for. So you can drill down and say for all these things at the bottom, how do they affect the application? So is it the database that I need to change the configuration of? If I click on the database, I very quickly find that almost everything’s clustered down here. This is not a place you need to spend time in your application trying to figure out for better configuration something that’s going to really move the needle on the way the application behaves. So it’s not the database CPU. Maybe it’s the replicas. Replicas are an important thing in Kubernetes. Maybe it’s the replica of the front-end server. So in this case it’s the end service that handles all the web traffic. So as I click on this, I get a very easy, very clear picture. Most of it’s centered here, but it behaves the way you expect the front-end service to behave. You get lower levels of performance for a single replica. As I start adding replicas, I get higher level performance, and at the highest level I really need to jump up. So I go from one to two to three replicas, right, and it scales in the way that you expect a front-end service to scale. But all replicas aren’t built the same. If you look down here at the workers, which are the back-end processes, they actually behave very differently. They actually don’t need a lot of changes. So as you’re trying to make changes in your application, understanding the current behavior, whether that behavior needs to change, or what are those things that I need to work on that help my developers understand the behavior of the application, so that ultimately as you make these changes and you come back, you then want to rerun the optimization to determine did I actually change the application the way I expected? And that is to be able to be a better application. In this case, I would want this to move up and to the left, less cost and more performance, but it may be that you made a change and it actually goes down to the right. That’s really important to know too before you do that because you may have unlocked performance problems in one part of the application, and another part of the application isn’t ready to handle that because before it wasn’t stressed, but now you’ve made this change, now another part of the application is stressed. Really, really common in microservice kinds of applications where the whole, each one of the pieces plays an important part, but as you change the application and start unlocking performance gains in some areas, you’ll find other areas where there are weaknesses and it gives you an opportunity to go after that. So I’ve shown you how it is that the cake after it’s baked. We don’t have time right now to show you how we bake the cake. That’s really where we’d love to engage with you and talk about how we do what we do and how we do the experimenting to be able to do that. 

Let me jump back to my deck here and present. So working with our customers, it’s a lot of fun. It’s a lot of fun for me as a solution architect to really go in and work with companies. In this case, we worked with a very large travel website and we’re able to achieve 50% cost efficiency improvement using StormForge. As you can imagine, that makes a big difference in times like COVID for companies that do things like travel, and this was a company that was highly optimized on their application. They have a team that’s focused just on optimizing the application. They knew the application inside and out. They had some challenges in it and in setting this up, running the POC, and ultimately getting to the answers that we were doing for them. We were able to, in a matter of a few hours when the experiment ran, show them a configuration that was 50 percent more cost efficient changing just a few parameters. The interesting thing is in talking with the team after we’re showing our results, they were like that wasn’t intuitive. Those changes we probably would have never made because we know how it works, but machine learning actually figured it out, right? The interesting thing is that our machine learning treats your application as a black box. It actually doesn’t know anything about your application. All it knows is that it’s given parameters that it’s trying to optimize, the design of the experiment says what it is we’re trying to optimize for, throughput, latency, cost, whatever we’re trying to optimize for, what are minimum levels or maximum levels of performance or non-performance we’re able to accept, do they fit within SLA and SLO, and then the machine learning optimizes the application based on that. It does it through an iterative iterative cycle where the machine learning says okay try this set of settings, report back to me how it performed. So it tries that set of settings, reports back, and says here is the performance of that. The machine learning then says, okay I’ve learned a little bit about this. Then ultimately in the background goes the cycle, which is of course happening automatically nobody’s got to be involved in this, determines how the application best behaves and what is the best configuration. The machine learns because it’s machine learning. It learns how your application behaves and gets better and better over time to ultimately give you a Pareto front of the best configurations for your environment so that you know scientifically what’s the best configuration for your environment. 

So I think from here, I’ll turn it back over to Rich. 

Rich: Yeah, thanks a lot Brad. Great overview. We’ve got a lot of really good questions coming in. Before we get to as many of those as we can, just wanted to let you know about a couple of other things you can do to follow up to get more information. So next Tuesday, we’re having a webinar that will go a lot deeper into the product offering. So as Brad said, he kind of showed you the cake once it’s been baked and if you want to see how the cake is actually baked and how it’s put together, then next Tuesday would be a great opportunity to do that. So Brad will also be presenting on that along with Noah and a really good opportunity to see the product in action at a deeper level. The other thing I wanted to point out is just a number of things you could do as next steps here. So we do have an ebook that covers a lot of the same content that we talked through today in a little bit more detail. So if you’d like to download that ebook you can do that. Also, we’d love to talk with you personally and show you a demo that’s a little more customized to your own needs. So definitely feel free to reach out to us there. If you scan the QR code here, you’ll get sent to a page that has all these next steps on it and things that you can do to further engage with us. So definitely encourage you to follow up and we’d love to talk with you more about the solution. 

So with that, we’ve got time to take a few questions I think Brad. So we’ve got a few good ones here. A couple of folks asked can we use StormForge for on-premise environments in addition to cloud environments? 

Brad: That’s actually a great question. So the beauty of what we’re doing is you can run anywhere Kubernetes runs, right? So on-premise, in public clouds, in managed versions, let’s say the AKS, EKS kinds of environments. The functionality that we use inside of Kubernetes is pretty base Kubernetes functionality on top of our secret sauce, but it runs pretty much everywhere Kubernetes runs. 

Rich: Great, thanks Brad. Another question we talked a bit about the tuning parameters and the configuration, things like that, but maybe just to clarify or reiterate, when we’re talking about the tuning parameters, what is that exactly? Are those basically arguments to the container? Can you just expand a little bit on that Brad? 

Brad: Sure yeah they can be arguments for the container like memory and CPU, right? The normal Kubernetes objects. It could be things that you set as environment variables that are picked up by underlying software, so it can pick that up. Really anything that can be exposed in such a way or be configured in a way to allow variability so that you can inject changes or inject something, which is a traditional regular Kubernetes pattern, right? Then you have this ability to do it. And those are things like things that are in configuration files, things that are command line options of technologies, anything that is something that’s exposed in the software that has an ability to be exposed and then changed can be used. So anything that’s a tunable parameter for your application. 


Rich: Great, and then the other question is around cost. So a lot of the results graphs you showed had cost as an axis. We talked about that a little bit. Can you talk a little bit more about when we show cost in the solution, what is the unit, or what are we actually showing there when it comes to cost? 

Brad: Great question. So from a cost standpoint, as we design experiments we’re using a costing factor that’s part of your experiment design because everybody’s running different kinds of infrastructure. They’ve got different costs that’s going on. If you’re in public clouds, you’ve got different discounting that happens based on your account. That sort of stuff. So you can put in a number in there and we can calculate it also on common things like AWS and GCP costs, but it’s a stand in for the cost of the compute resources within your environment. The kinds of nodes that you’ve got and what the cost of that is. Then the numbers that are presented are ultimately so that you can compare one configuration to the other. So the cost is a stand-in, but if you get your cost number in there correctly, then that is the cost of running that application that’s under load that we’re testing, if you ran that 24 hours a day, for 30 days. That’s the cost, that’s ultimately what it rolls up to. So it gives you an estimation. We’re not a direct costing tool, but you want to know that one configuration is $600 and one configuration is $300, that gives you the understanding that one’s 50% less expensive. 

Rich: Yeah and when you’re using, we talked about using the solution on-premise as well as in the cloud, so you’re using the solution on premise when we’re talking about cost, is that just basically resource utilization? 

Brad: That’s exactly what it is. It’s resource utilization. Some organizations actually build out their own costs. They kind of treat their own private cloud as a cloud so that they understand their costing factor. Resource efficiency is actually a bigger play when you have on-premise because, of course, on premise you have a certain amount of capacity. You can’t just burst out more because you take more traffic. You really have to be efficient in the way that you do that and make sure that you have the amount of overhead in case you do get those kinds of spikes. So that’s even more critical. Happens as well for like edge and IoT kinds of applications where you’ve got a resource that’s a certain constraint resource. You can’t go outside of that. So everything that runs in that environment needs to be 1. Efficient and 2. Predictable in the way that it behaves, which also allows you can do within the experimenting that we do, you want to be able to do an experiment and say what happens when I take three times normal traffic? What happens when I take 10 times normal traffic? It’s Black Friday for me, whatever your high point. We were talking with a customer the other day and they do IoT devices that are doorbell cameras and security things and found out Halloween’s their big day because people open and close your door a lot and the camera comes on a lot more than any other day of the year. So it was interesting, but they have a certain amount of capacity and they want to make sure that they can test for that capacity. 

Rich: Great, a couple of questions here before we wrap up just about using the StormForge product. So one is how much effort and how much Kubernetes knowledge is required to actually use the StormForge product for doing optimization of your applications? 

Brad: We’re making it easier and easier. In fact you need to know less and less about Kubernetes to be able to do that. So at the very easiest level, if you’re doing a very rudimentary kind of tuning, it takes almost no knowledge. The ability to just get in there and run the experiments and have a cluster and have some permissions in the cluster to do the things that you need to do. So it can be very simple. Of course the more complex your scenario is, the bigger the application is, there’s a little bit more complexity, but you’re talking about minutes and a few hours at the more complex end of things to be able to configure the optimization. Also requires a load test so you have to have a load test already, or we have an easy way to do load testing as well, but generally you’re talking about minutes and hours to be able to effectively do that. 

Rich: Great, and in the webinar next week we’ll go into more detail on how to set up those optimization experiments. So that will give you a better idea of kind of what’s involved with that. Then the last question, it’s kind of a related question, but it’s really around how much time it takes to actually run the optimization experiment?

Brad: Sure, so part of that is in the design of how you design your performance test. How long it runs for because that ultimately you’re measuring the application under load. So you want the most representative behavior of your application in the minimal amount of time. Generally that’s going to be 5, 10, 15 minutes something like that, right, because we’re running dozens if not hundreds of trials in the experiment depending on how many parameters you have. At some point you just run into wall clock time, right? If you’re doing 100 of something that takes 15 minutes, that’s 1500 minutes. Of course we give you the ability to do that in parallel as well, so as part of the way that we instruct you and show you how to do that, the lowest amount of time that gives you a real answer so that you can do that faster. Most customers’ experiments are generally in the minutes to a few hours. Every once in a while a really, really heavy duty, like something like a Spark application with a really large data set or something like that, might run overnight. That’s generally the biggest I’ve seen, and of course it’s all happening automated in the background, so you just go look at the results once it’s done.

Rich: Great, all right well thanks Brad! So we are out of time, but wanted to first of all thank you, Brad, for sharing your knowledge and experience on this topic and thank everyone for attending today. 

I hope that it’s been really valuable and hope to see you again on a future event. So thank you and have a great rest of your day.