On-Demand Webinar

Your Testing is Flawed

Introducing a New Open Source Tool for Accurate Kubernetes Application Testing

Air Date: February 10, 2021

About the Webinar

Analyzing the performance and behavior of applications run on Kubernetes is often challenging, making the need to optimize prior to production something that you must have. However, a problem has reared its head in the form of a question: How do you get an accurate measurement of application performance or other behavior without accurate testing or an accurate representation of how it will run in production?

In this webinar, StormForge’s Noah Abrahams presents and discusses a new fully Open Source tool for creating the needed tests with which to accurately measure your applications. We hope you learn more about this tool, and find out how you can help contribute.

View the Slides

Noah: Welcome to everyone. Welcome to the webinar. As Christina mentioned, my name is Noah and I’m going to talk to you today about our new open source tool, which hopefully will also be your new open source tool about testing in Kubernetes applications. So if my clicker works, the agenda we’re going to go through today is we’ll start with who we are at StormForge, why we created this product, why we care about testing, why it’s important to us, and what the problem is that we’re trying to solve. We’re going to talk a little bit about current testing models, how current testing models are flawed, and yes, they are all flawed in some way.

We’re going to talk about what we can do about it, then we’ll go on to the project itself. We’ll talk about how it attempts to solve the problems and then how folks can get involved, either start using, or involved with helping out with the project. 

So first of all who are we? We are StormForge. You may have known us previously as Carbon Relay up until last November. We are a market leader in performance testing and Kubernetes application optimization because those have to go hand in hand. If you don’t have good testing, it is difficult to optimize and if you are doing testing without acting upon it then why are you doing that testing? So we’ve been doing this for a while now. We are a company who is backed by a strong machine learning presence because optimization is difficult to do by hand, so letting machines make the decisions as opposed to having to try and juggle 10 or 12 different parameters at the same time, that only makes sense. We are all in on the cloud native ecosystem. As you can see we are focused on application optimization of Kubernetes applications and all the related products that go along with that. We’re also all in on open source as you might have guessed by my job title of Open Source Advocate and the fact that we are presenting an open source program and project to you today. If you want to know more about us you can find us on our website www.stormforge.io.

We have a public Slack and we also host bi-weekly office hours if you want to just come in and ask us questions about any or all of these topics, or none whatsoever. Anything that’s tangential will also happily take. So as a group that is focused on testing and optimization, we want to first establish what is optimization and how did it get us to this particular question that we wanted to solve? 

So, optimization, at least as I’m defining it, is getting your application to behave how you want it. It is however you want your application to act once you have finished the active optimization. Whether you are trying to normalize for cost, whether you’re trying to increase your performance, whether you’re trying to get better resiliency, whatever parameter, whatever property it is that you’re trying to get into your application, that is part of optimization. It is your desired behavior. So how do you measure desired behavior? 

That’s an interesting one because in order to know what that desired behavior is, you have to be able to have some form of measurement. You have to understand the tool, you have to understand the application, and understand what actual properties are the desired properties in the first place, which in the world of Kubernetes applications, in distributed systems, in this cloud-native ecosystem, that keeps getting harder. Applications are more spread out. We’ve gone from a world of monoliths to a world of microservices. Every individual piece is behaving in its own unique way and the idea of how to optimize is more and more complicated as we make the apps more and more abstracted. So in order to understand how those all fit together, we need to have some form of yardstick to do a measurement against. But in order to see that behavior, we have to stimulate the behavior in the first place. If you don’t have any sort of good load, if you don’t have a good way to stimulate the behavior that you want to see, you can’t see what the response is when you are stimulating it, obviously. If you want to see what your production apps look like then you have to know what a production behavior looks like. If you want to know what isolated behavior looks like then you are going to generate load in isolation. It’s pretty straightforward. 

So, we are interested in optimizing the applications which means we need to stimulate behavior that will provide us with a response that shows the properties and the qualities that we’re trying to optimize for, which gets us to load testing. 

As we said early on, load testing is flawed, but load testing is also a spectrum. at one end of the spectrum, you don’t really know what you’re testing for. Most load tests start with, “I’m just going to throw as many requests as I possibly can at the front end of my application and I’m going to see what the results are.” What’s the failure rate, what happens when I hit the root of my application, what happens when I hit a particular API endpoint, and you’re really not looking at responses. You’re looking at more of an overall status, whether or not it’s even functioning at all. But when you’re looking at trying to optimize for any sort of particular behavior, that’s not really relevant. It’s definitely not accurate, it’s not giving you anything useful. If you were to put an application out in the world, you’re not just going to get a large number of people hitting the front page of your website all the time. It’s not what’s going to happen. On the other end of the spectrum, you’ve got some understanding about your application. you’re making some quality guesses. You have some understanding about what the traffic looks like. You have some understanding what the page architecture looks like, or application architecture. You’re analyzing the flow and the patterns that come into that application and then you recreate the patterns, and this is going to be a recurring theme throughout this. the fact that you’re taking existing data, stepping away from it, and then trying to recreate that existing data in the first place. That existing traffic, it requires a lot of work. It certainly has a lot of uses, but it does require a lot of engineering time and effort and you end up with these handcrafted load tests, which require understanding of the application in the first place, and a lot of teams are frankly not suited to do.

So, that brings us on to, as we talk about that spectrum, what are we trying to accomplish with these load tests and why is understanding about your application an issue? So one of the big problems that we run into is a problem of accuracy and if you asked your average person on the street what accuracy is they might fumble through a definition, and one of the things that I think is interesting here is the difference between accuracy and precision, and we’re focused on trying to get more accurate tests.

If you asked an average person, they might tell you that accuracy and precision are synonyms – that they mean approximately the same thing. But in the world of measurement they don’t mean the same thing. Accuracy is about how on target you are and precision is about the width of your measurement, how close to the target you are. Now, when you are using early tests without any understanding of your application, you are missing accuracy significantly.

You’re you just kind of, you know, what we call throwing it at the wall, seeing what sticks. That’s not accurate at all because you don’t actually know what you’re aiming for, and honestly in the wild that’s a lot of what load testing is. It’s not a matter of whether or not you understand the application’s behavior, it’s a matter of whether or not when you perform a test did it fall over, is it on fire? And if it’s not, people just ship it out and they say well it stood up. We assumed that it behaved the way that we wanted it to, looks good to me. 

And if you are putting in that effort, well then you’re getting better tests. You’re getting a better understanding of your application if you’re putting in the effort to do so, in that you know what the results should be. But as we said previously that takes a lot of work. And since you’re stepping away from the initial traffic in the first place in order to analyze it and get back to recreating tests, this is where we make the statement that all tests are flawed because you’ve already stepped away from reality. You are not in a world where you’re actually working with the real data. You’re already working with an abstracted set of analysis. So across the entire spectrum, all of your tests are flawed. Whether you’re doing them in an inaccurate sense or whether you’re putting in the work to get as close back to what you started with in the first place.

So how do you get past not having that reality in the first place? Well, one of the best quotes that I’ve seen is simply about testing in prod. This is very easily defining that when you test in prod, you’re testing in reality and if you’re not testing in prod, you’re not testing in reality. It’s a pretty straightforward binary and there’s a lot of value to be said for both sides. You want to test before you get to prod, but testing in prod is the only way to see what the actual behavior is going to look like. However, why does it not reflect reality when you’re testing outside of prod?

Well, this is actually a lot more complicated of a question than what I’ve positioned here. It’s not the differences in the systems. It’s not infrastructure. If it were a matter of infrastructure, then you could just change the instance types that you’re requesting. That’s not a big deal. That’s pretty easily solved. Is it a matter of how your operational teams are taking care of and managing the systems? Are your dev systems being managed by hand, but your production systems are all, you know, infrastructure as code and everything is checked in? Well, we know that that happens in reality, but really that shouldn’t be a difference. You can easily solve that by adapting those operational practices to your development environments in your pre-prod environments, in addition to your production environments. 

Now almost all of the differences are based in usage, and I say almost, there’s some other stragglers that come in. The concept of state in a production environment is one of them, but for the most part it is the usage patterns that are coming into your production environments that are preventing your non-production environments from actually having testing that reflects reality. So all of your testing is inherently flawed if you cannot get production to be simulated in your non-production environments, and there is always a bit of danger in potentially testing in production. Maybe you’re doing destructive testing and you don’t want to push point that at your live production environments, or how are you testing against payment card data, or any number of things. So all of that testing is inherently flawed and we wanted to try and solve that problem because we wanted to have tests that could optimize an application for how it’s going to function in prod and we wanted those tests to be able to be run pretty much anywhere so that we could simulate the prod environments outside of prod. 

So we’ve presented this project which is currently called “VHS,” and that’s in quotes for a reason we’ll talk about it in a little bit. So what is VHS? VHS is a tool that will take your production traffic and allow you to use it in non-production. So how does it work? What is VHS today?

First of all, since we said it is a focus on Kubernetes applications, it deploys as a sidecar. It shows up inside your Kubernetes application and allows you to understand and make use of the traffic that’s going into that app. It records and replays any and all traffic that you tell it to, and it does so by storing that traffic in some form of file or manifest, wherever you tell it to. Now, this is sort of an any-to-any format, so we’ve taken the ability to record traffic and we’ve made all of these plugins, whether it is recording the inbound web traffic and storing it on local disk or in an S3 bucket or whatever it happens to be, and we’ve also taken the inverse where it is reading that manifest, where it’s reading that stored traffic and it’s playing it back into your application for the purposes of testing. And really that’s the majority of what it is, which is interesting because we saw that there was a problem in a testing environment, that there was a need for getting these tests to work correctly, but we didn’t really solve it with a testing application. We solved it with a traffic application because we found that that was going to be the most useful way for stimulating the behavior that we wanted to see in the applications by completely bypassing the idea of creating tests and having to understand what the application looks like in the first place. You still have to understand it, still very important. If you just throw production traffic in an application you don’t understand what it’s doing, you’re not going to get a lot of benefit, but it does take it out of the idea of creating the test.

So where would something like VHS fit into your lifecycle today? You’ve got this recording replay tool, so you throw it in as a sidecar on your Kubernetes application. If it’s receiving web traffic then it would take those http manifest, store it as a .har file standard http archive, in the storage of your choice, wherever you want to put it, and then as we go back and approach your CI/CD pipeline, it can play back that traffic into any number of stages. Now the ones where VHS appears here, this is a typical CI/CD pipeline that I’ve seen many times, check in your code do your build run your unit tests integration tests additional end to end test optimize the application and then finally deploy into your production environments. And the areas where VHS is here in this pipeline, this is about where StormForge is mostly concerned in the environments in the pipeline. We want to help with the optimization we were focused on what does the testing look like however this turns it a little bit on its ear because this is a typical ci cd pipeline, but while this is where it would fit, the very concept of these being stages in your pipeline might change because you have the ability to stimulate that production traffic into your test cycle. So this entire test cycle might change by virtue of you being able to throw live traffic at it, and like I said this is just a typical one that I’ve seen before. I’m sure that any number of people on this call can come up with new ideas, use your imagination, on interesting ways that this could be placed for anything that needs to replay traffic. Maybe you want to use it for testing. Maybe you want to use it for security. Maybe you want to use it for… come up with your own ideas. 

So that’s what we’ve got. We’ve got a framework really that does record and replay of any traffic to and from any scenario ostensibly, but it’s a new project. It’s a community project and there’s a lot of potential directions that it could go. For the core of the framework itself, one of the places that it could go could be to provide additional metrics. Maybe your application isn’t running Prometheus. Maybe your application, you don’t really know how to get that visibility back out. Well we’d like to throw some metrics into the core of the product so that when you run the test, when you provide that traffic, that you can understand what the reactions are to that traffic internal to VHS. We’re also looking at potentially testing things outside of the realm of just the Kubernetes applications themselves. Maybe want to test the Kubernetes platform. Maybe this is a sidecar that could be applied to the API server in the kube system namespace and you want to test not what the workload looks like of a particular application on Black Friday, but you want to test what it looks like when your cluster that is a multi-tenant cluster is being used by thousands of users at the same time and will your API server stand up to that much kubectl action? Yeah, there’s a lot of potential there for testing things outside the realms of just the app, and that leads us into versions of the app that are not just a single application sidecar. We’ve had some discussions around maybe the VHS implementation is to be a standalone application on the cluster, or maybe it’s not even a containerized version. We’ve had some people ask us about how would I apply this to a bare metal system because I don’t want, I want to test what a bare metal response looks like and I’m not interested in the application, I haven’t gotten that far yet. So there’s a few different changes that could potentially be on the horizon based on what the community needs because once again this is an open source project. This is a community project and we’re looking for a lot of people to get involved to help out with these types of directions. 

Let’s talk about the plugins a little bit. We said that this is an any-to-any scenario, so what if we want other storage mediums? Well how many different storage types do you want to have? Today we don’t have a plugin for Azure. If anyone on the call is well versed in writing Azure storage plugins, pop on by. We’d love to have you. What about the output that is being stored in the traffic manifests? What about encrypting that data? Right now it’s dumping it as a flat file. What about sanitizing that data? What about parsing it in some other way? These are all plugins because they’re all part of just getting into the workflow that we would like people to come jump into our sessions and tell us are these things that you find useful or, you know, maybe they’re not maybe. Everyone is gonna say, “hey I’m just gonna store this locally and I don’t care about encryption.” Unlikely, but possible. 

What about session management? So right now we’re doing mostly http traffic. How do you handle HDB sessions? How are you handling that across multiple instances, multiple replicas within your web tier? What does that look like and how are we handling that at the plugin level, and since I said we’re managing mostly

http traffic today because that’s what we started with, you need to have a starting point, what does it look like to handle other traffic types? Maybe we want to be able to capture SQL traffic. Maybe we want to capture GRPC traffic. Maybe we want to capture any number of other things. And as a corollary to that, if we’re capturing and playing back other traffic types. What if it’s not just on the front end of the application? What if when we’re testing our application we’re actually using this as something closer to surface virtualization where we’re simulating a database playback on the back end of the application. Because it’s in any-to-any configuration, this is 100% possible. 

We’re also looking at once the data is stored, once that traffic is somewhere in a file, what can you do with it? Do we have additional tools that we want to press upon that stored traffic? What if you want to do some manual traffic shaping those handcrafted artisanal load tests that I talked about. Well we said they’re a lot of work, but what if we could make them a lot less work by giving you a good starting point and allowing you to do simple shaping, like I want to add some delays in between some transactions to see if my e-commerce site still works? I want to increase the volume of a particular type of stored transaction. All of these things and this goes into our ML backing, our machine learning backing. 

What if we want to analyze that traffic and be able to either narrow or broaden the traffic types and the request types that are coming into your particular application? What if we want to increase the dial on just one particular type and say, “give me more of the incorrectly authorized admin traffic?” See what happens when the application hits that. 

And one that I am particularly concerned about, once you have that traffic stored, what do you do with any sort of personal information? Any sort of sensitive data? And this goes for a couple of different reasons. You’ve got things like GDPR where you don’t want to store any sensitive or personal data if you can avoid it. But also what if you’re playing back something that has payment card data and you’re now replaying a credit card transaction? You don’t want that to actually process and go through because you’re running it in part of a test simulation. You want to be able to change that up to say a test string. So all of these are ways that we see the project evolving. 

But we run into in this any-to-any scenario that we’ve made a bit of a Swiss Army knife. So where we’re starting today, we’ve got HDB traffic, the ability to de-zip and turrets, the ability to store it in a handful of locations. That’s a great framework and what we want to get to is something that a lot of people would find useful. That we want the community to be involved with to tell us and to help drive this product forward for what the most useful use cases will be. How are people going to interact with this project? Maybe it’s in ways we haven’t thought of yet. I’m great with that. Please come help us drive those forward, but if we attend to everything that is on that list and do all of the possible plugins, we’re going to end up with something that’s sort of untenable. So part of the conversation is around what are people going to find useful and in your organization when you’re doing testing or maybe you’re not doing it today because you find it too difficult and you wanted a way to start there? What would you find to be most useful in a traffic replay tool? 

So, that brings us to the project itself. What we have and what we don’t have. We have weekly meetings to talk about these things. What do people want? What is the community going to bring in? We have a mailing list, and a contributor’s guide, and a Code of Conduct, and we have a GitHub org, and all of the things that you need in order to manage a project. 

What we don’t have yet is significant contribution outside of StormForge and we want to change that. We don’t have a formal maintainer process because we don’t have quite frankly, you all of you who are here on this call. We want you to get involved in this project. We’re trying to make this a community project that is not owned by any one company and we want everyone to help us drive it forward. The other thing we don’t have is a new name. From earlier, you’ll notice that VHS was in quotes. Well there’s a couple of reasons we are intending to rename this project. VHS, I should surprise no one, is somewhat difficult to go do a Google search on and in almost every context that we were interested in, the name was already taken. You’re not going to get a GitHub org named VHS. It’s not going to happen. The other issue that’s not written here is that it is someone’s registered trademark. VHS belongs to a company and so while it’s quippy and is evocative of the record and playback functionality that this project has been built with, really we want something that as a name belongs more to the community. We want to be a little more unique and also we want it to be evocative of the cloud native ecosystem that we’re making this a part of. So if you say, “hey I’ve got a good name for that,” or “I want to see what suggestions other people are coming up with,” you can come join our meetings and we’re going to be opening a section for suggestions during our meetings. Along with storing things in a document, we’re expecting that sometime around the end of this quarter, we are going to have a poll based on the suggestions that everyone brings forth to us and we’re going to formally rename the project, probably in the beginning of April depending on the number and activity we get around the suggestions. 

So, if you wanted to get started with this project, how would you get involved? Well first of all you would come join us on GitHub in the appropriately named “rename-this,” org for obvious reasons. We have a mailing list over on Google groups, which is going to be our central point of contact. It is VHS-pre-rename-launch. That should be pretty easy for everyone to just remember, right? I’m just going to type down those. Yeah, that’s not going to happen so… 

We also have a public Slack. Unfortunately our public Slack registration page is currently down, so it’s a good thing the link is here, but we’re going to be announcing when that’s fixed on the mailing list so please come join us on the mailing list. I’m going to leave this up for a second or two for anyone who’s interested to go join the mailing list, and if you’re on the mailing list, then you will get the invite to our weekly Zoom meetings. They are 9am Monday mornings Pacific Time / 12pm Eastern Time / 5pm GMT. There is a public document with all of our meeting minutes and agendas and that same mailing list will also be used for control of the renaming documents. So that’s it. That’s all I’ve got. I hope that everyone found this to be interesting. We’ve got, you know, what we think is the framework for a really great solution to testing in prod, to being able to provide your production traffic in a cloud-native way. 

Now we’re going to maybe… maybe I’m going to move on to questions. There we go. So, we’ve got some questions here in the Q&A panel. Let’s see what we’ve got.

Question: What part of StormForge is open source and what part is proprietary? Can the service aggregation and result stored in StormForge services, stored in StormForge servers, or we install all the dependent services into our own servers?

Okay, so one of the lines that we’ve drawn as a company is that anything that a customer is going to run, and I say the word customer even if though it’s open source, anything that an end user is going to run on their own systems absolutely should be open source. So for StormForge, we’ve got this particular application. This is an open source application running in your cluster. If it reaches back to other things, that is a matter of discussion like we talked about the machine learning. That piece wouldn’t necessarily be open source, but anything that you’re running for the optimization, for local testing, anything that you’re running locally is going to be open source. We do have some service offerings and those service offerings aren’t necessarily open source. We’re still looking at continuing to grow that. I’m a big proponent of pretty much everything should be open source, so I’m going to be pushing that forward. But yeah, I hope that answers your question.

All of the stuff that you’re going to install, which would include the ability to like run experiments and whatnot is all open source.

Question: Storage overhead required?

It is primarily text, so it’s not a lot. If you’ve ever seen the output of, you know something like wireshark, it’s just a big blob of text. It doesn’t take a whole lot and it goes from there. The VHS object itself is an image, so that’s not taking a lot of space. I can’t give you a direct answer because that’s sort of like asking how much water is there in a swimming pool or how long is a piece of rope, but I can tell you that it’s text files and doesn’t take up a whole lot of space. 

Question: What happens if a node’s clock has become skewed? For example, very active nodes with a lot of CPU may go kilter in their ntp if not set up properly, so you could find a five minute or more mismatch. 

I’m not sure I understand what the context of the question is. Node clocks are a problem. I’ve seen that in production many times, but if we could get some more clarification on what you’re asking with respect to a particular product or if you’re just asking in general about node clock skew. I’d appreciate some more clarification on that question. 

In the meantime, what’s the next one?

Question: What about the GDPR from capturing a real traffic point of view? 

That is, I mean like I said in one of the previous slides, that is one of our primary interests on how we handle traffic and we want there to be some form of plugin to handle sanitization of personal data. Now whether that happens during the capture or it’s something that has to happen as a tool after the fact? I would prefer it to happen during the capture so that it never actually makes it to the stored data in the first place, as opposed to sanitizing the data after the fact which is sort of like you’re not really adhering to GDPR if you’re doing that. It’s a great question and it’s something that we really need to be concerned about and we would love people to come help us with that sort of endeavor around data filtering. Because right now, it’s not doing it and it needs to. 

We’ll give a minute or two to see if any other questions pop up, but other than that barring any other questions, I hope that folks found this interesting… 

Question: Replay is the issue (this is regarding the clock SKU earlier). Replay is the issue I assume you need to align events records with regards to timestamps. Imagine a reply logged before a query. This could affect Kubernetes also in frustrating ways.

Yeah, so that’s going to come down to being able to replay events as they came in. If there are time stamps in there and the time stamps need to be updated, then that would have to become part of, realistically that would be part of data sanitization. I think. To say I’m replaying this in a scenario where I want to update any clock signatures to now, and that would have to be covered at the plugin level. Because otherwise you’re playing back traffic that already has a timestamp built into it and there’s not really a reasonable way to associate that, and that might be part of the test. Can your system even handle a playback that has a timestamp other than now? That’s frankly something that I hadn’t really thought about until just now and I would encourage the person that asked this question to come on by our meetings and present it as a formal issue. Something that we need to be concerned about and something that we need to manage in the plugin management.

Thank you for that question. That’s really interesting. Barring any other questions, we’ll give it a minute. I hope folks saw this as interesting and want to come join our meetings, the mailing list is the place to start, which I will put back up here. Looks like we’re clear on questions. Thank you all for coming out today and I hope this was interesting and educational.