How Trilio Uses StormForge to Deliver Better Kubernetes Performance at 64% Lower Cost

Air Date: March 17, 2021

View the Slides

JP: Excellent. Thank you everyone for joining. I’m sure some more folks will be trickling in in the next few minutes, but we can get started. So first and foremost, thank you all for joining today’s webinar and Happy St Patrick’s Day on today’s beautiful day March 17th, 2021. So today we are going to be presenting on how Trilio utilizes StormForge to deliver better Kubernetes performance at 64% lower cost for their customers. Our two presenters today will be Prashanto Kochavara, Director of Product, Kubernetes at Trilio and Andrew Dawson, Senior Sales Engineer at StormForge. My name is Jonathan Phillips. I’m the Senior Account Executive here at StormForge and will be your moderator host for today.

So just some insights on our drawing today, there are two ways to win a $100 Amazon gift card. One is drawing at the end of this webinar. You’ll receive one entry in the drawing for attending the event and the other entry per question that you ask, so we will leave about five to ten minutes at the end to go through Q&A and each question asked. We’ll provide another entry for the Amazon gift card. The second way to win is our event survey. So following the event, we will send a quick three-minute survey, which earns another entry into the drawing. Great so without further ado, let’s get started.

Great, so we all know the shift to cloud native deployment comes with a slew of challenges, right? The complexities of deploying and running apps on Kubernetes ends up taking the majority of the effort right rather than actually scaling forward and making those applications work correctly, so the configuration decisions that have to be made, whether it be for memory and CPU requests and limits, replicas, application specific settings like jvm heap size and garbage collection, it almost becomes impossible to fine tune your applications to fit your specific needs, whether it be for optimal performance or for the lowest cost.

So StormForge is the platform that helps organizations accelerate transformation, while enabling a performance culture. Moving from a reactive approach, monitor and alert, to a proactive stance where failures and performance problems are eliminated before they happen. Going from a time-consuming manual trial and error process to an automated approach based on machine learning. This is really so that your devops teams can put their focus on building and delivering new capabilities, not having to go back and troubleshoot or remediate on resource-based failures or any other issues that come up throughout the deployment process. It’s also about ensuring outstanding service delivery, right? Improving performance and reducing failures, so your customers have the best possible user experience. And last but not least, it’s all about helping you become more efficient, eliminating over provisioning and reducing your cloud costs.

This gives you that foundation for Kubernetes success using the best technology to support growth at scale. So today we’re going to highlight how Trilio and StormForge have partnered to help support this massive shift towards cloud-native deployment. I’ll now pass it over to Prashanto Kochovara, Director of Product at Trilio, to introduce Trilio Vault for Kubernetes and the massive benefits our partnership provides for their customers.

Prashanto: Great thank you, Jonathan. Firstly, really happy to be here and talking about Kubernetes and also how we are using StormForge. So really excited to impart that knowledge to the audience out here. So I’ll start off providing a good background about myself. I’m the Director of Product at Trilio, spearheading the Kubernetes offering around data protection for applications running within a Kubernetes space. A little background about Trilio, so the company itself was founded in 2013. We have been around for like seven, eight years now and when we launched, back then we entered the market supporting openstack workloads and quickly we kind of led the space. We are currently still the leader in the OpenStack space and then we went about supporting data virtualization workloads as well.

Very recently we’ve moved into the Kubernetes space, considering Kubernetes is growing at a phenomenal pace. We’ve been asked by industry leaders, industry experts to support the Kubernetes landscape and that’s how we started our journey into modernizations as well. So overall our technology is fully proprietary. We built it ourselves, patented it end-to-end. We have engineering, global engineering sales and services. We have customers across all the different industry verticals, whether it’s telco, defense, automotive all across the globe. We have a decent amount of funding from VCs and we have a lot of industry experts and trusted advisors on board as well. Then from a partner ecosystem, we play well with obviously one of the Red Hat teams and all the other partner ecosystems that interface tightly with all the offerings around OpenStack and Kubernetes. And then finally, our technology is highly, and firstly it’s very very modular and the way we built it, we have gone through rigorous certifications through multiple different distributions like Rancher, Red Hat, IBM, and there are many more that are to come. So yeah that’s Trilio in a nutshell. What we end up doing is data protection for applications running within your OpenStack Kubernetes or data virtualization space.

So to kind of explain what Trilio does, we can actually build out the entire slide here. If you think of what we end up providing is point in time recovery for the applications that are running within a Kubernetes namespace or cluster. So as you can see, this diagram what we have is we have two Kubernetes clusters, we have a couple of namespaces that we are showing for the first cluster and another namespace for the second cluster. What we enable is firstly point in time data protection so if you lose your namespace you can bring those applications back again from a centralized repository where you store them as insurance or as a backup for a rainy day. And you can also extend that point in time data protection workflow to do data management, data orchestration, by taking that same data and reconstructing or restoring that and maybe another namespace of the same cluster or going to a separate namespace in a completely different cluster. So now if you have let’s say you are running on-prem or in the cloud you can use Trilio to capture these applications move them from one environment into another, or in case you have data corruption or accidental deletions of namespace or the application itself you can do that. Now you can use Trilio to safeguard you from that perspective. And I’m not going to jump into all the details, but namespace is just like one of the scopes of protection that we provide. We also support things based on helm operators and all the other application packaging tools that are out there. So what we end up doing is we capture the metadata and the data objects and we move it into the centralized repository, which can be NFS or S3 and we store it all there so that it can be reused again for data protection or data management, either or.

So what I’ll do next is I will talk about how we end up using StormForge internally. So I think yeah I think that’s the right animation we want… so what we’ve done is we have built the entire product, the Trilio Vault product, as custom resource definitions. So at the very base layer, how users interact with the product is by creating instances of custom resource definitions, which could be for performing a backup, restore, creating the target repository if you want to have automated policies or any application specific hooks, you can do that through the crd. So that’s the base level at which we operate. So the sense that I want to kind of deliver here is that the entire product is highly highly modular, right? So because it’s modular, we want to make sure that our customers get a very fine grain and a fine-tuned understanding of the resources that they would need to manage these custom resources or as they’re operating and working with the product, they should have a good understanding of what they are utilization or what they are, as you were saying earlier, proactively they should be able to tell what their resource consumptions are going to look like. We can build a slide out more… so at the control plane layer is where we back up or we capture with all the metadata objects around an application. So we communicate with the API server and we capture all the config maps, secrets, service accounts, anything and everything associated with the application. We take that and we put that into a target repository. Now at the control plane layer, we leverage StormForge to ensure that there are no memory leaks at the control layer itself. The way we do this is we create an experiment and within that experiment we have a bunch of backups and restores that are defined and we sequentially keep doing that for thousands of times performing a backup, doing a restore, editing and performing a backup, doing a restoration just to kind of stress out or figure out the stability of the control plane to ensure that there are no memory leaks or any kind of issues happening at that level. What we end up doing after that is we stress test it further. So instead of taking a backup, doing the restore, and then deleting it, we let the backups continue and we’ll have again thousands and thousands of trials running so that we are testing at multiple different levels and again we are not doing that just sequentially manually, but we are leveraging StormForge’s machine learning abilities to firstly cut down the time that is taken to figure these different aspects out. And then the stress testing basically gives us a good understanding of what the resource utilization or what is the memory and the CPU consumption if a customer is doing thousand backups, if a customer is doing 2000 backups, if they are doing 3000 backups. So based on that information, what StormForge provides us as we are testing these things, we use that to provide guidance to our end users as to, hey Mr. Customer, if you’re going to be doing about 2000 backups and that’s what you will be storing, these are the resources that you need at your control plane layer. So again, we end up using this for regression testing as well in a way because we compare what one particular release gave us in terms of memory CPU consumptions to the next release and then we kind of create different charts to see how the code has been progressing every time we add new pieces and new bits to it.

Now moving to the data plane layer, which is the third layer, this is where we capture the data object. So the control plane layer was the metadata objects, data plane layer is the data objects where we leverage CSI to talk to the persistent volumes and then capture the data out of those persistent volumes and move it into the target repository. Now the way we do this is we have independent pods, it’s a highly scalable architecture so we have independent data mover pods, which come up one for each persistent volume. So there is massive scalability in terms of the architecture again, and these data mover pods basically do their work, capture the data, move it into the target repository. Now initially when we started off, we went with one particular understanding of how the data mobile pod would function and the resources that we would need, but very soon we realized that based on the applications that we are dealing with and the amount of resources available to the data mover pod and to the application itself, we could tune the speed of the backup and the restores. So what we started doing is we started testing multiple different data sizes, multiple different persistent volumes, as well as different sizes or persistent volumes. We started testing that with StormForge and testing it at different memory and CPU levels to see how the performance changed and how the backup and the restore speeds were affected based on those changes. So as a result of that, we were able to firstly provide some firstly do our testing in a much more seamless fashion, very very quick, we don’t have to do things sequentially, but again leveraging the machine learning capabilities. And then secondly, we are able to provide, use the same data, to provide guidance and recommendations to our customers saying that, hey if your data volume sizes are so big or if this is the amount of incremental data that you will be backing up every few hours, then these are the memory and the CPU resources that you need. So not only internal testing, but obviously for an external guidance and recommendation StormForge is really helping us in these areas.

Now that was all the internal use cases as to how we leverage StormForge internally to test and provide guidance now when we started using StormForge that was the initial intention and how we wanted to use the product, but when we started getting more familiar and understanding the product much better ourselves, what we realized is that the use cases can certainly be extended and there is a lot more value that we can deliver to our customers or within our product itself from an external facing point of view. So what we have with Trilio, we provide this feature called as a disaster recovery plan, so the DR plan as the name suggests is something that a user would define in case a particular cluster goes down, or in case there is an outage, in case there is a disaster, right? So what we have done is we’ve provided the capability of defining what the plan should look like in case there’s an outage and we are leveraging StormForge to measure the recovery time objective. Recovery time objective in the data protection world is the time taken to restore activities. So our intent and how we are using this is to leverage StormForge to tell us the time taken to restore your mission critical applications, either in cluster one two three and four, it can be any of those clusters or all of those clusters, and it will give you firstly the time taken and how that time can be fine-tuned based on the memory and CPU resources. So the idea here is obviously disaster recovery is not something that happens every day, it’s more of an insurance that you keep. So the idea is for customers to run this on a daily basis to understand, hey in case if a disaster happens today, how do my other clusters look in case I need to restore my mission critical apps to them? What is my cost looking like? What is the time looking at and how can I tune it or redefine things to make it go faster? So it’s an automated process, which we are leveraging StormForge to provide these capabilities to the end user, and again based on that, there is a lot of guidance and recommendations that can be created as well. So overall crawl walk run approach. Right now it’s a little bit manual, where folks would need to leverage the DR plan and use StormForge to figure it out as to how it pertains to the RPO RTO timings, but eventually as we move with time, we would want to productize this within our product itself so that we can rapidly provide that intricate value to our end customer directly through our user console.

So that’s how Trilio basically ends up using and leverages StormForge for our internal use cases, as well as our external use cases.

Andrew: Thanks, Prashanto. So to summarize, what we did together with the two products is we essentially created a StormForge experiment that automatically runs these TVK backups and restores that we define and measures how long they take to complete. So we spend some time looking at what parameters matter for the specific application, we figured out the data mover and the metamover were good targets, and every time the StormForge trial runs, the StormForge controller actually patches new CPU, memory, and environment values that are suggested by our machine learning API onto the TVK deployment pods and then re-tests the scenario. So essentially you have machine learning exploring all kinds of different CPU, memory, and environment variable configurations and how they actually calculate into real world metrics for Trilio Vault. In our case, the metrics that we care about are the trial duration, so we want to measure how quickly does it take for all of the different Trilio Vault steps for the DR plan, or for the backup or recovery, how long does it take for those to actually complete successfully, and also the cost. So how many resources are needed to get that fast trial duration? Can we crank up certain memory areas and get a faster speed? Or we’ll actually find out all these answers by running the StormForge experiment. If I go to my next slide… so some of the experiment results we saw running some of the basic examples, we were really able to cut down on the resources needed. The thing about these experiments is each one is going to have a different reaction to your application because every single backup and recovery scenario is different, as well as every other application we optimize is different. So these results can vary from app to app, but the goal is we want to optimize for your selected metrics. In this case, we’re recommending an optimized configuration for our specific scenario that saw the cost to complete that backup actually reduce itself by 64%, so the amount of resources needed adding up to the cost, as well as a 6% reduction in duration. But almost more importantly, on the right side of this, you’ll see we’re actually removing configurations that fail. So when we run these test scenarios for Trilio Vault, if there was a configuration that didn’t allow the whole scenario to complete, we’ll remove that completely. And we’re putting stress on the actual cluster by executing these scenarios. You can see we separated out 26 high risk configurations that the dev team just didn’t need to deal with after that. So it turns it from a reactive where you deploy something out to a customer they’re gonna see the error and report it to you, now you can be more proactive. You can have that insurance policy beforehand that you’ve weeded out high-risk configurations. And in the StormForge product, you have the ability to go in and look at all the different results that the machine learning automation had run. So each one of these dots represented a scenario run defined in our experiment that Prashanto’s team essentially can click a button and run let’s see how many 40 different this has 40, but realistically we see about 400 per experiment, 400 trials that are measuring the duration versus the costs of each of those configurations. And if you look below, you see the different parameters that we’re actually targeting for each Trilio Vault scenario. So we’re able to give you a recommended configuration and give you a tested completed set of CPU, memory, different environment variables, and how to actually affect your application. So it turns that process into a trial and error, having to go in and do all these by hand, look at your APM, kind of make some adjustments, run it again. You turn it into almost load testing as code, where you can run this scenario or run your own testing scenario for your application in an automated way powered by machine learning. So the good thing about both these awesome products is there’s a free trial, or free tier version, for both of them. So if you on the call today want to go and test out some of these products StormForge Optimize offers a free tier, so up to five parameters you can optimize for free. When you start to get deeper than that, we actually do have some paid versions and enterprise versions. And if you want to run an evaluation on your own application to see how we can optimize the CPU, memory, we can usually do that in about less than two weeks to build something custom. So we’re seeing large results paying back essentially their purchase in less than three months and real cost savings allowing you to get that high performance with less resources. Trilio Vault also offers a free trial, so if you’re interested in checking that product out, a free trial for 30 days gives you the full product. It’s unlimited backup and restores, so you can really test out both these products together if you want to on your own backups or for your own individual applications. So I see we have a few questions coming in and we’re almost at 12:30pm already, so JP I’ll kick it off to you to moderate some of these questions.

JP: Excellent, great. Thank you, Andrew and thanks again Prashanto. That was great.

So a couple of questions that have come in… This is directed towards Prashanto, where is the benefit of using Trilio when using git ops where all my data is already in git reflecting my actual state in the cluster? My PVs are getting snapshots all the time and my databases are running outside with point in time recovery.

Prashanto: So generally when we think about applications running in Kubernetes, right, if your applications are going to be fairly static and never going to be changing, then git ops is still the right way of supporting and deploying those applications. But let’s say if your applications are going to be changing their footprint, whether it’s the metadata changing in terms of let’s say security policies or the service accounts and the secrets that it’s leveraging in the environment, if those items are changing then you can still leverage Trilio to protect and back up those states or those point-in-time states of the application itself. Where Trilio becomes even more important is obviously when you have data or persistent volumes and what we have been seeing, obviously the first wave was more around getting your stateless applications into Kubernetes, but now with the push towards a lot more storage within Kubernetes we’re seeing a lot more databases and stateful workloads in general coming into the Kubernetes landscape and I think if we kind of go by CNCF survey numbers, I think it’s 55% are using, this is based on the 2020 survey, 55% are using storage based workloads already with another 25% planning to move storage into Kubernetes. So we are seeing with any new architecture, any new paradigm, it’s firstly getting your stateless or your applications, which you are you can play around and tune up more and then it’s more of the mission critical intricate applications that you will be bringing in. So the data portion of it will follow. So that’s the kind of that’s what we’ve been noticing in the market and what we have been observing, but to kind of give it back to your question, if your application is completely being deployed via git ops and being managed outside, for now you would definitely be fine if the application DNA or the footprint is not changing, but eventually if you want to have a standardized Kubernetes central plane, and manage everything on the central plane, and you bring the application data inside, that is when you would want to use and use Trilio to protect it much more granularly and on a more day-to-day fashion as well.

JP: Excellent, next question is for Andrew. Does StormForge hold a central repository of already optimized baseline configurations? For example, mySQL version 5.5 with a defined set of SLOs, or do I have to make them up on my own?

Andrew: So as we’re working and getting more data, we’re looking towards things like this. Right now we do find that your experiment that you run, we find very unique results from cluster to cluster. That’s why you’re running it on your dev cluster in your environment simulating your actual application, so we haven’t found it to be really one-to-one. Although we do have general recommendations, we can absolutely help get you started. So we could give you a general baseline then the machine learning as it starts running against your mySQL app and how it’s experiencing load in however way you’re using it will dial that configuration in to best suit your performance. You also have the ability with StormForge to measure the things that you care about specifically. So we talked about trial duration today for Trilio Vault being an important metric, but many of our customers are optimizing for latency for a web application, or maybe they’re optimizing to maximize the throughput of their application, or if they’re benchmarking their application with a fixed workload, maybe it’s duration of that workload. So experiments are very unique and you’ll find that we’ll help you build them for your specific application and the results are usually very unique to each app, but happy to chat with whoever asks that after this if you want to follow up.

JP: Excellent, great and another question for Prashanto. Actually it’s a two-part question, are there data transfer speed issues during or after configuring the cluster and applications?

Prashanto: So data transfer and speed, I mean from a platform perspective, from the Trilio platform perspective, never I would say because as we have built the application to be fully cloud native. We spin up our pods on-demand to move the data. Again, if there is network latency and other network activity happening, which could fluctuate the data transfer speeds that can happen anytime, but from a platform perspective, from our architecture perspective, everything should flow smooth and as long as the infrastructure is intact everything should be pretty clean and pretty predictable as well.

JP: Excellent and then the second part of that question, are there limitations with respect to huge data when you’re measuring for RTO? So it says here database size limitations since data is increasing every day.

Prashanto: No, no at all. We have again patented the technology around these transfer capabilities in terms of how we move the data into the target repository and how we again please store the data as well. Our CTO has put in a lot of effort and time into making sure that our backups and everything are de-duplicated, so not only are we using less storage capacity, but the amount of data that we send over the wire is also very minimal. So all these nuances and the enhancements, which as part of the Trilio platform enable us to be very nimble and basically have a fully scaled architecture in a way where no matter the size of the data, we have the appropriate resources, appropriate services coming up to make sure that the backup concludes, backup or the restore concludes in the fastest possible manner.

JP: Excellent and this is a question for both Andrew and Prashanto. Did the trials cover the full Trilio platform end to end or just the mySQL component?

Andrew: I can answer that. So the sample that we looked at before, that was essentially testing the whole Trilio deployment. So all of the deployment pieces, the deployment pods that are running on the cluster, as well as when the backup happens, there’s a job pod that starts as well that we’re essentially taking the CPU and memory of all those pieces into account. The mySQL is just the target in that scenario. So I had a sample mySQL database with some fake data in it, so it wasn’t just a blank database and that can affect the different pieces of it. So in that specific example, mySQL was just the target. The optimization was happening on all the pieces of Trilio Vault itself as it works to run the backup restore scenarios.

JP: Excellent and another question, and this is for Prashanto, how intrusive was incorporating StormForge into your development chain?

Prashanto: So it was pretty straightforward. Firstly, I would say that the StormForge team has been excellent in terms of making sure that my team was able to get started with the product and understand the product and use it to our benefits and we were able to see the value very very right away.Then because the StormForge product, we were able to align the Trilio architecture to the StormForge product very easily in terms of, as we mentioned, the modular nature of Trilio. Integrating into our development chain as part of the test aspects and the validation aspects was very straightforward because it kind of spoke the same language in terms of being Kubernetes centric on both sides. So putting the StormForge team and the way they have built the product, along with the Trilio product, and adding all three just made the overall experience very very simple in my opinion.

JP: Excellent and another question for Prashanto. How would you integrate Trilio with CI/CD?

Prashanto: Great question. So when you think about CD, that’s the continuous delivery part where you would want to test validate your applications and so on, this would be basically if you double click on the test aspect, you can use Trilio to take your piece of code, what you’ve been developing locally and then move that into multiple cloud environment, public cloud environments, where you can get like massive scalability of your application. So Trilio can help you to test those in multiple platforms, figure out what your breaking points are and so on, just to get additional testing metrics and parameters. The other side of it is obviously when you’re running in production and there is an issue, you can again leverage Trilio to capture that production landscape a point in time of that, bring that back into dev, make changes move it back into test as part of the CD portion, and then release it back into production. So it kind of supports you through the entire 360-degree lifecycle.

Andrew: I’ll add on that too. I know no one really asked, but for StormForge, it’s possible to integrate with CI/CD as well. Once you build a repeatable experiment, everything can be kicked off with your normal Kubernetes APIs. So essentially, once you have a repeatable StormForge experiment, you could bake that in as part of a pipeline to check if a specific new addition to your code has spiked the performance in any way. And you can turn that into a continuous test.

JP: Excellent and I know we’re coming up on time here, but a couple more questions. This one is for Andrew, can you utilize StormForge for any containerized application and are there limitations for on-prem?

Andrew: Yeah, honestly the great thing about our product is that it works with any flavor of Kubernetes, whether it’s cloud or on-prem, it just has to be running Kubernetes later than 114 I believe or 113, but yeah we’re able to support anything. So definitely StormForge can help for on premise where you’re resource limited and you really want to allow your application to ask for less resources. You can create an experiment for any type of application. Really you just want to target your deployments and stateful sets and pod containers that have ability to set parameters, turn that into experiment, understand what your load test is, so figure out how to stress your system to give it a response, and essentially you can test anything. So we’re happy to work with anyone out there if you want to contact us to build something custom for you in a product evaluation, happy to do that.

JP: Great and last question for Prashanto. Do you have recommended configurations for TVK?

Prashanto: Yes, so in terms of all our pods, we provide the resource limits and some guidance around what they should be. It’s all within our public documentation doc.trilio.io forward slash kubernetes. You can find it there. And some of the data mover resources and guidance we have leveraged StormForge to come up with those numbers as well. So definitely we have that and definitely using StormForge to build that out and fine-tune it over and over.

JP: Excellent, well thank you so much both Andrew and Prashanto. This has been a fantastic webinar and will be available for distribution very shortly. We’ll make sure to also reach out to the folks that will win our raffle and also look out for the survey that will be coming your way shortly as well. Thank you all so much. Happy St Patrick’s Day once again. Enjoy the rest of your week and we look forward to having you join us for the next one. Thank you all so much. Have a great rest of the day.