Why Kubernetes Desperately Needs Machine Learning

Fireside Chat with Thibaut Perol, Ph.D.

Air Date: February 26, 2021

Noah: And we are recording. Welcome everyone to our first Fireside Chat in the StormForge Fireside Chat series. Thank you all for coming today. The purpose of the Fireside Chat series is to speak with folks in the ecosystem around cloud native, DevOps, and other related topics, about the topics that they care about and find really interesting. So I am your co-host Noah Abrahams. I’m the Open Source Advocate at StormForge and my other co-host…

Cody: I’m Cody Crudgington. I’m a Senior Consultant here at StormForge.

Noah: And with us today we have Thibaut Parol, who is a Lead Machine Learning Scientist. Is that correct, Thibaut? 

Thibaut: Yeah it is.

Noah: For those of you that are here about the raffle, because we did mention that we were going to be doing that, we’re going to tell you up front that we’ll be raffling off a couple of gift cards for Amazon. You have to stay to the end and you’ll get one entry for being here and one entry for every question you ask, whether or not we get to it. We’ll also be doing a survey at the end of this to talk about your experience, whether or not you found this valuable, things like that and that will get you another entry into another gift card. So keep an eye out for that survey and make sure you fill it out.

So let’s get started shall we? Thibaut, let’s start with you since you’re the guest of honor. We had set today to be talking about Kubernetes and machine learning, but I think some of the people that are here may not have a full understanding of machine learning. They may not really fully know what’s going on, what it means. So would you like to start with talking a little bit about what is machine learning, what is machine learning not, what are some of the buzzwords and misconceptions that might help people understand what it is and what it is not.

Thibaut: Yeah, sure let’s start with that. Machine learning. So in general what I would say is it’s a set of methods or algorithms that learn to execute a test based on experiments or data. So the way it’s different than another type of algorithm, they’ll be more rules based. You don’t say if there is a wall, I need to go left or need to go right. What you look at is data of, you know, a car going towards a wall and suddenly going left and and you remark that as a good example, and then if it goes in the wall as a bad example, and then make the algorithm learn that this is the output that you want. So that’ll be one example. The most famous example obviously is the cat and dog trying to recognize, you know, a cat or a dog on a picture and the way it’s done is by feeding a lot of pictures of cats and dogs that are labeled and then have the algorithm learn what it needs to learn to figure out how to differentiate that and then being able to generalize to a new image and figure out/predict whether it’s a cat or a dog. Best thing is to predict that with some kind of probability. So yeah, that’s I think a basic example of ways machine learning and what a lot of people think machine learning is. But I think it’s important to add to that this is just a subset of it. This is what we call supervised learning in the machine learning community. You know, you have a task, you know what is the output and then you train the algorithm to do that, but there is another set of algorithm which is unsupervised, which is learning from unstructured data and training for example to look at how you can cluster data, our other type of algorithm can be in doing like no recommendations. For example, there was this huge Netflix challenge that I think was up for $1 million dollars that was paid in 2009 I believe where the idea was to figure out the best algorithm to recommend you a movie based on the movie that you like and what other users like. And so basically the idea is to figure out how to map you to a similar user and then therefore using that information and figure out what movie you might like. So yeah it’s, I think this is, I mean yeah, the broad definition is just learning from data instead of hard-coding rules is what I would say about it. And then a couple of things is, you know, I know a lot of people are aware of it I would say after 2012 because, you know, this is when we had a big advancement in the field with the alexnet, which was this deep learning algorithm that was coded on the GPU that was now suddenly able to recognize as many different images and generalized very well. And then afterwards like it started exploding and, you know, there was a lot of confusion around things like machine learning, AI, deep learning, and it’s something to know. It’s just deep learning is a subset of machine learning users, how a specific way of creating those things and then after that it exploded with natural language processing. Understanding speech recognition became much better using those methods. So yeah, that’s kind of my short answer to what is machine running and how it’s – or I guess I would say short, some people will find it long – but what is machine learning, what is deep learning. 

Cody: No, that’s really good. So what limitations do you see that machine learning has, if any? I mean, I know you know this is obviously your focus, but are there any limitations of machine learning?

Thibaut: Yeah, so are we talking machine learning in general or we’re talking about how we can apply to Kubernetes? I think, you know, it works for a bunch of very specific tasks. So obviously it’s amazing for image conditioner, image segmentation, natural language processing, speech recognition, and there’s been a lot of work in the reinforcement learning field that is applied to robotics or other types of work. A lot of people want to apply that to supply chain. I think it has a lot of potential. Some of the issues sometimes we’ll find and there’s been… along time now is how to make a product out of it. One of the issues with those great algorithm is the inference time, so when we have the algorithm trained and then we make a prediction with it, which would be like in the example we mentioned before this is a cat from an image, sometimes can take, you know, quite some time simply because the algorithm is speaking, is becoming more and more complex. And then there’s a thread of like whether you want to be 99% accurate and, you know, serve a prediction in like 10 seconds, or you’re fine being 96% accurate and serve that prediction in like 0.5 seconds. So I think what is interesting now in machine learning, obviously there is a tremendous amount of research in machine learning, but what I really like right now is how to use these things and make products out of it that, you know, would be user friendly. So I think yeah this is one of the biggest challenges that I see right now. 

Cody: Well yes, well yeah that, I mean, it’s interesting and you talk about productizing machine learning. So why is machine learning a good fit for Kubernetes? Why do you think that?

Thibaut: Yeah, so I guess I can say why, you know, it’s a good fit and why now we think it’s a really good fit, or I can just say, you know, from like this history of the company, how it actually happened? We just started using Kubernetes for our own servers and then it just became obvious. Instead of just like, oh this is like it just makes some sense, I could just say yeah it makes a little sense. But no, what happened is we started using it and they were like, oh wait but like actually we do that all day long. You know, like you set up your cluster, you set up your pause. There’s so many parameters, that was the first thing. It’s just like, oh my god, it’s just a huge amount of data. You can’t comprehend that amount of data. It’s pretty obvious that this is something that needs to be done by someone other than myself. You can’t look at all those correlations. You can’t see like… humans are not very good at figuring out how playing with more than three parameters influences some set o, metrics, in our case, or whatever it is, outcomes. So it was pretty obvious, but then I can tell you that now it just makes sense. Kubernetes generates data left and right. It is data which is time series. You get it everywhere from the metrics server or primitives, if you have primitives, and that can help you monitor CPU, memory, disk, networking. These are all time series data and it’s a tremendous amount of things to analyze and use to take actions.

And then there is logs, more instructor data. Right now, we look at just, you know visualizing those logs, filtering those logs, but there’s a lot of things to do with it. With parsing the logs too to actually infer actions that need to be taken, so I think it’s the perfect example for using machine learning. In a lot of fields, it was the same where you start with a reactive thing where you see the data and then the first approach is something that is again rule space, which is what was the first intuition when you do AI is to set a bunch of rules. Like that’s how you made people play chess. It’s like you code those rules, so that they’ll be the same thing, like something reactive. I’m looking at the CPU and it’s just like the utilization is much lower than what I’ve requested, well I’ve created a rule that would tell you that you can probably reduce the amount of resources that you requested. 

And then you can go a step further which is what… with machine learning it’s just all this data and just it can fit a distribution that would figure out that if there is a spike, and you don’t even need to hard code that. You just… it’s falling out of private distribution. And so that would go from a rules-based thing to machine learning.

And then you can go even further, which is instead of being reactive, you can be proactive. You can combine those models with instead of just looking at the past and being like, oh there’s something to happen, maybe you want to change it. You can use a model that would tell you, well I predict that this is the trend, therefore, you need to adjust right now so that doesn’t happen – because there’s a fair amount of latency in Kubernetes, like spinning up pods, it’s not instantaneous, so if you want to scale it up, it’s better to do that beforehand, before you have this surge. So yeah, being an engineering scientist and working in Kubernetes is just amazing because you can find ideas everywhere. Yeah, I’ll pause there because I’ve talked a lot.

Noah: Well you talked a lot about the benefits, but one of the things that really struck me when you presented this topic to us was the vehemence with which you stated it was that Kubernetes needs some love. It’s really imperative. It’s really absolutely necessary that we get some ML in there. Can you talk a little bit about why you feel so strongly about that? 

Thibaut: Yeah, for example as a new user of communities, you arrive and you set your pods and they’re scheduled and it’s very abstracted away from you – well, you can check where their schedule or things like this – but it’s made so that you don’t have to worry about it, which is great. But at the same time it makes things less reproducible. You’ll launch your app on the same cluster and you’ll do it again, it’ll be scheduled differently, the apps are running concurrently are gonna create noise, so these are, like for me it calls for some kind of probabilistic understanding of what is happening. You cannot say that I have launched my app and this is how it’s gonna perform. You need to figure out, yeah this is an idea of how it’s going to perform plus or minus this in terms of performance. So it’s just for me the whole infrastructure… Kubernetes is great because it abstracts away a lot of things, but at the same time you also need to help the user figure out, and understand it better, like why you’re seeing what you’re seeing. I think sometimes part of your frustration with Kubernetes is trying to understand why it’s behaving in this way? And then you have to go and figure out how the whole thing was set up and scheduled and it’s a bit harder to understand.

Cody: Yeah, that’s a good point. I find it interesting, we’ve talked a lot about optimizing and using machine learning at the application level, but really any component inside Kubernetes we could use machine learning to help optimize, like say the scheduler or xcd, right?

Thibaut: Yeah that’s a great point. Yeah, we can talk about applications. We can talk about the whole cluster. There is cluster to scalar, but you can also just use machine learning to say, I have a set of applications that are currently running and now I just want to look at how I can optimize my overall infrastructure, which is, I want to change the, if I’m on GCP, I’m going to change instead of machines, I want to change the number of nodes, and I want to do that so that all my applications are running on my infrastructure are still performing well, and I’m reducing the cost. So that’s another type of idea. And then you have as you mentioned LCD. LCD… so it’s underneath you have Raft. Rafts is an algorithm that is famous for distributed computing, but Raft also has a bunch of parameters that are hard-coded. I don’t know application timeouts or things like this. They’re hardcoded. They’re hard coded because they’re based on experience of people running those things and multiple different settings, but that also calls for machine learning. You could also potentially optimize that, yeah.

Noah: So, we’ve got a question that is slightly tangential, I think, but it’s been asked a couple times now because you brought up the concept of supervised learning versus reinforced learning. I’m not sure if we can answer this reasonably or not, but our machine learning that we promote, which type of learning are we using and can we even talk about that?

Thibaut: Yeah, so the difference is supervised learning, unsupervised learning. Reinforcement learning is a subset of supervised learning. The idea is you have a set of tasks and you want to take an action that would maximize the reward in the future. So I just want to clarify supervised learning versus unsupervised learning and deep learning and reinforcement learning is part of this learning bucket. Now concerning what we do, I mean I can’t really say being fully transparent about it. We have a patent pending for our set of optimization methods, but what I would say is obviously we have this probabilistic model that I was talking about before because we do think that it is not deterministic because there’s just so many things going on you need to be… and it’s not reversible, so you need to understand the variance in your data and it is obviously yeah… that’s all I can really say about what’s under the hood, unfortunately, because it’s patent pending, but yeah. 

Cody: So, Prashanta said here in the chat, so what we’re saying with ml related to Kubernetes, are you more focused on using ml to automatically load balance and scale up? I think it’s a lot more than that, so Tibo if you don’t mind just kind of talking to that for a minute…

Thibaut: Right, so the question is… ml… can you repeat the question?

Cody: Yeah, so they’re asking are we more focused on using ML to automatically load balance and scale up? Obviously we do a lot more than that.

Thibaut: Yeah, that’s one way of doing it. Let’s take the example of scaling up and we can take the example of HPA, for example. It’s a good segue. 

So as we were talking before, you can be proactive. So you see that there is a trend so therefore you need to scale up and you want to scale up before this happens because it takes some time. But the thing is… so yes that involves machine learning. But there is also how you’re gonna scale up. What needs to be the number of replicas that you need to have? How are they gonna scale up? Is it based on the CPU utilization or is it based on some number of requests on the load balancer? And so that involves a lot of parameters and this is what I was trying to touch on earlier. Is there are partners left and right when you write your community’s application. It could be the CPU, memory, it could be also whatever parameters that you have inside your application, but you can also… and then for HPA you have a number of replicas, the minimum, maximum. How do you scale up? And all of these are… it’s hard to figure out how they all play together to make sure application performance, and this is what we do at StormForge. We have this experiment that we run to figure out what is the best combination of those parameters to maintain a level of performance that is adequate with what you’re looking for and reduce your cost.

So yeah, this just scaling up and down is definitely best done with machine learning.

Cody: And that’s just one part of it though, right? So when we’re talking we’re optimizing at say 10, 12, 15 parameters, those combinations can be exponential which is why it’s important to have machine learning do it for you versus trial and error, right?

Thibaut: Yeah, absolutely for the nerds in the room, this is what we call Machine Learning Curse of Dimensionality. Yeah, it just gets tougher and tougher. Obviously like everyone who’s done it manually figures out, you’re just exploring a tiny corner of the state space, and there is a lot of literature about in a lot of cases, it’s better to just choose randomly than what you think would be the best, and then you can do better at random. You can do what we have, which is a full-on machine learning model. 

Cody: Yeah, someone asked and I think it’s a pretty kind of relevant question today. Do you think machine learning can be a security risk?

Thibaut: So there is a lot of work on that, but it’s funny. It’s not really on the risk for Kubernetes. For example, let’s talk about what people do concerning the risk for image recognition. So a lot of people have been working on is we have a set of algorithms that would say that this is a dog, and then people starts inserting a tiny bit of white noise into this image, so that you as a human cannot see it. But the algorithm would say, well it’s actually not a cat, it’s a dog, or the opposite – I can’t remember what I was saying initially. But that’s the idea of trying to tweak, to play with the algorithm, so that it would start making mistakes. And obviously it’s a lot of concern for self-driving cars. 

Now in the space of Kubernetes, and especially for what we do here, there’s absolutely no problem with what the type of experiment that we’re doing is just like changing the resources and then we do that in dev, I can’t really see anything happening on that side, but I maybe if you start working on HCG, but I mean it’s hard for me to figure out would be the attack without even like thinking about the problem fully right now so… I’ll be curious to hear from people what they think could be a potential attack and that’ll be a lot of fun to discuss. But yeah, in any other field, when you start creating those new things you have to think about the security perspectives, but…

Cody: Just to add on to that… So, Jeremy says some of the security risk comes from the use of the API to interact with DML, so what role does DevOps play in orchestrating Kubernetes? I think that’s an interesting question because in the answer I mean, really, what are you using to secure the communication between the API and the machine learning back end, right? and I think that’s just part of the implementation, how it’s built out, so it’s only as insecure as today’s technology really. What do you think?

Thibaut: Yeah, I’m not sure if the question is more about our products specifically. It sounds to me like it’s more of a how do we communicate between the cluster and our machine learning server? 


Thibaut: Yeah, so you’re not receiving any information you’re asking for from your cluster, which is part of the answer in terms of security. There is probably much more to say about that, but I can’t really go into details right now. But if that’s the question, yeah, that’s the small answer I have about this. There’s nothing you don’t receive anything you just ask for something so oh you’re not opening up…

Noah: I’d say as a corollary to that, most of the security risk is not going to be inherent to the ML itself. It’s going to be inherent to whatever you’re acting upon.

So, we’ve got a bunch of other questions. They’ve been pouring in here. One of them is kind of an interesting question, actually a couple of them, about how ML has changed and if it’s changed any before the pandemic versus during the pandemic? What does it look like over the past year? Have hardware requirements changed? What can you tell us about the evolution of machine learning in the recent past?

Thibaut: Yeah, I can talk a bit about that. Obviously I don’t know everything, but what I would say is… a brief history… What I was talking about in 2012 that created this huge, not a revolution but made it very popular, was a lot of it was about infrastructure. A lot of it was about being able to train those in the case of image recognition called Nets on CPU. So then there was being able to feed more data. Being able to train faster. Being able to have much larger networks. That was the revolution in terms of hardware. I think afterwards there was a tremendous amount of work that was specific to deep learning for a couple of years. And everywhere in the field of speech recognition, natural language processing, but also like time series analysis. For example, for the nerds, the Attention Model was something that everyone was talking about. And then it became quite fancy to talk about reinforcement learning. Reinforcement running was the buzzword for quite some time and it is great, it has tremendous application. It is hard to make a product out of it simply because it requires a lot of training and takes a long time to converge. You need a lot of safeguards to make sure that you’re not trying weird configurations, so when you talk about doing that for Kubernetes on the cluster and trying not to mess up the whole cluster and we’ll do concurrent application, that’s kind of tricky. So that’s in broad strokes the evolution of machine learning. To respond to the change during the pandemic, I don’t think I would say that was a major paper that I’ve seen published for the past year that has changed everything. Obviously deep-mind does tremendous work and it’s not like we’re looking at seeing how we can apply that to two communities so far. Maybe in a couple of years…

Noah: Right, I see two other questions that I think would be interesting to put together. One is talking about whether or not this is an abstract conversation and is there a preferred machine learning framework for Kubernetes? Another question that I think sort of goes along with that is, if you’re just starting out with Kubernetes, is there a lot of learning and knowledge required to set this up? So, let’s kind of look at the two of those together. Is there a preferred package and what does it take to sort of get that in there?

Thibaut: Yeah, so considering the framework, I’m definitely not gonna go in the debate of what is my favorite framework. I think we’re long past that in the machine learning community, but it’s definitely something that was all over Twitter for the past couple of years. There is no preferred machine learning framework that would be specific to Kubernetes, which is… I think the way we see it is incorporating having your pod being a server with your machine learning for a model, for example, and then you do whatever you want there. No one really cares about the framework you’re using here. And then I don’t know if the question is in terms of framework. I can’t really see why we would be debating the type of framework here. If it’s in terms of the development process or the serving part, so I can’t really go in much details, but my simple answer would be no, this doesn’t really matter at that point.

Noah: So based on that, how would you say, somebody if they want to get started with machine learning, obviously we would push our product, but in a general sense, what would it take to get started with being able to apply some machine learning to your Kubernetes environment? 

Thibaut: Well the data is everywhere. So it’s just like if you have your cluster and you have parameters set up, you can do a bunch of curl and get all the data you need, and then with your data, you start doing machine learning with any type of package that you want. Now, it’s hard to figure out how to incorporate it with feeding it back to Kubernetes, doing measuring by himself. We can talk about this, but I’m not sure this is what we want to hear about. No, the cool thing is to get data from Kubernetes into a machine learning algorithm and feeding it back into Kubernetes to make sure that we achieve the goal that we want to achieve.

Yeah, I think we should just talk about machine learning here. It’s more about Kubernetes, and otherwise it’s a very different conversation. I’m not sure everyone is interested in that. 

Cody: Well, on that, Kurt is asking, so I think he’s trying to make the argument when ML is used in parameter tuning in places like what you say network optimizations or applications that don’t generally fit or have a standard set of metrics, he’s saying that the ML that it’s trained off of, the applications can actually perform worse after the optimization. Since Kubernetes needs scaling and load balancing very widely across applications, what methods and steps would be taken to ensure that the ML is more reliable and flexible for applications that are more chaotic in nature. So I’m guessing it’s specific to network optimizations, like packet dropping and things like that.

Thibaut: Yeah, I’m not sure how to respond to that. I would love to hear about, have a specific example here, of an application for which you don’t have any metrics you’re trying to tune…

Cody: Hey Kurt, if you’re listening, if you don’t mind just kind of elaborating on your question there maybe we can get people to answer it for you.

Noah: While we’re waiting, let’s see what else we’ve got for other questions. One of them is asking, going back to the security topic, do you know of any ML CSP, I don’t know what that acronym is, that are fedramp certified.

Cody: I think it’s constraint – I had to look this up – Constraint Satisfaction Problem? 

Thibaut: Not the best acronyms, so I can’t help you.

 Noah: …today I learned… oh, Cloud Service Provider!

Cody: Man, I was way off… 

Thibaut: Yeah, no I’m not aware of that. This is a good question. 

Noah: I’m not sure that they necessarily… that one precludes the other?

Cody: Yeah, I’m not even sure of a cloud service provider that has an ML offering that is fed ramped… 

Noah: Well actually that comes into another topic that we wanted to bring up which was the new offering from GKE. 

Thibaut: Yeah, Autopilot. Yeah, I love it. I think it’s great. You know, it’s for us, I think it’s a tremendous opportunity. We… Abstracting away all the infrastructure like the way they do it where at the end of the day, you’ll just pay for the resources that you’re asking for, how you provision your pods for a specific application, is… it’s great. I like this. I really like this initiative, and in our case it’s actually even better because once you abstract away the infrastructure and you just focus on the pods, ML here is perfect to help you provision realistically your pods that are for those specific applications, and then at the end of the day, you really know what you’re paying for. One of the issues of trying to tune your pods in a defined cluster is that at the end of the day, you pay for the nodes that you have and so you can optimize. So you can pack a bit more, but with that type of setup, the great thing is that you’ll get a great impact on your billing. 

So, yeah, it’s great. I think it’s fantastic and I’ve been playing a bit with it. It’s just the whole Kubernetes API is available in this 2pi cluster and it’s a lot of fun. 

Cody: We do a lot of work with the HPA, right? That’s one of the things that Autopilot doesn’t necessarily configure for you. It’s still a manual tuning process that you’re able to complement that fairly well is…

Thibaut: Autopilot is more like cluster to scaling. You start like if you look at the demo and you have your application and then you start to increase the load and then obviously because you have HPA you’ll spin up new pods and then this is where Autopilot comes in. Autopilot looks like all the pods that are not schedulable right now because you don’t have enough nodes and this is where they start spinning off new nodes. So you… it’s just like taking care of that side of the infrastructure for you to make sure that your pods will be able to be scheduled because you’re creating more replicas to accommodate for your traffic. So thinking about auto-scaling for Autopilot is thinking about cluster with the scaling.

Noah: I think to take that back to the earlier question, it really sounds like the behavior that we’re looking for out of ML. Anything that’s going to be affecting on, acting upon, doesn’t necessarily have to line up with whether or not any service provider is certified in any particular area. I think it really is a bit orthogonal.

Thibaut: Yeah, I think that’s a good way to formulate that. 

Cody: We got a pretty cool question here that I don’t think we’ve touched on. Are there any ML techniques that are applied to basic cube metrics for future prediction of failure in your Kubernetes cluster?

Thibaut: Yeah, that’s a great one. Predictive maintenance is a big thing and it is a hard problem, but it is something that obviously a lot of people do already in the industry, not in Kubernetes, but in assembly lines, for example. Everyone is about like with the industry 4.0, where you have predictive maintenance on some of your machines. So there is a fair amount of work here that is just begging to be transposed to Kubernetes. So I think it’s a great question and it’s going towards there. Like you’re looking at Prometheus data, you’re looking at the time series and what can you get from this. If you start to see spikes or things that are outside of the normal distribution obviously, you can start to create an algorithm that would flag that there’s something bad going on and potential failure with a probability estimate. That’s a great question. 

I think better than that is to do what I think we are working on, which is not to flag that there will be a failure, but to make something robust right away. So this is essentially what we have at StormForge, where when we do an experiment, you want to put a certain load on your system, and we do test all those things. Like we do a load with let’s say a very small configuration and we do see that it’s failing and we’re learning from the fact that it’s failing. So our model learns to change the configuration so that you will have an application that is reliable and is performing enough. So essentially we are some force going a step forward after that. It’s not failing because we are doing the stress test and we know that under this type of load, that would not be failing. And on top of it we can tell you what would be the performance and the cost. But yeah, that’s a really good question. 

Cody: Kurt has rephrased his question. So he says, for applications that are more unique in their Kubernetes needs, what would be done to prevent bad performance from an ML that could be overtrained on commonplace Kubernetes metrics? 

Thibaut: Okay, so overtrain is… I’m not sure if you’re asking about overfeeding, or what we call in the ML community overfeeding. So basically, you just have a ML that is not general enough to handle new cases. So, I don’t know, taking the example of a cat. You have seen a thousand cats. There’s a cat with a rainbow around it and your algorithm is not able to figure out this is a cat. This is what we’re talking about. 

Yeah, I think here is obviously an issue all the time in machine learning, but then the way to solve it is, in our specific case, is to create a load, or various type of loads, that you will be expecting to see in production. And then to just be completely sure, continuing the monitoring that you currently have and see if there is anything that is changing. So that’s something that we have and that we work with clients on is to make sure that we’re designing the type of loads that would be representative of what they would be experiencing. 

We do have a cool project called VHS, which Noah would love to talk about, which records the traffic and replays it to make sure that we have something that is representative of what’s happening, so that we’re not doing this thing, of this issue of overfeeding, and just training an application to performance in a very specific setup. But yeah, no it’s… I understand the question now. It’s very interesting…

Cody: We have one more I’d like to call out from Prashanta. I think this is really, not to answer for you beforehand, but it’s about being proactive so I’m just gonna ask it. What’s the advantage of using ML to predict and update your parameters beforehand versus just load balancing on demand like elastic scaling based on incoming volume like a lot of applications do these days?

Thibaut: Yeah, that’s a fantastic question. So let’s take an example. You have a certain amount of traffic and so you can, simple example, the base level of traffic on a certain number because on your UI, let’s say three. Then suddenly there’s a spike in traffic and you’re very smart. You know that your HPA needs to scale not on the CPU utilization, but on the level of traffic on the load balancer. Well if you just rely on that, the issue that you’re going to see is when the traffic spikes, the HPA calculates the new desired number of replicas and says, oh I need to go to five to accommodate that based on the rules, which is like the algorithm is in the Kubernetes API. And so it will start to spin up those replicas, but they take some time to be up and running to accommodate that level of traffic. So during the whole period where you’re spinning up those replicas, there is just no one to accommodate this burst in traffic. So you have a lot of failed requests. If you are able to have some prediction of the trend in your traffic, so when you’re here and there’s a spike in the traffic, you already had your machine learning that saw that was going to happen, so it started to scale a bit earlier, then those replicas that you’re spinning up already available to handle that traffic. And you’re reducing the latency, you’re reducing the number of errors, and all of this needs to be combined with machine learning and proper tuning of the parameters of your HPA, which is the number of replicas and how you scale, and this is what we have. We have a blog post on the StormForge website where we show that with our method we can actually do this tuning properly, where we have a bunch of different plateaus in the load and we see that by tuning properly those parameters, the replicas will always be available to handle the new type of traffic.

Noah: So there’s another good question in here. We’ve been talking about machine learning as it applies to the application parameters, as it applies to those various behaviors, but the question is what to use machine learning for in Kubernetes? Are you applying the machine learning model on the data generated by running experiments to determine the best settings, or are you using the machine learning to determine what experiments to even run in the first place? That’s I think an interesting place to apply it. Could you talk a little bit about that? 

Thibaut: Yeah sure. So the way it’s working right now is we talked with a customer about the application and what they want to achieve in terms of performance. We are not like, if I understand the question correctly, the question is can we actually use machine learning to learn what you want to achieve? That’s not how we approach the problem. It’s interesting, I’ll let you think more about that, but yeah, we usually… the way it works for us is more you have an application you have been running it for clients, and therefore, we design the experiment to achieve that goal.

Noah: And I think a good one to follow that up is the question why is this a good problem to be solved by deep RL in comparison to brute forcing the problem? I think this goes back to our machine versus human conversation. Do you want to try to generate…

Thibaut: So there is deep RL that is flagged here. I obviously have never said that we’re doing deep RL. I’ve said that there’s a lot of issues with deep RL actually, mostly the time it takes to converge and whether you’re willing to run an application on your dev cluster for that long before knowing if it’s going to perform well. So that’s to just rule out deep RL right away. Then the question is to do brute force, which is rules based, versus machine learning? Well I think the rules base is the first approach. I don’t think it’s really hard to prove that machine learning can beat rules based. We’ve done it in every single field. So yeah, I mean rules base is nice because you understand it and you can comprehend it, but at the end of the day, you’ll do something that is based on a couple of parameters that you understand how they play together. 

What machine learning shines in is combining a larger set of parameters and making them work towards a unified goal. That is just much harder to do with just cutting your own rules. It obviously works well and is very predictable. For example, you look at how the scheduler works. In Kubernetes, it’s a set of rules and it works fantastically well, but it’s… The reason why we’re hearing so much about machine learning is because it has come into all those problems and has proposed a new solution that is more efficient, so I do think it’s a better option. 

Noah: And I think to follow on that for what looks like it’s gonna be our last question, is how reliable is machine learning? If we’re looking at this in the context of we’ve got these various algorithms that are going to be providing us with recommendations, what is the confidence that you have in the recommendations that are being made?

Thibaut: That’s a fantastic question and this is why I was talking so much about having some sort of probability or confidence into those predictions. So the obvious way is to hide it. This is how we do it for weather forecasts. We tell you tomorrow is gonna be 50 Fahrenheit. We don’t tell you plus or minus five or ten. So that’s why I’m doing it, but I think the most efficient way and the most transparent way, which is what we wanna do for communities where you can check pretty much everything is going on, you want to provide some some sort of confidence or at least a privacy model that would basically say this is the prediction plus or minus whatever. Therefore, we think you should take that action. But yeah, it is a great question and this is why I think this type of model the right way to do it. 

Noah: Fantastic, this has been a great session. I think everybody’s learned a lot. We’ve had a lot of great questions. We have plenty more that I don’t think we’re gonna have time to get to. So we’ve got our folks in the background running a randomization, trying to figure out who’s going to be the gift card winner, but I want to give a big thanks to Thibaut for coming out and talking to us for a while, and to my co-host Cody, and to everyone who joined on the call for giving us a lot of great questions. There were a lot of different aspects that we were able to cover. 

Cody: Fantastic.

Noah: I’m totally stalling for time as I wait for the randomization to come back. I’ll just hum the Jeopardy theme for a little bit… 

So this was the first in our Fireside Chat series. We’ll be doing more of these conversations with as many people as we can get in here to talk about the topics that are interesting to them. 

We have a winner! Okay, it’s being announced… 

Brittina: Prashantha, you won! 

Noah: Congratulations. We will be sending out a survey after this to everyone who attended. Please fill it out. Hopefully this was a useful session, you learned a lot, and you can give us some feedback on you know what you thought and maybe what you’d like to see next. We’ll be getting more people here in the future and thank you all for coming. Have a good day. Thank you.