Bad CI/CD practices
Machine Learning: When is it a buzzword, and when is it applicable?
Application Optimization vs Platform Optimization
Active vs. Passive Optimization
What are the organizational challenges and how do companies get started with Optimization?

Noah: Thank you everyone for joining us. We are up on this session of StormForge Office Hours. The date is February 18? Yeah, 18th. The purpose of StormForge Office Hours is as usual to answer questions and engage in discussion with any member of the community that’s interested in any of the topics that we have to talk about – Kubernetes, optimization, testing, whatever else comes up.

No one’s here to ask us questions though this week. Slightly sad, but that’s okay because we always have The Hat to fall back upon. This week, because everyone is cold, I brought a warm hat – a hat with flaps.

So we’ll be live tweeting these questions so that if people want to jump in and talk about these topics with us, they can join us. So while we are waiting for anyone to join us on the line, hello Caller, thank you for joining. Shall we start off with topics from the hat?

Brad Ascar: Get a good one!

Noah: Okay see what we got, oh this is: Bad CI/CD practices.

Who wants to go first?

Brad Beam: Nope.

Noah: Do I put that one back in the hat?

Brad Ascar: Having this CI/CD process that you don’t trust? Was talking with a customer one day and he called it a Constant Illusion/Constant Delusion because it didn’t actually provide the benefits that it’s supposed to provide, right, which is to actually have some consistency around what you’re delivering, right? So when it doesn’t work properly, doesn’t work consistently, then it’s problematic. So tool for its job, tool not implemented correctly messes up that job.

Noah: Okay, do you want to go deeper?

Brad Ascar: Anybody else have any comments there?

Brad Beam: I feel like one of the biggest pitfalls is really just the stability of your CI/CD pipeline and it kind of hinges on something you said where, I forgot what you just said about the new name for CI/CD, but…

Brad Ascar: Illusion/delusion…

Brad Beam: Yeah. But it’s one of those where there’s a lot of promises for it. It’s very valuable, but if you don’t spend the time to put together an appropriate pipeline, put together a resilient pipeline, then it becomes a hassle. It becomes a chore and when you get in that state, it’s not providing value and it’s exactly what you said Illusion Delusion. Yeah, so it’s something that you need to actually put effort in. It’s not going to just magically be there.

Brad Ascar: Yeah, effort and maintenance, right? So things that test things also themselves have to be tested to make sure they’re still working and testing the way that you expect that they’re doing. Nothing worse than information that’s completely wrong that you rely on, right? So you think you got a pass it was okay, but the test wasn’t right therefore you keep passing it as okay. It’s not actually testing what you think it’s doing, right? So inspecting the process, you know who watches the watchers, you gotta watch tools and systems that do things for you to make sure they’re continuing to still do things for you the right way.

Noah: I’ve seen quite a few interesting and weird CI/CD practices over the years.

I think a lot of them end up coming down to understanding what your CI/CD pipeline is actually doing. You get a lot of people who just throw it in and say “oh pipeline ran” and whether or not it’s a good pipeline, they don’t even understand what the pipeline itself is supposed to be doing. Like I deployed my code, and therefore the pipeline took it, so it’s good? But did your code run in production? Did your code run inside your GitLab runner? Those are two completely different environments. When you went to pull it is your runner even indicative of your live environment or are you still using the same runner that nobody’s updated from three years ago using obsolete packages, using an old version of Java, using whatever it is… there’s a major consistency issue in understanding what the what the pipeline itself is actually supposed to look like. Anyone else want to comment? Anybody else got horror stories?

No, that’s a big no.

Brad Ascar: No, no horror stories.

Noah: I was really expecting more terror from cicd practices that had gone horribly awry.

Brad Beam: No, the only horror stories I have really are just around pain points I mentioned earlier. Like it just it needs attention, you need to have a lot of confidence that it works and we had a pipeline to try to handle the management of I don’t know probably six or seven clusters and the tooling wasn’t quite where it needed to be for it and it just became a huge pain point any time we were getting ready for release and actually testing things out. So nothing to the point that we got burned really by it, but it was one of those we couldn’t trust the process. So just growing pains with that.

Noah: I think tangential to that one of the problems that I see a lot of places is the concept of ownership of the CI/CD pipeline. Everyone assumes that the developers are just going to sort of somehow magically maintain it alongside all of the other duties that they have throughout the day, and when you get to an environment you know that’s not, this is not unique to CI/CD, when everybody owns it then nobody owns it. So you really need to have somebody own the pipeline if you’ve got more than just a couple people using it. Somebody has to own updating it, somebody has to own the tooling, somebody has to own the even the concept of you know is our version of our CI/CD software up to date? If nobody owns that, then it’s just gonna fall over.

Yep, well that was less climactic than I thought it was gonna be. Should we pull another topic from the hat? Maybe we’ll get some more lively conversation?

Brad Ascar: Welcome to our two new guests!

Noah: They are also our people. Okay, this is the joke one. I pulled out the one that says “Sandwiches.”

Brad Beam: Oh, sandwiches.

Noah: Do we want to talk about sandwiches? I mean have we gone that far? I don’t think we have.

Nope, we already had that one…

We already had that one…

Brad Beam: I mean for me, a little roast beef, provolone, and mayo…

Noah: I’m with you on that one.

Here’s a good one that we haven’t done before – machine learning. And this ties into the Fireside Chat that we’re going to be having next week.

We’re going to have Tibo come on specifically to talk about this topic for a while. Machine learning: when is it a buzzword and when is it applicable?

When have you seen machine learning in the wild be a buzzword, instead of actually being real?

Brad Ascar: Our marketing people are going to hate this and this is not here, but I’ve been places where people are like, “well, if we put this on here it will look more attractive, right? It’s a buzzword everybody’s looking for,” and it’s the very thing that I warn people about. Was actually one of the things for me actually coming over to StormForge was making sure that the machine learning was actually machine learning. Like there’s machine learning scientists, people that really know what they’re doing in the space because it is its own discipline and just adding it on a marketing schtick doesn’t make it so. Just having a script that does something doesn’t make it machine learning and so I think it’s the rigor behind it and the actual work behind it as a science, makes a difference.

Brad Beam: Yeah, I know in the past I’ve been caught in the hype train of a logging platform that said that they have machine learning and everything else incorporated into it to better detect noise, events, whatever else, but that was just that was nonsense. Especially when you still needed to explicitly state like here’s a format of my log entry, here’s this, here’s this, here’s this, like you very, very quickly realize that was just a whole lot of fluff.You got sold and not what you needed.

Noah: We’ve got a good question in the chat.

What are three questions companies can ask vendors to test the reality of their ml?

That’s a good one. If you were going out there and you wanted to actually see whether or not somebody’s ml was realistic or if it was entirely smoke and mirrors and buzzwords, what would you ask? What would you folks ask?

Brad Ascar: I think it’d start with what kind of data sets do you need to be able to do the kind of machine learning that you do. Different kinds of machine learning require different kinds of data sets. I would ask how they train the machine learning and also how they evaluate the efficacy of what they’re doing and do they look at multiple, you know, what kinds of technologies? There’s a bunch of different practices, a bunch of different kinds of networks that you can use. What is it that you’re using for doing that, right? If they can’t tell you those things, it’s pretty quick that they’re not doing machine learning.

Noah: We come back with a whole lot of bash scripts then… bunch of pipe regret…

Brad Ascar: No, that’s not machine learning.

Brad Beam: I think there’s definitely a balance between how much manual effort is still involved versus what is machine learning actually doing. If there’s something where you still have like I know one of the things that we get criticized a lot on is our experiment definition is difficult. And there is some manual work to do with that, but once you get that up and running, it goes. There’s not really much hand-holding, need to do, or things like that to really have machine learning go at it and do what it needs to do, but if you’re in there all the time constantly tweaking this or that so that the machine learning can actually work and you’re having to do all this leg work that could otherwise be handled there, I think that’s a huge red flag.

Noah: Any other comments? Any other questions?

Brad Ascar: I was gonna say you need to have a hat for when our machine learning people around here with us.

Noah: I really do. I need to have a separate machine learning hat. Shall I dive back into the hat again?

We don’t have quite the excitement that we had last time, but that’s okay.

Here’s a good one: application optimization versus platform optimization. Here’s one that I would expect people on this call would have a lot to say about, and go…

Brad Ascar: Other Brad go first.

Brad Beam: Both are important.

It’s one of those things where as you start to look more at it, you have like this macro verse micro scale, I have my platform with every different component on there versus just here’s my application and I think there’s a lot of times you can get hung up on one or the other versus looking at the holistic picture. I come from a systems background, so I very much think about going after the platform everything else and very loosely get into the application side of things because I just like systems, but in terms of optimization, performance testing, looking at the whole thing, you really have to be mindful of what you’re actually trying to, like what your goals are. If you have some website, e-commerce site, whatever else like the customer experience and how quickly they can actually load the page, look what they’re looking for, see the inventory for it, get the purchase price, check out, do all that stuff, that’s the critical flow that you need to be concerned with. If instead you’re worried about “oh, I can save off you know two milliseconds latency on my storage back end, well that’s great, but if that doesn’t actually come forward into the end user experience, that hot path, you’re spinning wheels. So you need to be very mindful of what the actual hot path is, what the user experience is, what you’re actually measuring or trying to measure and what you’re trying to optimize for for that to come into play and for actually being on the right track of doing what you need to do.

Brad Ascar: I tend to look at it, being a systems guy for a long time but actually an app person longer than that, you’re looking at quite different things. I kind of liken it to cars being carried on a train. You can optimize the car right and it’s a totally different thing that you’re doing. Some of the things are the same. You’re shedding, you’re trying to shed weight, and when you’re pulling things around trains, the less it weighs, the more efficient you’re going to be, but they’re very different things. You’re really optimizing at the application level quite differently than you necessarily are doing at the platform level. Platform level is again looking at macro concepts and looking at things in bulk and aggregate, scaling behaviors there are going to be quite different than application scaling behaviors, and so they’re two totally different practices because of the scope of their boundary, right? The app has a very defined boundary and that is delivering those executables as Brad was talking about, and then delivering the user experience piece of it. Now if the platform’s wrong, that impacts the application itself, but given that the platform’s operating right, how you scale, how you maintain, how you operationalize, how you monitor the platform is a completely different thing and just because you can pull switches and do scalings at that level doesn’t necessarily translate down to the application, so both are very very different practices for very different reasons.

Noah: So there’s a question in the chat about what would you say is the difference between optimizing in a passive versus an active sense and where do we fall in that.

Brad Ascar: Interesting, from a reactive performance standpoint, things that work in a reactive of course work reactively, some of them have some look-ahead kinds of capabilities, but ultimately they react to what went wrong, what’s going on, what’s going on at the time. I have this conversation with our customers. You get the difference between doing something proactively and reactively. Reactively, I may be looking at my application – the application’s got a particular configuration or particular architecture to it and when you’re doing it reactively, you’re generally relying on some sort of monitoring tools and those monitoring tools are telling you what the state of what is. The state of what is is to me similar to the dashboard of an airplane. Dashboard of an airplane tells me how fast I’m going, how far off the ground I am, how much fuel am I burning, all the things I need to know to fly that airplane correctly. What that dashboard can’t tell you is: am I doing things most efficient for what my end goal is? So I’m flying along in the Learjet and everything’s as trimmed and possible efficiency for that learjet, but the work I need to do is actually carrying cargo. Not the right tool for the job. So I’m doing everything as fast and best as I can. I can do this and go to a higher level or whatever, but if I actually need to carry you know 100 times the weight that that thing carries, I’ve got the wrong tool for the job. Proactive is going back to the drawing board and saying: what is it I’m trying to do and then how does this need to be configured to do the job? And in doing that, you go back all the way to the drawing board in the wind tunnel for an airplane design and say if I’m going to carry a lot of cargo, what does that look like, and we’ve seen those planes where the noses tip up and have the huge doors on the side for handling cargo, quite different design than a learjet. So you can do things in a proactive manner and ask questions and do what-ifs in a totally different way than just tweaking what’s running at the time. You can only tweak so much on the airplane that’s going and what you can’t do is say well let’s just have a hundred times capacity. That’s not gonna work. You’re flying an airplane. You won’t you’re within those parameters, so there’s a big difference in reactively responding to something and proactively going: what is it that we’re doing in the first place? And so I think that’s really important and that’s the area that we specialize in is taking that step back and saying what do you want to do, and let’s create the experiments, and let’s create the way that you test that, to determine what the application should look like because that answer could be very different than what you’re currently running. What you’re currently running may serve its purpose, but you may have another purpose for it and that’s really where the proactive side comes in.

Brad Beam: I think too with looking at how you handle the performance measurement and things like that in terms of like an active/passive role, one of the challenges, and this is where ml really shines, is if you’re going through the efforts of trying a performance performance test benchmark whatever your application to understand how it responds in different load scenarios and what have you, trying to keep track of what you tuned and how that impacts performance, even just going with the spreadsheet, that’s a heavy heavy lift and then finding out where to go next from there, it’s really just a crap shoot. So I think that’s one of the strengths that we really got from having machine learning back end service, what have you, is that really takes out a lot of that cognitive load of what do I need to do, and what do I need to do next, and how did this parameter impact this measurement?

So if you can get it to the point where when you’re going through these tests, if you can make it where it is more passive and that there’s nothing you really need to do and you can really let whatever orchestration platform facilitate all of the suggestions and what what things should be looked at and try to correlate those together, you’re going to have much much better results.

Brad Ascar: The other advantage machine learning gives is that the machine learning doesn’t have any presuppositions as to what these things do, as to how a replica affects something, how much memory affects something, how much CPU affects something. I do as a professional practitioner with a lot of years in. I’ve tuned a lot of things, so I have my own suppositions as to when I see this kind of thing, what do I do? I give it more CPU. I give it more replicas. I give it whatever. Sometimes that’s the right way, not most of the time though. Most of the time it’s not going to react the way that you expect it, so then you’re going to tune something else and then you’re going to tune something else, and after you’ve done that, because I’m human, I don’t tend to also say I tried one thing, let’s raise this and that didn’t do it. I don’t say well let’s lower that down and let’s raise the other thing. I just raise the next thing, raise the next thing, raise the next thing, and then what I end up with it is finally I might get my answer and I’m massively over provisioned because I’ve given way too much to get the output that I wanted and that’s because sometimes the information I have isn’t actually true. When you scientifically experiment, the machine learning only works on truths because it measures and then has the results, but my gut instincts aren’t always right and the way that I test isn’t the same and so that’s a very big difference is that there’s no presupposition or bias from the human in the system. Now you got to correct, you have to create the proper experiment, which is a place where bias can be introduced, but once you have the experiment correct, the human factors out of it and that’s kind of the important part because it has no bias.

Brad Beam: Yeah, and that’s I think one of the interesting things that I’ve seen as we’ve started doing experiments we’ve worked with customers is as a platform engineer or any sort of SRE type of role, you typically have a fairly good idea or mental model of what you think the platform operates like, what application is CPU constrained, which one is memory constrained, which one is network the bottleneck, and we typically have a pretty good mental model of this, but as you start to introduce machine learning and with it pretty much starting at a green field of I know nothing about how these components interact, it’s very interesting to see the outcomes that you get because they don’t always align with what you expect and I think a lot of that comes from you understand how each component works independently, but you miss that larger picture of how they all operate together. So even though I might have some web front end that’s CPU constrained, it doesn’t make sense to give it all the CPU in the world if at the end of the day I’m limited with my back end for however fast that can process. So it’s interesting. It sometimes gives you a little pause to be like this doesn’t quite align with my expectations for how this application should perform, but when you look at the whole picture of things, it actually paints a really interesting picture.

Brad Ascar: I think also just with complexity of complex systems and you know we’re talking about a lot of complex systems, we don’t necessarily know how every component works, right? So given a kind a database that you use all the time and have used a lot, what is its fallback and saving strategies when you get to 80% memory utilization or disk utilization, how does it start behaving differently, when does it panic and start dropping stuff down to disks where it would normally hold it in memory, there’s a certain amount of caching that it does, but how does the caching behave differently based on the amount of transactions coming through the system, and how fast it can potentially write its back-end logs, and things like that. So even these complex systems have fallbacks for when they get to the edge of their limits. When you’re pushing experiments to push things to the edge of their limits, you then test those edge of limits of those systems and if you have the edge of the limit of what’s happening in the database layer and what in the messaging layer or whatever and you’re pushing those limits, you also experience what they do when you hit those edge cases. And it may be edge cases that as a human you may not have ever noticed or used or you may have had great systems that you were always running at 50%, 60%, and weren’t pushing 90%, 95%, 100% utilization and that difference makes a very big difference in how it handles. And so one of the things that I’ve run into in my career in trying to diagnose really high scale, high level kinds of problems is there are anomalies introduced into large complex systems and some of them you can see the pattern like exactly every five minutes this is happening, why? And then at some point somebody would dig in and go, “oh, this thing, if you don’t get enough memory to this part of the database, you don’t give enough memory to this queueing system, it starts writing the disk in a different way and then pulling that back out when it gets less busy, right? It starts doing weird things and then you get these things and then large complex systems, you kind of have this cascading problem where the impact of one part of the ecosystem impacts the other part of the ecosystem, and therefore, gives you dynamics that you never saw before because you never actually had these two things working together before. And so also a great thing that you find out when you’re proactively testing an application, suite of applications, whatever it is that you’re testing and you build the experiment for expected load, unexpected load, breaking expected load, how does it behave so that you understand the places that you may have had gaps in your knowledge of how it worked, but truly the machine learning doesn’t care about that when it’s testing it. It just goes and says let’s go do this thing and then it measures the results and it doesn’t lie.

Noah: So, I wanted to take this in a slightly different direction and we have another question in the chat that I think kind of goes in the same way that I wanted to take a little bit. Ignore the backing vocals from my leaf blowers out there. I was thinking about since this is supposed to be a session where ostensibly users, customers, whoever could come ask us questions, my thought was along the lines of how does someone really get started with sort of understanding what they’ve got and the question from the chat is you know sort of orthogonal to that, why have companies historically not had optimization strategies for k8s in the cloud and I think those two sort of go hand in hand. What are the difficulties and how do they overcome them and how do they get started if somebody wants to get started with really looking at optimizing their platform and not really understanding what to pull in and what to bring in.

So when we were building out our platform, one of the challenges we had was since we were on-prem, didn’t really matter so much in terms of resource utilization because we had fairly beefy nodes, we kind of specced out enough to get whatever level of resiliency we needed, but then when it came time to actually start bringing on end user applications for it, we then started to be a bit more concerned about cluster utilization, how much capacity we had and things like that.

In order for us to answer those questions, it took some time. The way we ended up addressing it and I don’t, definitely wasn’t the best option, but it was an appropriate option for where we were at, was to use our monitoring tools to see what our resource consumption was like in our pre-prod environments and give it 10% 20% some fairly arbitrary buffer to then put those numbers in for production. A lot of that was struggles with tooling and having something there to kind of help figure out what those numbers should be and what the sweet spot was. Another part of it was actually load testing. If I have an ldap server, how do I properly load test that? How do I make sure that I have the right, that I have enough resources for my ldap server to accommodate whatever load is coming at it based on this many users? Based on these many applications that are integrating with ldap that then host their own users? There are just a lot of things that we didn’t have a good tooling strategy for, we didn’t have a good practice around, so really it was kind of wing it and see how far we can get with that. We had enough resources where that wasn’t really a hard constraint for us, so it was trying to just limp along as far as you can and then when we actually would get to a point where we were either seeing instability or latency that was too high, which again was kind of some arbitrary number we picked, then it was time to go back and look at what we did and more often it was just buffering up or adding up additional resources to those to those applications.

So I think a lot of a lot of my read on that was just tooling at the time wasn’t mature enough, wasn’t at a point where it would provide value, and I think that was the biggest struggle that we had with it.

Brad Ascar: So I read this question, not historically had optimization strategy, sometimes they have and that is I don’t want to get called at 3am in the morning when there’s a problem, so the way I take care, the way I’ve optimized is I optimize for not getting calls at 3am in the morning, and the way I do that is just throw too much memory, too much CPU at it, doesn’t always help but a lot of times there’s a lot of ills that are fixed by that, but it’s just like anything else you just are massively massively changing the efficiency of what you’re doing and so you’re way way over provisioned. So, it’s what you’ve optimized for, whether you realize you’ve optimized for it or not. Now we tend to think of optimization around using the least amount of resources possible to give the maximum amount of performance possible for what we’re doing and holding those in tension, and so tooling has not been great around that to Brad’s point and the challenge that people have in trying to do that. I think one of the other ones which is the interesting one and we talk about this a little bit is the way that we do experimenting. We send a load against the application and measure how the application behaves so this measurement side is always a very important part of it and it’s to me that’s the dashboard of the airplane that you’re in is telling you how this is performing. It’s very important to be able to have those measurements to see how it’s behaving so that you can determine what the optimal strategy is for doing that. The challenge is that you’ve also got the charts on the other side, which is the load, sending in load to simulate whatever it is you’re trying to simulate. When there’s a problem, and that’s one of the reasons we run a bunch of trials in our experiment is to see how it behaves, I can see the graph on this side of the load that’s coming in, I can see the graph on this side of how the application’s behaving, and I always say I can see that at five minutes after the hour there’s a problem and it starts going bad. Now as a human who doesn’t necessarily understand this system, what do I do with that information? I see the problem, I don’t know how to fix this problem. Like oh when I see that, sometimes there’s tribal knowledge or somebody in the organization – oh when you see the thing that looks like that, you go do this – if you can’t interpret what you see on the performance side and action against it, then you’re just admiring the graphs, you’re not actioning on the graphs and if you just can only admire, and you don’t know how to interpret it, and that’s really for me the biggest thing and it’s kind of the breakthrough thing right is for us to be able to do this in a way that you don’t have to interpret that graph. We actually measure the performance with the machine learning and determine that that’s not an optimal way to go and then it tries other things to find that optimal place, so that as those things are happening, that’s part of what’s happening in the machine learning is it’s learning how the application behaves, so it says oh when you do that, when you change that, this really gets ugly. So let’s go change something else because it doesn’t behave properly. And so that’s the advantage because in complex systems, it’s not sometimes just one thing, it’s a lot of different things, a lot of things go wrong all at the same time which cause these kinds of errors and so the more complex the environment is, Kubernetes is a complex environment, the more things can be affected and even if you were instrumenting all of the things, which is itself a challenge, when you see the problem what do you do about it? I’m looking at the graph, I have no idea what to change to actually make this better.

Brad Beam: Yeah and I think kind of tail on some of that, the tooling that you have when you have those graphs and you’re looking at them and you’re acting on them, they’re great and they’re helpful, but there’s a huge question of are they actually the right thing. And I feel like with how fast Kubernetes picked up and how fast the transition was into Kubernetes, containers, and all that, I think there was a huge knowledge gap too about I have 30 applications now that I need to be mindful of, I have 100 applications that I’m responsible for, I don’t have that depth of knowledge of knowing all the intricacies of each of those, so instead of being that highly tuned specialist of knowing I can tweak this, I can tweak this, I can tweak this, you’re left as a much more of a generalist position knowing I know enough about this, how to operate it, I know enough about this, how to operate it and I can work with that, but the scale that things have migrated to Kubernetes, to containers, to cloud scale, is much quicker than engineers have been able to scale knowledge internally about how that application works, how the platform works, and things like that. And it’s a huge huge catch-up game and that’s where I feel like the more we can pick up and bill out tooling to help with that, the better off everyone is going to be from that.

Brad Ascar: It was also easier when everything was being built as monoliths to be able to measure how the monolith was behaving because a lot of times you could have the same tooling in the same system and trace things from beginning to end in the same system, right? By nature, going to microservices, going in containers, it’s now distributed and just like it was always a challenge, still is a challenge for people, to write properly threaded applications running things in a bunch of different containers, you’ve just changed your threading to a different place, but it’s the same problem and that is how do you actually do that well and then how do you trace through it when there is a problem? How do I actually diagnose what’s going on because you’re now bouncing across a bunch of different systems, a bunch of different interfaces, and a bunch of different containers to try and troubleshoot the thing that previously might have been easier to troubleshoot on the same system because it was a monolith or multiple systems, you know, a three-tier architecture, you could just look at what’s going on on each tier. Well now it’s a lot of different tiers and a lot of different pieces and they’re just distributed and distributed debugging is hard.

Noah: So one of the other questions that came through in the chat, and I think it kind of emphasizes a lot of what you’ve said, one of the questions was you know can you can you manually optimize Kubernetes and your cloud usage today? Why not just do that? I think we touched a little bit on that, but I don’t think it’s clear probably to the people that are watching just how painful it was to attempt to do that without having something like us in place. Anybody want to take a stab at trying to you know walk through what it’s like to optimize a cluster manually or to optimize an application manually in boring and awful detail?

Brad Beam: I mean, sure. So like you start off with the Linux server right, and I know Kubernetes supports Windows too, but we’ll just talk about Linux because that’s where I have experience.

So you start off the Linux server, and this is applicable to clouds too, just cloud server not necessarily Kubernetes, but you’re looking at the system you’re figuring out what kernel tunables you need to adjust, what your network stack looks like, do you need to change any buffer sizes, do you need to change the IO scheduler, do you need to change the CPU governor, do you need to change any weird bio settings for that, what does your caching look like? There’s a lot of things to be concerned with just from the host itself before you even get into the application, and then it’s understanding the application patterns of is this an IO heavy load, is this read heavy load, is it write heavy load, where do I need to adjust my tunings for that, do I need to have a caching layer on top of all this, do I need to have memcache, do I need to have a queuing system, do I need to have… do I need to be worried about it being a Java application, do I have to adjust the heap and the garbage collection stuff, do I need to do jvm args? You have that, you have that, so on a typical cloud server, that’s not necessarily an excruciating detail, but I mean those are a number of things you’re looking at for it. Then you add Kubernetes on top of that, the complexity goes up a lot more. You now have to understand all those components at the bottom to see how they impact Kubernetes itself, but then also how they impact your application and with Kubernetes having the CNI plug-ins, you now have to understand networking a bit more too, and a lot of those operate in a very different manner from each other where you have a flannel with the vxlan encapsulation or you have calico which is BGP IP stuff or you have psyllium using EBPF. There’s a number of different technologies that are involved in each of those, and that’s just this abstraction on top of the low-level network stack. Then you have the application stuff inside containers and there’s a neat article a while back on a bug someone found with the CPU throttling. When you set limits for your containers and what happens when your container’s too active and you hit this limit, you all of a sudden you’ve got throttle for your CPU usage and you have poor performance.

You then have to be concerned with what does your state look like? Do you have an external database for managing that because with Kubernetes, if pod dies or your server dies it’s going to get rescheduled, which is great, it helps with optimizing for getting called in the middle of the night, but then you have to be concerned about how the storage moves with that, whether it’s you know something back by self or rook or OBS, not OBS that’s the streaming stuff, what’s can’t think of the other one, but sure you have to understand the storage stack for that too. So I mean you’re looking at the network from not only a single node now, but from the entire platform. You’re looking at storage not only from a single node now, to the entire platform and how that moves with you. You have to worry about the state of the storage system if you’re running on-prem. Do you have enough replicas? Is your cluster degraded? Do you have erasure code? Like you have to be concerned about all these different things so the number of components just starts adding up and adding up and adding up where it’s frankly obnoxious to be an expert on all these. So then you’re in this mode of here’s everything I need to handle…

How. You just are almost in that analysis paralysis because there’s so much to do. So like Brad mentioned before, it’s a matter of what you optimize for. I’m gonna optimize for not getting called at 3am in the morning because I don’t want to. That’s going to cost an extra however many dollars and capacity to make that easier, I’m probably going to lean towards that and say you know what, I need beefier servers because they’re not because I just don’t have the ability to dig into this. Then I mean yeah getting visibility to all those to understand it, getting visibility and getting knowledge to the point of being able to see what’s going on with your system, to then having the knowledge of how to address that, that’s significant. There’s a lot of work that goes into that and trying to go through that whole cycle of digging deeper find out what’s going on and find the root cause of this? There’s just a significant work effort involved with that. So yes you can up you know manually optimize it, but you have to be aware of every single stack that you are working with and with going from bare metal to potentially a virtual machine on top of that, to then having containers on top of that, there’s at least what three four layers involved with that? Of different areas that could be impacted by you know if I change something on the bare metal machine, how’s that going to impact what’s got what gets said or how the VM performs, to then how that impacts what is running on there from an application standpoint. There’s just a lot. Is that excruciating enough detail or is that still too high level?

Brad Ascar: Well in what you just described, you’re optimizing for all the things that you need to know based on what you need to know about the system. Whether it hasn’t been done, what you didn’t get to yet is, what does actual load look like at that point to see how the application really behaves because you make decisions and you make trade-offs based on assumptions some of which may or may not have been the right assumptions about how a complex system all interacts with each other. So now you introduce the actual load and so the first person to determine that maybe both the Brad’s made bad decisions architecturally on this application is when you’re running into a problem and all sudden it’s not working for your customers, right? So you have an operational risk, you have a reputational risk in the way that your application performs. The better path would be actually having load against an application that you expect and maybe 2x of that load and maybe 5x to that load and maybe 20x that load if you’re in retail and need to do Black Friday kinds of events, right? You need to see how it performs when it’s doing that so that you can even start making the tweaks and go, okay yeah now it’s now blowing up. Let’s figure out what I need to change to make this thing behave better. And you still have all of those things that Brad described now as tunables of things that you might change when it doesn’t do what you expected it to do. Even if you architected and made all the right decisions for performance, it may be all of the right decisions were made and it’s really expensive to run, right? And so you know there are cars that can go really really fast because people spend millions of dollars on it, most of us don’t have millions of dollars to spend on a car that we want to run, so there’s something there the reality comes back in is like I need to be able to do this in my budget and my budget is probably not unlimited if I’m in some sort of organization where I charge customers for my goods, and therefore, my cost of my goods can’t outrun the amount of money I got coming in the front door. So you have all of those and so beginning with the end in mind, what is it i want to do, how much load does it need to be able to take and handle, and then how do I tune the resources necessary to their most efficient values so that I can actually make money at the end of the day? Because if you just want to throw out money, I got a lot of great ideas about how you can, you know, do things super inefficiently. It’s doing it efficiently so that you can really do it. I can order the biggest thing Amazon offers at every tier of my application, but that’s a whole different story, right? So that really comes down to the end and this proactive, and we keep saying proactive, but really at the end of the day begin with the end in mind. Create the load create, create what it is that you want it to do, and then experiment so that you can determine is the design up to do it and if you were just going through and saying 10 different things that need to change, but you have a thousand different options in each one of those, or say even 100, start doing the math. Every parameter times 100, times 100, times 100, times 100, you really quickly get to billions of combinations. So create yourself a spreadsheet and put in a box and and then just do every ten of those and try and do every possible combination, you got billions of possible combinations. You don’t have enough time in your lifetime to do that if each test took five minutes. That’s the reality.

Brad Beam: Yeah I mean that’s really the reason why oftentimes just deploy things at the default, see how it runs, and then once you have a hot spot, then start digging into it, and it becomes much more of that reactive standpoint. A reactive methodology of wait until there’s a fire and then go put it out, and that works. It works, it’s not ideal, and it definitely is a very very hot spot in terms of causing burnout for your team and your ops team that’s managing that, but it’s completely valid strategy.

Noah: And it’s an expensive strategy.

Brad Ascar: It’s an expensive strategy because what’s the cost of doing that? I’m an expensive spreadsheet filler.

Noah: Well that was that was a good one. I like that…

I like when folks get excited and slightly angry about these topics. We’re coming up on our hour, so I think it’s probably getting close to time to call it for this session, unless folks want to go back to the sandwiches topic. That’s totally valid, you can.

Brad Ascar: I’ve got an affinity lately for Bon Me sandwiches so… or Korean barbecue either one .

You know what, now that you’ve mentioned that I might have to go get myself some Bon Me for lunch. I’m usually more of a reuben guy, I like that sauce on there. That’s so good, nice seeded rye. I’m very happy with that.

Brad Beam: It’s all about the kraut for me on reubens.

One thing I’ve been disappointed with after moving out here is we don’t have just assorted subs. I miss that from New York.

Noah: I having originally been from the northeast and moving out to the desert, I do miss the pizza and sub shop that was on pretty much every corner and those don’t really exist so much out here. Anyway.

Thanks for talking about sandwiches and everything else I suppose is also important. We’ll be back here in two weeks, hopefully with more people to ask questions. It was nice having some questions coming from the chat helping to steer things and we’ll see everybody at our next session! I’m gonna end recording now.

Brad Ascar: See you next time.

Noah: Bye, folks.

StormForge Office Hours

Latest Resources

Seeing is Believing