The Real Definition of Observability & Why It Matters

Fireside Chat with Charity Majors

Air Date: June 23, 2021

Noah Abrahams: So one of the things we always like to start with, someone you already started to lean into, is your story. How did you get here? How did Charity become CEO of a company?

Charity Majors: CTO now. I was CEO for the first three and a half years. I’ve been CTO for the last year and a half.

Very long and winding and strange road I guess. I was homeschooled and I grew up in Idaho in the middle of nowhere. Didn’t have computers growing up and went to college on a classical piano performance scholarship. But pretty quickly figured out that all musicians were poor and I didn’t want to be poor.

So I swapped keyboards. I’ve always loved just kind of tinkering with computers, so I got jobs as a system admin. Came to San Francisco shortly thereafter and I’ve been ops-ing ever since. I kind of made a career out of being the like earlier, the very first infrastructure engineer that get tired into a team where the software engineers think they have something that’s real, but they need somebody to help them help it grow up. So that’s what I’ve been doing. I did it live in the lab. I did it at a couple startups that nobody’s ever heard of, and then Purse, which was acquired by Facebook, which leads us here.

Noah Abrahams: So that’s a bit of a winding tail. How does that get you to observability as your domain of expertise?

Charity Majors: Well, you know at Purse we were building this giant multi…. For those of you who aren’t familiar with it, Purse was kind of like a Roku, but for mobile. If you have a mobile app, if you have APIs, SDKs, so you can write a whole mobile app without ever having to do any back end scaling stuff.

And it was this large multi tenant platform, and it was really kind of a professionally humiliating, just how poor our reliability was. Up and down all the time, because we had a pool of web servers and whenever they got saturated, whenever any of the databases, or whenever any app hit the top 10 the iTunes store, or anything happened, they all went down and I tried everything. 

To get more visibility, I tried every monitoring software, I tried every logging tool, I tried every APMs offered and it just like… they were great at helping you figure out the things they already knew would break, right. Once I knew something would take the system down, I could create a monitor and check for it, or I can create you know dashboards for it.

But this the system was unlike any other than I ever worked on where they weren’t failing repeatedly in the same way, over and over. It was a different freaking app every single time. There was no way you know… breaking down by you know high cardinality dimension, like app ID, it was just wasn’t possible. I tried every tool out there. None of that helped until we were at Facebook. There was this one tool there that we started using called Scuba, which is a butt ugly tool that is aggressively hostile to users. But it does one thing really well. It lets you slice and dice on high cardinality dimensions in in real time. None of this craft a query and go off to go to the bathroom and get coffee. None of that but, just like you can slice and dice.

So you could break down by app ID, breakdown by endpoint, break down by part, and just chain them all together and having this kind of a tool… The amount of time it took us to diagnose and fix these problems dropped from God knows hours, days, sometimes never, to like not even minutes. Like seconds. It wasn’t even an engineering problem anymore. It was a support problem and this made such an impact on me because, obviously right, and we started getting a handle on reliability and blah blah blah, but when I was leaving Facebook, I was planning to go be an engine manager at Soccer Strike for something and I suddenly went, oh shit I don’t know how to engineer anymore without this tooling that we’ve built around this ability… it was so unthinkable to me, I felt like such a poor engineer without the ability because it wasn’t just about getting the site backup when it was down, it was about… It was like my five senses for production. It’s all about how I understood, how do we understand what’s going on, how do I decide what to work on, how do I validate as I’m building it that it was doing what I expected it to do. There was just this tight feedback loop that I become accustomed to that just wasn’t possible using anything else. 

That plus I had a pedigree for the first time in my life, right. Leaving Facebook, I have a pedigree, so VCs are chasing me, which is kind of annoying, cause I wasn’t better engineering sitting at Facebook than I was before then, but like they’re like have some dollars, and I’m like well okay well, I’ll take your money.

We’re definitely going to fail because I saw it as being this very niche for large multi tenant platforms, you know. And I’m like well there’s probably a couple dozen companies in the world need this. I’ll take your money, we’ll go build. We will fail and will open source it. So then I still don’t have it right? Deal. 

Over the course of the first year of it, we started up in writing storage engine, which our investors were like what are you doing. Because you’re supposed to get it in front of users as soon as possible, and we’re over here building a storage engine first you’re like oh, could you not.

But also just trying to figure out how to talk about what we were building because we knew it wasn’t monitoring, right. We knew it wasn’t a reactive tool. We knew that it wasn’t anything quite like any of the tools that were out there.

And seven months after… we started in January. It was July, I found in my chat logs. In July, I Googled the definition of observability. Nobody was using it, right. It was just kind of like I think I’d heard it a few years ago, whatever.

But I Googled the definition and it came from mechanical engineering, from control systems theory, and it was all about can you understand what’s happening on the inside of your system just by looking at it from the outside? And I just started have light bulbs. Like oh my God, this is what we’re trying to do, right? It’s about understanding any novel state your system can get itself into, whether or not you’ve ever experienced it before, whether or not it’s ever happened before.

Just be able to ask the right questions from the outside, to understand it, so that without having to ship new code to understand it, right. Because so often it takes like all this wizardry and we finally figure it out by guessing and all this stuff. So we’ll ship some custom code to handle it right, so we can find it next time. 

And once I started to realize it that’s what we’re building, so much falls out of that. 

If we say that what we want is to be able to ask answer any question about any in our system state, well what proceeds from that? Well, it means that anything that  has as a fixed schema is verboten, right. Because you can’t define up front what kind of dimensions you’re going to need to understand. 

You need to encourage people to just toss in details whenever they might think of them, right. Just like oh, this might be useful someday. Throw it in. This might be helpful someday. Throw it in, right?

Anything that involves indexing is verboten, because you don’t want to have to decide up front what’s going to be efficient to query on.

You need everything to be equally efficient, right? So that’s why we needed to build a call and restore database.

Other things like it has to be able to handle high cardinality and high dimensionality too. The ability to stack together, as many of these high cardinality dimensions as you want. Because that’s how you’re able to ask and answer questions. Like you know oh, there’s a spike in errors. Well what do they have in common, right? Well it relies on  these arbitrary wide structure data blob’s because all of that stuff is bundled together around the user’s perspective of their experience, right. 

So you’re able to correlate and go oh those errors, all of them are for users, that are on this version of iOS, this language pack, using this database shard, using this version is a build ID, from this region, right. You have to be able to, like all of them in like…

So anyway, I could go on and on, but, like all of these things proceed from that simple definition that you need to be able to ask any question to understand your systems without any prior knowledge. So that’s how Honeycomb came about. And the thing is once you realized that, we started to realize oh, this isn’t a niche problem after all. This is an everyone problem. This is a… and all of the trends and computing are going towards you know you know heterogeneity, these far flung complex systems, many what you don’t have insight into in the background and all you can rely on is your instrumentation.

And, and like being able to app ID all the way down to like your containers, and everything’s high cardinality, everything matters, and everyone now has these problems.

Cody Crudgington: Yeah, me too. So I believe Noah and I both have an Ops background, so it’s nice having someone on here who can sink that in a little bit. I like the idea of cardinality, in doing things and being able to stack those bits of information.

Charity Majors: Can we define that for people real quick?

Cody Crudgington: Yeah, go for it.

Charity Majors: Yeah, high cardinality. It just means like imagine you have a collection of like 100 million users and a bunch of data about them. The highest possible cardinality dimensions will always be any unique ID. So like social security number, right, or row number.

The lowest possible cardinality is always going to be like species equals human, right. So like high cardinality dimensions, will be first name and last name, right. They’re not quite as large like row number. Gender is going to be fairly low cardinality, although not as low as species. So when you’re looking at debugging and understanding complex system, what is the most identifying information. Well it’s almost always going to be those high cardinality dimensions, because if you can track it down to one of those… as soon as you found like a trace ID that describes your problem, job over, right?

Which is why it’s unfortunate that, like metric systems are explicitly designed for only low cardinality data. That’s how they are so cheap.

Cody Crudgington: Yeah, so going off that… What is the difference, can you explain to everybody, what’s the difference between observability and telemetry?

Charity Majors: Oh yeah telemetry is just a broad… it’s just a name that encompasses all metrics, data, all the data that you use to understand your system.

What’s unfortunate is that more and more I keep hearing people use observability as a synonym of telemetry, and I’m like no guys, we already have the word telemetry. We don’t need to two words for it. My preference would be that observability to be understood to be something different, because like I said, it’s more identifying, right. I think that it helps people understand what is different about these newer kinds of tools and how it solves these newer problems because monitoring is a very mature, well understood discipline, but many of the best practices are like the opposite of the best practices for observability.

I understand why all of the marketing departments out there are like, “we do observability too,” because it’s always cheaper to adopt a new term then it is to build the actual feature sets.

But it’s a little bit unfortunate and I hope that, as their tools start to get those feature sets, maybe their interests will lie a little bit more and being more specific, with their with their words as well, but we’ll see.

Noah Abrahams: I think that ties really into one of the things that really excited me about getting you on and having this conversation when you suggested the topic, “The Real Definition of Observability.”

The word observability in and of itself is something that people can hear it, unlike telemetry, that’s not something that comes up in day to day conversation.

But you say the word “observe” and people sort of have this innate understanding of okay, I know what the verb ‘observe’ means, so I have some idea of what observability is going to be. Even if they are wrong about what is going to be. I know you mentioned earlier on that, in the existing definitions of observability, people are putting things in buckets and you didn’t like that definition, right? Can you talk a little bit…

Charity Majors: You mean the three pillars thing?

Noah Abrahams: Yes, yes.

Charity Majors: Yeah, I mean it seems pretty obvious from my point of view that the reason that they say there are three pillars is because they have three different products to sell you.

You have a metrics product, they have a blogging product, and they have a tracing product, therefore, you must need all three.

When in fact, that’s pretty ass backwards if you ask me because, I mean, how many ways to go about describing this…

Observability relies on… Its source needs to be this arbitrary wide structured data blog that describes a user’s experience, right. One per request per service.

And you can think of that in a lot of ways it’s like we blew up the monolith. So we can no longer attached to the bug. We can’t just go to New GDB and step through, right. We can’t do that anymore.

So we needed to find a way to like take that context and like pass it around so that you can correlate things that are happening at this is this service versus this service that are part of the same request, right. You need to be able to pass those ideas around and stuff. So that’s part of what we’re doing in the instrumentation way of observability and… What was your question? What was your question actually…

Noah Abrahams: It was how are the pillars wrong…

Charity Majors: Yeah, how are the pillars…

Right, so honestly observability itself speaks more to a quality of the system, right. It’s the ability to answer these questions. The only reason I’m saying that it needs to be focused on the arbitrator a structured data blob is because I have not yet seen anybody implement a system based on metrics, or logs, or traces that is capable of answering these questions, right. So I’m not trying to get religious and dogmatic about it. It’s just that definitionally, this is the only way I’ve seen to do it. Because you can derive metrics from an arbitrary derived structured data blob. You can derive logs from it, you can derive traces from it, if you have tracing ideas you perpetuate.

But you can’t go in the other direction, right. You can’t have a bunch of metrics and get a request you know, a wide blob request for it, because you discarded all that connective tissue at write time.

So, I think that it’s a bit disingenuous. I think that, ultimately, what matters is what you’re able to do with the tool, your ability to like understand the system.

It’s like metrics and logs and they’re just data types, right. It’s kind of like saying the three pillars of databases are indexes, rows, and columns. It’s kind of senseless. It doesn’t mean anything, and what matters is, can you get the job done and fundamentally like… I don’t want to throw too many bombs here, but like monitoring and metrics are the right tool for the job for infrastructure, more or less. I think infrastructure being defined as the stuff that you have to do to get to the stuff that you want to do.

Right? The stuff that is your bread and butter, you know, if you’re a software company, the code that you write, that is your bread and butter. That’s the stuff that you want to do, and for that you’re always going to need observability, not monitoring metrics because you care about the end user experience. You care about that more than, is my service up. What is the meaning of up if your users aren’t using it, it’s not up, right?

The other thing that it does is it speaks the language of functions, speak the language of endpoints, not the language of low level system stuff, right. Because I don’t think that you can expect most software engineers to stare at the four different types of RAM and all of the proc IPv6 blah blah blah, you know. 

Our job as ops has traditionally been to sit between the software engineers and the hardware and kind of explain to each other what’s going on, right.

But that’s kind of an efficient and I think that observability is part of helping developers actually own their own services.

Cody Crudgington: Yeah, that’s great. This is a good segue into my next question.

Sitting in between the software developer, and the infrastructure, AVPF lately has become, especially in like the last three or four years, it’s become really big in observability, especially in the Kubernetes track where things are so ephemeral.

And it’s great, and everybody’s starting to tap into that, and it’s been really awesome it’s pushing the observability forward.

Moving beyond that, where do you see us going now that AVPF is kind of being used so widely in it’s accepted, people are starting to actually execute on AVPF?

Charity Majors: Yeah, it’s interesting because I see the AVPF is being some lower level systems stuff and observability.

If what you are doing is… If you’re Amazon, you’re taking care of racks of servers and everything then holy help, you need that stuff.

But I honestly don’t think it’s the right tool most of the time for software engineers who are writing and shipping code every day.

I think that they need a little bit of it. A taste of it, right, but it’s not their bread and butter. The reason so many people automatically go to that is because we have such a long history of only having ops tools and like low level systems tools. 

So, for example, think about Serverless, right. Thinking about the way Serverless as instrumentation is a really good way about thinking just sort of you know what the platonic ideal of instrumentation should look like for your average software engineer. You don’t have access to the underlying stuff, right, and somehow you manage. You know how to instrument your code to handle retries, you know back ups. You learn when something is down, you know how to handle it. But you don’t have to have all of the information of the underlying services to get your work done.

That said, when you have some it can be really handy, right. It’s really great if you can ship some code and then go and look and see, oh, the memory usage is triple.

That’s a great use, right. Or if you’re suddenly saturating your CPU, right. But there’s really not more than like maybe half a dozen of those that I think really regularly need to be exported into the realm of the engineer who’s writing and shipping code in production.

And for the most, it’s not that they’re not useful. It’s just that you shouldn’t need them to get your job done as a person writing and shipping code as application software. There’s something wrong if you have to resort to that too much.

Noah Abrahams: The totally makes sense. So in that sense of evolution, AVPF is a big thing that’s come up. You’ve got whatever your developer tools are, what do you see has been the most important piece sort of historically, and what do you see as being the most important piece in terms of development? You said, like observability has sort of evolved out of, not really out of a set of tools, but more about a lack of observability.

What do you see that growth being and what do you see having been the most important piece. Let’s talk a little bit about timeline here?

Charity Majors: Sorry, can you rephrase?

Noah Abrahams: I’m sort of rolling two questions together, what do you think were the most important pieces historically that brought us to here, and where do you think we’re going?

Charity Majors: I think, historically, like other than like Honeycomb, which you’re right kind of came out of left field. It did not grow out of 20-30 years of traditional systems monitoring software. Tracing I think is the most important antecedent or precedent. I think tracing honestly… a trace is just an arbitrarily wide structured data blob with a trace ID, and a span ID, and some special fields appended. I think that along with blowing up the monolith and your request is no hop hop hop hop hop hop hopping around, tracing is suddenly become not optional for a lot of people.

Tracing, however, has a history of being very hard. Unfriendly. Sort of hard to roll out. There are a couple of obstacles, a couple of things you’ve observed when when it comes to tracing.

One is that until it’s like 100% rolled out, you don’t get any benefit from it, which means it’s hard to get the buy-in to actually like expend all the energy to get it rolled out everywhere.

And secondly, a thing that I’ve noticed again and again, is that you look at a shop, wrote it all out, got everything instrumented. You go there a year later, and you want to try something and everybody’s like oh, go ask that person. That’s the tracing guy.

It’s not a tool that finds full of permutation that the entire team puts in our toolkit and like uses every day.

But it’s like there’s one or two people who know how to use it, and when you have trouble you go to those people as your experts. That’s unfortunate and I think that that’s that’s something we’ve got to block. That’s a pattern we have to break.

I think that a lot of times the hardest part in finding bugs in the systems and so forth is not debugging the code or reviving the code, it’s finding where in the system is the code that I need to debug or fix.

And tracing is such a key important part of that. Just because you’ve got so many things. It’s not even just all the services. It’s all the different storage systems, you know it’s your polyglot.

You probably got this and that. The reason monitoring works so well for so long, was that you had the load balancer, the application, and the database.

Right and it’s like okay dashboards? Super helpful. You’ve very clear inputs and outputs, all the complexity is really wrapped up inside the application.

A little bits in the database, you’ve got your DBA for that. Now it’s just like okay well now ops is everyone’s problem, right? Because it’s just blink blink blink blink blink blink blink blink.

So, and the databases, you know, are not a priesthood so much anymore either. It’s really in the realm of the ops people and the developers to understand. 

So where we where we come from, I think that tracing has a lot to do with observability. I think of it as like you know early early observability stuff. It’s just that instead of starting… So the tracing tools out there, they’re kind of trace first, high cardinality now the second, and I think what you actually want is the Honeycomb model of high cardinality first and trace second because you want to find the problem and then trace it. As far as where we’re going, I mean you know I get a little cynical or like… I talk about various players in the marketplace from time to time, but I do think that all the major players in logging, monitoring, metrics, APM, even some adjacent like database companies have all changed their roadmap, to the point where they’re trying to look like Honeycomb.

They’re trying to move their technology to match Honeycomb’s tech faster than we can grow up our business side to match theirs. So I think that over the next couple years you’re going to see a lot of these tools bake in exactly what Honeycomb has done. I don’t say that because I’m like we’re the best! It’s just like some of these things are just kind of by default the right things. We just made the right trade offs, because we were able to kind of step outside of the tradition and start from scratch and they’re just the right way to do it.

But that doesn’t mean we will win, so. The history is littered with the stories of companies that did it right and didn’t win.

Cody Crudgington: That’s so true, we have a good question.

Charity Majors: Another thing that I would throw in there is the trend towards having consumer quality developer tools. Towards not building tools for developers and engineers, like their engineers. Like no Vyn bindings you know and building it for for us like we’re humans, like we have a job to do, like we were trying to get our job done, not trying to learn the tool.

Right, and I think that, to the extent that these tools can make you know these complex systems sensible and intuitive and you know, to the point where you get a good dopamine hit every time you go and look at your tool, instead of a oh gosh you know, every time you go and look at your tool, right to the extent that we can solve that I think that we will win.

Noah Abrahams: So on that point, do you think that the growth of observability in general is going to be more based on the evolution of the tooling or is it gonna be more based on the adoption of the ideology of observability.

Charity Majors: That’s an interesting question. Similar same question you know, because these are these are socio technical systems that there’s no start point or endpoint they they they feed back into each other, right.

And I what I immediately thought of when you ask that is just I was thinking back towards you know, five years ago, when we started talking about high cardinality and stuff.

I was CEO… my first like marketing grand plan was to go out there and go high cardinality I think we got two customers that way.

Two people out there, searching for high cardinality right, and if you look at the industry now, like everybody knows a high cardinality means. Everybody knows that they need it.

And there’s been a lot of the industry, and you know our awareness of tooling has grown a lot in the direction of the stuff that we’ve been talking about, to the point that you know we have occasionally like gone, we just reflexively gone, oh no we can’t talk about that because that’s alienating and then we realized that that was two or three years ago, and now we can talk about it this way, people are super into it. So that’s kind of exciting.

What do I think. I think that observability is not what I would pay attention to honestly. Observability as a means to an end. I think that the end is helping engineers own their own systems in production and not be afraid of them, right. And this doesn’t mean ops people like us are going away.

They’re never going to not need people like us, trust me. But you know we’re going to be more force amplifiers than we are going to be… we’re not gonna be the first line of defense anymore. We’re going to be what you escalate to. We’re going to be the experts that you escalate to. We’re going to be the ones who set the golden path and the company standards then help team by team, you know help get your code to a place where it is maintainable and where it isn’t crashing all the time and help you define alerts and SLOs and SLAs and SLIs and everything.

You know we’re going to be the one to help you, you software engineer, be the first line of defense for your code. The reason for that is because our systems are getting to the point where it can’t be done in any other way, right. 

This is not about… We have a reputation for masochism. I know. This is not about saying alright, time for you to suffer too.

It about saying okay software engineers, the only way we can actually make this better is, if you’re intimately involved in the shipping and the maintaining of your code in production.

Because we can’t just treat them like a black box anymore, right. Like the way that we did it with handing over ownership to ops was just like we’re just going to patch it blindly a little bit. That doesn’t work anymore. Having systems that are run that way, they degenerate over time, right. You just attract all this craft from things being like patched over and like shoved under the rug, not fully understood, and now you get you know 10 years later to a system that nobody wants to be on call for because it’s breaking all the time.

Well, of course, it is. No one’s ever understood it. Nobody has any chance of understanding it, but this is not necessary, right.

Systems don’t have to be this way. You can build systems where things are well understood from day one, and they don’t get all this property corners and nobody knows how to deal with. So like that’s the goal, right. Observability is a way of achieving that goal, because for the same reason I put my glasses on before I go driving down the freeway, because there’s a whole lot less stabbing around in the dark if you can see what you’re doing, you can move more confidently, you can move more swiftly. You know all of these things kind of go together, right. The CI/CD, fulfilling the promise of CI/CD, and having continuous delivery where you’re actually shipping to production within a few minutes of writing the code.

Having observability, instrumenting as you go so that software engineer who’s instrumented can then go and look at it and say is it doing what I expected to do, anything else look weird, right, and then finding and fixing 80% of all bugs right there in that few minute loop, right.

So it’s fresh. It’s exactly what you’re trying to do. You know all the trade off you just made. You know exactly what you didn’t do. You’re going to find… it gets exponentially more expensive in time and money to find and fix bugs the longer it has been, starting at the moment that you wrote it.

It’s all about shoveling and that’s not me. There’s a Facebook research paper that they published recently that showed us. So pushing this finding of problems forward, making systems more understandable, more comprehensible more friendly to humans and maintainer, it has has these amazing business benefits, too, because you know if it takes like X number of developers to build maintain the system where you’re shipping within 15 minutes.

If you’re shipping within hours, it takes at least twice as many more, and if you’re shipping within days takes twice as many more again. For shipping within weeks it takes twice as many more again.

I’m definitely not exaggerating. I might be too conservative, it might be exponentially more, but that’s just an incredible amount of waste.

And that’s not waste in the good in a good way where you’re like oh well, keeping jobs, you know.

People are gonna be obsolete. No, that’s waste and all the things that burn people out, right. People don’t get burned out from shipping too much code. They get burnt out from never shipping code, or for being called in to work on code that they wrote six to 12 months ago that they can’t remember.

All of that waste out there is the stuff that makes software engineering a terrible job. All of the fun stuff that I’m trying to like proselytize here, I’m not saying that you know people should be working themselves to death. I think that any engineer has about three maybe four engineers of concentrated focus time in them per day, tops. But that’s what moves business forward. Solving new creative problems that move the business forward three or four hours a day, day after day.

Instead of spending, you know 16 to 20 hours just shoveling shit and you know dealing with yak shaving and stuff which is what historically ops has ended up with lion’s share of that. But I guess that’s my long long winded answer to your questions. It’s not about observability, I think that the extent to which people are adopting observability in the service of these goals is something to watch and I do think that maybe the answer to your question is yes, it’s about the ideology, right. It’s about buying into… so many people think that they’re not tall enough to ride this ride, right. They’re not good enough, this is just for the Googles and the Facebook’s.

That’s not true. It’s actually the opposite. The way that they’re doing things now is the hard way. It is so much easier to write and ship code when you have a tight feedback loop, and you can see what you’re doing, and you get bugs fixed fixed much more quickly.

It is so much easier, but so many people don’t have the confidence to strike out and start you know, raising their standards for themselves.

You can blame a lot of this on engineering leadership. I think a lot of engineering leadership has been too timid too afraid to actually push for you know higher standards here, you can blame a lot on inertia, but in the end it doesn’t really matter, right. What matters is that we understand there’s a better way to write and ship and support code and it starts with little steps like observability and your CI/CD pipeline.

Noah Abrahams: You’re speaking to our hearts. We’re not here to talk about us, but StormForge as a company, we focus on waste and efficiency and all those sorts of things. 

Charity Majors: Nice. What do you do as a company?

Noah Abrahams: Oh, I don’t want to derail with….

Charity Majors: GIve me your elevator pitch. I would love to know.

Noah Abrahams: We help companies optimize their software in… How do I want to phrase this. I don’t usually do the sales pitch. We help companies optimize their software, so that they have less waste, so they run more efficiently, and so that they get better bang for their buck. Run things better, faster, cheaper.

But without the concepts of observability, without visibility into that ecosystem that use your own software, you can’t actually get to that stage, you need to have testing, you need to have observability.

So. Hearing you talk about waste, not just from a technological perspective, but also from a human perspective. That’s something that we’re just sitting here going oh this, this is a plays exactly into arena that we work in.

Charity Majors: This whole subject really is like the mythological elephant, right. Where everyone’s out there blind and like patting the elephant and seeing a different side of the problem, right.

And it’s like, I got the nose, I’ve got the feet, you know, but we’ve all got pieces of it, but we all want the same thing, right. We see that the way things are being done now can be done so much better.

Cody Crudgington: Absolutely, so what are the key things, and this is a question from the Q&A, what are the key things you see people get wrong in observability and do you have any pointers on addressing that?

Charity Majors: Well, first of all, the easy answer, the one that people tend to make because there are so many vendors out there spending lots of money to help you make this mistake, is treating it like a three pillars problem. Where you’re just like oh well, I must need a logging, a metrics, and a tracing tool. That’s just kind of nonsense.

That said, a lot of people have metrics tools and systems that they have been working on for years and… All right, whenever you’re asking someone to change the way they’re doing something, I think two things… first of all, what you’re offering them has to be an order of magnitude better than what they’ve got just to count for the costs.

And number two, I think you want to make it as easy of a transition as possible.

So, for example, I really resisted picking up Scuba Facebook. I was just like one more tool…

The way that I was convinced to try it was we were using Ganglia at the time and Ganglia Graphite. We had all these Ganglia graphs, right. Well Galia stores them in a giant xml dump once a minute on one server, and so we wrote a little script to just like take the xml dump and pack it and ship it over into Scuba using these wide events. And it was incredibly ugly. It was not the right way to do things, but it was right enough that it let me kind of see my world in there’s. 

I knew the variable names, I knew the names of what I was looking for, I knew how things were structured, you know so, even though it wasn’t really correct, I could see the value and it started to click to me. Nothing was being taken away right, nothing was being taken away from me, I was only being being offered this new tool that could answer so many questions that I simply couldn’t answer using my older tool. So I’m always in favor of like building bridges like that. I’m looking for ways to build bridges between what people have in any way.

Other key things that people will get wrong and observability.

A big one, is just like keep every log line. Logs are precious. Every log is sacred.

No, your logs are trash. Your logs are pure trash. Especially if they’re not structured they’re just there. this is like security theater right it’s just like logging theater just like I feel better about the world because I’ve got something and I’m paying a lot of money for it yeah.

So, a good midway step is often to just start by structuring your logs, right. Add some structure to them, do what you can to bundle them up around one line per request per service and see how much more value you get out of them. Another is dashboards. Just infinite proliferation of stupid dashboards.

Because we have this pattern right, first of all, we have this pattern and debugging where we’re like okay, here’s my dashboard, there’s a spike, that must be bad. So I’m going to jump over into my logging tool and I’m just gonna look for something roughly around the same time, and start searching for it, right.

So there are many problems with this, and one of the bigger problems is that when you’re searching for it and you’re logging system, you’re only searching for things that you already know exist, or for past problems, right.

And, most people don’t have that good of a story there. And then often when they also want to trace it, then they jump again. They find a lot of log ID, they jumped over the trace thing and they try and correlate again visually. This is just like one step above using n trails for defining as far as I’m concerned. It’s not very scientific whatsoever and it’s got a lot of guessing going on.

Cody Crudgington: God forbid your timestamps are off.

Charity Majors: Yeah, God forbid your timestamps are off. You’re never going to find it. It’s like the dirty secret and systems it’s just like how many outages and problems go never understood. It’s just like it resolves itself and you’re just like anybody see that? Nope? Okay. Go walk away quietly and let me know if it happens again.

Because it’s very hard. It’s very hard to explain problems using this current tool set.

So. Oh, and dashboards, we also have this pattern in systems where when we do figure something out, we post mortem, right. We do a retro.

And then we craft this perfect dashboard. We’re like next time I’ve got the dashboard that will describe that. I can flip to it instantly and understand what’s going wrong, but how many of those do you have? 15,000? You just accumulate more and more and more dashboards and then you can’t find it in your sea of dashboards and by the time you do find it, maybe you haven’t looked at it in six months and half the data sources are no longer shipping to it. It’s just like dashboards are not a debugging tool. Dashboards are a monitoring tool. We’ve gotten used to using them for debugging because we have had nothing else, but it’s the wrong model. What you want is to be like using a model it’s much more like using a GDB or something where you’re just like what about this, what about this, what about this, what about this?

You want to be asking questions and taking an action on the result of the question that you just asked. Dashboards are answers, right.

They’re just answers, and you’re just flipping through answers and looking for one that maybe matches the question that you have, instead of asking questions to get to the right answer. So I have just a real prejudice against dashboards. I think that they’re the worst and people should try to get rid of them..

Noah Abrahams: That lines up with actually there’s a question in the Q&A about single panes of glass and is that even a reasonable thing for something like that?

Charity Majors: I mean depends on what you want it for. If it’s so that you know… Well, I don’t know like people have like TV shoots and stuff and they want to look cool in the office and have a bunch of blinky blinky blinky lights behind them, they’re great for that purpose.

If you’re in a data center and you have like a static number of racks or something, you want to just like have a pretty display, great for that.

What is not great for is anything that has to do with writing and shipping software. 

It’s just not. It’s gonna give you a false sense of… you can maybe have a top level thing that get has your request, latency, errors. Great, everyone should have those.

But, beyond that it’s just visual noise and most systems are too dynamic and constantly changing for a single pane of glass even make sense.

And when you are in the mindset of having a single pane of glass that will answer all your questions, you’re fixing that mindset and that means that you’re not looking at all the things you probably should be looking at. I just think that they’re used by sales teams to sell you on things and that does not map to actual functionality or usefulness for most users.

Noah Abrahams: And it’s going to limit the number of questions that you’re asking in the first place, right

Charity Majors: Yeah, yeah. Absolutely. You’re just… yeah you’re locked in.

Noah Abrahams: So speaking of number of questions that we ask, we’ve come to the portion of the Fireside Chat where Cody and I like to do a handful of just fun rapid fire questions. 

Charity Majors: Okay! 

Noah Abrahams: Are you game?

Charity Majors: Yeah! 

Noah Abrahams: Wanna gonna start, Cody?

Cody Crudgington: Sure, pineapple on pizza. Yes or no?

Charity Majors: Yes.

Noah Abrahams: Favorite Open Source project outside of Honeycomb.

Charity Majors: I would have to go with… it’s the screen replacement that… I have alias to something so I don’t remember what it’s called… the screen replacement that reconnects when you’re on a flaky network connection.

Noah Abrahams: Tmux.

Charity Majors: Yeah, Tmux!

Cody Crudgington: Top three music albums of all time.

Charity Majors: Top three. St. Vincent. The one before this one with the woman and her pink bot on the cover. I remember what it’s called. Janelle Monáe, I’m bad at album names apparently. And probably something Leonard Cohen.

Noah Abrahams: A favorite single malt.

Charity Majors: Oh. Okay well. So. I’m definitely one of the Ardbeg single one of… the Ardbeg bottle… The Ardbeg Supernova.

They only did it that one year and it was it’s just like drinking coffee grinds and dirt and chocolate and it’s just so good. Ardbeg Supernova yeah.

Noah Abrahams: With just a little bit of being hit in the head with a campfire.

Charity Majors: Yeah yeah.

Cody Crudgington: Side note, Stockholm has an Ardbeg Embassy and it’s awesome. Yeah, if you’re ever in Sweden go to Stockholm Ardbeg in Old Town. It’s great.

Projects or initiatives you’re most proud of.

Charity Majors: Projects or initiatives… well I’m currently almost done refinishing my hardwood floors with 155 pounds sander and I’m pretty stoked about it. And I also… see if I can show you a picture. I just got divorced and I’ve been retaking my home and I will show you a picture of my ex wife wall. So she was very, very no nonsense. So I’ve painted it to look like that.

Cody Crudgington: I love it.

Charity Majors: Revenge painting.

Noah Abrahams: Late night or sleep?

Charity Majors: Late night.

Noah Abrahams: And our last one, the fun one.

Cody Crudgington: Most complicated.

Noah Abrahams: If you could change one thing in the world, what would it be?

Charity Majors: Oh gosh. What a dangerous power. I would cap CEO pay to a multiple multiplier of the worker pay.

Cody Crudgington: That’s fantastic.

Charity Majors: Or put a charge on the high frequency trading. One of those two. I just think it would radically serve the world for good.

Cody Crudgington: That’s fantastic.

Noah Abrahams: Thank you for joining us, and thank you for putting up with our questions.

Charity Majors: This is great, thanks for having me!

Noah Abrahams: This has been a fantastic, fantastic session. throw up our… Where did I… If I can find our final slide I will throw it up. We don’t usually do a lot of slides for things like this, but… 

Cody Crudgington: And I would just want to say Charity, thank you for coming on. Longtime listener, first time caller, so it’s you know bit of a fanboy. So it’s really glad to have you here.

Charity Majors: Awesome, this is really fun. Thank you for having me.

Noah Abrahams: And a huge thank you to all of our attendees who, as a reminder are benefiting charity by their attendance. Join us next month on July 15. We’ve been… I hope I got that date right… our next guest for the Fireside Chat is going to be Kris Nova. that’s going to be a great one as well. 

Charity Majors: Tell Kris I said hi.

Cody Crudgington: Will do.

Noah Abrahams: Thank you everyone for coming we’re gonna stop the recording and the end the session. Thanks everyone.