Talking with Matt Klein about “A Culture of Reliability”
A discussion with David N. Blank-Edelman and Matt Klein
Matt Klein is a software engineer at Lyft and the creator of Envoy, He’s a big fan of service mesh architectures as a way of improving the reliability and observability of environments, especially those that are microservices-based (Envoy is at the heart of a number of service mesh projects like Istio). Though it is always a pleasure to have a deep dive technical discussion on service meshes with Matt, I’ve also found his observations on SRE culture to be a good starting point for conversations about the field. That’s the direction we took for this discussion.
David: Let’s do a level set. How do you define SRE?
Matt: I think a lot of people have different definitions about what SRE is. The way that I think about SRE is that it’s really a software engineering and a thought discipline that is focused on system and product reliability, and not necessarily building features for a specific product. So obviously it’s pretty nuanced. But I think at the end of the day, that’s the only way to really cleanly break it down. Traditional software engineers are focused on building particular product features, typically driven by product managers and business requirements. And SREs or production engineers, or reliability engineers, or however you want to call them, I think their focus is not necessarily on building features, it’s on making sure that those features have a baseline level of reliability for the customers who end up using those features.
David: Do you feel that the culture for SREs is the same as the culture for feature producing software engineers?
Matt: So I guess it depends. This topic is obviously fairly complicated. We can talk about the culture of a product engineer, or the culture of a reliability engineer. But I find it more useful to think about culture in the sense of “what is a reliability culture?” I think that spans product engineers, and it spans reliability engineers.
David: Okay, so that’s a great distinction. So what does a culture of reliability entail?
Matt: I think that is actually what is most interesting about this discussion. Because if you look at the differences in terms of how companies have thought about this problem—dealt with this problem of building features and having those features actually end up being reliable, there isn’t one answer. The industry has actually been all over the place.
We have gone from having the very traditional siloed roles of software engineers, test engineers, systems engineers, release engineers. And then we’ve gone full stop the other way, where we have this theoretical concept of DevOps. DevOps means different things to different people, but at least at a high level, the idea of DevOps is a much more agile development process—the idea that we’re doing infrastructure through code, the idea that everyone can do their own operations.
And then we have somewhere in the middle of this a reliability engineering discipline where we try to enable product engineers to have operational excellence. Provide them the tooling and the culture. So I think it’s important to separate culture from discipline.
At a super high level, reliability culture is honestly, just respecting the user. It’s respecting production. And I actually think it’s that simple.
David: when you say “respecting the user” or “respecting production” what do you mean?
Matt: We can have a much more nuanced conversation about it, but there’s a way to think about software development or product development where we’re going to add features X, Y, or Z. We don’t do that in complete isolation without thinking about how our actions actually go through and affect the user. If we have a bug, what impact does that have on the user? If we for example deploy our new feature at a peak time, and we break the website, is that the most user respecting way that we could have done that? If we don’t use feature flags, so that we can incrementally test things, is that respecting the user? Are we doing development in a way where we reduce the chance of impacting the user experience as much as possible?
SRE culture is balancing velocity. Balancing feature development with risk assessment. And it’s always a balance of how do we have as much feature velocity as possible, while having as little risk as possible. And that’s why this is such an interesting topic. Because there’s no one answer—there’s no one way of actually solving this problem.
David: And respecting production?
Respecting users is a superset of respecting production—if production goes down the user doesn’t have anything to use. I think they’re kind of synonyms, just in the sense that I think implicitly by respecting production, we are respecting the user who uses production.
David: If I were to walk into an environment and I was trying to determine whether that environment had the sort of reliability culture that you were speaking of, what would be some of the indications of that?
Matt: Right, that’s where we can pop down a level. I think as an industry, we have evolved towards a set of best practices that allow us to have high velocity, while maintaining respect for the user or respecting production. We have a good understanding of things like using feature flags and doing incremental rollouts, how to do canarying, SLAs/SLOs, actually alarming on issues, dashboards that are not only infrastructure metrics but also business metrics to understand the impact to the user, a strong culture of automated testing, how we do config management, can we rollback, code reviews… And I can go on, and on, and on. There’s a set of industry best practices. And these are DevOps best practices also.
But the idea here is that the more of these best practices that an organization uses, the more likely it is that they can maintain high velocity, but still respect production, respect the user, and hopefully have as few issues as possible.
David: Are there best practices that are human based—related to how we treat the humans and what the humans do that you think are important as part of a culture of reliability?
Matt: Yeah. And so that’s where I actually think this conversation gets pretty interesting. Because as an industry, I think that is the area that we have the least agreement on.
I think most people, if you sit them down, and ask them about best practices, would broadly agree. Like I would be amazed if anyone disagrees at this point that you should have a canary process, you should probably do feature flagging, continuous integration, etc. There’s some basic table stakes most people agree with.
Where I think it gets a lot more interesting, and there’s a lot less agreement, is on things like “are all developers on call? Or, are only a subset of people on call? Who is responsible for documentation? And who’s offered to keeping that documentation up to date? Who is responsible for new hire education, continuing education, teaching engineers how to understand the concepts of canary and feature flagging?”
Because we don’t teach this in college, right? You can’t assume that people just come in and know these things. We have to actually teach them. So there is a strong component of education, of documentation, of mentorship. And I think as an industry, there is the least amount of agreement on how we go about doing that.
As an industry we haven’t decided on the right people layout. By people layout I mean, “do we have all engineers being on call, doing deploys and those types of things?” Or do we have a set of SREs who tend to do those things? Or do we have a completely siloed software engineering team versus a systems engineering team? Very siloed roles in terms of who touches production, who has gates on how we monitor things, how we deploy, when we deploy. There’s a range all the way from the silo to true “DevOps” with an everyone is on call culture, where everyone deploys, everyone manages their own runbooks, everyone manages their own systems.
I’ve been thinking about this area, this human stuff, a lot recently and wrote this blog post called, The Human Scalability of “DevOps”. I think that this is the area where as an industry, we have to do the most evolution. I’ve become increasingly convinced that agile DevOps development practices for all engineers are an absolute no brainer. We should do CI, we should allow everyone to deploy. These are very basic things. But at the same time, I feel that as an industry, we don’t invest in the human side of things.
David: Can you give an example?
Matt: I think increasingly we have human scalability issues where we’re not respecting the fact that people don’t know how to necessarily use modern infrastructure, or don’t know off the top of their head all of these site reliability best practices. You can go and read about them in the Google Site Reliability book, or the new book that’s coming out. But these things are just not obvious to people. And as an industry, I don’t think we educate folks enough, and do enough continuous education and support them. And I don’t think we do enough mentorship. There’s really good agreement on the things that people need to do, but there’s not good agreement on how we educate people, and how we help them actually do those things.
I think there are two main problems: first, as an industry, we have an over-inflated view of how easy cloud native infrastructure is actually to use. We like to claim that we can go and use all this modern technology, infrastructure as code, etc. and the days of yore in which you needed a old style system admin, that just doesn’t apply anymore.
We are very far ahead of where we were 10 years ago. But I think it is a gross exaggeration to say that it’s easy. If you look at the current state of the world, including Envoy, Kubernetes, all of the tooling that we’ve built—it’s still really hard to use. So trying to expect people that don’t have the domain experience either in core infrastructure or in reliability engineering, to come in and just know how to do networking, or know how to do provisioning, or know how to do containers…to me, it’s a little nutty. So that’s number one.
I think the other side of things is that we are engineers, and engineers like to build things. And a lot of engineers don’t necessarily value softer but important things. The two things that I’m thinking of are continuing education and documentation.
The dream of the cloud native experience is that it’s self-service…we’re expecting everyone to come in and through APIs and code, and through documentation, build amazing applications. I think we’re getting there, but at a lot of companies, particularly larger ones that have infrastructures that have scaled beyond what some of the current cloud abstractions can give people (Lyft is one of those companies) there are engineers that like to build stuff and solve problems. We don’t like to write documentation. We don’t like to build new hire or continuing education classes. And I think we grossly under invest in those two things.
David: Do you feel that the industry has come to consensus on how to handle the situations where things break, where the reliability is compromised?
Matt: There’s no consensus on how organizations do postmortems. Not only postmortems, but how people treat follow up actions. Like do companies actually fix their follow-up actions? How do we prevent the same incident from happening again? Or even during an incident, how do we communicate? Is there an incident leader? Do we get on a call? Do we use Slack? There isn’t consensus, there isn’t common tooling that people use around this. There’s a common theme here. We’re running before we’re actually crawling.
You can look at older organizations like Amazon or Google where they have a strong set of procedure that are borne out of, not only years of experience, but also because they started when the industry was in a very different place. They’re going to have an incident management team that knows how to deal with an incident and run the incident, and open a call bridge, and do all of those things. How do do follow-up, etc. Whereas in newer companies typically there isn’t an incident leader role—it can be chaos when an incident happens. Or there’s no incident command center, there’s no central monitoring command center.
These are just areas where we as an industry want to do things with as few specialized people as possible, because rightly so, we want people to write business logic, that’s how we make money. But, we see this theme, again, and again, and again, where the automated tooling that we assume will save us—it’s not quite there yet.
And there’s still a lot of these human issues. We don’t have consensus on “what are the right roles that we have to hire for? How do those roles actually interoperate with other folks at the company? What roles are required? Do you need an incident command center? Do you need a monitoring command center?” Most modern newer companies would say no. But then you wind up with issues during incident handling, so it’s complicated.
David: Do you think we’re going to get there? Do you think that the industry will coalesce?
Matt: I do think we will get there. I think it’s going to be longer than a lot of people would like. If you look at the way that most of the cloud native infrastructure is heading, look out 5, 10, 15 years, I can totally buy that people are mostly going to be writing their applications as functions. You know, they’re using lambdas and serverless, and all that stuff. At a visionary level, that sounds great. People write application logic snippets, you can talk to a database that you don’t really need to know how it works, and you can make network calls to functions, and it just works.
Unfortunately, we can barely run our container-based infrastructures today, let alone serverless, which I would argue is an order of magnitude more complicated in terms of the concerns around auto scaling and dynamic infrastructure, and networking, and observability. We’re just not there yet.
So, if I look out 10 years, I do think that eventually we will be using a lot more pre-canned solutions, where a lot of things will be a lot more automatic in terms of write some code snippets, have them run. The operational concerns around deploying, and all of those things, they’re just going to work. That doesn’t mean though, that 10, or 15, or 20 years from now, when people are writing their applications with amazing functional substrates, it doesn’t mean that there doesn’t have to be a reliability culture. Because you can write a functional snippet that breaks all of production.
I do believe that over time, cloud native infrastructure will become easier to use. A lot of the stuff that we struggle with today in terms of how people do data storage, and how people do multi-region, and how they do routing, and all of these things that are a mess today—I think a lot of them are going to become just built in to cloud platforms. But there’s still all the human concerns around thing like “okay, just because the cloud platform supports this amazing canary system, do people use it?” You can’t force people to do things. You can’t force people to use feature flags. You can’t force people to think about how to test their software.
So I don’t think we’re ever going to reach a point in which we get rid of this reliability engineering role. I think there always is going to have to be this duality of roles, where we have people that are thinking about product, we have people that are thinking about reliability. It may be that in the future, what took 200 people on an infrastructure team takes four, but it doesn’t mean that we don’t still have to be thinking about these basic human things.