EXCITING NEWS: BLAMELESS JOINS FORCES WITH FIREHYDRANT! Click here to view our blog!
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.
The Blameless Podcast

Resilience in Action E5:

Tammy Bryant and Eric Roberts The Importance of Glue Work
RIA Episode 5

Tammy Bryant and Eric Roberts The Importance of Glue Work

August 14, 2020

Kurt Andersen

Kurt Andersen is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know. Before joining Blameless, Kurt was a Sr. Staff SRE at LinkedIn, implementing SLOs (reliability metrics) at scale across the board for thousands of  independently deployable services. Kurt is a member of the USENIX Board of Directors and part of the steering committee for the world-wide SREcon conferences.

Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Blameless Staff SRE Amy Tobey. Amy has been an SRE and DevOps practitioner since before those names existed. She cares deeply about her community of SREs and wants to take what she’s learned over the 20+ years of her career to help others. 

In our third episode, Amy chats with Tammy Bryant, Principal SRE at Gremlin, skateboarder, and horror movie lover and Eric Roberts, Sr. Manager SRE at Under Armour, performer/writer/recorder of music, and coffee aficionado.

See the full transcript of their conversation below, which has been lightly edited for length and clarity.

Amy Tobey: Let’s do introductions.

Eric Roberts: My name is Eric Roberts and I've been manager of our SRE team at Under Armour for close to a year. Prior to that I was a senior lead reliability engineer.

Tammy Bryant: I'm Tammy and I work at Gremlin. I'm a Principal SRE. I've been at Gremlin for almost three years, and I started as the ninth employee. Prior to Gremlin, I worked at Dropbox where I was the SRE Manager for databases and storage. I've done a lot of chaos engineering over the years. It's something I really love to do. I’ve been doing it for over 10 years now, which is pretty cool.

Amy Tobey: I'm really glad to have you both here. Before we recorded, the thing that got me most excited is that Eric is building an SLO program for Under Armour. Eric, I wonder if you could just spend a little bit of time telling us how you got started in that. How did you get that conversation going with your leadership and your teams?

Eric Roberts: So, we're a bit of a larger engineering organization and we've got multiple products. We've got MyFitnessPal, MapMyRun, and a couple others. We really needed a framework to get alignment across these teams, because they don't all work the same way. We also needed alignment on how we measure success for ourselves. One of the things that was attractive about SLOs is they really orient the conversation towards measuring from the user's perspective.

When you frame all of these SLO conversations around the user's perspective, the technical part of the conversation fades away and you start to understand the organizational and social parts of these conversations. It really helps you get to the core of the problem that you're trying to solve. To use the SLO parlance, these are: what are the key user journeys that we care about that are important to our users and our business, how do we measure those accurately, and when an incident occurs, what's the real impact of that incident after the fact?

When you frame all of these SLO conversations around the user's perspective, the technical part of the conversation fades away and you start to understand the organizational and social parts of these conversations. It really helps you get to the core of the problem that you're trying to solve.

Amy Tobey: I love how you went into a space of thinking about the customer experience first. When you started this, was there already an idea of what the critical user journeys were or was that discovered as part of the process?

Eric Roberts: I think people knew what the 80/20 critical user journeys were, but it was interesting as we got into the conversations and put a spreadsheet together identifying those things that some other things kind of crept to the top. There are the obvious things that are part of the core experience of your application. Those things drive revenue, but there are other things that may not be as obvious that are not tied to money but they are important to the user experience. What do we do with those? Do we consider those in a critical user journey? Do we want to invest in the reliability of those things? It really added clarity to the list of things that we actually care about for each of our products.

Amy Tobey: That's really cool. Tammy, you work with a lot of folks all around the world doing similar things with CaaS programs. I'm curious if you've run into parallels where, in a new CaaS implementation, a lot of folks are still figuring out what the connection to the customers is. Is that pretty common in your experience?

Tammy Bryant: Yeah, that's pretty common. It just makes me think of so many experiences I've had over the last few years around understanding those user journeys for customers. It's really important to know what that is, but I think it's also a great opportunity to talk to other teams. You get to actually talk to the finance team, the product team, and the marketing team.

That's really important because, for example, if you're responsible for leading databases and capacity planning, you really need to know when marketing is doing its next campaign? Is that going to bring us millions of users, hundreds of millions? How do I prepare for that? How do I make sure that I inject failure in advance to know that I'm ready for when that campaign launches? What are the common user journeys that are going to be utilized? Are they adding new ones? It's really hard to know that unless you're actually asking the right questions, because people don't just email you and say, "Hey Tammy, I’m adding three user journeys this week." I wish they did, but they don't.

Amy Tobey: Often, what I've seen when I'm helping people is there's a disconnect between the technology solution and aligning it with that user journey and all those different departments. Like, marketing can be so important for us.

Tammy Bryant: There are so many different teams that need to know about those journeys and the different critical parts, and your backup plans if something fails during that journey. Obviously, if you launch a brand new campaign, and then suddenly, a critical part of that journey doesn't work, that's very bad. Maybe you spent millions of dollars on ads and then no one's able to actually sign up or make purchases.

There are so many different teams that need to know about those journeys and the different critical parts, and your backup plans if something fails during that journey. Obviously, if you launch a brand new campaign, and then suddenly, a critical part of that journey doesn't work, that's very bad. Maybe you spent millions of dollars on ads and then no one's able to actually sign up or make purchases.

Amy Tobey: It's even worse, right? Because they might not come back.

Tammy Bryant: I didn't think about that much when I started out as an engineer. I remember  one day I found out how much money the application I was responsible for was making every day, and then I was like, "Whoa."

Amy Tobey: That's a terrifying realization sometimes. I used to say when you are incident commander at Netflix, it's like being the captain of a $6 billion starship.

Tammy Bryant: You don't realize that when you're just starting out. You're like, "Wow, this is a big responsibility on my shoulders, to keep everything up and running."

Eric Roberts: One of the most exciting things for me is these conversations where you get engineers, product, PMO, and customer happiness all in the same room talking about user journeys. You see these light bulbs going off like, "Oh my gosh, they care about that?" Or everybody just really rallies around something that needs attention.

Amy Tobey: I'm starting to think that is really the true measure of the people that should be staffed in principal roles: when you live for the light bulb. It's not for the commits or the big political winds, it's when you get a couple of engineers in a room and you see them go, "Oh, my gosh." I like to orient them around all of our different aspects of the work in SRE, like incidents. In an incident analysis, my favorite thing is when we're going through analysis and then somebody comes afterwards and says, "I had no idea."

I'm starting to think that is really the true measure of the people that should be staffed in principal roles: when you live for the light bulb. It's not for the commits or the big political winds, it's when you get a couple of engineers in a room and you see them go, Oh, my gosh.

Tammy Bryant: That happens all the time. Pretty much every single incident I've ever worked on.

Amy Tobey: My favorite one was the big GitHub outage of 2018. I was interviewing people. We were talking about what their experience was, and I went through a list of questions. One of the questions was, "What surprised you?" People were like, "I just didn't know that database was that important." That, to me, was a light bulb moment. This is interesting because it wasn't the developer’s fault for not knowing the importance. They were working at a high level of abstraction. Each of these methods like SLOs, incidents, and chaos, each help us create those situations. SLOs and chaos create those situations under less stressful circumstances.

Tammy Bryant: It helps you realize you always need a backup plan, you always need redundancy, and everything's important. If it's a critical service, then it's probably important. You don't really know how it's going to fail. Until it does fail, it's always a surprise.

It helps you realize you always need a backup plan, you always need redundancy, and everything's important. If it's a critical service, then it's probably important. You don't really know how it's going to fail. Until it does fail, it's always a surprise.

Amy Tobey: So, Eric, you've started having these discussions with the teams and you've had some light bulb moments, and then those teams go away. What's your next step? Do they implement SLOs on their own? Is there a check-in process?

Eric Roberts: The way that we're approaching it is not putting a bunch of work on the teams, at least not in this initial rollout. That may be something unique about how our SRE team operates. We're very big on empathy and collaboration, and realizing that they've already got a lot on their plates. The way that we're initially approaching this is, SRE is going to take on the brunt of the work to get it going. Then we need to figure out a way to transfer that ownership to the teams themselves. Part of the thought there is that by having them as stakeholders and part of the process, they understand what they're agreeing to when we're assigning these SLO and error budget policies.

It's not like at the end of this, it's some waterfall method where we just hand them the policies we created. I don't like the ivory tower way of looking at that. The SRE team is actually taking it upon itself to go and find the SLIs that we'll use to make up the SLOs. If those don't exist, we work with the teams to understand, "Okay, if we want to measure this application event, where would we do that in the code?" Either the SRE team instruments the code or cuts a ticket for the team to instrument that, but we’re doing the bulk of the work to set things up for now.

We're very big on empathy and collaboration, and realizing that they've already got a lot on their plates. The way that we're initially approaching this is, SRE is going to take on the brunt of the work to get it going. Then we need to figure out a way to transfer that ownership to the teams themselves. Part of the thought there is that by having them as stakeholders and part of the process, they understand what they're agreeing to when we're assigning these SLO and error budget policies.

Amy Tobey: Instead of the ivory tower, it's like the ivory bungalow where you do all the heavy lifting up front, but you keep them engaged. That's the part that really caught my attention, putting them in as a stakeholder. Is that along with the product owners or are engineers also the product owners?

Eric Roberts: That's in addition to the product owners. The product team, engineering team, leads, managers, customer happiness all have a particular perspective which is important when you're trying to lay all this stuff out.

Amy Tobey: Absolutely. In CaaS programs, I feel like there's a ton of parallel. You see a lot of processes, Tammy. Drawing from what Eric said, what do you see as the successful things that work for you with aligning everyone?

Tammy Bryant: For the last three years my team has been responsible for running all of the chaos engineering at Gremlin, using Gremlin on Gremlin. It's really evolved over the last few years. When I first came in, I was like, "Okay, I want to do it learning from what I've done in the past, but then also from our founders." They've done chaos engineering in Netflix and in Amazon, so it was cool to be able to merge all those different ideas and see what we could do. We were running these really awesome game days where we would invite the entire company to come along and see, like, this is our hypothesis, now we're going to run some attacks, now we're going to record the results and set 1-3 key action items for how we can make an improvement before the next game day. That actually worked really well for a long time.

Over the last six months, we were like, now we need to scale this because we can't really invite the entire company to a game day. We had to stop doing that a while ago, but that's really cool when you can do that. That's like having a festival.

Now it's much smaller. We actually have a bot. We use the Donut bot to match people into mini game days with three people running a game day together. The engineers are running the game day and we're there to take their feedback always. Everyone gets to say what they think should be done to improve. That's a really big thing I think is important. You've always got to listen to everyone. Because if people don't like it or don't want to do it, then you're going to hear about it. That's the same for every single SRE practice.

The engineers are running the game day and we're there to take their feedback always. Everyone gets to say what they think should be done to improve. That's a really big thing I think is important. You've always got to listen to everyone. Because if people don't like it or don't want to do it, then you're going to hear about it. That's the same for every single SRE practice.

Amy Tobey: There's that underlying learning that we can't really track. I like how you both valued the input and the participation of the stakeholders and the engineers. It creates that environment where they're in a place where they can learn beyond the follow-ups that we discovered.

Tammy Bryant: I actually had marketing, product, and sales, in the game days, which was really cool. That's why I started to think a lot more about how different marketing campaigns impact you as an SRE, because I worked on some big ones. I was there for the Dropbox launch of Dropbox Business and we got like 100 million new customers in one year. That's a lot of scale.

I totally agree with Eric. Stakeholder management is super important. You don't really learn it at university, it's something you just have to learn on the job.

Amy Tobey: The School of Hard Knocks is the only place I know that teaches it. Eric, have you noticed in your process those out-of-band people discovering things that they didn't know before and learning about the business and the processes that surround them as part of the SLO implementation process?

Eric Roberts: We were talking before the call about how implementing SLOs—putting the metrics in place and getting the SLOs measured—is the easy part. It's getting everybody on the same level of understanding of what we're doing. It's like popping the stack for engineers. We need to be talking in terms of not the low-level technical details, but the higher level—going back to user-oriented language.

Amy Tobey: I like the cultural element. It always seems that the hardest part of doing SRE work isn't the technical stuff. All of us can write shell scripts and implement Kubernetes, but that part where we've shaken out all of the stuff we can do in the infrastructure to make things more reliable, and now we need to have things cut across the organization and the culture. I really like the idea that these implementation processes are almost more of a cultural change than a technical implementation. It almost pales in comparison.

Eric Roberts: Referencing what Tammy was saying a minute ago about, “We can only go so far by inviting the entire company to game day,” you have to figure out how to scale the culture.

Tammy Bryant: It's all about scale. One of the things that I learned when working at Dropbox was great SRE builds products and then sells them internally. I love that idea.

That totally changed how I thought about things. You're building things, and you're looking for customers: "Oh, I've got this idea for how we could totally revamp how we do monitoring or change how we do alerting to make it way better to improve on the time to detection." But you've got to build a prototype in MVP, show it to a few people, get some feedback, maybe build a little team around it, and get a budget. If you follow that mindset and just keep trying to do that and get better, you learn a lot of great skills. It's also a lot of fun. You find people across the company that you didn't know would be very passionate about that or are really good at it. You uncover people's skills.


Eric Roberts: I didn't think that being a salesperson or a cultural champion was ever going to be part of my job description. I found it to be both challenging but gratifying at the same time because you see these little light bulbs and these connections that you're making between people.

I didn't think that being a salesperson or a cultural champion was ever going to be part of my job description. I found it to be both challenging but gratifying at the same time because you see these little light bulbs and these connections that you're making between people.

Amy Tobey: When we talk about measuring engineering work, my favorite way to talk about it is leverage. High leverage work versus low leverage work. Fixing bugs is relatively low leverage, but changing minds is probably the highest leverage. What's at the top of your scale or most gratifying to you personally?

Eric Roberts: For me, it's establishing an idea or a goal and convincing everybody that this is the right thing to do. You see the momentum turn the corner and then everybody's talking about the thing.

Tammy Bryant: I think mine is honestly continuing to push yourself, and everyone doing that together so you can create something you've never done before. I really love that feeling where you're like, "Wow, we got to this point and that's amazing. We've got some awesome results, but we also had fun along the way." I'm really a big fan of having fun at work too, not making it so serious, because you can get great results and have a good time doing it. Then you get to that moment where you're like, "Whoa." You look back and you're like, "How did we even do that? That's awesome."

I'm really a big fan of having fun at work too, not making it so serious, because you can get great results and have a good time doing it.

Amy Tobey: Hindsight especially is gratifying. I feel like this would be a good point for us to shift to the dark side of the force and talk a little bit more about how you go from where we are in the process in terms of changing minds and bringing on light bulbs to understand how these processes serve us. How do we make it concrete? That's probably where we start putting in the chaos injection or the SLI metrics.

Eric Roberts: One of the things that I know is on the horizon for us is the first time we have an SLO miss that's significant and we have to go back to the agreement and deal with the consequences that we set. Either our roadmap is going to change, or the next couple sprints are just all about whatever work that was a result of that SLO miss.

Amy Tobey: What does that look like? I think this is where a lot of people get stuck on SLOs. Most people I've ever talked to are like, "This is a great idea. I love it." Then they go and implement a bunch of graphs and stuff, and then they go, "Wait, why isn't this doing anything?" It's that part you just talked about. How do we turn it into the right kind of feedback and accountability?

Tammy Bryant: One of the things that we've been doing recently, like today, is we ran four mini game days at the same time. We are really going into the idea of scale. One of the themes that we're focusing on right now is validating and verifying SLOs and SLIs using chaos engineering. So, inject failure and then see that you can actually measure that with your SLO.

But another chaos engineering attack that we like to do is trigger it to page. Like make a non-critical service unavailable, but have a set SLO. Then make the primary get paged. Then we do another attack where we say, “Let's have the primary not respond and see if it falls through correctly to the secondary.” These are all the things that you learn over time. Sometimes the primary doesn't have reception, or something's gone wrong and it drops to the secondary. You want to still be able to respond fast if you breach SLO. How do we use what we've built to get some value out of it as well?

Amy Tobey: This is the part that I like: when you have a result, say, the primary gets paged and the secondary maybe is five minutes late, and then you go in and you identify that there's something wrong, either they need new devices or whatever. How do you take that information back to the organization to effect change?

Tammy Bryant: One of my things that we do is we always write up. I'm actually a big fan of writing up reports or doing little presentations. I think that's a really good skill. Just getting good at writing, enjoying it, and having fun with it. Being able to tell stories, I guess that's what it's all about, storytelling. Take those experiences, don’t just send the alerts, the file, or whatever, in an email and be like, "Done," but more like, "Okay, let's tell the story. When we tried to set these SLOs and we went to verify them, we noticed these issues came up, but these are some ideas we have to be able to fix this going forward."

I think a big thing is actually using it as a brainstorming working session. That's one of my favorite things to do, take those findings and then have a 30 minute meeting. In 30 minutes, you're just trying to see what we can build, what we can improve that is going to give us a big positive impact but isn't going to be a ton of work. I don't want to have to do a five month project if it's something very small.

I'm actually a big fan of writing up reports or doing little presentations. I think that's a really good skill. Just getting good at writing, enjoying it, and having fun with it. Being able to tell stories, I guess that's what it's all about, storytelling.

Amy Tobey: So many of these things are like, "Oh, crap, we just need to fix this one thing."

Tammy Bryant: Exactly. So, in 30 minutes with three people as a working session, you can get a ton done within an SRE team.

Amy Tobey: I feel like that's something that works really well in organizations that are already pretty Agile. Eric, I imagine as you're implementing your SLO program, you're running into this a lot. Did you have anything to add about how you get that information back into the teams and products? For example, we implement an SLO, maybe we run a chaos test and we see that the error budget responds as we expected, and then we go and we dig in and we find out that maybe it went more than we expected, and now we've learned something about our system, which is useless until we get it fed back into the apparatus that builds the system. This is not what we, as SREs, own. That's owned by product and engineering. How are you approaching that? Is that a close partnership? Are they on the hook for the performance of SLOs? What's your mechanism for creating that flow?

Eric Roberts: It's funny that you mention that because that's an active thing that I'm working on, engaging our project management organization. Sometimes you get a lot of incidents and you get action items associated with those incidents, but what happens with those action items? How do we make sure that action items fed back into the team's work stream? How do you make sure that you're having the right conversations about those outcomes from the incident? For example, sometimes it's just an easy task, right? 

But sometimes it's bigger in scope, more ambiguous, and it deserves further discussion. Part of the conversation that is happening right now is, how do we surface this, adapt, and have the right conversations to generate the right ideas? That's where I really like Tammy's idea of making it fun. Rather than calling it a meeting, call it a brainstorming session, because you go into that with a different frame of mind—playful, maybe. And then being sure to follow through on that to capture the concrete action items. Is that a story that goes into the next sprint or is it a roadmap item that comes back on six months down the road?

Back to your original question, SRE can't do that for everyone, right? It’s about establishing relationships with our project management organization to say like, "Hey, how can we make sure that we never lose the thread, that there's a smooth transition of this work that comes out of incidents into sort of the regular way of working for teams?”

I really like Tammy's idea of making it fun. Rather than calling it a meeting, call it a brainstorming session, because you go into that with a different frame of mind—playful, maybe. And then being sure to follow through on that to capture the concrete action items. Is that a story that goes into the next sprint or is it a roadmap item that comes back on six months down the road?

Tammy Bryant: One thing that I really liked that I've seen done before was this idea of bucketing engineering work. When I joined Dropbox, I think I'd been there for maybe two or three months, and they sent out this developer efficiency survey for every engineering manager to fill out. I had to fill it out for my team and ask my team to help me do the metrics. They asked us to measure how much time we spend on features, on core products, and on KTLO (keep the lights on). They wanted to make sure you're actually spending time on all three of those buckets. I love that,  encouraging people to do maintenance work. They know you have to upgrade Ubuntu. We're not going to push you to continuously do features, because then you'll have issues. That worked really well. It's giving engineering managers the permission to allocate time to that work.

Amy Tobey: It's pretty explicit permission for everybody in the system. It's the autonomous choice. I'm giving you a baseline minimum amount of stewardship, which is the opposite of what everybody else does, which is like, "I want to push this as close to zero as possible." At least that's how it's felt for most of my career.

Tammy Bryant: I was like, "Oh, how much percentage should KTLO be?" They're like, "Oh, it's better to just measure it first and then see if you can improve it over time." I was like, "That is so cool." When I first measured it, it was really high for our team. I think it was like 70% of our time was spent on maintenance work.

It was like thousands of machines, tens of thousands, very small team, three engineers. So, heavily automated, but still a lot of work to do that. There's all this stuff that has to be done that's manual, sometimes. But we actually got it down. We got it down to like 30% within a year. I think if we never measured that, then we would never know what we actually got it down to. But it felt good that I knew it didn't have to get to zero. They're like, "If you could get to 20 to 30%, that would obviously be better for your team. You're getting paged less." It makes sense. But I don't like the idea of encouraging every team in a company to have 0% time allocated to doing maintenance work.

Amy Tobey: It should never be zero, and it probably should never be less than 5-10% if we're being responsible.

Tammy Bryant: Think about a car; you need to do an oil change, you need to take it to get cleaned. It’s a system.

Amy Tobey: Cars have changed over the years. When they were built, especially like the '70s and '80s, which is most of the cars I used to work on, they were designed for very short maintenance intervals. 2,000 miles was the oil change frequency, right. That's the only one they paid attention to, but all of the other things on the car were three months, six months before you were supposed to replace them. Cars would just fall apart because people wouldn't do the maintenance, yet they were designed for constant maintenance. Then over the '90s the cars we have today are designed for like 100,000 mile maintenance intervals. It's a completely different game.

Tammy Bryant: Focusing on the teams, you need to talk to them about the maintenance that you have to do, like, "We have to upgrade the operating system, we have to work on this firmware stuff, this is just baseline things that we need to do." I always used to say that I have work that I need to do to have a healthy fleet. I'm just trying to talk to people like I had a farm of animals or something, like I can't just not feed them all.

Amy Tobey: Somebody's got to pick up all the bits that fall on the floor and sweep them up and get them out of the data center, right?

Eric Roberts: I can't remember the exact quote, but within the last year, something I came across was, with respect to incidents, what incidents didn't happen today, and the fact that there are humans doing work every day, keeping the lights on.

Amy Tobey: That's the new view of safety. If we meet a 99.9% SLO, we are 99.9% successful. I love that reading so much more than the old thing about downtime. "I don't want to do downtime, I'm scared of downtime." Instead be like, "I want to maximize my success." It feels like a little tiny math trick, but it has such a drastic change in orientation of the people.

That's the new view of safety. If we meet a 99.9% SLO, we are 99.9% successful. I love that reading so much more than the old thing about downtime. "I don't want to do downtime, I'm scared of downtime." Instead be like, "I want to maximize my success." It feels like a little tiny math trick, but it has such a drastic change in orientation of the people.

Eric Roberts: It's a psychological thing, right? Reframe it in a more positive way. I like that.

Amy Tobey: I've been playing with this idea of maybe we misnamed SREs and really should be thinking of ourselves as culture engineers. So much of our work is culture engineering.

Tammy Bryant: Most SREs, often, will be working with every team in the whole company. It ends up being what happens as you go through your career.

Amy Tobey: Where do you feel like we are? Concerning the centralized SRE model or the service SRE model, there are a bunch of different ways that people do this. Are you centralized where you own big piles of stuff and support lots of teams, or do you embed people sometimes?

Eric Roberts: We're more of a central SRE team with product specific SREs that have domain expertise in the products; they have skin in the game. They're on on-call rotation, either for the product or maybe for our incident commander rotation.

Tammy Bryant: I think it depends on the company size for what works well. A lot of people ask me, "How do I build and scale my SRE team? How many SREs do I need?" I've seen people say, "Well, if you have this many product engineers, then you probably need this many SREs." You can kind of figure it out, but then it's very hard to find those SREs and hire them. That's why I do like this idea of, you can have a small, really amazing SRE team that builds tools for other teams to utilize as long as they're really good at working with other teams, taking feedback, continuously building tools, and measuring that people are actually using those tools and getting value out of it. If they don't, then they're deprecating their SRE tools, too.

I like that model a lot, but it's because it's hard to hire SREs. If you're at a bank with 10,000 engineers then it would be awesome to have a few hundred SREs, but where do you find them? I don't know.

That's why I do like this idea of, you can have a small, really amazing SRE team that builds tools for other teams to utilize as long as they're really good at working with other teams, taking feedback, continuously building tools, and measuring that people are actually using those tools and getting value out of it. If they don't, then they're deprecating their SRE tools, too.

Amy Tobey: I did some interviews for a talk I didn't end up giving because the conference got canceled, but I interviewed a whole bunch of people from all over SRE orgs and I found that there wasn't a consistent ratio of SREs to engineers. I asked everybody about it, and I heard everything from 3:1 to 100:1.

The other story that really stuck in my brain was about hiring SREs being so hard. They said that the recruiters at their company, if they get stuck on SRE recruiting, their main goal in life is to do well enough to get reassigned off of SRE because it's such a tough job to do the hiring for and they're still held accountable to the metrics. We're a difficult bunch.

Eric Roberts: I've had some folks reach out about job descriptions for SRE. An example would be, "We keep getting infrastructure engineers." I think it comes back to what you said, Amy, about how SRE has such a big cultural component that it's hard to hire for that, much less the sort of technical aspects of SRE, because it's all over the map, or can be.

Amy Tobey: You're not just learning some new tool. Usually, when you're hired in a new place, there's something in the tech stack that you’ve got to learn from scratch. It’s usually not a big deal to us. We get good at learning tech. But the part that I find really daunting in joining a new organization is learning its culture, and then discovering all the little dark spots, and the bright spots. That has taken me 20 years to learn and I still don't know how to teach it. That's the other tricky part: you can't just sit down and explain the facts to somebody and then be like, "Go forth and make SRE stuff happen."

But the part that I find really daunting in joining a new organization is learning its culture, and then discovering all the little dark spots, and the bright spots. That has taken me 20 years to learn and I still don't know how to teach it.

Eric Roberts: You kind of have to just do it. You have to look for people who have done it before.

Tammy Bryant: I did hire a few apprentice SREs when I was at Dropbox. That was really cool to be able to do that program. I had them from Holberton School. We sat down and thought, what could a program look like? It was a six month apprenticeship. Afterwards they all stayed on and they're still there now doing really well.

They've been there for a few years now. That was pretty amazing to me because I always felt it's tough to hire SREs because you need people with the technical skills like coding, automation, but then also deep Linux skills, knowing about kernels or databases or some special domain area, and then all the people skills too, like getting buy-in and stakeholder management, great writing skills, presenting in a meeting. We actually helped them learn all those different skills as part of their apprenticeship.

Amy Tobey: I think it gets left out a lot, right? Eric and I were talking about this earlier. Especially when I was younger, I would go straight to the technical problem that I understood and not want to worry about this, because if I could solve the technical problem, maybe I don't have to fix the people thing.

Tammy Bryant: Then it might be like the company's like, "No, but we really need you to fix this other issue."

Amy Tobey: In that training program, I think it's really cool that you had those glue skills or the soft skills on top of that without getting hyper focused on learning storage technology or something.

Tammy Bryant: I wanted them to be able to float around. That was a big thing. I was like, "I want you to be able to be an SRE later on the traffic team, or on Kafka, Nginx, Kafka, or whatever, so you can not feel stuck in your career. I think that's pretty important because technology changes so fast. Yeah, you're so right, Amy. I love that word too, glue skills. That's nice.

Amy Tobey: Tanya Reilly's post about glue work.

Eric Roberts: Honestly, I wonder what it says about the space that we have created this role that has to have this huge bucket of skills that is really hard to define. It's like we're trying to be more efficient in how we do this, but the fact that an SRE... I mean, there are other roles as well that have a wide range of skills, but just within this conversation, we've gone from, "Yeah, you have to be a cultural champion, and also a technical guru and everything in between."

Amy Tobey: The reason why I'm here talking about this high level stuff was the years of pain. Our peers spend decades trying to improve reliability and hitting heads against the wall, because the technology's not getting you there, and then finally having that epiphany, the light bulb moment and going, "Oh, God, now I’ve got to learn people. If I'm going to fix this, I've got to do people work."

Tammy Bryant: It's really true. You've got to do it. When I first started, the reason I started to do reliability work was actually when I first was put onto a team building mortgage broking software. Everything I built was so slow. Then I went backwards and backwards and had to talk to every single person that owned every layer of the stack until I could figure out how I could make it work. That's my first experience, just sheer pain.


Amy Tobey: Then you start to discover the stuff that works. SREs evolved out of this need. The directive we have, which is to improve availability and reliability of the system, and then that realization that we can't keep working down in a subgraph of the system. We have to work on the whole thing if we're going to improve availability.

SREs evolved out of this need. The directive we have, which is to improve availability and reliability of the system, and then that realization that we can't keep working down in a subgraph of the system. We have to work on the whole thing if we're going to improve availability.

Eric Roberts: I think it goes back to where we started with SLOs and chaos engineering in general. What's our North Star? It's the user, right? Because if you get mired down in those rabbit holes of technicality and you've lost the North Star of the user, that's how you end up having siloed groups and people not focusing on the right things. To me, that's comforting, to know that I can always turn around and be like, "Okay, I may have lost the script a little bit but I know I can come back to what is it from the user's perspective that we're trying to do here?"

What's our North Star? It's the user, right? Because if you get mired down in those rabbit holes of technicality and you've lost the North Star of the user, that's how you end up having siloed groups and people not focusing on the right things.

If you liked this, check out these resources:

Pricing calculator   - Blameless Images
ROI calculator

Find out how much 
you could save

Incidents can do real damage to companies that aren't sufficiently prepared them. Use our calculator to estimate the full cost of incidents for your team.
use the calculator
collapse button - Blameless Images