The Essential Guide to SRE
Why Site Reliability Engineering
In the world of technology, the stakes have never been higher. The move to the cloud and microservices to maximize agility has given way to digital disruptors and unprecedented competitive threats. As distributed systems become increasingly complex, the scale of ‘unknown unknowns’ increases. On top of this, customer expectations are sky-high.The cost of downtime is catastrophic, with customers willing to churn if their needs are not promptly met. According to Gartner, the average cost of downtime is $300,000 per hour. For some companies, this number is considerably higher; for example, Amazon lost approximately $90 million during their Prime Day outage in 2018, and the outage only lasted 75 minutes.
Organizations need to prioritize reliability so they can innovate as quickly as possible on top of a strong foundation that won’t compromise customer experience. This will become even more critical as more businesses move toward distributed systems with high reliability requirements.
That’s where site reliability engineering (SRE) comes in. The SRE function is growing quickly (30-70% YoY growth in job listings), but there is not enough skilled talent in the market to compensate. In other words, it will be important to understand how you can not just hire SREs, but grow your existing organization to adopt the practices and mindsets required for production excellence. With the shortage of SREs for hire, what can you do to ensure your service’s reliability?
To answer this question, you’ll need a deeper understanding of what SRE actually is.
What is SRE?
SRE is a practice first coined by Google in 2003 that seeks to create systems and services that are reliable enough to satisfy customer expectations. Since then, many large organizations such as LinkedIn and Netflix have adopted SRE best practices.
In recent years, SRE has become more widely adopted by many organizations globally, with the goal of reliability and resilience in mind in light of exponentially growing customer expectations as well as systems complexity.
SRE is based on a customer-first mentality. This means that SRE efforts are all tied to customer satisfaction, even if the customers using the service are actually internal users. Each decision should result in protecting or improving customer satisfaction.
Teams work together to determine which factors and experiences affect customer happiness, measure them, set goals, and balance reliability requirements with the innovation velocity required to stay viable in an increasingly competitive digital landscape.
To achieve this goal, SREs and teams that have adopted SRE best practices refer to several key tenets of SRE.
According to Google, these include:
- Ensuring a durable focus on engineering
- Pursuing maximum change velocity without violating a service level objective (SLO)
- Monitoring, including alerts, ticketing, and logging
- Emergency response
- Change management
- Demand forecasting and capacity planning
- Provisioning, and
- Efficiency and performance
According to Forrester, 46% of the tenets can be applied out-of-the-box for most software teams in the enterprise, but the rest require customizations or won’t make sense for the vast majority of organizations. The important question to ask yourself is how these tenets fit in with what you’re already doing, and how your teams can improve. We’ve got more answers below.
What sort of systems do SREs manage? Is it just websites?
The "site" in site reliability engineering may make you think that SRE principles are limited to maintaining websites. However, SREs manage the reliability of any type of software system, from legacy applications, to device-specific firmware, to modern day apps, and of course, websites themselves.
Understanding How SRE Fits Into Your Operations Model
A common early mistake in adopting SRE best practices is assuming that following SRE best practices means you’ll need to rip and replace your current procedures, which simply isn’t true. In fact, SRE can work as a complement to both DevOps and ITIL methodologies. The trick is to ensure that regardless of your organizations’ different operating models or toolchains, there is shared visibility, communication, and collaboration across teams.
This will allow your disparate teams to stay aligned while using the best practices from each methodology.
How SRE works within other technical teams
Before we dive into how SRE interacts with other infrastructure frameworks like ITIL or DevOps, we should look at how it fits within the general technical scope of your teams. Are SREs developers? Do they work with the operations team? Do they all sit together? Who do they report to? The short answer for all of these is: it depends.
Depending on the size of your organization and your specific needs, your SRE team model will look differently. Some organizations perfer a distributed model, where one or more SREs are designated to each development team and works with them to ensure their projects meet reliability standards. Other orgs want a centralized team, where SREs work together to develop practices and infrastructure for all development teams to use. Other orgs won't have dedicated SREs at all, but instead distribute the work of an SRE among developers and operators. For more information on the many forms an SRE team can take, check out this blog post.
How SRE works with DevOps
Think of SRE as the practice that brings life to the DevOps philosophy. The core principles ofDevOps and SRE are nearly identical.
According to Google’s course on SRE, “classSRE implements DevOps,” the 5 DevOps principles are as follows:
- Reduce organizational silos: SRE helps by sharing ownership across developers and production teams, and unifying tooling.
- Accept failure as normal: Blameless postmortems are an SRE best practice that ensures that all incidents are used as learning opportunities. SRE also creates a safe space and guardrails for failure through SLOs and error budgets.
- Implement gradual change: This is done by canarying rollouts to a small subset of customers before allowing all users to interact with new features. Smaller changes are easier and safer to dissect and iterate on.
- Leverage tooling and automation: SREs work to eliminate toil by measuring it and creating automation to do repetitive tasks without needing human intervention. This way, humans can focus on higher-value work.
- Measure everything: SRE specifically focuses on measuring toil and reliability to make sure that both customers and software teams are happy with the service. With these common principles defined, it’s easy to see how SRE and DevOps fit really well together, with SRE codifying practices that make it easier to achieve the promises of DevOps.
How SRE works with ITIL
In practice, ITIL and SRE can also make for a great combination. The first reason why is simple: every organization wants happy customers, and ITIL and SRE can help different functions work together to make that a reality. Embedding reliability throughout the software lifecycle can ensure a higher rate of customer happiness.
With the newest revision of ITIL (ITIL 4), which introduces seven guiding principles, SRE and ITIL align even more closely.
- Start Where You Are: Adopting SRE best practices is not one-size-fits-all, and everyone starts somewhere. Taking the first steps and implementing and iterating as you go is what matters most.
- Keep it Simple and Practical: In the Google SRE book’s chapter on simplicity, it states “Unlike just about everything else in life, ‘boring’ is actually a positive attribute when it comes to software! We don’t want our programs to be spontaneous and interesting; we want them to stick to the script and predictably accomplish their business goals.” Simplicity in both software and business operations streamlines communication, increases velocity, and helps ensure that reliability isn’t compromised. Less is more.
- Optimize and Automate: One of the goals of SRE is to automate toil-heavy processes, and free up developer time to focus innovation instead of unplanned work. This optimizes workflows and allows new features to ship faster.
- Progress Iteratively with Feedback: SREs set alerts for the most important and user-centric metrics. The metrics, alerts, and SLOs they’re tied to are all iterated upon to better satisfy customer needs.
- Collaborate and Promote Visibility: SRE is culturally collaborative. It focuses on a blameless work culture that values learning from failure, and trusting that each team member is doing what he or she thinks is best for the organization.
- Focus on Value: Without customers, there is no value in software. Business value is created when customers want, and get, what they need from a product. SRE best practices ensure that the product is reliable enough to provide value to the customers, and also protect the most important customer journeys. Thus, they provide significant value to the organization in helping to drive shared focus.
- Think and Work Holistically: By breaking down silos and focusing on scalability and reliability on a holistic level, SREs are able to provide significant benefits in maturing the organization. Business-wide success is in the hands of every team member, and SREs work to make sure that the company’s product, systems, and procedures are resilient enough to not just meet but exceed customer standards. For a visual on how SRE, DevOps, and ITIL’s best practices can be used in conjunction with each other, here is a handy graph.
Whether you identify as a DevOps or ITIL shop, your organization has something to gain by following the principles of SRE.
Let’s dive into what exactly these principles entail.
Principle #1: Create a Mindset of Resiliency
Resiliency isn’t something that just happens; it’s a result of dedication and hard work. To reach your optimal state of resilience, there are some crucial SRE best practices you should adopt to strengthen your processes.
Incident Playbooks
As you know, failure is not an option… because actually, it’s inevitable. Things will go wrong, especially with growing systems complexity and reliance on third-party service providers. You’ll need to be prepared to make the right decisions fast. There’s nothing worse than being called in the wee hours of a Sunday morning to handle a situation where thousands of dollars are going down the drain every second. Your brain is foggy, and you’ll likely need time to adjust to the extreme pressure of a critical incident. In these cases (and really, all cases where an incident is involved), incident playbooks can help guide you through the process and maximize the use of time.
According to Chris Taylor at Taksati Consulting, good incident playbooks help you cover all your bases. They typically include flowcharts and checklists to depict both the big picture and the minute details, a RACI (responsible, accountable, consulted, informed) chart for each step, and a list of environmental influences that are unique to your system.
To create your incident playbook, Chris recommends aggregating the following information:
- An inventory of relevant tools
- The right personnel/subject matter experts to engage in response
- Knowing the problem to solve, or the workflow you’re trying to document
- Current state (whether this is a new process, or updating and old one)
By developing incident playbooks and practicing running through them, you’ll be more prepared for the inevitable.
Change Management
Change management is often done haphazardly, if at all. This means that organizations are unable to manage the risk of pushing new code, possibly leading to more incidents. Rather than employ ITIL’s arduous CAB method, SRE seeks to empower teams to push code according to their own schedule while still managing risk. To do this, SRE uses SLOs and error budgets.
SLOs, or service level objectives, are internal goals for service availability and speed which are set according to customer needs. These SLOs serve as a benchmark for safety. Each month, you have a certain allowable amount of downtime determined by your SLO. You can use this downtime to push new features. If a feature is at risk for exceeding your error budget, it cannot be pushed until the next window. If the feature is low to no risk to your SLO, then you can push it.
Each month teams should aspire to use the entirety, but not exceed, their error budgets. This way, your organization can optimize for innovation, but do so safely without risking unacceptable levels of customer impact.
Capacity Planning
Black Friday outages, scaling, moving to cloud. All of these big events required heightened capacity planning. If you don’t have enough load balancers on Black Friday or Cyber Monday, you might be sunk. Or, if your company is simply growing quickly, you’ll need to adopt best practices to make sure that your team has everything it needs to be successful. There are two types of demand that require additional capacity: the first is organic demand (this is your organization’s natural growth) and inorganic demand (this is the growth that happens due to a marketing campaign or holiday. To prepare for these events, you’ll need to forecast the demand and plan time for acquisition.
Important facets of capacity planning include regular load testing and accurate provisioning. Regular load testing allows you to see how your system is operating under the average strain of daily users. As Google SRE Stephen Thorne writes, “It’s important to know that when you reach boundary conditions (such as CPU starvation or memory limits) things can go catastrophic, so sometimes it’s important to know where those limits are.” If your service is struggling to load balance, or the CPU usage is through the roof, you know that you’ll need to add capacity in the event of increased demand. That’s where provisioning comes in.
Adding capacity in any form can be expensive, so knowing where you need additional resources is key. It’s important to routinely plan for inorganic demand so you have time to provision correctly.The process of adding capacity can sometimes be a lengthy effort, especially if it’s the case of moving to cloud. You’ll also need to know how many hands you’ll need on deck for these momentous occasions.
Capacity planning is an important part of having a resilient system because in thinking about the allocation of resources, your team members matter. They need time off for holidays, personal vacations, and the obligatory annual cold. When you fail to plan for time off, you won’t have enough hands on deck to handle incidents as they occur. Denying people time off is obviously not the answer, as that will only lead to burnout and churn. So it’s important to develop a capacity plan that can accommodate people being, well, people.
Johann Strasser shares four steps you can take to develop a capacity plan that will eliminate staffing insecurity:
- Establish all necessary processes with the appropriate staff – from top management to team leaders. Decide how often you will need to revise/revisit this process and make sure that everyone is in agreement on this.
- Provide for complete and up-to-date project data and prioritize your projects. What projects are the most important, and which can be put on the back burner for now? Additionally, how long will each project take? You’ll need this data to be able to move forward with accurate plans.
- Identify the capacities across your existing team, as well as your infrastructure and services. Is the team equipped and system architected in a way that minimizes performance regressions, to protect efficiency and capacity?
- Consolidate the requirements (step 2) and the capacities (step 3). Identify underload as well as overload and try to balance them.
So, now you’ve got the people and the process, but how can you learn and improve on your resilience? For that, you’ll need great postmortem practices in place that facilitate real introspection, psychological safety, and forward-looking accountability.
Postmortem (or Incident Retrospective) Best Practices
When something goes wrong, it’s important to learn from it to prevent the same mistake from happening again. To do this, it’s important to craft and analyze postmortems (or post-incident reviews, RCA reports, or whatever you like to call them). To have postmortems worthy of analysis, applying SRE best practices will be key. In fact, postmortems are a great place to begin your SRE adoption journey.
As Steve McGhee, SRE Leader at Google shares, “Conducting blameless postmortems will enable you to see gaps in your current monitoring as well as operational processes."
Armed with better monitoring, you will find it easier and faster to detect, triage, and resolve incidents. More effective incident resolution will then free up time and mental bandwidth for more in-depth learning during postmortems, leading to even better monitoring.
Building a postmortem practice will eventually enable you to identify and tackle classes of issues, including fixing deeply rooted technical debt. With time, you’ll be able to directly improve systems continuously.
One of the most important elements of a postmortem, and of SRE as a whole, is the notion of blamelessness. To learn from postmortems, there needs to be total transparency. Opening up about mistakes can often be frightening, and requires a psychologically safe space to do so. Positive intent should always be assumed in order to foster the trust that allows for true openness. Blaming team members or defining people as the root cause for failure will only lead to more insecurity, covering up the important truths that postmortems are meant to uncover.
To craft great postmortems, there are four other best practices that will ensure your incidents are being used to their full advantage:
- Use visuals in your postmortems: As Steve McGhee says, “A ‘what happened’ narrative with graphs is the best textbook-let for teaching other engineers how to get better at progressing through future incidents.” Graphs provide an engineer with a quickly readable yet in-depth explanation for what was happening during the incident days, weeks, or even years later.
- Be a historian: Timelines can be invaluable for parsing through a particularly dense incident. Chat logs can be cluttered, and it’s difficult to quickly find what you’re looking for. Thus, it’s important to have a centralized timeline that gives a clean, clear summary of the events. This also provides the context that helps relevant team members analyze what happened.
- Tell a story: An incident is a story. To tell a story well, many components must work together. Without sufficient background knowledge, this story loses depth and context. Without a timeline dictating what happened during an incident, the story loses its plot. Without a plan to rectify outstanding action items, the story loses a resolution.
- Publish promptly: Promptness has two main benefits: first, it allows the authors of the postmortem to report on the incident with a clear mind, and second, it soothes affected customers. Best-in-class companies like Google, Uber, and others have internal SLOs around publishing their postmortems within 48 hours.
Creating incident playbooks, utilizing change management and capacity planning, and following postmortem best practices will all contribute to your system’s , but that’s not all that SRE seeks to do.
Principle #2: Reduce Engineering Problems / Innovation Blockers
Focusing on the customer has been a key business strategy since the beginning of time. But how do you really know what your customers want, and how can you guarantee you’re providing it? SRE’s concept of SLIs (service level indicators), SLOs (service level objectives), and error budgets will keep your organization aligned on what customer success looks like.
Service Level Indicators
When you look at your product through the eyes of your user, you aren’t just finding the right SLIs, but creating key information for constructing a user journey. A user journey is a powerful tool for many aspects of product design as it helps designers focus on users’ priorities. The lessons you learn from developing and analyzing user journeys can be insightful in the most fundamental areas of product design, but for these insights to be accurate, the underlying data must be carefully selected.
The touch points between the user and your service all involve requests and responses – the building blocks of SLIs. For each touchpoint you identify, you should be able to break down the specific SLIs measuring that interaction. From there, you can follow each branch that the user could take, gathering the SLIs for the following requests into a bundle for that journey.
To understand user intent, you must identify potential pain points for the chosen journey. Your bundle of SLIs can be instrumental in finding pains that might otherwise be invisible.
Let’s say that a user’s channel involves making a dozen requests to the same service component – like clicking through many pages of search results. Separately, these requests return quickly enough that users won’t be bothered, maybe under a second, and a user looking at just one or two pages will be satisfied with this speed. However, if your user journey involves looking through twenty pages, the annoyance of nearly a second wait, repeated twenty times, could be intolerable. Only through looking at relevant monitoring data as well as understanding the broader context could you discover this point of user frustration.
Finding these pain points along the user journey could lead to a radical redesign of the service as a whole. Additionally, it opens up a path to solutions deep in the backend and helps determine priorities for development. In our example above, you could either redesign the catalog to avoid the need to look through twenty pages, or you could optimize the components serving those pages until the total delay across the twenty pages is still acceptable.
Once you identify what makes your customer happy, it’s important to set goals to reach them.
Service Level Objectives
Service Level Objectives, or SLOs, are an internal goal for the essential metrics of a service, such as uptime or response speed, and correlate to customer happiness.
As SLOs are always set to be more stringent than any external-facing agreements you have with your clients (SLAs), they provide a safety net to ensure that issues are addressed before the user experience becomes unacceptable. For example, you may have an agreement with your client that the service will be available 99% of the time each month. You could then set an internal SLO where alerts activate when availability dips below 99.9%. This provides you a significant time buffer to resolve the issue before violating the agreement:
Service Level Agreement with Clients: 99% availability – 7.31 hours acceptable downtime per month
Service Level Objective Internally: 99.9% availability – 43.83 minutes acceptable downtime per month
Safety Buffer: 6.58 hours
Knowing that you’ll have over six and a half hours between your internal objective and an agreement breach can provide some peace of mind as you deploy. However, it can be difficult to determine a buffer that provides sufficient time to respond when disruptions occur. Garrett Plasky, who previously led Evernote’s SRE team, describes this challenge:
“Setting an appropriate SLO is an art in and of itself, but ultimately you should endeavor to set a target that is above the point at which your users feel pain and also one that you can realistically meet (i.e. SLOs should not be aspirational).”
It may be tempting from a management perspective to set an SLO of 100%, but it just isn’t realistic. Development would be paralyzed by fear that the smallest change could trigger an SLO breach. Moreover, such a high target isn’t helpful. As Garrett points out, the SLO should still be set above the point where the users of the service are pained, as any refinement beyond that quickly gives diminishing returns for additional user satisfaction.
Setting SLOs can also positively impact development velocity by giving developers the opportunity to use small amounts of downtime to improve the service. This amount of time allowed is called an error budget.
Error Budgets
Error budgets are the amount of downtime that can be spared per window before violating an SLO. Setting error budgets can positively impact your organization in many ways. First, it can increase the rate of innovation. Developers no longer need to spend time consulting with other teams before doing a code push, as long as the push won’t endanger the SLO and falls within the error budget. They can spend down the error budget on new features, or choose to allocate time instead to fixing technical debt or infrastructure. This also ensures that pushes don’t threaten the reliability of your system or customer satisfaction.
Beyond increasing innovation, error budgets also align different parts of the organization on incentives and consequences. With an error budget in place, developers can push code as fast as they need to without compromising reliability. Thus, developers, product, and production teams are all happy. If error budgets are overextended for a certain period of time, there are also consequences predetermined by the error budget policy, such as a code freeze.
Improve Engineering Efficiency and Morale
SRE not only helps customers stay happy, it also boosts morale within the organization.
Happy engineers means happy customers, as engineers won’t build the best products possible without support from the organization.
There are two majors ways that SRE can help brighten engineering’s day.
- Minimizing toil: One of the main focuses of SRE is automation. Toil is a waste of precious engineering time, and by SREs creating frameworks, processes, internal tooling/building tooling to eliminate it, engineers can get back to innovating.
- Reducing tech debt: SREs create accountability around postmortem follow-up action items to make sure that old issues aren’t buried under new code. SREs also put together frameworks to help developers deliver more performant code, prioritizing what matters most to the customer experience. Pinpointing the tech debt build-up that hurts customer experience is important to guide refactoring initiatives and other practices to help teams spend less time on reactive, unplanned work and more time on the things that matter for the business. This establishes a baseline for healthy engineering practices to help minimize future accrual of tech debt.
Additionally, SREs invest in cultural change that prevents more tech debt from accruing in the future, while still making way for innovation. Jean Hsu, Co-Founder of Co Leadership, wrote about refactoring Medium's codebase, and realized that the most important thing she could do for her team wasn’t just to fix spaghetti code; it was to create a culture that fixes technical debt as it goes along, deleting old code as needed.
Jean wrote, “I realized that if I always did this type of work myself, I would be constantly refactoring, and the rest of the team would take away the lesson that I'd cleanup after them. Though I did enjoy it myself, I really wanted to foster a long-term culture where engineers felt pride and ownership over this type of work.”
SREs are often the cultural drivers for this sort of work, improving the way engineering teams function as a whole rather than simply going from project to project fixing bugs. These changes are long-term initiatives that spark growth and adoption of best practices for the entire organization.
As you can see, SRE could positively impact each engineer’s day-to-day productivity. In fact, SRE is not about tooling or job titles, and is rather a more human-centric approach to systems as a whole.
With this context in mind, adoption brings positive business benefits for everyone in the organization.
Principle #3: Approach Systems from a Human Perspective
Resiliency engineering as a practice looks at systems holistically, considering not only infrastructure but also human, process, and cultural factors. Without adopting the culture and mindset behind SRE, you’ll simply have new processes with no uniting value at the center to keep the initiative in place. Focusing on the human approach to systems requires reevaluating your organization’s attitude towards the following: On-call & full service ownership practices, keeping burnout at bay, and celebrating failure.
Any organization can adopt SRE best practices, and it can begin in small increments. The most important change you will make will be the cultural one. As organizations are made of people, any organization can foster continuous learning, blameless culture, and psychological safety so long as its people are committed to a growth mindset. Once these cultural factors are in place, it becomes much easier to implement the practices, processes, and tools that scale that culture of excellence.
To dive deeper and get more bonus reading material on the above topics, download your copy of The Essential Guide to SRE.