Getting the Most Out of SRE, SLOs, and Error Budgets with Joseph Bironas at Collective Health
What nuances in execution and mentality separate successful SRE implementations from the failed ones? How can you get the most out of your SLOs and error budgets?
Joseph Bironas shares the often-overlooked but critical insights to answer these questions. Joseph has 14 years of experience in SRE, 12 of which at Google. His insider’s insights are uniquely incisive, multi-disciplinary, and empathetic, linking the significance of SRE to both business and engineering.
Joseph currently leads the SRE team at Collective Health, a company that is transforming the employer-driven healthcare economy, redefining the way health benefits work.
This “CliffsNotes” summary curates the key points that were discussed by Joseph Bironas in the 50-minute interview. It is not a standalone article and is most valuable when contextualized by the podcast. The order of the points are re-ordered and re-grouped from the podcast for clarity and flow.
The significance of reliability:
- To engineering: product quality is just as important as product functionality.
- To business: a reliable product is key to a company’s brand & its customers’ trust.
- SLIs are user experience-centric.
- SLOs are an organizational guardrail for managing risk.
- SRE teams set a perimeter of defense, then slowly expand.
- Operate without blame.
Counterintuitive Mentality Shifts
For successful SRE implementations
- Consider reliability as a core feature.
- People’s minds are implicitly fixed to 100% reliability, but we should never aim for 100% reliability.
- It’s not enough to set a boundary for risk with SLOs, you want to proactively control the risk with experiments to test and address key system vulnerabilities.
Key Components of Managing Risk and Ensuring Reliability in SRE
- SLI (Service Level Indicator)
- Definition: A user-centric metric
- E.g. success rate, availability, latency, etc.
- SLO (Service Level Objective)
- Definition: Internal objective for an SLI (e.g. 99.95% availability)
- The ideal is to hover on the SLO line (i.e. be ever-so-slightly above or below the SLO)
- Acting on SLOs:
- When you are meeting SLO, developers are signaled to move faster.
- When SLO < actual < SLA, blamelessly focus the effort to recover the SLO-violating issues at hand
- Code Yellow:
- Definition: A practice triggered by violating the SLO above a predetermined threshold (that thoughtfully balances risk and agility)
- E.g. Code freeze when availability is at 99.92% given a 99.95% SLO and 99.9% SLA
- Error Budget:
- Definition: 1 – SLO. A non-zero level of acceptable risk (E.g. 0.05% of monthly service unavailability given a 99.95% SLO)
- When you first start, it’s natural to spend freely until you’re about to burn the entire budget.
- Eventually, you want to have controlled risk/unavailability. How can you control risk and strategically spend the budget? Go to [34:00] to find out.
- E.g. Experiment with traffic control
- The heart of SRE is increasing development velocity with these decisions.
- E.g. Incident walkthrough.
- Sharing results
- E.g. Display monitors of a custom-built dashboard in the hallways.
Code with Empathy
Joseph’s unique insights about the cross-disciplinary human emotions involved with SRE
- BD + customers know that SLA is up to human negotiation. Agreement on what happens when SLA violations take place can only be met when we understand each others’ wants and needs.
- Developers are incredibly aware when a risky release is going out.
- Customer support has less pressure and doesn’t have to run defense around things that cause customer pain when SRE is done well.
- Leadership must be willing to sacrifice/change when the SLO line is reached (i.e. when the error budget is burned to zero). Leadership needs to empower someone to say “a code freeze is needed now”.
- SRE team may feel guilty when they violate SLOs (even though the SLA hasn’t been violated), it’s best to treat the violation like an incident – be blameless, relieve the guilt, and focus on action!
- Cognitive load is a leading indicator of stress and burnout, but it can be addressed by re-scoping the problem domain.
- Be blameless when responding to problems like service outages.
- The SREs should feel empowered to and fearless about raising issues.
- SLO is an objective measure that we can look at without emotions or blame.
Step-by-Step Process to Implementing SLOs
- If you are in a large company, must first overcome inertia.
- Obtain leadership buy-in. (This was already present at Collective Health)
- Set an SLO: determine an acceptable level of risk – error budget.
- Set up the technical infrastructure for data collection for SLOs.
- Training the engineering org. on how to interpret and act on SLOs and error budgets.
- Learn to strategically allocate the error budget and control risk.
- It’s normal to start with a centralized SRE team, but many companies eventually create specialized/embedded SRE teams for different products.
- When should you specialize? Go to [36:30] to find out!
SLO/Error Budget Implementation Challenges:
- The technical challenge for getting data out of the system.
- Visibility & Perception
- Educate the leadership team and the entire engineering org. about what the numbers mean and what we can do about them.
- Training and messaging are critical to success.