Getting to 99.999% Availability with Twilio’s Tyler Wells
A remarkable milestone for any company's site reliability engineering (SRE) is five 9s availability. That's less than 30 seconds of service unavailability per month! Exactly what Twilio has accomplished. Twilio is the world's leading communication platform with more than two million developer accounts. When we get an anonymous call or text from our Uber/Lyft drivers, that's Twilio at play.
Tyler Wells is the Director of Engineering at Twilio. He oversees all Programmable Video (WebRTC) and Client SDK teams distributed across the globe: Vancouver, NYC, Austin, Madrid, Latvia, to name a few. He shares how Twilio guides widely distributed teams to exceptional operational excellence.Throughout the interview, Tyler exudes rigor and discipline in his thinking and expression. Here are the building blocks he shares of getting to five 9s: Empathy, Chaos Engineering, and the Operational Maturity Model (OMM). The key points that Tyler shared are summarized into the article below.(If you haven't heard of SRE, SLA, SLO, or SLIs, now is a good time to quickly read through this cheat sheet.)
If our service is not reliable, a person considering suicide may not get the help at a time of their greatest need.
At Twilio, each one of the 300+ engineers practices SRE principles. Small autonomous teams at Twilio take their products from idea/concept all the way through production. Teams are responsible for the operational excellence and upkeep of their systems.
Empathy - The foundation of 99.999% availability
To achieve greatness, it all starts with empathy for the customer. The engineering org must understand how it impacts people's lives when you are not providing five nines.For instance, Crisis Text Line builds on top of us. That's suicide prevention. If we are not providing five nines, we are not providing the most reliable communication for someone at a time of their greatest need. In less dire cases, it would still be awful if grandma is right in the middle of wishing happy birthday to little Joey when the call drops, or if the doctor can't clearly hear or see the patient when they are describing their symptoms during a call or video session. With empathy as the foundation, companies can go to the engineering problems of achieving five 9s availability.
To achieve five 9s availability, the engineering org must understand how it impacts people's lives when you are not providing five nines.
Chaos engineering - Break everything yourself
It's not that Twilio has fewer incidents, it's that Twilio resolves most of the incidents before production. Before a system ever even reaches production, you should've broken it a thousand times. As an engineering organization, you should be in there purposely breaking all of your stuff, because you know once it's out there it's gonna break.
Do your own chaos engineering. Break everything yourself. Use a tool like Gremlin. Understand:
- How long does it take for you to detect something that's gone wrong in your systems?
- How long does it take to resolve when you detected that something has gone wrong?
- What are the tools and instrumentation you put in that gives you the signal that something is not quite right.
Before a system ever even reaches production, you should've broken it a thousand times.
Removing fear with chaos testing
At Twilio, teams roll their changes into stage, with stage near production as closely as possible. They then inject faults/chaos into those systems to create mock incidents inside of stage. The goal is to simulate as many situations as we can prior to getting into production.
By the time they get to production, teams have got the muscle memory and know how to react to incidents. They have validated their graphs and their indent tests to know that they are getting a clean signal. Teams should be confident that the monitors they've created will provide directionality on why things are breaking.
Delays that improve availability are okay
Does chaos engineering delay shipping of the product? Yes. During chaos engineering, the first questions I ask my team is "What is your prediction here? Based on your understanding of the systems, what do you think is going to happen when you do X?" X can be black-holing an API, injecting packet loss, or spinning the SPU up super high. We then enact chaos.
There's been a case during chaos testing when we expected media failure, but our dashboard showed nothing! We then stopped the test to see what's wrong. This type of delay is okay. By breaking everything yourselves, you can prevent the worst thing - letting the customer tell you about something that's failed inside of your systems.
Operational Maturity Model (OMM): Don't expect five 9s from day one
The OMM gamifies and divides the path to five 9s into three steps: aware, scaling, and Ironman. Individual teams at Twilio must meet criteria across a wide array of dimensions and policies in order to get their product into production access. E.g. Operational Maturity Model for LoggingAware
- Be willing to communicate and include an SLA for your service in a contract
- If there's a breach in SLO/SLA, you know the customer impact and which customers are affected
Products don't start at five 9s. For newer products going from Aware to Scaling, they go into beta stages at three to four 9s (99.9% - 99.99%). The beta launch gives the team time to have early incidents and learn from them, improving SLO performance for availability before launch. Teams can take from a couple of weeks to a couple of months to get Ironman certified.When a product is Ironman certified, it means that the product is generally available (GA) and published to customers. A customer can now trust in us that we have put a lot of time into operating, securing, and scaling our systems, and making them reliable.
Meeting SLOs - How we win customers' trust
Twilio's number one goal is to earn our customers' trust. SLAs are important because that's what our customers rely upon. However, inside Twilio, SLOs are important because that's how we go above and beyond customers' expectations. One SLO we have is five 9s availability. We also have SLOs for response time, MOS scores (for call quality), packet loss, and mean time to resolution (MTTR), etc.
Using our performance against SLOs, we are always analyzing: which service is throwing the most 500s and why? The answers can enable us to act and constantly improve our availability.
- If a service is error-prone because of failures in downstream dependency, can I ensure that it doesn't cause an outage? (E.g. Can I route around it? Can I find another carrier that can handle that traffic by shifting everything from US East to US West?)
- If we see a concentration of 400 level errors inside of specific customers, we would reach out to these customers and ask "how can we reduce these errors for the future?" (E.g. better documentation, UI, etc.)
Your site is reliable if you are: available + functional + resilient
Available: Is the service available? (E.g. Can the REST API actually respond?)
Functional: Are we responding correctly, timely, and accurately with the expected response codes? (E.g. Is my request throwing up 500s when I'm expecting 200s?)
Resilient: Can we resolve incidents quickly or prevent them in the first place? (E.g. Is time taken to resolve incidents decreasing over time?)
Ultimately, SRE is a specialty that needs to be embedded in the minds of every developer.