Eventbrite Mitigates Risk by Improving MTTA by 10X
The Challenge
Eventbrite brings people together through live experiences, allowing them to discover events that match their passions or create their own with online ticketing tools.
Before Blameless, the Site Reliability Engineering team struggled with several challenges such as:
- Manual requirements to create communications channels, alert key team members, set roles, create tasks, and keep tabs on status
- Lack of visibility into incidents and status updates
- Internal tooling which lacked refinement (due to detracting from core competencies), regular maintenance, and customer success team
- Lack of an integrated solution to tie together Postmortems, SLOs, Error Budgets, and Reliability Insights for a complete reliability process
John Shuping, Director of Site Reliability Engineering, and his team sought a solution that could go beyond just incident management, to also managing SLOs and error budgets. They wanted to replace internally-built tooling that took focus away from core competencies with modern, repeatable process for orchestrating reliability efforts. Finally, they wanted to eliminate intensive, tedious manual effort involved with incident management as well as maintaining SLOs and error budgets.
Before Blameless, there was significant toil tied to incidents as well as maintaining SLOs and Error Budgets.
The Solution
Blameless' integrated chatbot, SLOs and Error Budgets, and Reliability Insights features helped the Eventbrite team achieve the following benefits.
Benefits of Blameless
- Engage the right people and teams to stop incidents fast, ensuring customer satisfaction
- Automatically bring relevant information and context to Blameless Postmortems to learn without pointing fingers, ensuring continuous improvements
- Create SLOs and gain insights into Error Budget burndown, providing context to make informed decisions between releasing new features and meeting reliability requirements
- Query event data across the entire DevOps stack and create custom dashboards to quickly find signals across the noise
- Minimize customer impact and resolve incidents faster by allowing Incident Commanders to orchestrate parallel streams of investigations for complex incidents with Swimlanes for Incident Resolution
Finding a platform that finally went beyond incident management into SLOs and error budgets drove the decision to choose Blameless.
Reliability Toolchain
Here are the tools that Eventbrite relies on to maximize their reliability efforts.
- Blameless
- Testing frameworks
- Monitoring with Datadog
- Server redundancy
- Orchestrators such as Kubernetes
The Results
With Blameless, Eventbrite saw the following positive business results, helping cross-functional teams improve alignment and effectiveness to deliver great software experiences.
- Rapidly decreased MTTA and MTTR by 10X (1000%)
- Quantified frequency duration and severity of incidents
- Codified internal processes, turning focus on building great customer products vs. internal reliability tooling
- Provided reporting that’s meaningful to executives highlighting MTTA and MTTR
- Drove organization-wide adoption powering communication between engineering, customer service, and IT teams both independently and inter-departmentally
Before Blameless, it would take 5-10 minutes to get the right people on an incident. Now it's as fast as 1 minute.