Resources
Browse through videos, guides, and other educational resources that cover incident management, reliability, team culture, and more.
Blog
Ebook
8.6.2020
The Importance of Reliability Engineering
What makes reliability engineering so important? In this blog, we’ll look at three big benefits of investing in reliability and explain how you can get started on your journey to reliability excellence.
Blog
Ebook
7.30.2020
How to Improve On-Call with Better Practices and Tools
Establishing equitable on-call rotations, putting the right guardrails and automation in place, and regular incident practice are key to minimizing the stress of on-call. In this blog, we’ll share key tools and practices to ensure your on-call engineers are set up for success.
Blog
Ebook
7.29.2020
Enabling the Stripe and Lyft Platforms Through Modern Safety Science
Jacob Scott is an experienced engineer and enthusiastic participant in the resilience engineering community, having spent time caring for the technology systems powering high-growth startups as well as unicorns like Lyft and Stripe. See our interview with him here.
Blog
Ebook
7.23.2020
How to Choose Monitoring Tools for DevOps and SRE
Deciding what and how to monitor is an important decision. We’ll walk you through the basics in this blog post. We’ll also suggest a few popular monitoring tools for your consideration.
Blog
Ebook
7.22.2020
Leaders, Here's How to Encourage Full Service Ownership
Service ownership is becoming common practice and its benefits are well-known. Leadership will need to encourage and empower teams to adopt the “you build it, you run it” mentality. Here are some ways to get teams on board.
Blog
Ebook
7.17.2020
The Essential List of Top SRE Resources
Are you looking to get up to speed on SRE fundamentals with the best SRE books and best DevOps books? Or are you hoping to expand your SRE knowledge into new domains? Either way, we’ve got you covered in our list of essential SRE resources!
Blog
Ebook
7.16.2020
5 Tips for Getting Alert Fatigue Under Control
It’s important to minimize alert or pager fatigue as much as possible, for the health and well being of your team members. After all, the health of your systems is dependent on the health of your people. Here are 5 tips on how to cut down on alert fatigue and improve your signal-to-noise ratio.
Blog
Ebook
7.15.2020
Leadership and Innovation with Instacart's VP of Infrastructure
Blameless CEO Ashar Rizqi recently had the pleasure of interviewing Dustin Pearce in a virtual executive fireside chat and AMA. Below is the transcript of their conversation.
Blog
Ebook
7.8.2020
How to Classify Incidents
Benefits of classifying incidents, how classification is distinguished from incident triage, and how to set up your own classification system.
Blog
Ebook
7.1.2020
SLO Adoption at Twitter
The concept of service level objectives (SLOs) and error budgets have been key to this transformation, as SLOs shape an organization’s ability to make data-oriented decisions around reliability. (Read here for a definition of SLOs and how they transformed Evernote.). Today, the Twitter team has invested in centralized tooling to measure, track, and visualize SLOs and their corresponding error budgets.
Incident Impact Calculator
Find out how much you could save
Incidents can do real damage to companies that aren't sufficiently prepared them. Use our calculator to estimate the full cost of incidents for your team.
use the calculator