Blog

Blog

Ebook

9.2.2020

Determining Error Budgets and Policies that Work for Your Team

In this blog, we’ll look at the basics of error budgeting, how to set corresponding policies, and how to operationalize SLOs for the long term.

Blog

Ebook

9.1.2020

How to Build Your SRE Team

In this blog post, we’ll look at some of the many roles an SRE can play, and how to find people with those skill sets.

Blog

Ebook

8.20.2020

What is a Kubernetes Operator and Why it Matters for SRE

In this blog post, we’ll explain the Kubernetes Operator—the Kubernetes function at the heart of customized automation—and discuss how it can evolve your SRE solution.

Blog

Ebook

8.19.2020

Here are the Metrics you Need to Understand Operational Health

In this blog post, we’ll walk you through holistic measures and best practices that you can employ starting today. These will include challenges and pain points in gaining insight as well as key metrics and how they evolve as organizations mature.

Blog

Ebook

8.13.2020

Choosing the Right SRE Tools

Implementing SRE practices and culture can be challenging. In this blog, we’ll talk about what to look for in an SRE tool, and how they’ll help you on your journey to reliability excellence.

Blog

Ebook

8.6.2020

The Importance of Reliability Engineering

What makes reliability engineering so important? In this blog, we’ll look at three big benefits of investing in reliability and explain how you can get started on your journey to reliability excellence.

Blog

Ebook

7.30.2020

How to Improve On-Call with Better Practices and Tools

Establishing equitable on-call rotations, putting the right guardrails and automation in place, and regular incident practice are key to minimizing the stress of on-call. In this blog, we’ll share key tools and practices to ensure your on-call engineers are set up for success.

Blog

Ebook

7.29.2020

Enabling the Stripe and Lyft Platforms Through Modern Safety Science

Jacob Scott is an experienced engineer and enthusiastic participant in the resilience engineering community, having spent time caring for the technology systems powering high-growth startups as well as unicorns like Lyft and Stripe. See our interview with him here.

Blog

Ebook

7.23.2020

How to Choose Monitoring Tools for DevOps and SRE

Deciding what and how to monitor is an important decision. We’ll walk you through the basics in this blog post. We’ll also suggest a few popular monitoring tools for your consideration.

Blog

Ebook

7.22.2020

Leaders, Here's How to Encourage Full Service Ownership

Service ownership is becoming common practice and its benefits are well-known. Leadership will need to encourage and empower teams to adopt the “you build it, you run it” mentality. Here are some ways to get teams on board.

Blog

6.29.2022

Development Velocity (And How To Balance Reliability)

Development velocity is a measurement of how much work a software team can complete, based on similar work completed in previous iterations

Blog

6.9.2022

Incident vs. Problem [Understanding the Differences]

Curious about incidents vs. problems? We explain the differences and how to handle each one.

Blog

6.7.2022

Incident Priority Matrix (Understanding Impact and Urgency)

An incident priority matrix that helps set priority levels for your incidents based on four levels of impact. Here's how to determine an incident's urgency.

Blog

6.2.2022

Software Engineers vs Site Reliability Engineering Explained

We discuss what software engineers and site reliability engineering are and explain their differences and their importance in the software development process.

Blog

5.31.2022

DevOps Team Structure | Roles & Responsibilities

We explain how a DevOps team is structured, the roles and responsibilities within the team, and the balance between an individual contributor and the needs of the team.

Blog

5.26.2022

What Is DevOps Automation & What Are The Benefits?

Looking into DevOps automation? We explain how automation can improve your process, how to prioritize which tasks to automate, best practices, and how to avoid common mistakes.

Blog

5.10.2022

DevOps Pipeline | Best Practices, Tips, & Techniques

Looking into DevOps pipelines? We explain what a DevOps pipeline is, how to build one, and the best practices for building one for your team.

Blog

5.4.2022

The Reverse Red Herring

Our VP of Engineering relates a story where a seemingly innocuous clue turns out to be key - a reverse red herring!

Blog

5.3.2022

CI/CD Pipeline | What It Is & How It Works

Wondering about CI/CD pipelines? We explain what the CI/CD pipeline is, the steps involved, and best practices along the way.

Blog

4.28.2022

Post-Incident Review | Why It’s Important & How It’s Done

A post-incident review is an evaluation of the incident response process. The goal is to have clear actions to improve the process and prevent further incidents.