Blog

Blog

Ebook

6.29.2022

Development Velocity (And How To Balance Reliability)

Development velocity is a measurement of how much work a software team can complete, based on similar work completed in previous iterations

Blog

Ebook

6.9.2022

Incident vs. Problem [Understanding the Differences]

Curious about incidents vs. problems? We explain the differences and how to handle each one.

Blog

Ebook

6.7.2022

Incident Priority Matrix (Understanding Impact and Urgency)

An incident priority matrix that helps set priority levels for your incidents based on four levels of impact. Here's how to determine an incident's urgency.

Blog

Ebook

6.2.2022

Software Engineers vs Site Reliability Engineering Explained

We discuss what software engineers and site reliability engineering are and explain their differences and their importance in the software development process.

Blog

Ebook

5.31.2022

DevOps Team Structure | Roles & Responsibilities

We explain how a DevOps team is structured, the roles and responsibilities within the team, and the balance between an individual contributor and the needs of the team.

Blog

Ebook

5.26.2022

What Is DevOps Automation & What Are The Benefits?

Looking into DevOps automation? We explain how automation can improve your process, how to prioritize which tasks to automate, best practices, and how to avoid common mistakes.

Blog

Ebook

5.10.2022

DevOps Pipeline | Best Practices, Tips, & Techniques

Looking into DevOps pipelines? We explain what a DevOps pipeline is, how to build one, and the best practices for building one for your team.

Blog

Ebook

5.4.2022

The Reverse Red Herring

Our VP of Engineering relates a story where a seemingly innocuous clue turns out to be key - a reverse red herring!

Blog

Ebook

5.3.2022

CI/CD Pipeline | What It Is & How It Works

Wondering about CI/CD pipelines? We explain what the CI/CD pipeline is, the steps involved, and best practices along the way.

Blog

Ebook

4.28.2022

Post-Incident Review | Why It’s Important & How It’s Done

A post-incident review is an evaluation of the incident response process. The goal is to have clear actions to improve the process and prevent further incidents.

Blog

9.2.2020

Determining Error Budgets and Policies that Work for Your Team

In this blog, we’ll look at the basics of error budgeting, how to set corresponding policies, and how to operationalize SLOs for the long term.

Blog

9.1.2020

How to Build Your SRE Team

In this blog post, we’ll look at some of the many roles an SRE can play, and how to find people with those skill sets.

Blog

8.20.2020

What is a Kubernetes Operator and Why it Matters for SRE

In this blog post, we’ll explain the Kubernetes Operator—the Kubernetes function at the heart of customized automation—and discuss how it can evolve your SRE solution.

Blog

8.19.2020

Here are the Metrics you Need to Understand Operational Health

In this blog post, we’ll walk you through holistic measures and best practices that you can employ starting today. These will include challenges and pain points in gaining insight as well as key metrics and how they evolve as organizations mature.

Blog

8.13.2020

Choosing the Right SRE Tools

Implementing SRE practices and culture can be challenging. In this blog, we’ll talk about what to look for in an SRE tool, and how they’ll help you on your journey to reliability excellence.

Blog

8.6.2020

The Importance of Reliability Engineering

What makes reliability engineering so important? In this blog, we’ll look at three big benefits of investing in reliability and explain how you can get started on your journey to reliability excellence.

Blog

7.30.2020

How to Improve On-Call with Better Practices and Tools

Establishing equitable on-call rotations, putting the right guardrails and automation in place, and regular incident practice are key to minimizing the stress of on-call. In this blog, we’ll share key tools and practices to ensure your on-call engineers are set up for success.

Blog

7.29.2020

Enabling the Stripe and Lyft Platforms Through Modern Safety Science

Jacob Scott is an experienced engineer and enthusiastic participant in the resilience engineering community, having spent time caring for the technology systems powering high-growth startups as well as unicorns like Lyft and Stripe. See our interview with him here.

Blog

7.23.2020

How to Choose Monitoring Tools for DevOps and SRE

Deciding what and how to monitor is an important decision. We’ll walk you through the basics in this blog post. We’ll also suggest a few popular monitoring tools for your consideration.

Blog

7.22.2020

Leaders, Here's How to Encourage Full Service Ownership

Service ownership is becoming common practice and its benefits are well-known. Leadership will need to encourage and empower teams to adopt the “you build it, you run it” mentality. Here are some ways to get teams on board.

Development Velocity (And How To Balance Reliability)

Incident vs. Problem [Understanding the Differences]

Incident Priority Matrix (Understanding Impact and Urgency)

Software Engineers vs Site Reliability Engineering Explained

DevOps Team Structure | Roles & Responsibilities

What Is DevOps Automation & What Are The Benefits?