EXCITING NEWS: BLAMELESS JOINS FORCES WITH FIREHYDRANT! Click here to view our blog!
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

The Ultimate, Incident Postmortem (Retrospective) Template

Incident retrospectives (or postmortems, post-incident reports, RCAs, etc.) are the most important part of an incident. This is where you take the gift of that experience and turn it into knowledge. This knowledge then feeds back into the product, improving reliability and ensuring that no incident is a wasted learning opportunity. Every incident is an unplanned investment and teams should strive to make the most of it.

Yet, many teams find themselves unable to complete incident retrospectives on a regular basis. One common reason for this is that day-to-day tasks such as fixing bugs, managing fire drills, and deploying new features take precedence, making it hard to invest in a process to streamline post-incident report completion. To make the most of each incident, teams need a solid post-incident template that can help minimize cognitive load during the analysis process. In this article we provide an example of what a comprehensive, narrative incident retrospective could look like.

What is a postmortem (retrospective) template?

A retrospective template is a tool for your team to report upon an incident, with details on what happened, what went wrong, and what systemic improvements could be made.

Why are postmortems important?

Incident management is a stressful process where everyone is working hard to resolve the issue as soon as possible. When focusing on restoring service as quickly as possible, there is often no time to reflect during that process or dig deep into how to prevent the incident from happening again. 

The immediate need is to resolve the issue and ensure customers have the best experience with the solution. However, after the incident is dealt with, a retrospective is a crucial part of the process. A retrospective enables teams to look deeper at their effort and identify areas of improvement for next time. 

But that’s not all. Retrospectives also have value in other ways too. For example, retrospectives are also a useful tool when big projects are completed or even a regularly scheduled revisit of how the team is doing. The ultimate goal of the retrospective is to look back on the work completed and identify where things went well, what didn’t, and where improvement is needed. 

Using the information, teams work together to create and implement solutions. Doing so fosters collaboration among teams while also ensuring that everyone’s voice is heard during the process. The retrospective can serve as a hub for this followup work. Keep track of all the systemic improvements that result from a given incident in the retrospective, and check in with it to make sure people are on track.

What are the different types of retrospectives I can do?

There are many types of retrospectives that your team can do, so it’s essential to look for templates that can support that. 

Types of retrospectives could include:

  • Habit building: What are things your team needs to start doing and/or stop doing? What are some things to continue doing? Group ideas together as they come in, and talk about common themes and next steps. Building and putting the spotlight on helpful processes can make team members execute them more naturally.
  • Emotional: Another type of retrospective is to consider emotional health and how that can improve. What are team members upset about, and what are they happy about? How does that change before, during, and after an incident? If team members feel heard and supported, they’ll have psychological safety to continue working at their best.
  • Vision building: How do teams envision their work personally and in the larger context? What is stopping them from achieving that vision, and how can the team move forward as a whole and as individuals?
  • Process: This can work for incident management and other situations where teams come together to identify what works in their current processes and what needs to improve.
  • Incident management: After a major incident, teams can come together to discuss what went well, what didn’t, and what they learned to improve incident management moving forward. Below is an example of an incident retrospective template.


Sections of a good retrospective

1. Summary

This should contain 2-3 sentences that gives a reader an overview of the incident’s contributing factors, resolution, classification, and customer impact level. The briefer, the better as this is what engineers will look at first when trying to solve for a similar incident.

Template

Write a summary of the incident in a few sentences. Keep it as brief as possible, as this is what engineers will refer to when attempting to solve similar incidents. 

Time & Date Range:

Event Symptoms:

Event Trigger:

Contributing Factors:

Severity Level:

% of Users Affected:

Resolution:

Example

Google Compute Engine Incident #17007

This summary states “On Wednesday 5 April 2017, requests to the Google Cloud HTTP(S) Load Balancer experienced a 25% error rate for a duration of 22 minutes. We apologize for this incident. We understand that the Load Balancer needs to be very reliable for you to offer a high quality service to your customers. We have taken and will be taking various measures to prevent this type of incident from recurring.”

2. Customer impact

This section describes the level of customer impact. How many customers did the incident affect? Did customers lose partial or total functionality? Adding tags can be helpful here as well to help with future reporting, filtering and search.

Template

Describing the level of customer impact should be more detailed as it provides valuable insights that inform your incident prioritization process. As mentioned, including tags to facilitate quick searches for incidents using logical filters helps streamline reviews and works best when the same tags are applied to future reporting.

Overview:

Number of Customers Affected:

Capabilities Affected:

Level of Functional Loss:

Number of Support Cases Generated:

Detailed Description of Symptoms:

Example

Google Cloud Networking Incident #19009

In the section titled, “DETAILED DESCRIPTION OF IMPACT,” authors thoroughly breakdown which users and capabilities were affected.

3. Follow-up actions

This section is incredibly important to ensure that accountability around addressing incident contributing factors looks forward. Follow-up actions can include upgrading your monitoring and observability, bug fixes, or even larger initiatives like refactoring part of the code base. The best follow-up actions also detail who is responsible for items and when the rest of the team should expect an update by. Gathering feedback from the incident response team/participants for analysis provides the information to create an actionable checklist that will improve gaps and weak spots in the response.

Template

Gathering feedback from the incident response team/participants, for analysis provides the information to create an actionable checklist that will improve gaps and weak spots in the response. 

Who Responded to the Incident?

Date & Time of Response:

Does the Responder Have a Background in the Affected System?

Who is Typically Responsible for this Type of Incident?

Delays or Obstacles to the Response: 

Follow-up Actions (be as detailed as possible): 

(Examples include upgrading your monitoring and observability, bug fixes, or even larger initiatives like refactoring part of the code base)

When the Rest of the Team Should Expect an Update:

Example

Sentry’s Security Incident (June 12 2016)

While detailed action items are rarely visible to the public, Sentry did publish a list of improvements the team planned to make after this outage covering both fixes and process changes.

4. Contributing factors

With the increase in system complexity, it’s harder than ever to pinpoint a root cause for an incident. Each incident might have multiple dependencies that impact the service. Each dependency might result in action items. So there is no single root cause. To determine a contributing factor, consider using “because, why” statements.

Template

Fill out the following fields and answer the questions to help determine the contributing factors and root cause of the incident.

Why did this incident happen? (Main cause or failure):

(Example - There was an application outage because of a locked database): 

Why did the main cause or failure occur? (Which details caused this to occur):

(Example - Because of too many writes to the database):

Why were these details allowed to occur? (Which mistake or process issue allowed this to happen)

(Example – There were so many writes because when we pushed a change to the service, we didn’t take elevated writes into consideration)

Why were these details not considered in the process? 

(Example - Because our development process doesn’t include load testing changes)

Why is this element not included in the process? 

(Example - Because it didn’t seem necessary due to our scalability levels, until now)

Example

Travis CI’s Container-based Linux Precise infrastructure emergency maintenance

In this retrospective, authors cover contributing factors such as a change in docker backend executes build scripts, missing coverage in terms of alerting for the errors, and more.

5. Narrative

This section is one of the most important, yet one of the most rarely filled out. The narrative section is where you write out an incident like you’re telling a story. 

Who are the characters and how did they feel and react during the incident? What were the plot points? How did the story end? This will be incomplete without everyone’s perspective. 

Make sure the entire team involved in the incident gets a chance to write their own part of this narrative, whether through async document collaboration, templated questions, or other means.

This provides first hand explanations and the different viewpoints of the when, what, where and why of the incident. This can be difficult for team members who feel they contributed to the issues through their actions. However, a good narrative is viewed as blameless documentation about an incident, allowing engineers to understand what happened and those involved to determine whether the response plan was viable. They can look for shortfalls so the same mistakes are not repeated and preferably prevented. Mistakes can be unintentional, such as coders having preconceived notions about how an application might act. It might be that your process delivers information in an inefficient way, such as being too detailed, making it difficult to categorize the incident or omitting crucial pieces of information that make the incident unclear.

Template

Discuss the incident as if telling a story. Include all of the main participants, and how they felt and reacted during the incident. Make sure every member of the team writes their own part of the narrative, for the most complete version possible. 

Main Characters:

Brief Description of the Incident:

What Went into the Response:

What Could Have Been Improved for this Response:

Opportunities for Improvement in the Future:

6. Timeline

The timeline is a crucial snapshot of the incident. It details the most important moments. It can contain key communications, screen shots, and logs. This can often be one of the most time-consuming parts of a post-incident report, which is why we recommend a tool for automation. The timeline can be aggregated automatically via tooling such as Jira Service Management. These systems automate the timeline process with robust tracking using a customizable platform. Your timeline needs to capture the entire incident resolution which means automatically recording the timeline of events makes for easy tracking and review. That said, you are likely going to integrate timeline information from several sources, such as chat manuscripts regarding the incident, changes to severity levels in your incident process software, alerts and more.

Template

Detail the incident timeline, including significant events such as:

  • Lead up to the incident
  • First known impact
  • Escalations
  • Critical decisions and changes as it is resolved
  • Post-incident events

Use the following template to document the timeline:

XX:XX UTC - INCIDENT ACTIVITY; ACTION TAKEN

XX:XX UTC - INCIDENT ACTIVITY; ACTION TAKEN

XX:XX UTC - INCIDENT ACTIVITY; ACTION TAKEN

7. Technical analysis

Technical analyses are key to any successful retrospective. Afterall, this serves as a record and a possible resolution for future incidents. Any information relevant to the incident, from architecture graphs, to related incidents, to recurring bugs should be detailed here.

Here are some questions to answer with your team:

  • Have you seen an incident like this before?
  • Has this bug occurred previously, and if so, how often?
  • What dependencies came into play here?

Template

Detail any information that’s relevant to the incident, using the prompts listed below.

Have you seen an incident like this before?

Has this particular bug occurred previously, and if so, how often?

What dependencies came into play with this specific incident?

Additional information that’s relevant:

Incident management process analysis

At the heart of every incident is a team trying to right the ship. But how does that process go? Is your team panicked, hanging by a thread and relying on heroics? Or, does your team have a codified process that keeps everyone cool? This is the time to reflect on how the team worked together. 

Here are some questions to answer your team:

  • What went well? 
  • What went poorly?
  • Where did you get lucky and how can you improve moving forward?
  • Did your monitoring and alerting capture this issue?

Messaging

Communication during an incident is a necessity. Stakeholders such as managers, the line of business (i.e. sales, support, PR, etc.) C-levels, as well as customers will want updates. But communication internally and externally might look very different. Even communication internally might differ between what you would send a VPE, vs. your sales team. 

Here, document the messaging that was disseminated to different categories of stakeholders. This way, you can build templates for the future to continue streamlining communication.

Remember, there are several people impacted or involved in the incident. The way you communicate can make or break your reputation and relationships both with customers and internal stakeholders. Effective messaging keeps everyone involved in the loop and ensures appropriate language and messaging methods are used based on context.

You can compare the effectiveness of the tools such as:

Dedicated status pages: Customers and colleagues can keep track of the status of the incident in real-time and can even subscribe to receive updates when using a hosting solution such as Statuspage. Consider things such as did your status page help relieve support burden? Was there less interruption asking questions of those trying to resolve the incident, or did it cause issues for them as they had to update the status page? Who is the best person to update the status page? How is the URL for the status page shared?

Embedded status: Embedded widgets allow you to provide status information on any website so customers know when an incident is in progress. This can help avoid support calls and provide a click-through for further information. How many people clicked the widget during the incident? Were you able to keep the information up to date? Did you reduce support calls?

Email: This is more effective as a subscription-based update to report incidents and keep people in the loop both internally and externally. Consider who responded best or clicked on any informational links provided. Can this inform the types of people you should be including in your messaging? Also, is using subscribers the best approach as it alerts people to an issue that doesn’t impact them?

Chat tools: You can use tools that synch your customer chat tools with internal tools like Slack as well as with your ticket creation. This allows you to chat in real time and keep conversations in context based on the user. Did this generate more or less chat calls? Was it easy or difficult to keep different conversational tones and language in context?

Social media: Social channels like Twitter allow anyone to follow progress without those involved in the fix being the sole means of communication. Is this too public and ended up opening a can of worms? Who made the posts, and what were the ensuing comments, replies and shares? Was the response positive or negative?

SMS: Text message is immediate, making it ideal for internal stakeholders as well as mass messaging to customers with incident-related support calls. It’s best to generate SMS specifically for incidents as people subscribed to service or marketing SMS won’t be interested, which can lead to unsubscribes.  

Example

Google Compute Engine Incident #15056

In this incident, Google ensures that all major updates are regularly communicated. The team also lets users know when they can next expect to be updated. “We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. We will provide another status update by 19:00 US/Pacific with current details.”

Other Best Practices to Keep in Mind

  • Do the report within 48 hours
  • Ensure reports are housed such that they can be dynamically surfaced during incidents
  • Add graphics and charts to help readers visualize the incident
  • Be blameless. Remember that everyone is doing their best and failure is an opportunity to learn

Parting Thoughts

Failure is the most powerful learning tool, and deserves time and attention. Each retrospective you complete pushes you closer to optimal reliability. While they do take time and effort, the result is an artifact that is useful long after the incident is resolved. 

By using this template, your team is on the way to taking full advantage of every incident.

If you enjoyed this blog post, check out these resources:

Resources
Book a blameless demo
To view the calendar in full page view, click here.