NASA mision control room
By Sieuwert Otterloo on Unsplash

Software incidents suck big time. Suddenly, you have more work to do; the sprint is derailed; stakeholders raise their eyebrows and start asking questions. Nobody wants that! Unfortunately, incidents do happen. Complex software systems are bound to fail from time to time.

An incident can also be an opportunity. Your job is to make sure to seize it. Incident review docs followed by a discussion help a lot. Above all else, the main goal of incident reviews is to learn, although great incident reviews can achieve much more.

  • Identify ways to make the system more robust.
  • Identify and close process gaps.
  • Learn about the system they maintain and develop.
  • Protect the trust with the rest of the company and customers by being transparent.
  • Understand user impact better and develop user empathy.
  • Help future team members understand the systems' weak points and how incidents are handled.

I recommend the following approach, which makes the process more inclusive.

  1. The incident leader/responder writes the review doc.
  2. The team reads and comments async.
  3. Everyone gets together to reflect and align on actions.

Here's an incident review doc template that has worked well in my teams.

✍️ Authors

List the people who put the incident review document together. It will help future readers know who to ask for more information.

🗞️ Summary

Write out the TLDR version of what happened.

🔥 Impact

When talking about impact, you should consider several dimensions.

  • 🚶‍♂️ User: were users unable to carry out a particular action? Was their data lost? Was their data compromised? How did that affect their lives or jobs?
  • 👩‍💼 Business: a helpful way to think about this one is to try to answer the question, how much money did the company lose with this incident?
  • Team: did someone wake up in the middle of the night to respond? Did people work late? Was the sprint interrupted because many team members had to work on remediation?

🚨 Detection

How did the team become aware of the issue? At best, an alert told you something was wrong even before users became aware. At worst, a user had to submit a ticket to Product Support.

🔍 Triage

What were the actions taken by the responder to assess the severity of the issue and escalate?

🐛 Root causes

Share the team's steps to investigate the issue and identify the root cause. Include information gathered from observability tools like screenshots or links to dashboards. Do not spare details here. Everyone can learn a lot about how the system works from this section.

I am a firm believer in blameless postmortems, although I am conscious of how easy it is to avoid attribution. Focus on the circumstances that triggered the incident instead of the individuals. But be careful not to skip the root cause entirely.

🧯 Resolution

What actions did the team take to bring the system back to normal? How did you make sure everything was, indeed, okay?

Again, don't forget the charts!

⏱️ Timeline

Share the complete list of events. Compiling the actual timeline is much easier when the incident leader takes notes during the incident.

  • Timestamp.
  • Action or discovery.

👩‍🎓 Lessons learned

Knowing what happened, it's time to reflect. I like to focus on three categories.

  • 👍 What went well
  • 👎 What went wrong
  • 🤞 Where we got lucky

Think about it in terms of software and processes and avoid attributing individual blame.

⏭️ Actions and issues

List of remediation and preemptive actions. Ideally, all of them have a corresponding ticket. Focus on process improvements and technical work to address what went wrong and avoid depending on luck. As an engineering leader, it's your job to work with Product Management and prioritize those tickets.

👐 Share it

Always share the document with the entire organization. In some cases, consider sharing an adapted version with customers. It may feel scary and shameful. However, being open about mistakes and your actions to improve is a great way to protect the trust you worked so hard to build.

🙌 Thanks for reading. Hopefully, this template can help you make the most of the situation when your system blows up. I would love to hear from you if you're approaching things differently.