Incident response can be stressful. The system is on fire, the team has to fix it ASAP, and you're working against the clock. Your users are getting frustrated, the company loses money, stakeholders constantly check for updates, and trust is at risk. What a nightmare!
The tension is so high that the incident responder forgets to communicate effectively or is so worried about keeping folks up to date that they take the wrong action, further compromising the system.
Your team may greatly benefit from an incident playbook at times like this. When a new incident kicks off, the responder creates a new document from a template. This document serves as a guide on how to respond to the incident. It removes a lot of stress from the situation and helps the team resolve the problem faster.
At Plot, we have a Notion template for our incidents that looks something like this.
✅ Checklist
The responder ticks off actions as they go through them.
- Assign yourself and acknowledge the alert or the issue.
- Start populating the log below.
Start communications:
- Post a message in the
#downtime
Slack channel about the active outage. - Start a call to better coordinate and send the link to the
#engineering
Slack channel. - Update status page.
Work on the issue:
- If you need help from a team member with domain expertise, page them.
- First, stabilize the system.
- Then, resolve the issue.
When the issue is resolved:
- Update status page.
- Update the
#downtime
channel. - Write an incident review doc and share it in
#downtime
. - Schedule an incident review call with the engineering team.
✍️ Log
Populating the following log during the incident helps the responder put together an incident review document.
Finding or action taken | Details | Timestamp |
---|---|---|
Alert triggered | <link> | March 11, 2022 10:17 |
🚀 Additional tips for the incident responder
- 🤝 You are not alone in this. Ask for help if you need it and delegate.
- 🗣 High severity incidents: consider making some other team member the owner of communications so that you can focus on the other steps.
- 🔐 Security incidents: choose appropriate wording for public communications to avoid the vulnerability from being exploited further.
- ⚠️ If you stop being available during an incident, put a continuity plan is in place.
Of course, you can adapt this template to your needs. The main idea is that having a checklist in the first place avoids stress when responding to a problem.
🙌 Thanks for reading! I would love to hear from you if you have any thoughts around checklists for incident response.