No matter how stable your software product is, occasionally things go wrong in production, and Jobber is committed to doing a post-mortem investigation to follow up and learn from each incident.
At a high-level, an incident post-mortem answers these questions:
As we’ve grown and moved to a remote working environment, we’ve changed our process to work better for remote teams and super busy schedules. This is a summary of what we’re doing to make sure that incidents remain rare and our customers can keep getting their work done!
Our process is broken down into 4 steps: Resolve the incident, investigate it, debrief about it, then share the results
Collect data during the incident. We collect as much data as we can in a slack channel dedicated to incidents, keeping it organized with threads. This includes server graphs, snippets from logs, and screenshots showing what was going on at each point in the incident. It doesn’t all end up being useful, but it’s nice to have everything collected when you start going through the investigation.
Start the investigation right away. We get one of the involved people to take on the role of lead investigator, which really means they’re in charge of making sure the investigation gets done, the post-mortem document gets filled in, and the debrief gets held. Starting it right away makes sure nothing gets lost.
Review the results within a week. While things are still fresh, hold a debrief to review the post-mortem document, discuss the action items, and make any edits needed. This is a 30-60min zoom session with the team involved in the incident as well as reps from other departments (mainly the customer support/escalation team).
Share the results as soon as the debrief is done, so everyone gets a chance to learn from it! We post it to a slack channel that the whole company has access to, for transparency.
With a larger company, people working in all sorts of time zones, and everyone being remote, scheduling and coordinating got a lot more complicated. The process is still mostly the same, but with some tweaks to keep it effective.
We’ve shortened the timeline expectations - getting the incident doc started faster and the debrief done sooner helps get all the data and lets everyone involved get back to their sprint work sooner.
Scheduling the debrief sooner means that it’s harder to find a spot in everyone’s calendars. Rather than pushing the meeting further and further out, do more of the work asynchronously. Make sure the document can stand on its own, and use slack to ask people for their contributions.
We also record the debrief (easy with zoom) so that anyone who couldn’t attend is also able to watch it later, so nobody has to worry about missing out.
We’re using a wiki template for consistency, and over time we’ve simplified down the template repeatedly so there’s less sections to worry about.
Setting it up with a button to auto-create the new page from the template works well.
The template has sections for:
Our customer success team always has great input and is able to help fill in gaps in the timeline. We reach out to them early so there’s time for their input to be added into the post-mortem doc before the debrief. Waiting for the debrief is too late!
Why track action item progress in an incident doc when we already have a standard tool for tracking work? As soon as we can, we get all action items from post-mortems in as Jira tickets so they can be assigned to backlogs and don’t get lost.
We also have some reports set up to view the list of outstanding post-mortem actions - driven by a post-mortem label on the items.
Realistically, not all action items are actually actionable - some are more aspirational or are something we just need everyone to keep in mind. In order to keep the Jira action items clearer, we’ve included this section as a spot to put the things we think are important but we couldn’t turn into assignable/trackable work.
Our approach is that it’s better to have a smaller set of action items that we actually do than a giant list of things we’d like to do given infinite time.
This one isn’t actually new, but it’s well worth repeating! We’re interested in what happened and what we’re going to do to fix it going forward, not in pointing fingers.
"Removing blame from a postmortem gives people the confidence to escalate issues without fear."
– the SRE book
We're hiring for remote positions across Canada at all software engineering levels!
Our awesome Jobber technology teams span across Payments, Infrastructure, AI/ML, Business Workflows & Communications. We work on cutting edge & modern tech stacks using React, React Native, Ruby on Rails, & GraphQL.
If you want to be a part of a collaborative work culture, help small home service businesses scale and create a positive impact on our communities, then visit our careers site to learn more!
Источник: dev.toproductivity devops postmortems