Best Tools for Remote Team Post-Mortems
A post-mortem that happens three weeks after an incident, written by one person from memory, with no action items tracked to completion, is theater. Remote teams need a structured, blame-free process that starts within 48 hours, collects input asynchronously, and produces tracked action items that close before the next incident.
The Anatomy of a Good Remote Post-Mortem
- Timeline: Exact timestamps, who noticed what and when, what actions were taken
- Impact: Affected users/customers, duration, severity
- Root cause: Not “human error” — the systemic conditions that made the error possible
- Contributing factors: What made detection or resolution slower
- Action items: Specific, assigned, with due dates — not “improve monitoring”
The process has to work across time zones. Nobody should be blocked waiting for a live meeting to add their observations.
Tool 1: Notion (Best for Async Input)
Notion’s commenting system lets distributed team members add observations to specific sections of a post-mortem doc without waiting for a meeting. Use a database with templates for consistent structure.
Post-Mortem Database Template in Notion:
# Incident: [Brief Description] — [Date]
**Severity**: P1 / P2 / P3
**Duration**: [start time] → [end time] (X minutes)
**Services affected**:
**Customers affected**: ~N users
---
## Timeline
| Time (UTC) | Event | Actor |
|------------|-------|-------|
| 14:32 | Alert fired: error rate > 5% | PagerDuty |
| 14:35 | @alice acknowledged | Alice |
| 14:41 | Identified bad deploy at 14:20 | Alice |
| 14:45 | Rollback initiated | Alice |
| 14:51 | Error rate returned to baseline | Auto |
---
## What Happened
[Narrative description — written collaboratively via comments]
## Root Cause
[The systemic reason this happened — not "someone made a mistake"]
## Contributing Factors
- [What slowed detection]
- [What slowed resolution]
## What Went Well
- [Things that worked correctly during the incident]
## Action Items
| Item | Owner | Due | Status |
|------|-------|-----|--------|
| Add error rate alert at 2% threshold | @ops | 2026-04-01 | Open |
| Write runbook for rollback procedure | @alice | 2026-04-05 | Open |
Notion API to create a post-mortem from an incident:
#!/usr/bin/env python3
# create-postmortem.py
import os
import requests
from datetime import datetime
NOTION_TOKEN = os.environ["NOTION_TOKEN"]
PM_DATABASE_ID = os.environ["NOTION_PM_DATABASE_ID"]
def create_postmortem(title: str, severity: str, service: str):
response = requests.post(
"https://api.notion.com/v1/pages",
headers={
"Authorization": f"Bearer {NOTION_TOKEN}",
"Notion-Version": "2022-06-28",
"Content-Type": "application/json",
},
json={
"parent": {"database_id": PM_DATABASE_ID},
"properties": {
"Name": {"title": [{"text": {"content": title}}]},
"Severity": {"select": {"name": severity}},
"Service": {"rich_text": [{"text": {"content": service}}]},
"Date": {"date": {"start": datetime.utcnow().date().isoformat()}},
"Status": {"select": {"name": "Draft"}},
},
"children": [
{
"object": "block",
"type": "heading_2",
"heading_2": {"rich_text": [{"text": {"content": "Timeline"}}]},
},
{
"object": "block",
"type": "paragraph",
"paragraph": {"rich_text": [{"text": {"content": "Add timeline entries as table rows below."}}]},
},
],
},
)
response.raise_for_status()
data = response.json()
print(f"Post-mortem created: {data['url']}")
return data["url"]
# Triggered from your incident management tool or PagerDuty webhook
if __name__ == "__main__":
create_postmortem(
title=f"Incident: Payment service 500s — {datetime.utcnow().strftime('%Y-%m-%d')}",
severity="P1",
service="payments",
)
Tool 2: Jeli (Purpose-Built)
Jeli imports PagerDuty/Opsgenie timelines, Slack message history, and deployment logs automatically. The distributed team adds annotations and context without building a timeline from scratch.
PagerDuty webhook to create Jeli investigation:
# Configure in PagerDuty → Integrations → Webhooks
# Endpoint: https://app.jeli.io/api/v1/incidents/pagerduty
# Event: incident.triggered (P1/P2 only)
# Manually create an investigation from CLI
curl -X POST https://app.jeli.io/api/v1/investigations \
-H "Authorization: Bearer $JELI_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name": "Payment service outage 2026-03-22",
"severity": "sev1",
"summary": "Payment service returned 503s for 12 minutes",
"slack_channel_id": "C0XXXXXXXXX"
}'
Jeli automatically imports the Slack conversation from the incident channel into the timeline, making async post-mortems much faster.
Tool 3: GitHub Issues (Free, Integrated)
For teams already living in GitHub, a structured GitHub Issue template is the lowest-friction option.
.github/ISSUE_TEMPLATE/postmortem.md
---
name: Post-Mortem
about: Document an incident for learning and improvement
title: "Post-Mortem: [Brief description] — [YYYY-MM-DD]"
labels: post-mortem, needs-review
assignees: ""
---
## Summary
**Severity**: <!-- P1/P2/P3 -->
**Duration**: <!-- HH:MM UTC → HH:MM UTC (X minutes) -->
**Impact**: <!-- N users affected, X% error rate -->
## Timeline
<!-- Use UTC timestamps -->
| Time | Event | Actor |
|------|-------|-------|
| | | |
## Root Cause
<!-- Systemic cause — not human error -->
## Contributing Factors
-
## What Went Well
-
## Action Items
<!-- Use task list format so items show in issue sidebar -->
- [ ] @owner: Description of action item by YYYY-MM-DD
- [ ] @owner: Description of action item by YYYY-MM-DD
## Lessons Learned
<!-- What would you tell another team experiencing the same incident? -->
GitHub Actions to alert when action items are overdue:
#!/usr/bin/env python3
# check-pm-actions.py — run weekly, flag overdue action items
import os
import re
import requests
from datetime import datetime, timezone
GITHUB_TOKEN = os.environ["GITHUB_TOKEN"]
REPO = os.environ["GITHUB_REPO"]
SLACK_HOOK = os.environ.get("SLACK_WEBHOOK_URL", "")
headers = {"Authorization": f"Bearer {GITHUB_TOKEN}"}
issues = requests.get(
f"https://api.github.com/repos/{REPO}/issues",
headers=headers,
params={"labels": "post-mortem", "state": "open", "per_page": 50},
).json()
overdue = []
today = datetime.now(timezone.utc)
for issue in issues:
body = issue.get("body", "")
# Find task items with due dates: "- [ ] @owner: ... by YYYY-MM-DD"
matches = re.findall(r"- \[ \] .*?by (\d{4}-\d{2}-\d{2})", body)
for date_str in matches:
due = datetime.fromisoformat(date_str).replace(tzinfo=timezone.utc)
if due < today:
overdue.append({
"title": issue["title"],
"url": issue["html_url"],
"due": date_str,
})
if overdue and SLACK_HOOK:
lines = "\n".join(f"• <{i['url']}|{i['title']}> (due {i['due']})" for i in overdue)
requests.post(SLACK_HOOK, json={"text": f":warning: *Overdue post-mortem action items*\n{lines}"})
Running the Async Post-Mortem Process
Hour 0: Incident resolved. Create the post-mortem document immediately with just the title and timeline stub. Don’t write conclusions yet.
Hours 1–24: Everyone involved adds their observations asynchronously. Use comments for additions, not edits. Keep to facts, not blame.
Hour 24–48: Incident lead synthesizes the timeline into root cause analysis. Drafts action items with owners (who must be consulted, not just assigned).
Hour 48: Review comment period opens. Team has 48 hours to add corrections.
Hour 96: Document marked final. Action items are filed as tickets.
Week 4: Action item owners report progress. Incomplete items get re-scheduled, not silently dropped.
Related Reading
- Best Tools for Remote Team Changelog Review
- Best Tools for Remote Team Sprint Velocity
- How to Create Automated Rollback Systems
Built by theluckystrike — More at zovo.one