AI Postmortem Generation

When production incidents occur, writing postmortems becomes a critical but often time-consuming task. Teams must gather logs, identify root causes, document timelines, and extract lessons learned—all while managing the aftermath of an outage. AI postmortem generation tools are transforming this process, helping developers automate incident analysis and produce documentation in minutes rather than hours.

What Is AI Postmortem Generation?

AI postmortem generation refers to using artificial intelligence to analyze incident data—logs, metrics, chat transcripts, and code changes—and automatically produce structured postmortem documents. These tools ingest raw incident information, apply pattern recognition and causal analysis, and output formatted reports following best practices like the Etsy blameless postmortem format.

The core value proposition is straightforward: reduce the time engineers spend on documentation while improving consistency and completeness. A well-crafted AI postmortem generator captures relevant context, identifies probable root causes, and structures findings in a way that helps learning and prevention.

How AI Postmortem Generation Works

Most AI postmortem generation systems follow a multi-stage pipeline:

Data Collection: The system aggregates logs, metrics, traces, incident channel messages, and version control commits related to the incident timeframe.
Temporal Analysis: AI models correlate events across different data sources, establishing causality rather than mere correlation. This involves identifying the sequence of events that led to the incident.
Root Cause Inference: Using trained models or LLM reasoning, the system proposes potential root causes based on patterns like error spikes, configuration changes, or dependency failures.
Document Synthesis: The final stage generates a structured postmortem with sections for summary, impact, timeline, root cause, resolution, and action items.

Practical Implementation Approaches

Using LLMs Directly

The most flexible approach involves feeding incident data directly to a large language model with appropriate prompting. Here’s a Python example using OpenAI’s API:

from openai import OpenAI
import json
from datetime import datetime

def generate_postmortem(incident_data: dict, client: OpenAI) -> str:
    """Generate a postmortem document from incident data."""

    prompt = f"""Generate a blameless postmortem for the following incident.

## Incident Summary
- Title: {incident_data.get('title', 'Unknown')}
- Severity: {incident_data.get('severity', 'SEV-3')}
- Duration: {incident_data.get('duration_minutes', 0)} minutes
- Impact: {incident_data.get('impact', 'Unknown')}

## Timeline Events
{incident_data.get('timeline', '')}

## Key Logs

{incident_data.get(‘logs’, ‘’)}

## Code Changes During Incident
{incident_data.get('commits', '')}

Generate a complete postmortem with:
1. Executive Summary
2. Impact Assessment
3. Timeline (at least 5 key events)
4. Root Cause Analysis
5. Resolution Steps
6. Action Items (at least 3)

Use a blameless tone focused on learning."""

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an SRE expert specializing in incident postmortems."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.3
    )

    return response.choices[0].message.content

# Example usage
incident = {
    "title": "API Gateway 5xx Errors",
    "severity": "SEV-2",
    "duration_minutes": 47,
    "impact": "2,300 users unable to access dashboard",
    "timeline": """2026-03-20 14:23 - Alert fired: Error rate > 5%
2026-03-20 14:25 - On-call paged, investigation started
2026-03-20 14:31 - Root cause identified: config deployment
2026-03-20 14:45 - Rollback completed
2026-03-20 14:50 - Error rates normalized""",
    "logs": """level=error msg="connection refused" service=auth-service
level=error msg="upstream timeout" service=api-gateway
level=info msg="config loaded" service=api-gateway version="v2.1.0-bad" """,
    "commits": "abc123 - Update auth service config\ndef456 - Rollback to v2.0.9"
}

client = OpenAI(api_key="your-api-key")
postmortem = generate_postmortem(incident, client)
print(postmortem)

This approach offers maximum flexibility but requires careful prompt engineering and potentially multiple iterations to get quality output.

Specialized Postmortem Platforms

Several specialized tools have emerged that handle the entire pipeline:

Platform	Best For	Key Features
Incident.io	Slack-integrated teams	Automatic timeline generation from incident channels
Blameless	Enterprise compliance	Integration with ITSM tools, action item tracking
FireHydrant	Mid-market teams	Flexible templates, postmortem libraries
Vela	Incident response automation	AI-powered root cause suggestions

Building a Custom Pipeline

For organizations wanting more control, building a custom pipeline provides the greatest flexibility. Here’s a conceptual approach:

class PostmortemGenerator:
    def __init__(self, llm_client, log_aggregator, metrics_client):
        self.llm = llm_client
        self.logs = log_aggregator
        self.metrics = metrics_client

    def collect_incident_data(self, incident_id: str, window_minutes: int = 60):
        """Collect all relevant data around the incident."""
        end_time = datetime.utcnow()
        start_time = end_time - timedelta(minutes=window_minutes)

        return {
            "logs": self.logs.query(
                service=["api-gateway", "auth-service"],
                start=start_time,
                end=end_time,
                level="error"
            ),
            "metrics": self.metrics.query(
                metric="request_error_rate",
                start=start_time,
                end=end_time
            ),
            "deployments": self.get_deployments(start_time, end_time),
            "incidents": self.get_incident_channel(incident_id)
        }

    def analyze_root_cause(self, data: dict) -> dict:
        """Use AI to identify probable root cause."""
        analysis_prompt = f"""
        Analyze this incident data and identify the root cause.
        Focus on: temporal correlation, error patterns, recent changes.

        Logs: {data['logs'][:2000]}
        Metrics: {data['metrics']}
        Recent Deployments: {data['deployments']}

        Return a JSON object with:
        - root_cause: primary cause description
        - contributing_factors: array of contributing factors
        - confidence: confidence score 0-1
        """
        # Implementation here
        pass

    def generate_document(self, incident_id: str, window: int = 60) -> str:
        """Generate complete postmortem document."""
        data = self.collect_incident_data(incident_id, window)
        analysis = self.analyze_root_cause(data)

        return self.render_postmortem(incident_id, data, analysis)

Best Practices for AI Postmortem Generation

Validate before publishing. AI generates drafts, not final documents. Always have a human reviewer verify technical accuracy and add context that AI might miss.

Provide rich context. The quality of output depends heavily on input. Feed the system logs, metrics, commits, and chat transcripts for better analysis.

Iterate on prompts. If using LLMs directly, refine your prompts based on output quality. Include examples of good postmortems in your prompt engineering.

Maintain human ownership. Final postmortems should always be owned by team members who can speak to accuracy and commit to action items.

Challenges and Limitations

AI postmortem generation faces several challenges. Context windows limit how much historical data models can process, requiring careful selection of relevant logs and events. Root cause inference remains probabilistic—AI can suggest probable causes but cannot replace human investigation for complex issues. Additionally, certain root causes like race conditions or distributed system timing issues are inherently difficult for AI to identify without deep system knowledge.

Tool Comparison: Cost and Features

Several dedicated platforms handle AI postmortem generation. Here’s a practical comparison:

Platform	Cost	Best For	Key Features
Incident.io	Free to $300/month	Slack teams	Auto timeline, AI root cause, incident library
Blameless	Custom pricing	Enterprise	ITSM integration, automated remediation tracking
FireHydrant	$50-300/month	Teams <100	Custom templates, playbooks, AI enrichment
OpenAI API	$0.03-0.06 per 1K tokens	Custom builds	Full control, lowest cost for volume

For most teams, building custom postmortems with Claude API ($0.003 per 1K tokens input) or GPT-4o ($0.015/1K tokens) is cheaper than enterprise tools while offering flexibility.

CLI-Based Postmortem Generation Workflow

Here’s a complete bash/Python workflow for automating postmortem generation from logs:

#!/bin/bash
# fetch-incident-data.sh - Collect incident artifacts

INCIDENT_ID=$1
INCIDENT_START=$2
INCIDENT_END=$3

# Fetch logs from ELK/DataDog/CloudWatch
curl -s "https://logs.example.com/api/logs" \
  -H "Authorization: Bearer $LOG_TOKEN" \
  -d "incident_id=$INCIDENT_ID&start=$INCIDENT_START&end=$INCIDENT_END" \
  > logs.json

# Get metrics spike data
curl -s "https://metrics.example.com/api/timeseries" \
  -H "Authorization: Bearer $METRICS_TOKEN" \
  -d "metric=error_rate&start=$INCIDENT_START&end=$INCIDENT_END" \
  > metrics.json

# Fetch deployment info
git log --oneline --after="$INCIDENT_START" --before="$INCIDENT_END" \
  > deployments.txt

# Get slack incident channel transcript
python3 extract_slack_thread.py $INCIDENT_CHANNEL_ID > slack_discussion.txt

# Generate postmortem
python3 generate_postmortem.py \
  --logs logs.json \
  --metrics metrics.json \
  --deployments deployments.txt \
  --slack slack_discussion.txt \
  --output postmortem.md

Then use Claude API to generate:

import anthropic
import json

def generate_postmortem_from_files(logs_file, metrics_file, slack_file):
    client = anthropic.Anthropic(api_key="your-api-key")

    # Read collected data
    with open(logs_file) as f:
        logs = f.read()
    with open(metrics_file) as f:
        metrics = f.read()
    with open(slack_file) as f:
        slack = f.read()

    prompt = f"""You are an SRE writing a blameless postmortem.

## Raw Incident Data

### Error Logs
{logs[:3000]}

### Metrics During Incident
{metrics[:2000]}

### Team Discussion
{slack[:2000]}

Generate a comprehensive postmortem with:
1. Executive summary (2-3 sentences)
2. Impact: affected services, user count, duration
3. Timeline: at least 5 key events with timestamps
4. Root cause: primary cause + contributing factors
5. Resolution: what stopped the bleeding + full fix
6. Prevention: 3-5 specific action items to prevent recurrence

Use markdown format. Be specific with numbers and timeframes."""

    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2000,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    return response.content[0].text

# Generate and save
postmortem = generate_postmortem_from_files("logs.json", "metrics.json", "slack.txt")
with open("postmortem.md", "w") as f:
    f.write(postmortem)

Real-World Postmortem Example

Here’s what AI-generated postmortems typically look like (this example was created with Claude):

# Postmortem: Database Connection Pool Exhaustion

## Impact
- 2,847 users unable to log in
- 3 dependent services degraded (API, websockets, admin panel)
- SLA violation: 38-minute outage vs. 99.95% target

## Timeline
- 14:23 UTC: Authentication service v2.4.1 deployed
- 14:25 UTC: Error rate alert fires (5% vs. 0.1% baseline)
- 14:27 UTC: On-call team pages, investigation begins
- 14:31 UTC: Database connection pool at 95% utilization identified
- 14:39 UTC: Service rolled back to v2.4.0
- 14:50 UTC: Error rates return to baseline

## Root Cause
Connection pool timeout was changed from 30s to 5s in the new release, causing connections
to be recycled too aggressively. This created connection churn that exhausted the pool.

## Resolution
Immediate: Rollback to v2.4.0. The previous connection timeout of 30s had been validated
in production for 6 months.

Permanent: Update connection pool configuration to use 45s timeout with monitoring.

Addressing AI Limitations in Postmortem Generation

AI struggles with certain aspects of postmortems. Always manually verify:

Causality claims — AI suggests correlations as causes. Verify temporal causality with domain experts.
Specific numbers — Check all metrics, error counts, and impact figures against actual data.
Action items — AI generates generic fixes. Replace with specific, assigned tasks your team will actually do.
Context dependencies — AI may miss that this is the third similar incident. Add historical context manually.

Automated Postmortem Storage and Search

Store generated postmortems in a searchable database so teams learn from patterns:

from datetime import datetime

def save_postmortem(postmortem_text, incident_id, severity, services):
    doc = {
        "incident_id": incident_id,
        "generated_at": datetime.utcnow().isoformat(),
        "severity": severity,
        "affected_services": services,
        "text": postmortem_text,
        "root_cause_tags": extract_root_causes(postmortem_text),
        "searchable": postmortem_text.lower()
    }

    # Store in MongoDB, Elasticsearch, or similar
    postmortem_db.insert(doc)

    # Make searchable for "previous similar incidents"
    return doc

# Later, when new incident occurs:
def find_similar_incidents(current_error_message):
    similar = postmortem_db.search(
        query=current_error_message,
        limit=5
    )
    return similar  # Show team what they fixed before

This dramatically speeds up incident response by letting teams reference similar past incidents.

Built by theluckystrike — More at zovo.one