How to Use AI to Help Sre Teams Create on Call Handoff Docum

On-call rotations are a cornerstone of site reliability engineering, but the handoff process between shifts often becomes a time-consuming manual task. SRE teams frequently struggle to capture the right details, maintain consistency, and ensure the next engineer has everything needed to handle incoming incidents. AI tools offer practical ways to automate and improve on-call handoff documentation without adding cognitive burden to already-taxed teams.

The Problem with Manual Handoff Documents

When engineers manually write handoff notes, several issues emerge. Notes become inconsistent between shifts—some engineers detail every system interaction while others provide minimal context. Important details get omitted because the outgoing engineer assumes “everyone knows” certain systems or processes. Time pressure leads to rushed documentation that lacks the specificity needed for effective incident response.

A well-structured handoff document should answer key questions: What issues are currently open? Which systems require active monitoring? What recent changes might have introduced risk? What workarounds exist for known issues? Manually answering these questions every shift drains time that engineers could spend on proactive reliability work.

AI-Assisted Handoff Workflows

AI tools can assist at multiple stages of the handoff process. The most effective approach combines automated data gathering with intelligent summarization, then allows engineers to review and augment the output.

Collecting Relevant Context

Start by gathering data from your monitoring systems, incident trackers, and recent changes. A script can pull this information together before the handoff meeting:

import subprocess
from datetime import datetime, timedelta

def gather_handoff_context():
    """Gather relevant context for on-call handoff."""
    context = {}

    # Get open incidents from your ticketing system
    context['open_incidents'] = subprocess.run(
        ['jira', 'search', '--jql', 'status=Open AND labels=production'],
        capture_output=True, text=True
    ).stdout

    # Fetch recent deployments from the last 24 hours
    context['recent_deployments'] = subprocess.run(
        ['gh', 'run', 'list', '--limit', '50'],
        capture_output=True, text=True
    ).stdout

    # Get current system metrics snapshots
    context['active_alerts'] = subprocess.run(
        ['prometheus', 'query', 'up{job="production"} == 0'],
        capture_output=True, text=True
    ).stdout

    return context

This script aggregates raw data from multiple sources. The outgoing engineer reviews the output and identifies items requiring attention. This approach ensures nothing falls through the cracks while reducing the manual effort of searching through multiple systems.

Generating Summaries with AI

Once you have raw context, AI can synthesize this information into readable summaries. Rather than copying and pasting from various dashboards, feed the gathered data into an AI model with clear instructions about what the summary should contain:

Context: Recent deployments, open incidents, active alerts
Task: Write a concise handoff summary for the incoming on-call engineer
Include: Known issues, systems requiring attention, recent changes that might cause problems
Style: Technical but accessible, bullet points preferred

The AI generates a first draft that the outgoing engineer reviews and refines. This human-in-the-loop approach maintains accuracy while dramatically reducing documentation time.

Practical Templates and Examples

Effective AI-assisted handoffs work best with structured templates. Define what information matters for your team, then use AI to populate those sections intelligently.

Incident Status Section

Instead of manually listing every open incident, provide the AI with incident IDs and ask for status summaries:

## Current Incidents

### INC-1234: Elevated latency in payment service
- Status: Investigating
- Timeline: Identified 2 hours ago, root cause not yet determined
- Customer impact: 15% of transactions experiencing 500ms delay
- Notes from on-call: We believe this relates to the database connection pool
  changes from yesterday's deployment. The rollback was delayed because...

### INC-1235: SSL certificate warning on api.example.com
- Status: Monitoring
- Timeline: Detected 6 hours ago, temporary fix applied
- Workaround: Using fallback certificate while renewal processes

The AI helps maintain consistent formatting and ensures each incident includes the information engineers actually need: timeline, impact, and current status.

System Health Overview

For the systems overview section, generate a quick health check from monitoring data:

## Systems Status

| System | Status | Notes |
|--------|--------|-------|
| api-gateway | Healthy | Response times normal |
| user-service | Degraded | Memory usage elevated, investigating |
| payment-processor | Healthy | |
| analytics-pipeline | Healthy | Slight delay in processing (< 1 min) |

AI can transform raw metric outputs into this table format, saving manual formatting time while maintaining readability.

Action Items for Incoming Engineer

Clearly list tasks that need attention during the next shift:

## Action Items

1. Monitor user-service memory usage - may need to scale if it continues trending up
2. Follow up with security team on INC-1235 certificate renewal
3. Review INC-1234 if it escalates - database team available until 3pm PST
4. Complete the canary deployment for feature flag "new-checkout-flow"
   if no errors after 2 hours

Automating the Workflow

For teams ready to fully automate, integrate AI handoff generation into existing tooling. A CI/CD pipeline can run the context-gathering script, feed results to an AI model, and post the draft handoff document to your team’s communication channel before each shift ends.

This automation works best when paired with clear team conventions. Define which data sources to include, establish the required sections, and create a review process. The AI handles the heavy lifting of aggregation and formatting while engineers provide the critical domain expertise that cannot be automated.

Tips for Effective AI-Assisted Handoffs

Start small. Use AI to assist with one section of your handoff document initially. Measure the time savings and refine your approach before expanding to other sections.

Always include human review. AI generates drafts, not final documents. The outgoing engineer understands context the AI cannot capture—recent conversations with other teams, tacit knowledge about system behavior, and nuances that don’t appear in metrics.

Maintain consistency. Use the same template every shift. This predictability helps incoming engineers find information quickly and ensures no critical sections get omitted.

Store historical handoffs. Having a searchable archive of past handoffs helps AI models improve their output over time and allows engineers to reference previous incidents.

Handoff Document Template with AI Placeholders

# On-Call Handoff Report
**Date:** [AUTO: current date]
**Outgoing Engineer:** [NAME]
**Incoming Engineer:** [NAME]
**Time:** [AUTO: current time]

## Executive Summary
[AI SECTION: 2-3 sentence summary of current state]

## Critical Issues Requiring Attention
[AI: Extract from incident tracking system]

### INC-[ID]: [AI: Generated title]
- **Status:** [AI: Current status]
- **Impact:** [MANUAL: Human assessment of business impact]
- **Timeline:** [AI: Structured timeline with timestamps]
- **Next Steps:** [MANUAL: Specific actions needed]

## System Health Dashboard
[AI: Table generated from monitoring metrics]

| System | Status | Trend | Alert Threshold |
|--------|--------|-------|-----------------|
| [AI: auto-populated] | [AI: auto-populated] | [AI: trend analysis] | [MANUAL: configured value] |

## Recent Changes in Last 24 Hours
[AI: Pull from deployment logs]

| Service | Change | Deployment Time | Status |
|---------|--------|-----------------|--------|
| [AI: service name] | [AI: brief change description] | [AI: timestamp] | [AI: success/pending] |

## Known Workarounds and Limitations
[AI: Extract from incident notes]

1. **[MANUAL: Issue Title]** - [AI: Summary of workaround]
   - Implemented: [AI: timestamp]
   - Estimated resolution: [MANUAL: timeline]

## Monitoring Alerts to Watch For
[AI: High-alert items from monitoring system]

- Alert name: [AI: populated from alert rules]
- Current level: [AI: current value]
- Action if triggered: [MANUAL: response procedure]

## Escalation Contacts
[MANUAL: Team-specific contacts]

## Action Items for This Shift
[AI: Priority-ordered list from tickets]

1. [Task description] - Assigned to: [incoming engineer] - Deadline: [AI: calculated from urgency]
2. [Continue for each priority item]

---
*Generated with AI assistance at [timestamp] | Review time: [estimate minutes]*

Automation Scripts for Data Collection

For teams ready to fully automate, create scripts that feed AI with structured data:

#!/usr/bin/env python3
"""Collect handoff data and generate documentation via AI."""

import subprocess
import json
import os
from datetime import datetime, timedelta
from typing import Dict, Any

class HandoffDataCollector:
    def __init__(self):
        self.collection_time = datetime.now()
        self.data = {}

    def get_open_incidents(self) -> Dict[str, Any]:
        """Fetch incidents from ticketing system."""
        result = subprocess.run(
            ['jira', 'search', '--jql',
             'status in (Open, "In Progress") AND labels=production',
             '--format=json'],
            capture_output=True, text=True
        )
        incidents = json.loads(result.stdout)
        return {
            'total': len(incidents),
            'high_priority': [i for i in incidents if i['priority'] == 'High'],
            'items': incidents
        }

    def get_recent_deployments(self) -> Dict[str, Any]:
        """Fetch recent deployments from CI/CD system."""
        twenty_four_hours_ago = (
            self.collection_time - timedelta(hours=24)
        ).isoformat()

        result = subprocess.run(
            ['gh', 'run', 'list',
             '--created', f'>{twenty_four_hours_ago}',
             '--json', 'name,conclusion,createdAt'],
            capture_output=True, text=True
        )
        deployments = json.loads(result.stdout)
        return {
            'total': len(deployments),
            'succeeded': sum(1 for d in deployments if d['conclusion'] == 'success'),
            'failed': sum(1 for d in deployments if d['conclusion'] == 'failure'),
            'items': deployments
        }

    def get_current_alerts(self) -> Dict[str, Any]:
        """Fetch active alerts from monitoring system."""
        result = subprocess.run(
            ['prometheus_query',
             'ALERTS{alertstate="firing"}',
             '--format=json'],
            capture_output=True, text=True
        )
        alerts = json.loads(result.stdout)
        return {
            'total': len(alerts),
            'critical': sum(1 for a in alerts if a['severity'] == 'critical'),
            'warning': sum(1 for a in alerts if a['severity'] == 'warning'),
            'items': alerts
        }

    def get_database_status(self) -> Dict[str, Any]:
        """Check database replication and backup status."""
        result = subprocess.run(
            ['aws', 'rds', 'describe-db-instances',
             '--query', 'DBInstances[*].[DBInstanceIdentifier,DBInstanceStatus]',
             '--output', 'json'],
            capture_output=True, text=True
        )
        instances = json.loads(result.stdout)
        return {
            'total': len(instances),
            'healthy': sum(1 for i in instances if i[1] == 'available'),
            'instances': instances
        }

    def collect_all(self) -> Dict[str, Any]:
        """Collect all handoff-relevant data."""
        print("Collecting handoff data...")
        self.data = {
            'timestamp': self.collection_time.isoformat(),
            'incidents': self.get_open_incidents(),
            'deployments': self.get_recent_deployments(),
            'alerts': self.get_current_alerts(),
            'databases': self.get_database_status(),
        }
        return self.data

    def save_to_file(self, filepath: str):
        """Save collected data for AI processing."""
        with open(filepath, 'w') as f:
            json.dump(self.data, f, indent=2, default=str)
        print(f"Data saved to {filepath}")

if __name__ == "__main__":
    collector = HandoffDataCollector()
    collector.collect_all()
    collector.save_to_file("handoff-data.json")

    # Feed to AI for summary generation
    print("\nData ready for AI processing. Example usage:")
    print("claude --file handoff-data.json " +
          "'Generate concise handoff summary from this data'")

Integration with Slack for Automated Handoffs

For teams using Slack, automate handoff document posting:

"""Post handoff documents to Slack channel."""

from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError
import json

class HandoffSlackNotifier:
    def __init__(self, token: str, channel: str):
        self.client = WebClient(token=token)
        self.channel = channel

    def post_handoff_summary(self, handoff_data: Dict):
        """Post formatted handoff to Slack."""
        blocks = [
            {
                "type": "header",
                "text": {
                    "type": "plain_text",
                    "text": "On-Call Handoff Report"
                }
            },
            {
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": (
                        f"*Incidents:* {handoff_data['incidents']['total']} " +
                        f"({handoff_data['incidents']['high_priority']} high)\n" +
                        f"*Deployments:* {handoff_data['deployments']['succeeded']} " +
                        f"successful, {handoff_data['deployments']['failed']} failed\n" +
                        f"*Alerts:* {handoff_data['alerts']['critical']} critical, " +
                        f"{handoff_data['alerts']['warning']} warning"
                    )
                }
            },
            {
                "type": "divider"
            }
        ]

        # Add action items
        if handoff_data['incidents']['high_priority']:
            blocks.append({
                "type": "section",
                "text": {
                    "type": "mrkdwn",
                    "text": "*Requires Immediate Attention:*\n" +
                            "\n".join([
                                f"• {inc['key']}: {inc['summary']}"
                                for inc in handoff_data['incidents']['high_priority']
                            ])
                }
            })

        try:
            response = self.client.chat_postMessage(
                channel=self.channel,
                blocks=blocks
            )
            print(f"Handoff posted to Slack: {response['ts']}")
        except SlackApiError as e:
            print(f"Error posting to Slack: {e}")

Measuring Handoff Quality

Track whether handoffs effectively prevent information loss:

"""Measure handoff documentation quality and incident context transfer."""

class HandoffQualityMetrics:
    def __init__(self):
        self.metrics = {
            'avg_incident_clarity': 0,  # 1-10 scale
            'context_loss_incidents': 0,  # incidents with insufficient context
            'action_item_completion': 0,  # % of items completed as described
            'engineer_satisfaction': 0,  # incoming engineer's confidence score
            'documentation_completeness': 0,  # % of fields filled
        }

    def calculate_metrics(self, incident_db, handoff_history):
        """Calculate handoff quality metrics."""
        total_incidents = len(incident_db)
        clarity_scores = []
        lost_context_count = 0

        for incident in incident_db:
            # Check if handoff documented context clearly
            if incident.get('context_loss_reported'):
                lost_context_count += 1
            clarity_scores.append(incident.get('clarity_score', 5))

        self.metrics['avg_incident_clarity'] = (
            sum(clarity_scores) / len(clarity_scores) if clarity_scores else 0
        )
        self.metrics['context_loss_incidents'] = lost_context_count
        self.metrics['action_item_completion'] = self._calculate_completion_rate(
            handoff_history
        )

        return self.metrics

    def _calculate_completion_rate(self, handoff_history):
        """Calculate how many action items were completed as described."""
        completed = sum(
            1 for handoff in handoff_history
            if all(action.get('completed') for action in handoff.get('actions', []))
        )
        total = len([
            action
            for handoff in handoff_history
            for action in handoff.get('actions', [])
        ])
        return (completed / total * 100) if total > 0 else 0

Built by theluckystrike — More at zovo.one