A playbook is a documented procedure that any team member can execute without needing the author present. For remote teams across time zones, good playbooks turn what would be a “wake someone up” situation into a “follow these steps” situation. The difference is in the structure and completeness of the template.
This guide provides templates for three core playbook types: incident response, deployment, and onboarding.
Incident Response Playbook Template
# Incident: [INCIDENT-NAME]
**Severity**: P1 / P2 / P3
**Status**: Active / Resolved
**Incident Commander**: @[owner]
**Started**: YYYY-MM-DD HH:MM UTC
**Resolved**: YYYY-MM-DD HH:MM UTC (fill when resolved)
---
## What Is Happening
One paragraph plain-language description of the incident. What is affected? Who is affected? What is the user-visible impact?
> Example: "The payments API is returning 502 errors for ~40% of checkout attempts. Approximately 200 users per hour are unable to complete purchases. The error started at 14:23 UTC."
## Current Status
> Example: "Identified root cause (database connection pool exhausted). Implementing fix. ETA 30 minutes."
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 14:23 | First error alerts fired |
| 14:31 | On-call acknowledged |
| 14:45 | Root cause identified |
| 15:10 | Fix deployed |
| 15:15 | Confirmed resolved |
## Impact
- **Services affected**: payments-api, checkout-frontend
- **Error rate**: ~40%
- **Users affected**: ~200/hour
- **Revenue impact**: ~$4,000/hour estimate
- **External customers notified**: [Yes/No] via [status.yourcompany.com]
---
## Diagnosis Steps
Run these commands to gather context:
```bash
# Check service health
kubectl get pods -n production | grep payments
# View recent error logs
kubectl logs -n production deployment/payments-api --since=30m | grep ERROR | tail -50
# Check database connectivity
kubectl exec -n production deployment/payments-api -- \
pg_isready -h $DB_HOST -p 5432
# View active DB connections
psql -h $DB_HOST -U postgres -c \
"SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
Mitigation Options
| Option | Risk | ETA | Steps |
|---|---|---|---|
| Restart pods | Low | 2 min | kubectl rollout restart deployment/payments-api -n production |
| Roll back deployment | Low | 5 min | kubectl rollout undo deployment/payments-api -n production |
| Increase DB pool size | Medium | 15 min | Edit DB_POOL_SIZE env var and redeploy |
| Enable maintenance mode | Medium | 2 min | Set MAINTENANCE_MODE=true in config and redeploy |
Resolution
What was done to resolve the incident. What was the root cause?
Follow-up Actions
- Write post-mortem by [DATE] — @[owner]
- Add alert for [condition] — @[owner]
- Fix root cause permanently — ENG-[ticket]
- Update runbook with new steps — @[owner] ```
Deployment Playbook Template
# Deployment: [SERVICE-NAME] v[VERSION]
**Deployer**: @[name]
**Date**: YYYY-MM-DD
**Environment**: staging / production
**Deploy type**: Standard / Hotfix / Rollback
**PR / Release**: [link]
---
## Pre-Deployment Checklist
### Code
- [ ] PR approved by required reviewers
- [ ] All CI checks passing
- [ ] CHANGELOG updated
- [ ] Migration scripts reviewed (if applicable)
### Staging Verified
- [ ] Deployed to staging successfully
- [ ] Smoke tests passing on staging
- [ ] New feature tested on staging
### Dependencies
- [ ] Dependent services notified
- [ ] External APIs/webhooks compatible with new version
- [ ] Feature flags configured for gradual rollout
### Rollback Plan
- [ ] Previous version noted: `v[PREVIOUS_VERSION]`
- [ ] Rollback command tested: `kubectl rollout undo deployment/[service]`
- [ ] Database migrations are reversible: [Yes / No — explain if No]
---
## Deployment Steps
### 1. Announce
Post in #deployments Slack channel:
Deploying [service] v[version] to production. Changes: [brief summary] Risk: Low / Medium / High Rollback ready: yes
### 2. Deploy
```bash
# Tag and push (if not automated)
git tag v[version]
git push origin v[version]
# Trigger deploy (if manual)
kubectl set image deployment/[service] [service]=[registry]/[service]:v[version] -n production
# Wait for rollout
kubectl rollout status deployment/[service] -n production --timeout=5m
3. Verify
# Check pods are running
kubectl get pods -n production -l app=[service]
# Confirm new version
kubectl describe deployment/[service] -n production | grep Image
# Check error rate (first 5 minutes)
# Run this every 60 seconds x5
curl -s "https://monitoring.yourcompany.com/api/v1/query?query=rate(http_requests_total{service='[service]',status=~'5..'}[1m])" \
| jq '.data.result[0].value[1]'
4. Post-Deploy
- Confirm smoke tests pass in production
- Update status page if maintenance window was posted
- Announce completion in #deployments
- Enable feature flags for gradual rollout (if applicable)
Rollback Procedure
If error rate increases or critical errors appear:
# Immediate rollback
kubectl rollout undo deployment/[service] -n production
kubectl rollout status deployment/[service] -n production
# Verify rollback
kubectl describe deployment/[service] -n production | grep Image
Post in #deployments:
ROLLBACK: [service] rolled back to v[previous_version].
Reason: [brief explanation]
Investigation ongoing in #incidents
---
## Onboarding Playbook Template
```markdown
# Onboarding: [ENGINEER_NAME]
**Start Date**: YYYY-MM-DD
**Role**: [role]
**Manager**: @[manager]
**Buddy**: @[buddy]
**Team**: [team name]
---
## Week 1: Foundation
### Day 1 — Access and Setup
**IT/Admin tasks (Manager)**
- [ ] Google Workspace account created
- [ ] GitHub org invitation sent
- [ ] Slack invitation sent
- [ ] 1Password team invitation sent
- [ ] PagerDuty account created (if on-call eligible)
- [ ] AWS/GCP/Azure console access configured
**Environment setup (New Hire)**
```bash
# Clone the onboarding repo for setup scripts
git clone git@github.com:your-org/onboarding.git
cd onboarding && make setup
- Dev environment set up (run
make setup) - Can run the project locally
- VPN configured and tested
- 2FA enabled on all accounts
Day 2-3 — Codebase Orientation
- Read architecture overview in Notion: [link]
- Read team norms doc: [link]
- Review last 3 post-mortems: [link]
- Complete first “good first issue”: [ticket link]
- First PR submitted and reviewed
Day 4-5 — Process and Context
- Attended team standup
- Met with buddy (30-min async Loom or sync call)
- Met with manager (1:1 scheduled)
- Read current sprint goals
Week 2: Contributing
- First PR merged to main
- Attended or watched team retro recording
- Added to on-call rotation schedule (shadow week)
- Read and acknowledged security policies
- Set up error tracking alerts for owned services
Ongoing: 30/60/90 Day Goals
30 Days
- Ship 3 non-trivial PRs
- Understand the data model for core services
- Be able to debug a production issue independently
60 Days
- Own at least one feature end-to-end
- Be confident with deployment process
- Contribute to team norms doc with at least one edit
90 Days
- Lead a project or feature
- Mentor a more junior engineer on a PR review
- Identify and fix one piece of technical debt
Key Resources
| Resource | Link | |———-|——| | Architecture overview | [link] | | API documentation | [link] | | Deployment runbook | [link] | | Incident response playbook | [link] | | On-call rotation | [link] | | Team calendar | [link] |
---
## Storing and Accessing Playbooks
Playbooks rot if they're not maintained. The best storage is wherever your team already looks:
```bash
# Notion database with properties
Title: [text]
Type: [Incident / Deployment / Onboarding / Process]
Owner: [person]
Last Reviewed: [date]
Status: [Active / Draft / Archived]
# GitHub Wiki (version-controlled)
docs/
├── playbooks/
│ ├── incident-response.md
│ ├── deployment-standard.md
│ └── onboarding.md
Trigger playbook reminders via GitHub Actions to review stale playbooks:
name: Playbook Review Reminder
on:
schedule:
- cron: '0 9 1 * *' # First of each month
jobs:
remind:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Check for stale playbooks
run: |
find docs/playbooks/ -name "*.md" -mtime +90 \
-exec echo "STALE PLAYBOOK: {}" \; \
| while read msg; do
curl -s -X POST \
-H 'Content-type: application/json' \
--data "{\"text\":\"$msg — please review and update\"}" \
"${{ secrets.SLACK_WEBHOOK }}"
done
Related Reading
- Remote Team Code Review Checklist Template
- How to Create a Remote Dev Environment Template
- How to Create Automated Status Pages
- Calculate reasonable response windows based on overlap