Runbooks turn undocumented institutional knowledge into step-by-step procedures anyone on the team can follow at 3am. Good runbooks are opinionated, tested, and short — they list commands to run, not theory to understand. This guide builds the templates and tooling for a remote engineering team’s runbook library.
Table of Contents
- Prerequisites
- Prerequisites
- Prerequisites
- Prerequisites
- Prerequisites
- Troubleshooting
- Related Reading
[Operation Name] Runbook
Owner: @team-name Last tested: YYYY-MM-DD Estimated time: N minutes Severity: Critical / High / Medium / Low
Step 2: Purpose
One sentence: what does this runbook do?
Prerequisites
- Access required
- Tools needed
- Checks to do first
Step 3: Steps
- Step with command
- Step with expected output
- Verification step
Step 4: Verification
How to confirm the operation succeeded.
Step 5: Rollback
How to undo this if something goes wrong.
Step 6: Escalation
Who to page if this doesn’t work.
### Step 7: Template 1: Service Restart
```markdown
# Service Restart Runbook
**Owner:** @platform-team
**Last tested:** 2026-03-15
**Estimated time:** 5 minutes
**Severity:** High
### Step 8: Purpose
Safely restart a production service without extended downtime.
## Prerequisites
- SSH access to production servers
- Confirm: `kubectl get pods -n production` (for k8s) or SSH access
- Alert #incidents that restart is in progress
### Step 9: Steps
### Kubernetes
1. Check current pod status:
```bash
kubectl get pods -n production -l app=your-service
Expected: All pods in Running state before proceeding.
- Scale down to zero (optional for critical services):
kubectl scale deployment your-service -n production --replicas=0 kubectl wait --for=delete pods -l app=your-service -n production --timeout=60s - Restart with rolling update (preferred):
kubectl rollout restart deployment/your-service -n production - Monitor rollout:
kubectl rollout status deployment/your-service -n production --timeout=120sExpected output:
deployment "your-service" successfully rolled out
Docker / systemd
- Check service health before restart:
systemctl status your-service - Restart:
sudo systemctl restart your-service - Check for errors:
sudo journalctl -u your-service -n 50 --no-pager
Step 10: Verification
# Check service responds
curl -s --max-time 10 https://api.example.com/health | jq .
# Expected: {"status": "ok"}
# Check error rate in Grafana:
# Dashboard: Service Health > Error Rate > last 5 minutes
# Expected: < 0.1% errors
Step 11: Rollback
If the service doesn’t come back up:
# Kubernetes: rollback to previous version
kubectl rollout undo deployment/your-service -n production
kubectl rollout status deployment/your-service -n production
# Docker: start previous container
docker start your-service_previous
Step 12: Escalation
Service still down after 10 minutes: page @on-call-engineer via PagerDuty.
### Step 13: Template 2: Database Backup Verification
```markdown
# Database Backup Verification Runbook
**Owner:** @database-team
**Last tested:** 2026-03-01
**Estimated time:** 20 minutes
**Severity:** Medium
### Step 14: Purpose
Verify that recent database backup is valid and can be restored.
## Prerequisites
- Access to backup storage (S3/MinIO)
- Test restore environment available
- At least 10GB free disk space on test host
### Step 15: Steps
1. List recent backups and confirm latest is recent:
```bash
mc ls company/backups/postgres/ --recursive | sort | tail -10
# Confirm latest backup is < 24 hours old
- Download latest backup to test host:
BACKUP=$(mc ls company/backups/postgres/ --recursive | sort | tail -1 | awk '{print $NF}') mc cp "company/backups/postgres/${BACKUP}" /tmp/test-restore.sql.gz echo "Backup size: $(du -sh /tmp/test-restore.sql.gz)" - Create test database:
createdb -U postgres test_restore_$(date +%Y%m%d) - Restore backup:
DB_NAME="test_restore_$(date +%Y%m%d)" gunzip -c /tmp/test-restore.sql.gz | psql -U postgres "$DB_NAME" - Verify table count matches production:
# On production: psql -U postgres appdb -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';" # On test restore: psql -U postgres "$DB_NAME" -c "SELECT count(*) FROM information_schema.tables WHERE table_schema = 'public';" # Counts should match - Verify recent data exists:
psql -U postgres "$DB_NAME" \ -c "SELECT MAX(created_at) FROM orders;" # Should be within last 24 hours
Step 16: Cleanup
dropdb -U postgres "test_restore_$(date +%Y%m%d)"
rm /tmp/test-restore.sql.gz
Step 17: Verification
Record backup test results in the backup log:
Date: YYYY-MM-DD
Backup file: filename.sql.gz
Backup size: NNN MB
Restore time: N minutes
Table count match: YES/NO
Latest data date: YYYY-MM-DD
Tested by: @username
Step 18: Escalation
Backup older than 36 hours or restore fails: page @database-team immediately.
### Step 19: Template 3: SSL Certificate Renewal
```markdown
# SSL Certificate Renewal Runbook
**Owner:** @platform-team
**Last tested:** 2026-01-10
**Estimated time:** 15 minutes (automated) / 45 minutes (manual)
### Step 20: Purpose
Renew SSL certificates before expiry. Run this 30 days before expiry.
## Prerequisites
- Root/sudo access to servers running nginx/apache
- Certbot installed, or access to certificate provider dashboard
### Step 21: Check Current Expiry
```bash
# Check all certs on a server
for domain in api.example.com git.example.com auth.example.com; do
echo -n "$domain: "
echo | openssl s_client -servername "$domain" -connect "$domain:443" 2>/dev/null \
| openssl x509 -noout -dates 2>/dev/null | grep notAfter
done
Step 22: Automated Renewal (Let’s Encrypt)
# Test renewal (dry run)
sudo certbot renew --dry-run
# Renew all certs
sudo certbot renew
# Reload nginx after renewal
sudo systemctl reload nginx
# Verify renewal
sudo certbot certificates
Step 23: Manual Renewal (Other CA)
- Generate new CSR:
openssl req -new -newkey rsa:2048 -nodes \ -keyout /etc/ssl/private/example.com.key \ -out /tmp/example.com.csr \ -subj "/C=US/ST=NY/O=YourCompany/CN=example.com" -
Submit CSR to your CA, download new certificate.
- Install new certificate:
sudo cp new-cert.crt /etc/ssl/certs/example.com.crt sudo nginx -t && sudo systemctl reload nginx
Step 24: Verification
# Verify new expiry date
echo | openssl s_client -servername api.example.com \
-connect api.example.com:443 2>/dev/null \
| openssl x509 -noout -dates
# notAfter should be 90 days from now (Let's Encrypt) or per CA
### Step 25: Run book CI — Auto-Test Commands
Test runbook commands don't drift from reality:
```yaml
# .github/workflows/test-runbooks.yml
name: Test Runbook Commands
on:
schedule:
- cron: '0 6 * * 1' # Weekly Monday
pull_request:
paths: ['runbooks/**']
jobs:
test-cert-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Test certificate check command
run: |
echo | openssl s_client -servername google.com \
-connect google.com:443 2>/dev/null \
| openssl x509 -noout -dates
Step 26: Run book Index Template
# Runbook Index
### Step 27: Plan Incident Response
| Runbook | Owner | Last Tested | Time |
|---------|-------|-------------|------|
| [Service Restart](./service-restart.md) | @platform | 2026-03-15 | 5m |
| [Database Failover](./db-failover.md) | @dba | 2026-02-01 | 30m |
| [High Traffic Response](./high-traffic.md) | @sre | 2026-03-01 | 15m |
### Step 28: Deploy ments
| Runbook | Owner | Last Tested | Time |
|---------|-------|-------------|------|
| [Deploy Hotfix](./deploy-hotfix.md) | @engineering | 2026-03-10 | 20m |
| [Rollback Release](./rollback.md) | @engineering | 2026-03-05 | 10m |
### Step 29: Perform Maintenance
| Runbook | Owner | Last Tested | Time |
|---------|-------|-------------|------|
| [SSL Renewal](./ssl-renewal.md) | @platform | 2026-01-10 | 15m |
| [Backup Verification](./backup-verify.md) | @dba | 2026-03-01 | 20m |
| [Server Patching](./server-patching.md) | @platform | 2026-03-20 | 60m |
Step 30: Slack Command for Quick Runbook Access
# Post this to #ops when an incident starts
/runbooks incident service-restart
# Returns link to runbook + last tested date
Create a simple slash command webhook that queries your runbook index.
Step 31: Template 4: High Traffic / Scaling Response
# High Traffic Response Runbook
**Owner:** @sre-team
**Last tested:** 2026-03-01
**Estimated time:** 15 minutes
**Severity:** Critical
### Step 32: Purpose
Scale production to handle traffic spikes without service degradation.
## Prerequisites
- Access to AWS Console or `kubectl` with production context
- Grafana dashboard: "Service Health > Request Rate"
- Confirm this is a real traffic spike, not a metrics scrape bug
### Step 33: Steps
### Kubernetes — Horizontal Scaling
1. Check current pod count and CPU/memory:
```bash
kubectl get hpa -n production
kubectl top pods -n production -l app=your-service
- Manually scale if HPA is not triggering fast enough:
kubectl scale deployment your-service -n production --replicas=10 - Verify new pods start healthy:
kubectl rollout status deployment/your-service -n production kubectl get pods -n production -l app=your-service
Database — Connection Pool Check
- Check Postgres connection count:
psql -U postgres -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"If active connections > 80% of max_connections: enable PgBouncer connection pooling.
- Enable read replicas for read-heavy traffic:
# Point reporting queries to replica export DB_READ_HOST=db-replica-01.example.com
CDN / Cache
- Check cache hit rate in Cloudflare dashboard
- Purge stale cache if serving outdated content:
curl -X POST "https://api.cloudflare.com/client/v4/zones/${CF_ZONE_ID}/purge_cache" \ -H "Authorization: Bearer ${CF_API_TOKEN}" \ -H "Content-Type: application/json" \ --data '{"purge_everything":true}'
Step 34: Verification
Traffic is handled when:
- Error rate < 0.5% (Grafana: Service Health > Error Rate)
- P95 response time < 500ms
- Pod CPU usage < 70% under load
Step 35: Rollback / Scale Down
After traffic returns to normal (monitor for 30 minutes):
# Let HPA handle it, or manually scale back
kubectl scale deployment your-service -n production --replicas=3
Step 36: Escalation
Traffic still unmanageable after 20 minutes: page @infrastructure-lead and open a Cloudflare support ticket if CDN appears to be the bottleneck.
### Step 37: Making Runbooks Findable at 3am
A runbook nobody can find in an incident is useless. Three places every runbook must live:
**1. The repo (source of truth):**
runbooks/ incident/ service-restart.md db-failover.md high-traffic.md deployments/ deploy-hotfix.md rollback.md maintenance/ ssl-renewal.md backup-verify.md server-patching.md
**2. Your internal docs tool** (Notion, Confluence, or a static site built from the same markdown). Mirror the repo structure exactly so links in Slack messages to runbooks do not break when people navigate around the docs site.
**3. Pinned in `#incidents`:**
Post a pinned message at the top of your incidents Slack channel with direct links to the five most-used runbooks. During an incident, people do not have time to navigate a wiki — the link should be one click away.
A runbook library only works if the team trusts it. Trust comes from: commands that actually run without modification, time estimates that are close to reality, and rollback steps that have actually been tested. Each time you use a runbook in a real incident, update the `Last tested` field and fix anything that was inaccurate. This feedback loop — use it, fix it, trust it more — is what separates a living runbook from documentation theatre.
## Troubleshooting
**Configuration changes not taking effect**
Restart the relevant service or application after making changes. Some settings require a full system reboot. Verify the configuration file path is correct and the syntax is valid.
**Permission denied errors**
Run the command with `sudo` for system-level operations, or check that your user account has the necessary permissions. On macOS, you may need to grant terminal access in System Settings > Privacy & Security.
**Connection or network-related failures**
Check your internet connection and firewall settings. If using a VPN, try disconnecting temporarily to isolate the issue. Verify that the target server or service is accessible from your network.
## Related Reading
- [How to Write Runbooks for Remote Engineering Teams](/remote-work-tools/how-to-write-runbooks-remote-engineering-teams/)
- [Best Practice for Remote Team Escalation Paths](/remote-work-tools/best-practice-for-remote-team-escalation-paths-that-scale-wi/)
- [Best Practices for Remote Incident Communication](/remote-work-tools/best-practices-for-remote-incident-communication/)
- [How to Create Remote Team Playbook Templates](/remote-work-tools/how-to-create-remote-team-playbook-templates/)
---
## Related Articles
- [How to Build a Remote Team Runbook Library 2026](/remote-work-tools/how-to-build-remote-team-runbook-library-2026/)
- [How to Organize Remote Team Runbook Documentation for](/remote-work-tools/how-to-organize-remote-team-runbook-documentation-for-on-cal/)
- [Remote Team Charter Template Guide 2026](/remote-work-tools/remote-team-charter-template-guide-2026/)
- [How to Create Remote Team Playbook Templates](/remote-work-tools/how-to-create-remote-team-playbook-templates/)
- [How to Write Runbooks for Remote Engineering Teams](/remote-work-tools/how-to-write-runbooks-remote-engineering-teams/)
Built by theluckystrike — More at [zovo.one](https://zovo.one)