Remote Team Runbook Template for Database Failover Procedure with Distributed DevOps Staff
When your primary database instance fails at 3 AM while your DBA is eight time zones away, the difference between a 15-minute recovery and a multi-hour outage often comes down to having a well-practiced failover runbook. Database failures don’t wait for business hours, and distributed DevOps teams can’t rely on synchronous handoffs during critical incidents. This guide provides a runbook template that remote engineering teams can adapt for handling database failovers across distributed staff.
The Challenge of Database Failover in Distributed Teams
Traditional database operations assume that the person with the most knowledge about the system is available when problems arise. In distributed teams spanning multiple time zones, this assumption breaks down. A failover that requires senior DBA approval can stall for hours simply because the right person is sleeping.
Effective remote team runbooks solve this by establishing clear decision criteria, predefined automation, and async-compatible verification steps. The goal is not to eliminate human judgment but to structure decisions so that qualified team members can make informed choices without waiting for specific individuals.
Pre-Failover Preparation Checklist
Before any failover scenario, ensure these prerequisites are in place:
- Replication topology is documented — Know your primary-replica configuration, replication lag thresholds, and any read-replica hierarchies
- Failover credentials are configured — All authorized engineers have appropriate access to execute failover commands
- Monitoring alerts are tuned — Set clear thresholds for when to trigger failover (not every blip warrants it)
- Communication channels are established — Know where to post status updates and who to notify
Database Connection String Template
Store your connection information in a standardized format accessible to all team members:
# config/database-failover.yaml
production:
primary:
host: db-primary.internal
port: 5432
region: us-east-1
replicas:
- host: db-replica-1.internal
region: us-east-1
priority: 1
- host: db-replica-2.internal
region: us-west-2
priority: 2
failover:
trigger_conditions:
cpu_percent: 90
connection_count: 1000
replication_lag_seconds: 30
error_rate_percent: 5
auto_failover_enabled: true
max_replication_lag_before_failover: 30
The Database Failover Runbook
Phase 1: Detection and Initial Assessment (0-3 minutes)
Trigger: Monitoring alert or on-call engineer notification
Actions:
1. Acknowledge the alert in #database-alerts Slack channel
2. Verify the alert is not a false positive (check metrics directly)
3. Identify current database topology status
4. Determine if this is a primary or replica failure
5. Check replication lag across all replicas
Automated Detection Script:
#!/bin/bash
# db-health-check.sh - Run this on detection
DB_PRIMARY="${DB_PRIMARY_HOST:-db-primary.internal}"
DB_REPLICAS=("db-replica-1.internal" "db-replica-2.internal")
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"
echo "=== Database Health Check ==="
echo "Primary: $DB_PRIMARY"
# Check primary availability
if pg_isready -h "$DB_PRIMARY" -p 5432 -U readonly; then
echo "✅ Primary is reachable"
else
echo "❌ Primary is NOT reachable"
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
--data '{"text": "🚨 DATABASE ALERT: Primary database unreachable!"}'
fi
# Check replica replication lag
for replica in "${DB_REPLICAS[@]}"; do
LAG=$(psql -h "$replica" -U readonly -t -c "SELECT now() - pg_last_xact_replay_timestamp();" 2>/dev/null | xargs)
echo "Replica $replica lag: $LAG"
done
Phase 2: Decision to Failover (3-10 minutes)
Use this decision matrix to determine if failover is appropriate:
| Condition | Action |
|---|---|
| Primary unreachable, replicas healthy | Initiate failover to most current replica |
| Primary reachable but degraded | Investigate root cause first; failover if no improvement in 10 minutes |
| Replica lag > 30 seconds | Do not failover; investigate replication issues |
| Multiple replicas down | Assess scope; consider data loss scenarios |
Async Decision Protocol for Distributed Teams:
When the on-call engineer is not the primary DBA:
- Post the current situation in #database-incidents with the
?failovertag - Include: current metrics, time since failure, affected services
- Wait 3 minutes for objections or additional context
- If no blocking objections, proceed with failover
- Document all decisions in the incident thread
## Failover Decision Request
**Current Status:** [Primary unreachable / High latency / etc]
**Time Since Issue:** [X minutes]
**Affected Services:** [list]
**Replication Status:** [replica-1: Xs lag, replica-2: Ys lag]
**Recommendation:** Failover to [replica name]
@oncall-dba @senior-engineer - Any objections to proceeding?
Phase 3: Failover Execution (10-20 minutes)
Execute the failover using your database’s native tools. The following examples use PostgreSQL with Patroni, but adapt to your specific setup:
Step 1: Promote the Target Replica
#!/bin/bash
# promote-replica.sh - Execute on the replica to promote
REPLICA_HOST="${1:-db-replica-1.internal}"
NEW_PRIMARY="$REPLICA_HOST"
echo "Promoting $REPLICA_HOST to primary..."
# For Patroni-managed clusters
patronictl -c /etc/patroni.yml promote "$REPLICA_HOST"
# For manual PostgreSQL
# pg_ctl promote -D /var/lib/postgresql/data
# Verify promotion succeeded
if pg_isready -h "$NEW_PRIMARY" -p 5432; then
echo "✅ Failover completed successfully"
# Notify the team
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-type: application/json' \
--data '{"text": "✅ Database failover completed. New primary: '"$NEW_PRIMARY"'"}'
else
echo "❌ Failover verification failed"
exit 1
fi
Step 2: Update Application Connection Strings
#!/bin/bash
# update-db-dns.sh - Point applications to new primary
NEW_PRIMARY_IP=$(host "$NEW_PRIMARY" | awk '{print $NF}')
# For service discovery (Consul example)
consul kv put database/primary/host "$NEW_PRIMARY_IP"
# For environment-based deployments
aws ssm put-parameter \
--name "/app/production/db_host" \
--value "$NEW_PRIMARY_IP" \
--type "String" \
--overwrite
# Restart application pods to pick up new connections
kubectl rollout restart deployment/production-api
Step 3: Verify Replication to Old Primary (Now Replica)
#!/bin/bash
# verify-new-replica.sh - Run on the old primary
OLD_PRIMARY="db-primary.internal"
NEW_PRIMARY="db-replica-1.internal"
# Configure old primary as replica of new primary
# This depends on your specific replication setup
psql -h "$OLD_PRIMARY" -c "SELECT pg_start_replay();"
# Monitor until caught up
while true; do
LAG=$(psql -h "$OLD_PRIMARY" -t -c "SELECT now() - pg_last_xact_replay_timestamp();")
echo "Replication lag: $LAG"
if [ "$LAG" = "00:00:00" ]; then
echo "✅ Fully caught up"
break
fi
sleep 5
done
Phase 4: Post-Failover Verification (20-30 minutes)
Run checks to ensure the failover was successful:
#!/bin/bash
# post-failover-verification.sh
NEW_PRIMARY="${1:-db-replica-1.internal}"
echo "=== Post-Failover Verification ==="
# 1. Basic connectivity
echo "1. Testing connectivity..."
pg_isready -h "$NEW_PRIMARY" || exit 1
# 2. Query execution
echo "2. Testing query execution..."
psql -h "$NEW_PRIMARY" -c "SELECT 1;" | grep -q "1" || exit 1
# 3. Application smoke test
echo "3. Running application smoke tests..."
SMOKE=$(curl -s https://api.yourapp.com/health/db)
echo "$SMOKE" | grep -q "healthy" || exit 1
# 4. Test write operations
echo "4. Testing write operations..."
WRITE_TEST=$(psql -h "$NEW_PRIMARY" -c "CREATE TABLE IF NOT EXISTS failover_test (id SERIAL, ts TIMESTAMP DEFAULT NOW()); DROP TABLE failover_test;" 2>&1)
echo "$WRITE_TEST" | grep -q "DROP" || exit 1
# 5. Verify backup jobs
echo "5. Checking backup jobs..."
pg_dump -h "$NEW_PRIMARY" -F c -f /tmp/backup_test.sql
echo "✅ All verification checks passed"
Phase 5: Incident Documentation and Follow-Up
Create an incident report within 24 hours of the failover:
## Database Failover Incident Report
**Date:** [ISO timestamp]
**Duration:** [start to full resolution]
**Root Cause:** [what caused the original failure]
### Timeline
- 02:15 - Alert triggered: primary unreachable
- 02:18 - Failover decision made (async approval)
- 02:25 - Failover completed
- 02:35 - All systems verified
### What Went Well
- [List successful aspects]
### What Needs Improvement
- [List action items]
### Action Items
- [ ] Schedule post-mortem review
- [ ] Update runbook with lessons learned
- [ ] Review monitoring thresholds
Key Principles for Remote Team Database Failovers
Automate detection but humanize decisions. Automated monitoring should catch issues immediately, but failover execution benefits from human oversight when time permits. Structure your runbook so that clear-cut cases can proceed automatically while ambiguous situations trigger async escalation.
Document your topology in code. Keep your database configuration in version control alongside your application code. This ensures every team member can access current infrastructure information regardless of time zone.
Practice the runbook regularly. Schedule quarterly failover drills. Test the process with a non-production database to identify gaps before real incidents expose them.
Establish clear ownership rotation. Ensure that failover authority is not limited to a single person. Train multiple team members and rotate on-call schedules to provide coverage across time zones.
Related Articles
- Remote Team Runbook Template for Deploying Hotfix to
- Remote Team Runbook Template for SSL Certificate Renewal
- teleport-db-config.yaml
- From your local machine with VPN active
- How to Set Up Reliable Backup Internet for Remote Work
Built by theluckystrike — More at zovo.one