How to Create Automated Rollback Systems
A deployment that breaks at 3 AM and waits for an on-call engineer to wake up and roll back is a deployment with two failure windows: the failure itself, and the time-to-rollback. Automated rollback closes the second window by detecting the failure and reverting without human intervention.
Kubernetes: Built-In Rollback with Health Gates
Kubernetes has rollback built in via kubectl rollout undo. The automation goal is to trigger it based on health conditions, not manual observation.
Deployment with proper health checks:
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: production
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # never kill old pods before new ones are ready
maxSurge: 1 # spin up one extra pod during update
selector:
matchLabels:
app: myapp
template:
spec:
containers:
- name: myapp
image: ghcr.io/yourorg/myapp:latest
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 15
periodSeconds: 10
failureThreshold: 3
Automated rollback after a failed rollout:
#!/bin/bash
# k8s-deploy-with-rollback.sh
set -euo pipefail
DEPLOYMENT="${1:?Usage: $0 <deployment> <image>}"
IMAGE="${2:?}"
NAMESPACE="${3:-production}"
TIMEOUT="${4:-300}"
echo "Deploying $IMAGE to $DEPLOYMENT in $NAMESPACE"
# Record current image for rollback reference
CURRENT_IMAGE=$(kubectl get deployment "$DEPLOYMENT" \
-n "$NAMESPACE" \
-o jsonpath='{.spec.template.spec.containers[0].image}')
# Apply new image
kubectl set image deployment/"$DEPLOYMENT" \
"${DEPLOYMENT}=${IMAGE}" \
-n "$NAMESPACE"
# Wait for rollout with timeout
if ! kubectl rollout status deployment/"$DEPLOYMENT" \
-n "$NAMESPACE" \
--timeout="${TIMEOUT}s"; then
echo "ERROR: Rollout failed — rolling back to $CURRENT_IMAGE"
kubectl rollout undo deployment/"$DEPLOYMENT" -n "$NAMESPACE"
kubectl rollout status deployment/"$DEPLOYMENT" -n "$NAMESPACE" --timeout=120s
# Notify Slack
curl -s -X POST "$SLACK_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d "{\"text\": \":rotating_light: Deployment FAILED and ROLLED BACK\nDeployment: \`$DEPLOYMENT\`\nFailed image: \`$IMAGE\`\nReverted to: \`$CURRENT_IMAGE\`\"}"
exit 1
fi
echo "Deployment successful: $IMAGE"
Kubernetes: Prometheus-Based Rollback
Roll back based on error rate crossing a threshold after deploy — more reliable than timeouts alone:
#!/bin/bash
# prometheus-gate.sh — post-deploy error rate check
set -euo pipefail
PROM_URL="${PROMETHEUS_URL:-http://prometheus:9090}"
DEPLOYMENT="$1"
THRESHOLD="0.05" # 5% error rate
OBSERVATION_WINDOW="5m"
WAIT_SECONDS=120 # wait 2 min after deploy before checking
echo "Waiting ${WAIT_SECONDS}s for metrics to stabilize..."
sleep "$WAIT_SECONDS"
# Query 5-minute error rate for this deployment
QUERY="sum(rate(http_requests_total{deployment=\"${DEPLOYMENT}\",status=~\"5..\"}[${OBSERVATION_WINDOW}])) / sum(rate(http_requests_total{deployment=\"${DEPLOYMENT}\"}[${OBSERVATION_WINDOW}]))"
ERROR_RATE=$(curl -s "${PROM_URL}/api/v1/query" \
--data-urlencode "query=${QUERY}" \
| jq -r '.data.result[0].value[1] // "0"')
echo "Current error rate: $ERROR_RATE (threshold: $THRESHOLD)"
if (( $(echo "$ERROR_RATE > $THRESHOLD" | bc -l) )); then
echo "ERROR RATE ABOVE THRESHOLD — triggering rollback"
kubectl rollout undo deployment/"$DEPLOYMENT" -n production
kubectl rollout status deployment/"$DEPLOYMENT" -n production --timeout=120s
exit 1
fi
echo "Error rate OK — deployment accepted"
Docker Compose: Health-Check Rollback
For non-Kubernetes deployments with Docker Compose:
#!/bin/bash
# compose-deploy.sh
set -euo pipefail
COMPOSE_FILE="${1:-/opt/myapp/docker-compose.yml}"
SERVICE="${2:?Usage: $0 <compose-file> <service>}"
HEALTH_URL="${3:?}" # e.g., http://localhost:3000/health
MAX_WAIT=120
# Save current image tag
CURRENT_IMAGE=$(docker-compose -f "$COMPOSE_FILE" config | \
awk "/^ ${SERVICE}:/{found=1} found && /image:/{print \$2; exit}")
echo "Pulling new image for $SERVICE"
docker-compose -f "$COMPOSE_FILE" pull "$SERVICE"
NEW_IMAGE=$(docker-compose -f "$COMPOSE_FILE" config | \
awk "/^ ${SERVICE}:/{found=1} found && /image:/{print \$2; exit}")
if [[ "$CURRENT_IMAGE" == "$NEW_IMAGE" ]]; then
echo "No new image available — skipping"
exit 0
fi
echo "Upgrading $SERVICE: $CURRENT_IMAGE -> $NEW_IMAGE"
docker-compose -f "$COMPOSE_FILE" up -d --no-deps "$SERVICE"
# Poll health endpoint
elapsed=0
while [[ $elapsed -lt $MAX_WAIT ]]; do
status=$(curl -s -o /dev/null -w "%{http_code}" "$HEALTH_URL" 2>/dev/null || echo "000")
if [[ "$status" == "200" ]]; then
echo "Health check passed after ${elapsed}s"
docker image prune -f
exit 0
fi
sleep 5
elapsed=$((elapsed + 5))
echo "Waiting for health check... ${elapsed}s (HTTP $status)"
done
# Rollback
echo "Health check timed out — rolling back to $CURRENT_IMAGE"
# Override image in a temp compose file and re-deploy
sed "s|$NEW_IMAGE|$CURRENT_IMAGE|" "$COMPOSE_FILE" > /tmp/rollback-compose.yml
docker-compose -f /tmp/rollback-compose.yml up -d --no-deps "$SERVICE"
curl -s -X POST "$SLACK_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d "{\"text\": \":warning: Deploy FAILED on \`$(hostname)\` for \`$SERVICE\` — reverted to \`$CURRENT_IMAGE\`\"}"
rm /tmp/rollback-compose.yml
exit 1
Lambda: Version Aliases with Automatic Traffic Shift
AWS Lambda supports weighted aliases. Deploy to a new version, shift traffic gradually, roll back if error rate rises:
#!/bin/bash
# lambda-deploy.sh
set -euo pipefail
FUNCTION_NAME="${1:?}"
NEW_ZIP="${2:?}"
REGION="${AWS_DEFAULT_REGION:-us-east-1}"
# Deploy new version
aws lambda update-function-code \
--function-name "$FUNCTION_NAME" \
--zip-file "fileb://$NEW_ZIP" \
--region "$REGION"
aws lambda wait function-updated \
--function-name "$FUNCTION_NAME" \
--region "$REGION"
NEW_VERSION=$(aws lambda publish-version \
--function-name "$FUNCTION_NAME" \
--region "$REGION" \
--query 'Version' --output text)
CURRENT_VERSION=$(aws lambda get-alias \
--function-name "$FUNCTION_NAME" \
--name production \
--region "$REGION" \
--query 'FunctionVersion' --output text 2>/dev/null || echo "1")
echo "New version: $NEW_VERSION, Current: $CURRENT_VERSION"
# Update alias to point 10% traffic at new version
aws lambda update-alias \
--function-name "$FUNCTION_NAME" \
--name production \
--function-version "$CURRENT_VERSION" \
--routing-config "AdditionalVersionWeights={$NEW_VERSION=0.1}" \
--region "$REGION"
echo "Observing for 5 minutes at 10% traffic..."
sleep 300
# Check error rate in CloudWatch
ERROR_RATE=$(aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Errors \
--dimensions "Name=FunctionName,Value=$FUNCTION_NAME" "Name=Resource,Value=$FUNCTION_NAME:$NEW_VERSION" \
--start-time "$(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--period 300 \
--statistics Sum \
--query 'Datapoints[0].Sum // `0`' \
--output text \
--region "$REGION")
INVOCATIONS=$(aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name Invocations \
--dimensions "Name=FunctionName,Value=$FUNCTION_NAME" "Name=Resource,Value=$FUNCTION_NAME:$NEW_VERSION" \
--start-time "$(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
--period 300 \
--statistics Sum \
--query 'Datapoints[0].Sum // `1`' \
--output text \
--region "$REGION")
RATE=$(echo "scale=4; $ERROR_RATE / $INVOCATIONS" | bc)
echo "Error rate on new version: $RATE"
if (( $(echo "$RATE > 0.02" | bc -l) )); then
echo "ERROR RATE TOO HIGH — rolling back"
aws lambda update-alias \
--function-name "$FUNCTION_NAME" \
--name production \
--function-version "$CURRENT_VERSION" \
--routing-config 'AdditionalVersionWeights={}' \
--region "$REGION"
exit 1
fi
# Promote to 100%
aws lambda update-alias \
--function-name "$FUNCTION_NAME" \
--name production \
--function-version "$NEW_VERSION" \
--routing-config 'AdditionalVersionWeights={}' \
--region "$REGION"
echo "Deployment complete: version $NEW_VERSION at 100%"
GitHub Actions: Automated Rollback Gate in CI/CD
Integrating rollback directly into your GitHub Actions pipeline catches failures before they fully propagate:
# .github/workflows/deploy.yml
name: Deploy with Auto-Rollback
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
kubeconfig: ${{ secrets.KUBECONFIG }}
- name: Deploy new image
id: deploy
run: |
NEW_IMAGE="ghcr.io/${{ github.repository }}:${{ github.sha }}"
CURRENT_IMAGE=$(kubectl get deployment myapp -n production \
-o jsonpath='{.spec.template.spec.containers[0].image}')
echo "current-image=$CURRENT_IMAGE" >> "$GITHUB_OUTPUT"
kubectl set image deployment/myapp myapp="$NEW_IMAGE" -n production
- name: Wait for rollout
id: rollout
run: |
if ! kubectl rollout status deployment/myapp -n production --timeout=5m; then
echo "rollout-failed=true" >> "$GITHUB_OUTPUT"
else
echo "rollout-failed=false" >> "$GITHUB_OUTPUT"
fi
- name: Rollback on failure
if: steps.rollout.outputs.rollout-failed == 'true'
run: |
echo "::error::Rollout failed — rolling back"
kubectl rollout undo deployment/myapp -n production
kubectl rollout status deployment/myapp -n production --timeout=120s
exit 1
- name: Post-deploy smoke test
run: |
sleep 10
STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
"https://myapp.example.com/health")
if [[ "$STATUS" != "200" ]]; then
echo "::error::Smoke test failed (HTTP $STATUS) — rolling back"
kubectl rollout undo deployment/myapp -n production
exit 1
fi
echo "Smoke test passed: HTTP $STATUS"
- name: Notify Slack on rollback
if: failure()
uses: slackapi/slack-github-action@v1.27
with:
payload: |
{
"text": ":rotating_light: Deploy FAILED and ROLLED BACK",
"blocks": [{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Deploy failed*\nRepo: `${{ github.repository }}`\nCommit: `${{ github.sha }}`\nReverted to: `${{ steps.deploy.outputs.current-image }}`\n<${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View run>"
}
}]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK_URL }}
Rollback Decision Matrix
Not all failures should trigger automatic rollback. Use a matrix to decide:
| Signal | Threshold | Action |
|---|---|---|
| Readiness probe fails | Pod stays unready > 5 min | Auto rollback |
| HTTP 5xx error rate | > 5% over 5 min | Auto rollback |
| p99 latency | > 3× baseline over 5 min | Alert + manual decision |
| CPU usage | > 90% sustained | Scale up, not rollback |
| Memory leak | Slow growth over hours | Alert, schedule fix |
| Deployment timeout | Pods not ready in N min | Auto rollback |
For latency spikes, automatic rollback can mask underlying capacity problems rather than solve them. Prefer alerting and manual decision for ambiguous signals, and reserve automatic rollback for clear failure signals: readiness failures, hard error rate thresholds, and deployment timeouts.
Keep rollback procedures in your team runbook so the on-call engineer knows what automatic rollback covers and what still requires manual intervention:
# ops/runbook/rollback.md quick reference
# What rolls back automatically:
# - k8s readiness failures (kubectl rollout undo)
# - Docker Compose health check timeout
# - Lambda error rate > 2% during 10% canary
# - GitHub Actions post-deploy smoke test failure
#
# What requires manual rollback:
# - Database migrations (requires migration revert script)
# - Feature flag changes (toggle in LaunchDarkly)
# - S3 / external config changes
# - CDN cache / edge config changes
Related Reading
- How to Set Up Keel for Continuous Delivery
- How to Automate Docker Container Updates
- How to Set Up ArgoCD for GitOps Workflows
Built by theluckystrike — More at zovo.one