Distributed teams face an unsolvable problem: production incidents happen at 3am in your timezone. Forcing your team into 24/7 on-call coverage burns people out. Building a rotation that respects timezones, fairly distributes load, and compensates appropriately requires deliberate architecture.
Why Remote On-Call is Harder
Collocated teams: “Let’s rotate who stays late.” Distributed teams have to solve:
- Timezone fairness: APAC engineer shouldn’t carry US incident load just because US is HQ
- Context switching: Waking at 3am to debug unfamiliar code is slower than during business hours
- Burnout metrics: How many on-call weeks per year is fair? (Industry standard: 1 week per 8 weeks)
- Compensation: If on-call at night, do we pay per incident? Per week? Do we give comp time?
- Escalation boundaries: When does on-call engineer wake up manager? When do we page the CEO?
Poor rotations create hero culture: “Sarah always saves us, let’s keep her on-call” leads to departure.
On-Call Rotation Models
Model 1: Single Engineer On-Call (One Person)
Setup: One person on-call for 1 week at a time. Rotates every Monday.
Pros:
- Simple to manage (8 people = 8 weeks/year each)
- Clear ownership
- Incidents flow to one person (no confusion)
Cons:
- Brutal for remote teams (covers all timezones)
- Sleep deprivation if incidents cluster at night
- High burnout
When to use: Teams < 8 people, or SLA allows 4-hour response time (non-critical incidents)
Compensation: Each on-call week = 1 comp day off the following week. Incidents that wake you = $100-300 per incident (on top of salary) or 4 hours comp time.
Model 2: Primary + Escalation Engineer (Two Person)
Setup: Engineer A is primary (pages first), Engineer B is escalation (pages if A unresponsive for 15 minutes).
Pros:
- Backup if primary is sick/unavailable
- Shared load (A takes incident, B monitors)
- Better for 8-16 person teams
Cons:
- Escalation engineer also loses sleep
- Still covers all timezones if primary on-call for full week
When to use: Teams 8-16 people, critical infrastructure
Compensation: Primary takes $150-250/incident or 2h comp time. Escalation (if paged) takes $50-100 or 1h comp time. Both get 1 comp day per week.
Model 3: Timezone-Based Rotation (Multiple Regions)
Setup:
- US team: 5pm PT → 8am PT (13 hours, covers US evening + morning)
- EU team: 8am CET → 5pm CET (9 hours, covers EU business + overlap)
- APAC team: 5pm SGT → 8am SGT (15 hours, overnight)
Rotate by week, but each person only on-call during their business hours + night.
Pros:
- Incidents during your waking hours (faster response)
- No on-call during deep night (3-7am) for anyone
- Fair load distribution
Cons:
- Complex scheduling (need to track 3+ time zones)
- What if incident is at 2am CET and US team is asleep?
- Escalation paths must be clear
When to use: Teams 12+ with global distribution, critical SLA (99.99%)
Tool setup:
- PagerDuty: Create schedule with time windows per timezone
- Opsgenie (free tier): Same capability
- Google Calendar: Manual but free
Example escalation:
- 5pm PT (US): Incident pages US engineer
- 2am CET (EU midnight): If US engineer unresponsive, pages on-call manager (US-based)
- 5am SGT (APAC): If still unresolved, pages APAC escalation
Compensation:
- On-call during business hours: Included in salary (no extra pay)
- On-call overnight (9pm-7am in your timezone): $200-400/week or 8h comp time
- Each escalation page: Additional $50-100
Model 4: Follow-the-Sun Rotation
Setup: Incident ownership passes from timezone to timezone as earth rotates.
- 5pm PT: US engineer takes incident, documents everything
- 5pm CET: Next morning, EU engineer reads notes, continues investigation
- 5pm SGT: APAC engineer continues, escalates to US if critical
Pros:
- Each incident investigates during business hours
- No night-time incident response (except critical)
- Excellent knowledge transfer
Cons:
- Requires async handoff discipline (clear notes between shifts)
- Slower incident resolution (may take 24 hours for complex issues)
- Not suitable for critical incidents
When to use: Teams with SLA > 4 hours (e.g., non-critical infrastructure, batch jobs)
Tool setup:
- Incident post-mortem document: Start with US notes, EU adds findings, APAC adds resolution
- Jira/GitHub Issues: Tag by timezone owner
- Slack thread: Async updates with @timezone-owner mentions
Compensation: No emergency pay (all business hours), just standard salary.
Tool Comparison
PagerDuty
Price: $49/user/month (Standard), $199/user/month (Enterprise) Best for: Critical infrastructure, large teams
Features:
- Time-window schedules (8am-5pm PT, 5pm PT-8am PT)
- Escalation policies (primary → escalation → manager)
- On-call analytics (who takes most incidents, busiest shift)
- Mobile app with push notifications
- Integration: Slack, email, SMS, phone calls
Example config:
- Create schedule “US-Business-Hours” (5pm PT - 8am PT)
- Create schedule “EU-Business-Hours” (8am CET - 5pm CET)
- Set escalation: If US doesn’t ack in 5 min, escalate to EU
- Rotation: 1-week on-call per person
Cost analysis: $49 × 8 people = $392/month. High but justified for critical systems.
Opsgenie (Atlassian)
Price: Free tier (1 on-call schedule, 5 users), $29/user/month (Standard) Best for: Teams 5-20, budget-conscious
Features:
- Time-window schedules
- Escalation policies
- Mobile app
- Slack integration
- Cheaper than PagerDuty
Example config:
- Create schedule with time windows
- Set escalation policy: email (15 min) → Slack (30 min) → SMS (60 min)
- Rotation pattern: 1-week on-call
Cost analysis: Free tier covers 1 schedule + 5 people (sufficient for small teams). Paid $29 × 8 = $232/month if you outgrow free.
Google Calendar + Slack Bot
Price: Free (if already using Google Workspace + Slack) Best for: Teams < 10, simple rotations
Setup:
- Create shared “On-Call” calendar
- Add events: “US On-Call: John (5pm-8am)” with color coding
- Slack bot reads calendar, posts “#on-call who’s on duty”
Example:
John: Mon-Sun 5pm PT - 8am PT
Sarah: Mon-Sun 5pm CET - 8am CET
Mike: Mon-Sun 5pm SGT - 8am SGT
Repeat next week
Cost analysis: Free if you have Google Workspace. No SMS/escalation automation.
Grafana OnCall (formerly Grafana Incident)
Price: Free (basic), $240/month (Pro) Best for: Teams already using Grafana, good observability integration
Features:
- Tight integration with Grafana alerts
- Escalation policies
- Mobile app
- Cheaper than PagerDuty
Cost analysis: $240/month (fixed, not per-user). Good ROI for Grafana-heavy shops.
Setting Up Timezone-Based Rotation (Recommended)
Scenario: 12-person team across US, EU, APAC
Step 1: Define on-call windows
US: 5pm PT - 8am PT (13h, covers evening + morning)
EU: 8am CET - 5pm CET (9h, covers business + overlap)
APAC: 5pm SGT - 8am SGT (15h, covers evening + morning)
Step 2: Define rotation
Each person does 1-week on-call per 8 weeks = 1.5 weeks/year per person.
Week 1: US: John | EU: Sarah | APAC: Mike
Week 2: US: Jessica| EU: Philipp| APAC: Kim
...
Week 8: (repeat)
Step 3: Set compensation
- On-call during business hours (5pm-10pm = 5h): No extra pay
- On-call overnight (10pm-7am = 9h): $300/week or 8h comp time
- Each incident page: +$50-100 or +1h comp time
- Critical incident (requires 3+ hours): +$200 or +8h comp time
Step 4: Set escalation policy
1. Primary on-call (5min to ack)
↓
2. Escalation engineer (10min timeout)
↓
3. On-call manager (20min timeout)
↓
4. VP Engineering (hard page)
Step 5: Test in PagerDuty or Opsgenie
- Create 3 schedules (US, EU, APAC)
- Add 4 people to each (rotating weekly)
- Create escalation policy chaining them
- Send test alert from Slack/monitoring system
Red Flags in On-Call Rotations
Red flag 1: One person takes 3x incidents per rotation
Indicates: Unfair incident distribution, or that person is better at fixing things (promote them, don’t burn them out).
Fix: Analyze incident sources, fix root causes, rotate on-call.
Red flag 2: On-call engineer sleeps with pager, then works full day
Indicates: Team is unsustainable, incidents cluster at night.
Fix: Give comp time (3h slept badly = next day off), reduce on-call frequency, or hire contractor for night shifts.
Red flag 3: On-call engineer avoids “on-call week” by calling in sick
Indicates: Rotation is unfair, compensation is inadequate, or culture is blame-heavy.
Fix: Review rotations for fairness, increase compensation, fix blame culture (blameless post-mortems).
Red flag 4: Manager is never on-call
Indicates: Culture problem (managers avoid pain), or you’re protecting managers from reality.
Fix: Managers take on-call rotation. They understand incidents better, and it builds empathy.
Compensation Framework
| Scenario | Payment Model |
|---|---|
| On-call during business hours (no incidents) | Included in salary, no extra |
| On-call during business hours (1-2 incidents) | +$100-150 for the week |
| On-call during night (9pm-7am, no incidents) | $300-400/week or 8h comp time |
| On-call during night (1+ incidents) | $400-600/week or 12h comp time |
| Critical incident (3+ hours, wakes you at 3am) | $200-300 + 8h comp time |
| Oncall manager (escalation calls only) | $500/week (usually salary bump) |
Why comp time > cash?: Some people prefer time off. Offer both, let them choose.
FAQ
Q: What’s a fair on-call frequency? A: 1 week per 8 weeks (1.5 weeks/year) is industry standard. 1 week per 6 weeks is aggressive. 1 week per 12 weeks is cushy.
Q: Do we page the CEO? A: Only for critical incidents: “service down for all users” or “data loss in progress”. Not for “p99 latency is high” or “one customer affected”.
Q: What if someone refuses on-call? A: It’s part of the job for most engineering roles. If someone refuses, you either: (1) hire someone else, (2) restructure the role (not infrastructure-facing), or (3) hire an ops/SRE team to absorb on-call.
Q: How do we prevent on-call burnout? A: (1) Limit frequency, (2) Pay well, (3) Invest in automation (fewer incidents = fewer pages), (4) Blameless post-mortems (reduce anxiety), (5) Give comp time.
Q: Should interns be on-call? A: No. Senior engineers only until they have 2+ years experience + deep system knowledge.
Q: What if we’re a startup and can’t pay on-call premium? A: Use Model 1 (single engineer, simple rotation) + comp time (day off each week). It’s less fair but sustainable for early stage.
Related Articles
- Building Blameless Post-Mortem Cultures
- Incident Response Playbooks
- Distributed Team Communication Tools
- SLA and SLO Definitions
- Scaling Engineering Teams Across Timezones
Built by theluckystrike — More at zovo.one