What are you looking for?
Ops Scale Checklist
Ops Scale Checklist
Day 31–90: lock in process, SLAs, runbooks, and monitoring so we can add customers without adding chaos. Use the calculators to size capacity and trigger hires.
Targets
Scale KPIs
- First Response Time ≤ 15m (P1), ≤ 4h (Std).
- MTTR ≤ 2h (P1), ≤ 24h (Std). FCR ≥ 70%.
- CSAT ≥ 4.6/5. Backlog < 1 day of work.
- Uptime target ≥ 99.9% (error budget 43.8m/mo).
Gate
Scale Decision
- Greenlight: KPIs at/above targets 4 weeks straight.
- Scale lever: add headcount if forecast breach in ≤ 30 days.
- Freeze: if MTTR or CSAT trend negative 2+ weeks.
Process
Core Readiness
Tooling
Stack
Quality
QA & Training
Security
Controls
Forecast
Volume → Headcount
—
Capacity/agent/day ≈ hours×60×occ ÷ AHT. Needed agents ≈ tickets/day ÷ capacity.
SLAs
Response & Resolution
—
Rule of thumb: if avg wait exceeds target or backlog > 1 day, trigger hire.
On-Call
Rotation Helper
—
Aim: ≤ 1 P1/page per week; no back-to-back weekends when team > 4.
Template
Severity Matrix
Sev | Definition | Targets |
---|---|---|
P1 | Site down / major impact | FR 15m • MTTR 2h |
P2 | Speed/security degraded | FR 1h • MTTR 8h |
P3 | Single feature bug | FR 4h • MTTR 24h |
P4 | Question/task | FR 1d • MTTR 5d |
Runbook
Site Down (P1)
Trigger: 5xx or monitor alert; user cannot access site.
1) Acknowledge in 5m; set status page to “Investigating”.
2) Checks: DNS, TLS, origin, WAF, PHP-FPM, DB, cache, CDN.
3) Mitigate: failover cache or rollback last deploy; restore backup if needed.
4) Communicate q15m until resolved; update RCA stub.
5) Close: postmortem within 48h; add KB if systemic.
Runbook
Slow Performance (P2)
Trigger: LCP/TTFB spike; user reports “slow”.
1) Confirm scope; check origin load & recent changes.
2) Inspect queries, cache headers, image weight, blockers (3P scripts).
3) Hotfix: enable page caching; serve fallback; throttle cron.
4) Plan: ticket for long-term fix; share before/after chart.
Runbook
Security Incident (P1/P2)
Trigger: WAF hit surge, malware flag, auth anomalies.
1) Isolate: lock logins; read-only mode if needed.
2) Scope: logs, file diff, admin users, plugins.
3) Remediate: patch, remove payload, rotate secrets.
4) Notify customer: facts only; timeline; next steps.
5) Post-incident review; tighten rules; add checks.
Comms
Status Update (P1)
[Time] Investigating an issue affecting some WordPress sites.
Impact: pages may fail to load.
Next update in 15 minutes. Reference: INC-{{id}}
SLO
Error Budget (30 days)
—
Budget (min) ≈ (100−target)% × 43,200.
Monitors
Checks Table
Name | Type | Sev | URL/Target |
---|
Export
Build Report
Exports include progress %, capacity result, SLA plan, and monitors list.