Ops Scale Checklist

PostedSeptember 3, 2025

UpdatedSeptember 3, 2025

Bynous-buildons

Ops Scale Checklist

Day 31–90: lock in process, SLAs, runbooks, and monitoring so we can add customers without adding chaos. Use the calculators to size capacity and trigger hires.

Targets

Scale KPIs

First Response Time ≤ 15m (P1), ≤ 4h (Std).
MTTR ≤ 2h (P1), ≤ 24h (Std). FCR ≥ 70%.
CSAT ≥ 4.6/5. Backlog < 1 day of work.
Uptime target ≥ 99.9% (error budget 43.8m/mo).

Gate

Scale Decision

Greenlight: KPIs at/above targets 4 weeks straight.
Scale lever: add headcount if forecast breach in ≤ 30 days.
Freeze: if MTTR or CSAT trend negative 2+ weeks.

Quality

QA & Training

10 ticket reviews/week scorecard

Macro library curated tone + accuracy

Shadowing program L1→L2 rotations

CSAT/NPS loop weekly review

Security

Controls

MFA enforced all consoles

Least-privilege roles runbooks note scope

Backups tested monthly restore drill

Incident comms script P1 template

Template

Severity Matrix

Sev	Definition	Targets
P1	Site down / major impact	FR 15m • MTTR 2h
P2	Speed/security degraded	FR 1h • MTTR 8h
P3	Single feature bug	FR 4h • MTTR 24h
P4	Question/task	FR 1d • MTTR 5d

Runbook

Site Down (P1)

Trigger: 5xx or monitor alert; user cannot access site. 1) Acknowledge in 5m; set status page to “Investigating”. 2) Checks: DNS, TLS, origin, WAF, PHP-FPM, DB, cache, CDN. 3) Mitigate: failover cache or rollback last deploy; restore backup if needed. 4) Communicate q15m until resolved; update RCA stub. 5) Close: postmortem within 48h; add KB if systemic.

Runbook

Slow Performance (P2)

Trigger: LCP/TTFB spike; user reports “slow”. 1) Confirm scope; check origin load & recent changes. 2) Inspect queries, cache headers, image weight, blockers (3P scripts). 3) Hotfix: enable page caching; serve fallback; throttle cron. 4) Plan: ticket for long-term fix; share before/after chart.

Runbook

Security Incident (P1/P2)

Trigger: WAF hit surge, malware flag, auth anomalies. 1) Isolate: lock logins; read-only mode if needed. 2) Scope: logs, file diff, admin users, plugins. 3) Remediate: patch, remove payload, rotate secrets. 4) Notify customer: facts only; timeline; next steps. 5) Post-incident review; tighten rules; add checks.

Comms

Status Update (P1)

[Time] Investigating an issue affecting some WordPress sites. Impact: pages may fail to load. Next update in 15 minutes. Reference: INC-{{id}}

SLO

Error Budget (30 days)

Uptime target % Downtime (min) used

—

Budget (min) ≈ (100−target)% × 43,200.

Monitors

Checks Table

Name	Type	Sev	URL/Target

Export

Build Report

Exports include progress %, capacity result, SLA plan, and monitors list.

Ops Scale Checklist

Ops Scale Checklist

Scale KPIs

Scale Decision

Core Readiness

Stack

QA & Training

Controls

Volume → Headcount

Response & Resolution

Rotation Helper

Severity Matrix

Site Down (P1)

Slow Performance (P2)

Security Incident (P1/P2)

Status Update (P1)

Error Budget (30 days)

Checks Table

Build Report