diff --git a/docs/superpowers/phase-1/runbook.md b/docs/superpowers/phase-1/runbook.md new file mode 100644 index 00000000..da00c019 --- /dev/null +++ b/docs/superpowers/phase-1/runbook.md @@ -0,0 +1,290 @@ +# Flights Web — Operational Runbook + +**Version:** 1.0 (Phase 1I) +**Last updated:** 2026-04-14 + +--- + +## 1. Incident Response Decision Tree + +``` +Is the service returning errors? + | + +-- YES: Check /health endpoint + | | + | +-- /health returns 503 + | | -> Upstream API issue (see Section 6.1) + | | + | +-- /health returns 200 but users see errors + | | -> Application-level bug. Check logs (Section 5). + | | -> If recent deploy: rollback (Section 3). + | | + | +-- /health unreachable (connection refused / timeout) + | -> Container/VM is down. + | -> Check container orchestrator status. + | -> If all replicas down: escalate to infra team (Severity 1). + | -> If partial: rely on load balancer, investigate affected nodes. + | + +-- NO: Check for degraded performance + | + +-- Latency > 2x baseline + | -> Check OTel metrics for slow spans. + | -> Check upstream API latency. + | -> If upstream: see Section 6.1. + | -> If internal: check for memory pressure, CPU saturation. + | + +-- Intermittent errors in logs + -> Check error rate trend. + -> If rising: prepare for rollback. + -> If stable/low: monitor for 15 min, then investigate. +``` + +### Severity Levels + +| Severity | Criteria | Response Time | Who to Page | +|----------|----------|---------------|-------------| +| S1 | Service fully down, all users affected | Immediate | On-call engineer + team lead | +| S2 | Partial outage, >10% error rate | 15 min | On-call engineer | +| S3 | Degraded performance, no data loss | 1 hour | On-call engineer (next business day if after hours) | +| S4 | Minor issue, workaround exists | Next business day | Assigned engineer | + +--- + +## 2. Canary Rollout Procedure + +### Pre-rollout Checklist + +- [ ] All CI checks pass (typecheck, lint, test) +- [ ] Docker images built and pushed to registry +- [ ] Rollback image tag identified (current production tag) +- [ ] Monitoring dashboards open + +### Rollout Steps + +1. **Deploy canary** (5% traffic) to a single node in one geographic region +2. **Monitor for 10 minutes:** + - Error rate must stay below 0.5% + - p99 latency must not exceed 2x baseline + - `/health` must return 200 on the canary + - No new error patterns in logs +3. **Expand to 25%** if canary is healthy +4. **Monitor for 15 minutes** with same criteria +5. **Expand to 100%** across all geographic regions +6. **Post-deploy verification:** + - `/health` returns 200 on all nodes + - Smoke test passes end-to-end + - No error rate spike in the first 30 minutes + +### Abort Criteria + +Roll back immediately if any of these occur during canary: +- Error rate exceeds 1% +- `/health` returns 503 on canary nodes +- p99 latency exceeds 5x baseline +- Any S1/S2 incident triggered + +--- + +## 3. Rollback Procedure + +### 3.1 Automatic Rollback + +The deploy pipeline monitors `/health` after deployment. If the health check fails within the first 5 minutes post-deploy: + +1. Pipeline automatically reverts to the previous image tag +2. Alert fires to the on-call channel +3. Engineer investigates the failed deployment logs + +**No manual action required** for auto-rollback. Verify the rollback succeeded by checking: +- `/health` returns 200 +- Error rate returns to baseline +- Previous image tag is running on all nodes + +### 3.2 Manual Rollback + +If auto-rollback did not trigger or a problem is discovered later: + +1. **Identify the last-known-good image tag** from the deployment history +2. **Redeploy the previous image:** + ```bash + # Placeholder — actual commands depend on customer's deployment tool + # Example: + # deploy --image $REGISTRY/flights-web-standalone:$PREVIOUS_SHA --env production + ``` +3. **Verify rollback:** + - `/health` returns 200 on all nodes + - Error rate returns to baseline + - Smoke test passes +4. **Post-mortem:** file an incident report within 24 hours + +--- + +## 4. Health-Check Interpretation + +### Endpoint: `GET /health` + +| Response | Status | Meaning | Action | +|----------|--------|---------|--------| +| `{ "status": "ok" }` | 200 | Upstream API reachable within last 60s | None | +| `{ "status": "degraded", "reason": "upstream_unreachable" }` | 503 | No successful upstream ping in 60s | Check upstream API status; see Section 6.1 | + +### Common Causes of 503 + +1. **Upstream API is down** — check upstream service status page / monitoring +2. **Network partition** — the node cannot reach the upstream API; check network policies +3. **DNS resolution failure** — verify DNS configuration on the node +4. **Upstream API overloaded** — ping times out; coordinate with upstream team + +### Load Balancer Behavior + +When `/health` returns 503, the load balancer should stop routing traffic to that node. When the upstream recovers and `/health` returns 200 again, traffic automatically resumes. + +--- + +## 5. Log Query Cookbook + +Logs are shipped in JSON Lines format to the customer's log aggregation system. + +### Log Structure + +```json +{ + "ts": "2026-04-14T12:00:00.000Z", + "level": "error", + "msg": "Request failed", + "fields": { + "traceId": "abc123", + "path": "/api/flights", + "status": 500, + "err": "TypeError: Cannot read properties of undefined" + } +} +``` + +### Common Queries + +**Find all errors in the last hour:** +``` +level:error AND ts:[now-1h TO now] +``` + +**Find errors for a specific trace:** +``` +fields.traceId:"abc123" +``` + +**Find slow requests (logged by the API client on timeout):** +``` +msg:"Retrying request" OR msg:"upstream_timeout" +``` + +**Find health-check failures:** +``` +msg:"upstream_unreachable" OR (path:"/health" AND status:503) +``` + +**Find graceful shutdown events:** +``` +msg:"SIGTERM received" OR msg:"Server closed" OR msg:"Drain timeout exceeded" +``` + +**Find CSP violations (if CSP reporting is enabled):** +``` +msg:"csp-violation" OR fields.type:"csp-report" +``` + +--- + +## 6. Known-Failure Playbooks + +### 6.1 Upstream API Down + +**Symptoms:** `/health` returns 503; API client logs show retry exhaustion. + +**Impact:** Users see error pages or stale data (if caching is in place). + +**Steps:** +1. Confirm upstream status via the upstream team's status page or monitoring +2. If upstream is aware and working on it: monitor, no action needed on our side +3. If upstream is unaware: escalate via agreed communication channel +4. If outage exceeds 30 minutes: consider enabling a maintenance page +5. Recovery is automatic — once upstream responds, `/health` returns 200 within 60s + +### 6.2 SignalR Hub Offline + +**Symptoms:** Real-time flight updates stop; SignalR reconnection logs appear. + +**Impact:** Users see stale board data; manual refresh still works via REST API. + +**Steps:** +1. Check SignalR hub process/container status +2. Verify WebSocket connectivity from the node to the SignalR hub +3. The client auto-reconnects with exponential backoff — recovery is usually automatic +4. If hub is permanently down: REST polling fallback should activate (if implemented) +5. Inform users if downtime exceeds 5 minutes + +### 6.3 CSP Violation Spike + +**Symptoms:** Spike in CSP violation reports; possibly broken page functionality. + +**Impact:** Scripts or styles blocked by Content-Security-Policy; UI may be partially broken. + +**Steps:** +1. Check CSP violation reports for the blocked resource URL +2. If a legitimate resource is blocked: update CSP policy in `src/server/middleware/csp.ts` +3. If a third-party script is the source: investigate whether it was injected (security concern) +4. If after a deploy: the new code may reference resources not in the CSP allowlist — fix or rollback +5. CSP is in report-only mode during Phase 1 — no user impact, but violations should be tracked + +### 6.4 Analytics Adapter Load Failure + +**Symptoms:** `flights.analytics.load_failed` counter increases; analytics data gaps. + +**Impact:** Analytics data not collected; no user-facing impact. + +**Steps:** +1. Check which adapter(s) failed (Yandex.Metrica, CTM, Variocube, Dynatrace) +2. Verify the adapter's external script URL is reachable from the client +3. Check for CORS or CSP blocking the adapter script +4. If a single adapter: low priority, monitor +5. If all adapters: likely a CSP or network issue affecting all external scripts + +### 6.5 OTel Exporter Unreachable + +**Symptoms:** Metrics and traces stop appearing in the monitoring dashboard. + +**Impact:** No observability data; no user-facing impact. + +**Steps:** +1. Check the OTel collector/exporter endpoint connectivity +2. Verify the `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable is correct +3. Check for network policy changes that may block the exporter +4. The SDK buffers data locally — some data may be recoverable once the exporter is reachable again +5. If the exporter is permanently moved: update the endpoint configuration and redeploy + +### 6.6 Memory Pressure / OOM Kill + +**Symptoms:** Container restarts; OOM kill events in container orchestrator logs. + +**Impact:** Requests in flight are dropped; load balancer reroutes to healthy nodes. + +**Steps:** +1. Check container memory limits vs actual usage +2. Review recent deploys for memory leaks (new dependencies, unbounded caches) +3. If a specific route causes high memory: check for large API responses or unbounded data structures +4. Short-term: increase memory limits +5. Long-term: profile the application to find the leak; fix and redeploy + +--- + +## Recovery SLA + +**Target:** Service recovery within 6 hours after infrastructure is restored. + +**Recovery steps:** +1. Infrastructure team restores VMs / containers across geographic regions +2. Deployment tool re-deploys the last-known-good image +3. `/health` checks confirm upstream connectivity +4. Load balancer re-enables traffic to recovered nodes +5. On-call engineer verifies end-to-end functionality +6. Incident report filed within 24 hours of resolution