Add operational runbook covering incident response and failure playbooks

Covers incident decision tree, canary rollout, rollback procedures, health-check interpretation, log query cookbook, and 6 known-failure playbooks per master plan requirements.
2026-04-15 00:56:24 +03:00
parent ca6ae0eea2
commit 56cc9e1af2
1 changed files with 290 additions and 0 deletions
@@ -0,0 +1,290 @@
+# Flights Web — Operational Runbook
+
+**Version:** 1.0 (Phase 1I)
+**Last updated:** 2026-04-14
+
+---
+
+## 1. Incident Response Decision Tree
+
+```
+Is the service returning errors?
+  |
+  +-- YES: Check /health endpoint
+  |     |
+  |     +-- /health returns 503
+  |     |     -> Upstream API issue (see Section 6.1)
+  |     |
+  |     +-- /health returns 200 but users see errors
+  |     |     -> Application-level bug. Check logs (Section 5).
+  |     |     -> If recent deploy: rollback (Section 3).
+  |     |
+  |     +-- /health unreachable (connection refused / timeout)
+  |           -> Container/VM is down.
+  |           -> Check container orchestrator status.
+  |           -> If all replicas down: escalate to infra team (Severity 1).
+  |           -> If partial: rely on load balancer, investigate affected nodes.
+  |
+  +-- NO: Check for degraded performance
+        |
+        +-- Latency > 2x baseline
+        |     -> Check OTel metrics for slow spans.
+        |     -> Check upstream API latency.
+        |     -> If upstream: see Section 6.1.
+        |     -> If internal: check for memory pressure, CPU saturation.
+        |
+        +-- Intermittent errors in logs
+              -> Check error rate trend.
+              -> If rising: prepare for rollback.
+              -> If stable/low: monitor for 15 min, then investigate.
+```
+
+### Severity Levels
+
+| Severity | Criteria | Response Time | Who to Page |
+|----------|----------|---------------|-------------|
+| S1 | Service fully down, all users affected | Immediate | On-call engineer + team lead |
+| S2 | Partial outage, >10% error rate | 15 min | On-call engineer |
+| S3 | Degraded performance, no data loss | 1 hour | On-call engineer (next business day if after hours) |
+| S4 | Minor issue, workaround exists | Next business day | Assigned engineer |
+
+---
+
+## 2. Canary Rollout Procedure
+
+### Pre-rollout Checklist
+
+- [ ] All CI checks pass (typecheck, lint, test)
+- [ ] Docker images built and pushed to registry
+- [ ] Rollback image tag identified (current production tag)
+- [ ] Monitoring dashboards open
+
+### Rollout Steps
+
+1. **Deploy canary** (5% traffic) to a single node in one geographic region
+2. **Monitor for 10 minutes:**
+   - Error rate must stay below 0.5%
+   - p99 latency must not exceed 2x baseline
+   - `/health` must return 200 on the canary
+   - No new error patterns in logs
+3. **Expand to 25%** if canary is healthy
+4. **Monitor for 15 minutes** with same criteria
+5. **Expand to 100%** across all geographic regions
+6. **Post-deploy verification:**
+   - `/health` returns 200 on all nodes
+   - Smoke test passes end-to-end
+   - No error rate spike in the first 30 minutes
+
+### Abort Criteria
+
+Roll back immediately if any of these occur during canary:
+- Error rate exceeds 1%
+- `/health` returns 503 on canary nodes
+- p99 latency exceeds 5x baseline
+- Any S1/S2 incident triggered
+
+---
+
+## 3. Rollback Procedure
+
+### 3.1 Automatic Rollback
+
+The deploy pipeline monitors `/health` after deployment. If the health check fails within the first 5 minutes post-deploy:
+
+1. Pipeline automatically reverts to the previous image tag
+2. Alert fires to the on-call channel
+3. Engineer investigates the failed deployment logs
+
+**No manual action required** for auto-rollback. Verify the rollback succeeded by checking:
+- `/health` returns 200
+- Error rate returns to baseline
+- Previous image tag is running on all nodes
+
+### 3.2 Manual Rollback
+
+If auto-rollback did not trigger or a problem is discovered later:
+
+1. **Identify the last-known-good image tag** from the deployment history
+2. **Redeploy the previous image:**
+   ```bash
+   # Placeholder — actual commands depend on customer's deployment tool
+   # Example:
+   # deploy --image $REGISTRY/flights-web-standalone:$PREVIOUS_SHA --env production
+   ```
+3. **Verify rollback:**
+   - `/health` returns 200 on all nodes
+   - Error rate returns to baseline
+   - Smoke test passes
+4. **Post-mortem:** file an incident report within 24 hours
+
+---
+
+## 4. Health-Check Interpretation
+
+### Endpoint: `GET /health`
+
+| Response | Status | Meaning | Action |
+|----------|--------|---------|--------|
+| `{ "status": "ok" }` | 200 | Upstream API reachable within last 60s | None |
+| `{ "status": "degraded", "reason": "upstream_unreachable" }` | 503 | No successful upstream ping in 60s | Check upstream API status; see Section 6.1 |
+
+### Common Causes of 503
+
+1. **Upstream API is down** — check upstream service status page / monitoring
+2. **Network partition** — the node cannot reach the upstream API; check network policies
+3. **DNS resolution failure** — verify DNS configuration on the node
+4. **Upstream API overloaded** — ping times out; coordinate with upstream team
+
+### Load Balancer Behavior
+
+When `/health` returns 503, the load balancer should stop routing traffic to that node. When the upstream recovers and `/health` returns 200 again, traffic automatically resumes.
+
+---
+
+## 5. Log Query Cookbook
+
+Logs are shipped in JSON Lines format to the customer's log aggregation system.
+
+### Log Structure
+
+```json
+{
+  "ts": "2026-04-14T12:00:00.000Z",
+  "level": "error",
+  "msg": "Request failed",
+  "fields": {
+    "traceId": "abc123",
+    "path": "/api/flights",
+    "status": 500,
+    "err": "TypeError: Cannot read properties of undefined"
+  }
+}
+```
+
+### Common Queries
+
+**Find all errors in the last hour:**
+```
+level:error AND ts:[now-1h TO now]
+```
+
+**Find errors for a specific trace:**
+```
+fields.traceId:"abc123"
+```
+
+**Find slow requests (logged by the API client on timeout):**
+```
+msg:"Retrying request" OR msg:"upstream_timeout"
+```
+
+**Find health-check failures:**
+```
+msg:"upstream_unreachable" OR (path:"/health" AND status:503)
+```
+
+**Find graceful shutdown events:**
+```
+msg:"SIGTERM received" OR msg:"Server closed" OR msg:"Drain timeout exceeded"
+```
+
+**Find CSP violations (if CSP reporting is enabled):**
+```
+msg:"csp-violation" OR fields.type:"csp-report"
+```
+
+---
+
+## 6. Known-Failure Playbooks
+
+### 6.1 Upstream API Down
+
+**Symptoms:** `/health` returns 503; API client logs show retry exhaustion.
+
+**Impact:** Users see error pages or stale data (if caching is in place).
+
+**Steps:**
+1. Confirm upstream status via the upstream team's status page or monitoring
+2. If upstream is aware and working on it: monitor, no action needed on our side
+3. If upstream is unaware: escalate via agreed communication channel
+4. If outage exceeds 30 minutes: consider enabling a maintenance page
+5. Recovery is automatic — once upstream responds, `/health` returns 200 within 60s
+
+### 6.2 SignalR Hub Offline
+
+**Symptoms:** Real-time flight updates stop; SignalR reconnection logs appear.
+
+**Impact:** Users see stale board data; manual refresh still works via REST API.
+
+**Steps:**
+1. Check SignalR hub process/container status
+2. Verify WebSocket connectivity from the node to the SignalR hub
+3. The client auto-reconnects with exponential backoff — recovery is usually automatic
+4. If hub is permanently down: REST polling fallback should activate (if implemented)
+5. Inform users if downtime exceeds 5 minutes
+
+### 6.3 CSP Violation Spike
+
+**Symptoms:** Spike in CSP violation reports; possibly broken page functionality.
+
+**Impact:** Scripts or styles blocked by Content-Security-Policy; UI may be partially broken.
+
+**Steps:**
+1. Check CSP violation reports for the blocked resource URL
+2. If a legitimate resource is blocked: update CSP policy in `src/server/middleware/csp.ts`
+3. If a third-party script is the source: investigate whether it was injected (security concern)
+4. If after a deploy: the new code may reference resources not in the CSP allowlist — fix or rollback
+5. CSP is in report-only mode during Phase 1 — no user impact, but violations should be tracked
+
+### 6.4 Analytics Adapter Load Failure
+
+**Symptoms:** `flights.analytics.load_failed` counter increases; analytics data gaps.
+
+**Impact:** Analytics data not collected; no user-facing impact.
+
+**Steps:**
+1. Check which adapter(s) failed (Yandex.Metrica, CTM, Variocube, Dynatrace)
+2. Verify the adapter's external script URL is reachable from the client
+3. Check for CORS or CSP blocking the adapter script
+4. If a single adapter: low priority, monitor
+5. If all adapters: likely a CSP or network issue affecting all external scripts
+
+### 6.5 OTel Exporter Unreachable
+
+**Symptoms:** Metrics and traces stop appearing in the monitoring dashboard.
+
+**Impact:** No observability data; no user-facing impact.
+
+**Steps:**
+1. Check the OTel collector/exporter endpoint connectivity
+2. Verify the `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable is correct
+3. Check for network policy changes that may block the exporter
+4. The SDK buffers data locally — some data may be recoverable once the exporter is reachable again
+5. If the exporter is permanently moved: update the endpoint configuration and redeploy
+
+### 6.6 Memory Pressure / OOM Kill
+
+**Symptoms:** Container restarts; OOM kill events in container orchestrator logs.
+
+**Impact:** Requests in flight are dropped; load balancer reroutes to healthy nodes.
+
+**Steps:**
+1. Check container memory limits vs actual usage
+2. Review recent deploys for memory leaks (new dependencies, unbounded caches)
+3. If a specific route causes high memory: check for large API responses or unbounded data structures
+4. Short-term: increase memory limits
+5. Long-term: profile the application to find the leak; fix and redeploy
+
+---
+
+## Recovery SLA
+
+**Target:** Service recovery within 6 hours after infrastructure is restored.
+
+**Recovery steps:**
+1. Infrastructure team restores VMs / containers across geographic regions
+2. Deployment tool re-deploys the last-known-good image
+3. `/health` checks confirm upstream connectivity
+4. Load balancer re-enables traffic to recovered nodes
+5. On-call engineer verifies end-to-end functionality
+6. Incident report filed within 24 hours of resolution