Add operational runbook covering incident response and failure playbooks
Covers incident decision tree, canary rollout, rollback procedures, health-check interpretation, log query cookbook, and 6 known-failure playbooks per master plan requirements.
This commit is contained in:
@@ -0,0 +1,290 @@
|
||||
# Flights Web — Operational Runbook
|
||||
|
||||
**Version:** 1.0 (Phase 1I)
|
||||
**Last updated:** 2026-04-14
|
||||
|
||||
---
|
||||
|
||||
## 1. Incident Response Decision Tree
|
||||
|
||||
```
|
||||
Is the service returning errors?
|
||||
|
|
||||
+-- YES: Check /health endpoint
|
||||
| |
|
||||
| +-- /health returns 503
|
||||
| | -> Upstream API issue (see Section 6.1)
|
||||
| |
|
||||
| +-- /health returns 200 but users see errors
|
||||
| | -> Application-level bug. Check logs (Section 5).
|
||||
| | -> If recent deploy: rollback (Section 3).
|
||||
| |
|
||||
| +-- /health unreachable (connection refused / timeout)
|
||||
| -> Container/VM is down.
|
||||
| -> Check container orchestrator status.
|
||||
| -> If all replicas down: escalate to infra team (Severity 1).
|
||||
| -> If partial: rely on load balancer, investigate affected nodes.
|
||||
|
|
||||
+-- NO: Check for degraded performance
|
||||
|
|
||||
+-- Latency > 2x baseline
|
||||
| -> Check OTel metrics for slow spans.
|
||||
| -> Check upstream API latency.
|
||||
| -> If upstream: see Section 6.1.
|
||||
| -> If internal: check for memory pressure, CPU saturation.
|
||||
|
|
||||
+-- Intermittent errors in logs
|
||||
-> Check error rate trend.
|
||||
-> If rising: prepare for rollback.
|
||||
-> If stable/low: monitor for 15 min, then investigate.
|
||||
```
|
||||
|
||||
### Severity Levels
|
||||
|
||||
| Severity | Criteria | Response Time | Who to Page |
|
||||
|----------|----------|---------------|-------------|
|
||||
| S1 | Service fully down, all users affected | Immediate | On-call engineer + team lead |
|
||||
| S2 | Partial outage, >10% error rate | 15 min | On-call engineer |
|
||||
| S3 | Degraded performance, no data loss | 1 hour | On-call engineer (next business day if after hours) |
|
||||
| S4 | Minor issue, workaround exists | Next business day | Assigned engineer |
|
||||
|
||||
---
|
||||
|
||||
## 2. Canary Rollout Procedure
|
||||
|
||||
### Pre-rollout Checklist
|
||||
|
||||
- [ ] All CI checks pass (typecheck, lint, test)
|
||||
- [ ] Docker images built and pushed to registry
|
||||
- [ ] Rollback image tag identified (current production tag)
|
||||
- [ ] Monitoring dashboards open
|
||||
|
||||
### Rollout Steps
|
||||
|
||||
1. **Deploy canary** (5% traffic) to a single node in one geographic region
|
||||
2. **Monitor for 10 minutes:**
|
||||
- Error rate must stay below 0.5%
|
||||
- p99 latency must not exceed 2x baseline
|
||||
- `/health` must return 200 on the canary
|
||||
- No new error patterns in logs
|
||||
3. **Expand to 25%** if canary is healthy
|
||||
4. **Monitor for 15 minutes** with same criteria
|
||||
5. **Expand to 100%** across all geographic regions
|
||||
6. **Post-deploy verification:**
|
||||
- `/health` returns 200 on all nodes
|
||||
- Smoke test passes end-to-end
|
||||
- No error rate spike in the first 30 minutes
|
||||
|
||||
### Abort Criteria
|
||||
|
||||
Roll back immediately if any of these occur during canary:
|
||||
- Error rate exceeds 1%
|
||||
- `/health` returns 503 on canary nodes
|
||||
- p99 latency exceeds 5x baseline
|
||||
- Any S1/S2 incident triggered
|
||||
|
||||
---
|
||||
|
||||
## 3. Rollback Procedure
|
||||
|
||||
### 3.1 Automatic Rollback
|
||||
|
||||
The deploy pipeline monitors `/health` after deployment. If the health check fails within the first 5 minutes post-deploy:
|
||||
|
||||
1. Pipeline automatically reverts to the previous image tag
|
||||
2. Alert fires to the on-call channel
|
||||
3. Engineer investigates the failed deployment logs
|
||||
|
||||
**No manual action required** for auto-rollback. Verify the rollback succeeded by checking:
|
||||
- `/health` returns 200
|
||||
- Error rate returns to baseline
|
||||
- Previous image tag is running on all nodes
|
||||
|
||||
### 3.2 Manual Rollback
|
||||
|
||||
If auto-rollback did not trigger or a problem is discovered later:
|
||||
|
||||
1. **Identify the last-known-good image tag** from the deployment history
|
||||
2. **Redeploy the previous image:**
|
||||
```bash
|
||||
# Placeholder — actual commands depend on customer's deployment tool
|
||||
# Example:
|
||||
# deploy --image $REGISTRY/flights-web-standalone:$PREVIOUS_SHA --env production
|
||||
```
|
||||
3. **Verify rollback:**
|
||||
- `/health` returns 200 on all nodes
|
||||
- Error rate returns to baseline
|
||||
- Smoke test passes
|
||||
4. **Post-mortem:** file an incident report within 24 hours
|
||||
|
||||
---
|
||||
|
||||
## 4. Health-Check Interpretation
|
||||
|
||||
### Endpoint: `GET /health`
|
||||
|
||||
| Response | Status | Meaning | Action |
|
||||
|----------|--------|---------|--------|
|
||||
| `{ "status": "ok" }` | 200 | Upstream API reachable within last 60s | None |
|
||||
| `{ "status": "degraded", "reason": "upstream_unreachable" }` | 503 | No successful upstream ping in 60s | Check upstream API status; see Section 6.1 |
|
||||
|
||||
### Common Causes of 503
|
||||
|
||||
1. **Upstream API is down** — check upstream service status page / monitoring
|
||||
2. **Network partition** — the node cannot reach the upstream API; check network policies
|
||||
3. **DNS resolution failure** — verify DNS configuration on the node
|
||||
4. **Upstream API overloaded** — ping times out; coordinate with upstream team
|
||||
|
||||
### Load Balancer Behavior
|
||||
|
||||
When `/health` returns 503, the load balancer should stop routing traffic to that node. When the upstream recovers and `/health` returns 200 again, traffic automatically resumes.
|
||||
|
||||
---
|
||||
|
||||
## 5. Log Query Cookbook
|
||||
|
||||
Logs are shipped in JSON Lines format to the customer's log aggregation system.
|
||||
|
||||
### Log Structure
|
||||
|
||||
```json
|
||||
{
|
||||
"ts": "2026-04-14T12:00:00.000Z",
|
||||
"level": "error",
|
||||
"msg": "Request failed",
|
||||
"fields": {
|
||||
"traceId": "abc123",
|
||||
"path": "/api/flights",
|
||||
"status": 500,
|
||||
"err": "TypeError: Cannot read properties of undefined"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Common Queries
|
||||
|
||||
**Find all errors in the last hour:**
|
||||
```
|
||||
level:error AND ts:[now-1h TO now]
|
||||
```
|
||||
|
||||
**Find errors for a specific trace:**
|
||||
```
|
||||
fields.traceId:"abc123"
|
||||
```
|
||||
|
||||
**Find slow requests (logged by the API client on timeout):**
|
||||
```
|
||||
msg:"Retrying request" OR msg:"upstream_timeout"
|
||||
```
|
||||
|
||||
**Find health-check failures:**
|
||||
```
|
||||
msg:"upstream_unreachable" OR (path:"/health" AND status:503)
|
||||
```
|
||||
|
||||
**Find graceful shutdown events:**
|
||||
```
|
||||
msg:"SIGTERM received" OR msg:"Server closed" OR msg:"Drain timeout exceeded"
|
||||
```
|
||||
|
||||
**Find CSP violations (if CSP reporting is enabled):**
|
||||
```
|
||||
msg:"csp-violation" OR fields.type:"csp-report"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Known-Failure Playbooks
|
||||
|
||||
### 6.1 Upstream API Down
|
||||
|
||||
**Symptoms:** `/health` returns 503; API client logs show retry exhaustion.
|
||||
|
||||
**Impact:** Users see error pages or stale data (if caching is in place).
|
||||
|
||||
**Steps:**
|
||||
1. Confirm upstream status via the upstream team's status page or monitoring
|
||||
2. If upstream is aware and working on it: monitor, no action needed on our side
|
||||
3. If upstream is unaware: escalate via agreed communication channel
|
||||
4. If outage exceeds 30 minutes: consider enabling a maintenance page
|
||||
5. Recovery is automatic — once upstream responds, `/health` returns 200 within 60s
|
||||
|
||||
### 6.2 SignalR Hub Offline
|
||||
|
||||
**Symptoms:** Real-time flight updates stop; SignalR reconnection logs appear.
|
||||
|
||||
**Impact:** Users see stale board data; manual refresh still works via REST API.
|
||||
|
||||
**Steps:**
|
||||
1. Check SignalR hub process/container status
|
||||
2. Verify WebSocket connectivity from the node to the SignalR hub
|
||||
3. The client auto-reconnects with exponential backoff — recovery is usually automatic
|
||||
4. If hub is permanently down: REST polling fallback should activate (if implemented)
|
||||
5. Inform users if downtime exceeds 5 minutes
|
||||
|
||||
### 6.3 CSP Violation Spike
|
||||
|
||||
**Symptoms:** Spike in CSP violation reports; possibly broken page functionality.
|
||||
|
||||
**Impact:** Scripts or styles blocked by Content-Security-Policy; UI may be partially broken.
|
||||
|
||||
**Steps:**
|
||||
1. Check CSP violation reports for the blocked resource URL
|
||||
2. If a legitimate resource is blocked: update CSP policy in `src/server/middleware/csp.ts`
|
||||
3. If a third-party script is the source: investigate whether it was injected (security concern)
|
||||
4. If after a deploy: the new code may reference resources not in the CSP allowlist — fix or rollback
|
||||
5. CSP is in report-only mode during Phase 1 — no user impact, but violations should be tracked
|
||||
|
||||
### 6.4 Analytics Adapter Load Failure
|
||||
|
||||
**Symptoms:** `flights.analytics.load_failed` counter increases; analytics data gaps.
|
||||
|
||||
**Impact:** Analytics data not collected; no user-facing impact.
|
||||
|
||||
**Steps:**
|
||||
1. Check which adapter(s) failed (Yandex.Metrica, CTM, Variocube, Dynatrace)
|
||||
2. Verify the adapter's external script URL is reachable from the client
|
||||
3. Check for CORS or CSP blocking the adapter script
|
||||
4. If a single adapter: low priority, monitor
|
||||
5. If all adapters: likely a CSP or network issue affecting all external scripts
|
||||
|
||||
### 6.5 OTel Exporter Unreachable
|
||||
|
||||
**Symptoms:** Metrics and traces stop appearing in the monitoring dashboard.
|
||||
|
||||
**Impact:** No observability data; no user-facing impact.
|
||||
|
||||
**Steps:**
|
||||
1. Check the OTel collector/exporter endpoint connectivity
|
||||
2. Verify the `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable is correct
|
||||
3. Check for network policy changes that may block the exporter
|
||||
4. The SDK buffers data locally — some data may be recoverable once the exporter is reachable again
|
||||
5. If the exporter is permanently moved: update the endpoint configuration and redeploy
|
||||
|
||||
### 6.6 Memory Pressure / OOM Kill
|
||||
|
||||
**Symptoms:** Container restarts; OOM kill events in container orchestrator logs.
|
||||
|
||||
**Impact:** Requests in flight are dropped; load balancer reroutes to healthy nodes.
|
||||
|
||||
**Steps:**
|
||||
1. Check container memory limits vs actual usage
|
||||
2. Review recent deploys for memory leaks (new dependencies, unbounded caches)
|
||||
3. If a specific route causes high memory: check for large API responses or unbounded data structures
|
||||
4. Short-term: increase memory limits
|
||||
5. Long-term: profile the application to find the leak; fix and redeploy
|
||||
|
||||
---
|
||||
|
||||
## Recovery SLA
|
||||
|
||||
**Target:** Service recovery within 6 hours after infrastructure is restored.
|
||||
|
||||
**Recovery steps:**
|
||||
1. Infrastructure team restores VMs / containers across geographic regions
|
||||
2. Deployment tool re-deploys the last-known-good image
|
||||
3. `/health` checks confirm upstream connectivity
|
||||
4. Load balancer re-enables traffic to recovered nodes
|
||||
5. On-call engineer verifies end-to-end functionality
|
||||
6. Incident report filed within 24 hours of resolution
|
||||
Reference in New Issue
Block a user