98c6eca90e
Comprehensive operational procedure for the Angular-to-React traffic cutover: pre-cutover gates, proxy config templates (nginx/HAProxy), 72-hour traffic ramp schedule, monitoring checklist, rollback procedure, 1-week soak criteria, and Angular decommission steps. Also adds Phase 6 cross-reference sections to the Phase 1 runbook.
341 lines
11 KiB
Markdown
341 lines
11 KiB
Markdown
# Flights Web — Operational Runbook
|
|
|
|
**Version:** 1.0 (Phase 1I)
|
|
**Last updated:** 2026-04-14
|
|
|
|
---
|
|
|
|
## 1. Incident Response Decision Tree
|
|
|
|
```
|
|
Is the service returning errors?
|
|
|
|
|
+-- YES: Check /health endpoint
|
|
| |
|
|
| +-- /health returns 503
|
|
| | -> Upstream API issue (see Section 6.1)
|
|
| |
|
|
| +-- /health returns 200 but users see errors
|
|
| | -> Application-level bug. Check logs (Section 5).
|
|
| | -> If recent deploy: rollback (Section 3).
|
|
| |
|
|
| +-- /health unreachable (connection refused / timeout)
|
|
| -> Container/VM is down.
|
|
| -> Check container orchestrator status.
|
|
| -> If all replicas down: escalate to infra team (Severity 1).
|
|
| -> If partial: rely on load balancer, investigate affected nodes.
|
|
|
|
|
+-- NO: Check for degraded performance
|
|
|
|
|
+-- Latency > 2x baseline
|
|
| -> Check OTel metrics for slow spans.
|
|
| -> Check upstream API latency.
|
|
| -> If upstream: see Section 6.1.
|
|
| -> If internal: check for memory pressure, CPU saturation.
|
|
|
|
|
+-- Intermittent errors in logs
|
|
-> Check error rate trend.
|
|
-> If rising: prepare for rollback.
|
|
-> If stable/low: monitor for 15 min, then investigate.
|
|
```
|
|
|
|
### Severity Levels
|
|
|
|
| Severity | Criteria | Response Time | Who to Page |
|
|
|----------|----------|---------------|-------------|
|
|
| S1 | Service fully down, all users affected | Immediate | On-call engineer + team lead |
|
|
| S2 | Partial outage, >10% error rate | 15 min | On-call engineer |
|
|
| S3 | Degraded performance, no data loss | 1 hour | On-call engineer (next business day if after hours) |
|
|
| S4 | Minor issue, workaround exists | Next business day | Assigned engineer |
|
|
|
|
---
|
|
|
|
## 2. Canary Rollout Procedure
|
|
|
|
### Pre-rollout Checklist
|
|
|
|
- [ ] All CI checks pass (typecheck, lint, test)
|
|
- [ ] Docker images built and pushed to registry
|
|
- [ ] Rollback image tag identified (current production tag)
|
|
- [ ] Monitoring dashboards open
|
|
|
|
### Rollout Steps
|
|
|
|
1. **Deploy canary** (5% traffic) to a single node in one geographic region
|
|
2. **Monitor for 10 minutes:**
|
|
- Error rate must stay below 0.5%
|
|
- p99 latency must not exceed 2x baseline
|
|
- `/health` must return 200 on the canary
|
|
- No new error patterns in logs
|
|
3. **Expand to 25%** if canary is healthy
|
|
4. **Monitor for 15 minutes** with same criteria
|
|
5. **Expand to 100%** across all geographic regions
|
|
6. **Post-deploy verification:**
|
|
- `/health` returns 200 on all nodes
|
|
- Smoke test passes end-to-end
|
|
- No error rate spike in the first 30 minutes
|
|
|
|
### Abort Criteria
|
|
|
|
Roll back immediately if any of these occur during canary:
|
|
- Error rate exceeds 1%
|
|
- `/health` returns 503 on canary nodes
|
|
- p99 latency exceeds 5x baseline
|
|
- Any S1/S2 incident triggered
|
|
|
|
---
|
|
|
|
## 3. Rollback Procedure
|
|
|
|
### 3.1 Automatic Rollback
|
|
|
|
The deploy pipeline monitors `/health` after deployment. If the health check fails within the first 5 minutes post-deploy:
|
|
|
|
1. Pipeline automatically reverts to the previous image tag
|
|
2. Alert fires to the on-call channel
|
|
3. Engineer investigates the failed deployment logs
|
|
|
|
**No manual action required** for auto-rollback. Verify the rollback succeeded by checking:
|
|
- `/health` returns 200
|
|
- Error rate returns to baseline
|
|
- Previous image tag is running on all nodes
|
|
|
|
### 3.2 Manual Rollback
|
|
|
|
If auto-rollback did not trigger or a problem is discovered later:
|
|
|
|
1. **Identify the last-known-good image tag** from the deployment history
|
|
2. **Redeploy the previous image:**
|
|
```bash
|
|
# Placeholder — actual commands depend on customer's deployment tool
|
|
# Example:
|
|
# deploy --image $REGISTRY/flights-web-standalone:$PREVIOUS_SHA --env production
|
|
```
|
|
3. **Verify rollback:**
|
|
- `/health` returns 200 on all nodes
|
|
- Error rate returns to baseline
|
|
- Smoke test passes
|
|
4. **Post-mortem:** file an incident report within 24 hours
|
|
|
|
---
|
|
|
|
## 4. Health-Check Interpretation
|
|
|
|
### Endpoint: `GET /health`
|
|
|
|
| Response | Status | Meaning | Action |
|
|
|----------|--------|---------|--------|
|
|
| `{ "status": "ok" }` | 200 | Upstream API reachable within last 60s | None |
|
|
| `{ "status": "degraded", "reason": "upstream_unreachable" }` | 503 | No successful upstream ping in 60s | Check upstream API status; see Section 6.1 |
|
|
|
|
### Common Causes of 503
|
|
|
|
1. **Upstream API is down** — check upstream service status page / monitoring
|
|
2. **Network partition** — the node cannot reach the upstream API; check network policies
|
|
3. **DNS resolution failure** — verify DNS configuration on the node
|
|
4. **Upstream API overloaded** — ping times out; coordinate with upstream team
|
|
|
|
### Load Balancer Behavior
|
|
|
|
When `/health` returns 503, the load balancer should stop routing traffic to that node. When the upstream recovers and `/health` returns 200 again, traffic automatically resumes.
|
|
|
|
---
|
|
|
|
## 5. Log Query Cookbook
|
|
|
|
Logs are shipped in JSON Lines format to the customer's log aggregation system.
|
|
|
|
### Log Structure
|
|
|
|
```json
|
|
{
|
|
"ts": "2026-04-14T12:00:00.000Z",
|
|
"level": "error",
|
|
"msg": "Request failed",
|
|
"fields": {
|
|
"traceId": "abc123",
|
|
"path": "/api/flights",
|
|
"status": 500,
|
|
"err": "TypeError: Cannot read properties of undefined"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Common Queries
|
|
|
|
**Find all errors in the last hour:**
|
|
```
|
|
level:error AND ts:[now-1h TO now]
|
|
```
|
|
|
|
**Find errors for a specific trace:**
|
|
```
|
|
fields.traceId:"abc123"
|
|
```
|
|
|
|
**Find slow requests (logged by the API client on timeout):**
|
|
```
|
|
msg:"Retrying request" OR msg:"upstream_timeout"
|
|
```
|
|
|
|
**Find health-check failures:**
|
|
```
|
|
msg:"upstream_unreachable" OR (path:"/health" AND status:503)
|
|
```
|
|
|
|
**Find graceful shutdown events:**
|
|
```
|
|
msg:"SIGTERM received" OR msg:"Server closed" OR msg:"Drain timeout exceeded"
|
|
```
|
|
|
|
**Find CSP violations (if CSP reporting is enabled):**
|
|
```
|
|
msg:"csp-violation" OR fields.type:"csp-report"
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Known-Failure Playbooks
|
|
|
|
### 6.1 Upstream API Down
|
|
|
|
**Symptoms:** `/health` returns 503; API client logs show retry exhaustion.
|
|
|
|
**Impact:** Users see error pages or stale data (if caching is in place).
|
|
|
|
**Steps:**
|
|
1. Confirm upstream status via the upstream team's status page or monitoring
|
|
2. If upstream is aware and working on it: monitor, no action needed on our side
|
|
3. If upstream is unaware: escalate via agreed communication channel
|
|
4. If outage exceeds 30 minutes: consider enabling a maintenance page
|
|
5. Recovery is automatic — once upstream responds, `/health` returns 200 within 60s
|
|
|
|
### 6.2 SignalR Hub Offline
|
|
|
|
**Symptoms:** Real-time flight updates stop; SignalR reconnection logs appear.
|
|
|
|
**Impact:** Users see stale board data; manual refresh still works via REST API.
|
|
|
|
**Steps:**
|
|
1. Check SignalR hub process/container status
|
|
2. Verify WebSocket connectivity from the node to the SignalR hub
|
|
3. The client auto-reconnects with exponential backoff — recovery is usually automatic
|
|
4. If hub is permanently down: REST polling fallback should activate (if implemented)
|
|
5. Inform users if downtime exceeds 5 minutes
|
|
|
|
### 6.3 CSP Violation Spike
|
|
|
|
**Symptoms:** Spike in CSP violation reports; possibly broken page functionality.
|
|
|
|
**Impact:** Scripts or styles blocked by Content-Security-Policy; UI may be partially broken.
|
|
|
|
**Steps:**
|
|
1. Check CSP violation reports for the blocked resource URL
|
|
2. If a legitimate resource is blocked: update CSP policy in `src/server/middleware/csp.ts`
|
|
3. If a third-party script is the source: investigate whether it was injected (security concern)
|
|
4. If after a deploy: the new code may reference resources not in the CSP allowlist — fix or rollback
|
|
5. CSP is in report-only mode during Phase 1 — no user impact, but violations should be tracked
|
|
|
|
### 6.4 Analytics Adapter Load Failure
|
|
|
|
**Symptoms:** `flights.analytics.load_failed` counter increases; analytics data gaps.
|
|
|
|
**Impact:** Analytics data not collected; no user-facing impact.
|
|
|
|
**Steps:**
|
|
1. Check which adapter(s) failed (Yandex.Metrica, CTM, Variocube, Dynatrace)
|
|
2. Verify the adapter's external script URL is reachable from the client
|
|
3. Check for CORS or CSP blocking the adapter script
|
|
4. If a single adapter: low priority, monitor
|
|
5. If all adapters: likely a CSP or network issue affecting all external scripts
|
|
|
|
### 6.5 OTel Exporter Unreachable
|
|
|
|
**Symptoms:** Metrics and traces stop appearing in the monitoring dashboard.
|
|
|
|
**Impact:** No observability data; no user-facing impact.
|
|
|
|
**Steps:**
|
|
1. Check the OTel collector/exporter endpoint connectivity
|
|
2. Verify the `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable is correct
|
|
3. Check for network policy changes that may block the exporter
|
|
4. The SDK buffers data locally — some data may be recoverable once the exporter is reachable again
|
|
5. If the exporter is permanently moved: update the endpoint configuration and redeploy
|
|
|
|
### 6.6 Memory Pressure / OOM Kill
|
|
|
|
**Symptoms:** Container restarts; OOM kill events in container orchestrator logs.
|
|
|
|
**Impact:** Requests in flight are dropped; load balancer reroutes to healthy nodes.
|
|
|
|
**Steps:**
|
|
1. Check container memory limits vs actual usage
|
|
2. Review recent deploys for memory leaks (new dependencies, unbounded caches)
|
|
3. If a specific route causes high memory: check for large API responses or unbounded data structures
|
|
4. Short-term: increase memory limits
|
|
5. Long-term: profile the application to find the leak; fix and redeploy
|
|
|
|
---
|
|
|
|
## Recovery SLA
|
|
|
|
**Target:** Service recovery within 6 hours after infrastructure is restored.
|
|
|
|
**Recovery steps:**
|
|
1. Infrastructure team restores VMs / containers across geographic regions
|
|
2. Deployment tool re-deploys the last-known-good image
|
|
3. `/health` checks confirm upstream connectivity
|
|
4. Load balancer re-enables traffic to recovered nodes
|
|
5. On-call engineer verifies end-to-end functionality
|
|
6. Incident report filed within 24 hours of resolution
|
|
|
|
---
|
|
|
|
## 7. Phase 6 Cutover Reference
|
|
|
|
This section cross-references the full cutover runbook at `docs/superpowers/plans/2026-04-15-phase-6-cutover.md`.
|
|
|
|
### 7.1 Traffic Ramp Quick Reference
|
|
|
|
During cutover, traffic shifts from Angular to React over 72 hours:
|
|
- **T+0h:** 5% React / 95% Angular
|
|
- **T+12h:** 25% React / 75% Angular
|
|
- **T+24h:** 50% React / 50% Angular
|
|
- **T+48h:** 100% React / 0% Angular
|
|
|
|
Each step requires explicit go/no-go from the on-call engineer. Full details in the cutover runbook Section 3.
|
|
|
|
### 7.2 Cutover Rollback
|
|
|
|
During or after the traffic ramp, if a rollback is needed:
|
|
|
|
1. Flip proxy weights back to Angular (< 1 minute)
|
|
2. Verify Angular is serving traffic via response headers
|
|
3. Confirm error rate and latency return to baseline
|
|
4. File post-mortem within 24 hours
|
|
|
|
**Trigger criteria:** error rate > 1% for 5+ minutes, p95 > 2x baseline for 10+ minutes, or > 50% of React nodes returning 503.
|
|
|
|
Full rollback procedure in the cutover runbook Section 5.
|
|
|
|
### 7.3 Post-Cutover Soak
|
|
|
|
After reaching 100% React traffic, a 7-day soak period is required before Angular decommission. Soak pass criteria:
|
|
- Zero Angular hits in access logs
|
|
- Error rate < 0.1%
|
|
- p95 < 500ms
|
|
- Core Web Vitals in "Good" threshold
|
|
- No Search Console regressions
|
|
|
|
Full soak criteria in the cutover runbook Section 6.
|
|
|
|
### 7.4 Angular Decommission
|
|
|
|
After soak sign-off:
|
|
1. Tag the Angular codebase: `git tag -a angular-final`
|
|
2. Create archive branch: `archive/angular-spa`
|
|
3. Remove Angular/ASP.NET files (requires customer approval)
|
|
4. Infrastructure cleanup
|
|
|
|
Full decommission steps in the cutover runbook Section 7.
|