Comprehensive operational procedure for the Angular-to-React traffic cutover: pre-cutover gates, proxy config templates (nginx/HAProxy), 72-hour traffic ramp schedule, monitoring checklist, rollback procedure, 1-week soak criteria, and Angular decommission steps. Also adds Phase 6 cross-reference sections to the Phase 1 runbook.
11 KiB
Flights Web — Operational Runbook
Version: 1.0 (Phase 1I) Last updated: 2026-04-14
1. Incident Response Decision Tree
Is the service returning errors?
|
+-- YES: Check /health endpoint
| |
| +-- /health returns 503
| | -> Upstream API issue (see Section 6.1)
| |
| +-- /health returns 200 but users see errors
| | -> Application-level bug. Check logs (Section 5).
| | -> If recent deploy: rollback (Section 3).
| |
| +-- /health unreachable (connection refused / timeout)
| -> Container/VM is down.
| -> Check container orchestrator status.
| -> If all replicas down: escalate to infra team (Severity 1).
| -> If partial: rely on load balancer, investigate affected nodes.
|
+-- NO: Check for degraded performance
|
+-- Latency > 2x baseline
| -> Check OTel metrics for slow spans.
| -> Check upstream API latency.
| -> If upstream: see Section 6.1.
| -> If internal: check for memory pressure, CPU saturation.
|
+-- Intermittent errors in logs
-> Check error rate trend.
-> If rising: prepare for rollback.
-> If stable/low: monitor for 15 min, then investigate.
Severity Levels
| Severity | Criteria | Response Time | Who to Page |
|---|---|---|---|
| S1 | Service fully down, all users affected | Immediate | On-call engineer + team lead |
| S2 | Partial outage, >10% error rate | 15 min | On-call engineer |
| S3 | Degraded performance, no data loss | 1 hour | On-call engineer (next business day if after hours) |
| S4 | Minor issue, workaround exists | Next business day | Assigned engineer |
2. Canary Rollout Procedure
Pre-rollout Checklist
- All CI checks pass (typecheck, lint, test)
- Docker images built and pushed to registry
- Rollback image tag identified (current production tag)
- Monitoring dashboards open
Rollout Steps
- Deploy canary (5% traffic) to a single node in one geographic region
- Monitor for 10 minutes:
- Error rate must stay below 0.5%
- p99 latency must not exceed 2x baseline
/healthmust return 200 on the canary- No new error patterns in logs
- Expand to 25% if canary is healthy
- Monitor for 15 minutes with same criteria
- Expand to 100% across all geographic regions
- Post-deploy verification:
/healthreturns 200 on all nodes- Smoke test passes end-to-end
- No error rate spike in the first 30 minutes
Abort Criteria
Roll back immediately if any of these occur during canary:
- Error rate exceeds 1%
/healthreturns 503 on canary nodes- p99 latency exceeds 5x baseline
- Any S1/S2 incident triggered
3. Rollback Procedure
3.1 Automatic Rollback
The deploy pipeline monitors /health after deployment. If the health check fails within the first 5 minutes post-deploy:
- Pipeline automatically reverts to the previous image tag
- Alert fires to the on-call channel
- Engineer investigates the failed deployment logs
No manual action required for auto-rollback. Verify the rollback succeeded by checking:
/healthreturns 200- Error rate returns to baseline
- Previous image tag is running on all nodes
3.2 Manual Rollback
If auto-rollback did not trigger or a problem is discovered later:
- Identify the last-known-good image tag from the deployment history
- Redeploy the previous image:
# Placeholder — actual commands depend on customer's deployment tool # Example: # deploy --image $REGISTRY/flights-web-standalone:$PREVIOUS_SHA --env production - Verify rollback:
/healthreturns 200 on all nodes- Error rate returns to baseline
- Smoke test passes
- Post-mortem: file an incident report within 24 hours
4. Health-Check Interpretation
Endpoint: GET /health
| Response | Status | Meaning | Action |
|---|---|---|---|
{ "status": "ok" } |
200 | Upstream API reachable within last 60s | None |
{ "status": "degraded", "reason": "upstream_unreachable" } |
503 | No successful upstream ping in 60s | Check upstream API status; see Section 6.1 |
Common Causes of 503
- Upstream API is down — check upstream service status page / monitoring
- Network partition — the node cannot reach the upstream API; check network policies
- DNS resolution failure — verify DNS configuration on the node
- Upstream API overloaded — ping times out; coordinate with upstream team
Load Balancer Behavior
When /health returns 503, the load balancer should stop routing traffic to that node. When the upstream recovers and /health returns 200 again, traffic automatically resumes.
5. Log Query Cookbook
Logs are shipped in JSON Lines format to the customer's log aggregation system.
Log Structure
{
"ts": "2026-04-14T12:00:00.000Z",
"level": "error",
"msg": "Request failed",
"fields": {
"traceId": "abc123",
"path": "/api/flights",
"status": 500,
"err": "TypeError: Cannot read properties of undefined"
}
}
Common Queries
Find all errors in the last hour:
level:error AND ts:[now-1h TO now]
Find errors for a specific trace:
fields.traceId:"abc123"
Find slow requests (logged by the API client on timeout):
msg:"Retrying request" OR msg:"upstream_timeout"
Find health-check failures:
msg:"upstream_unreachable" OR (path:"/health" AND status:503)
Find graceful shutdown events:
msg:"SIGTERM received" OR msg:"Server closed" OR msg:"Drain timeout exceeded"
Find CSP violations (if CSP reporting is enabled):
msg:"csp-violation" OR fields.type:"csp-report"
6. Known-Failure Playbooks
6.1 Upstream API Down
Symptoms: /health returns 503; API client logs show retry exhaustion.
Impact: Users see error pages or stale data (if caching is in place).
Steps:
- Confirm upstream status via the upstream team's status page or monitoring
- If upstream is aware and working on it: monitor, no action needed on our side
- If upstream is unaware: escalate via agreed communication channel
- If outage exceeds 30 minutes: consider enabling a maintenance page
- Recovery is automatic — once upstream responds,
/healthreturns 200 within 60s
6.2 SignalR Hub Offline
Symptoms: Real-time flight updates stop; SignalR reconnection logs appear.
Impact: Users see stale board data; manual refresh still works via REST API.
Steps:
- Check SignalR hub process/container status
- Verify WebSocket connectivity from the node to the SignalR hub
- The client auto-reconnects with exponential backoff — recovery is usually automatic
- If hub is permanently down: REST polling fallback should activate (if implemented)
- Inform users if downtime exceeds 5 minutes
6.3 CSP Violation Spike
Symptoms: Spike in CSP violation reports; possibly broken page functionality.
Impact: Scripts or styles blocked by Content-Security-Policy; UI may be partially broken.
Steps:
- Check CSP violation reports for the blocked resource URL
- If a legitimate resource is blocked: update CSP policy in
src/server/middleware/csp.ts - If a third-party script is the source: investigate whether it was injected (security concern)
- If after a deploy: the new code may reference resources not in the CSP allowlist — fix or rollback
- CSP is in report-only mode during Phase 1 — no user impact, but violations should be tracked
6.4 Analytics Adapter Load Failure
Symptoms: flights.analytics.load_failed counter increases; analytics data gaps.
Impact: Analytics data not collected; no user-facing impact.
Steps:
- Check which adapter(s) failed (Yandex.Metrica, CTM, Variocube, Dynatrace)
- Verify the adapter's external script URL is reachable from the client
- Check for CORS or CSP blocking the adapter script
- If a single adapter: low priority, monitor
- If all adapters: likely a CSP or network issue affecting all external scripts
6.5 OTel Exporter Unreachable
Symptoms: Metrics and traces stop appearing in the monitoring dashboard.
Impact: No observability data; no user-facing impact.
Steps:
- Check the OTel collector/exporter endpoint connectivity
- Verify the
OTEL_EXPORTER_OTLP_ENDPOINTenvironment variable is correct - Check for network policy changes that may block the exporter
- The SDK buffers data locally — some data may be recoverable once the exporter is reachable again
- If the exporter is permanently moved: update the endpoint configuration and redeploy
6.6 Memory Pressure / OOM Kill
Symptoms: Container restarts; OOM kill events in container orchestrator logs.
Impact: Requests in flight are dropped; load balancer reroutes to healthy nodes.
Steps:
- Check container memory limits vs actual usage
- Review recent deploys for memory leaks (new dependencies, unbounded caches)
- If a specific route causes high memory: check for large API responses or unbounded data structures
- Short-term: increase memory limits
- Long-term: profile the application to find the leak; fix and redeploy
Recovery SLA
Target: Service recovery within 6 hours after infrastructure is restored.
Recovery steps:
- Infrastructure team restores VMs / containers across geographic regions
- Deployment tool re-deploys the last-known-good image
/healthchecks confirm upstream connectivity- Load balancer re-enables traffic to recovered nodes
- On-call engineer verifies end-to-end functionality
- Incident report filed within 24 hours of resolution
7. Phase 6 Cutover Reference
This section cross-references the full cutover runbook at docs/superpowers/plans/2026-04-15-phase-6-cutover.md.
7.1 Traffic Ramp Quick Reference
During cutover, traffic shifts from Angular to React over 72 hours:
- T+0h: 5% React / 95% Angular
- T+12h: 25% React / 75% Angular
- T+24h: 50% React / 50% Angular
- T+48h: 100% React / 0% Angular
Each step requires explicit go/no-go from the on-call engineer. Full details in the cutover runbook Section 3.
7.2 Cutover Rollback
During or after the traffic ramp, if a rollback is needed:
- Flip proxy weights back to Angular (< 1 minute)
- Verify Angular is serving traffic via response headers
- Confirm error rate and latency return to baseline
- File post-mortem within 24 hours
Trigger criteria: error rate > 1% for 5+ minutes, p95 > 2x baseline for 10+ minutes, or > 50% of React nodes returning 503.
Full rollback procedure in the cutover runbook Section 5.
7.3 Post-Cutover Soak
After reaching 100% React traffic, a 7-day soak period is required before Angular decommission. Soak pass criteria:
- Zero Angular hits in access logs
- Error rate < 0.1%
- p95 < 500ms
- Core Web Vitals in "Good" threshold
- No Search Console regressions
Full soak criteria in the cutover runbook Section 6.
7.4 Angular Decommission
After soak sign-off:
- Tag the Angular codebase:
git tag -a angular-final - Create archive branch:
archive/angular-spa - Remove Angular/ASP.NET files (requires customer approval)
- Infrastructure cleanup
Full decommission steps in the cutover runbook Section 7.