Files

T

gnezim 98c6eca90e Add Phase 6 cutover runbook and operational checklist

Comprehensive operational procedure for the Angular-to-React traffic
cutover: pre-cutover gates, proxy config templates (nginx/HAProxy),
72-hour traffic ramp schedule, monitoring checklist, rollback procedure,
1-week soak criteria, and Angular decommission steps. Also adds Phase 6
cross-reference sections to the Phase 1 runbook.

2026-04-15 10:56:34 +03:00

11 KiB

Raw Blame History

Flights Web — Operational Runbook

Version: 1.0 (Phase 1I) Last updated: 2026-04-14

1. Incident Response Decision Tree

Is the service returning errors?
  |
  +-- YES: Check /health endpoint
  |     |
  |     +-- /health returns 503
  |     |     -> Upstream API issue (see Section 6.1)
  |     |
  |     +-- /health returns 200 but users see errors
  |     |     -> Application-level bug. Check logs (Section 5).
  |     |     -> If recent deploy: rollback (Section 3).
  |     |
  |     +-- /health unreachable (connection refused / timeout)
  |           -> Container/VM is down.
  |           -> Check container orchestrator status.
  |           -> If all replicas down: escalate to infra team (Severity 1).
  |           -> If partial: rely on load balancer, investigate affected nodes.
  |
  +-- NO: Check for degraded performance
        |
        +-- Latency > 2x baseline
        |     -> Check OTel metrics for slow spans.
        |     -> Check upstream API latency.
        |     -> If upstream: see Section 6.1.
        |     -> If internal: check for memory pressure, CPU saturation.
        |
        +-- Intermittent errors in logs
              -> Check error rate trend.
              -> If rising: prepare for rollback.
              -> If stable/low: monitor for 15 min, then investigate.

Severity Levels

Severity	Criteria	Response Time	Who to Page
S1	Service fully down, all users affected	Immediate	On-call engineer + team lead
S2	Partial outage, >10% error rate	15 min	On-call engineer
S3	Degraded performance, no data loss	1 hour	On-call engineer (next business day if after hours)
S4	Minor issue, workaround exists	Next business day	Assigned engineer

2. Canary Rollout Procedure

Pre-rollout Checklist

All CI checks pass (typecheck, lint, test)
Docker images built and pushed to registry
Rollback image tag identified (current production tag)
Monitoring dashboards open

Rollout Steps

Deploy canary (5% traffic) to a single node in one geographic region
Monitor for 10 minutes:
- Error rate must stay below 0.5%
- p99 latency must not exceed 2x baseline
- /health must return 200 on the canary
- No new error patterns in logs
Expand to 25% if canary is healthy
Monitor for 15 minutes with same criteria
Expand to 100% across all geographic regions
Post-deploy verification:
- /health returns 200 on all nodes
- Smoke test passes end-to-end
- No error rate spike in the first 30 minutes

Abort Criteria

Roll back immediately if any of these occur during canary:

Error rate exceeds 1%
/health returns 503 on canary nodes
p99 latency exceeds 5x baseline
Any S1/S2 incident triggered

3. Rollback Procedure

3.1 Automatic Rollback

The deploy pipeline monitors /health after deployment. If the health check fails within the first 5 minutes post-deploy:

Pipeline automatically reverts to the previous image tag
Alert fires to the on-call channel
Engineer investigates the failed deployment logs

No manual action required for auto-rollback. Verify the rollback succeeded by checking:

/health returns 200
Error rate returns to baseline
Previous image tag is running on all nodes

3.2 Manual Rollback

If auto-rollback did not trigger or a problem is discovered later:

Identify the last-known-good image tag from the deployment history

Redeploy the previous image:

# Placeholder — actual commands depend on customer's deployment tool
# Example:
# deploy --image $REGISTRY/flights-web-standalone:$PREVIOUS_SHA --env production

Verify rollback:
- /health returns 200 on all nodes
- Error rate returns to baseline
- Smoke test passes
Post-mortem: file an incident report within 24 hours

4. Health-Check Interpretation

Endpoint: `GET /health`

Response	Status	Meaning	Action
`{ "status": "ok" }`	200	Upstream API reachable within last 60s	None
`{ "status": "degraded", "reason": "upstream_unreachable" }`	503	No successful upstream ping in 60s	Check upstream API status; see Section 6.1

Common Causes of 503

Upstream API is down — check upstream service status page / monitoring
Network partition — the node cannot reach the upstream API; check network policies
DNS resolution failure — verify DNS configuration on the node
Upstream API overloaded — ping times out; coordinate with upstream team

Load Balancer Behavior

When /health returns 503, the load balancer should stop routing traffic to that node. When the upstream recovers and /health returns 200 again, traffic automatically resumes.

5. Log Query Cookbook

Logs are shipped in JSON Lines format to the customer's log aggregation system.

Log Structure

{
  "ts": "2026-04-14T12:00:00.000Z",
  "level": "error",
  "msg": "Request failed",
  "fields": {
    "traceId": "abc123",
    "path": "/api/flights",
    "status": 500,
    "err": "TypeError: Cannot read properties of undefined"
  }
}

Common Queries

Find all errors in the last hour:

level:error AND ts:[now-1h TO now]

Find errors for a specific trace:

fields.traceId:"abc123"

Find slow requests (logged by the API client on timeout):

msg:"Retrying request" OR msg:"upstream_timeout"

Find health-check failures:

msg:"upstream_unreachable" OR (path:"/health" AND status:503)

Find graceful shutdown events:

msg:"SIGTERM received" OR msg:"Server closed" OR msg:"Drain timeout exceeded"

Find CSP violations (if CSP reporting is enabled):

msg:"csp-violation" OR fields.type:"csp-report"

6. Known-Failure Playbooks

6.1 Upstream API Down

Symptoms: /health returns 503; API client logs show retry exhaustion.

Impact: Users see error pages or stale data (if caching is in place).

Steps:

Confirm upstream status via the upstream team's status page or monitoring
If upstream is aware and working on it: monitor, no action needed on our side
If upstream is unaware: escalate via agreed communication channel
If outage exceeds 30 minutes: consider enabling a maintenance page
Recovery is automatic — once upstream responds, /health returns 200 within 60s

6.2 SignalR Hub Offline

Symptoms: Real-time flight updates stop; SignalR reconnection logs appear.

Impact: Users see stale board data; manual refresh still works via REST API.

Steps:

Check SignalR hub process/container status
Verify WebSocket connectivity from the node to the SignalR hub
The client auto-reconnects with exponential backoff — recovery is usually automatic
If hub is permanently down: REST polling fallback should activate (if implemented)
Inform users if downtime exceeds 5 minutes

6.3 CSP Violation Spike

Symptoms: Spike in CSP violation reports; possibly broken page functionality.

Impact: Scripts or styles blocked by Content-Security-Policy; UI may be partially broken.

Steps:

Check CSP violation reports for the blocked resource URL
If a legitimate resource is blocked: update CSP policy in src/server/middleware/csp.ts
If a third-party script is the source: investigate whether it was injected (security concern)
If after a deploy: the new code may reference resources not in the CSP allowlist — fix or rollback
CSP is in report-only mode during Phase 1 — no user impact, but violations should be tracked

6.4 Analytics Adapter Load Failure

Symptoms: flights.analytics.load_failed counter increases; analytics data gaps.

Impact: Analytics data not collected; no user-facing impact.

Steps:

Check which adapter(s) failed (Yandex.Metrica, CTM, Variocube, Dynatrace)
Verify the adapter's external script URL is reachable from the client
Check for CORS or CSP blocking the adapter script
If a single adapter: low priority, monitor
If all adapters: likely a CSP or network issue affecting all external scripts

6.5 OTel Exporter Unreachable

Symptoms: Metrics and traces stop appearing in the monitoring dashboard.

Impact: No observability data; no user-facing impact.

Steps:

Check the OTel collector/exporter endpoint connectivity
Verify the OTEL_EXPORTER_OTLP_ENDPOINT environment variable is correct
Check for network policy changes that may block the exporter
The SDK buffers data locally — some data may be recoverable once the exporter is reachable again
If the exporter is permanently moved: update the endpoint configuration and redeploy

6.6 Memory Pressure / OOM Kill

Symptoms: Container restarts; OOM kill events in container orchestrator logs.

Impact: Requests in flight are dropped; load balancer reroutes to healthy nodes.

Steps:

Check container memory limits vs actual usage
Review recent deploys for memory leaks (new dependencies, unbounded caches)
If a specific route causes high memory: check for large API responses or unbounded data structures
Short-term: increase memory limits
Long-term: profile the application to find the leak; fix and redeploy

Recovery SLA

Target: Service recovery within 6 hours after infrastructure is restored.

Recovery steps:

Infrastructure team restores VMs / containers across geographic regions
Deployment tool re-deploys the last-known-good image
/health checks confirm upstream connectivity
Load balancer re-enables traffic to recovered nodes
On-call engineer verifies end-to-end functionality
Incident report filed within 24 hours of resolution

7. Phase 6 Cutover Reference

This section cross-references the full cutover runbook at docs/superpowers/plans/2026-04-15-phase-6-cutover.md.

7.1 Traffic Ramp Quick Reference

During cutover, traffic shifts from Angular to React over 72 hours:

T+0h: 5% React / 95% Angular
T+12h: 25% React / 75% Angular
T+24h: 50% React / 50% Angular
T+48h: 100% React / 0% Angular

Each step requires explicit go/no-go from the on-call engineer. Full details in the cutover runbook Section 3.

7.2 Cutover Rollback

During or after the traffic ramp, if a rollback is needed:

Flip proxy weights back to Angular (< 1 minute)
Verify Angular is serving traffic via response headers
Confirm error rate and latency return to baseline
File post-mortem within 24 hours

Trigger criteria: error rate > 1% for 5+ minutes, p95 > 2x baseline for 10+ minutes, or > 50% of React nodes returning 503.

Full rollback procedure in the cutover runbook Section 5.

7.3 Post-Cutover Soak

After reaching 100% React traffic, a 7-day soak period is required before Angular decommission. Soak pass criteria:

Zero Angular hits in access logs
Error rate < 0.1%
p95 < 500ms
Core Web Vitals in "Good" threshold
No Search Console regressions

Full soak criteria in the cutover runbook Section 6.

7.4 Angular Decommission

After soak sign-off:

Tag the Angular codebase: git tag -a angular-final
Create archive branch: archive/angular-spa
Remove Angular/ASP.NET files (requires customer approval)
Infrastructure cleanup

Full decommission steps in the cutover runbook Section 7.

11 KiB Raw Blame History

Flights Web — Operational Runbook

1. Incident Response Decision Tree

Severity Levels

2. Canary Rollout Procedure

Pre-rollout Checklist

Rollout Steps

Abort Criteria

3. Rollback Procedure

3.1 Automatic Rollback

3.2 Manual Rollback

4. Health-Check Interpretation

Endpoint: GET /health

Common Causes of 503

Load Balancer Behavior

5. Log Query Cookbook

Log Structure

Common Queries

6. Known-Failure Playbooks

6.1 Upstream API Down

6.2 SignalR Hub Offline

6.3 CSP Violation Spike

6.4 Analytics Adapter Load Failure

6.5 OTel Exporter Unreachable

6.6 Memory Pressure / OOM Kill

Recovery SLA

7. Phase 6 Cutover Reference

7.1 Traffic Ramp Quick Reference

7.2 Cutover Rollback

7.3 Post-Cutover Soak

7.4 Angular Decommission

11 KiB

Raw Blame History

Endpoint: `GET /health`