flights_web/docs/superpowers/phase-1/runbook.md

# Flights Web — Operational Runbook

**Version:** 1.0 (Phase 1I)
**Last updated:** 2026-04-14

---

## 1. Incident Response Decision Tree

```
Is the service returning errors?
  |
  +-- YES: Check /health endpoint
  |     |
  |     +-- /health returns 503
  |     |     -> Upstream API issue (see Section 6.1)
  |     |
  |     +-- /health returns 200 but users see errors
  |     |     -> Application-level bug. Check logs (Section 5).
  |     |     -> If recent deploy: rollback (Section 3).
  |     |
  |     +-- /health unreachable (connection refused / timeout)
  |           -> Container/VM is down.
  |           -> Check container orchestrator status.
  |           -> If all replicas down: escalate to infra team (Severity 1).
  |           -> If partial: rely on load balancer, investigate affected nodes.
  |
  +-- NO: Check for degraded performance
        |
        +-- Latency > 2x baseline
        |     -> Check OTel metrics for slow spans.
        |     -> Check upstream API latency.
        |     -> If upstream: see Section 6.1.
        |     -> If internal: check for memory pressure, CPU saturation.
        |
        +-- Intermittent errors in logs
              -> Check error rate trend.
              -> If rising: prepare for rollback.
              -> If stable/low: monitor for 15 min, then investigate.
```

### Severity Levels

| Severity | Criteria | Response Time | Who to Page |
|----------|----------|---------------|-------------|
| S1 | Service fully down, all users affected | Immediate | On-call engineer + team lead |
| S2 | Partial outage, >10% error rate | 15 min | On-call engineer |
| S3 | Degraded performance, no data loss | 1 hour | On-call engineer (next business day if after hours) |
| S4 | Minor issue, workaround exists | Next business day | Assigned engineer |

---

## 2. Canary Rollout Procedure

### Pre-rollout Checklist

- [ ] All CI checks pass (typecheck, lint, test)
- [ ] Docker images built and pushed to registry
- [ ] Rollback image tag identified (current production tag)
- [ ] Monitoring dashboards open

### Rollout Steps

1. **Deploy canary** (5% traffic) to a single node in one geographic region
2. **Monitor for 10 minutes:**
   - Error rate must stay below 0.5%
   - p99 latency must not exceed 2x baseline
   - `/health` must return 200 on the canary
   - No new error patterns in logs
3. **Expand to 25%** if canary is healthy
4. **Monitor for 15 minutes** with same criteria
5. **Expand to 100%** across all geographic regions
6. **Post-deploy verification:**
   - `/health` returns 200 on all nodes
   - Smoke test passes end-to-end
   - No error rate spike in the first 30 minutes

### Abort Criteria

Roll back immediately if any of these occur during canary:
- Error rate exceeds 1%
- `/health` returns 503 on canary nodes
- p99 latency exceeds 5x baseline
- Any S1/S2 incident triggered

---

## 3. Rollback Procedure

### 3.1 Automatic Rollback

The deploy pipeline monitors `/health` after deployment. If the health check fails within the first 5 minutes post-deploy:

1. Pipeline automatically reverts to the previous image tag
2. Alert fires to the on-call channel
3. Engineer investigates the failed deployment logs

**No manual action required** for auto-rollback. Verify the rollback succeeded by checking:
- `/health` returns 200
- Error rate returns to baseline
- Previous image tag is running on all nodes

### 3.2 Manual Rollback

If auto-rollback did not trigger or a problem is discovered later:

1. **Identify the last-known-good image tag** from the deployment history
2. **Redeploy the previous image:**
   ```bash
   # Placeholder — actual commands depend on customer's deployment tool
   # Example:
   # deploy --image $REGISTRY/flights-web-standalone:$PREVIOUS_SHA --env production
   ```
3. **Verify rollback:**
   - `/health` returns 200 on all nodes
   - Error rate returns to baseline
   - Smoke test passes
4. **Post-mortem:** file an incident report within 24 hours

---

## 4. Health-Check Interpretation

### Endpoint: `GET /health`

| Response | Status | Meaning | Action |
|----------|--------|---------|--------|
| `{ "status": "ok" }` | 200 | Upstream API reachable within last 60s | None |
| `{ "status": "degraded", "reason": "upstream_unreachable" }` | 503 | No successful upstream ping in 60s | Check upstream API status; see Section 6.1 |

### Common Causes of 503

1. **Upstream API is down** — check upstream service status page / monitoring
2. **Network partition** — the node cannot reach the upstream API; check network policies
3. **DNS resolution failure** — verify DNS configuration on the node
4. **Upstream API overloaded** — ping times out; coordinate with upstream team

### Load Balancer Behavior

When `/health` returns 503, the load balancer should stop routing traffic to that node. When the upstream recovers and `/health` returns 200 again, traffic automatically resumes.

---

## 5. Log Query Cookbook

Logs are shipped in JSON Lines format to the customer's log aggregation system.

### Log Structure

```json
{
  "ts": "2026-04-14T12:00:00.000Z",
  "level": "error",
  "msg": "Request failed",
  "fields": {
    "traceId": "abc123",
    "path": "/api/flights",
    "status": 500,
    "err": "TypeError: Cannot read properties of undefined"
  }
}
```

### Common Queries

**Find all errors in the last hour:**
```
level:error AND ts:[now-1h TO now]
```

**Find errors for a specific trace:**
```
fields.traceId:"abc123"
```

**Find slow requests (logged by the API client on timeout):**
```
msg:"Retrying request" OR msg:"upstream_timeout"
```

**Find health-check failures:**
```
msg:"upstream_unreachable" OR (path:"/health" AND status:503)
```

**Find graceful shutdown events:**
```
msg:"SIGTERM received" OR msg:"Server closed" OR msg:"Drain timeout exceeded"
```

**Find CSP violations (if CSP reporting is enabled):**
```
msg:"csp-violation" OR fields.type:"csp-report"
```

---

## 6. Known-Failure Playbooks

### 6.1 Upstream API Down

**Symptoms:** `/health` returns 503; API client logs show retry exhaustion.

**Impact:** Users see error pages or stale data (if caching is in place).

**Steps:**
1. Confirm upstream status via the upstream team's status page or monitoring
2. If upstream is aware and working on it: monitor, no action needed on our side
3. If upstream is unaware: escalate via agreed communication channel
4. If outage exceeds 30 minutes: consider enabling a maintenance page
5. Recovery is automatic — once upstream responds, `/health` returns 200 within 60s

### 6.2 SignalR Hub Offline

**Symptoms:** Real-time flight updates stop; SignalR reconnection logs appear.

**Impact:** Users see stale board data; manual refresh still works via REST API.

**Steps:**
1. Check SignalR hub process/container status
2. Verify WebSocket connectivity from the node to the SignalR hub
3. The client auto-reconnects with exponential backoff — recovery is usually automatic
4. If hub is permanently down: REST polling fallback should activate (if implemented)
5. Inform users if downtime exceeds 5 minutes

### 6.3 CSP Violation Spike

**Symptoms:** Spike in CSP violation reports; possibly broken page functionality.

**Impact:** Scripts or styles blocked by Content-Security-Policy; UI may be partially broken.

**Steps:**
1. Check CSP violation reports for the blocked resource URL
2. If a legitimate resource is blocked: update CSP policy in `src/server/middleware/csp.ts`
3. If a third-party script is the source: investigate whether it was injected (security concern)
4. If after a deploy: the new code may reference resources not in the CSP allowlist — fix or rollback
5. CSP is in report-only mode during Phase 1 — no user impact, but violations should be tracked

### 6.4 Analytics Adapter Load Failure

**Symptoms:** `flights.analytics.load_failed` counter increases; analytics data gaps.

**Impact:** Analytics data not collected; no user-facing impact.

**Steps:**
1. Check which adapter(s) failed (Yandex.Metrica, CTM, Variocube, Dynatrace)
2. Verify the adapter's external script URL is reachable from the client
3. Check for CORS or CSP blocking the adapter script
4. If a single adapter: low priority, monitor
5. If all adapters: likely a CSP or network issue affecting all external scripts

### 6.5 OTel Exporter Unreachable

**Symptoms:** Metrics and traces stop appearing in the monitoring dashboard.

**Impact:** No observability data; no user-facing impact.

**Steps:**
1. Check the OTel collector/exporter endpoint connectivity
2. Verify the `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable is correct
3. Check for network policy changes that may block the exporter
4. The SDK buffers data locally — some data may be recoverable once the exporter is reachable again
5. If the exporter is permanently moved: update the endpoint configuration and redeploy

### 6.6 Memory Pressure / OOM Kill

**Symptoms:** Container restarts; OOM kill events in container orchestrator logs.

**Impact:** Requests in flight are dropped; load balancer reroutes to healthy nodes.

**Steps:**
1. Check container memory limits vs actual usage
2. Review recent deploys for memory leaks (new dependencies, unbounded caches)
3. If a specific route causes high memory: check for large API responses or unbounded data structures
4. Short-term: increase memory limits
5. Long-term: profile the application to find the leak; fix and redeploy

---

## Recovery SLA

**Target:** Service recovery within 6 hours after infrastructure is restored.

**Recovery steps:**
1. Infrastructure team restores VMs / containers across geographic regions
2. Deployment tool re-deploys the last-known-good image
3. `/health` checks confirm upstream connectivity
4. Load balancer re-enables traffic to recovered nodes
5. On-call engineer verifies end-to-end functionality
6. Incident report filed within 24 hours of resolution

---

## 7. Phase 6 Cutover Reference

This section cross-references the full cutover runbook at `docs/superpowers/plans/2026-04-15-phase-6-cutover.md`.

### 7.1 Traffic Ramp Quick Reference

During cutover, traffic shifts from Angular to React over 72 hours:
- **T+0h:** 5% React / 95% Angular
- **T+12h:** 25% React / 75% Angular
- **T+24h:** 50% React / 50% Angular
- **T+48h:** 100% React / 0% Angular

Each step requires explicit go/no-go from the on-call engineer. Full details in the cutover runbook Section 3.

### 7.2 Cutover Rollback

During or after the traffic ramp, if a rollback is needed:

1. Flip proxy weights back to Angular (< 1 minute)
2. Verify Angular is serving traffic via response headers
3. Confirm error rate and latency return to baseline
4. File post-mortem within 24 hours

**Trigger criteria:** error rate > 1% for 5+ minutes, p95 > 2x baseline for 10+ minutes, or > 50% of React nodes returning 503.

Full rollback procedure in the cutover runbook Section 5.

### 7.3 Post-Cutover Soak

After reaching 100% React traffic, a 7-day soak period is required before Angular decommission. Soak pass criteria:
- Zero Angular hits in access logs
- Error rate < 0.1%
- p95 < 500ms
- Core Web Vitals in "Good" threshold
- No Search Console regressions

Full soak criteria in the cutover runbook Section 6.

### 7.4 Angular Decommission

After soak sign-off:
1. Tag the Angular codebase: `git tag -a angular-final`
2. Create archive branch: `archive/angular-spa`
3. Remove Angular/ASP.NET files (requires customer approval)
4. Infrastructure cleanup

Full decommission steps in the cutover runbook Section 7.