From 98c6eca90ec0366e2ac1efdcd6080ac77d7d12b3 Mon Sep 17 00:00:00 2001 From: gnezim Date: Wed, 15 Apr 2026 10:56:34 +0300 Subject: [PATCH] Add Phase 6 cutover runbook and operational checklist Comprehensive operational procedure for the Angular-to-React traffic cutover: pre-cutover gates, proxy config templates (nginx/HAProxy), 72-hour traffic ramp schedule, monitoring checklist, rollback procedure, 1-week soak criteria, and Angular decommission steps. Also adds Phase 6 cross-reference sections to the Phase 1 runbook. --- docs/superpowers/phase-1/runbook.md | 50 ++ .../plans/2026-04-15-phase-6-cutover.md | 443 ++++++++++++++++++ 2 files changed, 493 insertions(+) create mode 100644 docs/superpowers/plans/2026-04-15-phase-6-cutover.md diff --git a/docs/superpowers/phase-1/runbook.md b/docs/superpowers/phase-1/runbook.md index da00c019..6b470330 100644 --- a/docs/superpowers/phase-1/runbook.md +++ b/docs/superpowers/phase-1/runbook.md @@ -288,3 +288,53 @@ msg:"csp-violation" OR fields.type:"csp-report" 4. Load balancer re-enables traffic to recovered nodes 5. On-call engineer verifies end-to-end functionality 6. Incident report filed within 24 hours of resolution + +--- + +## 7. Phase 6 Cutover Reference + +This section cross-references the full cutover runbook at `docs/superpowers/plans/2026-04-15-phase-6-cutover.md`. + +### 7.1 Traffic Ramp Quick Reference + +During cutover, traffic shifts from Angular to React over 72 hours: +- **T+0h:** 5% React / 95% Angular +- **T+12h:** 25% React / 75% Angular +- **T+24h:** 50% React / 50% Angular +- **T+48h:** 100% React / 0% Angular + +Each step requires explicit go/no-go from the on-call engineer. Full details in the cutover runbook Section 3. + +### 7.2 Cutover Rollback + +During or after the traffic ramp, if a rollback is needed: + +1. Flip proxy weights back to Angular (< 1 minute) +2. Verify Angular is serving traffic via response headers +3. Confirm error rate and latency return to baseline +4. File post-mortem within 24 hours + +**Trigger criteria:** error rate > 1% for 5+ minutes, p95 > 2x baseline for 10+ minutes, or > 50% of React nodes returning 503. + +Full rollback procedure in the cutover runbook Section 5. + +### 7.3 Post-Cutover Soak + +After reaching 100% React traffic, a 7-day soak period is required before Angular decommission. Soak pass criteria: +- Zero Angular hits in access logs +- Error rate < 0.1% +- p95 < 500ms +- Core Web Vitals in "Good" threshold +- No Search Console regressions + +Full soak criteria in the cutover runbook Section 6. + +### 7.4 Angular Decommission + +After soak sign-off: +1. Tag the Angular codebase: `git tag -a angular-final` +2. Create archive branch: `archive/angular-spa` +3. Remove Angular/ASP.NET files (requires customer approval) +4. Infrastructure cleanup + +Full decommission steps in the cutover runbook Section 7. diff --git a/docs/superpowers/plans/2026-04-15-phase-6-cutover.md b/docs/superpowers/plans/2026-04-15-phase-6-cutover.md new file mode 100644 index 00000000..0f098b75 --- /dev/null +++ b/docs/superpowers/plans/2026-04-15-phase-6-cutover.md @@ -0,0 +1,443 @@ +# Phase 6: Cutover & Decommission Runbook + +**Version:** 1.0 +**Date:** 2026-04-15 +**Target:** Flip all traffic from Angular SPA to React Module Federation remote, soak for 1 week, then archive the Angular codebase. + +--- + +## 1. Pre-Cutover Checklist + +Every gate below must be **green** before starting the traffic ramp. + +| # | Gate | Verified by | Status | +|---|------|-------------|--------| +| 1 | Phase 1 exit: Foundation complete (ModernJS SSR, MF 2.0, i18n, API client, SignalR, layout, SEO, analytics, logger, metrics, security, deploy) | Tech lead | [ ] | +| 2 | Phase 2 exit: Online Board feature parity confirmed (URL serializer, API hooks, SignalR wiring, routes, SEO, integration tests) | QA lead | [ ] | +| 3 | Phase 3 exit: Schedule feature parity confirmed | QA lead | [ ] | +| 4 | Phase 4 exit: Flights Map feature parity confirmed | QA lead | [ ] | +| 5 | Phase 5 exit: Popular Requests feature parity confirmed | QA lead | [ ] | +| 6 | Load test passed at 100 RPS sustained for 30 minutes with p95 < 500ms | Performance engineer | [ ] | +| 7 | Staging environment fully verified (all routes, SSR, MF remote loading) | QA lead | [ ] | +| 8 | Rollback procedure rehearsed in staging (< 1 minute switchover confirmed) | Ops engineer | [ ] | +| 9 | Monitoring dashboards operational (error rate, p95 latency, Web Vitals, SignalR health) | SRE | [ ] | +| 10 | Search Console coverage verified (no indexing regressions on staging) | SEO lead | [ ] | +| 11 | All `pnpm typecheck && pnpm lint && pnpm test && pnpm build:both` pass on the release branch | CI | [ ] | +| 12 | Incident communication plan agreed (who to notify, escalation path) | Team lead | [ ] | +| 13 | Customer sign-off for production traffic shift | Project manager | [ ] | + +--- + +## 2. Proxy Rule Configuration Template + +The actual LB/proxy technology depends on the customer's infrastructure. Below are placeholder templates for common setups. Replace `` and `` with the real backend addresses. + +### 2.1 nginx + +```nginx +# --- Phase 6 cutover: traffic split --- +# Adjust the weight to control traffic ramp (see Section 3). +# weight=0 means no traffic to that upstream. + +upstream react_backend { + server weight=100; +} + +upstream angular_backend { + server weight=0; # Set to 0 at 100% cutover +} + +# Split block — used during ramp +split_clients "${remote_addr}${request_uri}" $backend { + # Adjust percentages during ramp (Section 3) + # 5% react_backend; # Step 1 + # 25% react_backend; # Step 2 + # 50% react_backend; # Step 3 + 100% react_backend; # Step 4 (final) + * angular_backend; +} + +server { + listen 443 ssl; + server_name flights.example.com; + + location / { + proxy_pass http://$backend; + proxy_set_header Host $host; + proxy_set_header X-Real-IP $remote_addr; + proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; + proxy_set_header X-Forwarded-Proto $scheme; + + # WebSocket support for SignalR + proxy_http_version 1.1; + proxy_set_header Upgrade $http_upgrade; + proxy_set_header Connection "upgrade"; + } + + # Health check — always probe the active backend + location /health { + proxy_pass http://react_backend/health; + } +} +``` + +### 2.2 HAProxy + +```haproxy +# --- Phase 6 cutover: traffic split --- +frontend flights_frontend + bind *:443 ssl crt /etc/ssl/certs/flights.pem + default_backend react_backend + + # During ramp, use ACL + random stick-table for percentage split: + # acl is_canary rand(100) lt 5 # 5% + # use_backend react_backend if is_canary + # default_backend angular_backend + +backend react_backend + server react1 check + option httpchk GET /health + http-check expect status 200 + +backend angular_backend + server angular1 check + # Set to "disabled" state after 100% cutover: + # server angular1 check disabled +``` + +### 2.3 Customer LB (generic) + +``` +# If the customer uses a proprietary load balancer or CDN (e.g., F5, AWS ALB, Cloudflare): +# +# 1. Create a weighted target group / origin pool with two backends: +# - React: weight = +# - Angular: weight = <100 - RAMP_PERCENTAGE> +# +# 2. Attach both targets to the main listener rule for flights.example.com/* +# +# 3. Adjust weights per the schedule in Section 3. +# +# 4. At 100% cutover, remove the Angular target entirely. +``` + +--- + +## 3. Traffic Ramp Schedule + +Total ramp duration: **72 hours** (3 days). Each step requires explicit go/no-go from the on-call engineer. + +| Step | Time (T+) | React % | Angular % | Duration before next step | Go/No-Go by | +|------|-----------|---------|-----------|---------------------------|-------------| +| 1 | T+0h | 5% | 95% | 12 hours | On-call engineer | +| 2 | T+12h | 25% | 75% | 12 hours | On-call engineer | +| 3 | T+24h | 50% | 50% | 24 hours | On-call engineer | +| 4 | T+48h | 100% | 0% | Start soak period | Tech lead | + +### Step Execution + +For each step: + +1. **Update proxy weights** per Section 2 templates +2. **Reload proxy config** (graceful reload, no dropped connections): + ```bash + # nginx + nginx -t && nginx -s reload + + # HAProxy + haproxy -c -f /etc/haproxy/haproxy.cfg && systemctl reload haproxy + ``` +3. **Verify the split** by checking access logs for both backends +4. **Monitor for the hold period** (Section 4) +5. **Record go/no-go decision** in the incident channel with timestamp + +--- + +## 4. Monitoring Checklist During Ramp + +Monitor these metrics continuously during each ramp step. Any breach triggers the hold or rollback procedure. + +### 4.1 Error Rate + +- [ ] Overall error rate (5xx) < 0.5% throughout ramp step +- [ ] No new error patterns in application logs +- [ ] No CSP violation spikes +- [ ] No unhandled promise rejections in client-side error tracking + +### 4.2 Latency + +- [ ] p50 latency within 10% of Angular baseline +- [ ] p95 latency < 500ms +- [ ] p99 latency < 2s +- [ ] Time to First Byte (TTFB) < 800ms + +### 4.3 Web Vitals (Core Web Vitals from RUM / Dynatrace) + +- [ ] LCP (Largest Contentful Paint) < 2.5s +- [ ] FID / INP (Interaction to Next Paint) < 200ms +- [ ] CLS (Cumulative Layout Shift) < 0.1 + +### 4.4 Search Console + +- [ ] No indexing coverage drops +- [ ] No new crawl errors +- [ ] Structured data (JSON-LD) validated without errors +- [ ] No mobile usability regressions + +### 4.5 SignalR Health + +- [ ] SignalR hub connections stable (no reconnection storms) +- [ ] Real-time flight updates arriving within 2s of server push +- [ ] WebSocket upgrade success rate > 99% + +### 4.6 Business Metrics + +- [ ] Page views per session consistent with Angular baseline +- [ ] Bounce rate not elevated > 5% above baseline +- [ ] Analytics events firing (Yandex.Metrica, CTM, Variocube, Dynatrace) + +### 4.7 Infrastructure + +- [ ] CPU utilization < 70% on React nodes +- [ ] Memory utilization stable (no upward trend) +- [ ] No OOM kills +- [ ] Health check (`/health`) returning 200 on all React nodes + +--- + +## 5. Rollback Procedure + +**Target: < 1 minute to restore Angular traffic.** + +### 5.1 Trigger Criteria + +Roll back immediately if any of: +- Error rate exceeds 1% for more than 5 minutes +- p95 latency exceeds 2x Angular baseline for more than 10 minutes +- `/health` returns 503 on > 50% of React nodes +- SignalR connections drop and do not recover within 5 minutes +- Any S1 incident is declared + +### 5.2 Rollback Steps + +1. **Flip proxy weights back to Angular:** + ```bash + # nginx — set react weight=0, angular weight=100 + # Edit the split_clients block or upstream weights + nginx -t && nginx -s reload + + # HAProxy — switch default_backend to angular_backend + haproxy -c -f /etc/haproxy/haproxy.cfg && systemctl reload haproxy + + # Customer LB — set React target weight to 0, Angular to 100 + ``` + +2. **Verify Angular is serving traffic:** + ```bash + curl -sI https://flights.example.com/ | grep -i server + # Should show the ASP.NET/Angular response headers + ``` + +3. **Confirm health:** + - [ ] Error rate returning to baseline + - [ ] p95 latency returning to baseline + - [ ] `/health` returning 200 + +4. **Notify stakeholders** in the incident channel + +5. **Post-mortem:** File within 24 hours. Determine root cause before attempting another ramp. + +### 5.3 Post-Rollback + +- Do NOT re-attempt ramp until the root cause is identified and fixed +- Re-run the full pre-cutover checklist (Section 1) before the next attempt +- Consider a smaller initial percentage (2% instead of 5%) for the retry + +--- + +## 6. One-Week Soak Criteria + +After reaching 100% React traffic (T+48h), maintain for **7 calendar days** before proceeding to decommission. + +### 6.1 Soak Pass Criteria + +All of these must be true for the entire 7-day period: + +| Criterion | Threshold | How to verify | +|-----------|-----------|---------------| +| Angular traffic | **Zero hits** in Angular access logs | `grep -c "angular_backend" /var/log/nginx/access.log` returns 0 for each day | +| Error rate | < 0.1% (5xx responses) | Monitoring dashboard daily average | +| p95 latency | < 500ms | Monitoring dashboard daily p95 | +| p99 latency | < 2s | Monitoring dashboard daily p99 | +| Core Web Vitals | All "Good" threshold | RUM / Dynatrace daily report | +| Search Console | No coverage regression | Weekly Search Console report | +| SignalR health | No reconnection storms, < 0.1% dropped connections | SignalR hub metrics | +| Analytics parity | Event counts within 5% of pre-cutover Angular baseline | Analytics dashboard comparison | +| OOM / restart count | Zero unexpected container restarts | Container orchestrator logs | + +### 6.2 Soak Failure + +If any criterion is breached during the soak: + +1. **Do NOT immediately roll back** unless it meets Section 5.1 trigger criteria +2. Investigate the root cause +3. If a fix is deployed, restart the 7-day soak clock +4. If the issue is transient and recovers within 1 hour, document but do not restart the clock + +### 6.3 Soak Sign-Off + +At the end of the 7-day soak, obtain written sign-off from: + +- [ ] Tech lead +- [ ] QA lead +- [ ] Project manager +- [ ] Customer representative (if required by contract) + +--- + +## 7. Angular Decommission Steps + +Only proceed after soak sign-off (Section 6.3). + +### 7.1 Git Tag + +Tag the last commit that includes the Angular code: + +```bash +# Identify the current HEAD (should be the release branch tip) +git log --oneline -1 + +# Create an annotated tag +git tag -a angular-final -m "Final state of Angular codebase before decommission" + +# Push the tag to the remote +git push origin angular-final +``` + +### 7.2 Archive Branch + +Create an archive branch preserving the full Angular codebase: + +```bash +# Create archive branch from current HEAD +git checkout -b archive/angular-spa + +# Push to remote +git push -u origin archive/angular-spa + +# Return to the main development branch +git checkout plan/react-rewrite # or main, depending on workflow +``` + +### 7.3 Remove Angular / ASP.NET Files + +**WARNING: Only do this after customer approval. This runbook does NOT execute this step automatically.** + +Files and directories to remove (review with the customer first): + +``` +ClientApp/ # Angular SPA source + src/ + angular.json + karma.conf.js + tsconfig*.json (Angular-specific) + package.json (Angular) + cypress/ + .storybook/ + ... + +Aeroflot.Flights.Web.csproj # ASP.NET host project +Startup.cs # ASP.NET startup +Program.cs # ASP.NET entry point +Controllers/ # ASP.NET controllers (if any) +wwwroot/ # Static assets served by ASP.NET +appsettings*.json # ASP.NET configuration +*.sln # .NET solution file +``` + +Steps: + +1. Create a new branch for the cleanup: + ```bash + git checkout -b chore/remove-angular-code + ``` + +2. Remove the files listed above (verify the list with `git status` before committing) + +3. Update `.gitignore` to remove Angular/ASP.NET-specific entries + +4. Run the full verification suite: + ```bash + pnpm typecheck && pnpm lint && pnpm test && pnpm build:both + ``` + +5. Commit and create a PR for review + +6. After merge, verify production deployment is unaffected + +### 7.4 Infrastructure Cleanup + +After the Angular code is removed and deployed: + +- [ ] Decommission Angular backend VMs / containers +- [ ] Remove Angular upstream from load balancer configuration +- [ ] Remove Angular-specific monitoring dashboards (or archive them) +- [ ] Update DNS records if Angular was on a separate subdomain +- [ ] Revoke Angular-specific secrets / certificates if any +- [ ] Update architecture diagrams and documentation + +### 7.5 Post-Decommission Verification + +- [ ] All routes return correct responses from React +- [ ] No references to Angular backend in proxy configs +- [ ] No orphaned infrastructure resources +- [ ] Documentation updated to reflect React-only architecture +- [ ] Incident runbook (docs/superpowers/phase-1/runbook.md) updated to remove Angular references + +--- + +## Appendix A: Quick Reference Commands + +```bash +# Check which backend is serving a request +curl -sI https://flights.example.com/ | grep -E "Server|X-Powered-By" + +# Check Angular access log for hits (should be zero during soak) +grep -c "angular_backend" /var/log/nginx/access.log + +# Verify React health +curl -s https://flights.example.com/health | jq . + +# Check MF manifest is accessible +curl -sI https://flights.example.com/mf-manifest.json + +# Verify SSR is working (check for rendered HTML in response body) +curl -s https://flights.example.com/ru/onlineboard | grep -c "data-mf-expose" + +# Check SignalR hub connectivity +curl -s https://flights.example.com/signalr/negotiate -X POST | jq . +``` + +## Appendix B: Contacts + +| Role | Name | Contact | +|------|------|---------| +| Tech lead | TBD | TBD | +| On-call engineer | TBD | TBD | +| QA lead | TBD | TBD | +| SRE | TBD | TBD | +| Customer representative | TBD | TBD | +| Incident channel | TBD | TBD | + +--- + +## Appendix C: Timeline Summary + +``` +Day 0 : Pre-cutover checklist complete, customer sign-off +Day 0-3 : Traffic ramp (5% -> 25% -> 50% -> 100%) +Day 3-10 : One-week soak at 100% React +Day 10 : Soak sign-off +Day 10+ : Angular decommission (git tag, archive, file removal) +Day 11+ : Infrastructure cleanup +```