Add Phase 6 cutover runbook and operational checklist
Comprehensive operational procedure for the Angular-to-React traffic cutover: pre-cutover gates, proxy config templates (nginx/HAProxy), 72-hour traffic ramp schedule, monitoring checklist, rollback procedure, 1-week soak criteria, and Angular decommission steps. Also adds Phase 6 cross-reference sections to the Phase 1 runbook.
This commit is contained in:
@@ -288,3 +288,53 @@ msg:"csp-violation" OR fields.type:"csp-report"
|
||||
4. Load balancer re-enables traffic to recovered nodes
|
||||
5. On-call engineer verifies end-to-end functionality
|
||||
6. Incident report filed within 24 hours of resolution
|
||||
|
||||
---
|
||||
|
||||
## 7. Phase 6 Cutover Reference
|
||||
|
||||
This section cross-references the full cutover runbook at `docs/superpowers/plans/2026-04-15-phase-6-cutover.md`.
|
||||
|
||||
### 7.1 Traffic Ramp Quick Reference
|
||||
|
||||
During cutover, traffic shifts from Angular to React over 72 hours:
|
||||
- **T+0h:** 5% React / 95% Angular
|
||||
- **T+12h:** 25% React / 75% Angular
|
||||
- **T+24h:** 50% React / 50% Angular
|
||||
- **T+48h:** 100% React / 0% Angular
|
||||
|
||||
Each step requires explicit go/no-go from the on-call engineer. Full details in the cutover runbook Section 3.
|
||||
|
||||
### 7.2 Cutover Rollback
|
||||
|
||||
During or after the traffic ramp, if a rollback is needed:
|
||||
|
||||
1. Flip proxy weights back to Angular (< 1 minute)
|
||||
2. Verify Angular is serving traffic via response headers
|
||||
3. Confirm error rate and latency return to baseline
|
||||
4. File post-mortem within 24 hours
|
||||
|
||||
**Trigger criteria:** error rate > 1% for 5+ minutes, p95 > 2x baseline for 10+ minutes, or > 50% of React nodes returning 503.
|
||||
|
||||
Full rollback procedure in the cutover runbook Section 5.
|
||||
|
||||
### 7.3 Post-Cutover Soak
|
||||
|
||||
After reaching 100% React traffic, a 7-day soak period is required before Angular decommission. Soak pass criteria:
|
||||
- Zero Angular hits in access logs
|
||||
- Error rate < 0.1%
|
||||
- p95 < 500ms
|
||||
- Core Web Vitals in "Good" threshold
|
||||
- No Search Console regressions
|
||||
|
||||
Full soak criteria in the cutover runbook Section 6.
|
||||
|
||||
### 7.4 Angular Decommission
|
||||
|
||||
After soak sign-off:
|
||||
1. Tag the Angular codebase: `git tag -a angular-final`
|
||||
2. Create archive branch: `archive/angular-spa`
|
||||
3. Remove Angular/ASP.NET files (requires customer approval)
|
||||
4. Infrastructure cleanup
|
||||
|
||||
Full decommission steps in the cutover runbook Section 7.
|
||||
|
||||
@@ -0,0 +1,443 @@
|
||||
# Phase 6: Cutover & Decommission Runbook
|
||||
|
||||
**Version:** 1.0
|
||||
**Date:** 2026-04-15
|
||||
**Target:** Flip all traffic from Angular SPA to React Module Federation remote, soak for 1 week, then archive the Angular codebase.
|
||||
|
||||
---
|
||||
|
||||
## 1. Pre-Cutover Checklist
|
||||
|
||||
Every gate below must be **green** before starting the traffic ramp.
|
||||
|
||||
| # | Gate | Verified by | Status |
|
||||
|---|------|-------------|--------|
|
||||
| 1 | Phase 1 exit: Foundation complete (ModernJS SSR, MF 2.0, i18n, API client, SignalR, layout, SEO, analytics, logger, metrics, security, deploy) | Tech lead | [ ] |
|
||||
| 2 | Phase 2 exit: Online Board feature parity confirmed (URL serializer, API hooks, SignalR wiring, routes, SEO, integration tests) | QA lead | [ ] |
|
||||
| 3 | Phase 3 exit: Schedule feature parity confirmed | QA lead | [ ] |
|
||||
| 4 | Phase 4 exit: Flights Map feature parity confirmed | QA lead | [ ] |
|
||||
| 5 | Phase 5 exit: Popular Requests feature parity confirmed | QA lead | [ ] |
|
||||
| 6 | Load test passed at 100 RPS sustained for 30 minutes with p95 < 500ms | Performance engineer | [ ] |
|
||||
| 7 | Staging environment fully verified (all routes, SSR, MF remote loading) | QA lead | [ ] |
|
||||
| 8 | Rollback procedure rehearsed in staging (< 1 minute switchover confirmed) | Ops engineer | [ ] |
|
||||
| 9 | Monitoring dashboards operational (error rate, p95 latency, Web Vitals, SignalR health) | SRE | [ ] |
|
||||
| 10 | Search Console coverage verified (no indexing regressions on staging) | SEO lead | [ ] |
|
||||
| 11 | All `pnpm typecheck && pnpm lint && pnpm test && pnpm build:both` pass on the release branch | CI | [ ] |
|
||||
| 12 | Incident communication plan agreed (who to notify, escalation path) | Team lead | [ ] |
|
||||
| 13 | Customer sign-off for production traffic shift | Project manager | [ ] |
|
||||
|
||||
---
|
||||
|
||||
## 2. Proxy Rule Configuration Template
|
||||
|
||||
The actual LB/proxy technology depends on the customer's infrastructure. Below are placeholder templates for common setups. Replace `<REACT_UPSTREAM>` and `<ANGULAR_UPSTREAM>` with the real backend addresses.
|
||||
|
||||
### 2.1 nginx
|
||||
|
||||
```nginx
|
||||
# --- Phase 6 cutover: traffic split ---
|
||||
# Adjust the weight to control traffic ramp (see Section 3).
|
||||
# weight=0 means no traffic to that upstream.
|
||||
|
||||
upstream react_backend {
|
||||
server <REACT_UPSTREAM> weight=100;
|
||||
}
|
||||
|
||||
upstream angular_backend {
|
||||
server <ANGULAR_UPSTREAM> weight=0; # Set to 0 at 100% cutover
|
||||
}
|
||||
|
||||
# Split block — used during ramp
|
||||
split_clients "${remote_addr}${request_uri}" $backend {
|
||||
# Adjust percentages during ramp (Section 3)
|
||||
# 5% react_backend; # Step 1
|
||||
# 25% react_backend; # Step 2
|
||||
# 50% react_backend; # Step 3
|
||||
100% react_backend; # Step 4 (final)
|
||||
* angular_backend;
|
||||
}
|
||||
|
||||
server {
|
||||
listen 443 ssl;
|
||||
server_name flights.example.com;
|
||||
|
||||
location / {
|
||||
proxy_pass http://$backend;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
|
||||
# WebSocket support for SignalR
|
||||
proxy_http_version 1.1;
|
||||
proxy_set_header Upgrade $http_upgrade;
|
||||
proxy_set_header Connection "upgrade";
|
||||
}
|
||||
|
||||
# Health check — always probe the active backend
|
||||
location /health {
|
||||
proxy_pass http://react_backend/health;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2.2 HAProxy
|
||||
|
||||
```haproxy
|
||||
# --- Phase 6 cutover: traffic split ---
|
||||
frontend flights_frontend
|
||||
bind *:443 ssl crt /etc/ssl/certs/flights.pem
|
||||
default_backend react_backend
|
||||
|
||||
# During ramp, use ACL + random stick-table for percentage split:
|
||||
# acl is_canary rand(100) lt 5 # 5%
|
||||
# use_backend react_backend if is_canary
|
||||
# default_backend angular_backend
|
||||
|
||||
backend react_backend
|
||||
server react1 <REACT_UPSTREAM> check
|
||||
option httpchk GET /health
|
||||
http-check expect status 200
|
||||
|
||||
backend angular_backend
|
||||
server angular1 <ANGULAR_UPSTREAM> check
|
||||
# Set to "disabled" state after 100% cutover:
|
||||
# server angular1 <ANGULAR_UPSTREAM> check disabled
|
||||
```
|
||||
|
||||
### 2.3 Customer LB (generic)
|
||||
|
||||
```
|
||||
# If the customer uses a proprietary load balancer or CDN (e.g., F5, AWS ALB, Cloudflare):
|
||||
#
|
||||
# 1. Create a weighted target group / origin pool with two backends:
|
||||
# - React: weight = <RAMP_PERCENTAGE>
|
||||
# - Angular: weight = <100 - RAMP_PERCENTAGE>
|
||||
#
|
||||
# 2. Attach both targets to the main listener rule for flights.example.com/*
|
||||
#
|
||||
# 3. Adjust weights per the schedule in Section 3.
|
||||
#
|
||||
# 4. At 100% cutover, remove the Angular target entirely.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Traffic Ramp Schedule
|
||||
|
||||
Total ramp duration: **72 hours** (3 days). Each step requires explicit go/no-go from the on-call engineer.
|
||||
|
||||
| Step | Time (T+) | React % | Angular % | Duration before next step | Go/No-Go by |
|
||||
|------|-----------|---------|-----------|---------------------------|-------------|
|
||||
| 1 | T+0h | 5% | 95% | 12 hours | On-call engineer |
|
||||
| 2 | T+12h | 25% | 75% | 12 hours | On-call engineer |
|
||||
| 3 | T+24h | 50% | 50% | 24 hours | On-call engineer |
|
||||
| 4 | T+48h | 100% | 0% | Start soak period | Tech lead |
|
||||
|
||||
### Step Execution
|
||||
|
||||
For each step:
|
||||
|
||||
1. **Update proxy weights** per Section 2 templates
|
||||
2. **Reload proxy config** (graceful reload, no dropped connections):
|
||||
```bash
|
||||
# nginx
|
||||
nginx -t && nginx -s reload
|
||||
|
||||
# HAProxy
|
||||
haproxy -c -f /etc/haproxy/haproxy.cfg && systemctl reload haproxy
|
||||
```
|
||||
3. **Verify the split** by checking access logs for both backends
|
||||
4. **Monitor for the hold period** (Section 4)
|
||||
5. **Record go/no-go decision** in the incident channel with timestamp
|
||||
|
||||
---
|
||||
|
||||
## 4. Monitoring Checklist During Ramp
|
||||
|
||||
Monitor these metrics continuously during each ramp step. Any breach triggers the hold or rollback procedure.
|
||||
|
||||
### 4.1 Error Rate
|
||||
|
||||
- [ ] Overall error rate (5xx) < 0.5% throughout ramp step
|
||||
- [ ] No new error patterns in application logs
|
||||
- [ ] No CSP violation spikes
|
||||
- [ ] No unhandled promise rejections in client-side error tracking
|
||||
|
||||
### 4.2 Latency
|
||||
|
||||
- [ ] p50 latency within 10% of Angular baseline
|
||||
- [ ] p95 latency < 500ms
|
||||
- [ ] p99 latency < 2s
|
||||
- [ ] Time to First Byte (TTFB) < 800ms
|
||||
|
||||
### 4.3 Web Vitals (Core Web Vitals from RUM / Dynatrace)
|
||||
|
||||
- [ ] LCP (Largest Contentful Paint) < 2.5s
|
||||
- [ ] FID / INP (Interaction to Next Paint) < 200ms
|
||||
- [ ] CLS (Cumulative Layout Shift) < 0.1
|
||||
|
||||
### 4.4 Search Console
|
||||
|
||||
- [ ] No indexing coverage drops
|
||||
- [ ] No new crawl errors
|
||||
- [ ] Structured data (JSON-LD) validated without errors
|
||||
- [ ] No mobile usability regressions
|
||||
|
||||
### 4.5 SignalR Health
|
||||
|
||||
- [ ] SignalR hub connections stable (no reconnection storms)
|
||||
- [ ] Real-time flight updates arriving within 2s of server push
|
||||
- [ ] WebSocket upgrade success rate > 99%
|
||||
|
||||
### 4.6 Business Metrics
|
||||
|
||||
- [ ] Page views per session consistent with Angular baseline
|
||||
- [ ] Bounce rate not elevated > 5% above baseline
|
||||
- [ ] Analytics events firing (Yandex.Metrica, CTM, Variocube, Dynatrace)
|
||||
|
||||
### 4.7 Infrastructure
|
||||
|
||||
- [ ] CPU utilization < 70% on React nodes
|
||||
- [ ] Memory utilization stable (no upward trend)
|
||||
- [ ] No OOM kills
|
||||
- [ ] Health check (`/health`) returning 200 on all React nodes
|
||||
|
||||
---
|
||||
|
||||
## 5. Rollback Procedure
|
||||
|
||||
**Target: < 1 minute to restore Angular traffic.**
|
||||
|
||||
### 5.1 Trigger Criteria
|
||||
|
||||
Roll back immediately if any of:
|
||||
- Error rate exceeds 1% for more than 5 minutes
|
||||
- p95 latency exceeds 2x Angular baseline for more than 10 minutes
|
||||
- `/health` returns 503 on > 50% of React nodes
|
||||
- SignalR connections drop and do not recover within 5 minutes
|
||||
- Any S1 incident is declared
|
||||
|
||||
### 5.2 Rollback Steps
|
||||
|
||||
1. **Flip proxy weights back to Angular:**
|
||||
```bash
|
||||
# nginx — set react weight=0, angular weight=100
|
||||
# Edit the split_clients block or upstream weights
|
||||
nginx -t && nginx -s reload
|
||||
|
||||
# HAProxy — switch default_backend to angular_backend
|
||||
haproxy -c -f /etc/haproxy/haproxy.cfg && systemctl reload haproxy
|
||||
|
||||
# Customer LB — set React target weight to 0, Angular to 100
|
||||
```
|
||||
|
||||
2. **Verify Angular is serving traffic:**
|
||||
```bash
|
||||
curl -sI https://flights.example.com/ | grep -i server
|
||||
# Should show the ASP.NET/Angular response headers
|
||||
```
|
||||
|
||||
3. **Confirm health:**
|
||||
- [ ] Error rate returning to baseline
|
||||
- [ ] p95 latency returning to baseline
|
||||
- [ ] `/health` returning 200
|
||||
|
||||
4. **Notify stakeholders** in the incident channel
|
||||
|
||||
5. **Post-mortem:** File within 24 hours. Determine root cause before attempting another ramp.
|
||||
|
||||
### 5.3 Post-Rollback
|
||||
|
||||
- Do NOT re-attempt ramp until the root cause is identified and fixed
|
||||
- Re-run the full pre-cutover checklist (Section 1) before the next attempt
|
||||
- Consider a smaller initial percentage (2% instead of 5%) for the retry
|
||||
|
||||
---
|
||||
|
||||
## 6. One-Week Soak Criteria
|
||||
|
||||
After reaching 100% React traffic (T+48h), maintain for **7 calendar days** before proceeding to decommission.
|
||||
|
||||
### 6.1 Soak Pass Criteria
|
||||
|
||||
All of these must be true for the entire 7-day period:
|
||||
|
||||
| Criterion | Threshold | How to verify |
|
||||
|-----------|-----------|---------------|
|
||||
| Angular traffic | **Zero hits** in Angular access logs | `grep -c "angular_backend" /var/log/nginx/access.log` returns 0 for each day |
|
||||
| Error rate | < 0.1% (5xx responses) | Monitoring dashboard daily average |
|
||||
| p95 latency | < 500ms | Monitoring dashboard daily p95 |
|
||||
| p99 latency | < 2s | Monitoring dashboard daily p99 |
|
||||
| Core Web Vitals | All "Good" threshold | RUM / Dynatrace daily report |
|
||||
| Search Console | No coverage regression | Weekly Search Console report |
|
||||
| SignalR health | No reconnection storms, < 0.1% dropped connections | SignalR hub metrics |
|
||||
| Analytics parity | Event counts within 5% of pre-cutover Angular baseline | Analytics dashboard comparison |
|
||||
| OOM / restart count | Zero unexpected container restarts | Container orchestrator logs |
|
||||
|
||||
### 6.2 Soak Failure
|
||||
|
||||
If any criterion is breached during the soak:
|
||||
|
||||
1. **Do NOT immediately roll back** unless it meets Section 5.1 trigger criteria
|
||||
2. Investigate the root cause
|
||||
3. If a fix is deployed, restart the 7-day soak clock
|
||||
4. If the issue is transient and recovers within 1 hour, document but do not restart the clock
|
||||
|
||||
### 6.3 Soak Sign-Off
|
||||
|
||||
At the end of the 7-day soak, obtain written sign-off from:
|
||||
|
||||
- [ ] Tech lead
|
||||
- [ ] QA lead
|
||||
- [ ] Project manager
|
||||
- [ ] Customer representative (if required by contract)
|
||||
|
||||
---
|
||||
|
||||
## 7. Angular Decommission Steps
|
||||
|
||||
Only proceed after soak sign-off (Section 6.3).
|
||||
|
||||
### 7.1 Git Tag
|
||||
|
||||
Tag the last commit that includes the Angular code:
|
||||
|
||||
```bash
|
||||
# Identify the current HEAD (should be the release branch tip)
|
||||
git log --oneline -1
|
||||
|
||||
# Create an annotated tag
|
||||
git tag -a angular-final -m "Final state of Angular codebase before decommission"
|
||||
|
||||
# Push the tag to the remote
|
||||
git push origin angular-final
|
||||
```
|
||||
|
||||
### 7.2 Archive Branch
|
||||
|
||||
Create an archive branch preserving the full Angular codebase:
|
||||
|
||||
```bash
|
||||
# Create archive branch from current HEAD
|
||||
git checkout -b archive/angular-spa
|
||||
|
||||
# Push to remote
|
||||
git push -u origin archive/angular-spa
|
||||
|
||||
# Return to the main development branch
|
||||
git checkout plan/react-rewrite # or main, depending on workflow
|
||||
```
|
||||
|
||||
### 7.3 Remove Angular / ASP.NET Files
|
||||
|
||||
**WARNING: Only do this after customer approval. This runbook does NOT execute this step automatically.**
|
||||
|
||||
Files and directories to remove (review with the customer first):
|
||||
|
||||
```
|
||||
ClientApp/ # Angular SPA source
|
||||
src/
|
||||
angular.json
|
||||
karma.conf.js
|
||||
tsconfig*.json (Angular-specific)
|
||||
package.json (Angular)
|
||||
cypress/
|
||||
.storybook/
|
||||
...
|
||||
|
||||
Aeroflot.Flights.Web.csproj # ASP.NET host project
|
||||
Startup.cs # ASP.NET startup
|
||||
Program.cs # ASP.NET entry point
|
||||
Controllers/ # ASP.NET controllers (if any)
|
||||
wwwroot/ # Static assets served by ASP.NET
|
||||
appsettings*.json # ASP.NET configuration
|
||||
*.sln # .NET solution file
|
||||
```
|
||||
|
||||
Steps:
|
||||
|
||||
1. Create a new branch for the cleanup:
|
||||
```bash
|
||||
git checkout -b chore/remove-angular-code
|
||||
```
|
||||
|
||||
2. Remove the files listed above (verify the list with `git status` before committing)
|
||||
|
||||
3. Update `.gitignore` to remove Angular/ASP.NET-specific entries
|
||||
|
||||
4. Run the full verification suite:
|
||||
```bash
|
||||
pnpm typecheck && pnpm lint && pnpm test && pnpm build:both
|
||||
```
|
||||
|
||||
5. Commit and create a PR for review
|
||||
|
||||
6. After merge, verify production deployment is unaffected
|
||||
|
||||
### 7.4 Infrastructure Cleanup
|
||||
|
||||
After the Angular code is removed and deployed:
|
||||
|
||||
- [ ] Decommission Angular backend VMs / containers
|
||||
- [ ] Remove Angular upstream from load balancer configuration
|
||||
- [ ] Remove Angular-specific monitoring dashboards (or archive them)
|
||||
- [ ] Update DNS records if Angular was on a separate subdomain
|
||||
- [ ] Revoke Angular-specific secrets / certificates if any
|
||||
- [ ] Update architecture diagrams and documentation
|
||||
|
||||
### 7.5 Post-Decommission Verification
|
||||
|
||||
- [ ] All routes return correct responses from React
|
||||
- [ ] No references to Angular backend in proxy configs
|
||||
- [ ] No orphaned infrastructure resources
|
||||
- [ ] Documentation updated to reflect React-only architecture
|
||||
- [ ] Incident runbook (docs/superpowers/phase-1/runbook.md) updated to remove Angular references
|
||||
|
||||
---
|
||||
|
||||
## Appendix A: Quick Reference Commands
|
||||
|
||||
```bash
|
||||
# Check which backend is serving a request
|
||||
curl -sI https://flights.example.com/ | grep -E "Server|X-Powered-By"
|
||||
|
||||
# Check Angular access log for hits (should be zero during soak)
|
||||
grep -c "angular_backend" /var/log/nginx/access.log
|
||||
|
||||
# Verify React health
|
||||
curl -s https://flights.example.com/health | jq .
|
||||
|
||||
# Check MF manifest is accessible
|
||||
curl -sI https://flights.example.com/mf-manifest.json
|
||||
|
||||
# Verify SSR is working (check for rendered HTML in response body)
|
||||
curl -s https://flights.example.com/ru/onlineboard | grep -c "data-mf-expose"
|
||||
|
||||
# Check SignalR hub connectivity
|
||||
curl -s https://flights.example.com/signalr/negotiate -X POST | jq .
|
||||
```
|
||||
|
||||
## Appendix B: Contacts
|
||||
|
||||
| Role | Name | Contact |
|
||||
|------|------|---------|
|
||||
| Tech lead | TBD | TBD |
|
||||
| On-call engineer | TBD | TBD |
|
||||
| QA lead | TBD | TBD |
|
||||
| SRE | TBD | TBD |
|
||||
| Customer representative | TBD | TBD |
|
||||
| Incident channel | TBD | TBD |
|
||||
|
||||
---
|
||||
|
||||
## Appendix C: Timeline Summary
|
||||
|
||||
```
|
||||
Day 0 : Pre-cutover checklist complete, customer sign-off
|
||||
Day 0-3 : Traffic ramp (5% -> 25% -> 50% -> 100%)
|
||||
Day 3-10 : One-week soak at 100% React
|
||||
Day 10 : Soak sign-off
|
||||
Day 10+ : Angular decommission (git tag, archive, file removal)
|
||||
Day 11+ : Infrastructure cleanup
|
||||
```
|
||||
Reference in New Issue
Block a user