deployment: bootstrap runbook + failure-path rehearsals

This commit is contained in:
2026-04-25 02:12:25 +03:00
parent 0508f0f33d
commit 1fbd8ef23f
+188
View File
@@ -0,0 +1,188 @@
# pve-201 Deployment Runbook
This is the bootstrap procedure for hosting `https://ui-dashboard.gnerim.ru/` on pve-201, plus rehearsal recipes for the CI/CD pipeline failure paths. The full design rationale lives in `docs/superpowers/specs/2026-04-25-cicd-pipeline-design.md`.
## One-time setup
### 1. Routing pve-201 → TIM API (via webzavod)
**On webzavod (192.168.88.58)** — verify IP forwarding and MASQUERADE:
```bash
sysctl net.ipv4.ip_forward # expect: 1
sudo iptables -t nat -L POSTROUTING -nv | grep ppp0 # expect: MASQUERADE rule
```
If missing:
```bash
echo 'net.ipv4.ip_forward=1' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
sudo iptables -t nat -A POSTROUTING -o ppp0 -j MASQUERADE
sudo apt install iptables-persistent
sudo netfilter-persistent save
```
**On pve-201** — add a persistent static route to TIM via webzavod:
```yaml
# /etc/netplan/01-routes.yaml — adjust NIC name as needed
network:
version: 2
ethernets:
eth0:
routes:
- to: 172.18.0.0/16
via: 192.168.88.58
```
```bash
sudo netplan apply
```
**On pve-201** — pin TIM hostnames to reachable A records (TIM DNS returns duplicate As, one of which is dead):
```bash
echo '172.18.0.121 flights.test.aeroflot.ru' | sudo tee -a /etc/hosts
```
**Smoke test:**
```bash
curl -v https://flights.test.aeroflot.ru/swagger/ # expect: 401 in <300ms
```
If this fails, fix routing/DNS before proceeding — nothing else will work.
### 2. nginx vhost
```bash
sudo cp deployment/nginx/ui-dashboard.gnerim.ru.conf /etc/nginx/sites-available/
sudo ln -s /etc/nginx/sites-available/ui-dashboard.gnerim.ru.conf /etc/nginx/sites-enabled/
sudo mkdir -p /etc/nginx/htpasswd
sudo nginx -t
sudo systemctl reload nginx
```
The `htpasswd` file is created by `scripts/ci/install-htpasswd.sh` on first deploy.
### 3. Gitea runner setup
The runner must be in the `docker` group (so it can talk to the Docker socket without sudo) and reach all upstream services:
```bash
sudo usermod -aG docker <runner-user> # then re-login the runner service
docker ps # must work without sudo for the runner user
```
Reachability checks the runner must pass:
```bash
curl -fsS https://git.gnerim.ru/ # Gitea
curl -fsSI https://teamscore.gitlab.yandexcloud.net/ # GitLab
curl -fsSI http://jenkins.yc.devwebzavod.ru:8080/ # Jenkins (via static route)
curl -fsSI http://flights-ui.devwebzavod.ru/ # Customer URL (via static route)
```
### 4. GitLab Personal Access Token
GitLab → User Settings → Access Tokens → create with scopes `api` and `write_repository`. Store as Gitea Actions secret `GITLAB_PAT`.
### 5. Allow self-approve on GitLab project
GitLab → flights-front project → Settings → Merge requests → Approval rules → uncheck **"Prevent approval by author"**.
Verify by running (locally, after PAT is in place):
```bash
GITLAB_PAT=<pat> ./scripts/ci/check-gitlab-project.sh
```
It prints the numeric project ID (store as `GITLAB_PROJECT_ID` secret) and confirms self-approve is allowed.
### 6. Jenkins remote trigger token
Jenkins → `Aeroflot2/Flights-Front-Dev` job → Configure → check **"Trigger builds remotely"** → set token (e.g. `flights-cd-trigger`). Store as `JENKINS_TRIGGER_TOKEN`.
Also: Jenkins → User → Configure → API Token → Add new token. Store username as `JENKINS_USER`, token as `JENKINS_API_TOKEN`.
### 7. Telegram bot
Use existing bot or create via @BotFather. Get the chat_id by sending a message and querying `https://api.telegram.org/bot<TOKEN>/getUpdates`. Store as `TELEGRAM_BOT_TOKEN` and `TELEGRAM_CHAT_ID`.
### 8. Gitea Actions secrets summary
Repo → Settings → Actions → Secrets — set all of:
| Secret | Purpose |
|---|---|
| `BASIC_AUTH_USER`, `BASIC_AUTH_PASS` | nginx htpasswd |
| `MAP_TILE_URL` | Default `/map/api/tile/{z}/{x}/{y}.jpeg` |
| `API_BASE_URL` | Default `/api` |
| `GITLAB_PAT`, `GITLAB_PROJECT_ID` | GitLab MR API |
| `JENKINS_USER`, `JENKINS_API_TOKEN`, `JENKINS_TRIGGER_TOKEN` | Jenkins API |
| `TELEGRAM_BOT_TOKEN`, `TELEGRAM_CHAT_ID` | Notifications |
## Verifying failure paths
Run at least the rollback and "release blocked" rehearsals once before declaring the pipeline production-grade.
### A: e2e fail → rollback
Push a commit that adds `console.error('rehearsal')` somewhere that runs on every page (e.g. `src/routes/layout.tsx`). Workflow A runs, e2e fails on the console-gate, rollback to `:previous` triggers. Verify:
- Telegram message: `❌ ci-deploy FAILED at step "Run Playwright e2e" — rolled back to <prev-sha>`
- `https://ui-dashboard.gnerim.ru/` still serves the previous version (check the page or `docker inspect flights-web`).
Revert the rehearsal commit when done.
### A: rollback itself fails
```bash
ssh pve-201 'docker rmi flights-web:previous'
```
Then push a commit that fails e2e. Rollback step finds no `:previous` and bails. Verify:
- Telegram message: `🔥 ci-deploy ROLLBACK FAILED — site is DOWN`
- `https://ui-dashboard.gnerim.ru/` returns 502.
- Manual recovery: `ssh pve-201 'docker run -d --name flights-web -p 127.0.0.1:8081:8080 flights-web:<known-good-sha>'`.
### B: blocked on A not green
Trigger Workflow B (manual or tag) for a SHA that has no green Workflow A run. Verify:
- Telegram message: `⚠️ release blocked — workflow ci-deploy is not green for <sha>`
- B exits early; nothing changes in GitLab.
### B: Jenkins poll timeout
Set `JENKINS_TIMEOUT=30` as a secret override and trigger B. Polling should give up after 30s and report timeout.
## Manual recovery scenarios
### Workflow B failed at step 12-13 (Jenkins) — MR merged but customer site stale
GitLab is already at the new commit; Jenkins didn't deploy. Recovery:
1. Open Jenkins UI → click "Build Now" on the same job, or
2. Push a new commit to GitLab to re-trigger Jenkins polling (if it's set up that way), or
3. Re-run Workflow B from a green Workflow A — but only if you also pushed new code; otherwise B will sync a no-op and skip.
### Container running but nginx returns 502
Check the bind:
```bash
ssh pve-201
docker ps --filter name=flights-web
curl -v http://127.0.0.1:8081/ # should return 200 (or whatever the SSR root returns)
sudo nginx -t && sudo systemctl reload nginx
```
If the container died, the Restart policy `unless-stopped` should bring it back. If not:
```bash
docker logs flights-web --tail 200
docker run -d --name flights-web -p 127.0.0.1:8081:8080 flights-web:current
```