deployment: bootstrap runbook + failure-path rehearsals
This commit is contained in:
@@ -0,0 +1,188 @@
|
||||
# pve-201 Deployment Runbook
|
||||
|
||||
This is the bootstrap procedure for hosting `https://ui-dashboard.gnerim.ru/` on pve-201, plus rehearsal recipes for the CI/CD pipeline failure paths. The full design rationale lives in `docs/superpowers/specs/2026-04-25-cicd-pipeline-design.md`.
|
||||
|
||||
## One-time setup
|
||||
|
||||
### 1. Routing pve-201 → TIM API (via webzavod)
|
||||
|
||||
**On webzavod (192.168.88.58)** — verify IP forwarding and MASQUERADE:
|
||||
|
||||
```bash
|
||||
sysctl net.ipv4.ip_forward # expect: 1
|
||||
sudo iptables -t nat -L POSTROUTING -nv | grep ppp0 # expect: MASQUERADE rule
|
||||
```
|
||||
|
||||
If missing:
|
||||
|
||||
```bash
|
||||
echo 'net.ipv4.ip_forward=1' | sudo tee -a /etc/sysctl.conf
|
||||
sudo sysctl -p
|
||||
sudo iptables -t nat -A POSTROUTING -o ppp0 -j MASQUERADE
|
||||
sudo apt install iptables-persistent
|
||||
sudo netfilter-persistent save
|
||||
```
|
||||
|
||||
**On pve-201** — add a persistent static route to TIM via webzavod:
|
||||
|
||||
```yaml
|
||||
# /etc/netplan/01-routes.yaml — adjust NIC name as needed
|
||||
network:
|
||||
version: 2
|
||||
ethernets:
|
||||
eth0:
|
||||
routes:
|
||||
- to: 172.18.0.0/16
|
||||
via: 192.168.88.58
|
||||
```
|
||||
|
||||
```bash
|
||||
sudo netplan apply
|
||||
```
|
||||
|
||||
**On pve-201** — pin TIM hostnames to reachable A records (TIM DNS returns duplicate As, one of which is dead):
|
||||
|
||||
```bash
|
||||
echo '172.18.0.121 flights.test.aeroflot.ru' | sudo tee -a /etc/hosts
|
||||
```
|
||||
|
||||
**Smoke test:**
|
||||
|
||||
```bash
|
||||
curl -v https://flights.test.aeroflot.ru/swagger/ # expect: 401 in <300ms
|
||||
```
|
||||
|
||||
If this fails, fix routing/DNS before proceeding — nothing else will work.
|
||||
|
||||
### 2. nginx vhost
|
||||
|
||||
```bash
|
||||
sudo cp deployment/nginx/ui-dashboard.gnerim.ru.conf /etc/nginx/sites-available/
|
||||
sudo ln -s /etc/nginx/sites-available/ui-dashboard.gnerim.ru.conf /etc/nginx/sites-enabled/
|
||||
sudo mkdir -p /etc/nginx/htpasswd
|
||||
sudo nginx -t
|
||||
sudo systemctl reload nginx
|
||||
```
|
||||
|
||||
The `htpasswd` file is created by `scripts/ci/install-htpasswd.sh` on first deploy.
|
||||
|
||||
### 3. Gitea runner setup
|
||||
|
||||
The runner must be in the `docker` group (so it can talk to the Docker socket without sudo) and reach all upstream services:
|
||||
|
||||
```bash
|
||||
sudo usermod -aG docker <runner-user> # then re-login the runner service
|
||||
docker ps # must work without sudo for the runner user
|
||||
```
|
||||
|
||||
Reachability checks the runner must pass:
|
||||
|
||||
```bash
|
||||
curl -fsS https://git.gnerim.ru/ # Gitea
|
||||
curl -fsSI https://teamscore.gitlab.yandexcloud.net/ # GitLab
|
||||
curl -fsSI http://jenkins.yc.devwebzavod.ru:8080/ # Jenkins (via static route)
|
||||
curl -fsSI http://flights-ui.devwebzavod.ru/ # Customer URL (via static route)
|
||||
```
|
||||
|
||||
### 4. GitLab Personal Access Token
|
||||
|
||||
GitLab → User Settings → Access Tokens → create with scopes `api` and `write_repository`. Store as Gitea Actions secret `GITLAB_PAT`.
|
||||
|
||||
### 5. Allow self-approve on GitLab project
|
||||
|
||||
GitLab → flights-front project → Settings → Merge requests → Approval rules → uncheck **"Prevent approval by author"**.
|
||||
|
||||
Verify by running (locally, after PAT is in place):
|
||||
|
||||
```bash
|
||||
GITLAB_PAT=<pat> ./scripts/ci/check-gitlab-project.sh
|
||||
```
|
||||
|
||||
It prints the numeric project ID (store as `GITLAB_PROJECT_ID` secret) and confirms self-approve is allowed.
|
||||
|
||||
### 6. Jenkins remote trigger token
|
||||
|
||||
Jenkins → `Aeroflot2/Flights-Front-Dev` job → Configure → check **"Trigger builds remotely"** → set token (e.g. `flights-cd-trigger`). Store as `JENKINS_TRIGGER_TOKEN`.
|
||||
|
||||
Also: Jenkins → User → Configure → API Token → Add new token. Store username as `JENKINS_USER`, token as `JENKINS_API_TOKEN`.
|
||||
|
||||
### 7. Telegram bot
|
||||
|
||||
Use existing bot or create via @BotFather. Get the chat_id by sending a message and querying `https://api.telegram.org/bot<TOKEN>/getUpdates`. Store as `TELEGRAM_BOT_TOKEN` and `TELEGRAM_CHAT_ID`.
|
||||
|
||||
### 8. Gitea Actions secrets summary
|
||||
|
||||
Repo → Settings → Actions → Secrets — set all of:
|
||||
|
||||
| Secret | Purpose |
|
||||
|---|---|
|
||||
| `BASIC_AUTH_USER`, `BASIC_AUTH_PASS` | nginx htpasswd |
|
||||
| `MAP_TILE_URL` | Default `/map/api/tile/{z}/{x}/{y}.jpeg` |
|
||||
| `API_BASE_URL` | Default `/api` |
|
||||
| `GITLAB_PAT`, `GITLAB_PROJECT_ID` | GitLab MR API |
|
||||
| `JENKINS_USER`, `JENKINS_API_TOKEN`, `JENKINS_TRIGGER_TOKEN` | Jenkins API |
|
||||
| `TELEGRAM_BOT_TOKEN`, `TELEGRAM_CHAT_ID` | Notifications |
|
||||
|
||||
## Verifying failure paths
|
||||
|
||||
Run at least the rollback and "release blocked" rehearsals once before declaring the pipeline production-grade.
|
||||
|
||||
### A: e2e fail → rollback
|
||||
|
||||
Push a commit that adds `console.error('rehearsal')` somewhere that runs on every page (e.g. `src/routes/layout.tsx`). Workflow A runs, e2e fails on the console-gate, rollback to `:previous` triggers. Verify:
|
||||
|
||||
- Telegram message: `❌ ci-deploy FAILED at step "Run Playwright e2e" — rolled back to <prev-sha>`
|
||||
- `https://ui-dashboard.gnerim.ru/` still serves the previous version (check the page or `docker inspect flights-web`).
|
||||
|
||||
Revert the rehearsal commit when done.
|
||||
|
||||
### A: rollback itself fails
|
||||
|
||||
```bash
|
||||
ssh pve-201 'docker rmi flights-web:previous'
|
||||
```
|
||||
|
||||
Then push a commit that fails e2e. Rollback step finds no `:previous` and bails. Verify:
|
||||
|
||||
- Telegram message: `🔥 ci-deploy ROLLBACK FAILED — site is DOWN`
|
||||
- `https://ui-dashboard.gnerim.ru/` returns 502.
|
||||
- Manual recovery: `ssh pve-201 'docker run -d --name flights-web -p 127.0.0.1:8081:8080 flights-web:<known-good-sha>'`.
|
||||
|
||||
### B: blocked on A not green
|
||||
|
||||
Trigger Workflow B (manual or tag) for a SHA that has no green Workflow A run. Verify:
|
||||
|
||||
- Telegram message: `⚠️ release blocked — workflow ci-deploy is not green for <sha>`
|
||||
- B exits early; nothing changes in GitLab.
|
||||
|
||||
### B: Jenkins poll timeout
|
||||
|
||||
Set `JENKINS_TIMEOUT=30` as a secret override and trigger B. Polling should give up after 30s and report timeout.
|
||||
|
||||
## Manual recovery scenarios
|
||||
|
||||
### Workflow B failed at step 12-13 (Jenkins) — MR merged but customer site stale
|
||||
|
||||
GitLab is already at the new commit; Jenkins didn't deploy. Recovery:
|
||||
|
||||
1. Open Jenkins UI → click "Build Now" on the same job, or
|
||||
2. Push a new commit to GitLab to re-trigger Jenkins polling (if it's set up that way), or
|
||||
3. Re-run Workflow B from a green Workflow A — but only if you also pushed new code; otherwise B will sync a no-op and skip.
|
||||
|
||||
### Container running but nginx returns 502
|
||||
|
||||
Check the bind:
|
||||
|
||||
```bash
|
||||
ssh pve-201
|
||||
docker ps --filter name=flights-web
|
||||
curl -v http://127.0.0.1:8081/ # should return 200 (or whatever the SSR root returns)
|
||||
sudo nginx -t && sudo systemctl reload nginx
|
||||
```
|
||||
|
||||
If the container died, the Restart policy `unless-stopped` should bring it back. If not:
|
||||
|
||||
```bash
|
||||
docker logs flights-web --tail 200
|
||||
docker run -d --name flights-web -p 127.0.0.1:8081:8080 flights-web:current
|
||||
```
|
||||
Reference in New Issue
Block a user