From 1fec2bb9b17e5a27b2643244b3254a23e5cec554 Mon Sep 17 00:00:00 2001 From: gnezim Date: Sat, 25 Apr 2026 01:34:43 +0300 Subject: [PATCH] spec: design Gitea Actions CI/CD pipeline to pve-201, GitLab MR, Jenkins Captures the agreed two-workflow shape (push-deploy + manual release) so the implementation plan has an unambiguous source of truth before touching scripts, Dockerfile build-args, or nginx config. --- .../specs/2026-04-25-cicd-pipeline-design.md | 487 ++++++++++++++++++ 1 file changed, 487 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-25-cicd-pipeline-design.md diff --git a/docs/superpowers/specs/2026-04-25-cicd-pipeline-design.md b/docs/superpowers/specs/2026-04-25-cicd-pipeline-design.md new file mode 100644 index 00000000..3e1ea0b0 --- /dev/null +++ b/docs/superpowers/specs/2026-04-25-cicd-pipeline-design.md @@ -0,0 +1,487 @@ +# CI/CD Pipeline Design — Gitea Actions → pve-201 → GitLab → Jenkins + +**Status:** Approved design, ready for implementation plan. +**Date:** 2026-04-25 +**Author:** gnezim (with Claude) + +## Summary + +A two-workflow Gitea Actions pipeline that builds and deploys this React SSR app to your own infrastructure (pve-201, behind `https://ui-dashboard.gnerim.ru/`) on every push, then — on explicit trigger — syncs sources to the customer's GitLab, opens and auto-merges an MR, fires the Jenkins build, and runs end-to-end tests against the customer's dev URL. All notifications via Telegram. + +Two workflow files: + +- **`ci-deploy.yml`** — push-triggered. Build → unit tests → Docker build → swap container → e2e on `ui-dashboard.gnerim.ru`. Auto-rollback to previous image on any post-build failure. +- **`release.yml`** — manually triggered (UI button or `release-*` git tag). Verifies `ci-deploy` is green for the same SHA, then GitLab sync → MR → approve → merge → Jenkins trigger → poll → e2e on `flights-ui.devwebzavod.ru`. Halts on any failure. + +The Gitea runner runs on pve-201 itself, with Docker socket access — no SSH, no registry hop. Image-versioning uses `flights-web:` plus moving aliases `:current` and `:previous` for one-step rollback. Future migration to a private registry is a config change, not a refactor. + +## Architecture + +``` +┌──────────────────┐ push to main ┌─────────────────────────────────┐ +│ dev pc (you) │ ─────────────────────► │ git.gnerim.ru (Gitea server) │ +└──────────────────┘ manual / tag push └────────────────┬────────────────┘ + │ webhook + ▼ + ┌──────────────────────────────┐ + │ Gitea Actions runner │ + │ on pve-201 (Docker socket) │ + └──┬───────────────────────────┘ + │ + ┌──────────────────────────────┼──────────────────────────────┐ + │ │ │ + ▼ on push ▼ on tag/manual ▼ Telegram + ┌────────────────────┐ ┌────────────────────┐ ┌────────────────────┐ + │ Workflow A │ │ Workflow B │ │ Notify on every │ + │ ci-deploy.yml │ │ release.yml │ │ stage start / end │ + │ │ │ │ │ / failure │ + │ build & test │ │ verify A is green │ └────────────────────┘ + │ ↓ │ │ ↓ │ + │ docker build :SHA │ │ sync → GitLab MR │ + │ ↓ │ │ ↓ │ + │ swap container │ │ approve & merge │ + │ ↓ │ │ ↓ │ + │ smoke /health │ │ trigger Jenkins │ + │ ↓ │ │ ↓ │ + │ playwright e2e on │ │ poll until SUCCESS │ + │ ui-dashboard │ │ ↓ │ + │ ↓ on fail │ │ playwright e2e on │ + │ rollback to │ │ flights-ui.devweb │ + │ :previous │ │ ↓ on fail │ + └────────────────────┘ │ halt + dump logs │ + └────────────────────┘ +``` + +### Key invariants + +- **Workflow A is the gatekeeper.** Workflow B always queries Gitea for the latest A run for the same commit SHA; if not green, B refuses to start. +- **Image tag aliases.** Every build is tagged `flights-web:`. Two moving aliases on the host: `flights-web:current` (live container source) and `flights-web:previous` (rollback target). Pruning keeps the last 5 SHA tags + the two aliases. +- **Container is named `flights-web`** (singleton). Restart sequence: `docker stop flights-web && docker rm flights-web && docker run -d --name flights-web --restart unless-stopped -p 127.0.0.1:8081:8080 flights-web:current`. +- **Nginx on pve-201** terminates TLS for `ui-dashboard.gnerim.ru` and proxies to `127.0.0.1:8081`. +- **All four major stages emit Telegram messages** (start / pass / fail). Failure messages include log tail and a clickable link to the Gitea run. +- **`.github/workflows/` files are deleted** in PR #2 (not the same PR that adds the new workflows; see "Layer 2 — staged rollout" under "Testing the pipeline itself"). + +### Architectural choices already made + +- **Runner-on-host with direct Docker socket** (vs SSH-back-to-localhost or local registry) — least moving parts; runner is in the `docker` group on pve-201. +- **Two independent workflow files** (vs one file with conditional jobs, vs shared composite action) — short and focused beats clever. +- **Manual trigger for Workflow B + git tag fallback** (vs commit-message keyword) — explicit; can't ship to customer by accident. + +## Routing, build-args, and access control + +The build-args change the most across this design — they go from absolute (TIM hostnames) to relative paths, which moves the burden onto nginx on pve-201. + +### Routing pve-201 → TIM API + +The customer API at `https://flights.test.aeroflot.ru/api/*` is reachable only through the corp VPN. `webzavod` (192.168.88.58) on the same LAN as pve-201 (192.168.88.167) already has a working L2TP/IPsec tunnel to TIM via `ppp0`. The cleanest way to make pve-201 reach TIM is a static route through webzavod, which leverages the existing VPN setup. + +**One-time host setup (manual, not in workflows):** + +1. **On webzavod** — verify IP forwarding and MASQUERADE on `ppp0`: + + ```bash + sysctl net.ipv4.ip_forward # expect: 1 + sudo iptables -t nat -L POSTROUTING -nv | grep ppp0 # expect: MASQUERADE + ``` + + If not set: + + ```bash + echo 'net.ipv4.ip_forward=1' | sudo tee -a /etc/sysctl.conf + sudo sysctl -p + sudo iptables -t nat -A POSTROUTING -o ppp0 -j MASQUERADE + sudo apt install iptables-persistent && sudo netfilter-persistent save + ``` + +2. **On pve-201** — add a persistent static route to TIM via webzavod: + + ```yaml + # /etc/netplan/01-routes.yaml + network: + version: 2 + ethernets: + eth0: # rename to actual NIC name + routes: + - to: 172.18.0.0/16 + via: 192.168.88.58 + ``` + + ```bash + sudo netplan apply + ``` + +3. **On pve-201** — pin TIM hostnames to reachable A records (mirrors the duplicate-DNS workaround documented in `~/_projects/gnezim/knowledge/projects/work/tim/ui-dashboard/mac-via-windows-jump.md`): + + ```bash + # /etc/hosts + 172.18.0.121 flights.test.aeroflot.ru + ``` + +4. **Smoke test from pve-201:** + + ```bash + curl -v https://flights.test.aeroflot.ru/swagger/ # expect: 401 in ~70ms + ``` + + Failure here means routing is broken — fix before any pipeline run. + +### nginx vhost on pve-201 + +```nginx +server { + listen 443 ssl http2; + server_name ui-dashboard.gnerim.ru; + # ssl_certificate, ssl_certificate_key — existing certbot config + + auth_basic "ui-dashboard"; + auth_basic_user_file /etc/nginx/htpasswd/ui-dashboard; + + location / { + proxy_pass http://127.0.0.1:8081; + proxy_set_header Host $host; + proxy_set_header X-Forwarded-Proto $scheme; + proxy_set_header X-Real-IP $remote_addr; + } + + location /api/ { + auth_basic off; # API path open behind nginx — basic auth gates the HTML, not the API + proxy_pass https://flights.test.aeroflot.ru; + proxy_set_header Host flights.test.aeroflot.ru; + proxy_ssl_server_name on; + } + + location /map/api/ { + auth_basic off; + proxy_pass https://flights.test.aeroflot.ru; + proxy_set_header Host flights.test.aeroflot.ru; + proxy_ssl_server_name on; + } +} +``` + +This file is checked in at `deployment/nginx/ui-dashboard.gnerim.ru.conf` and symlinked into `/etc/nginx/sites-enabled/` by hand on first setup. + +### Dockerfile build-args become relative + +Workflow A passes: + +``` +--build-arg MAP_TILE_URL=/map/api/tile/{z}/{x}/{y}.jpeg +--build-arg API_BASE_URL=/api +``` + +Same-origin URLs in the client bundle, no CORS, the TIM hostname never leaks into the browser. The customer's prod build (Jenkins) keeps its own absolute URLs because their nginx is configured differently — independent. + +### Basic auth and how e2e bypasses it + +- **Credentials** stored as Gitea Actions secrets `BASIC_AUTH_USER` and `BASIC_AUTH_PASS`. +- **Workflow A's deploy step** regenerates `/etc/nginx/htpasswd/ui-dashboard` (using `htpasswd -bn`) and runs `nginx -s reload`. Rotating creds = re-run Workflow A. +- **Public smoke check** in Workflow A step 10 hits `https://ui-dashboard.gnerim.ru/` with `--user $BASIC_AUTH_USER:$BASIC_AUTH_PASS` to validate TLS + nginx + auth + container in one curl. Catches nginx misconfig. +- **Full e2e** in Workflow A step 11 runs against `BASE_URL=http://127.0.0.1:8081` (loopback, skips nginx and auth). Faster, no creds in test, regression in nginx layer already caught by step 10. + +## Workflow A — `ci-deploy.yml` + +**Triggers:** `push` to `main`; `workflow_dispatch` for re-runs. + +**Single sequential job** on the pve-201 runner: + +| # | Step | What it does | On failure | +|---|------|---|---| +| 1 | Checkout | `actions/checkout@v4`, full history | hard fail | +| 2 | Setup pnpm + Node 24 | from `.nvmrc` | hard fail | +| 3 | Restore pnpm cache | `~/.pnpm-store` keyed on `pnpm-lock.yaml` | continue (cache miss is fine) | +| 4 | Install deps | `pnpm install --frozen-lockfile` | hard fail → Telegram | +| 5 | Typecheck + lint + unit tests | `pnpm typecheck && pnpm lint && pnpm test` | hard fail → Telegram | +| 6 | Build SSR image | `docker build -f Dockerfile.react -t flights-web:${GITHUB_SHA} --build-arg MAP_TILE_URL=/map/api/tile/{z}/{x}/{y}.jpeg --build-arg API_BASE_URL=/api .` | hard fail → Telegram | +| 7 | Tag previous-current as previous | `docker tag flights-web:current flights-web:previous` (skip if first deploy) | continue | +| 8 | Tag SHA as current | `docker tag flights-web:${GITHUB_SHA} flights-web:current` | hard fail | +| 9 | Restart container | `scripts/ci/deploy-container.sh swap` | trigger rollback (step 12) | +| 10 | Wait for health | `scripts/ci/wait-for-url.sh https://ui-dashboard.gnerim.ru/ 30 2` (with basic auth) | trigger rollback | +| 11 | Run Playwright e2e | `BASE_URL=http://127.0.0.1:8081 pnpm test:e2e` (full suite + console-error gate) | trigger rollback | +| 12 | **Rollback (only if 9/10/11 failed)** | `scripts/ci/deploy-container.sh rollback` — runs `:previous`, swaps aliases back, verifies health | always Telegram with logs | +| 13 | Prune old images | keep last 5 `flights-web:*` SHA tags + the two aliases | continue | +| 14 | Telegram (success or failure) | `if: always()` — `notify-telegram.sh ok\|fail ci-deploy ` | continue | + +**Estimated runtime:** 6–10 min cached; 12–15 min cold. + +**Console-error gate (step 11):** + +A Playwright fixture `tests/e2e/fixtures/console-gate.ts` attaches a listener to every page, collects all `console.error` and `console.warn` messages, filters out anything matching patterns in `tests/e2e/fixtures/console-allowlist.json`, and asserts the remaining list is empty in `afterEach`. Per the agreed policy: **zero tolerance with explicit allowlist**. Each allowlist entry has a `reason` field; lint enforces non-empty. + +## Workflow B — `release.yml` + +**Triggers:** `workflow_dispatch` (manual button); `push` of tags matching `release-*` (e.g., `git tag release-2026-04-25 && git push --tags`). + +**Single sequential job** on the pve-201 runner: + +| # | Step | What it does | On failure | +|---|------|---|---| +| 1 | Checkout (full history + tags) | needed for sync to operate on real source tree | hard fail | +| 2 | **Verify Workflow A is green for this SHA** | query Gitea API `GET /repos/{owner}/{repo}/actions/runs?head_sha=&workflow=ci-deploy.yml`; require status=success | hard fail → Telegram "release blocked: A not green" | +| 3 | Setup pnpm + Node 24, install deps | needed for paranoid lint/test re-run | hard fail | +| 4 | **Re-run lint + typecheck + unit tests** | belt-and-suspenders: catch flakiness; confirms commit still passes locally before sending to customer | hard fail → Telegram | +| 5 | Clone GitLab target into temp dir | `git clone https://oauth2:$GITLAB_PAT@teamscore.gitlab.yandexcloud.net/aeroflot2/flights-front.git /tmp/flights-front` | hard fail | +| 6 | **Run sync (CI variant)** | `scripts/ci/sync-to-gitlab.sh /tmp/flights-front/Aeroflot.Flights.Front` | hard fail | +| 7 | Commit on a feature branch | `cd /tmp/flights-front && git checkout -b auto/sync-` then `git add -A && git commit -m "auto: sync from gitea "` (skip if no diff) | hard fail; if no diff → log "nothing to sync" + skip 8-13 + Telegram info | +| 8 | Push branch | `git push -u origin auto/sync-` | hard fail | +| 9 | **Open MR** | `POST /api/v4/projects//merge_requests` with `source_branch=auto/sync-`, `target_branch=main`, title `"auto: sync from gitea "`, description with link back to Gitea run | hard fail | +| 10 | **Approve MR** | `POST /api/v4/projects//merge_requests//approve` | hard fail; on 401/403 → log explicit "self-approve blocked, configure project to allow author approval" | +| 11 | **Merge MR** | `PUT /api/v4/projects//merge_requests//merge` with `merge_when_pipeline_succeeds=false`, `should_remove_source_branch=true`, `squash=true` | on fail → close MR + delete branch → Telegram | +| 12 | **Trigger Jenkins** | `curl -u $JENKINS_USER:$JENKINS_API_TOKEN 'http://jenkins.yc.devwebzavod.ru:8080/job/Aeroflot2/job/Flights-Front-Dev/build?token=$JENKINS_TRIGGER_TOKEN'` — returns the queue item URL in `Location` header | hard fail | +| 13 | **Poll Jenkins for completion** | `scripts/ci/jenkins-trigger-and-wait.sh` — parses queue URL, gets build URL once it leaves queue, polls `/api/json` for `result != null`. Timeout: 30 min. Required: `result == "SUCCESS"`. | hard fail (`UNSTABLE`/`FAILURE`/timeout) → Telegram with Jenkins console URL | +| 14 | Wait for the customer URL to update | `scripts/ci/wait-for-url.sh http://flights-ui.devwebzavod.ru/ru-ru/onlineboard 60 5` (5 min window) | hard fail | +| 15 | **Run Playwright e2e** against `http://flights-ui.devwebzavod.ru/` | `BASE_URL=http://flights-ui.devwebzavod.ru pnpm test:e2e` (full suite + console-error gate) | hard fail → Telegram with playwright report attached | +| 16 | Telegram (success or failure) | `if: always()` — final notification with full chain links | continue | + +**Estimated runtime:** 15–25 min (most of it Jenkins build + e2e). + +**Self-approve note.** If GitLab's project setting "*Prevent approvals by author*" is enabled, step 10 returns 401 with `"You cannot approve your own merge request"`. Prereq #9 in "One-time manual setup" unchecks this. If you can't (org policy), fallback is to skip step 10 and rely on `merge_when_pipeline_succeeds=false` + branch protection allowing maintainer push. + +**Jenkins polling race.** Naive polling has a race where the queue item hasn't materialized into a build yet. `jenkins-trigger-and-wait.sh` polls the queue URL first, then the build URL once it appears. + +**The `auto/sync-` branch** lives forever in GitLab unless step 11 succeeds (which deletes it via `should_remove_source_branch=true`). On step 11 failure, the script closes the MR + deletes the branch. + +**The Gitea runner needs network reachability to TIM** for steps 12-15 (Jenkins host, customer URL). That works automatically once the static route from "Routing pve-201 → TIM API" is in place — the runner shares pve-201's routes. + +## Prerequisites and secrets + +### One-time manual setup + +| # | What | Where | Why | +|---|------|-------|-----| +| 1 | Verify webzavod IP forwarding + MASQUERADE on `ppp0` | webzavod | see "Routing pve-201 → TIM API" | +| 2 | Add static route `172.18.0.0/16 via 192.168.88.58` in netplan | pve-201 | see "Routing pve-201 → TIM API" | +| 3 | Pin `172.18.0.121 flights.test.aeroflot.ru` in `/etc/hosts` | pve-201 | duplicate DNS gotcha — see "Routing pve-201 → TIM API" | +| 4 | Verify from pve-201: `curl -v https://flights.test.aeroflot.ru/swagger/` returns 401 (typically <300ms) | pve-201 | smoke test | +| 5 | Install nginx vhost from `deployment/nginx/ui-dashboard.gnerim.ru.conf` | pve-201 | see "nginx vhost on pve-201" | +| 6 | Confirm Gitea runner has docker socket access (`docker ps` from runner user, no sudo) | pve-201 | required for runner-on-host deploy | +| 7 | Confirm Gitea runner can reach `git.gnerim.ru`, `teamscore.gitlab.yandexcloud.net`, `jenkins.yc.devwebzavod.ru:8080`, `flights-ui.devwebzavod.ru` | pve-201 | last two via static route from #2 | +| 8 | **Create GitLab Personal Access Token** with scopes `api`, `write_repository` | GitLab → Settings → Access Tokens | Workflow B steps 9-11 | +| 9 | **Uncheck "Prevent approvals by author"** on the GitLab project | GitLab → flights-front → Settings → Merge requests → Approval rules | so Workflow B step 10 works | +| 10 | **Configure Jenkins remote trigger token** on `Aeroflot2/Flights-Front-Dev` job | Jenkins → job → Configure → "Trigger builds remotely" | Workflow B step 12 | +| 11 | **Generate Jenkins API token** for your user | Jenkins → user → Configure → API Token | Workflow B steps 12-13 | +| 12 | Create the Telegram bot (or reuse existing) and capture chat_id | Telegram BotFather | all notifications | +| 13 | Pick + reserve port `:8081` on pve-201 (or substitute another free port consistently) | pve-201 | container's host-side bind | +| 14 | **Clean uncommitted work in this repo before flipping the switch** | dev pc | the first push to `main` after merging the pipeline will fire Workflow A on whatever's in `main` | +| 15 | Run `scripts/ci/check-gitlab-project.sh` once after creating the PAT | dev pc | captures numeric `GITLAB_PROJECT_ID` for the secret + verifies approval-rule config | + +### Gitea Actions secrets + +Stored at **repo → Settings → Actions → Secrets**. Workflows reference as `${{ secrets.NAME }}`. + +| Secret | Used in | Notes | +|---|---|---| +| `BASIC_AUTH_USER` | Workflow A (deploy) | nginx htpasswd; rotate by re-running A | +| `BASIC_AUTH_PASS` | Workflow A (deploy) | same | +| `MAP_TILE_URL` | Workflow A (build) | default `/map/api/tile/{z}/{x}/{y}.jpeg` — secret so it can be overridden per env | +| `API_BASE_URL` | Workflow A (build) | default `/api` | +| `GITLAB_PAT` | Workflow B (steps 5, 8-11) | from prereq #8 | +| `GITLAB_PROJECT_ID` | Workflow B (steps 9-11) | numeric, from prereq #15 | +| `JENKINS_USER` | Workflow B (steps 12-13) | username | +| `JENKINS_API_TOKEN` | Workflow B (steps 12-13) | from prereq #11 | +| `JENKINS_TRIGGER_TOKEN` | Workflow B (step 12) | from prereq #10 | +| `TELEGRAM_BOT_TOKEN` | both workflows | from prereq #12 | +| `TELEGRAM_CHAT_ID` | both workflows | DM or group | + +### What lives in plain repo files (not secrets) + +- `.gitea/workflows/ci-deploy.yml` and `.gitea/workflows/release.yml` — public, parameterized via secrets. +- `scripts/ci/sync-to-gitlab.sh` — refactored from `sync-to-flights-front.sh`. The original becomes a thin wrapper that calls this with the local sibling-dir default. +- `scripts/ci/notify-telegram.sh` — reads `TELEGRAM_BOT_TOKEN`/`TELEGRAM_CHAT_ID` from env. Has `--dry-run`. +- `scripts/ci/jenkins-trigger-and-wait.sh` — polling logic for B steps 12-13. Has `--mock-mode`. +- `scripts/ci/wait-for-url.sh` — generic curl-with-retry. +- `scripts/ci/deploy-container.sh` — `swap` and `rollback` subcommands. Has `--dry-run`. +- `scripts/ci/install-htpasswd.sh` — renders htpasswd + reloads nginx. +- `scripts/ci/check-gitlab-project.sh` — one-shot setup helper (not used by workflows). +- `tests/e2e/fixtures/console-gate.ts` — Playwright fixture. +- `tests/e2e/fixtures/console-allowlist.json` — empty starter; grows on first runs. +- `deployment/nginx/ui-dashboard.gnerim.ru.conf` — nginx vhost. +- `deployment/README.md` — bootstrap runbook + failure-path rehearsal recipes. + +### What gets deleted (in PR #2, not PR #1) + +- `.github/workflows/ci.yml` +- `.github/workflows/deploy.yml` + +## Scripts to add + +| Path | Purpose | Approx LOC | +|---|---|---| +| `scripts/ci/sync-to-gitlab.sh` | Refactored from `sync-to-flights-front.sh`; takes target dir as required arg, no `make`-related output. | ~150 | +| `scripts/ci/notify-telegram.sh` | `notify-telegram.sh []`; HTML mode; failure messages include Gitea run URL. | ~40 | +| `scripts/ci/jenkins-trigger-and-wait.sh` | Triggers, parses `Location`, polls queue then build, exits 0 only on `SUCCESS`. | ~80 | +| `scripts/ci/wait-for-url.sh` | `wait-for-url.sh [] []`. | ~25 | +| `scripts/ci/deploy-container.sh` | `swap` and `rollback` subcommands. Encapsulates the alias dance + health check. Image source parameterized so registry migration is a config flip. | ~70 | +| `scripts/ci/install-htpasswd.sh` | Renders `/etc/nginx/htpasswd/ui-dashboard` from env + `nginx -s reload`. | ~15 | +| `scripts/ci/check-gitlab-project.sh` | One-shot: print numeric project ID + approval rule config + self-approve allowed (yes/no). | ~25 | +| `scripts/ci/audit-console-allowlist.sh` | Run e2e with allowlist disabled, report which entries didn't fire (dead config). | ~30 | +| `tests/e2e/fixtures/console-gate.ts` | Playwright fixture for the console-error gate. | ~50 | +| `tests/e2e/fixtures/console-allowlist.json` | Empty starter `{ patterns: [] }`. | n/a | +| `.gitea/workflows/ci-deploy.yml` | Workflow A. | ~80 | +| `.gitea/workflows/release.yml` | Workflow B. | ~100 | +| `deployment/nginx/ui-dashboard.gnerim.ru.conf` | nginx vhost from "nginx vhost on pve-201". | ~30 | +| `deployment/README.md` | Setup runbook + failure-path rehearsals. | ~200 | +| `tests/ci/*.bats` (or shell) | Unit tests for the testable scripts. | ~80 | +| `tests/ci/fixtures/jenkins-success-flow.json` | Mock fixture for `jenkins-trigger-and-wait.sh --mock-mode`. | ~40 | + +## Failure handling and notifications + +### Telegram message shapes + +All messages use `parse_mode=HTML`. + +**Start (one per workflow run):** +``` +🚀 ci-deploy started +commit: abc1234 — fix: schedule width regression +gitea run: +``` + +**Success:** +``` +✅ ci-deploy passed (8m 42s) +commit: abc1234 — fix: schedule width regression +deployed: https://ui-dashboard.gnerim.ru/ +gitea run: +``` + +**Failure:** +``` +❌ ci-deploy FAILED at step "Run Playwright e2e" (6m 18s) +commit: abc1234 — fix: schedule width regression +gitea run: + +last 30 lines of step output: +
... e2e log tail ...
+ +artifacts: +- container logs +- playwright report +``` + +Workflow B failures include MR URL, Jenkins build URL, customer URL as appropriate. + +### Per-stage failure contracts + +| Failure point | Action | Notification | +|---|---|---| +| **A:1-6** (build/lint/test/dockerbuild) | hard fail; nothing was deployed | `❌ ci-deploy FAILED at step ""` + tail | +| **A:7-11** (deploy/health/e2e) | trigger A:12 rollback to `:previous`, verify rollback healthy | `❌ ci-deploy FAILED at step "" — rolled back to ` + container logs + playwright report (if e2e) | +| **A:12 rollback fails** | container stopped, site is 502 | `🔥 ci-deploy ROLLBACK FAILED — site is DOWN. Manual intervention required. Last good image: flights-web:` | +| **B:2** (A not green for SHA) | refuse to start | `⚠️ release blocked — workflow ci-deploy is not green for . Re-run A first.` | +| **B:3-4** (lint/test re-run) | hard fail | `❌ release FAILED at lint/test re-run (paranoid check). Investigate and re-trigger.` | +| **B:5-8** (sync, branch, push) | hard fail; if MR was created, close it; if branch was pushed, delete it | `❌ release FAILED at "" — cleanup done` | +| **B:9-11** (MR open/approve/merge) | hard fail; close MR + delete branch | `❌ release FAILED at MR — MR closed, branch deleted. ` | +| **B:12-13** (Jenkins trigger/poll) | hard fail; do NOT close the GitLab MR (already merged, can't unmerge) | `❌ release FAILED at Jenkins build — gitlab MR already merged. Jenkins console: ` | +| **B:14** (customer URL not responding) | hard fail | `❌ release FAILED — Jenkins reported SUCCESS but flights-ui.devwebzavod.ru not responding. Investigate.` | +| **B:15** (e2e on customer URL) | hard fail; no auto-rollback (we can't), notify with logs | `❌ release FAILED at e2e on customer URL — gitlab MR merged + Jenkins # green but app misbehaves. Playwright report attached.` | + +### Recovery from B:12-13 failure (awkward case) + +GitLab MR is already merged but customer site has previous code. Recovery is manual: + +1. Open Jenkins UI → click "Build Now" on the same job, or +2. Push a new commit to GitLab to re-trigger Jenkins polling. + +A "retry just the Jenkins half" workflow file is **not** included — the manual path is rare enough to not warrant the abstraction. + +### Implementation pattern + +Both workflows end with `if: always()` finalize steps: + +```yaml +- name: Notify (success) + if: success() + run: scripts/ci/notify-telegram.sh ok ci-deploy + +- name: Notify (failure) + if: failure() + run: scripts/ci/notify-telegram.sh fail ci-deploy "${{ steps.failed_step.outputs.name }}" +``` + +Step IDs propagate the failed-step name. Slightly verbose but no magic. + +### Artifacts on failure + +Always uploaded on failure (never on success). 7-day retention. + +- **Workflow A:** `docker logs flights-web --tail 500`, `playwright-report/` (if e2e ran), nginx error log tail. +- **Workflow B:** `playwright-report/` (if e2e ran), the rendered MR/Jenkins API responses (for debugging integration), tail of `git log` on the sync branch. + +### Deliberately NOT done + +- No PagerDuty / SMS escalation. Telegram is enough. +- No automatic re-runs on flake. A flaky e2e fail = real signal worth investigating. +- No "previous run was already failing, suppress notification" logic. Spam is a feature; silence is dangerous. +- No Slack/email mirror. Single channel. + +## Testing the pipeline itself + +### Layer 1 — Unit tests for the testable bits + +Bash scripts under `scripts/ci/` with logic worth testing: + +- **`notify-telegram.sh`** — `--dry-run` prints the rendered payload to stdout instead of POSTing. Tests verify the three message shapes. +- **`wait-for-url.sh`** — testable with a local `python3 -m http.server`; assert exit codes for 200, 404, network failure, timeout. +- **`jenkins-trigger-and-wait.sh`** — `--mock-mode` reads from `tests/ci/fixtures/jenkins-success-flow.json`. Tests verify queue-then-build polling + SUCCESS / FAILURE / UNSTABLE / timeout branches. +- **`deploy-container.sh`** — `--dry-run` prints docker commands instead of running them. Test verifies alias-swap order. + +Run via `make test-ci` or as a step in Workflow A itself (~10 sec total). + +### Layer 2 — Workflow A first-run validation (staged rollout) + +Plan for the first run to fail. Stage the rollout: + +1. **PR #1** — adds workflows + scripts + console-gate fixture + nginx config + `deployment/README.md`. Does **not** delete `.github/workflows/`. Workflow A starts firing on push. +2. **First few runs will fail** at: portability of e2e specs to remote `BASE_URL`, missing `BASE_URL` overrides in test setup, console-gate revealing real warnings to allowlist, network/DNS/route gotchas. Each failure → fix → re-push. +3. **PR #2** — deletes `.github/workflows/` and any compatibility shims, only after A has run green for a few consecutive commits. +4. **PR #3** — Workflow B. First run triggered manually. Once it works once end-to-end, it's "live". + +**Budget 1-2 dev days of "debug the pipeline against reality" after merging PR #1.** Expecting green on first run is wrong. + +### Layer 3 — Documented rehearsal of failure paths + +`deployment/README.md` includes recipes for inducing each failure path: + +| Failure | How to induce | +|---|---| +| A e2e fail → rollback | push a commit that adds `console.error('test')` to `App.tsx`. Verify rollback. | +| A rollback fail | break the `:previous` tag manually (`docker rmi flights-web:previous`), trigger an e2e fail. | +| B blocked on A not green | push a commit that fails A, then trigger B for that SHA. | +| B Jenkins poll timeout | reduce `JENKINS_TIMEOUT` to 30s and trigger B. | +| B e2e fail on customer URL | manually break the customer URL (trigger an old Jenkins build), then run B without a code change. | + +Run at least the "rollback" and "release blocked" rehearsals once before declaring the pipeline production-grade. + +### Console-allowlist seeding strategy + +- **Don't pre-seed.** Run e2e once locally against `https://ui-dashboard.gnerim.ru/` (or `http://localhost:8081`), capture every console message, decide which are real bugs vs allowlist material. +- **Each allowlist entry has a `reason` field**, lint-enforced. +- **Re-evaluate quarterly** via `scripts/ci/audit-console-allowlist.sh` — entries that didn't fire are dead config. + +### Deliberately NOT tested + +- The actual GitLab API integration (no way to mock GitLab without GitLab; first B run is the test). +- The actual Jenkins API integration (same; polling logic *is* tested via mock-mode). +- The Telegram bot (tested via `--dry-run`; failed delivery observable as "no message arrived"). + +## Future seam: container registry + +When a private registry comes online (eventual `registry.gnerim.ru`), changes: + +- **Workflow A** — replace local `docker tag flights-web:current` + `docker run` with: + ```yaml + - run: docker push registry.gnerim.ru/flights-web:${GITHUB_SHA} + - run: ssh deploy@pve-201 'docker pull registry.gnerim.ru/flights-web: && ...' + ``` + Runner can move off pve-201 — anywhere with reach to registry + SSH key to deploy host. +- **Add secrets** `REGISTRY_USER`, `REGISTRY_PASS`, `DEPLOY_SSH_KEY`. +- **Rollback semantics identical** — `docker pull ` instead of relying on local cache. +- **No script rewrites** — `scripts/ci/deploy-container.sh` accepts image-source as a parameter from day one. `flights-web:current` / `:previous` becomes `:` / `:`, same shape. + +## Open questions and known gaps + +1. **`GITLAB_PROJECT_ID`** — numeric ID is unknown until the PAT exists. `scripts/ci/check-gitlab-project.sh` resolves it post-PAT. +2. **The 9 untracked `snap-*.yml` files at repo root** look like throwaway parity-snapshot artifacts. Add to `.gitignore` or commit? Verify before flipping pipeline on (prereq #14). +3. **e2e portability to remote `BASE_URL`** — existing specs were written against localhost. Many likely hardcode paths or rely on dev-only state. Layer 2 of testing strategy budgets time for this. +4. **Initial console-allowlist content** — empty starter; will be populated on first runs ("we'll figure it out in future" per design discussion).