Two design pivots discovered during Phase B prerequisites: Routing: Replace static-route + NAT plan with persistent ssh -L tunnel from pve-201 to webzavod (deployment/systemd/flights-tim-tunnel.service). nginx proxies /api/ and /map/api/ to https://127.0.0.1:8443 with SNI/Host overrides so cert validation still targets the real hostname. No webzavod kernel changes (no ip_forward/MASQUERADE), no /etc/hosts pin needed. Workflow B: Drop Jenkins trigger/poll automation (operator lacks Jenkins job-configure access and user API token access). release.yml now stops after MR merge with a Telegram message containing the Jenkins job URL. release-verify.yml (new, workflow_dispatch only) runs the customer-URL e2e suite once the operator has triggered Jenkins manually and it has completed. Other: - SSR loopback port 8081 -> 3002 (8081 was taken by openwebui on pve-201) - notify-telegram.sh skips cleanly when TG secrets unset (was: hard-fail) - README + spec addendum cover the new prereqs and removed steps
35 KiB
CI/CD Pipeline Design — Gitea Actions → pve-201 → GitLab → Jenkins
Status: Approved design, ready for implementation plan. Date: 2026-04-25 Author: gnezim (with Claude)
Summary
A two-workflow Gitea Actions pipeline that builds and deploys this React SSR app to your own infrastructure (pve-201, behind https://ui-dashboard.gnerim.ru/) on every push, then — on explicit trigger — syncs sources to the customer's GitLab, opens and auto-merges an MR, fires the Jenkins build, and runs end-to-end tests against the customer's dev URL. All notifications via Telegram.
Two workflow files:
ci-deploy.yml— push-triggered. Build → unit tests → Docker build → swap container → e2e onui-dashboard.gnerim.ru. Auto-rollback to previous image on any post-build failure.release.yml— manually triggered (UI button orrelease-*git tag). Verifiesci-deployis green for the same SHA, then GitLab sync → MR → approve → merge → Jenkins trigger → poll → e2e onflights-ui.devwebzavod.ru. Halts on any failure.
The Gitea runner runs on pve-201 itself, with Docker socket access — no SSH, no registry hop. Image-versioning uses flights-web:<sha> plus moving aliases :current and :previous for one-step rollback. Future migration to a private registry is a config change, not a refactor.
Architecture
┌──────────────────┐ push to main ┌─────────────────────────────────┐
│ dev pc (you) │ ─────────────────────► │ git.gnerim.ru (Gitea server) │
└──────────────────┘ manual / tag push └────────────────┬────────────────┘
│ webhook
▼
┌──────────────────────────────┐
│ Gitea Actions runner │
│ on pve-201 (Docker socket) │
└──┬───────────────────────────┘
│
┌──────────────────────────────┼──────────────────────────────┐
│ │ │
▼ on push ▼ on tag/manual ▼ Telegram
┌────────────────────┐ ┌────────────────────┐ ┌────────────────────┐
│ Workflow A │ │ Workflow B │ │ Notify on every │
│ ci-deploy.yml │ │ release.yml │ │ stage start / end │
│ │ │ │ │ / failure │
│ build & test │ │ verify A is green │ └────────────────────┘
│ ↓ │ │ ↓ │
│ docker build :SHA │ │ sync → GitLab MR │
│ ↓ │ │ ↓ │
│ swap container │ │ approve & merge │
│ ↓ │ │ ↓ │
│ smoke /health │ │ trigger Jenkins │
│ ↓ │ │ ↓ │
│ playwright e2e on │ │ poll until SUCCESS │
│ ui-dashboard │ │ ↓ │
│ ↓ on fail │ │ playwright e2e on │
│ rollback to │ │ flights-ui.devweb │
│ :previous │ │ ↓ on fail │
└────────────────────┘ │ halt + dump logs │
└────────────────────┘
Key invariants
- Workflow A is the gatekeeper. Workflow B always queries Gitea for the latest A run for the same commit SHA; if not green, B refuses to start.
- Image tag aliases. Every build is tagged
flights-web:<sha>. Two moving aliases on the host:flights-web:current(live container source) andflights-web:previous(rollback target). Pruning keeps the last 5 SHA tags + the two aliases. - Container is named
flights-web(singleton). Restart sequence:docker stop flights-web && docker rm flights-web && docker run -d --name flights-web --restart unless-stopped -p 127.0.0.1:8081:8080 flights-web:current. - Nginx on pve-201 terminates TLS for
ui-dashboard.gnerim.ruand proxies to127.0.0.1:8081. - All four major stages emit Telegram messages (start / pass / fail). Failure messages include log tail and a clickable link to the Gitea run.
.github/workflows/files are deleted in PR #2 (not the same PR that adds the new workflows; see "Layer 2 — staged rollout" under "Testing the pipeline itself").
Architectural choices already made
- Runner-on-host with direct Docker socket (vs SSH-back-to-localhost or local registry) — least moving parts; runner is in the
dockergroup on pve-201. - Two independent workflow files (vs one file with conditional jobs, vs shared composite action) — short and focused beats clever.
- Manual trigger for Workflow B + git tag fallback (vs commit-message keyword) — explicit; can't ship to customer by accident.
Routing, build-args, and access control
The build-args change the most across this design — they go from absolute (TIM hostnames) to relative paths, which moves the burden onto nginx on pve-201.
Routing pve-201 → TIM API
The customer API at https://flights.test.aeroflot.ru/api/* is reachable only through the corp VPN. webzavod (192.168.88.58) on the same LAN as pve-201 (192.168.88.167) already has a working L2TP/IPsec tunnel to TIM via ppp0. The cleanest way to make pve-201 reach TIM is a static route through webzavod, which leverages the existing VPN setup.
One-time host setup (manual, not in workflows):
-
On webzavod — verify IP forwarding and MASQUERADE on
ppp0:sysctl net.ipv4.ip_forward # expect: 1 sudo iptables -t nat -L POSTROUTING -nv | grep ppp0 # expect: MASQUERADEIf not set:
echo 'net.ipv4.ip_forward=1' | sudo tee -a /etc/sysctl.conf sudo sysctl -p sudo iptables -t nat -A POSTROUTING -o ppp0 -j MASQUERADE sudo apt install iptables-persistent && sudo netfilter-persistent save -
On pve-201 — add a persistent static route to TIM via webzavod:
# /etc/netplan/01-routes.yaml network: version: 2 ethernets: eth0: # rename to actual NIC name routes: - to: 172.18.0.0/16 via: 192.168.88.58sudo netplan apply -
On pve-201 — pin TIM hostnames to reachable A records (mirrors the duplicate-DNS workaround documented in
~/_projects/gnezim/knowledge/projects/work/tim/ui-dashboard/mac-via-windows-jump.md):# /etc/hosts 172.18.0.121 flights.test.aeroflot.ru -
Smoke test from pve-201:
curl -v https://flights.test.aeroflot.ru/swagger/ # expect: 401 in ~70msFailure here means routing is broken — fix before any pipeline run.
nginx vhost on pve-201
server {
listen 443 ssl http2;
server_name ui-dashboard.gnerim.ru;
# ssl_certificate, ssl_certificate_key — existing certbot config
auth_basic "ui-dashboard";
auth_basic_user_file /etc/nginx/htpasswd/ui-dashboard;
location / {
proxy_pass http://127.0.0.1:8081;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header X-Real-IP $remote_addr;
}
location /api/ {
auth_basic off; # API path open behind nginx — basic auth gates the HTML, not the API
proxy_pass https://flights.test.aeroflot.ru;
proxy_set_header Host flights.test.aeroflot.ru;
proxy_ssl_server_name on;
}
location /map/api/ {
auth_basic off;
proxy_pass https://flights.test.aeroflot.ru;
proxy_set_header Host flights.test.aeroflot.ru;
proxy_ssl_server_name on;
}
}
This file is checked in at deployment/nginx/ui-dashboard.gnerim.ru.conf and symlinked into /etc/nginx/sites-enabled/ by hand on first setup.
Dockerfile build-args become relative
Workflow A passes:
--build-arg MAP_TILE_URL=/map/api/tile/{z}/{x}/{y}.jpeg
--build-arg API_BASE_URL=/api
Same-origin URLs in the client bundle, no CORS, the TIM hostname never leaks into the browser. The customer's prod build (Jenkins) keeps its own absolute URLs because their nginx is configured differently — independent.
Basic auth and how e2e bypasses it
- Credentials stored as Gitea Actions secrets
BASIC_AUTH_USERandBASIC_AUTH_PASS. - Workflow A's deploy step regenerates
/etc/nginx/htpasswd/ui-dashboard(usinghtpasswd -bn) and runsnginx -s reload. Rotating creds = re-run Workflow A. - Public smoke check in Workflow A step 10 hits
https://ui-dashboard.gnerim.ru/with--user $BASIC_AUTH_USER:$BASIC_AUTH_PASSto validate TLS + nginx + auth + container in one curl. Catches nginx misconfig. - Full e2e in Workflow A step 11 runs against
BASE_URL=http://127.0.0.1:8081(loopback, skips nginx and auth). Faster, no creds in test, regression in nginx layer already caught by step 10.
Workflow A — ci-deploy.yml
Triggers: push to main; workflow_dispatch for re-runs.
Single sequential job on the pve-201 runner:
| # | Step | What it does | On failure |
|---|---|---|---|
| 1 | Checkout | actions/checkout@v4, full history |
hard fail |
| 2 | Setup pnpm + Node 24 | from .nvmrc |
hard fail |
| 3 | Restore pnpm cache | ~/.pnpm-store keyed on pnpm-lock.yaml |
continue (cache miss is fine) |
| 4 | Install deps | pnpm install --frozen-lockfile |
hard fail → Telegram |
| 5 | Typecheck + lint + unit tests | pnpm typecheck && pnpm lint && pnpm test |
hard fail → Telegram |
| 6 | Build SSR image | docker build -f Dockerfile.react -t flights-web:${GITHUB_SHA} --build-arg MAP_TILE_URL=/map/api/tile/{z}/{x}/{y}.jpeg --build-arg API_BASE_URL=/api . |
hard fail → Telegram |
| 7 | Tag previous-current as previous | docker tag flights-web:current flights-web:previous (skip if first deploy) |
continue |
| 8 | Tag SHA as current | docker tag flights-web:${GITHUB_SHA} flights-web:current |
hard fail |
| 9 | Restart container | scripts/ci/deploy-container.sh swap |
trigger rollback (step 12) |
| 10 | Wait for health | scripts/ci/wait-for-url.sh https://ui-dashboard.gnerim.ru/ 30 2 (with basic auth) |
trigger rollback |
| 11 | Run Playwright e2e | BASE_URL=http://127.0.0.1:8081 pnpm test:e2e (full suite + console-error gate) |
trigger rollback |
| 12 | Rollback (only if 9/10/11 failed) | scripts/ci/deploy-container.sh rollback — runs :previous, swaps aliases back, verifies health |
always Telegram with logs |
| 13 | Prune old images | keep last 5 flights-web:* SHA tags + the two aliases |
continue |
| 14 | Telegram (success or failure) | if: always() — notify-telegram.sh ok|fail ci-deploy <step-name> |
continue |
Estimated runtime: 6–10 min cached; 12–15 min cold.
Console-error gate (step 11):
A Playwright fixture tests/e2e/fixtures/console-gate.ts attaches a listener to every page, collects all console.error and console.warn messages, filters out anything matching patterns in tests/e2e/fixtures/console-allowlist.json, and asserts the remaining list is empty in afterEach. Per the agreed policy: zero tolerance with explicit allowlist. Each allowlist entry has a reason field; lint enforces non-empty.
Workflow B — release.yml
Triggers: workflow_dispatch (manual button); push of tags matching release-* (e.g., git tag release-2026-04-25 && git push --tags).
Single sequential job on the pve-201 runner:
| # | Step | What it does | On failure |
|---|---|---|---|
| 1 | Checkout (full history + tags) | needed for sync to operate on real source tree | hard fail |
| 2 | Verify Workflow A is green for this SHA | query Gitea API GET /repos/{owner}/{repo}/actions/runs?head_sha=<sha>&workflow=ci-deploy.yml; require status=success |
hard fail → Telegram "release blocked: A not green" |
| 3 | Setup pnpm + Node 24, install deps | needed for paranoid lint/test re-run | hard fail |
| 4 | Re-run lint + typecheck + unit tests | belt-and-suspenders: catch flakiness; confirms commit still passes locally before sending to customer | hard fail → Telegram |
| 5 | Clone GitLab target into temp dir | git clone https://oauth2:$GITLAB_PAT@teamscore.gitlab.yandexcloud.net/aeroflot2/flights-front.git /tmp/flights-front |
hard fail |
| 6 | Run sync (CI variant) | scripts/ci/sync-to-gitlab.sh /tmp/flights-front/Aeroflot.Flights.Front |
hard fail |
| 7 | Commit on a feature branch | cd /tmp/flights-front && git checkout -b auto/sync-<sha> then git add -A && git commit -m "auto: sync from gitea <sha>" (skip if no diff) |
hard fail; if no diff → log "nothing to sync" + skip 8-13 + Telegram info |
| 8 | Push branch | git push -u origin auto/sync-<sha> |
hard fail |
| 9 | Open MR | POST /api/v4/projects/<id>/merge_requests with source_branch=auto/sync-<sha>, target_branch=main, title "auto: sync from gitea <sha-short>", description with link back to Gitea run |
hard fail |
| 10 | Approve MR | POST /api/v4/projects/<id>/merge_requests/<iid>/approve |
hard fail; on 401/403 → log explicit "self-approve blocked, configure project to allow author approval" |
| 11 | Merge MR | PUT /api/v4/projects/<id>/merge_requests/<iid>/merge with merge_when_pipeline_succeeds=false, should_remove_source_branch=true, squash=true |
on fail → close MR + delete branch → Telegram |
| 12 | Trigger Jenkins | curl -u $JENKINS_USER:$JENKINS_API_TOKEN 'http://jenkins.yc.devwebzavod.ru:8080/job/Aeroflot2/job/Flights-Front-Dev/build?token=$JENKINS_TRIGGER_TOKEN' — returns the queue item URL in Location header |
hard fail |
| 13 | Poll Jenkins for completion | scripts/ci/jenkins-trigger-and-wait.sh — parses queue URL, gets build URL once it leaves queue, polls <build_url>/api/json for result != null. Timeout: 30 min. Required: result == "SUCCESS". |
hard fail (UNSTABLE/FAILURE/timeout) → Telegram with Jenkins console URL |
| 14 | Wait for the customer URL to update | scripts/ci/wait-for-url.sh http://flights-ui.devwebzavod.ru/ru-ru/onlineboard 60 5 (5 min window) |
hard fail |
| 15 | Run Playwright e2e against http://flights-ui.devwebzavod.ru/ |
BASE_URL=http://flights-ui.devwebzavod.ru pnpm test:e2e (full suite + console-error gate) |
hard fail → Telegram with playwright report attached |
| 16 | Telegram (success or failure) | if: always() — final notification with full chain links |
continue |
Estimated runtime: 15–25 min (most of it Jenkins build + e2e).
Self-approve note. If GitLab's project setting "Prevent approvals by author" is enabled, step 10 returns 401 with "You cannot approve your own merge request". Prereq #9 in "One-time manual setup" unchecks this. If you can't (org policy), fallback is to skip step 10 and rely on merge_when_pipeline_succeeds=false + branch protection allowing maintainer push.
Jenkins polling race. Naive polling has a race where the queue item hasn't materialized into a build yet. jenkins-trigger-and-wait.sh polls the queue URL first, then the build URL once it appears.
The auto/sync-<sha> branch lives forever in GitLab unless step 11 succeeds (which deletes it via should_remove_source_branch=true). On step 11 failure, the script closes the MR + deletes the branch.
The Gitea runner needs network reachability to TIM for steps 12-15 (Jenkins host, customer URL). That works automatically once the static route from "Routing pve-201 → TIM API" is in place — the runner shares pve-201's routes.
Prerequisites and secrets
One-time manual setup
| # | What | Where | Why |
|---|---|---|---|
| 1 | Verify webzavod IP forwarding + MASQUERADE on ppp0 |
webzavod | see "Routing pve-201 → TIM API" |
| 2 | Add static route 172.18.0.0/16 via 192.168.88.58 in netplan |
pve-201 | see "Routing pve-201 → TIM API" |
| 3 | Pin 172.18.0.121 flights.test.aeroflot.ru in /etc/hosts |
pve-201 | duplicate DNS gotcha — see "Routing pve-201 → TIM API" |
| 4 | Verify from pve-201: curl -v https://flights.test.aeroflot.ru/swagger/ returns 401 (typically <300ms) |
pve-201 | smoke test |
| 5 | Install nginx vhost from deployment/nginx/ui-dashboard.gnerim.ru.conf |
pve-201 | see "nginx vhost on pve-201" |
| 6 | Confirm Gitea runner has docker socket access (docker ps from runner user, no sudo) |
pve-201 | required for runner-on-host deploy |
| 7 | Confirm Gitea runner can reach git.gnerim.ru, teamscore.gitlab.yandexcloud.net, jenkins.yc.devwebzavod.ru:8080, flights-ui.devwebzavod.ru |
pve-201 | last two via static route from #2 |
| 8 | Create GitLab Personal Access Token with scopes api, write_repository |
GitLab → Settings → Access Tokens | Workflow B steps 9-11 |
| 9 | Uncheck "Prevent approvals by author" on the GitLab project | GitLab → flights-front → Settings → Merge requests → Approval rules | so Workflow B step 10 works |
| 10 | Configure Jenkins remote trigger token on Aeroflot2/Flights-Front-Dev job |
Jenkins → job → Configure → "Trigger builds remotely" | Workflow B step 12 |
| 11 | Generate Jenkins API token for your user | Jenkins → user → Configure → API Token | Workflow B steps 12-13 |
| 12 | Create the Telegram bot (or reuse existing) and capture chat_id | Telegram BotFather | all notifications |
| 13 | Pick + reserve port :8081 on pve-201 (or substitute another free port consistently) |
pve-201 | container's host-side bind |
| 14 | Clean uncommitted work in this repo before flipping the switch | dev pc | the first push to main after merging the pipeline will fire Workflow A on whatever's in main |
| 15 | Run scripts/ci/check-gitlab-project.sh once after creating the PAT |
dev pc | captures numeric GITLAB_PROJECT_ID for the secret + verifies approval-rule config |
Gitea Actions secrets
Stored at repo → Settings → Actions → Secrets. Workflows reference as ${{ secrets.NAME }}.
| Secret | Used in | Notes |
|---|---|---|
BASIC_AUTH_USER |
Workflow A (deploy) | nginx htpasswd; rotate by re-running A |
BASIC_AUTH_PASS |
Workflow A (deploy) | same |
MAP_TILE_URL |
Workflow A (build) | default /map/api/tile/{z}/{x}/{y}.jpeg — secret so it can be overridden per env |
API_BASE_URL |
Workflow A (build) | default /api |
GITLAB_PAT |
Workflow B (steps 5, 8-11) | from prereq #8 |
GITLAB_PROJECT_ID |
Workflow B (steps 9-11) | numeric, from prereq #15 |
JENKINS_USER |
Workflow B (steps 12-13) | username |
JENKINS_API_TOKEN |
Workflow B (steps 12-13) | from prereq #11 |
JENKINS_TRIGGER_TOKEN |
Workflow B (step 12) | from prereq #10 |
TELEGRAM_BOT_TOKEN |
both workflows | from prereq #12 |
TELEGRAM_CHAT_ID |
both workflows | DM or group |
What lives in plain repo files (not secrets)
.gitea/workflows/ci-deploy.ymland.gitea/workflows/release.yml— public, parameterized via secrets.scripts/ci/sync-to-gitlab.sh— refactored fromsync-to-flights-front.sh. The original becomes a thin wrapper that calls this with the local sibling-dir default.scripts/ci/notify-telegram.sh— readsTELEGRAM_BOT_TOKEN/TELEGRAM_CHAT_IDfrom env. Has--dry-run.scripts/ci/jenkins-trigger-and-wait.sh— polling logic for B steps 12-13. Has--mock-mode.scripts/ci/wait-for-url.sh— generic curl-with-retry.scripts/ci/deploy-container.sh—swapandrollbacksubcommands. Has--dry-run.scripts/ci/install-htpasswd.sh— renders htpasswd + reloads nginx.scripts/ci/check-gitlab-project.sh— one-shot setup helper (not used by workflows).tests/e2e/fixtures/console-gate.ts— Playwright fixture.tests/e2e/fixtures/console-allowlist.json— empty starter; grows on first runs.deployment/nginx/ui-dashboard.gnerim.ru.conf— nginx vhost.deployment/README.md— bootstrap runbook + failure-path rehearsal recipes.
What gets deleted (in PR #2, not PR #1)
.github/workflows/ci.yml.github/workflows/deploy.yml
Scripts to add
| Path | Purpose | Approx LOC |
|---|---|---|
scripts/ci/sync-to-gitlab.sh |
Refactored from sync-to-flights-front.sh; takes target dir as required arg, no make-related output. |
~150 |
scripts/ci/notify-telegram.sh |
notify-telegram.sh <ok|fail> <stage> [<extra-context>]; HTML mode; failure messages include Gitea run URL. |
~40 |
scripts/ci/jenkins-trigger-and-wait.sh |
Triggers, parses Location, polls queue then build, exits 0 only on SUCCESS. |
~80 |
scripts/ci/wait-for-url.sh |
wait-for-url.sh <url> [<max-attempts>] [<delay>]. |
~25 |
scripts/ci/deploy-container.sh |
swap and rollback subcommands. Encapsulates the alias dance + health check. Image source parameterized so registry migration is a config flip. |
~70 |
scripts/ci/install-htpasswd.sh |
Renders /etc/nginx/htpasswd/ui-dashboard from env + nginx -s reload. |
~15 |
scripts/ci/check-gitlab-project.sh |
One-shot: print numeric project ID + approval rule config + self-approve allowed (yes/no). | ~25 |
scripts/ci/audit-console-allowlist.sh |
Run e2e with allowlist disabled, report which entries didn't fire (dead config). | ~30 |
tests/e2e/fixtures/console-gate.ts |
Playwright fixture for the console-error gate. | ~50 |
tests/e2e/fixtures/console-allowlist.json |
Empty starter { patterns: [] }. |
n/a |
.gitea/workflows/ci-deploy.yml |
Workflow A. | ~80 |
.gitea/workflows/release.yml |
Workflow B. | ~100 |
deployment/nginx/ui-dashboard.gnerim.ru.conf |
nginx vhost from "nginx vhost on pve-201". | ~30 |
deployment/README.md |
Setup runbook + failure-path rehearsals. | ~200 |
tests/ci/*.bats (or shell) |
Unit tests for the testable scripts. | ~80 |
tests/ci/fixtures/jenkins-success-flow.json |
Mock fixture for jenkins-trigger-and-wait.sh --mock-mode. |
~40 |
Failure handling and notifications
Telegram message shapes
All messages use parse_mode=HTML.
Start (one per workflow run):
🚀 ci-deploy started
commit: abc1234 — fix: schedule width regression
gitea run: <link>
Success:
✅ ci-deploy passed (8m 42s)
commit: abc1234 — fix: schedule width regression
deployed: https://ui-dashboard.gnerim.ru/
gitea run: <link>
Failure:
❌ ci-deploy FAILED at step "Run Playwright e2e" (6m 18s)
commit: abc1234 — fix: schedule width regression
gitea run: <link>
last 30 lines of step output:
<pre>... e2e log tail ...</pre>
artifacts:
- container logs
- playwright report
Workflow B failures include MR URL, Jenkins build URL, customer URL as appropriate.
Per-stage failure contracts
| Failure point | Action | Notification |
|---|---|---|
| A:1-6 (build/lint/test/dockerbuild) | hard fail; nothing was deployed | ❌ ci-deploy FAILED at step "<name>" + tail |
| A:7-11 (deploy/health/e2e) | trigger A:12 rollback to :previous, verify rollback healthy |
❌ ci-deploy FAILED at step "<name>" — rolled back to <prev-sha> + container logs + playwright report (if e2e) |
| A:12 rollback fails | container stopped, site is 502 | 🔥 ci-deploy ROLLBACK FAILED — site is DOWN. Manual intervention required. Last good image: flights-web:<prev-sha> |
| B:2 (A not green for SHA) | refuse to start | ⚠️ release blocked — workflow ci-deploy is not green for <sha>. Re-run A first. |
| B:3-4 (lint/test re-run) | hard fail | ❌ release FAILED at lint/test re-run (paranoid check). Investigate and re-trigger. |
| B:5-8 (sync, branch, push) | hard fail; if MR was created, close it; if branch was pushed, delete it | ❌ release FAILED at "<step>" — cleanup done |
| B:9-11 (MR open/approve/merge) | hard fail; close MR + delete branch | ❌ release FAILED at MR <step> — MR closed, branch deleted. <link> |
| B:12-13 (Jenkins trigger/poll) | hard fail; do NOT close the GitLab MR (already merged, can't unmerge) | ❌ release FAILED at Jenkins build — gitlab MR <iid> already merged. Jenkins console: <link> |
| B:14 (customer URL not responding) | hard fail | ❌ release FAILED — Jenkins reported SUCCESS but flights-ui.devwebzavod.ru not responding. Investigate. |
| B:15 (e2e on customer URL) | hard fail; no auto-rollback (we can't), notify with logs | ❌ release FAILED at e2e on customer URL — gitlab MR <iid> merged + Jenkins #<n> green but app misbehaves. Playwright report attached. |
Recovery from B:12-13 failure (awkward case)
GitLab MR is already merged but customer site has previous code. Recovery is manual:
- Open Jenkins UI → click "Build Now" on the same job, or
- Push a new commit to GitLab to re-trigger Jenkins polling.
A "retry just the Jenkins half" workflow file is not included — the manual path is rare enough to not warrant the abstraction.
Implementation pattern
Both workflows end with if: always() finalize steps:
- name: Notify (success)
if: success()
run: scripts/ci/notify-telegram.sh ok ci-deploy
- name: Notify (failure)
if: failure()
run: scripts/ci/notify-telegram.sh fail ci-deploy "${{ steps.failed_step.outputs.name }}"
Step IDs propagate the failed-step name. Slightly verbose but no magic.
Artifacts on failure
Always uploaded on failure (never on success). 7-day retention.
- Workflow A:
docker logs flights-web --tail 500,playwright-report/(if e2e ran), nginx error log tail. - Workflow B:
playwright-report/(if e2e ran), the rendered MR/Jenkins API responses (for debugging integration), tail ofgit logon the sync branch.
Deliberately NOT done
- No PagerDuty / SMS escalation. Telegram is enough.
- No automatic re-runs on flake. A flaky e2e fail = real signal worth investigating.
- No "previous run was already failing, suppress notification" logic. Spam is a feature; silence is dangerous.
- No Slack/email mirror. Single channel.
Testing the pipeline itself
Layer 1 — Unit tests for the testable bits
Bash scripts under scripts/ci/ with logic worth testing:
notify-telegram.sh—--dry-runprints the rendered payload to stdout instead of POSTing. Tests verify the three message shapes.wait-for-url.sh— testable with a localpython3 -m http.server; assert exit codes for 200, 404, network failure, timeout.jenkins-trigger-and-wait.sh—--mock-modereads fromtests/ci/fixtures/jenkins-success-flow.json. Tests verify queue-then-build polling + SUCCESS / FAILURE / UNSTABLE / timeout branches.deploy-container.sh—--dry-runprints docker commands instead of running them. Test verifies alias-swap order.
Run via make test-ci or as a step in Workflow A itself (~10 sec total).
Layer 2 — Workflow A first-run validation (staged rollout)
Plan for the first run to fail. Stage the rollout:
- PR #1 — adds workflows + scripts + console-gate fixture + nginx config +
deployment/README.md. Does not delete.github/workflows/. Workflow A starts firing on push. - First few runs will fail at: portability of e2e specs to remote
BASE_URL, missingBASE_URLoverrides in test setup, console-gate revealing real warnings to allowlist, network/DNS/route gotchas. Each failure → fix → re-push. - PR #2 — deletes
.github/workflows/and any compatibility shims, only after A has run green for a few consecutive commits. - PR #3 — Workflow B. First run triggered manually. Once it works once end-to-end, it's "live".
Budget 1-2 dev days of "debug the pipeline against reality" after merging PR #1. Expecting green on first run is wrong.
Layer 3 — Documented rehearsal of failure paths
deployment/README.md includes recipes for inducing each failure path:
| Failure | How to induce |
|---|---|
| A e2e fail → rollback | push a commit that adds console.error('test') to App.tsx. Verify rollback. |
| A rollback fail | break the :previous tag manually (docker rmi flights-web:previous), trigger an e2e fail. |
| B blocked on A not green | push a commit that fails A, then trigger B for that SHA. |
| B Jenkins poll timeout | reduce JENKINS_TIMEOUT to 30s and trigger B. |
| B e2e fail on customer URL | manually break the customer URL (trigger an old Jenkins build), then run B without a code change. |
Run at least the "rollback" and "release blocked" rehearsals once before declaring the pipeline production-grade.
Console-allowlist seeding strategy
- Don't pre-seed. Run e2e once locally against
https://ui-dashboard.gnerim.ru/(orhttp://localhost:8081), capture every console message, decide which are real bugs vs allowlist material. - Each allowlist entry has a
reasonfield, lint-enforced. - Re-evaluate quarterly via
scripts/ci/audit-console-allowlist.sh— entries that didn't fire are dead config.
Deliberately NOT tested
- The actual GitLab API integration (no way to mock GitLab without GitLab; first B run is the test).
- The actual Jenkins API integration (same; polling logic is tested via mock-mode).
- The Telegram bot (tested via
--dry-run; failed delivery observable as "no message arrived").
Future seam: container registry
When a private registry comes online (eventual registry.gnerim.ru), changes:
- Workflow A — replace local
docker tag flights-web:current+docker runwith:Runner can move off pve-201 — anywhere with reach to registry + SSH key to deploy host.- run: docker push registry.gnerim.ru/flights-web:${GITHUB_SHA} - run: ssh deploy@pve-201 'docker pull registry.gnerim.ru/flights-web:<sha> && ...' - Add secrets
REGISTRY_USER,REGISTRY_PASS,DEPLOY_SSH_KEY. - Rollback semantics identical —
docker pull <prev-sha>instead of relying on local cache. - No script rewrites —
scripts/ci/deploy-container.shaccepts image-source as a parameter from day one.flights-web:current/:previousbecomes<repo>:<sha>/<repo>:<prev-sha>, same shape.
Open questions and known gaps
GITLAB_PROJECT_ID— numeric ID is unknown until the PAT exists.scripts/ci/check-gitlab-project.shresolves it post-PAT.- The 9 untracked
snap-*.ymlfiles at repo root look like throwaway parity-snapshot artifacts. Add to.gitignoreor commit? Verify before flipping pipeline on (prereq #14). - e2e portability to remote
BASE_URL— existing specs were written against localhost. Many likely hardcode paths or rely on dev-only state. Layer 2 of testing strategy budgets time for this. - Initial console-allowlist content — empty starter; will be populated on first runs ("we'll figure it out in future" per design discussion).
Addendum 2026-04-27 — routing change + manual Jenkins trigger
Two design pivots discovered during Phase B prerequisites work:
Routing: ssh -L tunnel instead of static-route + NAT
Original design: static route on pve-201 pushes <TIM-CIDR> via webzavod's LAN IP, webzavod NATs LAN→ppp0, /etc/hosts pins flights.test.aeroflot.ru to an internal A record.
Discovered:
flights.test.aeroflot.ruresolves to public IPs from both pve-201 and webzavod (no internal A record exists).- pve-201 reaches the public IP directly with HTTP 200, but the response is a WAF interstitial — the customer WAF returns 200/HTML for non-corp egress and 401/JSON-ready for corp egress.
- The same URL from webzavod returns 401 (real backend) — webzavod's
ppp0egress IP is whitelisted.
New design: persistent ssh -L 127.0.0.1:8443:flights.test.aeroflot.ru:443 from pve-201 to webzavod via systemd unit deployment/systemd/flights-tim-tunnel.service. nginx proxies /api/ and /map/api/ to https://127.0.0.1:8443 with Host and proxy_ssl_name overrides so SNI/cert validation still target the real hostname.
Webzavod-side authorisation pinned with command="exit 1",no-pty,no-X11-forwarding,no-agent-forwarding,no-user-rc,permitopen="flights.test.aeroflot.ru:443" — the key cannot open a shell, agent-forward, or forward any other host:port.
Trade-offs vs. original:
- ✅ No webzavod kernel changes (no
ip_forwardtoggle, no MASQUERADE rule, no iptables-persistent). - ✅ No
/etc/hostspin needed (DNS resolution happens on webzavod, where the real IPs work). - ✅ Recoverable in seconds (
systemctl restart flights-tim-tunnel). - ⚠ Per-host SSH tunnel — adding another upstream means another
-Lline. Currently only one upstream. - ⚠ Discovered OpenSSH 9.6 quirk:
restrict + permitopencauses TLS handshake to EOF mid-stream. Using explicitno-*options instead ofrestrictworks.
Workflow B: drop Jenkins automation
Original design: Workflow B triggers Jenkins via remote-build token, polls build status via authenticated API, then runs e2e against customer URL.
Constraint: operator does not have Jenkins job-configure access (no remote-trigger token) nor Jenkins user API token access. Authenticated API trigger and polling are not possible without admin involvement.
New design:
- Workflow B (
release.yml) — sync to GitLab, open MR, auto-approve, auto-merge, stop. Telegram notify includes the Jenkins job URL with instructions to trigger by hand. - Workflow C (
release-verify.yml) —workflow_dispatchonly. Operator runs manually after Jenkins finishes. Probes customer URL until reachable, runs Playwright e2e againsthttp://flights-ui.devwebzavod.ruwith the console-error gate, notifies Telegram.
Removed from the repo:
scripts/ci/jenkins-trigger-and-wait.shtests/ci/test-jenkins-trigger.shtests/ci/fixtures/jenkins-{success,failure}-flow.jsonJENKINS_USER,JENKINS_API_TOKEN,JENKINS_TRIGGER_TOKENsecrets
Trade-off: lose automated end-to-end pipeline. Acceptable because (a) operator already triggers Jenkins manually today, (b) the manual step is a checkpoint where build failures surface clearly, (c) future Jenkins API access can swap C back into B without changing the rest of the design.
Other small adjustments
- SSR container loopback port changed from
8081→3002(port 8081 already in use on pve-201 by openwebui). notify-telegram.shnow skips cleanly when Telegram secrets are unset (was: hard-fail). Lets the pipeline run end-to-end without TG configured.