Files

T

gnezim 03eeddfbf8 CI/CD pipeline: ssh -L tunnel for TIM API + manual Jenkins trigger

Two design pivots discovered during Phase B prerequisites:

Routing: Replace static-route + NAT plan with persistent ssh -L tunnel
from pve-201 to webzavod (deployment/systemd/flights-tim-tunnel.service).
nginx proxies /api/ and /map/api/ to https://127.0.0.1:8443 with SNI/Host
overrides so cert validation still targets the real hostname. No webzavod
kernel changes (no ip_forward/MASQUERADE), no /etc/hosts pin needed.

Workflow B: Drop Jenkins trigger/poll automation (operator lacks Jenkins
job-configure access and user API token access). release.yml now stops
after MR merge with a Telegram message containing the Jenkins job URL.
release-verify.yml (new, workflow_dispatch only) runs the customer-URL
e2e suite once the operator has triggered Jenkins manually and it has
completed.

Other:
- SSR loopback port 8081 -> 3002 (8081 was taken by openwebui on pve-201)
- notify-telegram.sh skips cleanly when TG secrets unset (was: hard-fail)
- README + spec addendum cover the new prereqs and removed steps

2026-04-27 11:58:39 +03:00

35 KiB

Raw Blame History

CI/CD Pipeline Design — Gitea Actions → pve-201 → GitLab → Jenkins

Status: Approved design, ready for implementation plan. Date: 2026-04-25 Author: gnezim (with Claude)

Summary

A two-workflow Gitea Actions pipeline that builds and deploys this React SSR app to your own infrastructure (pve-201, behind https://ui-dashboard.gnerim.ru/) on every push, then — on explicit trigger — syncs sources to the customer's GitLab, opens and auto-merges an MR, fires the Jenkins build, and runs end-to-end tests against the customer's dev URL. All notifications via Telegram.

Two workflow files:

ci-deploy.yml — push-triggered. Build → unit tests → Docker build → swap container → e2e on ui-dashboard.gnerim.ru. Auto-rollback to previous image on any post-build failure.
release.yml — manually triggered (UI button or release-* git tag). Verifies ci-deploy is green for the same SHA, then GitLab sync → MR → approve → merge → Jenkins trigger → poll → e2e on flights-ui.devwebzavod.ru. Halts on any failure.

The Gitea runner runs on pve-201 itself, with Docker socket access — no SSH, no registry hop. Image-versioning uses flights-web:<sha> plus moving aliases :current and :previous for one-step rollback. Future migration to a private registry is a config change, not a refactor.

Architecture

┌──────────────────┐  push to main          ┌─────────────────────────────────┐
│  dev pc (you)    │ ─────────────────────► │  git.gnerim.ru (Gitea server)   │
└──────────────────┘  manual / tag push     └────────────────┬────────────────┘
                                                              │ webhook
                                                              ▼
                                              ┌──────────────────────────────┐
                                              │  Gitea Actions runner        │
                                              │  on pve-201 (Docker socket)  │
                                              └──┬───────────────────────────┘
                                                 │
                  ┌──────────────────────────────┼──────────────────────────────┐
                  │                              │                              │
                  ▼ on push                      ▼ on tag/manual                ▼ Telegram
        ┌────────────────────┐         ┌────────────────────┐         ┌────────────────────┐
        │ Workflow A         │         │ Workflow B         │         │ Notify on every    │
        │ ci-deploy.yml      │         │ release.yml        │         │ stage start / end  │
        │                    │         │                    │         │ / failure          │
        │ build & test       │         │ verify A is green  │         └────────────────────┘
        │   ↓                │         │   ↓                │
        │ docker build :SHA  │         │ sync → GitLab MR   │
        │   ↓                │         │   ↓                │
        │ swap container     │         │ approve & merge    │
        │   ↓                │         │   ↓                │
        │ smoke /health      │         │ trigger Jenkins    │
        │   ↓                │         │   ↓                │
        │ playwright e2e on  │         │ poll until SUCCESS │
        │ ui-dashboard       │         │   ↓                │
        │   ↓ on fail        │         │ playwright e2e on  │
        │ rollback to        │         │ flights-ui.devweb  │
        │ :previous          │         │   ↓ on fail        │
        └────────────────────┘         │ halt + dump logs   │
                                       └────────────────────┘

Key invariants

Workflow A is the gatekeeper. Workflow B always queries Gitea for the latest A run for the same commit SHA; if not green, B refuses to start.
Image tag aliases. Every build is tagged flights-web:<sha>. Two moving aliases on the host: flights-web:current (live container source) and flights-web:previous (rollback target). Pruning keeps the last 5 SHA tags + the two aliases.
Container is named flights-web (singleton). Restart sequence: docker stop flights-web && docker rm flights-web && docker run -d --name flights-web --restart unless-stopped -p 127.0.0.1:8081:8080 flights-web:current.
Nginx on pve-201 terminates TLS for ui-dashboard.gnerim.ru and proxies to 127.0.0.1:8081.
All four major stages emit Telegram messages (start / pass / fail). Failure messages include log tail and a clickable link to the Gitea run.
.github/workflows/ files are deleted in PR #2 (not the same PR that adds the new workflows; see "Layer 2 — staged rollout" under "Testing the pipeline itself").

Architectural choices already made

Runner-on-host with direct Docker socket (vs SSH-back-to-localhost or local registry) — least moving parts; runner is in the docker group on pve-201.
Two independent workflow files (vs one file with conditional jobs, vs shared composite action) — short and focused beats clever.
Manual trigger for Workflow B + git tag fallback (vs commit-message keyword) — explicit; can't ship to customer by accident.

Routing, build-args, and access control

The build-args change the most across this design — they go from absolute (TIM hostnames) to relative paths, which moves the burden onto nginx on pve-201.

Routing pve-201 → TIM API

The customer API at https://flights.test.aeroflot.ru/api/* is reachable only through the corp VPN. webzavod (192.168.88.58) on the same LAN as pve-201 (192.168.88.167) already has a working L2TP/IPsec tunnel to TIM via ppp0. The cleanest way to make pve-201 reach TIM is a static route through webzavod, which leverages the existing VPN setup.

One-time host setup (manual, not in workflows):

On webzavod — verify IP forwarding and MASQUERADE on ppp0:

sysctl net.ipv4.ip_forward                          # expect: 1
sudo iptables -t nat -L POSTROUTING -nv | grep ppp0 # expect: MASQUERADE

If not set:

echo 'net.ipv4.ip_forward=1' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
sudo iptables -t nat -A POSTROUTING -o ppp0 -j MASQUERADE
sudo apt install iptables-persistent && sudo netfilter-persistent save

On pve-201 — add a persistent static route to TIM via webzavod:

# /etc/netplan/01-routes.yaml
network:
  version: 2
  ethernets:
    eth0:                          # rename to actual NIC name
      routes:
        - to: 172.18.0.0/16
          via: 192.168.88.58

sudo netplan apply

On pve-201 — pin TIM hostnames to reachable A records (mirrors the duplicate-DNS workaround documented in ~/_projects/gnezim/knowledge/projects/work/tim/ui-dashboard/mac-via-windows-jump.md):
```
# /etc/hosts
172.18.0.121 flights.test.aeroflot.ru
```
Smoke test from pve-201:
```
curl -v https://flights.test.aeroflot.ru/swagger/  # expect: 401 in ~70ms
```
Failure here means routing is broken — fix before any pipeline run.

nginx vhost on pve-201

server {
  listen 443 ssl http2;
  server_name ui-dashboard.gnerim.ru;
  # ssl_certificate, ssl_certificate_key — existing certbot config

  auth_basic "ui-dashboard";
  auth_basic_user_file /etc/nginx/htpasswd/ui-dashboard;

  location / {
    proxy_pass http://127.0.0.1:8081;
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Real-IP $remote_addr;
  }

  location /api/ {
    auth_basic off;                                   # API path open behind nginx — basic auth gates the HTML, not the API
    proxy_pass https://flights.test.aeroflot.ru;
    proxy_set_header Host flights.test.aeroflot.ru;
    proxy_ssl_server_name on;
  }

  location /map/api/ {
    auth_basic off;
    proxy_pass https://flights.test.aeroflot.ru;
    proxy_set_header Host flights.test.aeroflot.ru;
    proxy_ssl_server_name on;
  }
}

This file is checked in at deployment/nginx/ui-dashboard.gnerim.ru.conf and symlinked into /etc/nginx/sites-enabled/ by hand on first setup.

Dockerfile build-args become relative

Workflow A passes:

--build-arg MAP_TILE_URL=/map/api/tile/{z}/{x}/{y}.jpeg
--build-arg API_BASE_URL=/api

Same-origin URLs in the client bundle, no CORS, the TIM hostname never leaks into the browser. The customer's prod build (Jenkins) keeps its own absolute URLs because their nginx is configured differently — independent.

Basic auth and how e2e bypasses it

Credentials stored as Gitea Actions secrets BASIC_AUTH_USER and BASIC_AUTH_PASS.
Workflow A's deploy step regenerates /etc/nginx/htpasswd/ui-dashboard (using htpasswd -bn) and runs nginx -s reload. Rotating creds = re-run Workflow A.
Public smoke check in Workflow A step 10 hits https://ui-dashboard.gnerim.ru/ with --user $BASIC_AUTH_USER:$BASIC_AUTH_PASS to validate TLS + nginx + auth + container in one curl. Catches nginx misconfig.
Full e2e in Workflow A step 11 runs against BASE_URL=http://127.0.0.1:8081 (loopback, skips nginx and auth). Faster, no creds in test, regression in nginx layer already caught by step 10.

Workflow A — `ci-deploy.yml`

Triggers: push to main; workflow_dispatch for re-runs.

Single sequential job on the pve-201 runner:

#	Step	What it does	On failure
1	Checkout	`actions/checkout@v4`, full history	hard fail
2	Setup pnpm + Node 24	from `.nvmrc`	hard fail
3	Restore pnpm cache	`~/.pnpm-store` keyed on `pnpm-lock.yaml`	continue (cache miss is fine)
4	Install deps	`pnpm install --frozen-lockfile`	hard fail → Telegram
5	Typecheck + lint + unit tests	`pnpm typecheck && pnpm lint && pnpm test`	hard fail → Telegram
6	Build SSR image	`docker build -f Dockerfile.react -t flights-web:${GITHUB_SHA} --build-arg MAP_TILE_URL=/map/api/tile/{z}/{x}/{y}.jpeg --build-arg API_BASE_URL=/api .`	hard fail → Telegram
7	Tag previous-current as previous	`docker tag flights-web:current flights-web:previous` (skip if first deploy)	continue
8	Tag SHA as current	`docker tag flights-web:${GITHUB_SHA} flights-web:current`	hard fail
9	Restart container	`scripts/ci/deploy-container.sh swap`	trigger rollback (step 12)
10	Wait for health	`scripts/ci/wait-for-url.sh https://ui-dashboard.gnerim.ru/ 30 2` (with basic auth)	trigger rollback
11	Run Playwright e2e	`BASE_URL=http://127.0.0.1:8081 pnpm test:e2e` (full suite + console-error gate)	trigger rollback
12	Rollback (only if 9/10/11 failed)	`scripts/ci/deploy-container.sh rollback` — runs `:previous`, swaps aliases back, verifies health	always Telegram with logs
13	Prune old images	keep last 5 `flights-web:*` SHA tags + the two aliases	continue
14	Telegram (success or failure)	`if: always()` — `notify-telegram.sh ok\|fail ci-deploy <step-name>`	continue

Estimated runtime: 6–10 min cached; 12–15 min cold.

Console-error gate (step 11):

A Playwright fixture tests/e2e/fixtures/console-gate.ts attaches a listener to every page, collects all console.error and console.warn messages, filters out anything matching patterns in tests/e2e/fixtures/console-allowlist.json, and asserts the remaining list is empty in afterEach. Per the agreed policy: zero tolerance with explicit allowlist. Each allowlist entry has a reason field; lint enforces non-empty.

Workflow B — `release.yml`

Triggers: workflow_dispatch (manual button); push of tags matching release-* (e.g., git tag release-2026-04-25 && git push --tags).

Single sequential job on the pve-201 runner:

#	Step	What it does	On failure
1	Checkout (full history + tags)	needed for sync to operate on real source tree	hard fail
2	Verify Workflow A is green for this SHA	query Gitea API `GET /repos/{owner}/{repo}/actions/runs?head_sha=<sha>&workflow=ci-deploy.yml`; require status=success	hard fail → Telegram "release blocked: A not green"
3	Setup pnpm + Node 24, install deps	needed for paranoid lint/test re-run	hard fail
4	Re-run lint + typecheck + unit tests	belt-and-suspenders: catch flakiness; confirms commit still passes locally before sending to customer	hard fail → Telegram
5	Clone GitLab target into temp dir	`git clone https://oauth2:$GITLAB_PAT@teamscore.gitlab.yandexcloud.net/aeroflot2/flights-front.git /tmp/flights-front`	hard fail
6	Run sync (CI variant)	`scripts/ci/sync-to-gitlab.sh /tmp/flights-front/Aeroflot.Flights.Front`	hard fail
7	Commit on a feature branch	`cd /tmp/flights-front && git checkout -b auto/sync-<sha>` then `git add -A && git commit -m "auto: sync from gitea <sha>"` (skip if no diff)	hard fail; if no diff → log "nothing to sync" + skip 8-13 + Telegram info
8	Push branch	`git push -u origin auto/sync-<sha>`	hard fail
9	Open MR	`POST /api/v4/projects/<id>/merge_requests` with `source_branch=auto/sync-<sha>`, `target_branch=main`, title `"auto: sync from gitea <sha-short>"`, description with link back to Gitea run	hard fail
10	Approve MR	`POST /api/v4/projects/<id>/merge_requests/<iid>/approve`	hard fail; on 401/403 → log explicit "self-approve blocked, configure project to allow author approval"
11	Merge MR	`PUT /api/v4/projects/<id>/merge_requests/<iid>/merge` with `merge_when_pipeline_succeeds=false`, `should_remove_source_branch=true`, `squash=true`	on fail → close MR + delete branch → Telegram
12	Trigger Jenkins	`curl -u $JENKINS_USER:$JENKINS_API_TOKEN 'http://jenkins.yc.devwebzavod.ru:8080/job/Aeroflot2/job/Flights-Front-Dev/build?token=$JENKINS_TRIGGER_TOKEN'` — returns the queue item URL in `Location` header	hard fail
13	Poll Jenkins for completion	`scripts/ci/jenkins-trigger-and-wait.sh` — parses queue URL, gets build URL once it leaves queue, polls `<build_url>/api/json` for `result != null`. Timeout: 30 min. Required: `result == "SUCCESS"`.	hard fail (`UNSTABLE`/`FAILURE`/timeout) → Telegram with Jenkins console URL
14	Wait for the customer URL to update	`scripts/ci/wait-for-url.sh http://flights-ui.devwebzavod.ru/ru-ru/onlineboard 60 5` (5 min window)	hard fail
15	Run Playwright e2e against `http://flights-ui.devwebzavod.ru/`	`BASE_URL=http://flights-ui.devwebzavod.ru pnpm test:e2e` (full suite + console-error gate)	hard fail → Telegram with playwright report attached
16	Telegram (success or failure)	`if: always()` — final notification with full chain links	continue

Estimated runtime: 15–25 min (most of it Jenkins build + e2e).

Self-approve note. If GitLab's project setting "Prevent approvals by author" is enabled, step 10 returns 401 with "You cannot approve your own merge request". Prereq #9 in "One-time manual setup" unchecks this. If you can't (org policy), fallback is to skip step 10 and rely on merge_when_pipeline_succeeds=false + branch protection allowing maintainer push.

Jenkins polling race. Naive polling has a race where the queue item hasn't materialized into a build yet. jenkins-trigger-and-wait.sh polls the queue URL first, then the build URL once it appears.

The auto/sync-<sha> branch lives forever in GitLab unless step 11 succeeds (which deletes it via should_remove_source_branch=true). On step 11 failure, the script closes the MR + deletes the branch.

The Gitea runner needs network reachability to TIM for steps 12-15 (Jenkins host, customer URL). That works automatically once the static route from "Routing pve-201 → TIM API" is in place — the runner shares pve-201's routes.

Prerequisites and secrets

One-time manual setup

#	What	Where	Why
1	Verify webzavod IP forwarding + MASQUERADE on `ppp0`	webzavod	see "Routing pve-201 → TIM API"
2	Add static route `172.18.0.0/16 via 192.168.88.58` in netplan	pve-201	see "Routing pve-201 → TIM API"
3	Pin `172.18.0.121 flights.test.aeroflot.ru` in `/etc/hosts`	pve-201	duplicate DNS gotcha — see "Routing pve-201 → TIM API"
4	Verify from pve-201: `curl -v https://flights.test.aeroflot.ru/swagger/` returns 401 (typically <300ms)	pve-201	smoke test
5	Install nginx vhost from `deployment/nginx/ui-dashboard.gnerim.ru.conf`	pve-201	see "nginx vhost on pve-201"
6	Confirm Gitea runner has docker socket access (`docker ps` from runner user, no sudo)	pve-201	required for runner-on-host deploy
7	Confirm Gitea runner can reach `git.gnerim.ru`, `teamscore.gitlab.yandexcloud.net`, `jenkins.yc.devwebzavod.ru:8080`, `flights-ui.devwebzavod.ru`	pve-201	last two via static route from #2
8	Create GitLab Personal Access Token with scopes `api`, `write_repository`	GitLab → Settings → Access Tokens	Workflow B steps 9-11
9	Uncheck "Prevent approvals by author" on the GitLab project	GitLab → flights-front → Settings → Merge requests → Approval rules	so Workflow B step 10 works
10	Configure Jenkins remote trigger token on `Aeroflot2/Flights-Front-Dev` job	Jenkins → job → Configure → "Trigger builds remotely"	Workflow B step 12
11	Generate Jenkins API token for your user	Jenkins → user → Configure → API Token	Workflow B steps 12-13
12	Create the Telegram bot (or reuse existing) and capture chat_id	Telegram BotFather	all notifications
13	Pick + reserve port `:8081` on pve-201 (or substitute another free port consistently)	pve-201	container's host-side bind
14	Clean uncommitted work in this repo before flipping the switch	dev pc	the first push to `main` after merging the pipeline will fire Workflow A on whatever's in `main`
15	Run `scripts/ci/check-gitlab-project.sh` once after creating the PAT	dev pc	captures numeric `GITLAB_PROJECT_ID` for the secret + verifies approval-rule config

Gitea Actions secrets

Stored at repo → Settings → Actions → Secrets. Workflows reference as ${{ secrets.NAME }}.

Secret	Used in	Notes
`BASIC_AUTH_USER`	Workflow A (deploy)	nginx htpasswd; rotate by re-running A
`BASIC_AUTH_PASS`	Workflow A (deploy)	same
`MAP_TILE_URL`	Workflow A (build)	default `/map/api/tile/{z}/{x}/{y}.jpeg` — secret so it can be overridden per env
`API_BASE_URL`	Workflow A (build)	default `/api`
`GITLAB_PAT`	Workflow B (steps 5, 8-11)	from prereq #8
`GITLAB_PROJECT_ID`	Workflow B (steps 9-11)	numeric, from prereq #15
`JENKINS_USER`	Workflow B (steps 12-13)	username
`JENKINS_API_TOKEN`	Workflow B (steps 12-13)	from prereq #11
`JENKINS_TRIGGER_TOKEN`	Workflow B (step 12)	from prereq #10
`TELEGRAM_BOT_TOKEN`	both workflows	from prereq #12
`TELEGRAM_CHAT_ID`	both workflows	DM or group

What lives in plain repo files (not secrets)

.gitea/workflows/ci-deploy.yml and .gitea/workflows/release.yml — public, parameterized via secrets.
scripts/ci/sync-to-gitlab.sh — refactored from sync-to-flights-front.sh. The original becomes a thin wrapper that calls this with the local sibling-dir default.
scripts/ci/notify-telegram.sh — reads TELEGRAM_BOT_TOKEN/TELEGRAM_CHAT_ID from env. Has --dry-run.
scripts/ci/jenkins-trigger-and-wait.sh — polling logic for B steps 12-13. Has --mock-mode.
scripts/ci/wait-for-url.sh — generic curl-with-retry.
scripts/ci/deploy-container.sh — swap and rollback subcommands. Has --dry-run.
scripts/ci/install-htpasswd.sh — renders htpasswd + reloads nginx.
scripts/ci/check-gitlab-project.sh — one-shot setup helper (not used by workflows).
tests/e2e/fixtures/console-gate.ts — Playwright fixture.
tests/e2e/fixtures/console-allowlist.json — empty starter; grows on first runs.
deployment/nginx/ui-dashboard.gnerim.ru.conf — nginx vhost.
deployment/README.md — bootstrap runbook + failure-path rehearsal recipes.

What gets deleted (in PR #2, not PR #1)

.github/workflows/ci.yml
.github/workflows/deploy.yml

Scripts to add

Path	Purpose	Approx LOC
`scripts/ci/sync-to-gitlab.sh`	Refactored from `sync-to-flights-front.sh`; takes target dir as required arg, no `make`-related output.	~150
`scripts/ci/notify-telegram.sh`	`notify-telegram.sh <ok\|fail> <stage> [<extra-context>]`; HTML mode; failure messages include Gitea run URL.	~40
`scripts/ci/jenkins-trigger-and-wait.sh`	Triggers, parses `Location`, polls queue then build, exits 0 only on `SUCCESS`.	~80
`scripts/ci/wait-for-url.sh`	`wait-for-url.sh <url> [<max-attempts>] [<delay>]`.	~25
`scripts/ci/deploy-container.sh`	`swap` and `rollback` subcommands. Encapsulates the alias dance + health check. Image source parameterized so registry migration is a config flip.	~70
`scripts/ci/install-htpasswd.sh`	Renders `/etc/nginx/htpasswd/ui-dashboard` from env + `nginx -s reload`.	~15
`scripts/ci/check-gitlab-project.sh`	One-shot: print numeric project ID + approval rule config + self-approve allowed (yes/no).	~25
`scripts/ci/audit-console-allowlist.sh`	Run e2e with allowlist disabled, report which entries didn't fire (dead config).	~30
`tests/e2e/fixtures/console-gate.ts`	Playwright fixture for the console-error gate.	~50
`tests/e2e/fixtures/console-allowlist.json`	Empty starter `{ patterns: [] }`.	n/a
`.gitea/workflows/ci-deploy.yml`	Workflow A.	~80
`.gitea/workflows/release.yml`	Workflow B.	~100
`deployment/nginx/ui-dashboard.gnerim.ru.conf`	nginx vhost from "nginx vhost on pve-201".	~30
`deployment/README.md`	Setup runbook + failure-path rehearsals.	~200
`tests/ci/*.bats` (or shell)	Unit tests for the testable scripts.	~80
`tests/ci/fixtures/jenkins-success-flow.json`	Mock fixture for `jenkins-trigger-and-wait.sh --mock-mode`.	~40

Failure handling and notifications

Telegram message shapes

All messages use parse_mode=HTML.

Start (one per workflow run):

🚀 ci-deploy started
commit: abc1234 — fix: schedule width regression
gitea run: <link>

Success:

✅ ci-deploy passed (8m 42s)
commit: abc1234 — fix: schedule width regression
deployed: https://ui-dashboard.gnerim.ru/
gitea run: <link>

Failure:

❌ ci-deploy FAILED at step "Run Playwright e2e" (6m 18s)
commit: abc1234 — fix: schedule width regression
gitea run: <link>

last 30 lines of step output:
<pre>... e2e log tail ...</pre>

artifacts:
- container logs
- playwright report

Workflow B failures include MR URL, Jenkins build URL, customer URL as appropriate.

Per-stage failure contracts

Failure point	Action	Notification
A:1-6 (build/lint/test/dockerbuild)	hard fail; nothing was deployed	`❌ ci-deploy FAILED at step "<name>"` + tail
A:7-11 (deploy/health/e2e)	trigger A:12 rollback to `:previous`, verify rollback healthy	`❌ ci-deploy FAILED at step "<name>" — rolled back to <prev-sha>` + container logs + playwright report (if e2e)
A:12 rollback fails	container stopped, site is 502	`🔥 ci-deploy ROLLBACK FAILED — site is DOWN. Manual intervention required. Last good image: flights-web:<prev-sha>`
B:2 (A not green for SHA)	refuse to start	`⚠️ release blocked — workflow ci-deploy is not green for <sha>. Re-run A first.`
B:3-4 (lint/test re-run)	hard fail	`❌ release FAILED at lint/test re-run (paranoid check). Investigate and re-trigger.`
B:5-8 (sync, branch, push)	hard fail; if MR was created, close it; if branch was pushed, delete it	`❌ release FAILED at "<step>" — cleanup done`
B:9-11 (MR open/approve/merge)	hard fail; close MR + delete branch	`❌ release FAILED at MR <step> — MR closed, branch deleted. <link>`
B:12-13 (Jenkins trigger/poll)	hard fail; do NOT close the GitLab MR (already merged, can't unmerge)	`❌ release FAILED at Jenkins build — gitlab MR <iid> already merged. Jenkins console: <link>`
B:14 (customer URL not responding)	hard fail	`❌ release FAILED — Jenkins reported SUCCESS but flights-ui.devwebzavod.ru not responding. Investigate.`
B:15 (e2e on customer URL)	hard fail; no auto-rollback (we can't), notify with logs	`❌ release FAILED at e2e on customer URL — gitlab MR <iid> merged + Jenkins #<n> green but app misbehaves. Playwright report attached.`

Recovery from B:12-13 failure (awkward case)

GitLab MR is already merged but customer site has previous code. Recovery is manual:

Open Jenkins UI → click "Build Now" on the same job, or
Push a new commit to GitLab to re-trigger Jenkins polling.

A "retry just the Jenkins half" workflow file is not included — the manual path is rare enough to not warrant the abstraction.

Implementation pattern

Both workflows end with if: always() finalize steps:

- name: Notify (success)
  if: success()
  run: scripts/ci/notify-telegram.sh ok ci-deploy

- name: Notify (failure)
  if: failure()
  run: scripts/ci/notify-telegram.sh fail ci-deploy "${{ steps.failed_step.outputs.name }}"

Step IDs propagate the failed-step name. Slightly verbose but no magic.

Artifacts on failure

Always uploaded on failure (never on success). 7-day retention.

Workflow A: docker logs flights-web --tail 500, playwright-report/ (if e2e ran), nginx error log tail.
Workflow B: playwright-report/ (if e2e ran), the rendered MR/Jenkins API responses (for debugging integration), tail of git log on the sync branch.

Deliberately NOT done

No PagerDuty / SMS escalation. Telegram is enough.
No automatic re-runs on flake. A flaky e2e fail = real signal worth investigating.
No "previous run was already failing, suppress notification" logic. Spam is a feature; silence is dangerous.
No Slack/email mirror. Single channel.

Testing the pipeline itself

Layer 1 — Unit tests for the testable bits

Bash scripts under scripts/ci/ with logic worth testing:

notify-telegram.sh — --dry-run prints the rendered payload to stdout instead of POSTing. Tests verify the three message shapes.
wait-for-url.sh — testable with a local python3 -m http.server; assert exit codes for 200, 404, network failure, timeout.
jenkins-trigger-and-wait.sh — --mock-mode reads from tests/ci/fixtures/jenkins-success-flow.json. Tests verify queue-then-build polling + SUCCESS / FAILURE / UNSTABLE / timeout branches.
deploy-container.sh — --dry-run prints docker commands instead of running them. Test verifies alias-swap order.

Run via make test-ci or as a step in Workflow A itself (~10 sec total).

Layer 2 — Workflow A first-run validation (staged rollout)

Plan for the first run to fail. Stage the rollout:

PR #1 — adds workflows + scripts + console-gate fixture + nginx config + deployment/README.md. Does not delete .github/workflows/. Workflow A starts firing on push.
First few runs will fail at: portability of e2e specs to remote BASE_URL, missing BASE_URL overrides in test setup, console-gate revealing real warnings to allowlist, network/DNS/route gotchas. Each failure → fix → re-push.
PR #2 — deletes .github/workflows/ and any compatibility shims, only after A has run green for a few consecutive commits.
PR #3 — Workflow B. First run triggered manually. Once it works once end-to-end, it's "live".

Budget 1-2 dev days of "debug the pipeline against reality" after merging PR #1. Expecting green on first run is wrong.

Layer 3 — Documented rehearsal of failure paths

deployment/README.md includes recipes for inducing each failure path:

Failure	How to induce
A e2e fail → rollback	push a commit that adds `console.error('test')` to `App.tsx`. Verify rollback.
A rollback fail	break the `:previous` tag manually (`docker rmi flights-web:previous`), trigger an e2e fail.
B blocked on A not green	push a commit that fails A, then trigger B for that SHA.
B Jenkins poll timeout	reduce `JENKINS_TIMEOUT` to 30s and trigger B.
B e2e fail on customer URL	manually break the customer URL (trigger an old Jenkins build), then run B without a code change.

Run at least the "rollback" and "release blocked" rehearsals once before declaring the pipeline production-grade.

Console-allowlist seeding strategy

Don't pre-seed. Run e2e once locally against https://ui-dashboard.gnerim.ru/ (or http://localhost:8081), capture every console message, decide which are real bugs vs allowlist material.
Each allowlist entry has a reason field, lint-enforced.
Re-evaluate quarterly via scripts/ci/audit-console-allowlist.sh — entries that didn't fire are dead config.

Deliberately NOT tested

The actual GitLab API integration (no way to mock GitLab without GitLab; first B run is the test).
The actual Jenkins API integration (same; polling logic is tested via mock-mode).
The Telegram bot (tested via --dry-run; failed delivery observable as "no message arrived").

Future seam: container registry

When a private registry comes online (eventual registry.gnerim.ru), changes:

Workflow A — replace local docker tag flights-web:current + docker run with:
```
- run: docker push registry.gnerim.ru/flights-web:${GITHUB_SHA}
- run: ssh deploy@pve-201 'docker pull registry.gnerim.ru/flights-web:<sha> && ...'
```
Runner can move off pve-201 — anywhere with reach to registry + SSH key to deploy host.
Add secrets REGISTRY_USER, REGISTRY_PASS, DEPLOY_SSH_KEY.
Rollback semantics identical — docker pull <prev-sha> instead of relying on local cache.
No script rewrites — scripts/ci/deploy-container.sh accepts image-source as a parameter from day one. flights-web:current / :previous becomes <repo>:<sha> / <repo>:<prev-sha>, same shape.

Open questions and known gaps

GITLAB_PROJECT_ID — numeric ID is unknown until the PAT exists. scripts/ci/check-gitlab-project.sh resolves it post-PAT.
The 9 untracked snap-*.yml files at repo root look like throwaway parity-snapshot artifacts. Add to .gitignore or commit? Verify before flipping pipeline on (prereq #14).
e2e portability to remote BASE_URL — existing specs were written against localhost. Many likely hardcode paths or rely on dev-only state. Layer 2 of testing strategy budgets time for this.
Initial console-allowlist content — empty starter; will be populated on first runs ("we'll figure it out in future" per design discussion).

Addendum 2026-04-27 — routing change + manual Jenkins trigger

Two design pivots discovered during Phase B prerequisites work:

Routing: ssh -L tunnel instead of static-route + NAT

Original design: static route on pve-201 pushes <TIM-CIDR> via webzavod's LAN IP, webzavod NATs LAN→ppp0, /etc/hosts pins flights.test.aeroflot.ru to an internal A record.

Discovered:

flights.test.aeroflot.ru resolves to public IPs from both pve-201 and webzavod (no internal A record exists).
pve-201 reaches the public IP directly with HTTP 200, but the response is a WAF interstitial — the customer WAF returns 200/HTML for non-corp egress and 401/JSON-ready for corp egress.
The same URL from webzavod returns 401 (real backend) — webzavod's ppp0 egress IP is whitelisted.

New design: persistent ssh -L 127.0.0.1:8443:flights.test.aeroflot.ru:443 from pve-201 to webzavod via systemd unit deployment/systemd/flights-tim-tunnel.service. nginx proxies /api/ and /map/api/ to https://127.0.0.1:8443 with Host and proxy_ssl_name overrides so SNI/cert validation still target the real hostname.

Webzavod-side authorisation pinned with command="exit 1",no-pty,no-X11-forwarding,no-agent-forwarding,no-user-rc,permitopen="flights.test.aeroflot.ru:443" — the key cannot open a shell, agent-forward, or forward any other host:port.

Trade-offs vs. original:

✅ No webzavod kernel changes (no ip_forward toggle, no MASQUERADE rule, no iptables-persistent).
✅ No /etc/hosts pin needed (DNS resolution happens on webzavod, where the real IPs work).
✅ Recoverable in seconds (systemctl restart flights-tim-tunnel).
⚠ Per-host SSH tunnel — adding another upstream means another -L line. Currently only one upstream.
⚠ Discovered OpenSSH 9.6 quirk: restrict + permitopen causes TLS handshake to EOF mid-stream. Using explicit no-* options instead of restrict works.

Workflow B: drop Jenkins automation

Original design: Workflow B triggers Jenkins via remote-build token, polls build status via authenticated API, then runs e2e against customer URL.

Constraint: operator does not have Jenkins job-configure access (no remote-trigger token) nor Jenkins user API token access. Authenticated API trigger and polling are not possible without admin involvement.

New design:

Workflow B (release.yml) — sync to GitLab, open MR, auto-approve, auto-merge, stop. Telegram notify includes the Jenkins job URL with instructions to trigger by hand.
Workflow C (release-verify.yml) — workflow_dispatch only. Operator runs manually after Jenkins finishes. Probes customer URL until reachable, runs Playwright e2e against http://flights-ui.devwebzavod.ru with the console-error gate, notifies Telegram.

Removed from the repo:

scripts/ci/jenkins-trigger-and-wait.sh
tests/ci/test-jenkins-trigger.sh
tests/ci/fixtures/jenkins-{success,failure}-flow.json
JENKINS_USER, JENKINS_API_TOKEN, JENKINS_TRIGGER_TOKEN secrets

Trade-off: lose automated end-to-end pipeline. Acceptable because (a) operator already triggers Jenkins manually today, (b) the manual step is a checkpoint where build failures surface clearly, (c) future Jenkins API access can swap C back into B without changing the rest of the design.

Other small adjustments

SSR container loopback port changed from 8081 → 3002 (port 8081 already in use on pve-201 by openwebui).
notify-telegram.sh now skips cleanly when Telegram secrets are unset (was: hard-fail). Lets the pipeline run end-to-end without TG configured.

35 KiB Raw Blame History Unescape Escape