Files
flights_web/docs/superpowers/specs/2026-04-25-cicd-pipeline-design.md
T
gnezim 03eeddfbf8 CI/CD pipeline: ssh -L tunnel for TIM API + manual Jenkins trigger
Two design pivots discovered during Phase B prerequisites:

Routing: Replace static-route + NAT plan with persistent ssh -L tunnel
from pve-201 to webzavod (deployment/systemd/flights-tim-tunnel.service).
nginx proxies /api/ and /map/api/ to https://127.0.0.1:8443 with SNI/Host
overrides so cert validation still targets the real hostname. No webzavod
kernel changes (no ip_forward/MASQUERADE), no /etc/hosts pin needed.

Workflow B: Drop Jenkins trigger/poll automation (operator lacks Jenkins
job-configure access and user API token access). release.yml now stops
after MR merge with a Telegram message containing the Jenkins job URL.
release-verify.yml (new, workflow_dispatch only) runs the customer-URL
e2e suite once the operator has triggered Jenkins manually and it has
completed.

Other:
- SSR loopback port 8081 -> 3002 (8081 was taken by openwebui on pve-201)
- notify-telegram.sh skips cleanly when TG secrets unset (was: hard-fail)
- README + spec addendum cover the new prereqs and removed steps
2026-04-27 11:58:39 +03:00

35 KiB
Raw Blame History

CI/CD Pipeline Design — Gitea Actions → pve-201 → GitLab → Jenkins

Status: Approved design, ready for implementation plan. Date: 2026-04-25 Author: gnezim (with Claude)

Summary

A two-workflow Gitea Actions pipeline that builds and deploys this React SSR app to your own infrastructure (pve-201, behind https://ui-dashboard.gnerim.ru/) on every push, then — on explicit trigger — syncs sources to the customer's GitLab, opens and auto-merges an MR, fires the Jenkins build, and runs end-to-end tests against the customer's dev URL. All notifications via Telegram.

Two workflow files:

  • ci-deploy.yml — push-triggered. Build → unit tests → Docker build → swap container → e2e on ui-dashboard.gnerim.ru. Auto-rollback to previous image on any post-build failure.
  • release.yml — manually triggered (UI button or release-* git tag). Verifies ci-deploy is green for the same SHA, then GitLab sync → MR → approve → merge → Jenkins trigger → poll → e2e on flights-ui.devwebzavod.ru. Halts on any failure.

The Gitea runner runs on pve-201 itself, with Docker socket access — no SSH, no registry hop. Image-versioning uses flights-web:<sha> plus moving aliases :current and :previous for one-step rollback. Future migration to a private registry is a config change, not a refactor.

Architecture

┌──────────────────┐  push to main          ┌─────────────────────────────────┐
│  dev pc (you)    │ ─────────────────────► │  git.gnerim.ru (Gitea server)   │
└──────────────────┘  manual / tag push     └────────────────┬────────────────┘
                                                              │ webhook
                                                              ▼
                                              ┌──────────────────────────────┐
                                              │  Gitea Actions runner        │
                                              │  on pve-201 (Docker socket)  │
                                              └──┬───────────────────────────┘
                                                 │
                  ┌──────────────────────────────┼──────────────────────────────┐
                  │                              │                              │
                  ▼ on push                      ▼ on tag/manual                ▼ Telegram
        ┌────────────────────┐         ┌────────────────────┐         ┌────────────────────┐
        │ Workflow A         │         │ Workflow B         │         │ Notify on every    │
        │ ci-deploy.yml      │         │ release.yml        │         │ stage start / end  │
        │                    │         │                    │         │ / failure          │
        │ build & test       │         │ verify A is green  │         └────────────────────┘
        │   ↓                │         │   ↓                │
        │ docker build :SHA  │         │ sync → GitLab MR   │
        │   ↓                │         │   ↓                │
        │ swap container     │         │ approve & merge    │
        │   ↓                │         │   ↓                │
        │ smoke /health      │         │ trigger Jenkins    │
        │   ↓                │         │   ↓                │
        │ playwright e2e on  │         │ poll until SUCCESS │
        │ ui-dashboard       │         │   ↓                │
        │   ↓ on fail        │         │ playwright e2e on  │
        │ rollback to        │         │ flights-ui.devweb  │
        │ :previous          │         │   ↓ on fail        │
        └────────────────────┘         │ halt + dump logs   │
                                       └────────────────────┘

Key invariants

  • Workflow A is the gatekeeper. Workflow B always queries Gitea for the latest A run for the same commit SHA; if not green, B refuses to start.
  • Image tag aliases. Every build is tagged flights-web:<sha>. Two moving aliases on the host: flights-web:current (live container source) and flights-web:previous (rollback target). Pruning keeps the last 5 SHA tags + the two aliases.
  • Container is named flights-web (singleton). Restart sequence: docker stop flights-web && docker rm flights-web && docker run -d --name flights-web --restart unless-stopped -p 127.0.0.1:8081:8080 flights-web:current.
  • Nginx on pve-201 terminates TLS for ui-dashboard.gnerim.ru and proxies to 127.0.0.1:8081.
  • All four major stages emit Telegram messages (start / pass / fail). Failure messages include log tail and a clickable link to the Gitea run.
  • .github/workflows/ files are deleted in PR #2 (not the same PR that adds the new workflows; see "Layer 2 — staged rollout" under "Testing the pipeline itself").

Architectural choices already made

  • Runner-on-host with direct Docker socket (vs SSH-back-to-localhost or local registry) — least moving parts; runner is in the docker group on pve-201.
  • Two independent workflow files (vs one file with conditional jobs, vs shared composite action) — short and focused beats clever.
  • Manual trigger for Workflow B + git tag fallback (vs commit-message keyword) — explicit; can't ship to customer by accident.

Routing, build-args, and access control

The build-args change the most across this design — they go from absolute (TIM hostnames) to relative paths, which moves the burden onto nginx on pve-201.

Routing pve-201 → TIM API

The customer API at https://flights.test.aeroflot.ru/api/* is reachable only through the corp VPN. webzavod (192.168.88.58) on the same LAN as pve-201 (192.168.88.167) already has a working L2TP/IPsec tunnel to TIM via ppp0. The cleanest way to make pve-201 reach TIM is a static route through webzavod, which leverages the existing VPN setup.

One-time host setup (manual, not in workflows):

  1. On webzavod — verify IP forwarding and MASQUERADE on ppp0:

    sysctl net.ipv4.ip_forward                          # expect: 1
    sudo iptables -t nat -L POSTROUTING -nv | grep ppp0 # expect: MASQUERADE
    

    If not set:

    echo 'net.ipv4.ip_forward=1' | sudo tee -a /etc/sysctl.conf
    sudo sysctl -p
    sudo iptables -t nat -A POSTROUTING -o ppp0 -j MASQUERADE
    sudo apt install iptables-persistent && sudo netfilter-persistent save
    
  2. On pve-201 — add a persistent static route to TIM via webzavod:

    # /etc/netplan/01-routes.yaml
    network:
      version: 2
      ethernets:
        eth0:                          # rename to actual NIC name
          routes:
            - to: 172.18.0.0/16
              via: 192.168.88.58
    
    sudo netplan apply
    
  3. On pve-201 — pin TIM hostnames to reachable A records (mirrors the duplicate-DNS workaround documented in ~/_projects/gnezim/knowledge/projects/work/tim/ui-dashboard/mac-via-windows-jump.md):

    # /etc/hosts
    172.18.0.121 flights.test.aeroflot.ru
    
  4. Smoke test from pve-201:

    curl -v https://flights.test.aeroflot.ru/swagger/  # expect: 401 in ~70ms
    

    Failure here means routing is broken — fix before any pipeline run.

nginx vhost on pve-201

server {
  listen 443 ssl http2;
  server_name ui-dashboard.gnerim.ru;
  # ssl_certificate, ssl_certificate_key — existing certbot config

  auth_basic "ui-dashboard";
  auth_basic_user_file /etc/nginx/htpasswd/ui-dashboard;

  location / {
    proxy_pass http://127.0.0.1:8081;
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Real-IP $remote_addr;
  }

  location /api/ {
    auth_basic off;                                   # API path open behind nginx — basic auth gates the HTML, not the API
    proxy_pass https://flights.test.aeroflot.ru;
    proxy_set_header Host flights.test.aeroflot.ru;
    proxy_ssl_server_name on;
  }

  location /map/api/ {
    auth_basic off;
    proxy_pass https://flights.test.aeroflot.ru;
    proxy_set_header Host flights.test.aeroflot.ru;
    proxy_ssl_server_name on;
  }
}

This file is checked in at deployment/nginx/ui-dashboard.gnerim.ru.conf and symlinked into /etc/nginx/sites-enabled/ by hand on first setup.

Dockerfile build-args become relative

Workflow A passes:

--build-arg MAP_TILE_URL=/map/api/tile/{z}/{x}/{y}.jpeg
--build-arg API_BASE_URL=/api

Same-origin URLs in the client bundle, no CORS, the TIM hostname never leaks into the browser. The customer's prod build (Jenkins) keeps its own absolute URLs because their nginx is configured differently — independent.

Basic auth and how e2e bypasses it

  • Credentials stored as Gitea Actions secrets BASIC_AUTH_USER and BASIC_AUTH_PASS.
  • Workflow A's deploy step regenerates /etc/nginx/htpasswd/ui-dashboard (using htpasswd -bn) and runs nginx -s reload. Rotating creds = re-run Workflow A.
  • Public smoke check in Workflow A step 10 hits https://ui-dashboard.gnerim.ru/ with --user $BASIC_AUTH_USER:$BASIC_AUTH_PASS to validate TLS + nginx + auth + container in one curl. Catches nginx misconfig.
  • Full e2e in Workflow A step 11 runs against BASE_URL=http://127.0.0.1:8081 (loopback, skips nginx and auth). Faster, no creds in test, regression in nginx layer already caught by step 10.

Workflow A — ci-deploy.yml

Triggers: push to main; workflow_dispatch for re-runs.

Single sequential job on the pve-201 runner:

# Step What it does On failure
1 Checkout actions/checkout@v4, full history hard fail
2 Setup pnpm + Node 24 from .nvmrc hard fail
3 Restore pnpm cache ~/.pnpm-store keyed on pnpm-lock.yaml continue (cache miss is fine)
4 Install deps pnpm install --frozen-lockfile hard fail → Telegram
5 Typecheck + lint + unit tests pnpm typecheck && pnpm lint && pnpm test hard fail → Telegram
6 Build SSR image docker build -f Dockerfile.react -t flights-web:${GITHUB_SHA} --build-arg MAP_TILE_URL=/map/api/tile/{z}/{x}/{y}.jpeg --build-arg API_BASE_URL=/api . hard fail → Telegram
7 Tag previous-current as previous docker tag flights-web:current flights-web:previous (skip if first deploy) continue
8 Tag SHA as current docker tag flights-web:${GITHUB_SHA} flights-web:current hard fail
9 Restart container scripts/ci/deploy-container.sh swap trigger rollback (step 12)
10 Wait for health scripts/ci/wait-for-url.sh https://ui-dashboard.gnerim.ru/ 30 2 (with basic auth) trigger rollback
11 Run Playwright e2e BASE_URL=http://127.0.0.1:8081 pnpm test:e2e (full suite + console-error gate) trigger rollback
12 Rollback (only if 9/10/11 failed) scripts/ci/deploy-container.sh rollback — runs :previous, swaps aliases back, verifies health always Telegram with logs
13 Prune old images keep last 5 flights-web:* SHA tags + the two aliases continue
14 Telegram (success or failure) if: always()notify-telegram.sh ok|fail ci-deploy <step-name> continue

Estimated runtime: 610 min cached; 1215 min cold.

Console-error gate (step 11):

A Playwright fixture tests/e2e/fixtures/console-gate.ts attaches a listener to every page, collects all console.error and console.warn messages, filters out anything matching patterns in tests/e2e/fixtures/console-allowlist.json, and asserts the remaining list is empty in afterEach. Per the agreed policy: zero tolerance with explicit allowlist. Each allowlist entry has a reason field; lint enforces non-empty.

Workflow B — release.yml

Triggers: workflow_dispatch (manual button); push of tags matching release-* (e.g., git tag release-2026-04-25 && git push --tags).

Single sequential job on the pve-201 runner:

# Step What it does On failure
1 Checkout (full history + tags) needed for sync to operate on real source tree hard fail
2 Verify Workflow A is green for this SHA query Gitea API GET /repos/{owner}/{repo}/actions/runs?head_sha=<sha>&workflow=ci-deploy.yml; require status=success hard fail → Telegram "release blocked: A not green"
3 Setup pnpm + Node 24, install deps needed for paranoid lint/test re-run hard fail
4 Re-run lint + typecheck + unit tests belt-and-suspenders: catch flakiness; confirms commit still passes locally before sending to customer hard fail → Telegram
5 Clone GitLab target into temp dir git clone https://oauth2:$GITLAB_PAT@teamscore.gitlab.yandexcloud.net/aeroflot2/flights-front.git /tmp/flights-front hard fail
6 Run sync (CI variant) scripts/ci/sync-to-gitlab.sh /tmp/flights-front/Aeroflot.Flights.Front hard fail
7 Commit on a feature branch cd /tmp/flights-front && git checkout -b auto/sync-<sha> then git add -A && git commit -m "auto: sync from gitea <sha>" (skip if no diff) hard fail; if no diff → log "nothing to sync" + skip 8-13 + Telegram info
8 Push branch git push -u origin auto/sync-<sha> hard fail
9 Open MR POST /api/v4/projects/<id>/merge_requests with source_branch=auto/sync-<sha>, target_branch=main, title "auto: sync from gitea <sha-short>", description with link back to Gitea run hard fail
10 Approve MR POST /api/v4/projects/<id>/merge_requests/<iid>/approve hard fail; on 401/403 → log explicit "self-approve blocked, configure project to allow author approval"
11 Merge MR PUT /api/v4/projects/<id>/merge_requests/<iid>/merge with merge_when_pipeline_succeeds=false, should_remove_source_branch=true, squash=true on fail → close MR + delete branch → Telegram
12 Trigger Jenkins curl -u $JENKINS_USER:$JENKINS_API_TOKEN 'http://jenkins.yc.devwebzavod.ru:8080/job/Aeroflot2/job/Flights-Front-Dev/build?token=$JENKINS_TRIGGER_TOKEN' — returns the queue item URL in Location header hard fail
13 Poll Jenkins for completion scripts/ci/jenkins-trigger-and-wait.sh — parses queue URL, gets build URL once it leaves queue, polls <build_url>/api/json for result != null. Timeout: 30 min. Required: result == "SUCCESS". hard fail (UNSTABLE/FAILURE/timeout) → Telegram with Jenkins console URL
14 Wait for the customer URL to update scripts/ci/wait-for-url.sh http://flights-ui.devwebzavod.ru/ru-ru/onlineboard 60 5 (5 min window) hard fail
15 Run Playwright e2e against http://flights-ui.devwebzavod.ru/ BASE_URL=http://flights-ui.devwebzavod.ru pnpm test:e2e (full suite + console-error gate) hard fail → Telegram with playwright report attached
16 Telegram (success or failure) if: always() — final notification with full chain links continue

Estimated runtime: 1525 min (most of it Jenkins build + e2e).

Self-approve note. If GitLab's project setting "Prevent approvals by author" is enabled, step 10 returns 401 with "You cannot approve your own merge request". Prereq #9 in "One-time manual setup" unchecks this. If you can't (org policy), fallback is to skip step 10 and rely on merge_when_pipeline_succeeds=false + branch protection allowing maintainer push.

Jenkins polling race. Naive polling has a race where the queue item hasn't materialized into a build yet. jenkins-trigger-and-wait.sh polls the queue URL first, then the build URL once it appears.

The auto/sync-<sha> branch lives forever in GitLab unless step 11 succeeds (which deletes it via should_remove_source_branch=true). On step 11 failure, the script closes the MR + deletes the branch.

The Gitea runner needs network reachability to TIM for steps 12-15 (Jenkins host, customer URL). That works automatically once the static route from "Routing pve-201 → TIM API" is in place — the runner shares pve-201's routes.

Prerequisites and secrets

One-time manual setup

# What Where Why
1 Verify webzavod IP forwarding + MASQUERADE on ppp0 webzavod see "Routing pve-201 → TIM API"
2 Add static route 172.18.0.0/16 via 192.168.88.58 in netplan pve-201 see "Routing pve-201 → TIM API"
3 Pin 172.18.0.121 flights.test.aeroflot.ru in /etc/hosts pve-201 duplicate DNS gotcha — see "Routing pve-201 → TIM API"
4 Verify from pve-201: curl -v https://flights.test.aeroflot.ru/swagger/ returns 401 (typically <300ms) pve-201 smoke test
5 Install nginx vhost from deployment/nginx/ui-dashboard.gnerim.ru.conf pve-201 see "nginx vhost on pve-201"
6 Confirm Gitea runner has docker socket access (docker ps from runner user, no sudo) pve-201 required for runner-on-host deploy
7 Confirm Gitea runner can reach git.gnerim.ru, teamscore.gitlab.yandexcloud.net, jenkins.yc.devwebzavod.ru:8080, flights-ui.devwebzavod.ru pve-201 last two via static route from #2
8 Create GitLab Personal Access Token with scopes api, write_repository GitLab → Settings → Access Tokens Workflow B steps 9-11
9 Uncheck "Prevent approvals by author" on the GitLab project GitLab → flights-front → Settings → Merge requests → Approval rules so Workflow B step 10 works
10 Configure Jenkins remote trigger token on Aeroflot2/Flights-Front-Dev job Jenkins → job → Configure → "Trigger builds remotely" Workflow B step 12
11 Generate Jenkins API token for your user Jenkins → user → Configure → API Token Workflow B steps 12-13
12 Create the Telegram bot (or reuse existing) and capture chat_id Telegram BotFather all notifications
13 Pick + reserve port :8081 on pve-201 (or substitute another free port consistently) pve-201 container's host-side bind
14 Clean uncommitted work in this repo before flipping the switch dev pc the first push to main after merging the pipeline will fire Workflow A on whatever's in main
15 Run scripts/ci/check-gitlab-project.sh once after creating the PAT dev pc captures numeric GITLAB_PROJECT_ID for the secret + verifies approval-rule config

Gitea Actions secrets

Stored at repo → Settings → Actions → Secrets. Workflows reference as ${{ secrets.NAME }}.

Secret Used in Notes
BASIC_AUTH_USER Workflow A (deploy) nginx htpasswd; rotate by re-running A
BASIC_AUTH_PASS Workflow A (deploy) same
MAP_TILE_URL Workflow A (build) default /map/api/tile/{z}/{x}/{y}.jpeg — secret so it can be overridden per env
API_BASE_URL Workflow A (build) default /api
GITLAB_PAT Workflow B (steps 5, 8-11) from prereq #8
GITLAB_PROJECT_ID Workflow B (steps 9-11) numeric, from prereq #15
JENKINS_USER Workflow B (steps 12-13) username
JENKINS_API_TOKEN Workflow B (steps 12-13) from prereq #11
JENKINS_TRIGGER_TOKEN Workflow B (step 12) from prereq #10
TELEGRAM_BOT_TOKEN both workflows from prereq #12
TELEGRAM_CHAT_ID both workflows DM or group

What lives in plain repo files (not secrets)

  • .gitea/workflows/ci-deploy.yml and .gitea/workflows/release.yml — public, parameterized via secrets.
  • scripts/ci/sync-to-gitlab.sh — refactored from sync-to-flights-front.sh. The original becomes a thin wrapper that calls this with the local sibling-dir default.
  • scripts/ci/notify-telegram.sh — reads TELEGRAM_BOT_TOKEN/TELEGRAM_CHAT_ID from env. Has --dry-run.
  • scripts/ci/jenkins-trigger-and-wait.sh — polling logic for B steps 12-13. Has --mock-mode.
  • scripts/ci/wait-for-url.sh — generic curl-with-retry.
  • scripts/ci/deploy-container.shswap and rollback subcommands. Has --dry-run.
  • scripts/ci/install-htpasswd.sh — renders htpasswd + reloads nginx.
  • scripts/ci/check-gitlab-project.sh — one-shot setup helper (not used by workflows).
  • tests/e2e/fixtures/console-gate.ts — Playwright fixture.
  • tests/e2e/fixtures/console-allowlist.json — empty starter; grows on first runs.
  • deployment/nginx/ui-dashboard.gnerim.ru.conf — nginx vhost.
  • deployment/README.md — bootstrap runbook + failure-path rehearsal recipes.

What gets deleted (in PR #2, not PR #1)

  • .github/workflows/ci.yml
  • .github/workflows/deploy.yml

Scripts to add

Path Purpose Approx LOC
scripts/ci/sync-to-gitlab.sh Refactored from sync-to-flights-front.sh; takes target dir as required arg, no make-related output. ~150
scripts/ci/notify-telegram.sh notify-telegram.sh <ok|fail> <stage> [<extra-context>]; HTML mode; failure messages include Gitea run URL. ~40
scripts/ci/jenkins-trigger-and-wait.sh Triggers, parses Location, polls queue then build, exits 0 only on SUCCESS. ~80
scripts/ci/wait-for-url.sh wait-for-url.sh <url> [<max-attempts>] [<delay>]. ~25
scripts/ci/deploy-container.sh swap and rollback subcommands. Encapsulates the alias dance + health check. Image source parameterized so registry migration is a config flip. ~70
scripts/ci/install-htpasswd.sh Renders /etc/nginx/htpasswd/ui-dashboard from env + nginx -s reload. ~15
scripts/ci/check-gitlab-project.sh One-shot: print numeric project ID + approval rule config + self-approve allowed (yes/no). ~25
scripts/ci/audit-console-allowlist.sh Run e2e with allowlist disabled, report which entries didn't fire (dead config). ~30
tests/e2e/fixtures/console-gate.ts Playwright fixture for the console-error gate. ~50
tests/e2e/fixtures/console-allowlist.json Empty starter { patterns: [] }. n/a
.gitea/workflows/ci-deploy.yml Workflow A. ~80
.gitea/workflows/release.yml Workflow B. ~100
deployment/nginx/ui-dashboard.gnerim.ru.conf nginx vhost from "nginx vhost on pve-201". ~30
deployment/README.md Setup runbook + failure-path rehearsals. ~200
tests/ci/*.bats (or shell) Unit tests for the testable scripts. ~80
tests/ci/fixtures/jenkins-success-flow.json Mock fixture for jenkins-trigger-and-wait.sh --mock-mode. ~40

Failure handling and notifications

Telegram message shapes

All messages use parse_mode=HTML.

Start (one per workflow run):

🚀 ci-deploy started
commit: abc1234 — fix: schedule width regression
gitea run: <link>

Success:

✅ ci-deploy passed (8m 42s)
commit: abc1234 — fix: schedule width regression
deployed: https://ui-dashboard.gnerim.ru/
gitea run: <link>

Failure:

❌ ci-deploy FAILED at step "Run Playwright e2e" (6m 18s)
commit: abc1234 — fix: schedule width regression
gitea run: <link>

last 30 lines of step output:
<pre>... e2e log tail ...</pre>

artifacts:
- container logs
- playwright report

Workflow B failures include MR URL, Jenkins build URL, customer URL as appropriate.

Per-stage failure contracts

Failure point Action Notification
A:1-6 (build/lint/test/dockerbuild) hard fail; nothing was deployed ❌ ci-deploy FAILED at step "<name>" + tail
A:7-11 (deploy/health/e2e) trigger A:12 rollback to :previous, verify rollback healthy ❌ ci-deploy FAILED at step "<name>" — rolled back to <prev-sha> + container logs + playwright report (if e2e)
A:12 rollback fails container stopped, site is 502 🔥 ci-deploy ROLLBACK FAILED — site is DOWN. Manual intervention required. Last good image: flights-web:<prev-sha>
B:2 (A not green for SHA) refuse to start ⚠️ release blocked — workflow ci-deploy is not green for <sha>. Re-run A first.
B:3-4 (lint/test re-run) hard fail ❌ release FAILED at lint/test re-run (paranoid check). Investigate and re-trigger.
B:5-8 (sync, branch, push) hard fail; if MR was created, close it; if branch was pushed, delete it ❌ release FAILED at "<step>" — cleanup done
B:9-11 (MR open/approve/merge) hard fail; close MR + delete branch ❌ release FAILED at MR <step> — MR closed, branch deleted. <link>
B:12-13 (Jenkins trigger/poll) hard fail; do NOT close the GitLab MR (already merged, can't unmerge) ❌ release FAILED at Jenkins build — gitlab MR <iid> already merged. Jenkins console: <link>
B:14 (customer URL not responding) hard fail ❌ release FAILED — Jenkins reported SUCCESS but flights-ui.devwebzavod.ru not responding. Investigate.
B:15 (e2e on customer URL) hard fail; no auto-rollback (we can't), notify with logs ❌ release FAILED at e2e on customer URL — gitlab MR <iid> merged + Jenkins #<n> green but app misbehaves. Playwright report attached.

Recovery from B:12-13 failure (awkward case)

GitLab MR is already merged but customer site has previous code. Recovery is manual:

  1. Open Jenkins UI → click "Build Now" on the same job, or
  2. Push a new commit to GitLab to re-trigger Jenkins polling.

A "retry just the Jenkins half" workflow file is not included — the manual path is rare enough to not warrant the abstraction.

Implementation pattern

Both workflows end with if: always() finalize steps:

- name: Notify (success)
  if: success()
  run: scripts/ci/notify-telegram.sh ok ci-deploy

- name: Notify (failure)
  if: failure()
  run: scripts/ci/notify-telegram.sh fail ci-deploy "${{ steps.failed_step.outputs.name }}"

Step IDs propagate the failed-step name. Slightly verbose but no magic.

Artifacts on failure

Always uploaded on failure (never on success). 7-day retention.

  • Workflow A: docker logs flights-web --tail 500, playwright-report/ (if e2e ran), nginx error log tail.
  • Workflow B: playwright-report/ (if e2e ran), the rendered MR/Jenkins API responses (for debugging integration), tail of git log on the sync branch.

Deliberately NOT done

  • No PagerDuty / SMS escalation. Telegram is enough.
  • No automatic re-runs on flake. A flaky e2e fail = real signal worth investigating.
  • No "previous run was already failing, suppress notification" logic. Spam is a feature; silence is dangerous.
  • No Slack/email mirror. Single channel.

Testing the pipeline itself

Layer 1 — Unit tests for the testable bits

Bash scripts under scripts/ci/ with logic worth testing:

  • notify-telegram.sh--dry-run prints the rendered payload to stdout instead of POSTing. Tests verify the three message shapes.
  • wait-for-url.sh — testable with a local python3 -m http.server; assert exit codes for 200, 404, network failure, timeout.
  • jenkins-trigger-and-wait.sh--mock-mode reads from tests/ci/fixtures/jenkins-success-flow.json. Tests verify queue-then-build polling + SUCCESS / FAILURE / UNSTABLE / timeout branches.
  • deploy-container.sh--dry-run prints docker commands instead of running them. Test verifies alias-swap order.

Run via make test-ci or as a step in Workflow A itself (~10 sec total).

Layer 2 — Workflow A first-run validation (staged rollout)

Plan for the first run to fail. Stage the rollout:

  1. PR #1 — adds workflows + scripts + console-gate fixture + nginx config + deployment/README.md. Does not delete .github/workflows/. Workflow A starts firing on push.
  2. First few runs will fail at: portability of e2e specs to remote BASE_URL, missing BASE_URL overrides in test setup, console-gate revealing real warnings to allowlist, network/DNS/route gotchas. Each failure → fix → re-push.
  3. PR #2 — deletes .github/workflows/ and any compatibility shims, only after A has run green for a few consecutive commits.
  4. PR #3 — Workflow B. First run triggered manually. Once it works once end-to-end, it's "live".

Budget 1-2 dev days of "debug the pipeline against reality" after merging PR #1. Expecting green on first run is wrong.

Layer 3 — Documented rehearsal of failure paths

deployment/README.md includes recipes for inducing each failure path:

Failure How to induce
A e2e fail → rollback push a commit that adds console.error('test') to App.tsx. Verify rollback.
A rollback fail break the :previous tag manually (docker rmi flights-web:previous), trigger an e2e fail.
B blocked on A not green push a commit that fails A, then trigger B for that SHA.
B Jenkins poll timeout reduce JENKINS_TIMEOUT to 30s and trigger B.
B e2e fail on customer URL manually break the customer URL (trigger an old Jenkins build), then run B without a code change.

Run at least the "rollback" and "release blocked" rehearsals once before declaring the pipeline production-grade.

Console-allowlist seeding strategy

  • Don't pre-seed. Run e2e once locally against https://ui-dashboard.gnerim.ru/ (or http://localhost:8081), capture every console message, decide which are real bugs vs allowlist material.
  • Each allowlist entry has a reason field, lint-enforced.
  • Re-evaluate quarterly via scripts/ci/audit-console-allowlist.sh — entries that didn't fire are dead config.

Deliberately NOT tested

  • The actual GitLab API integration (no way to mock GitLab without GitLab; first B run is the test).
  • The actual Jenkins API integration (same; polling logic is tested via mock-mode).
  • The Telegram bot (tested via --dry-run; failed delivery observable as "no message arrived").

Future seam: container registry

When a private registry comes online (eventual registry.gnerim.ru), changes:

  • Workflow A — replace local docker tag flights-web:current + docker run with:
    - run: docker push registry.gnerim.ru/flights-web:${GITHUB_SHA}
    - run: ssh deploy@pve-201 'docker pull registry.gnerim.ru/flights-web:<sha> && ...'
    
    Runner can move off pve-201 — anywhere with reach to registry + SSH key to deploy host.
  • Add secrets REGISTRY_USER, REGISTRY_PASS, DEPLOY_SSH_KEY.
  • Rollback semantics identicaldocker pull <prev-sha> instead of relying on local cache.
  • No script rewritesscripts/ci/deploy-container.sh accepts image-source as a parameter from day one. flights-web:current / :previous becomes <repo>:<sha> / <repo>:<prev-sha>, same shape.

Open questions and known gaps

  1. GITLAB_PROJECT_ID — numeric ID is unknown until the PAT exists. scripts/ci/check-gitlab-project.sh resolves it post-PAT.
  2. The 9 untracked snap-*.yml files at repo root look like throwaway parity-snapshot artifacts. Add to .gitignore or commit? Verify before flipping pipeline on (prereq #14).
  3. e2e portability to remote BASE_URL — existing specs were written against localhost. Many likely hardcode paths or rely on dev-only state. Layer 2 of testing strategy budgets time for this.
  4. Initial console-allowlist content — empty starter; will be populated on first runs ("we'll figure it out in future" per design discussion).

Addendum 2026-04-27 — routing change + manual Jenkins trigger

Two design pivots discovered during Phase B prerequisites work:

Routing: ssh -L tunnel instead of static-route + NAT

Original design: static route on pve-201 pushes <TIM-CIDR> via webzavod's LAN IP, webzavod NATs LAN→ppp0, /etc/hosts pins flights.test.aeroflot.ru to an internal A record.

Discovered:

  • flights.test.aeroflot.ru resolves to public IPs from both pve-201 and webzavod (no internal A record exists).
  • pve-201 reaches the public IP directly with HTTP 200, but the response is a WAF interstitial — the customer WAF returns 200/HTML for non-corp egress and 401/JSON-ready for corp egress.
  • The same URL from webzavod returns 401 (real backend) — webzavod's ppp0 egress IP is whitelisted.

New design: persistent ssh -L 127.0.0.1:8443:flights.test.aeroflot.ru:443 from pve-201 to webzavod via systemd unit deployment/systemd/flights-tim-tunnel.service. nginx proxies /api/ and /map/api/ to https://127.0.0.1:8443 with Host and proxy_ssl_name overrides so SNI/cert validation still target the real hostname.

Webzavod-side authorisation pinned with command="exit 1",no-pty,no-X11-forwarding,no-agent-forwarding,no-user-rc,permitopen="flights.test.aeroflot.ru:443" — the key cannot open a shell, agent-forward, or forward any other host:port.

Trade-offs vs. original:

  • No webzavod kernel changes (no ip_forward toggle, no MASQUERADE rule, no iptables-persistent).
  • No /etc/hosts pin needed (DNS resolution happens on webzavod, where the real IPs work).
  • Recoverable in seconds (systemctl restart flights-tim-tunnel).
  • ⚠ Per-host SSH tunnel — adding another upstream means another -L line. Currently only one upstream.
  • ⚠ Discovered OpenSSH 9.6 quirk: restrict + permitopen causes TLS handshake to EOF mid-stream. Using explicit no-* options instead of restrict works.

Workflow B: drop Jenkins automation

Original design: Workflow B triggers Jenkins via remote-build token, polls build status via authenticated API, then runs e2e against customer URL.

Constraint: operator does not have Jenkins job-configure access (no remote-trigger token) nor Jenkins user API token access. Authenticated API trigger and polling are not possible without admin involvement.

New design:

  • Workflow B (release.yml) — sync to GitLab, open MR, auto-approve, auto-merge, stop. Telegram notify includes the Jenkins job URL with instructions to trigger by hand.
  • Workflow C (release-verify.yml)workflow_dispatch only. Operator runs manually after Jenkins finishes. Probes customer URL until reachable, runs Playwright e2e against http://flights-ui.devwebzavod.ru with the console-error gate, notifies Telegram.

Removed from the repo:

  • scripts/ci/jenkins-trigger-and-wait.sh
  • tests/ci/test-jenkins-trigger.sh
  • tests/ci/fixtures/jenkins-{success,failure}-flow.json
  • JENKINS_USER, JENKINS_API_TOKEN, JENKINS_TRIGGER_TOKEN secrets

Trade-off: lose automated end-to-end pipeline. Acceptable because (a) operator already triggers Jenkins manually today, (b) the manual step is a checkpoint where build failures surface clearly, (c) future Jenkins API access can swap C back into B without changing the rest of the design.

Other small adjustments

  • SSR container loopback port changed from 80813002 (port 8081 already in use on pve-201 by openwebui).
  • notify-telegram.sh now skips cleanly when Telegram secrets are unset (was: hard-fail). Lets the pipeline run end-to-end without TG configured.