Files
firezone/.github/workflows/_integration_tests.yml
Jamil 0ccd4bbf24 feat(ci): enable relay eBPF offloading (#10160)
In CI, eBPF in driver mode actually functions just fine with no changes
to our existing tests, given we apply a few workarounds and bugfixes:

- The interface learning mechanism had two flaws: (1) it only learned
per-CPU, which meant the risk for a missing entry grew as the core count
of the relay host grew, and (2) it did not filter for unicast IPs, so it
picked up broadcast and link-local addresses, causing cross-relay paths
to fail occasionally
- The `relay-relay` candidate where the two relays are the same relay
causes packet drops / loops in the Docker bridge setup, and possibly in
GCP too. I'm not sure this is a valid path that solves a real
connectivity issue in the wild. I can understand relay-relay paths where
two relays are different hosts, and the client and gateway both talk
over their TURN channel to each other (i.e. WireGuard is blocked in each
of their networks), but I can't think of an advantage for a relay-relay
candidate where the traffic simply hairpins (or is dropped) off the
nearest switch. This has been now detected with a new `PacketLoop` error
that triggers whenever source_ip == dest_ip.
- The relays in CI need a common next-hop to talk to for the MAC address
swapping to work. A simple router service is added which functions as a
basic L3 router (no NAT) that allows the MAC swapping to work.
- The `veth` driver has some peculiar requirements to allow it to
function with XDP_TX. If you send a packet out of one interface of a
veth pair with XDP_TX, you need to either make sure both interfaces have
GRO enabled, or you need to attach a dummy XDP program that simply does
XDP_PASS to the other interface so that the sk_buff is allocated before
going up the stack to the Docker bridge. The GRO method was unreliable
and didn't work in our case, causing massive packet delays and
unpredictable bursts that prevented ICE from working, so we use the
XDP_PASS method instead. A simple docker image is built and lives at
https://github.com/firezone/xdp-pass to handle this.

Related: #10138 
Related: #10260
2025-08-31 23:37:03 +00:00

196 lines
6.7 KiB
YAML

name: Integration Tests
run-name: Triggered from ${{ github.event_name }} by ${{ github.actor }}
on:
workflow_call:
inputs:
domain_image:
required: false
type: string
default: "ghcr.io/firezone/domain"
domain_tag:
required: false
type: string
default: ${{ github.sha }}
api_image:
required: false
type: string
default: "ghcr.io/firezone/api"
api_tag:
required: false
type: string
default: ${{ github.sha }}
web_image:
required: false
type: string
default: "ghcr.io/firezone/web"
web_tag:
required: false
type: string
default: ${{ github.sha }}
elixir_image:
required: false
type: string
default: "ghcr.io/firezone/elixir"
elixir_tag:
required: false
type: string
default: ${{ github.sha }}
relay_image:
required: false
type: string
default: "ghcr.io/firezone/debug/relay"
relay_tag:
required: false
type: string
default: ${{ github.sha }}
gateway_image:
required: false
type: string
default: "ghcr.io/firezone/debug/gateway"
gateway_tag:
required: false
type: string
default: ${{ github.sha }}
client_image:
required: false
type: string
default: "ghcr.io/firezone/debug/client"
client_tag:
required: false
type: string
default: ${{ github.sha }}
http_test_server_image:
required: false
type: string
default: "ghcr.io/firezone/debug/http-test-server"
http_test_server_tag:
required: false
type: string
default: ${{ github.sha }}
env:
COMPOSE_PARALLEL_LIMIT: 1 # Temporary fix for https://github.com/docker/compose/pull/12752 until compose v2.36.0 lands on GitHub actions runners.
jobs:
integration-tests:
name: ${{ matrix.test.name }}
runs-on: ubuntu-22.04
permissions:
contents: read
id-token: write
pull-requests: write
env:
DOMAIN_IMAGE: ${{ inputs.domain_image }}
DOMAIN_TAG: ${{ inputs.domain_tag }}
API_IMAGE: ${{ inputs.api_image }}
API_TAG: ${{ inputs.api_tag }}
WEB_IMAGE: ${{ inputs.web_image }}
WEB_TAG: ${{ inputs.web_tag }}
RELAY_IMAGE: ${{ inputs.relay_image }}
RELAY_TAG: ${{ inputs.relay_tag }}
GATEWAY_IMAGE: ${{ inputs.gateway_image }}
GATEWAY_TAG: ${{ inputs.gateway_tag }}
CLIENT_IMAGE: ${{ inputs.client_image }}
CLIENT_TAG: ${{ inputs.client_tag }}
ELIXIR_IMAGE: ${{ inputs.elixir_image }}
ELIXIR_TAG: ${{ inputs.elixir_tag }}
HTTP_TEST_SERVER_IMAGE: ${{ inputs.http_test_server_image }}
HTTP_TEST_SERVER_TAG: ${{ inputs.http_test_server_tag }}
strategy:
fail-fast: false
matrix:
test:
- name: direct-curl-api-down
- name: direct-curl-api-restart
- name: direct-curl-ecn
- name: direct-download-packet-loss
- name: direct-dns-api-down
- name: direct-dns-two-resources
- name: direct-dns
- name: direct-download-roaming-network
# Too noisy can cause flaky tests due to the amount of data
rust_log: debug
- name: dns-nm
- name: tcp-dns
- name: relay-graceful-shutdown
- name: systemd/dns-systemd-resolved
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
- uses: ./.github/actions/ghcr-docker-login
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
- name: Seed database
run: docker compose run elixir /bin/sh -c 'cd apps/domain && mix ecto.migrate --migrations-path priv/repo/migrations --migrations-path priv/repo/manual_migrations && mix ecto.seed'
- name: Start docker compose in the background
run: |
set -xe
if [[ -n "${{ matrix.test.rust_log }}" ]]; then
export RUST_LOG="${{ matrix.test.rust_log }}"
fi
# Start one-by-one to avoid variability in service startup order
docker compose up -d dns.httpbin.search.test --no-build
docker compose up -d httpbin --no-build
docker compose up -d download.httpbin --no-build
docker compose up -d api web domain --no-build
docker compose up -d otel --no-build
docker compose up -d relay-1 --no-build
docker compose up -d relay-2 --no-build
docker compose up -d gateway --no-build
docker compose up -d client --no-build
docker compose up veth-config
# Wait a few seconds for the services to fully start. GH runners are
# slow, so this gives the Client enough time to initialize its tun interface,
# for example.
# Intended to mitigate <https://github.com/firezone/firezone/issues/5830>
sleep 3
- name: Add 50ms simulated API latency
run: |
docker compose exec -T -u root api sh -c 'apk add --update --no-cache iproute2-tc'
docker compose exec -T -u root api sh -c 'tc qdisc add dev eth0 root netem delay 50ms'
- name: Add 10ms simulated gateway latency
run: |
# compatibility test images won't have the `tc` command
docker compose exec -T gateway sh -c 'apk add --update --no-cache iproute2-tc'
docker compose exec -T gateway sh -c 'tc qdisc add dev eth0 root netem delay 10ms'
- run: ./scripts/tests/${{ matrix.test.name }}.sh
- name: Ensure Client emitted no warnings
if: "!cancelled()"
run: |
# Disabling checksum offloading causes one or two "I/O error (os error 5)" warnings
docker compose logs client | \
grep --invert "I/O error (os error 5)" | \
grep "WARN" && exit 1 || exit 0
- name: Show Client logs
if: "!cancelled()"
run: docker compose logs client
- name: Show Relay-1 logs
if: "!cancelled()"
run: docker compose logs relay-1
- name: Show Relay-2 logs
if: "!cancelled()"
run: docker compose logs relay-2
- name: Ensure Gateway emitted no warnings
if: "!cancelled()"
run: |
# Disabling checksum offloading causes one or two "I/O error (os error 5)" warnings
docker compose logs gateway | \
grep --invert "I/O error (os error 5)" | \
grep "WARN" && exit 1 || exit 0
- name: Show Gateway logs
if: "!cancelled()"
run: docker compose logs gateway
- name: Show API logs
if: "!cancelled()"
run: docker compose logs api