Commit Graph

573 Commits

Author SHA1 Message Date
Thomas Eizinger
e374560ecc ci: increase txqueuelen in containers (#10713)
By default, Docker creates network interfaces with a txqueuelen of 1000.
This is pretty small and causes unnecessary packet drops when running
perf tests in that setup.
2025-10-27 04:40:45 +00:00
Thomas Eizinger
b97761903b fix: initialize num_ipv4_tuples variable (#10712)
I believe this test may be flaky because a declared variable in Bash is
perhaps not necessarily initialised.

Resolves: #10711
2025-10-27 04:32:52 +00:00
Thomas Eizinger
f4ce6f12a0 test: handle IPv6-only roaming (#10703)
It is possible that our "roaming" integration test only ever connects
Client and Gateway over IPv6. In that case, the number of unique source
tuples after roaming is only 2 and not 3. To handle this case, we count
the number of IPv4 tuples and change the assertion accordingly.

In the current state, this causes test failures on `main`.
2025-10-24 05:06:42 +00:00
Thomas Eizinger
28ea0730b6 feat(apt): import .deb files from import- directory (#10694)
Currently, the `sync-apt.sh` script just generates metadata for all
packages found in the `.deb` directory. Unfortunately, this requires the
packages to already be uploaded with a certain naming convention,
otherwise `apt-ftparchive packages` doesn't actually detect them and
creates an empty `Packages` file.

The solution here is to extend the `sync-apt.sh` script to normalize the
filename to what we need it to be. This requires us to upload the new
`.deb` files to the `pool` directory. Instead of messing around with the
existing files in there, we slightly change how the `sync-apt.sh` script
works.

In its new version, it expects packages to be in the `import-stable` and
`import-preview` directories. It will then download these, normalize
their names and move them to a local `pool-stable` and `pool-preview`
directory respectively (potentially overwriting and existing one that is
already there, this allows for updating packages).

As a final step, it will generate the metadata for all packages in
`pool-stable` and `pool-preview`, upload both directories, upload the
metadata and then delete the imported `.deb` files.
2025-10-23 10:09:07 +00:00
Thomas Eizinger
ce297999c9 feat(gateway): capture domain name of flow (#10692)
Whenever we route a packet from the Client to a DNS resource, we now
also capture the domain name. If this is the first packet and we are
thus creating a new flow, we'll save that domain in it. Later packets
for the same IP are rolled up under the same flow and thus don't need to
re-set the domain.

Resolves: #10691
2025-10-23 04:25:37 +00:00
Thomas Eizinger
883d95c2c8 feat(apt): sign contents of APT repository (#10688)
In order to secure an APT repository, the `Release` file containing the
hashes of all packages needs to be signed with a GPG key. These
signatures simply need to be synced back up to the repository. The rest
is handled by `apt` itself.

Resolves: #10599
2025-10-22 23:44:48 +00:00
Thomas Eizinger
c1a6a968ac test: assert that we have at least 3 different source tuples (#10689)
This integration test is currently flaky because we might "roam" between
IPv4 and IPv6 during ICE already. To assert that we actually roamed, we
need to check that we have at least 3 different source tuples in our
list of flows.
2025-10-22 23:17:10 +00:00
Thomas Eizinger
5aa2069008 test: assert that we have at least 3 flows (#10687)
Fixes a flaky test where we sometimes would also find the already RST'd
flows from curl's happy-eyeballs connection attempt.
2025-10-22 22:14:20 +00:00
Thomas Eizinger
d734ee0d21 ci: compare source tuple instead of only port (#10680)
This test is currently flaky on main because it may happen that we first
roam from our IPv4 address to the IPv6 one. Therefore, to make the
assertion pass we need to check that all flows have a different source
tuple and not just a different source port.
2025-10-22 19:14:00 +00:00
Thomas Eizinger
2026d08aea chore(apt): introduce stable and preview distributions (#10679)
In order to distribute pre-releases, it is useful to have a `preview`
distribution in addition to the `stable` one.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-10-22 12:11:33 +00:00
Thomas Eizinger
6a538368cb feat(gateway): add flow-logs MVP (#10576)
Network flow logs are a common feature of VPNs. Due to the nature of a
shared exit node, it is of great interest to a network analyst, which
TCP connections are getting routed through the tunnel, who is initiating
them, for long do they last and how much traffic is sent across them.

With this PR, the Firezone Gateway gains the ability of detecting the
TCP and UDP flows that are being routed through it. The information we
want to attach to these flows is spread out over several layers of the
packet handling code. To simplify the implementation and not complicate
the APIs unnecessarily, we chose to rely on TLS (thread-local storage)
for gathering all the necessary data as a packet gets passed through the
various layers. When using a const initializer, the overhead of a TLS
variable over an actual local variable is basically zero. The entire
routing state of the Gateway is also never sent across any threads,
making TLS variables a particularly good choice for this problem.

In its MVP form, the detected flows are only emitted on stdout and also
that only if `flow_logs=trace` is set using `RUST_LOG`. Early adopters
of this feature are encouraged to enable these logs as described and
then ingest the Gateway's logs into the SIEM of their choice for further
analysis.

Related: #8353
2025-10-22 03:10:21 +00:00
Firezone Bot
76d86545a6 chore: publish apple-client 1.5.9 (#10654) 2025-10-20 14:04:08 +00:00
Brian Manifold
27565ea5c8 refactor(portal): remove soft delete elements from portal code (#10607)
Why:

* In previous commits, the portal code had been updated to use hard
deletion rather than soft deletion of data. The fields used in the soft
deletion were still kept in the DB and the code to allow for zero
downtime rollout and an easy rollback if necessary. To continue with
that work the portal code has now been updated to remove any reference
to the soft deleted fields (e.g. deleted_at, persistent_id, etc...).
While the code has been updated the actual data in the DB will need to
remain for now, to once again allow for a zero downtime rollout. Once
this commit has been deployed to production another PR can follow to
remove the columns from the necessary tables in the DB.


Related: #8187
2025-10-18 17:02:26 +00:00
Firezone Bot
9b6ebb01ed chore: publish android-client 1.5.5 (#10614) 2025-10-18 16:54:35 +00:00
Thomas Eizinger
7e5ec7c2d7 ci: upload .deb from releases to APT repository (#10587)
This PR creates the necessary CI infrastructure to copy `.deb` packages
from releases to our APT repository. Re-generation of the index is
separated out into a dedicated workflow to avoid concurrency issues and
so we can re-generate it without making a release.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-16 19:39:35 +00:00
Thomas Eizinger
17ab1a6d04 ci: remove jitter from docker-compose (#10589)
Jitter causes packets to get re-ordered which makes it really hard to
get predictable performance results. With jitter disabled, we get more
consistent performance numbers.
2025-10-16 13:34:59 +00:00
Firezone Bot
5272e0c992 chore: publish headless-client 1.5.4 (#10590) 2025-10-16 09:15:32 +00:00
Firezone Bot
f78cccea1b chore: publish gui-client 1.5.8 (#10591) 2025-10-16 08:47:35 +00:00
Firezone Bot
e3bb2fb931 chore: publish gateway 1.4.17 (#10584) 2025-10-16 05:38:12 +00:00
Thomas Eizinger
a1b2ca195c ci(apple): explicitly select Xcode 26.0 (#10511)
In order to build the iOS app with the Xcode version that is installed
on the GitHub runners, we need to select the Xcode version by major and
minor version. Currently, the iOS builds are failing because Xcode 26.1
also exists but iOS 26.1 isn't supported (or released?).

See
https://github.com/firezone/firezone/actions/runs/18239282351/job/51938727311.
2025-10-06 16:07:34 +00:00
Thomas Eizinger
0d61cacb08 ci: add 20% jitter in test environment (#10504)
To simulate the real-world more accurately, we add a 20% jitter to the
specified latency on the router containers.
2025-10-02 05:27:05 +00:00
Thomas Eizinger
b11adfcfe4 feat(connlib): create flow on ICMP error "prohibited" (#10462)
In Firezone, a Client requests an "access authorization" for a Resource
on the fly when it sees the first packet for said Resource going through
the tunnel. If we don't have a connection to the Gateway yet, this is
also where we will establish a connection and create the WireGuard
tunnel.

In order for this to work, the access authorization state between the
Client and the Gateway MUST NOT get out of sync. If the Client thinks it
has access to a Resource, it will just route the traffic to the Gateway.
If the access authorization on the Gateway has expired or vanished
otherwise, the packets will be black-holed.

Starting with #9816, the Gateway sends ICMP errors back to the
application whenever it filters a packet. This can happen either because
the access authorization is gone or because the traffic wasn't allowed
by the specific filter rules on the Resource.

With this patch, the Client will attempt to create a new flow (i.e.
re-authorize) traffic for this resource whenever it sees such an ICMP
error, therefore acting as a way of synchronizing the view of the world
between Client and Gateway should they ever run out of sync.

Testing turned out to be a bit tricky. If we let the authorization on
the Gateway lapse naturally, we portal will also toggle the Resource off
and on on the Client, resulting in "flushing" the current
authorizations. Additionally, it the Client had only access to one
Resource, then the Gateway will gracefully close the connection, also
resulting in the Client creating a new flow for the next packet.

To actually trigger this new behaviour we need to:

- Access at least two resources via the same Gateway
- Directly send `reject_access` to the Gateway for this particular
resource

To achieve this, we dynamically eval some code on the API node and
instruct the Gateway channel to send `reject_access`. The connection
stays intact because there is still another active access authorization
but packets for the other resource are answered with ICMP errors.

To achieve a safe roll-out, the new behaviour is feature-flagged. In
order to still test it, we now also allow feature flags to be set via
env variables.

Resolves: #10074

---------

Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>
2025-09-30 08:23:39 +00:00
Thomas Eizinger
9865e03343 ci: fix double symmetric NAT test failure (#10410)
As it turns out, the flaky test was caused by a bug in the eBPF kernel where we read the old channel data header from the wrong offset. This made us essentially read garbage data for the channel number, causing us to:

a. Compute a bad checksum
b. Send the packet on a completely wrong channel

The reason this caused a flaky test is that it requires on side to pick IPv4 to talk to the relay and the other side IPv6. The happy-eyeballs approach of the `allocation` module made that non-deterministic, only exposing this bug occasionally.

To ensure these kind of things are detected earlier in the future, I am adding an additional CI step that checks all packets emitted by the eBPF kernel for checksum errors.

Fixes: #10404

Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
2025-09-25 17:53:17 +10:00
Thomas Eizinger
cf837c5087 ci: fix build context for relay container (#10426)
The build context is taken relative from where the file is defined,
meaning we first need to navigate to directories up.
2025-09-23 04:01:55 +00:00
Jamil
8b2bf97513 fix(ci): RUN_MANUAL_MIGRATIONS=true (#10377)
This variable was renamed and not updated in our docker-compose.yml,
causing intermittent errors like this one:


https://github.com/firezone/firezone/actions/runs/17835644646/job/50712540454
2025-09-18 17:43:02 +00:00
Firezone Bot
8f46007674 chore: publish android-client 1.5.4 (#10374) 2025-09-18 10:37:20 -07:00
Thomas Eizinger
22eac1ad6d ci: add latency to routers (#10352)
Now that we have a more realistic network setup in our compose file, we
can extend our router containers to apply the latency on the network
path. This means any use of the compose file has a latency by default,
simplifying our CI setup. It also allows us to restart containers
without having to re-apply the latency which is useful during
performance testing.
2025-09-16 20:27:47 +00:00
Thomas Eizinger
737137df97 chore: remove nix flake (#10364)
I am not longer using Nix so this is now effectively unmaintained. Let's
remove it so it doesn't got stale.
2025-09-16 10:27:18 +00:00
Thomas Eizinger
e2dce710f1 refactor: tidy up docker-compose.yml (#10334)
Our `docker-compose.yml` file has grown to a degree where it is almost
unmanageable. Docker compose offers several tools to deal with complex
compose setups, like include files and yaml anchors.

We refactor our setup using these tools to organise these services and
their configuration a bit better.
2025-09-15 03:37:39 +00:00
Thomas Eizinger
0b89959354 fix(relay): handle relay-relay candidate pairs in eBPF (#10286)
Currently, the eBPF module can translate from channel data messages to
UDP packets and vice versa. It can even do that across IP stacks, i.e.
translate from an IPv6 UDP packet to an IPv4 channel data messages.

What it cannot do is handle packets to itself. This can happen if both -
Client and Gateway - pick the same relay to make an allocation. When
exchanging candidates, ICE will then form pairs between both relay
candidates, essentially requiring the relay to loop packets back to
itself.

In eBPF, we cannot do that. When sending a packet back out with
`XDP_TX`, it will actually go out on the wire without an additional
check whether they are for our own IP.

Properly handling this in eBPF (by comparing the destination IP to our
public IP) adds more cases we need to handle. The current module
structure where everything is one file makes this quite hard to
understand, which is why I opted to create four sub-modules:

- `from_ipv4_channel`
- `from_ipv4_udp`
- `from_ipv6_channel`
- `from_ipv6_udp`

For traffic arriving via a data-channel, it is possible that we also
need to send it back out via a data-channel if the peer address we are
sending to is the relay itself. Therefore, the `from_ipX_channel`
modules have four sub-modules:

- `to_ipv4_channel`
- `to_ipv4_udp`
- `to_ipv6_channel`
- `to_ipv6_udp`

For the traffic arriving on an allocation port (`from_ipX_udp`), we
always map to a data-channel and therefore can never get into a routing
loop, resulting in only two modules:

- `to_ipv4_channel`
- `to_ipv6_channel`

The actual implementation of the new code paths is rather simple and
mostly copied from the existing ones. For half of them, we don't need to
make any adjustments to the buffer size (i.e. IPv4 channel to IPv4
channel). For the other half, we need to adjust for the difference in
the IP header size.

To test these changes, we add a new integration test that makes use of
the new docker-compose setup added in #10301 and configures masquerading
for both Client and Gateway. To make this more useful, we also remove
the `direct-` prefix from all tests as the test script itself no longer
makes any decisions as to whether it is operating over a direct or
relayed connection.

Resolves: #7518
2025-09-11 07:19:23 +00:00
Thomas Eizinger
9cd25d70d8 ci: prevent packet reordering by router containers (#10328)
By default, RPS (Receive Packet Steering) is disabled on Linux which
means the CPU handling the interrupt for an incoming packet also handles
the packet. Under high-load, this can causes packet reordering in your
test setup where at least two routers are in the path between Client and
Gateway.

To ensure our test suite is deterministic, we enable RPS and set it to
1, meaning always CPU 1 will handle all packets.

Local testing has shown that this fixes the warnings of "packet counter
too old" on the Gateway and instead, all packets arrive entirely in
order.

Source:
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/network-rps
2025-09-11 06:54:05 +00:00
Thomas Eizinger
83171d3a2d ci: add integration test for graceful Gateway shutdown (#10077)
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-09-10 23:41:55 +00:00
Thomas Eizinger
d1d46fdfb4 ci: create a more realistic network setup (#10301)
Currently, the setup we have in docker-compose does not reflect
real-world scenarios very well because most components share the same
subnet. In reality, Clients, Gateways, relays and the backend are all in
separate subnets, connected via multiple routers on the Internet.

The current setup makes it hard to properly test relayed connections. To
fix this, we move all components into their own subnet with a dedicated
router container that performs source and destination NAT as well as
acts as a firewall for the client and gateway containers to not allow
inbound traffic.

This setup will allow us to more easily test #10286 which requires port
randomization for outgoing traffic on the Client and Gateway side.
2025-09-10 23:37:16 +00:00
Firezone Bot
d8079c869f chore: publish apple-client 1.5.8 (#10323) 2025-09-10 17:06:40 +00:00
Thomas Eizinger
f96cc3d583 feat(relay): remove graceful shutdown (#10322)
Initially, we added the graceful shutdown functionality to the relay to
better deal with deploys and achieve as minimal downtime as possible.
With the split of app and infrastructure that we now have, this
functionality is no longer necessary as portal deploys don't touch the
relay infra at all.

Thus, we can remove this functionality which will actually speed-up
deploys of the relays as systemd no longer has to time-out after sending
the SIGTERM to the binary.
2025-09-10 07:00:20 +00:00
Firezone Bot
af7f4c9992 chore: publish headless-client 1.5.3 (#10320) 2025-09-10 05:25:24 +00:00
Firezone Bot
cacef44b4b chore: publish gateway 1.4.16 (#10321) 2025-09-10 04:50:43 +00:00
Firezone Bot
ff8781b7b6 chore: publish gui-client 1.5.7 (#10319) 2025-09-10 04:22:09 +00:00
Thomas Eizinger
3cffeef483 ci: reduce target bitrate for UDP perf tests to 600Mbit/s (#10312)
To achieve a more stable CI, we need to reduce the target bitrate of the
UDP perf tests. Now that we no longer have GSO enabled in the tests, the
most we can achieve in CI is 600Mbit/s. Forcing more packets through the
tunnel results in all sorts of warnings which end up failing CI.
2025-09-09 12:58:33 +00:00
Thomas Eizinger
b762c3acde ci: don't restart portal at the beginning of the test (#10274)
Restarting the portal at the beginning of the test is useless. We
haven't made any connections yet so restarting it will just get us back
to the same state that we are already in.
2025-09-01 13:43:50 +00:00
Jamil
0ccd4bbf24 feat(ci): enable relay eBPF offloading (#10160)
In CI, eBPF in driver mode actually functions just fine with no changes
to our existing tests, given we apply a few workarounds and bugfixes:

- The interface learning mechanism had two flaws: (1) it only learned
per-CPU, which meant the risk for a missing entry grew as the core count
of the relay host grew, and (2) it did not filter for unicast IPs, so it
picked up broadcast and link-local addresses, causing cross-relay paths
to fail occasionally
- The `relay-relay` candidate where the two relays are the same relay
causes packet drops / loops in the Docker bridge setup, and possibly in
GCP too. I'm not sure this is a valid path that solves a real
connectivity issue in the wild. I can understand relay-relay paths where
two relays are different hosts, and the client and gateway both talk
over their TURN channel to each other (i.e. WireGuard is blocked in each
of their networks), but I can't think of an advantage for a relay-relay
candidate where the traffic simply hairpins (or is dropped) off the
nearest switch. This has been now detected with a new `PacketLoop` error
that triggers whenever source_ip == dest_ip.
- The relays in CI need a common next-hop to talk to for the MAC address
swapping to work. A simple router service is added which functions as a
basic L3 router (no NAT) that allows the MAC swapping to work.
- The `veth` driver has some peculiar requirements to allow it to
function with XDP_TX. If you send a packet out of one interface of a
veth pair with XDP_TX, you need to either make sure both interfaces have
GRO enabled, or you need to attach a dummy XDP program that simply does
XDP_PASS to the other interface so that the sk_buff is allocated before
going up the stack to the Docker bridge. The GRO method was unreliable
and didn't work in our case, causing massive packet delays and
unpredictable bursts that prevented ICE from working, so we use the
XDP_PASS method instead. A simple docker image is built and lives at
https://github.com/firezone/xdp-pass to handle this.

Related: #10138 
Related: #10260
2025-08-31 23:37:03 +00:00
Jamil
516be7417e fix(ci): remove extraneous caching (#10258)
- Removes the swift DerivedData cache. This was added to attempt to
speed up the Swift builds in CI but in reality, those are already fast
and the cache did not speed them up.
- Removes the runner.os/arch specifier from the Webview installer cache
key. The binary download is hardcoded for a specific windows version /
arch already so the cache key just adds unneeded complexity.

These caches are getting saved on PR runs which consumes excess GHA
cache storage.
2025-08-27 05:01:02 -07:00
Jamil
8eb738e66a chore(ci): downgrade runners to free tier (#10248)
To avoid burning Azure credits, we move the runners back down to the
free tier. Now that caching is properly set up, this should incur only a
minor increase in CI time.
2025-08-26 10:48:45 -07:00
Jamil
0698e0d35f ci: test IPv6 for CIDR resources (#10168)
Docker for Mac finally supports IPv6 in general availability. It's time
to add IPv6 to our suite of integration tests.

The thinking behind this PR is try and not slow down CI much, if at all,
by testing IPv6 side-by-side with the existing IPv4 tests.

More comprehensive testing is being developed in #10131 that will test
things like IPv4-in-6 relaying, client / gateway IP stack mismatches,
and so forth.
2025-08-18 20:59:40 +00:00
Firezone Bot
95ee111e62 chore: publish apple-client 1.5.7 (#10159) 2025-08-07 04:38:03 +00:00
Thomas Eizinger
456fde5b60 ci: increase bitrate of direct connection UDP perf tests (#10154)
We can easily handle 1GBit/s for the direct connections.
2025-08-06 14:02:47 +00:00
Thomas Eizinger
b5e3ee8065 ci: reduce UDP perf test bitrate (#10153)
Forcing 500MBit/s through a relayed connection in CI makes the
user-space relay fall-over and drop control messages, leading to ICE
timeouts of the connection.
2025-08-06 09:11:57 +00:00
Firezone Bot
ea960cce74 chore: publish android-client 1.5.3 (#10141) 2025-08-05 16:38:23 +00:00
Jamil
56f5405849 chore(ci): increase perf test time to 30s (#10133)
Our ICE timeout is ~15s, so it would be a good idea to ensure the perf
tests span a possible ICE timeout if it occurs in the test, so that we
may detect cases where high throughput may cause an ICE timeout.
2025-08-05 07:42:17 +00:00
Firezone Bot
3e529ed36c chore: publish gateway 1.4.15 (#10134) 2025-08-05 17:17:25 +10:00