Commit Graph

549 Commits

Author SHA1 Message Date
Jamil
8b2bf97513 fix(ci): RUN_MANUAL_MIGRATIONS=true (#10377)
This variable was renamed and not updated in our docker-compose.yml,
causing intermittent errors like this one:


https://github.com/firezone/firezone/actions/runs/17835644646/job/50712540454
2025-09-18 17:43:02 +00:00
Firezone Bot
8f46007674 chore: publish android-client 1.5.4 (#10374) 2025-09-18 10:37:20 -07:00
Thomas Eizinger
22eac1ad6d ci: add latency to routers (#10352)
Now that we have a more realistic network setup in our compose file, we
can extend our router containers to apply the latency on the network
path. This means any use of the compose file has a latency by default,
simplifying our CI setup. It also allows us to restart containers
without having to re-apply the latency which is useful during
performance testing.
2025-09-16 20:27:47 +00:00
Thomas Eizinger
737137df97 chore: remove nix flake (#10364)
I am not longer using Nix so this is now effectively unmaintained. Let's
remove it so it doesn't got stale.
2025-09-16 10:27:18 +00:00
Thomas Eizinger
e2dce710f1 refactor: tidy up docker-compose.yml (#10334)
Our `docker-compose.yml` file has grown to a degree where it is almost
unmanageable. Docker compose offers several tools to deal with complex
compose setups, like include files and yaml anchors.

We refactor our setup using these tools to organise these services and
their configuration a bit better.
2025-09-15 03:37:39 +00:00
Thomas Eizinger
0b89959354 fix(relay): handle relay-relay candidate pairs in eBPF (#10286)
Currently, the eBPF module can translate from channel data messages to
UDP packets and vice versa. It can even do that across IP stacks, i.e.
translate from an IPv6 UDP packet to an IPv4 channel data messages.

What it cannot do is handle packets to itself. This can happen if both -
Client and Gateway - pick the same relay to make an allocation. When
exchanging candidates, ICE will then form pairs between both relay
candidates, essentially requiring the relay to loop packets back to
itself.

In eBPF, we cannot do that. When sending a packet back out with
`XDP_TX`, it will actually go out on the wire without an additional
check whether they are for our own IP.

Properly handling this in eBPF (by comparing the destination IP to our
public IP) adds more cases we need to handle. The current module
structure where everything is one file makes this quite hard to
understand, which is why I opted to create four sub-modules:

- `from_ipv4_channel`
- `from_ipv4_udp`
- `from_ipv6_channel`
- `from_ipv6_udp`

For traffic arriving via a data-channel, it is possible that we also
need to send it back out via a data-channel if the peer address we are
sending to is the relay itself. Therefore, the `from_ipX_channel`
modules have four sub-modules:

- `to_ipv4_channel`
- `to_ipv4_udp`
- `to_ipv6_channel`
- `to_ipv6_udp`

For the traffic arriving on an allocation port (`from_ipX_udp`), we
always map to a data-channel and therefore can never get into a routing
loop, resulting in only two modules:

- `to_ipv4_channel`
- `to_ipv6_channel`

The actual implementation of the new code paths is rather simple and
mostly copied from the existing ones. For half of them, we don't need to
make any adjustments to the buffer size (i.e. IPv4 channel to IPv4
channel). For the other half, we need to adjust for the difference in
the IP header size.

To test these changes, we add a new integration test that makes use of
the new docker-compose setup added in #10301 and configures masquerading
for both Client and Gateway. To make this more useful, we also remove
the `direct-` prefix from all tests as the test script itself no longer
makes any decisions as to whether it is operating over a direct or
relayed connection.

Resolves: #7518
2025-09-11 07:19:23 +00:00
Thomas Eizinger
9cd25d70d8 ci: prevent packet reordering by router containers (#10328)
By default, RPS (Receive Packet Steering) is disabled on Linux which
means the CPU handling the interrupt for an incoming packet also handles
the packet. Under high-load, this can causes packet reordering in your
test setup where at least two routers are in the path between Client and
Gateway.

To ensure our test suite is deterministic, we enable RPS and set it to
1, meaning always CPU 1 will handle all packets.

Local testing has shown that this fixes the warnings of "packet counter
too old" on the Gateway and instead, all packets arrive entirely in
order.

Source:
https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/6/html/performance_tuning_guide/network-rps
2025-09-11 06:54:05 +00:00
Thomas Eizinger
83171d3a2d ci: add integration test for graceful Gateway shutdown (#10077)
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-09-10 23:41:55 +00:00
Thomas Eizinger
d1d46fdfb4 ci: create a more realistic network setup (#10301)
Currently, the setup we have in docker-compose does not reflect
real-world scenarios very well because most components share the same
subnet. In reality, Clients, Gateways, relays and the backend are all in
separate subnets, connected via multiple routers on the Internet.

The current setup makes it hard to properly test relayed connections. To
fix this, we move all components into their own subnet with a dedicated
router container that performs source and destination NAT as well as
acts as a firewall for the client and gateway containers to not allow
inbound traffic.

This setup will allow us to more easily test #10286 which requires port
randomization for outgoing traffic on the Client and Gateway side.
2025-09-10 23:37:16 +00:00
Firezone Bot
d8079c869f chore: publish apple-client 1.5.8 (#10323) 2025-09-10 17:06:40 +00:00
Thomas Eizinger
f96cc3d583 feat(relay): remove graceful shutdown (#10322)
Initially, we added the graceful shutdown functionality to the relay to
better deal with deploys and achieve as minimal downtime as possible.
With the split of app and infrastructure that we now have, this
functionality is no longer necessary as portal deploys don't touch the
relay infra at all.

Thus, we can remove this functionality which will actually speed-up
deploys of the relays as systemd no longer has to time-out after sending
the SIGTERM to the binary.
2025-09-10 07:00:20 +00:00
Firezone Bot
af7f4c9992 chore: publish headless-client 1.5.3 (#10320) 2025-09-10 05:25:24 +00:00
Firezone Bot
cacef44b4b chore: publish gateway 1.4.16 (#10321) 2025-09-10 04:50:43 +00:00
Firezone Bot
ff8781b7b6 chore: publish gui-client 1.5.7 (#10319) 2025-09-10 04:22:09 +00:00
Thomas Eizinger
3cffeef483 ci: reduce target bitrate for UDP perf tests to 600Mbit/s (#10312)
To achieve a more stable CI, we need to reduce the target bitrate of the
UDP perf tests. Now that we no longer have GSO enabled in the tests, the
most we can achieve in CI is 600Mbit/s. Forcing more packets through the
tunnel results in all sorts of warnings which end up failing CI.
2025-09-09 12:58:33 +00:00
Thomas Eizinger
b762c3acde ci: don't restart portal at the beginning of the test (#10274)
Restarting the portal at the beginning of the test is useless. We
haven't made any connections yet so restarting it will just get us back
to the same state that we are already in.
2025-09-01 13:43:50 +00:00
Jamil
0ccd4bbf24 feat(ci): enable relay eBPF offloading (#10160)
In CI, eBPF in driver mode actually functions just fine with no changes
to our existing tests, given we apply a few workarounds and bugfixes:

- The interface learning mechanism had two flaws: (1) it only learned
per-CPU, which meant the risk for a missing entry grew as the core count
of the relay host grew, and (2) it did not filter for unicast IPs, so it
picked up broadcast and link-local addresses, causing cross-relay paths
to fail occasionally
- The `relay-relay` candidate where the two relays are the same relay
causes packet drops / loops in the Docker bridge setup, and possibly in
GCP too. I'm not sure this is a valid path that solves a real
connectivity issue in the wild. I can understand relay-relay paths where
two relays are different hosts, and the client and gateway both talk
over their TURN channel to each other (i.e. WireGuard is blocked in each
of their networks), but I can't think of an advantage for a relay-relay
candidate where the traffic simply hairpins (or is dropped) off the
nearest switch. This has been now detected with a new `PacketLoop` error
that triggers whenever source_ip == dest_ip.
- The relays in CI need a common next-hop to talk to for the MAC address
swapping to work. A simple router service is added which functions as a
basic L3 router (no NAT) that allows the MAC swapping to work.
- The `veth` driver has some peculiar requirements to allow it to
function with XDP_TX. If you send a packet out of one interface of a
veth pair with XDP_TX, you need to either make sure both interfaces have
GRO enabled, or you need to attach a dummy XDP program that simply does
XDP_PASS to the other interface so that the sk_buff is allocated before
going up the stack to the Docker bridge. The GRO method was unreliable
and didn't work in our case, causing massive packet delays and
unpredictable bursts that prevented ICE from working, so we use the
XDP_PASS method instead. A simple docker image is built and lives at
https://github.com/firezone/xdp-pass to handle this.

Related: #10138 
Related: #10260
2025-08-31 23:37:03 +00:00
Jamil
516be7417e fix(ci): remove extraneous caching (#10258)
- Removes the swift DerivedData cache. This was added to attempt to
speed up the Swift builds in CI but in reality, those are already fast
and the cache did not speed them up.
- Removes the runner.os/arch specifier from the Webview installer cache
key. The binary download is hardcoded for a specific windows version /
arch already so the cache key just adds unneeded complexity.

These caches are getting saved on PR runs which consumes excess GHA
cache storage.
2025-08-27 05:01:02 -07:00
Jamil
8eb738e66a chore(ci): downgrade runners to free tier (#10248)
To avoid burning Azure credits, we move the runners back down to the
free tier. Now that caching is properly set up, this should incur only a
minor increase in CI time.
2025-08-26 10:48:45 -07:00
Jamil
0698e0d35f ci: test IPv6 for CIDR resources (#10168)
Docker for Mac finally supports IPv6 in general availability. It's time
to add IPv6 to our suite of integration tests.

The thinking behind this PR is try and not slow down CI much, if at all,
by testing IPv6 side-by-side with the existing IPv4 tests.

More comprehensive testing is being developed in #10131 that will test
things like IPv4-in-6 relaying, client / gateway IP stack mismatches,
and so forth.
2025-08-18 20:59:40 +00:00
Firezone Bot
95ee111e62 chore: publish apple-client 1.5.7 (#10159) 2025-08-07 04:38:03 +00:00
Thomas Eizinger
456fde5b60 ci: increase bitrate of direct connection UDP perf tests (#10154)
We can easily handle 1GBit/s for the direct connections.
2025-08-06 14:02:47 +00:00
Thomas Eizinger
b5e3ee8065 ci: reduce UDP perf test bitrate (#10153)
Forcing 500MBit/s through a relayed connection in CI makes the
user-space relay fall-over and drop control messages, leading to ICE
timeouts of the connection.
2025-08-06 09:11:57 +00:00
Firezone Bot
ea960cce74 chore: publish android-client 1.5.3 (#10141) 2025-08-05 16:38:23 +00:00
Jamil
56f5405849 chore(ci): increase perf test time to 30s (#10133)
Our ICE timeout is ~15s, so it would be a good idea to ensure the perf
tests span a possible ICE timeout if it occurs in the test, so that we
may detect cases where high throughput may cause an ICE timeout.
2025-08-05 07:42:17 +00:00
Firezone Bot
3e529ed36c chore: publish gateway 1.4.15 (#10134) 2025-08-05 17:17:25 +10:00
Firezone Bot
acf52ccf1e chore: publish apple-client 1.5.6 (#10106) 2025-08-02 19:43:35 +00:00
Thomas Eizinger
a7ba15c8c1 ci: test packet loss behaviour using download (#10067) 2025-08-01 01:55:02 +00:00
Thomas Eizinger
69f9a03ee8 refactor(connlib): simplify IpPacket struct (#9795)
With the removal of the NAT64/46 modules, we can now simplify the
internals of our `IpPacket` struct. The requirements for our `IpPacket`
struct are somewhat delicate.

On the one hand, we don't want to be overly restrictive in our parsing /
validation code because there is a lot of broken software out there that
doesn't necessarily follow RFCs. Hence, we want to be as lenient as
possible in what we accept.

On the other hand, we do need to verify certain aspects of the packet,
like the payload lengths. At the moment, we are somewhat too lenient
there which causes errors on the Gateway where we have to NAT or
otherwise manipulate the packets. See #9567 or #9552 for example.

To fix this, we make the parsing in the `IpPacket` constructor more
restrictive. If it is a UDP, TCP or ICMP packet, we attempt to fully
parse its headers and validate the payload lengths.

This parsing allows us to then rely on the integrity of the packet as
part of the implementation. This does create several code paths that can
in theory panic but in practice, should be impossible to hit. To ensure
that this does in fact not happen, we also tackle an issue that is long
overdue: Fuzzing.

Resolves: #6667 
Resolves: #9567
Resolves: #9552
2025-07-29 04:42:57 +00:00
Jamil
1763113511 test(ci): test 20% packet loss (#9846)
Packet loss is a reality on the modern internet. Ideally, Firezone
should be able to handle some level of packet loss and still function
reliably, especially considering all of the UDP-based protocols we rely
on.

To test this, we set an extreme packet loss of 20% and perform a 10 MB
download through Firezone. Doing so actually exposed a bug:

For DNS resources, we need to set up the DNS resource NAT on the Gateway
which happens through the p2p control protocol. This packet is resent at
most every 2s but only if there are any other DNS queries. If we don't
receive another DNS query but get traffic for the resource, we keep
buffering those packets without trying to re-send the `AssignedIp`s
packet.
2025-07-28 22:51:04 +00:00
Firezone Bot
e6fc7e62da chore: publish apple-client 1.5.5 (#10035) 2025-07-28 20:14:12 +00:00
Firezone Bot
2309be11fc chore: publish headless-client 1.5.2 (#10029) 2025-07-28 06:17:42 +00:00
Firezone Bot
cf40f4dd96 chore: publish gateway 1.4.14 (#10030) 2025-07-28 06:14:07 +00:00
Firezone Bot
7b8daf4074 chore: publish gui-client 1.5.6 (#10028) 2025-07-28 06:08:01 +00:00
Firezone Bot
a11983e4b3 chore: publish gateway 1.4.13 (#9969) 2025-07-22 18:56:40 +00:00
Thomas Eizinger
47b35d6e3c ci: increase timeout for download roaming test (#9945)
Now that we don't tolerate any failures in the download, this test
sometimes fails because the timeout is a bit too tight.
2025-07-21 04:06:37 +00:00
Thomas Eizinger
72fbe306b6 test: remove curl retry in favor of keep-alive (#9888)
At present, the `direct-download-roaming-network` integration test is a
bit odd. It uses the `--retry` switch from `curl` to retry the download
once it failed. However, what we want to show with this integration test
is that a TCP connection can survive network roaming. We can show that
successfully but only if we specify the `--keepalive-time` option,
otherwise the download stalls.

From inspecting the network logs, this is because `curl` simply waits
for more data to be downloaded. After a network reset, the connection
however is gone and the _client_ (in this case `curl`) needs to send at
least 1 packet to re-establish the connection. By using the keep-alive
option, we can send such a packet and the download completes
successfully.
2025-07-16 16:17:27 +00:00
Thomas Eizinger
cf2470ba1e test(iperf): install iptables rule inside of container (#9880)
In Docker environments, applying iptables rules to filter
container-container traffic on the Docker bridged network is not
reliable, leading to direct connections being established in our relayed
tests. To fix this, we insert the rules directly from the client
container itself.

---------

Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
2025-07-16 10:29:33 +00:00
Jamil
12351e5985 ci: publish apple 1.5.4 clients (#9842) 2025-07-11 16:35:25 +00:00
Thomas Eizinger
55aef6ae11 chore: publish gui-client 1.5.5 (#9811) 2025-07-09 12:44:38 +00:00
Jamil
4a02e89b43 ci: publish headless 1.5.1 (#9791) 2025-07-05 08:18:14 +00:00
Thomas Eizinger
0617c96b69 chore(nix): follow system release channel (#9784)
In order to avoid glibc mismatches, the dev-shell started by the nix
flake should follow the system channel instead of pinning a version by
itself.
2025-07-04 14:27:53 +00:00
Jamil
4091457788 ci: publish android 1.5.2 (#9735)
**NOTE**: This is for last week's release of 1.5.2. We will still need
to do a release to cut 1.5.3.
2025-07-01 14:11:48 +00:00
Jamil
a4cf3ead0f ci: publish gateway 1.4.12 (#9736) 2025-07-01 14:04:21 +00:00
Jamil
699739deae fix(docs): use sha256sum over sha256 (#9690)
`sha256` isn't found by default on some machines.
2025-06-27 20:08:41 +00:00
Jamil
1acfcd5678 fix(gateway): bump timeoutstart to 15s (#9685)
3s on a fresh install may be too low for the binary to download.

Related:
https://firezonehq.slack.com/archives/C08FPHECLUF/p1750946191078759?thread_ts=1750940488.328739&cid=C08FPHECLUF
2025-06-26 14:19:44 +00:00
Thomas Eizinger
40f0609d90 ci: lint GitHub workflows with actionlint (#9590)
[`actionlint`](https://github.com/rhysd/actionlint) is a static analysis
tool for GitHub workflows and actions. It detects various issues ahead
of time and runs shellcheck on all `run` blocks. It is worth noting that
this does **not** lint the contents of composite actions so we still
need to be vigilant when working with those.
2025-06-24 08:05:10 +00:00
Thomas Eizinger
a91dda139f feat(connlib): only conditionally hash firezone ID (#9633)
A bit of legacy that we have inherited around our Firezone ID is that
the ID stored on the user's device is sha'd before being passed to the
portal as the "external ID". This makes it difficult to correlate IDs in
Sentry and PostHog with the data we have in the portal. For Sentry and
PostHog, we submit the raw UUID stored on the user's device.

As a first step in overcoming this, we embed an "external ID" in those
services as well IF the provided Firezone ID is a valid UUID. This will
allow us to immediately correlate those events.

As a second step, we automatically generate all new Firezone IDs for the
Windows and Linux Client as `hex(sha256(uuid))`. These won't parse as
valid UUIDs and therefore will be submitted as is to the portal.

As a third step, we update all documentation around generating Firezone
IDs to use `uuidgen | sha256` instead of just `uuidgen`. This is
effectively the equivalent of (2) but for the Headless Client and
Gateway where the Firezone ID can be configured via environment
variables.

Resolves: #9382

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2025-06-24 07:05:48 +00:00
Jamil
081b075f2c chore: bump gui, apple, gateway (#9586)
The new publish automation still [has some
kinks](https://github.com/firezone/firezone/actions/runs/15764891111) so
publishing this manually.
2025-06-19 12:29:46 -07:00
Jamil
f50fa95778 fix(ci): lock xcode major (#9585)
Apple won't allow apps built with Xcode betas to be reviewed.

<img width="1146" alt="Screenshot 2025-06-19 at 9 04 17 AM"
src="https://github.com/user-attachments/assets/11470f04-603b-4c5c-aad2-fba0e4eb391a"
/>
2025-06-19 09:21:58 -07:00