20 Commits

Author SHA1 Message Date
Thomas Eizinger
d1d46fdfb4 ci: create a more realistic network setup (#10301)
Currently, the setup we have in docker-compose does not reflect
real-world scenarios very well because most components share the same
subnet. In reality, Clients, Gateways, relays and the backend are all in
separate subnets, connected via multiple routers on the Internet.

The current setup makes it hard to properly test relayed connections. To
fix this, we move all components into their own subnet with a dedicated
router container that performs source and destination NAT as well as
acts as a firewall for the client and gateway containers to not allow
inbound traffic.

This setup will allow us to more easily test #10286 which requires port
randomization for outgoing traffic on the Client and Gateway side.
2025-09-10 23:37:16 +00:00
Thomas Eizinger
3cffeef483 ci: reduce target bitrate for UDP perf tests to 600Mbit/s (#10312)
To achieve a more stable CI, we need to reduce the target bitrate of the
UDP perf tests. Now that we no longer have GSO enabled in the tests, the
most we can achieve in CI is 600Mbit/s. Forcing more packets through the
tunnel results in all sorts of warnings which end up failing CI.
2025-09-09 12:58:33 +00:00
Jamil
0ccd4bbf24 feat(ci): enable relay eBPF offloading (#10160)
In CI, eBPF in driver mode actually functions just fine with no changes
to our existing tests, given we apply a few workarounds and bugfixes:

- The interface learning mechanism had two flaws: (1) it only learned
per-CPU, which meant the risk for a missing entry grew as the core count
of the relay host grew, and (2) it did not filter for unicast IPs, so it
picked up broadcast and link-local addresses, causing cross-relay paths
to fail occasionally
- The `relay-relay` candidate where the two relays are the same relay
causes packet drops / loops in the Docker bridge setup, and possibly in
GCP too. I'm not sure this is a valid path that solves a real
connectivity issue in the wild. I can understand relay-relay paths where
two relays are different hosts, and the client and gateway both talk
over their TURN channel to each other (i.e. WireGuard is blocked in each
of their networks), but I can't think of an advantage for a relay-relay
candidate where the traffic simply hairpins (or is dropped) off the
nearest switch. This has been now detected with a new `PacketLoop` error
that triggers whenever source_ip == dest_ip.
- The relays in CI need a common next-hop to talk to for the MAC address
swapping to work. A simple router service is added which functions as a
basic L3 router (no NAT) that allows the MAC swapping to work.
- The `veth` driver has some peculiar requirements to allow it to
function with XDP_TX. If you send a packet out of one interface of a
veth pair with XDP_TX, you need to either make sure both interfaces have
GRO enabled, or you need to attach a dummy XDP program that simply does
XDP_PASS to the other interface so that the sk_buff is allocated before
going up the stack to the Docker bridge. The GRO method was unreliable
and didn't work in our case, causing massive packet delays and
unpredictable bursts that prevented ICE from working, so we use the
XDP_PASS method instead. A simple docker image is built and lives at
https://github.com/firezone/xdp-pass to handle this.

Related: #10138 
Related: #10260
2025-08-31 23:37:03 +00:00
Jamil
8eb738e66a chore(ci): downgrade runners to free tier (#10248)
To avoid burning Azure credits, we move the runners back down to the
free tier. Now that caching is properly set up, this should incur only a
minor increase in CI time.
2025-08-26 10:48:45 -07:00
Thomas Eizinger
456fde5b60 ci: increase bitrate of direct connection UDP perf tests (#10154)
We can easily handle 1GBit/s for the direct connections.
2025-08-06 14:02:47 +00:00
Thomas Eizinger
b5e3ee8065 ci: reduce UDP perf test bitrate (#10153)
Forcing 500MBit/s through a relayed connection in CI makes the
user-space relay fall-over and drop control messages, leading to ICE
timeouts of the connection.
2025-08-06 09:11:57 +00:00
Jamil
56f5405849 chore(ci): increase perf test time to 30s (#10133)
Our ICE timeout is ~15s, so it would be a good idea to ensure the perf
tests span a possible ICE timeout if it occurs in the test, so that we
may detect cases where high throughput may cause an ICE timeout.
2025-08-05 07:42:17 +00:00
Thomas Eizinger
a8aafc9e14 ci: use bencher.dev for continuous benchmarking (#5915)
Currently, we have a homegrown benchmark suite that reports results of
the iperf runs within CI by comparing a run on `main` with the current
branch.

These comments are noisy because they happen on every PR, regardless of
the performance results. As a result, they tend to be skimmed over by
devs and not actually considered. To properly track performance, we need
to record benchmark results over time and use statistics to detect
regressions.

https://bencher.dev does exactly that. it supports various benchmark
harnesses to automatically collect benchmarks. For our case, we simply
use the generic JSON adapter to extract the relevant metrics from the
iperf results and report them to the bencher backend.

With these metrics in place, bencher can plot the results over time, and
alert us in the case of regressions using thresholds based on
statistical tests.

Resolves: #5818.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-07-24 01:22:17 +00:00
Jamil
a43f39ae8b perf: increase UDP send rate for performance test (#4793)
Now that we've worked out the flakiness from the iperf tests, we should
increase the UDP send rate so we have some benchmark of how many packets
we can actually handle before dropping.
2024-04-26 21:11:44 +00:00
Thomas Eizinger
51089b89e7 feat(connlib): smoothly migrate relayed connections (#4568)
Whenever we receive a `relays_presence` message from the portal, we
invalidate the candidates of all now disconnected relays and make
allocations on the new ones. This triggers signalling of new candidates
to the remote party and migrates the connection to the newly nominated
socket.

This still relies on #4613 until we have #4634.

Resolves: #4548.

---------

Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-04-20 06:16:35 +00:00
Thomas Eizinger
4972e49b34 ci: run assertions inside docker container (#4680)
As part of #4568, we are adding a 2nd relay which showed some
short-comings of the current process state assertions because they were
running outside the docker containers, thus listing all relays as soon
as there are multiple.
2024-04-18 23:48:42 +00:00
Thomas Eizinger
8d49452668 ci: assert that nothing busy loops after the perf tests (#4546)
The clients, gateway and relay all employ an internal design that is
based on an eventloop. This gives us a lot of control in how various IO
components interact with each other. Great control also comes with a
source of bugs, the latest of which made the relay busy-loop once it
started relaying some traffic.

Eventloops are notoriously hard to unit-test because they compose
various IO bits together. Instead of writing unit tests, we can go and
assert the process state after the performance tests. Those generate a
fair bit of load on all our components but after that, they should
suspend.

The most effective tests survive even large refactorings and for that,
they need to be coded against a stable API / property. Asserting that
the process sleeps when it is idle from an application PoV is such a
property.

Related: #4511.
2024-04-09 07:09:50 +00:00
Jamil
391150f0e1 chore(ci): Fix new issues in cd.yml (#4085)
Fixes some issues encountered after the merge of #4049 

- Fix performance tests to only run using base_ref and head_ref to avoid
dependence on `main`
- Fixes some typos
- Prevents a catch-22 condition where breaking compatibility meant we
wouldn't be able to deploy production
2024-03-12 02:06:19 +00:00
Jamil
3bd7dc504e fix(ci): Fix flaky iperf3 "Bad file descriptor" (#3731)
- Lower UDP bandwidth to 50M -- this fixes intermittent file descriptor
issues because we overload iperf3 for more than 5 seconds
- Simplify iperf3 to the minimum set that makes tests reliable
2024-02-22 19:57:22 +00:00
Jamil
5bd717b877 fix(ci): Use workflow id to fetch perf results (#3710) 2024-02-20 19:40:16 -08:00
Jamil
63cdd09a01 refactor(ci): Merge perf results into one comment (#3707)
One comment vs eight, need I say more?
2024-02-20 18:17:48 -08:00
Jamil
2d208b1991 fix(ci): Fix js typo (#3704)
More fixes from the perf test refactor
2024-02-20 16:38:05 -08:00
Jamil
0598ca55c3 fix(ci): Fix result overwrite (#3700)
Buttoning up fixes from #3695
2024-02-20 15:46:59 -08:00
Jamil
7ff40b82ed fix(ci): Run each perf test in its own matrix job (#3695)
The iperf3 server sometimes hangs, or takes a while to startup.

Rather than trying to reset the iperf3 state between performance tests,
this PR refactors them so they each run in their matrix job. This
ensures each performance test will run on a separate VM, unaffected by
previous test runs to eliminate the effect any residual network buffer
state can have on a particular test.

It also makes sure the server is listening with a `healthcheck`.
2024-02-20 22:44:20 +00:00
Jamil
eebd7fc7f1 fix(ci): Ensure integration-tests allow for at least 30 seconds to establish a connection (#3676)
So the cause of the flaky tests is that they aren't waiting long enough
for a connection to be established. Both the test in #3666 and the
`iperf` tests have a timeout of 10 seconds.

Connections _should_ be established **very quickly** in CI. However, I
have a few guesses as to why they might not be, essentially causing us
to have to wait for a timeout to re-initiate a connection request:

- Packets arrive out of order or too quickly for the WireGuard state
machine to establish a handshake.
- Too many ICE candidates gathered (the gateway has 3 interfaces)


This PR:

- Refactors the iperf tests to be a little easier to maintain
- Ensures `integration-tests` run for at least 30 seconds before timing
out


In any case, we can debug / optimize this further after snownet is
merged, which might just solve the problem completely.
2024-02-19 20:50:58 +00:00