firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 18:18:55 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	d1d46fdfb4	ci: create a more realistic network setup (#10301 ) Currently, the setup we have in docker-compose does not reflect real-world scenarios very well because most components share the same subnet. In reality, Clients, Gateways, relays and the backend are all in separate subnets, connected via multiple routers on the Internet. The current setup makes it hard to properly test relayed connections. To fix this, we move all components into their own subnet with a dedicated router container that performs source and destination NAT as well as acts as a firewall for the client and gateway containers to not allow inbound traffic. This setup will allow us to more easily test #10286 which requires port randomization for outgoing traffic on the Client and Gateway side.	2025-09-10 23:37:16 +00:00
Thomas Eizinger	3cffeef483	ci: reduce target bitrate for UDP perf tests to 600Mbit/s (#10312 ) To achieve a more stable CI, we need to reduce the target bitrate of the UDP perf tests. Now that we no longer have GSO enabled in the tests, the most we can achieve in CI is 600Mbit/s. Forcing more packets through the tunnel results in all sorts of warnings which end up failing CI.	2025-09-09 12:58:33 +00:00
Jamil	0ccd4bbf24	feat(ci): enable relay eBPF offloading (#10160 ) In CI, eBPF in driver mode actually functions just fine with no changes to our existing tests, given we apply a few workarounds and bugfixes: - The interface learning mechanism had two flaws: (1) it only learned per-CPU, which meant the risk for a missing entry grew as the core count of the relay host grew, and (2) it did not filter for unicast IPs, so it picked up broadcast and link-local addresses, causing cross-relay paths to fail occasionally - The `relay-relay` candidate where the two relays are the same relay causes packet drops / loops in the Docker bridge setup, and possibly in GCP too. I'm not sure this is a valid path that solves a real connectivity issue in the wild. I can understand relay-relay paths where two relays are different hosts, and the client and gateway both talk over their TURN channel to each other (i.e. WireGuard is blocked in each of their networks), but I can't think of an advantage for a relay-relay candidate where the traffic simply hairpins (or is dropped) off the nearest switch. This has been now detected with a new `PacketLoop` error that triggers whenever source_ip == dest_ip. - The relays in CI need a common next-hop to talk to for the MAC address swapping to work. A simple router service is added which functions as a basic L3 router (no NAT) that allows the MAC swapping to work. - The `veth` driver has some peculiar requirements to allow it to function with XDP_TX. If you send a packet out of one interface of a veth pair with XDP_TX, you need to either make sure both interfaces have GRO enabled, or you need to attach a dummy XDP program that simply does XDP_PASS to the other interface so that the sk_buff is allocated before going up the stack to the Docker bridge. The GRO method was unreliable and didn't work in our case, causing massive packet delays and unpredictable bursts that prevented ICE from working, so we use the XDP_PASS method instead. A simple docker image is built and lives at https://github.com/firezone/xdp-pass to handle this. Related: #10138 Related: #10260	2025-08-31 23:37:03 +00:00
Jamil	8eb738e66a	chore(ci): downgrade runners to free tier (#10248 ) To avoid burning Azure credits, we move the runners back down to the free tier. Now that caching is properly set up, this should incur only a minor increase in CI time.	2025-08-26 10:48:45 -07:00
Thomas Eizinger	456fde5b60	ci: increase bitrate of direct connection UDP perf tests (#10154 ) We can easily handle 1GBit/s for the direct connections.	2025-08-06 14:02:47 +00:00
Thomas Eizinger	b5e3ee8065	ci: reduce UDP perf test bitrate (#10153 ) Forcing 500MBit/s through a relayed connection in CI makes the user-space relay fall-over and drop control messages, leading to ICE timeouts of the connection.	2025-08-06 09:11:57 +00:00
Jamil	56f5405849	chore(ci): increase perf test time to 30s (#10133 ) Our ICE timeout is ~15s, so it would be a good idea to ensure the perf tests span a possible ICE timeout if it occurs in the test, so that we may detect cases where high throughput may cause an ICE timeout.	2025-08-05 07:42:17 +00:00
Thomas Eizinger	a8aafc9e14	ci: use bencher.dev for continuous benchmarking (#5915 ) Currently, we have a homegrown benchmark suite that reports results of the iperf runs within CI by comparing a run on `main` with the current branch. These comments are noisy because they happen on every PR, regardless of the performance results. As a result, they tend to be skimmed over by devs and not actually considered. To properly track performance, we need to record benchmark results over time and use statistics to detect regressions. https://bencher.dev does exactly that. it supports various benchmark harnesses to automatically collect benchmarks. For our case, we simply use the generic JSON adapter to extract the relevant metrics from the iperf results and report them to the bencher backend. With these metrics in place, bencher can plot the results over time, and alert us in the case of regressions using thresholds based on statistical tests. Resolves: #5818. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>	2024-07-24 01:22:17 +00:00
Jamil	a43f39ae8b	perf: increase UDP send rate for performance test (#4793 ) Now that we've worked out the flakiness from the iperf tests, we should increase the UDP send rate so we have some benchmark of how many packets we can actually handle before dropping.	2024-04-26 21:11:44 +00:00
Thomas Eizinger	51089b89e7	feat(connlib): smoothly migrate relayed connections (#4568 ) Whenever we receive a `relays_presence` message from the portal, we invalidate the candidates of all now disconnected relays and make allocations on the new ones. This triggers signalling of new candidates to the remote party and migrates the connection to the newly nominated socket. This still relies on #4613 until we have #4634. Resolves: #4548. --------- Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2024-04-20 06:16:35 +00:00
Thomas Eizinger	4972e49b34	ci: run assertions inside docker container (#4680 ) As part of #4568, we are adding a 2nd relay which showed some short-comings of the current process state assertions because they were running outside the docker containers, thus listing all relays as soon as there are multiple.	2024-04-18 23:48:42 +00:00
Thomas Eizinger	8d49452668	ci: assert that nothing busy loops after the perf tests (#4546 ) The clients, gateway and relay all employ an internal design that is based on an eventloop. This gives us a lot of control in how various IO components interact with each other. Great control also comes with a source of bugs, the latest of which made the relay busy-loop once it started relaying some traffic. Eventloops are notoriously hard to unit-test because they compose various IO bits together. Instead of writing unit tests, we can go and assert the process state after the performance tests. Those generate a fair bit of load on all our components but after that, they should suspend. The most effective tests survive even large refactorings and for that, they need to be coded against a stable API / property. Asserting that the process sleeps when it is idle from an application PoV is such a property. Related: #4511.	2024-04-09 07:09:50 +00:00
Jamil	391150f0e1	chore(ci): Fix new issues in cd.yml (#4085 ) Fixes some issues encountered after the merge of #4049 - Fix performance tests to only run using base_ref and head_ref to avoid dependence on `main` - Fixes some typos - Prevents a catch-22 condition where breaking compatibility meant we wouldn't be able to deploy production	2024-03-12 02:06:19 +00:00
Jamil	3bd7dc504e	fix(ci): Fix flaky iperf3 "Bad file descriptor" (#3731 ) - Lower UDP bandwidth to 50M -- this fixes intermittent file descriptor issues because we overload iperf3 for more than 5 seconds - Simplify iperf3 to the minimum set that makes tests reliable	2024-02-22 19:57:22 +00:00
Jamil	5bd717b877	fix(ci): Use workflow id to fetch perf results (#3710 )	2024-02-20 19:40:16 -08:00
Jamil	63cdd09a01	refactor(ci): Merge perf results into one comment (#3707 ) One comment vs eight, need I say more?	2024-02-20 18:17:48 -08:00
Jamil	2d208b1991	fix(ci): Fix js typo (#3704 ) More fixes from the perf test refactor	2024-02-20 16:38:05 -08:00
Jamil	0598ca55c3	fix(ci): Fix result overwrite (#3700 ) Buttoning up fixes from #3695	2024-02-20 15:46:59 -08:00
Jamil	7ff40b82ed	fix(ci): Run each perf test in its own matrix job (#3695 ) The iperf3 server sometimes hangs, or takes a while to startup. Rather than trying to reset the iperf3 state between performance tests, this PR refactors them so they each run in their matrix job. This ensures each performance test will run on a separate VM, unaffected by previous test runs to eliminate the effect any residual network buffer state can have on a particular test. It also makes sure the server is listening with a `healthcheck`.	2024-02-20 22:44:20 +00:00
Jamil	eebd7fc7f1	fix(ci): Ensure integration-tests allow for at least 30 seconds to establish a connection (#3676 ) So the cause of the flaky tests is that they aren't waiting long enough for a connection to be established. Both the test in #3666 and the `iperf` tests have a timeout of 10 seconds. Connections _should_ be established very quickly in CI. However, I have a few guesses as to why they might not be, essentially causing us to have to wait for a timeout to re-initiate a connection request: - Packets arrive out of order or too quickly for the WireGuard state machine to establish a handshake. - Too many ICE candidates gathered (the gateway has 3 interfaces) This PR: - Refactors the iperf tests to be a little easier to maintain - Ensures `integration-tests` run for at least 30 seconds before timing out In any case, we can debug / optimize this further after snownet is merged, which might just solve the problem completely.	2024-02-19 20:50:58 +00:00

20 Commits