firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 10:18:54 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	b7dc897eea	refactor(rust): introduce `libs/` directory (#10964 ) The current Rust workspace isn't as consistent as it could be. To make navigation a bit easier, we move a few crates around. Generally, we follow the idea that entry-points should be at the top-level. `rust/` now looks like this (directories only): ``` . ├── cli # Firezone CLI ├── client-ffi # Entry point for Apple & Android ├── gateway # Gateway ├── gui-client # GUI client ├── headless-client # Headless client ├── libs # Library crates ├── relay # Relay ├── target # Compile artifacts ├── tests # Crates for testing └── tools # Local tools ``` To further enforce this structure, we also drop the `firezone-` prefix from all crates that are not top-level binary crates.	2025-11-25 10:59:11 +00:00
Thomas Eizinger	bcf4ccf817	fix(rust): introduce dedicated downcast functions for `anyhow` (#10966 ) The downcasting abilities of `anyhow` are pretty powerful. Unfortunately, they can also be a bit tricky to get right. Whilst `is` and `downcast` work fine for any errors that are within the `anyhow` error chain, they don't check the chain of errors prior to that. In other words, if we already have a nested `std::error::Error` with several causes, `anyhow` cannot downcast to these causes directly. In order to avoid this footgun, we create a thin-layer on top of the `anyhow` crate with some downcasting functions that always try to do the right thing.	2025-11-25 04:14:17 +00:00
Thomas Eizinger	d09bab3d0c	test(relay): go back to the future before healthcheck (#10961 ) The health-check tests for the relay use `Instant::elapsed` which implicitly uses `Instant::now`. On a freshly booted Windows machine, these tests might easily fail because we are subtracting 15 minutes from `Instant::now` which might result in an underflow as Windows cannot represent `Instant`s prior to the boot time. Related: #10927	2025-11-25 00:48:24 +00:00
Thomas Eizinger	9016ffc9dc	build(rust): bump to Rust 1.91.0 (#10767 ) Rust 1.91 has been released and brings with it a few new lints that we need to tidy up. In addition, it also stabilizes `BTreeMap::extract_if`: A really nifty std-lib function that allows us to conditionally take elements from a map. We need that in a bunch of places.	2025-11-03 01:56:12 +00:00
dependabot[bot]	941f6f3d1c	build(deps): bump secrecy from 0.8.0 to 0.10.3 in /rust (#10631 ) Bumps [secrecy](https://github.com/iqlusioninc/crates) from 0.8.0 to 0.10.3. <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/iqlusioninc/crates/commits">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=secrecy&package-manager=cargo&previous-version=0.8.0&new-version=0.10.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2025-10-30 01:17:10 +00:00
Thomas Eizinger	20d0298a8a	chore: fix clippy warnings about HashMap iteration (#10661 ) Not quite sure how these didn't get picked up by CI but they showed in my local IDE. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-10-21 02:54:20 +00:00
Thomas Eizinger	685acdac3a	feat: add more specific component type to user-agent header (#10457 ) In order to allow the portal to more easily classify, what kind of component is connecting, we extend the `get_user_agent` header to include a component type instead of the generic `connlib/`. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-09-26 00:18:36 +00:00
Thomas Eizinger	94a56fc6bc	build(deps): update `aya` to latest `main` (#10424 ) We haven't updated `aya` in a while. Unfortunately, the update is not without problems. For one, the logging infrastructure changed, requiring us to drop the error details from `xdp_adjust_head`. See https://github.com/aya-rs/aya/issues/1348. Two, the `tokio` feature flag got removed but luckily that can be worked around quite easily. Resolves: #10344	2025-09-23 17:45:59 +10:00
Thomas Eizinger	69afe71215	refactor(connlib): remove concept of "ReplyMessages" (#10361 ) In earlier versions of Firezone, the WebSocket protocol with the portal was using the request-response semantics built into Phoenix. This however is quite cumbersome to work with to due to the polymorphic nature of the protocol design. We ended up moving away from it and instead only use one-way messages where each event directly corresponds to a message type. However, we have never removed the capability reply messages from the `phoenix-channel` module, instead all usages just set it to `()`. We can simplify the code here by always setting this to `()`. Resolves: #7091	2025-09-17 04:10:56 +00:00
Thomas Eizinger	0b89959354	fix(relay): handle relay-relay candidate pairs in eBPF (#10286 ) Currently, the eBPF module can translate from channel data messages to UDP packets and vice versa. It can even do that across IP stacks, i.e. translate from an IPv6 UDP packet to an IPv4 channel data messages. What it cannot do is handle packets to itself. This can happen if both - Client and Gateway - pick the same relay to make an allocation. When exchanging candidates, ICE will then form pairs between both relay candidates, essentially requiring the relay to loop packets back to itself. In eBPF, we cannot do that. When sending a packet back out with `XDP_TX`, it will actually go out on the wire without an additional check whether they are for our own IP. Properly handling this in eBPF (by comparing the destination IP to our public IP) adds more cases we need to handle. The current module structure where everything is one file makes this quite hard to understand, which is why I opted to create four sub-modules: - `from_ipv4_channel` - `from_ipv4_udp` - `from_ipv6_channel` - `from_ipv6_udp` For traffic arriving via a data-channel, it is possible that we also need to send it back out via a data-channel if the peer address we are sending to is the relay itself. Therefore, the `from_ipX_channel` modules have four sub-modules: - `to_ipv4_channel` - `to_ipv4_udp` - `to_ipv6_channel` - `to_ipv6_udp` For the traffic arriving on an allocation port (`from_ipX_udp`), we always map to a data-channel and therefore can never get into a routing loop, resulting in only two modules: - `to_ipv4_channel` - `to_ipv6_channel` The actual implementation of the new code paths is rather simple and mostly copied from the existing ones. For half of them, we don't need to make any adjustments to the buffer size (i.e. IPv4 channel to IPv4 channel). For the other half, we need to adjust for the difference in the IP header size. To test these changes, we add a new integration test that makes use of the new docker-compose setup added in #10301 and configures masquerading for both Client and Gateway. To make this more useful, we also remove the `direct-` prefix from all tests as the test script itself no longer makes any decisions as to whether it is operating over a direct or relayed connection. Resolves: #7518	2025-09-11 07:19:23 +00:00
Thomas Eizinger	f96cc3d583	feat(relay): remove graceful shutdown (#10322 ) Initially, we added the graceful shutdown functionality to the relay to better deal with deploys and achieve as minimal downtime as possible. With the split of app and infrastructure that we now have, this functionality is no longer necessary as portal deploys don't touch the relay infra at all. Thus, we can remove this functionality which will actually speed-up deploys of the relays as systemd no longer has to time-out after sending the SIGTERM to the binary.	2025-09-10 07:00:20 +00:00
Thomas Eizinger	4a612da189	fix(relay): filter traces by log filter (#10317 ) We want to control which traces are collected and sent to OTEL with the log filter. To do that, we need to also apply the supplied log filter to the tracer.	2025-09-09 23:32:57 +00:00
Thomas Eizinger	c891d9c864	fix(relay): re-add eBPF channel map entry on refresh (#10291 ) TURN channels have a 5 minute cooldown period after they expire where they cannot be rebound to another peer but can be refreshed and thus "reactivated". To stop routing packets when the channel expires, we remove it from the channel map of the eBPF code. The client however knows that it still has a channel that it can reactivate for another 5min. In case it chooses to do so, we refresh the channel in userspace but until now, forget to re-populate the eBPF map. This effectively blocks this communication path from working because the relay reports the channel from being refreshed successfully, yet the new eBPF kernel drops all packets without a map entry.	2025-09-05 01:29:50 +00:00
Thomas Eizinger	9cddfe59fa	fix(rust): don't require Internet on startup (#10264 ) With the introduction of the pre-resolved Sentry host, all Firezone clients now require Internet on startup. That is a signficant usability hit that we can easily fix by simply falling back to resolving the host on-demand.	2025-09-01 01:31:05 +00:00
Jamil	0ccd4bbf24	feat(ci): enable relay eBPF offloading (#10160 ) In CI, eBPF in driver mode actually functions just fine with no changes to our existing tests, given we apply a few workarounds and bugfixes: - The interface learning mechanism had two flaws: (1) it only learned per-CPU, which meant the risk for a missing entry grew as the core count of the relay host grew, and (2) it did not filter for unicast IPs, so it picked up broadcast and link-local addresses, causing cross-relay paths to fail occasionally - The `relay-relay` candidate where the two relays are the same relay causes packet drops / loops in the Docker bridge setup, and possibly in GCP too. I'm not sure this is a valid path that solves a real connectivity issue in the wild. I can understand relay-relay paths where two relays are different hosts, and the client and gateway both talk over their TURN channel to each other (i.e. WireGuard is blocked in each of their networks), but I can't think of an advantage for a relay-relay candidate where the traffic simply hairpins (or is dropped) off the nearest switch. This has been now detected with a new `PacketLoop` error that triggers whenever source_ip == dest_ip. - The relays in CI need a common next-hop to talk to for the MAC address swapping to work. A simple router service is added which functions as a basic L3 router (no NAT) that allows the MAC swapping to work. - The `veth` driver has some peculiar requirements to allow it to function with XDP_TX. If you send a packet out of one interface of a veth pair with XDP_TX, you need to either make sure both interfaces have GRO enabled, or you need to attach a dummy XDP program that simply does XDP_PASS to the other interface so that the sk_buff is allocated before going up the stack to the Docker bridge. The GRO method was unreliable and didn't work in our case, causing massive packet delays and unpredictable bursts that prevented ICE from working, so we use the XDP_PASS method instead. A simple docker image is built and lives at https://github.com/firezone/xdp-pass to handle this. Related: #10138 Related: #10260	2025-08-31 23:37:03 +00:00
Thomas Eizinger	c70c88c856	build(deps): upgrade to opentelemetry 0.30 (#10239 )	2025-08-21 22:47:39 +00:00
Thomas Eizinger	46afa52f78	feat(telemetry): pre-resolve Sentry ingest host (#10206 ) Our Sentry client needs to resolve DNS before being able to send logs or errors to the backend. Currently, this DNS resolution happens on-demand as we don't take any control of the underlying HTTP client. In addition, this will use HTTP/1.1 by default which isn't as efficient as it could be, especially with concurrent requests. Finally, if we decide to ever proxy all Sentry for traffic through our own domain, we have to take control of the underlying client anyway. To resolve all of the above, we create a custom `TransportFactory` where we reuse the existing `ReqwestHttpTransport` but provide an already configured `reqwest::Client` that always uses HTTP/2 with a pre-configured set of DNS records for the given ingest host.	2025-08-21 03:28:05 +00:00
Thomas Eizinger	da00848549	build(deps): bump to Rust 1.89 (#10208 ) Rust 1.89 comes with a new lint that wants us to use explicitly refer to lifetimes, even if they are elided.	2025-08-18 05:04:55 +00:00
Thomas Eizinger	70a930e45d	chore(relay): use existing `ebpf` module import (#10202 )	2025-08-17 23:45:36 +00:00
Jamil	b07fa341cf	feat(relay): XDP driver (native) mode for gVNIC (#10177 ) This updates our eBPF module to use DRV_MODE for less CPU overhead and better performance for all same-stack TURN relaying. Notably, gVNIC does not seem to support the `bpf_xdp_adjust_head` helper, so unfortunately we need to extend / shrink the packet tail and move the payload instead. Comprehensive benchmarks have not been performed, but early results show that we can saturate about 1 Gbps per E2 core on GCP: ``` [SUM] 0.00-30.04 sec 3.16 GBytes 904 Mbits/sec 12088 sender [SUM] 0.00-30.00 sec 3.12 GBytes 894 Mbits/sec receiver ``` This is with 64 TCP streams. More streams will better utilize all available RX queues, and lead to better performance. Related: #10138 Fixes: #8633	2025-08-17 15:04:19 +00:00
Thomas Eizinger	2dde3b8573	fix(relay): read from most-recently-ready socket first (#10148 ) The relay uses `mio` to react to readiness events from multiple sockets at once. Including the control port 3478, the relay needs to also send and receive traffic from up to 16384 sockets (one for each possible allocation). We need to process readiness events from these sockets as fairly as possible. Under high-load, it may otherwise happen that we don't read packets from an allocation socket, resulting in ICE timeouts of the connection being relayed. To achieve this fairness, we collect all readiness tokens into a set and store it with the number of packets we have read so far from this socket. Then, we always read from the socket next that we have so far read the least amount of packets from.	2025-08-06 09:13:05 +00:00
Thomas Eizinger	f27683760a	fix(relay): check for ANSI support on stdout (#10149 )	2025-08-06 07:42:54 +00:00
Thomas Eizinger	0e32f1c4f2	fix(relay): increase nonce usage to 10000 (#10128 ) On a Gateway with a busy connections, only being able to use a nonce 100 times causes unnecessary churn. We increase this to 10000 to be able to handle bursts of messages such as channel bindings better.	2025-08-05 02:00:57 +00:00
Thomas Eizinger	fbf96c261e	chore(relay): remove spans (#9962 ) These are flooding our monitoring infra and don't really add that much value. Pretty much all of the processing the relay does is request in and out and none of the spans are nested. We can therefore almost 1-to-1 replicate the logging we do with spans by adding the fields to each log message. Resolves: #9954	2025-07-22 13:24:58 +00:00
Thomas Eizinger	0f1c5f2818	refactor(relay): simplify auth module (#9873 ) Whilst looking through the auth module of the relay, I noticed that we unnecessarily convert back and forth between expiry timestamps and username formats when we could just be using the already parsed version.	2025-07-15 14:14:51 +00:00
Thomas Eizinger	2b70596636	fix(rust): only apply filter to select tracing layers (#9872 ) Applying a filter globally to the entire subscriber means it filters events for all layers. This prevents the Sentry layer from uploading DEBUG logs if configured.	2025-07-15 13:44:53 +00:00
Thomas Eizinger	d01701148b	fix(rust): remove jemalloc (#9849 ) I am no longer able to compile `jemalloc` on my system in a debug build. It fails with the following error: ``` src/malloc_io.c: In function ‘buferror’: src/malloc_io.c:107:16: error: returning ‘char *’ from a function with return type ‘int’ makes integer from pointer without a cast [-Wint-conversion] 107 \| return strerror_r(err, buf, buflen); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` This appears to be a problem with modern versions of clang/gcc. I believe this started happening when I recently upgraded my system. The upstream [`jemalloc`](https://github.com/jemalloc/jemalloc) repository is now archived and thus unmaintained. I am not sure if we ever measured a significant benefit in using `jemalloc`. Related: https://github.com/servo/servo/issues/31059	2025-07-12 19:22:06 +00:00
Thomas Eizinger	d6805d7e48	chore(rust): bump to Rust 1.88 (#9714 ) Rust 1.88 has been released and brings with it a quite exciting feature: let-chains! It allows us to mix-and-match `if` and `let` expressions, therefore often reducing the "right-drift" of the relevant code, making it easier to read. Rust.188 also comes with a new clippy lint that warns when creating a mutable reference from an immutable pointer. Attempting to fix this revealed that this is exactly what we are doing in the eBPF kernel. Unfortunately, it doesn't seem to be possible to design this in a way that is both accepted by the borrow-checker AND by the eBPF verifier. Hence, we simply make the function `unsafe` and document for the programmer, what needs to be upheld.	2025-07-12 06:42:50 +00:00
Thomas Eizinger	3b972643b1	feat(rust): stream logs to Sentry when enabled in PostHog (#9635 ) Sentry has a new "Logs" feature where we can stream logs directly to Sentry. Doing this for all Clients and Gateways would be way too much data to collect though. In order to aid debugging from customer installations, we add a PostHog-managed feature flag that - if set to `true` - enables the streaming of logs to Sentry. This feature flag is evaluated every time the telemetry context is initialised: - For all FFI usages of connlib, this happens every time a new session is created. - For the Windows/Linux Tunnel service, this also happens every time we create a new session. - For the Headless Client and Gateway, it happens on startup and afterwards, every minute. The feature-flag context itself is only checked every 5 minutes though so it might take up to 5 minutes before this takes effect. The default value - like all feature flags - is `false`. Therefore, if there is any issue with the PostHog service, we will fallback to the previous behaviour where logs are simply stored locally. Resolves: #9600	2025-06-25 16:14:14 +00:00
Thomas Eizinger	fccf5021e6	fix(relay): don't fail event-loop on interrupt (#9592 ) When profiling the relay, certain syscalls may get interrupted by the kernel. At present, this crashes the relay which makes profiling impossible. Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>	2025-06-20 18:42:57 +00:00
Thomas Eizinger	e05c98bfca	ci: update to new `cargo sort` release (#9354 ) The latest release now also sorts workspace dependencies, as well as different dependency sections. Keeping these things sorted reduces the chances of merge conflicts when multiple PRs edit these files.	2025-06-02 02:01:09 +00:00
Thomas Eizinger	cee4be9e24	build(deps): bump Rust dependencies (#9192 ) A mass upgrade of our Rust dependencies. Most crucially, these remove several duplicated dependencies from our tree. - The Tauri plugins have been stuck on `windows v0.60` for a while. They are now updated to use `windows v0.61` which is what the rest of our dependency tree uses. - By bumping `axum`, can also bump `reqwest` which reduces a few more duplicated dependencies. - By removing `env_logger`, we can get rid of a few dependencies.	2025-05-22 13:15:01 +00:00
Thomas Eizinger	37529803ce	build(rust): bump otel ecosystem crates to 0.29 (#9029 )	2025-05-05 12:33:07 +00:00
Thomas Eizinger	6114bb274f	chore(rust): make most of the Rust code compile on MacOS (#8924 ) When working on the Rust code of Firezone from a MacOS computer, it is useful to have pretty much all of the code at least compile to ensure detect problems early. Eventually, once we target features like a headless MacOS client, some of these stubs will actually be filled in an be functional.	2025-04-29 11:20:09 +00:00
Thomas Eizinger	1af7f4f8c1	fix(rust): don't use jemalloc on ARMv7 (#8859 ) Doesn't compile on ARMv7 so we just fallback to the default allocator there.	2025-04-19 22:20:05 +00:00
Thomas Eizinger	34f28e2ae6	feat(rust): use jemalloc for Gateway and Relay (#8846 ) `jemalloc` is a modern allocator that is designed for multi-threaded systems and can better handle smaller allocations that may otherwise fragment the heap. Firezone's components, especially Relays and Gateways are intended to operate with a long uptime and therefore need to handle memory efficiently.	2025-04-19 12:25:46 +00:00
Thomas Eizinger	c52d88f421	fix(relay): stateless encoding/decoding (#8810 ) The STUN message encoder & decoder from `stun_codec` are stateful operations. However, they only operate on one datagram at the time. If encoding or decoding fails, their internal state is corrupted and must be discarded. At present, this doesn't happen which leads to further failures down the line because new datagrams coming in cannot be correctly decoded. To fix this, we scope the stateful nature of these encoders and decoders to their respective functions. Resolves: #8808	2025-04-18 15:12:46 +00:00
Thomas Eizinger	96e739439b	fix(relay): remove `Config` caching (#8809 ) In #8650, we originally added a feature-flag for toggling the eBPF TURN router on and off at runtime. This later got removed again in #8681. What remained was a "caching system" of the config that the eBPF kernel and user space share with each other. This config was initialised to the default configuration. If the to-be-set config was the same as the current config, the config would not actually apply to the array that was shared with the eBPF kernel. At the time, we assumed that, if the config was not set in the kernel, the lookup in the array would yield `None` and we would fall back to the `Default` implementation of `Config`. This assumption was wrong. It appears that look-ups in the array always yield an element: all zeros. Initialising our config with all zeros yields the following: ![image](https://github.com/user-attachments/assets/6556f32d-8cff-4fba-aa29-f9ac7349ace6) Of course, if this range is not initialised correctly, we can never actually route packets arriving on allocation ports and with UDP checksumming turned off, all packets routed the other way will have an invalid checksum and therefore be dropped by the receiving host. Our integration test did not catch this because in there, we purposely disable UDP checksumming. That meant that the "caching" check in the `ebpf::Program` did not trigger and we actually did set a `Config` in the array, therefore initialising the allocation port range correctly and allowing the packet to be routed. To fix this, we remove this caching check again which means every `Config` we set on the eBPF program actually gets copied to the shared array. Originally, this caching check was introduced to avoid a syscall on every event-loop iteration as part of checking the feature-flag. Now that the feature-flag has been removed, we don't need to have this cache anymore.	2025-04-18 13:50:42 +00:00
Thomas Eizinger	0079f76ebd	fix(eBPF): store allocation port-range in big-endian (#8804 ) Any communication between user-space and the eBPF kernel happens via maps. The keys and values in these maps are serialised to bytes, meaning the endianness of how these values are encoded matters! When debugging why the eBPF kernels were not performing as much as we thought they would, I noticed that only very small packets were getting relayed. In particular, only packets encoded as channel-data packets were getting unwrapped correctly. The reverse didn't happen at all. Turning the log-level up to TRACE did reveal that we do in fact see these packets but they don't get handled. Here is the relevant section that handles these packets: `74ccf8e0b2/rust/relay/ebpf-turn-router/src/main.rs (L127-L151)` We can see the `trace!` log in the logs and we know that it should be handled by the first `if`. But for some reason it doesn't. x86 systems like the machines running in GCP are typically little-endian. Network-byte ordering is big-endian. My current theory is that we are comparing the port range with the wrong endianness and therefore, this branch never gets hit, causing the relaying to be offloaded to user space. By storing the fields within `Config` in byte-arrays, we can take explicit control over which endianness is used to store these fields.	2025-04-18 04:51:40 +00:00
Thomas Eizinger	38dedb8275	feat(relay): allow controlling log-level at runtime (#8800 ) When debugging issues with the relays on GCP, it is useful to be able to change the log-level at runtime without having to redeploy them. We can achieve this by running an additional HTTP server as part of the relay that response to HTTP POST requests that contain new logging directives. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-04-17 13:22:12 +00:00
Thomas Eizinger	07a82d2254	chore(relay): remove feature flag for eBPF TURN router (#8681 ) The original idea of this feature flag was that we can easily disable the eBPF router in case it causes issues in production. However, something seems to be not working in reliably turning this on / off. Without an explicit toggle of the feature-flag, the eBPF program doesn't seem to be loaded correctly. The uncertainty in this makes me not the trust the metrics that we are seeing because we don't know, whether really all relays are using the eBPF router to relay TURN traffic. In order to draw truthful conclusions as too how much traffic we are relaying via eBPF, this patch removes the feature flag again. As of #8656, we can disable the eBPF program by not setting the `EBPF_OFFLOADING` env variable. This requires a re-deploy / restart of relays to take effect which isn't quite as fast as toggling a feature flag but much reliable and easier to maintain.	2025-04-07 03:31:22 +00:00
Thomas Eizinger	391e94ebed	fix(relay): set a Firezone ID to enable feature-flags (#8657 ) Our feature-flags are currently coupled to our Firezone ID. Without a Firezone ID, we cannot evaluate feature flags. In order to be able to use the feature flags to enable / disable the eBPF TURN router, we see a random UUID as the Firezone ID upon startup of the relay. Not setting this causes the eBPF router to currently be instantly disabled as soon as we start up because the default of the feature flag is false and we don't reevaluate it later due to the missing ID.	2025-04-04 07:13:56 +00:00
Thomas Eizinger	6fe7e77f76	refactor(relay): fail if eBPF offloading is requested but fails (#8656 ) It happens a bunch of times to me during testing that I'd forget to set the right interface onto which the eBPF kernel should be loaded and was wondering why it didn't work. Defaulting to `eth0` wasn't a very smart decision because it means users cannot disable the eBPF kernel at all (other than via the feature-flag). It makes more sense to default to not loading the program at all AND hard-fail if we are requested to load it but cannot. This allows us to catch configuration errors early.	2025-04-04 07:00:29 +00:00
Thomas Eizinger	cd94dd8a2c	fix(relay): update cached eBPF config when it changes (#8655 )	2025-04-04 05:45:11 +00:00
Thomas Eizinger	941ef6c668	feat(relay): introduce feature-flag for toggling eBPF program (#8650 ) This PR implements a feature-flag in PostHog that we can use to toggle the use of the eBPF data plane at runtime. At every tick of the event-loop, the relay will compare the (cached) configuration of the eBPF program with the (cached) value of the feature-flag. If they differ, the flag will be updated and upon the next packet, the eBPF program will act accordingly. Feature-flags are re-evaluated every 5 minutes, meaning there is some delay until this gets applied. The default value of our all our feature-flags is `false`, meaning if there is some problem with evaluating them, we'd turn the eBPF data plane off. Performing routing in userspace is slower but it is a safer default. Resolves: #8548	2025-04-04 02:51:52 +00:00
Thomas Eizinger	ebb71e0f54	fix(relay): increase page size for metrics to 4096 (#8646 ) The default here is 2 which is nowhere near enough of a batch-size for us to read all perf events generated by the kernel when it is actually relaying data via eBPF (we generate 1 perf event per relayed packet). If we don't read them fast enough, the kernel has to drop some, meaning we skew our metrics as to how much data we've relayed via eBPF. This has been tested in my local setup and I've seen north of 500 events being read in a single batch now. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-04-04 01:28:22 +00:00
Thomas Eizinger	40fb7d0565	fix(eBPF): explicitly attach in SKB mode (#8628 ) It appears that the gVNIC driver in Google Cloud doesn't give us enough headroom to use `bpf_xdp_adjust_head` with a delta of 4 bytes. Currently, we are loading the XDP program with default flags. By loading it explicitly in SKB mode, we should be able to bypass these driver limitations at the expense of some performance (which should still be better than userspace!). Related: https://github.com/GoogleCloudPlatform/compute-virtual-ethernet-linux/issues/70	2025-04-03 07:51:45 +00:00
Thomas Eizinger	e7cf00eb53	chore(relay): log when encountering unsupported channel mappings (#8617 ) Currently, the relays eBPF module only supports routing from IPv4 to IPv4 as well as IPv6 to IPv6. In general, TURN servers can also route from IPv4 to IPv6 and vice versa. Our userspace routing supports that but doing the same in the eBPF code is a bit more involved. We'd need to move around the headers a bit more (IPv4 and IPv6 headers are different in size), as well as configure the respective "source" address for each interface. Currently, we simply take the destination address of the incoming packet as the new source address. When routing across IP versions, that doesn't work. To gain some more insight into how often this happens, we add these additional maps and populate them. This allows us to emit a dedicated log message whenever we encounter a packet for such a mapping. First, we always do check for an entry in the maps that we can handle. If we can't we check the other map and special-case the error. Otherwise, we fall back to the previous "no entry" error. We shouldn't really see these "no entry" errors anymore now, unless someone starts probing our relays for active channels.	2025-04-02 12:07:59 +00:00
Thomas Eizinger	4695f289a0	chore(relay): add more logs to eBPF stats reporting (#8613 )	2025-04-02 06:50:01 +00:00
Thomas Eizinger	59453bd063	chore(eBPF): improve log messages (#8611 )	2025-04-02 04:52:45 +00:00

1 2

59 Commits