firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 10:18:54 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	3b972643b1	feat(rust): stream logs to Sentry when enabled in PostHog (#9635 ) Sentry has a new "Logs" feature where we can stream logs directly to Sentry. Doing this for all Clients and Gateways would be way too much data to collect though. In order to aid debugging from customer installations, we add a PostHog-managed feature flag that - if set to `true` - enables the streaming of logs to Sentry. This feature flag is evaluated every time the telemetry context is initialised: - For all FFI usages of connlib, this happens every time a new session is created. - For the Windows/Linux Tunnel service, this also happens every time we create a new session. - For the Headless Client and Gateway, it happens on startup and afterwards, every minute. The feature-flag context itself is only checked every 5 minutes though so it might take up to 5 minutes before this takes effect. The default value - like all feature flags - is `false`. Therefore, if there is any issue with the PostHog service, we will fallback to the previous behaviour where logs are simply stored locally. Resolves: #9600	2025-06-25 16:14:14 +00:00
Thomas Eizinger	fccf5021e6	fix(relay): don't fail event-loop on interrupt (#9592 ) When profiling the relay, certain syscalls may get interrupted by the kernel. At present, this crashes the relay which makes profiling impossible. Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>	2025-06-20 18:42:57 +00:00
Thomas Eizinger	e05c98bfca	ci: update to new `cargo sort` release (#9354 ) The latest release now also sorts workspace dependencies, as well as different dependency sections. Keeping these things sorted reduces the chances of merge conflicts when multiple PRs edit these files.	2025-06-02 02:01:09 +00:00
Thomas Eizinger	cee4be9e24	build(deps): bump Rust dependencies (#9192 ) A mass upgrade of our Rust dependencies. Most crucially, these remove several duplicated dependencies from our tree. - The Tauri plugins have been stuck on `windows v0.60` for a while. They are now updated to use `windows v0.61` which is what the rest of our dependency tree uses. - By bumping `axum`, can also bump `reqwest` which reduces a few more duplicated dependencies. - By removing `env_logger`, we can get rid of a few dependencies.	2025-05-22 13:15:01 +00:00
Thomas Eizinger	37529803ce	build(rust): bump otel ecosystem crates to 0.29 (#9029 )	2025-05-05 12:33:07 +00:00
Thomas Eizinger	6114bb274f	chore(rust): make most of the Rust code compile on MacOS (#8924 ) When working on the Rust code of Firezone from a MacOS computer, it is useful to have pretty much all of the code at least compile to ensure detect problems early. Eventually, once we target features like a headless MacOS client, some of these stubs will actually be filled in an be functional.	2025-04-29 11:20:09 +00:00
Thomas Eizinger	bcbc8cd212	build(rust): bump `aya` to include BTF information feature (#8883 ) The latest version of `aya-build` automatically builds our eBPF program with BTF information enabled. Related: https://github.com/aya-rs/aya/pull/1250	2025-04-22 00:36:41 +00:00
Thomas Eizinger	1af7f4f8c1	fix(rust): don't use jemalloc on ARMv7 (#8859 ) Doesn't compile on ARMv7 so we just fallback to the default allocator there.	2025-04-19 22:20:05 +00:00
Thomas Eizinger	a41395a165	feat(eBPF): embed BTF information in eBPF kernel (#8842 ) It turns out that the Rust compiler doesn't always say that it is adding debug information to a binary even when it does! The build output only displays `[optimized]` when in fact it does actually emit debug information. Adding an additional linker flag configures `bpf-linker` to include the necessary BTF information in our kernel. This makes debugging verifier errors much easier as the program output contains source code annotiations. It also should make it easier to debug issues using `xdpdump` which relies on BTF information. Resolves: #8503	2025-04-19 12:38:59 +00:00
Thomas Eizinger	34f28e2ae6	feat(rust): use jemalloc for Gateway and Relay (#8846 ) `jemalloc` is a modern allocator that is designed for multi-threaded systems and can better handle smaller allocations that may otherwise fragment the heap. Firezone's components, especially Relays and Gateways are intended to operate with a long uptime and therefore need to handle memory efficiently.	2025-04-19 12:25:46 +00:00
Thomas Eizinger	b4afd0bffb	refactor(eBPF): reduce size of maps (#8849 ) Whilst developing the eBPF module for the relay, I needed to manually add padding within the key and value structs used in the maps in order for the kernel to be able to correctly retrieve the data. For some reason, this seems no longer necessary as the integration test now passes without this as well. Being able to remove the padding drastically reduces the size of these maps for the current number of entries that we allow. This brings the overall memory usage of the relay down. Resolves: #8682	2025-04-19 11:46:58 +00:00
Thomas Eizinger	f51fd53708	chore(eBPF): use `RangeInclusive::contains` again (#8812 ) Now that we have figured out what the problem was with the eBPF kernel not routing certain packets, we can undo the manual implementation of the allocation range checking again and use the more concise `RangeInclusive::contains`. Related: #8809 Related: #8807	2025-04-18 15:49:23 +00:00
Thomas Eizinger	492e54efaa	build(rust): bump `network-types` to `v0.0.8` (#8811 ) This new release includes several patches we have made upstream that allow us to remove some of the vendored types from the crate. All fields that we access from `network-types` are now stored as byte-arrays and thus retain the big-endian byte ordering from the network. Resolves: #8686 Related: https://github.com/vadorovsky/network-types/pull/34 Related: https://github.com/vadorovsky/network-types/pull/36 Related: https://github.com/vadorovsky/network-types/pull/38	2025-04-18 15:21:14 +00:00
Thomas Eizinger	c52d88f421	fix(relay): stateless encoding/decoding (#8810 ) The STUN message encoder & decoder from `stun_codec` are stateful operations. However, they only operate on one datagram at the time. If encoding or decoding fails, their internal state is corrupted and must be discarded. At present, this doesn't happen which leads to further failures down the line because new datagrams coming in cannot be correctly decoded. To fix this, we scope the stateful nature of these encoders and decoders to their respective functions. Resolves: #8808	2025-04-18 15:12:46 +00:00
Thomas Eizinger	96e739439b	fix(relay): remove `Config` caching (#8809 ) In #8650, we originally added a feature-flag for toggling the eBPF TURN router on and off at runtime. This later got removed again in #8681. What remained was a "caching system" of the config that the eBPF kernel and user space share with each other. This config was initialised to the default configuration. If the to-be-set config was the same as the current config, the config would not actually apply to the array that was shared with the eBPF kernel. At the time, we assumed that, if the config was not set in the kernel, the lookup in the array would yield `None` and we would fall back to the `Default` implementation of `Config`. This assumption was wrong. It appears that look-ups in the array always yield an element: all zeros. Initialising our config with all zeros yields the following: ![image](https://github.com/user-attachments/assets/6556f32d-8cff-4fba-aa29-f9ac7349ace6) Of course, if this range is not initialised correctly, we can never actually route packets arriving on allocation ports and with UDP checksumming turned off, all packets routed the other way will have an invalid checksum and therefore be dropped by the receiving host. Our integration test did not catch this because in there, we purposely disable UDP checksumming. That meant that the "caching" check in the `ebpf::Program` did not trigger and we actually did set a `Config` in the array, therefore initialising the allocation port range correctly and allowing the packet to be routed. To fix this, we remove this caching check again which means every `Config` we set on the eBPF program actually gets copied to the shared array. Originally, this caching check was introduced to avoid a syscall on every event-loop iteration as part of checking the feature-flag. Now that the feature-flag has been removed, we don't need to have this cache anymore.	2025-04-18 13:50:42 +00:00
Thomas Eizinger	4ade88b1b1	fix(eBPF): implement "is port in allocation range" ourselves (#8807 ) I am suspecting that something is wrong with the check that a port is indeed within that range. Thus, we now implemented this ourselves with two simple conditions.	2025-04-18 06:43:33 +00:00
Thomas Eizinger	c5c195f282	chore(eBPF): change error log-levels (#8805 ) Neither of the moved error cases should happen very often so it is fine to log them on debug. - `Error::NotTurn` only happens if we receive a UDP packet that isn't STUN traffic (port 3478) or not in the allocation-port range. I am suspecting there to be a bug that I am aiming to fix in #8804. - `Error::NotAChannelDataMessage` will happen for all STUN control traffic, like channel bindings, allocation requests, etc. Those only happen occasionally so won't spam too much. - `Ipv4PacketWithOptions` should basically not happen at all because - as far as I know - IPv4 options aren't used a lot. In any case, when debugging, it is useful to see when we do hit these cases to know, why a packet was offloaded to user space.	2025-04-18 04:55:43 +00:00
Thomas Eizinger	0079f76ebd	fix(eBPF): store allocation port-range in big-endian (#8804 ) Any communication between user-space and the eBPF kernel happens via maps. The keys and values in these maps are serialised to bytes, meaning the endianness of how these values are encoded matters! When debugging why the eBPF kernels were not performing as much as we thought they would, I noticed that only very small packets were getting relayed. In particular, only packets encoded as channel-data packets were getting unwrapped correctly. The reverse didn't happen at all. Turning the log-level up to TRACE did reveal that we do in fact see these packets but they don't get handled. Here is the relevant section that handles these packets: `74ccf8e0b2/rust/relay/ebpf-turn-router/src/main.rs (L127-L151)` We can see the `trace!` log in the logs and we know that it should be handled by the first `if`. But for some reason it doesn't. x86 systems like the machines running in GCP are typically little-endian. Network-byte ordering is big-endian. My current theory is that we are comparing the port range with the wrong endianness and therefore, this branch never gets hit, causing the relaying to be offloaded to user space. By storing the fields within `Config` in byte-arrays, we can take explicit control over which endianness is used to store these fields.	2025-04-18 04:51:40 +00:00
Thomas Eizinger	38dedb8275	feat(relay): allow controlling log-level at runtime (#8800 ) When debugging issues with the relays on GCP, it is useful to be able to change the log-level at runtime without having to redeploy them. We can achieve this by running an additional HTTP server as part of the relay that response to HTTP POST requests that contain new logging directives. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-04-17 13:22:12 +00:00
Thomas Eizinger	07a82d2254	chore(relay): remove feature flag for eBPF TURN router (#8681 ) The original idea of this feature flag was that we can easily disable the eBPF router in case it causes issues in production. However, something seems to be not working in reliably turning this on / off. Without an explicit toggle of the feature-flag, the eBPF program doesn't seem to be loaded correctly. The uncertainty in this makes me not the trust the metrics that we are seeing because we don't know, whether really all relays are using the eBPF router to relay TURN traffic. In order to draw truthful conclusions as too how much traffic we are relaying via eBPF, this patch removes the feature flag again. As of #8656, we can disable the eBPF program by not setting the `EBPF_OFFLOADING` env variable. This requires a re-deploy / restart of relays to take effect which isn't quite as fast as toggling a feature flag but much reliable and easier to maintain.	2025-04-07 03:31:22 +00:00
Thomas Eizinger	391e94ebed	fix(relay): set a Firezone ID to enable feature-flags (#8657 ) Our feature-flags are currently coupled to our Firezone ID. Without a Firezone ID, we cannot evaluate feature flags. In order to be able to use the feature flags to enable / disable the eBPF TURN router, we see a random UUID as the Firezone ID upon startup of the relay. Not setting this causes the eBPF router to currently be instantly disabled as soon as we start up because the default of the feature flag is false and we don't reevaluate it later due to the missing ID.	2025-04-04 07:13:56 +00:00
Thomas Eizinger	6fe7e77f76	refactor(relay): fail if eBPF offloading is requested but fails (#8656 ) It happens a bunch of times to me during testing that I'd forget to set the right interface onto which the eBPF kernel should be loaded and was wondering why it didn't work. Defaulting to `eth0` wasn't a very smart decision because it means users cannot disable the eBPF kernel at all (other than via the feature-flag). It makes more sense to default to not loading the program at all AND hard-fail if we are requested to load it but cannot. This allows us to catch configuration errors early.	2025-04-04 07:00:29 +00:00
Thomas Eizinger	cd94dd8a2c	fix(relay): update cached eBPF config when it changes (#8655 )	2025-04-04 05:45:11 +00:00
Thomas Eizinger	941ef6c668	feat(relay): introduce feature-flag for toggling eBPF program (#8650 ) This PR implements a feature-flag in PostHog that we can use to toggle the use of the eBPF data plane at runtime. At every tick of the event-loop, the relay will compare the (cached) configuration of the eBPF program with the (cached) value of the feature-flag. If they differ, the flag will be updated and upon the next packet, the eBPF program will act accordingly. Feature-flags are re-evaluated every 5 minutes, meaning there is some delay until this gets applied. The default value of our all our feature-flags is `false`, meaning if there is some problem with evaluating them, we'd turn the eBPF data plane off. Performing routing in userspace is slower but it is a safer default. Resolves: #8548	2025-04-04 02:51:52 +00:00
Thomas Eizinger	f0a6367c7f	refactor(eBPF): rename `slice_mut_at` module (#8634 ) The name `slice_mut_at` came from a time where this function actually returned a slice of bytes. It has since been refactored to return a mutable reference to a type T that gets set by the caller. Thus, `ref_mut_at` is a much more fitting name.	2025-04-04 02:00:05 +00:00
Thomas Eizinger	ebb71e0f54	fix(relay): increase page size for metrics to 4096 (#8646 ) The default here is 2 which is nowhere near enough of a batch-size for us to read all perf events generated by the kernel when it is actually relaying data via eBPF (we generate 1 perf event per relayed packet). If we don't read them fast enough, the kernel has to drop some, meaning we skew our metrics as to how much data we've relayed via eBPF. This has been tested in my local setup and I've seen north of 500 events being read in a single batch now. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-04-04 01:28:22 +00:00
Thomas Eizinger	634c5ee38f	refactor(eBPF): reuse `CdHdr` struct (#8635 ) Instead of passing just a 4-byte array, we can pass a `CdHdr` struct that we have already defined. This is more type-safe and correctly captures the invariant of the order of fields in the header.	2025-04-03 14:22:38 +00:00
Thomas Eizinger	2b1527b48c	chore(eBPF): warn when dropping packets (#8630 ) When we decide to drop a packet, it means something is seriously off and we should look into it. These warnings will propagate to userspace and trigger a warning that gets reported to Sentry (if telemetry is enabled).	2025-04-03 14:14:27 +00:00
Thomas Eizinger	b863febac8	chore(eBPF): fix bad error message (#8629 ) Not sure how this one snuck in there. Must have made a mistake with my multi-line cursors.	2025-04-03 14:14:07 +00:00
Thomas Eizinger	6a83b06f9e	feat(eBPF): log Ethernet header update (#8632 ) Similar to IPv4, IPv6 and UDP, this adds a debug log describing how we are updating the Ethernet header of a packet. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-04-03 14:09:47 +00:00
Thomas Eizinger	40fb7d0565	fix(eBPF): explicitly attach in SKB mode (#8628 ) It appears that the gVNIC driver in Google Cloud doesn't give us enough headroom to use `bpf_xdp_adjust_head` with a delta of 4 bytes. Currently, we are loading the XDP program with default flags. By loading it explicitly in SKB mode, we should be able to bypass these driver limitations at the expense of some performance (which should still be better than userspace!). Related: https://github.com/GoogleCloudPlatform/compute-virtual-ethernet-linux/issues/70	2025-04-03 07:51:45 +00:00
Thomas Eizinger	8c55c2a46a	chore(eBPF): include return value in errors (#8626 ) At present, we only check for the return value of the various helper functions and bail out if they fail. What we don't learn is what the actual return code is. To further help with debugging, we include the return code in the error so we can print it later. We can't use the formatting macro within the `write` function so we need to stitch the message together ourselves.	2025-04-03 01:21:47 +00:00
Thomas Eizinger	d00995a91e	fix(eBPF): drop messed up packets (#8618 ) In case any of the xdp store/adjust/load functions fail, we need to drop the packet. By the time we get to these functions, we have already overwrote the Ethernet, IP and UDP headers and would only need to copy them either forwards or backwards to get rid of or add the channel data header. Forwarding these packets to userspace is pointless.	2025-04-03 00:27:44 +00:00
Thomas Eizinger	e7cf00eb53	chore(relay): log when encountering unsupported channel mappings (#8617 ) Currently, the relays eBPF module only supports routing from IPv4 to IPv4 as well as IPv6 to IPv6. In general, TURN servers can also route from IPv4 to IPv6 and vice versa. Our userspace routing supports that but doing the same in the eBPF code is a bit more involved. We'd need to move around the headers a bit more (IPv4 and IPv6 headers are different in size), as well as configure the respective "source" address for each interface. Currently, we simply take the destination address of the incoming packet as the new source address. When routing across IP versions, that doesn't work. To gain some more insight into how often this happens, we add these additional maps and populate them. This allows us to emit a dedicated log message whenever we encounter a packet for such a mapping. First, we always do check for an entry in the maps that we can handle. If we can't we check the other map and special-case the error. Otherwise, we fall back to the previous "no entry" error. We shouldn't really see these "no entry" errors anymore now, unless someone starts probing our relays for active channels.	2025-04-02 12:07:59 +00:00
Thomas Eizinger	4695f289a0	chore(relay): add more logs to eBPF stats reporting (#8613 )	2025-04-02 06:50:01 +00:00
Thomas Eizinger	59453bd063	chore(eBPF): improve log messages (#8611 )	2025-04-02 04:52:45 +00:00
Thomas Eizinger	fb1311991a	fix(eBPF): correctly set Ethernet addresses (#8601 ) At present, the eBPF code assumes that the incoming packet needs to be sent back to the same MAC address that it came from. This is only true if there is at least one IP layer hop in-between the relay and the Client / Gateway. When setting up Firezone in my local LAN to debug the eBPF code, all components are within the same subnet and thus can send packets directly to each other, without having to go through the router. In such a scenario, simply swapping the Ethernet addresses is not correct. As part of witnessing traffic coming in via the network, we can build up a mapping of IP to MAC address. This mapping can then later be used to set the correct MAC address for a given destination IP. All of this functions entirely without interaction from userspace. Unless you are running in a LAN environment, most if not all IPs will point to the same MAC address (the one of the next IP layer hop, i.e. the router). For the very first packet that we want to relay, we will not have a MAC address for the destination IP. This doesn't matter though, we simply pass that packet up to userspace and handle it there. Pretty much all communication on the Internet is bi-directional because you need some kind of ACK. As soon as we receive the first ACK, e.g. the response to a binding request, we will learn the MAC address for the given target IP and the eBPF router can kick in for all packets going forward. Related: #7518	2025-04-02 03:20:37 +00:00
Thomas Eizinger	f71995f7a5	fix(eBPF): incorporate change in UDP payload into checksum (#8603 ) The UDP checksum also includes the entire payload. Removing and adding bytes to the payload therefore needs to be reflected in the checksum update that we perform. When we add the channel data header, we need to add the bytes to the checksum and when we remove them, they need to be removed. Related: #7518	2025-04-01 16:23:44 +00:00
Thomas Eizinger	e58ec73bbc	refactor(eBPF): imply `XDP_TX` from `Ok(())` (#8604 ) Currently, the eBPF code isn't consistent in how it handles XDP actions. For some cases, we return errors and then map them to `XDP_PASS` or `XDP_DROP`. For others, we return `Ok(XDP_PASS)`. This is unnecessarily hard to understand. We refactor the eBPF kernel to ALWAYS use `Error`s for all code-paths that don't end in `XDP_TX`, i.e. when we successfully modified the packet and want to send it back out. In addition, we also change the way we log these errors. Not all errors are equal and most `XDP_PASS` actions don't need to be logged. Those packets are simply passing through. Finally, we also introduce new checks in case any calls to the eBPF helper functions fail. Related: #7518	2025-04-01 13:42:00 +00:00
Thomas Eizinger	cff14b3da0	feat(relay): make interface for eBPF program configurable (#8592 )	2025-04-01 08:20:27 +00:00
Thomas Eizinger	a942dee723	chore(eBPF): don't count channel data header as relayed bytes (#8590 )	2025-04-01 04:31:06 +00:00
Thomas Eizinger	bb36156ea8	chore(eBPF): remove commented out codeblock (#8588 ) This is a leftover from debugging trying to make the verifier happy.	2025-04-01 00:10:36 +00:00
Thomas Eizinger	db76cc3844	fix(relay): reduce memory usage of eBPF program to < 100MB (#8587 ) At present, the eBPF program would try to pre-allocate around 800MB of memory for all entries in the maps. This would allow for 1 million channel mappings. We don't need that many to begin with. Reducing the max number of channels down to 65536 reduces our memory usage to less than 100MB. Related: #7518 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-04-01 00:08:07 +00:00
Thomas Eizinger	1d0ecf94b8	feat(relay): record metrics about bytes relayed via eBPF (#8556 ) Perf events are designed to be an extremely efficient way of transferring data from an eBPF kernel to the user-space program. In order to monitor, how much traffic we are actually relaying via eBPF, we introduce a dedicated `STATS` map that is a `PerfEventArray`. The events from that array are read asynchronously in user-space and fed into our OTEL metrics. They will show up in our Google Cloud metrics as `data_relayed_ebpf_bytes`. We already have a metric for the total relayed bytes. That counter is renamed to `data_relayed_userspace_bytes` so we can clearly differentiate the two.	2025-03-31 21:57:31 +00:00
Thomas Eizinger	b51a68def0	feat(relay): implement eBPF routing for IPv6 (#8554 ) This fills in the boilerplate for handling IPv6 packets in the eBPF code. Unfortunately, we cannot add an integration test for this because IPv6 doesn't have a checksum and thus doesn't allow the UDP checksum to be set to 0. Because Linux (and other OSs too I'd assume) offload UDP checksumming to the NIC yet on the loopback interface, the packets never get to the NIC, our eBPF code sees only a partial checksum and can thus updates the checksum incorrectly. Related: #7518 Related: #8502 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-03-31 21:22:11 +00:00
Thomas Eizinger	a4851ee76f	feat(relay): implement the reverse IPv4 eBPF code path (#8544 ) This PR implements the "reverse path" of handling TURN traffic, i.e. UDP datagrams that arrive on an allocation port and need to be wrapped in a channel-data message to be sent to the TURN client. In order to achieve that, I had to rewrite most of the TURN code to not use the `etherparse` crate. I couldn't quite figure out the details but the eBPF verifier rejected my code in mysterious ways that I didn't understand. Commenting out random code-paths seemed to make it happy but all code-paths combined caused an error. Eventually, I decided that we simply have to use less abstractions to implement the same logic. All the "parsing" code is now using types inspired by `network-types`. The only modification here is that we use byte-arrays within our structs in order to directly receive them in big-endian ordering. `network-types` uses `u16`s and `u32`s which get interpreted as little-endian on x86. Instead of converting around between the endianness, constructing those values where we want them using the right endianness is deemed much simpler. I opened an issue with upstream which - if accepted - will allow us to remove our own structs and instead depend on upstream again. I also had to aggressively add `#[inline(always)]` to several functions, otherwise the compiler would not optimise away our function calls, causing the linker and / or eBPF verifier to fail. This PR also fixes numerous bugs that I've found in the already existing eBPF code. The number of bugs makes me question how this has been working so far at all! - We did not swap the Ethernet source and destination MAC address when re-routing the packet. The integration-test didn't catch this because it only operates on the loopback interface. Further testing on staging should allow us to confirm that this is indeed working now. - The UDP checksum update did not incorporate the new src and dst port. The integration-test didnt' catch that because it has UDP checksumming disabled. We need to have that disabled in the test because UDP checksumming is typically offloaded to the NIC and packets on the loopback interface never leave the device. Related: https://github.com/vadorovsky/network-types/issues/32. Related: #7518	2025-03-31 12:32:35 +00:00
Thomas Eizinger	ae157bce12	fix(relay): turn regression tests back on (#8541 ) As part of iterating on #8496, the API of `relay::Server` had changed and I had commented out the regression tests to move quicker. In later iterations, those API changes were reverted but I forgot to uncomment them.	2025-03-31 08:55:26 +00:00
Thomas Eizinger	afa6814ab4	chore(relay): ignore eBPF integration test (#8543 ) This needs elevated privileges to run. Our current pattern for these is to set them as ignored. In CI, we run all tests, including the ignored ones.	2025-03-29 01:49:43 +00:00
Thomas Eizinger	e231ba9407	fix(relay): update `aya-build` dependency to latest version (#8540 ) As part of working on https://github.com/aya-rs/aya/pull/1228, which I am depending on in here I had to force-push which will break CI. Opening this to fix it.	2025-03-29 00:12:14 +00:00
Thomas Eizinger	3c7ac084c0	feat(relay): MVP for routing channel data message in eBPF kernel (#8496 ) ## Abstract This pull-request implements the first stage of off-loading routing of TURN data channel messages to the kernel via an eBPF XDP program. In particular, the eBPF kernel implemented here only handles the decapsulation of IPv4 data channel messages into their embedded UDP payload. Implementation of other data paths, such as the receiving of UDP traffic on an allocation and wrapping it in a TURN channel data message is deferred to a later point for reasons explained further down. As it stands, this PR implements the bare minimum for us to start experimenting and benefiting from eBPF. It is already massive as it is due to the infrastructure required for actually doing this. Let's dive into it! ## A refresher on TURN channel-data messages TURN specifies a channel-data message for relaying data between two peers. A channel data message has a fixed 4-byte header: - The first two bytes specify the channel number - The second two bytes specify the length of the encapsulated payload Like all TURN traffic, channel data messages run over UDP by default, meaning this header sits at the very front of the UDP payload. This will be important later. After making an allocation with a TURN server (i.e. reserving a port on the TURN server's interfaces), a TURN client can bind channels on that allocation. As such, channel numbers are scoped to a client's allocation. Channel numbers are allocated by the client within a given range (0x4000 - 0x4FFF). When binding a channel, the client specifies the remote's peer address that they'd like the data sent on the channel to be sent to. Given this setup, when a TURN server receives a channel data message, it first looks at the sender's IP + port to infer the allocation (a client can only ever have 1 allocation at a time). Within that allocation, the server then looks for the channel number and retrieves the target socket address from that. The allocation itself is a port on the relay's interface. With that, we can now "unpack" the payload of the channel data message and rewrite it to the new receiver: - The new source IP can be set from the old dst IP (when operating in user-space mode this is irrelevant because we are working with the socket API). - The new source port is the client's allocation. - The new destination IP is retrieved from the mapping retrieved via the channel number. - The new destination port is retrieved from the mapping retrieved via the channel number. Last but not least, all that is left is removing the channel data header from the UDP payload and we can send out the packet. In other words, we need to cut off the first 4 bytes of the UDP payload. ## User-space relaying At present, we implement the above flow in user-space. This is tricky to do because we need to bind _many_ sockets, one for each possible allocation port (of which there can be 16383). The actual work to be done on these packets is also extremely minimal. All we do is cut off (or add on) the data-channel header. Benchmarks show that we spend pretty much all of our time copying data between user-space and kernel-space. Cutting this out should give us a massive increase in performance. ## Implementing an eBPF XDP TURN router eBPF has been shown to be a very efficient way of speeding up a TURN server [0]. After many failed experiments (e.g. using TC instead of XDP) and countless rabbit-holes, we have also arrived at the design documented within the paper. Most notably: - The eBPF program is entirely optional. We try to load it on startup, but if that fails, we will simply use the user-space mode. - Retaining the user-space mode is also important because under certain circumstances, the eBPF kernel needs to pass on the packet, for example, when receiving IPv4 packets with options. Those make the header dynamically-sized which makes further processing difficult because the eBPF verifier disallows indexing into the packet with data derived from the packet itself. - In order to add/remove the channel-data header, we shift the packet headers backwards / forwards and leave the payload in place as the packet headers are constant in size and can thus easily and cheaply be copied out. In order to perform the relaying flow explained above, we introduce maps that are shared with user-space. These maps go from a tuple of (client-socket, channel-number) to a tuple of (allocation-port, peer-socket) and thus give us all the data necessary to rewrite the packet. ## Integration with our relay Last but not least, to actually integrate the eBPF kernel with our relay, we need to extend the `Server` with two more events so we can learn, when channel bindings are created and when they expire. Using these events, we can then update the eBPF maps accordingly and therefore influence the routing behaviour in the kernel. ## Scope What is implemented here is only one of several possible data paths. Implementing the others isn't conceptually difficult but it does increase the scope. Landing something that already works allows us to gain experience running it in staging (and possibly production). Additionally, I've hit some issues with the eBPF verifier when adding more codepaths to the kernel. I expect those to be possible to resolve given sufficient debugging but I'd like to do so after merging this. --- Depends-On: #8506 Depends-On: #8507 Depends-On: #8500 Resolves: #8501 [0]: https://dl.acm.org/doi/pdf/10.1145/3609021.3609296	2025-03-27 10:59:40 +00:00

1 2 3 4 5 ...

311 Commits