The current Rust workspace isn't as consistent as it could be. To make
navigation a bit easier, we move a few crates around. Generally, we
follow the idea that entry-points should be at the top-level. `rust/`
now looks like this (directories only):
```
.
├── cli # Firezone CLI
├── client-ffi # Entry point for Apple & Android
├── gateway # Gateway
├── gui-client # GUI client
├── headless-client # Headless client
├── libs # Library crates
├── relay # Relay
├── target # Compile artifacts
├── tests # Crates for testing
└── tools # Local tools
```
To further enforce this structure, we also drop the `firezone-` prefix
from all crates that are not top-level binary crates.
In CI, eBPF in driver mode actually functions just fine with no changes
to our existing tests, given we apply a few workarounds and bugfixes:
- The interface learning mechanism had two flaws: (1) it only learned
per-CPU, which meant the risk for a missing entry grew as the core count
of the relay host grew, and (2) it did not filter for unicast IPs, so it
picked up broadcast and link-local addresses, causing cross-relay paths
to fail occasionally
- The `relay-relay` candidate where the two relays are the same relay
causes packet drops / loops in the Docker bridge setup, and possibly in
GCP too. I'm not sure this is a valid path that solves a real
connectivity issue in the wild. I can understand relay-relay paths where
two relays are different hosts, and the client and gateway both talk
over their TURN channel to each other (i.e. WireGuard is blocked in each
of their networks), but I can't think of an advantage for a relay-relay
candidate where the traffic simply hairpins (or is dropped) off the
nearest switch. This has been now detected with a new `PacketLoop` error
that triggers whenever source_ip == dest_ip.
- The relays in CI need a common next-hop to talk to for the MAC address
swapping to work. A simple router service is added which functions as a
basic L3 router (no NAT) that allows the MAC swapping to work.
- The `veth` driver has some peculiar requirements to allow it to
function with XDP_TX. If you send a packet out of one interface of a
veth pair with XDP_TX, you need to either make sure both interfaces have
GRO enabled, or you need to attach a dummy XDP program that simply does
XDP_PASS to the other interface so that the sk_buff is allocated before
going up the stack to the Docker bridge. The GRO method was unreliable
and didn't work in our case, causing massive packet delays and
unpredictable bursts that prevented ICE from working, so we use the
XDP_PASS method instead. A simple docker image is built and lives at
https://github.com/firezone/xdp-pass to handle this.
Related: #10138
Related: #10260
This updates our eBPF module to use DRV_MODE for less CPU overhead and
better performance for all same-stack TURN relaying.
Notably, gVNIC does not seem to support the `bpf_xdp_adjust_head`
helper, so unfortunately we need to extend / shrink the packet tail and
move the payload instead.
Comprehensive benchmarks have not been performed, but early results show
that we can saturate about 1 Gbps per E2 core on GCP:
```
[SUM] 0.00-30.04 sec 3.16 GBytes 904 Mbits/sec 12088 sender
[SUM] 0.00-30.00 sec 3.12 GBytes 894 Mbits/sec receiver
```
This is with 64 TCP streams. More streams will better utilize all
available RX queues, and lead to better performance.
Related: #10138Fixes: #8633
Rust 1.88 has been released and brings with it a quite exciting feature:
let-chains! It allows us to mix-and-match `if` and `let` expressions,
therefore often reducing the "right-drift" of the relevant code, making
it easier to read.
Rust.188 also comes with a new clippy lint that warns when creating a
mutable reference from an immutable pointer. Attempting to fix this
revealed that this is exactly what we are doing in the eBPF kernel.
Unfortunately, it doesn't seem to be possible to design this in a way
that is both accepted by the borrow-checker AND by the eBPF verifier.
Hence, we simply make the function `unsafe` and document for the
programmer, what needs to be upheld.
A mass upgrade of our Rust dependencies. Most crucially, these remove
several duplicated dependencies from our tree.
- The Tauri plugins have been stuck on `windows v0.60` for a while. They
are now updated to use `windows v0.61` which is what the rest of our
dependency tree uses.
- By bumping `axum`, can also bump `reqwest` which reduces a few more
duplicated dependencies.
- By removing `env_logger`, we can get rid of a few dependencies.
Any communication between user-space and the eBPF kernel happens via
maps. The keys and values in these maps are serialised to bytes, meaning
the endianness of how these values are encoded matters!
When debugging why the eBPF kernels were not performing as much as we
thought they would, I noticed that only very small packets were getting
relayed. In particular, only packets encoded as channel-data packets
were getting unwrapped correctly. The reverse didn't happen at all.
Turning the log-level up to TRACE did reveal that we do in fact see
these packets but they don't get handled.
Here is the relevant section that handles these packets:
74ccf8e0b2/rust/relay/ebpf-turn-router/src/main.rs (L127-L151)
We can see the `trace!` log in the logs and we know that it should be
handled by the first `if`. But for some reason it doesn't.
x86 systems like the machines running in GCP are typically
little-endian. Network-byte ordering is big-endian. My current theory is
that we are comparing the port range with the wrong endianness and
therefore, this branch never gets hit, causing the relaying to be
offloaded to user space.
By storing the fields within `Config` in byte-arrays, we can take
explicit control over which endianness is used to store these fields.
Perf events are designed to be an extremely efficient way of
transferring data from an eBPF kernel to the user-space program. In
order to monitor, how much traffic we are actually relaying via eBPF, we
introduce a dedicated `STATS` map that is a `PerfEventArray`.
The events from that array are read asynchronously in user-space and fed
into our OTEL metrics. They will show up in our Google Cloud metrics as
`data_relayed_ebpf_bytes`. We already have a metric for the total
relayed bytes. That counter is renamed to `data_relayed_userspace_bytes`
so we can clearly differentiate the two.
This PR implements the "reverse path" of handling TURN traffic, i.e. UDP
datagrams that arrive on an allocation port and need to be wrapped in a
channel-data message to be sent to the TURN client.
In order to achieve that, I had to rewrite most of the TURN code to not
use the `etherparse` crate. I couldn't quite figure out the details but
the eBPF verifier rejected my code in mysterious ways that I didn't
understand. Commenting out random code-paths seemed to make it happy but
all code-paths combined caused an error. Eventually, I decided that we
simply have to use less abstractions to implement the same logic.
All the "parsing" code is now using types inspired by `network-types`.
The only modification here is that we use byte-arrays within our structs
in order to directly receive them in big-endian ordering.
`network-types` uses `u16`s and `u32`s which get interpreted as
little-endian on x86. Instead of converting around between the
endianness, constructing those values where we want them using the right
endianness is deemed much simpler. I opened an issue with upstream which
- if accepted - will allow us to remove our own structs and instead
depend on upstream again.
I also had to aggressively add `#[inline(always)]` to several functions,
otherwise the compiler would not optimise away our function calls,
causing the linker and / or eBPF verifier to fail.
This PR also fixes numerous bugs that I've found in the already existing
eBPF code. The number of bugs makes me question how this has been
working so far at all!
- We did not swap the Ethernet source and destination MAC address when
re-routing the packet. The integration-test didn't catch this because it
only operates on the loopback interface. Further testing on staging
should allow us to confirm that this is indeed working now.
- The UDP checksum update did not incorporate the new src and dst port.
The integration-test didnt' catch that because it has UDP checksumming
disabled. We need to have that disabled in the test because UDP
checksumming is typically offloaded to the NIC and packets on the
loopback interface never leave the device.
Related: https://github.com/vadorovsky/network-types/issues/32.
Related: #7518
As part of iterating on #8496, the API of `relay::Server` had changed
and I had commented out the regression tests to move quicker. In later
iterations, those API changes were reverted but I forgot to uncomment
them.
## Abstract
This pull-request implements the first stage of off-loading routing of
TURN data channel messages to the kernel via an eBPF XDP program. In
particular, the eBPF kernel implemented here **only** handles the
decapsulation of IPv4 data channel messages into their embedded UDP
payload. Implementation of other data paths, such as the receiving of
UDP traffic on an allocation and wrapping it in a TURN channel data
message is deferred to a later point for reasons explained further down.
As it stands, this PR implements the bare minimum for us to start
experimenting and benefiting from eBPF. It is already massive as it is
due to the infrastructure required for actually doing this. Let's dive
into it!
## A refresher on TURN channel-data messages
TURN specifies a channel-data message for relaying data between two
peers. A channel data message has a fixed 4-byte header:
- The first two bytes specify the channel number
- The second two bytes specify the length of the encapsulated payload
Like all TURN traffic, channel data messages run over UDP by default,
meaning this header sits at the very front of the UDP payload. This will
be important later.
After making an allocation with a TURN server (i.e. reserving a port on
the TURN server's interfaces), a TURN client can bind channels on that
allocation. As such, channel numbers are scoped to a client's
allocation. Channel numbers are allocated by the client within a given
range (0x4000 - 0x4FFF). When binding a channel, the client specifies
the remote's peer address that they'd like the data sent on the channel
to be sent to.
Given this setup, when a TURN server receives a channel data message, it
first looks at the sender's IP + port to infer the allocation (a client
can only ever have 1 allocation at a time). Within that allocation, the
server then looks for the channel number and retrieves the target socket
address from that. The allocation itself is a port on the relay's
interface. With that, we can now "unpack" the payload of the channel
data message and rewrite it to the new receiver:
- The new source IP can be set from the old dst IP (when operating in
user-space mode this is irrelevant because we are working with the
socket API).
- The new source port is the client's allocation.
- The new destination IP is retrieved from the mapping retrieved via the
channel number.
- The new destination port is retrieved from the mapping retrieved via
the channel number.
Last but not least, all that is left is removing the channel data header
from the UDP payload and we can send out the packet. In other words, we
need to cut off the first 4 bytes of the UDP payload.
## User-space relaying
At present, we implement the above flow in user-space. This is tricky to
do because we need to bind _many_ sockets, one for each possible
allocation port (of which there can be 16383). The actual work to be
done on these packets is also extremely minimal. All we do is cut off
(or add on) the data-channel header. Benchmarks show that we spend
pretty much all of our time copying data between user-space and
kernel-space. Cutting this out should give us a massive increase in
performance.
## Implementing an eBPF XDP TURN router
eBPF has been shown to be a very efficient way of speeding up a TURN
server [0]. After many failed experiments (e.g. using TC instead of XDP)
and countless rabbit-holes, we have also arrived at the design
documented within the paper. Most notably:
- The eBPF program is entirely optional. We try to load it on startup,
but if that fails, we will simply use the user-space mode.
- Retaining the user-space mode is also important because under certain
circumstances, the eBPF kernel needs to pass on the packet, for example,
when receiving IPv4 packets with options. Those make the header
dynamically-sized which makes further processing difficult because the
eBPF verifier disallows indexing into the packet with data derived from
the packet itself.
- In order to add/remove the channel-data header, we shift the packet
headers backwards / forwards and leave the payload in place as the
packet headers are constant in size and can thus easily and cheaply be
copied out.
In order to perform the relaying flow explained above, we introduce maps
that are shared with user-space. These maps go from a tuple of
(client-socket, channel-number) to a tuple of (allocation-port,
peer-socket) and thus give us all the data necessary to rewrite the
packet.
## Integration with our relay
Last but not least, to actually integrate the eBPF kernel with our
relay, we need to extend the `Server` with two more events so we can
learn, when channel bindings are created and when they expire. Using
these events, we can then update the eBPF maps accordingly and therefore
influence the routing behaviour in the kernel.
## Scope
What is implemented here is only one of several possible data paths.
Implementing the others isn't conceptually difficult but it does
increase the scope. Landing something that already works allows us to
gain experience running it in staging (and possibly production).
Additionally, I've hit some issues with the eBPF verifier when adding
more codepaths to the kernel. I expect those to be possible to resolve
given sufficient debugging but I'd like to do so after merging this.
---
Depends-On: #8506
Depends-On: #8507
Depends-On: #8500Resolves: #8501
[0]: https://dl.acm.org/doi/pdf/10.1145/3609021.3609296