Currently, the eBPF module can translate from channel data messages to
UDP packets and vice versa. It can even do that across IP stacks, i.e.
translate from an IPv6 UDP packet to an IPv4 channel data messages.
What it cannot do is handle packets to itself. This can happen if both -
Client and Gateway - pick the same relay to make an allocation. When
exchanging candidates, ICE will then form pairs between both relay
candidates, essentially requiring the relay to loop packets back to
itself.
In eBPF, we cannot do that. When sending a packet back out with
`XDP_TX`, it will actually go out on the wire without an additional
check whether they are for our own IP.
Properly handling this in eBPF (by comparing the destination IP to our
public IP) adds more cases we need to handle. The current module
structure where everything is one file makes this quite hard to
understand, which is why I opted to create four sub-modules:
- `from_ipv4_channel`
- `from_ipv4_udp`
- `from_ipv6_channel`
- `from_ipv6_udp`
For traffic arriving via a data-channel, it is possible that we also
need to send it back out via a data-channel if the peer address we are
sending to is the relay itself. Therefore, the `from_ipX_channel`
modules have four sub-modules:
- `to_ipv4_channel`
- `to_ipv4_udp`
- `to_ipv6_channel`
- `to_ipv6_udp`
For the traffic arriving on an allocation port (`from_ipX_udp`), we
always map to a data-channel and therefore can never get into a routing
loop, resulting in only two modules:
- `to_ipv4_channel`
- `to_ipv6_channel`
The actual implementation of the new code paths is rather simple and
mostly copied from the existing ones. For half of them, we don't need to
make any adjustments to the buffer size (i.e. IPv4 channel to IPv4
channel). For the other half, we need to adjust for the difference in
the IP header size.
To test these changes, we add a new integration test that makes use of
the new docker-compose setup added in #10301 and configures masquerading
for both Client and Gateway. To make this more useful, we also remove
the `direct-` prefix from all tests as the test script itself no longer
makes any decisions as to whether it is operating over a direct or
relayed connection.
Resolves: #7518
In CI, eBPF in driver mode actually functions just fine with no changes
to our existing tests, given we apply a few workarounds and bugfixes:
- The interface learning mechanism had two flaws: (1) it only learned
per-CPU, which meant the risk for a missing entry grew as the core count
of the relay host grew, and (2) it did not filter for unicast IPs, so it
picked up broadcast and link-local addresses, causing cross-relay paths
to fail occasionally
- The `relay-relay` candidate where the two relays are the same relay
causes packet drops / loops in the Docker bridge setup, and possibly in
GCP too. I'm not sure this is a valid path that solves a real
connectivity issue in the wild. I can understand relay-relay paths where
two relays are different hosts, and the client and gateway both talk
over their TURN channel to each other (i.e. WireGuard is blocked in each
of their networks), but I can't think of an advantage for a relay-relay
candidate where the traffic simply hairpins (or is dropped) off the
nearest switch. This has been now detected with a new `PacketLoop` error
that triggers whenever source_ip == dest_ip.
- The relays in CI need a common next-hop to talk to for the MAC address
swapping to work. A simple router service is added which functions as a
basic L3 router (no NAT) that allows the MAC swapping to work.
- The `veth` driver has some peculiar requirements to allow it to
function with XDP_TX. If you send a packet out of one interface of a
veth pair with XDP_TX, you need to either make sure both interfaces have
GRO enabled, or you need to attach a dummy XDP program that simply does
XDP_PASS to the other interface so that the sk_buff is allocated before
going up the stack to the Docker bridge. The GRO method was unreliable
and didn't work in our case, causing massive packet delays and
unpredictable bursts that prevented ICE from working, so we use the
XDP_PASS method instead. A simple docker image is built and lives at
https://github.com/firezone/xdp-pass to handle this.
Related: #10138
Related: #10260
In order to support cross-stack relaying, we need to know what the
source IP is going to be to write the packets from. To know this, we can
simply learn the destination IP address for incoming packets to our XDP
program.
A separate cache is used per IP stack in order be a bit more cache line
friendly and prevent contention when only IP stack lookup is needed.
Related: #10192
Whilst developing the eBPF module for the relay, I needed to manually
add padding within the key and value structs used in the maps in order
for the kernel to be able to correctly retrieve the data.
For some reason, this seems no longer necessary as the integration test
now passes without this as well.
Being able to remove the padding drastically reduces the size of these
maps for the current number of entries that we allow. This brings the
overall memory usage of the relay down.
Resolves: #8682
Any communication between user-space and the eBPF kernel happens via
maps. The keys and values in these maps are serialised to bytes, meaning
the endianness of how these values are encoded matters!
When debugging why the eBPF kernels were not performing as much as we
thought they would, I noticed that only very small packets were getting
relayed. In particular, only packets encoded as channel-data packets
were getting unwrapped correctly. The reverse didn't happen at all.
Turning the log-level up to TRACE did reveal that we do in fact see
these packets but they don't get handled.
Here is the relevant section that handles these packets:
74ccf8e0b2/rust/relay/ebpf-turn-router/src/main.rs (L127-L151)
We can see the `trace!` log in the logs and we know that it should be
handled by the first `if`. But for some reason it doesn't.
x86 systems like the machines running in GCP are typically
little-endian. Network-byte ordering is big-endian. My current theory is
that we are comparing the port range with the wrong endianness and
therefore, this branch never gets hit, causing the relaying to be
offloaded to user space.
By storing the fields within `Config` in byte-arrays, we can take
explicit control over which endianness is used to store these fields.
The original idea of this feature flag was that we can easily disable
the eBPF router in case it causes issues in production. However,
something seems to be not working in reliably turning this on / off.
Without an explicit toggle of the feature-flag, the eBPF program doesn't
seem to be loaded correctly. The uncertainty in this makes me not the
trust the metrics that we are seeing because we don't know, whether
really all relays are using the eBPF router to relay TURN traffic.
In order to draw truthful conclusions as too how much traffic we are
relaying via eBPF, this patch removes the feature flag again. As of
#8656, we can disable the eBPF program by not setting the
`EBPF_OFFLOADING` env variable. This requires a re-deploy / restart of
relays to take effect which isn't quite as fast as toggling a feature
flag but much reliable and easier to maintain.
This PR implements a feature-flag in PostHog that we can use to toggle
the use of the eBPF data plane at runtime. At every tick of the
event-loop, the relay will compare the (cached) configuration of the
eBPF program with the (cached) value of the feature-flag. If they
differ, the flag will be updated and upon the next packet, the eBPF
program will act accordingly.
Feature-flags are re-evaluated every 5 minutes, meaning there is some
delay until this gets applied.
The default value of our all our feature-flags is `false`, meaning if
there is some problem with evaluating them, we'd turn the eBPF data
plane off. Performing routing in userspace is slower but it is a safer
default.
Resolves: #8548
Perf events are designed to be an extremely efficient way of
transferring data from an eBPF kernel to the user-space program. In
order to monitor, how much traffic we are actually relaying via eBPF, we
introduce a dedicated `STATS` map that is a `PerfEventArray`.
The events from that array are read asynchronously in user-space and fed
into our OTEL metrics. They will show up in our Google Cloud metrics as
`data_relayed_ebpf_bytes`. We already have a metric for the total
relayed bytes. That counter is renamed to `data_relayed_userspace_bytes`
so we can clearly differentiate the two.
This fills in the boilerplate for handling IPv6 packets in the eBPF
code. Unfortunately, we cannot add an integration test for this because
IPv6 doesn't have a checksum and thus doesn't allow the UDP checksum to
be set to 0. Because Linux (and other OSs too I'd assume) offload UDP
checksumming to the NIC yet on the loopback interface, the packets never
get to the NIC, our eBPF code sees only a partial checksum and can thus
updates the checksum incorrectly.
Related: #7518
Related: #8502
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR implements the "reverse path" of handling TURN traffic, i.e. UDP
datagrams that arrive on an allocation port and need to be wrapped in a
channel-data message to be sent to the TURN client.
In order to achieve that, I had to rewrite most of the TURN code to not
use the `etherparse` crate. I couldn't quite figure out the details but
the eBPF verifier rejected my code in mysterious ways that I didn't
understand. Commenting out random code-paths seemed to make it happy but
all code-paths combined caused an error. Eventually, I decided that we
simply have to use less abstractions to implement the same logic.
All the "parsing" code is now using types inspired by `network-types`.
The only modification here is that we use byte-arrays within our structs
in order to directly receive them in big-endian ordering.
`network-types` uses `u16`s and `u32`s which get interpreted as
little-endian on x86. Instead of converting around between the
endianness, constructing those values where we want them using the right
endianness is deemed much simpler. I opened an issue with upstream which
- if accepted - will allow us to remove our own structs and instead
depend on upstream again.
I also had to aggressively add `#[inline(always)]` to several functions,
otherwise the compiler would not optimise away our function calls,
causing the linker and / or eBPF verifier to fail.
This PR also fixes numerous bugs that I've found in the already existing
eBPF code. The number of bugs makes me question how this has been
working so far at all!
- We did not swap the Ethernet source and destination MAC address when
re-routing the packet. The integration-test didn't catch this because it
only operates on the loopback interface. Further testing on staging
should allow us to confirm that this is indeed working now.
- The UDP checksum update did not incorporate the new src and dst port.
The integration-test didnt' catch that because it has UDP checksumming
disabled. We need to have that disabled in the test because UDP
checksumming is typically offloaded to the NIC and packets on the
loopback interface never leave the device.
Related: https://github.com/vadorovsky/network-types/issues/32.
Related: #7518
## Abstract
This pull-request implements the first stage of off-loading routing of
TURN data channel messages to the kernel via an eBPF XDP program. In
particular, the eBPF kernel implemented here **only** handles the
decapsulation of IPv4 data channel messages into their embedded UDP
payload. Implementation of other data paths, such as the receiving of
UDP traffic on an allocation and wrapping it in a TURN channel data
message is deferred to a later point for reasons explained further down.
As it stands, this PR implements the bare minimum for us to start
experimenting and benefiting from eBPF. It is already massive as it is
due to the infrastructure required for actually doing this. Let's dive
into it!
## A refresher on TURN channel-data messages
TURN specifies a channel-data message for relaying data between two
peers. A channel data message has a fixed 4-byte header:
- The first two bytes specify the channel number
- The second two bytes specify the length of the encapsulated payload
Like all TURN traffic, channel data messages run over UDP by default,
meaning this header sits at the very front of the UDP payload. This will
be important later.
After making an allocation with a TURN server (i.e. reserving a port on
the TURN server's interfaces), a TURN client can bind channels on that
allocation. As such, channel numbers are scoped to a client's
allocation. Channel numbers are allocated by the client within a given
range (0x4000 - 0x4FFF). When binding a channel, the client specifies
the remote's peer address that they'd like the data sent on the channel
to be sent to.
Given this setup, when a TURN server receives a channel data message, it
first looks at the sender's IP + port to infer the allocation (a client
can only ever have 1 allocation at a time). Within that allocation, the
server then looks for the channel number and retrieves the target socket
address from that. The allocation itself is a port on the relay's
interface. With that, we can now "unpack" the payload of the channel
data message and rewrite it to the new receiver:
- The new source IP can be set from the old dst IP (when operating in
user-space mode this is irrelevant because we are working with the
socket API).
- The new source port is the client's allocation.
- The new destination IP is retrieved from the mapping retrieved via the
channel number.
- The new destination port is retrieved from the mapping retrieved via
the channel number.
Last but not least, all that is left is removing the channel data header
from the UDP payload and we can send out the packet. In other words, we
need to cut off the first 4 bytes of the UDP payload.
## User-space relaying
At present, we implement the above flow in user-space. This is tricky to
do because we need to bind _many_ sockets, one for each possible
allocation port (of which there can be 16383). The actual work to be
done on these packets is also extremely minimal. All we do is cut off
(or add on) the data-channel header. Benchmarks show that we spend
pretty much all of our time copying data between user-space and
kernel-space. Cutting this out should give us a massive increase in
performance.
## Implementing an eBPF XDP TURN router
eBPF has been shown to be a very efficient way of speeding up a TURN
server [0]. After many failed experiments (e.g. using TC instead of XDP)
and countless rabbit-holes, we have also arrived at the design
documented within the paper. Most notably:
- The eBPF program is entirely optional. We try to load it on startup,
but if that fails, we will simply use the user-space mode.
- Retaining the user-space mode is also important because under certain
circumstances, the eBPF kernel needs to pass on the packet, for example,
when receiving IPv4 packets with options. Those make the header
dynamically-sized which makes further processing difficult because the
eBPF verifier disallows indexing into the packet with data derived from
the packet itself.
- In order to add/remove the channel-data header, we shift the packet
headers backwards / forwards and leave the payload in place as the
packet headers are constant in size and can thus easily and cheaply be
copied out.
In order to perform the relaying flow explained above, we introduce maps
that are shared with user-space. These maps go from a tuple of
(client-socket, channel-number) to a tuple of (allocation-port,
peer-socket) and thus give us all the data necessary to rewrite the
packet.
## Integration with our relay
Last but not least, to actually integrate the eBPF kernel with our
relay, we need to extend the `Server` with two more events so we can
learn, when channel bindings are created and when they expire. Using
these events, we can then update the eBPF maps accordingly and therefore
influence the routing behaviour in the kernel.
## Scope
What is implemented here is only one of several possible data paths.
Implementing the others isn't conceptually difficult but it does
increase the scope. Landing something that already works allows us to
gain experience running it in staging (and possibly production).
Additionally, I've hit some issues with the eBPF verifier when adding
more codepaths to the kernel. I expect those to be possible to resolve
given sufficient debugging but I'd like to do so after merging this.
---
Depends-On: #8506
Depends-On: #8507
Depends-On: #8500Resolves: #8501
[0]: https://dl.acm.org/doi/pdf/10.1145/3609021.3609296