`jemalloc` is a modern allocator that is designed for multi-threaded
systems and can better handle smaller allocations that may otherwise
fragment the heap. Firezone's components, especially Relays and Gateways
are intended to operate with a long uptime and therefore need to handle
memory efficiently.
Whilst developing the eBPF module for the relay, I needed to manually
add padding within the key and value structs used in the maps in order
for the kernel to be able to correctly retrieve the data.
For some reason, this seems no longer necessary as the integration test
now passes without this as well.
Being able to remove the padding drastically reduces the size of these
maps for the current number of entries that we allow. This brings the
overall memory usage of the relay down.
Resolves: #8682
Now that we have figured out what the problem was with the eBPF kernel
not routing certain packets, we can undo the manual implementation of
the allocation range checking again and use the more concise
`RangeInclusive::contains`.
Related: #8809
Related: #8807
The STUN message encoder & decoder from `stun_codec` are stateful
operations. However, they only operate on one datagram at the time. If
encoding or decoding fails, their internal state is corrupted and must
be discarded. At present, this doesn't happen which leads to further
failures down the line because new datagrams coming in cannot be
correctly decoded.
To fix this, we scope the stateful nature of these encoders and decoders
to their respective functions.
Resolves: #8808
In #8650, we originally added a feature-flag for toggling the eBPF TURN
router on and off at runtime. This later got removed again in #8681.
What remained was a "caching system" of the config that the eBPF kernel
and user space share with each other.
This config was initialised to the default configuration. If the
to-be-set config was the same as the current config, the config would
not actually apply to the array that was shared with the eBPF kernel.
At the time, we assumed that, if the config was not set in the kernel,
the lookup in the array would yield `None` and we would fall back to the
`Default` implementation of `Config`. This assumption was wrong. It
appears that look-ups in the array always yield an element: all zeros.
Initialising our config with all zeros yields the following:

Of course, if this range is not initialised correctly, we can never
actually route packets arriving on allocation ports and with UDP
checksumming turned off, all packets routed the other way will have an
invalid checksum and therefore be dropped by the receiving host.
Our integration test did not catch this because in there, we purposely
disable UDP checksumming. That meant that the "caching" check in the
`ebpf::Program` did not trigger and we actually did set a `Config` in
the array, therefore initialising the allocation port range correctly
and allowing the packet to be routed.
To fix this, we remove this caching check again which means every
`Config` we set on the eBPF program actually gets copied to the shared
array. Originally, this caching check was introduced to avoid a syscall
on every event-loop iteration as part of checking the feature-flag. Now
that the feature-flag has been removed, we don't need to have this cache
anymore.
I am suspecting that something is wrong with the check that a port is
indeed within that range. Thus, we now implemented this ourselves with
two simple conditions.
Neither of the moved error cases should happen very often so it is fine
to log them on debug.
- `Error::NotTurn` only happens if we receive a UDP packet that isn't
STUN traffic (port 3478) or not in the allocation-port range. I am
suspecting there to be a bug that I am aiming to fix in #8804.
- `Error::NotAChannelDataMessage` will happen for all STUN control
traffic, like channel bindings, allocation requests, etc. Those only
happen occasionally so won't spam too much.
- `Ipv4PacketWithOptions` should basically not happen at all because -
as far as I know - IPv4 options aren't used a lot.
In any case, when debugging, it is useful to see when we do hit these
cases to know, why a packet was offloaded to user space.
Any communication between user-space and the eBPF kernel happens via
maps. The keys and values in these maps are serialised to bytes, meaning
the endianness of how these values are encoded matters!
When debugging why the eBPF kernels were not performing as much as we
thought they would, I noticed that only very small packets were getting
relayed. In particular, only packets encoded as channel-data packets
were getting unwrapped correctly. The reverse didn't happen at all.
Turning the log-level up to TRACE did reveal that we do in fact see
these packets but they don't get handled.
Here is the relevant section that handles these packets:
74ccf8e0b2/rust/relay/ebpf-turn-router/src/main.rs (L127-L151)
We can see the `trace!` log in the logs and we know that it should be
handled by the first `if`. But for some reason it doesn't.
x86 systems like the machines running in GCP are typically
little-endian. Network-byte ordering is big-endian. My current theory is
that we are comparing the port range with the wrong endianness and
therefore, this branch never gets hit, causing the relaying to be
offloaded to user space.
By storing the fields within `Config` in byte-arrays, we can take
explicit control over which endianness is used to store these fields.
When debugging issues with the relays on GCP, it is useful to be able to
change the log-level at runtime without having to redeploy them. We can
achieve this by running an additional HTTP server as part of the relay
that response to HTTP POST requests that contain new logging directives.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
The original idea of this feature flag was that we can easily disable
the eBPF router in case it causes issues in production. However,
something seems to be not working in reliably turning this on / off.
Without an explicit toggle of the feature-flag, the eBPF program doesn't
seem to be loaded correctly. The uncertainty in this makes me not the
trust the metrics that we are seeing because we don't know, whether
really all relays are using the eBPF router to relay TURN traffic.
In order to draw truthful conclusions as too how much traffic we are
relaying via eBPF, this patch removes the feature flag again. As of
#8656, we can disable the eBPF program by not setting the
`EBPF_OFFLOADING` env variable. This requires a re-deploy / restart of
relays to take effect which isn't quite as fast as toggling a feature
flag but much reliable and easier to maintain.
Our feature-flags are currently coupled to our Firezone ID. Without a
Firezone ID, we cannot evaluate feature flags. In order to be able to
use the feature flags to enable / disable the eBPF TURN router, we see a
random UUID as the Firezone ID upon startup of the relay.
Not setting this causes the eBPF router to currently be instantly
disabled as soon as we start up because the default of the feature flag
is false and we don't reevaluate it later due to the missing ID.
It happens a bunch of times to me during testing that I'd forget to set
the right interface onto which the eBPF kernel should be loaded and was
wondering why it didn't work. Defaulting to `eth0` wasn't a very smart
decision because it means users cannot disable the eBPF kernel at all
(other than via the feature-flag).
It makes more sense to default to not loading the program at all AND
hard-fail if we are requested to load it but cannot. This allows us to
catch configuration errors early.
This PR implements a feature-flag in PostHog that we can use to toggle
the use of the eBPF data plane at runtime. At every tick of the
event-loop, the relay will compare the (cached) configuration of the
eBPF program with the (cached) value of the feature-flag. If they
differ, the flag will be updated and upon the next packet, the eBPF
program will act accordingly.
Feature-flags are re-evaluated every 5 minutes, meaning there is some
delay until this gets applied.
The default value of our all our feature-flags is `false`, meaning if
there is some problem with evaluating them, we'd turn the eBPF data
plane off. Performing routing in userspace is slower but it is a safer
default.
Resolves: #8548
The name `slice_mut_at` came from a time where this function actually
returned a slice of bytes. It has since been refactored to return a
mutable reference to a type T that gets set by the caller. Thus,
`ref_mut_at` is a much more fitting name.
The default here is 2 which is nowhere near enough of a batch-size for
us to read all perf events generated by the kernel when it is actually
relaying data via eBPF (we generate 1 perf event per relayed packet). If
we don't read them fast enough, the kernel has to drop some, meaning we
skew our metrics as to how much data we've relayed via eBPF.
This has been tested in my local setup and I've seen north of 500 events
being read in a single batch now.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Instead of passing just a 4-byte array, we can pass a `CdHdr` struct
that we have already defined. This is more type-safe and correctly
captures the invariant of the order of fields in the header.
When we decide to drop a packet, it means something is seriously off and
we should look into it. These warnings will propagate to userspace and
trigger a warning that gets reported to Sentry (if telemetry is
enabled).
Similar to IPv4, IPv6 and UDP, this adds a debug log describing how we
are updating the Ethernet header of a packet.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
It appears that the gVNIC driver in Google Cloud doesn't give us enough
headroom to use `bpf_xdp_adjust_head` with a delta of 4 bytes.
Currently, we are loading the XDP program with default flags. By loading
it explicitly in SKB mode, we should be able to bypass these driver
limitations at the expense of some performance (which should still be
better than userspace!).
Related:
https://github.com/GoogleCloudPlatform/compute-virtual-ethernet-linux/issues/70
At present, we only check for the return value of the various helper
functions and bail out if they fail. What we don't learn is what the
actual return code is. To further help with debugging, we include the
return code in the error so we can print it later.
We can't use the formatting macro within the `write` function so we need
to stitch the message together ourselves.
In case any of the xdp store/adjust/load functions fail, we need to drop
the packet. By the time we get to these functions, we have already
overwrote the Ethernet, IP and UDP headers and would only need to copy
them either forwards or backwards to get rid of or add the channel data
header. Forwarding these packets to userspace is pointless.
Currently, the relays eBPF module only supports routing from IPv4 to
IPv4 as well as IPv6 to IPv6. In general, TURN servers can also route
from IPv4 to IPv6 and vice versa. Our userspace routing supports that
but doing the same in the eBPF code is a bit more involved. We'd need to
move around the headers a bit more (IPv4 and IPv6 headers are different
in size), as well as configure the respective "source" address for each
interface. Currently, we simply take the destination address of the
incoming packet as the new source address. When routing across IP
versions, that doesn't work.
To gain some more insight into how often this happens, we add these
additional maps and populate them. This allows us to emit a dedicated
log message whenever we encounter a packet for such a mapping.
First, we always do check for an entry in the maps that we can handle.
If we can't we check the other map and special-case the error.
Otherwise, we fall back to the previous "no entry" error. We shouldn't
really see these "no entry" errors anymore now, unless someone starts
probing our relays for active channels.
At present, the eBPF code assumes that the incoming packet needs to be
sent back to the same MAC address that it came from. This is only true
if there is at least one IP layer hop in-between the relay and the
Client / Gateway. When setting up Firezone in my local LAN to debug the
eBPF code, all components are within the same subnet and thus can send
packets directly to each other, without having to go through the router.
In such a scenario, simply swapping the Ethernet addresses is not
correct.
As part of witnessing traffic coming in via the network, we can build up
a mapping of IP to MAC address. This mapping can then later be used to
set the correct MAC address for a given destination IP. All of this
functions entirely without interaction from userspace.
Unless you are running in a LAN environment, most if not all IPs will
point to the same MAC address (the one of the next IP layer hop, i.e.
the router). For the very first packet that we want to relay, we will
not have a MAC address for the destination IP. This doesn't matter
though, we simply pass that packet up to userspace and handle it there.
Pretty much all communication on the Internet is bi-directional because
you need some kind of ACK. As soon as we receive the first ACK, e.g. the
response to a binding request, we will learn the MAC address for the
given target IP and the eBPF router can kick in for all packets going
forward.
Related: #7518
The UDP checksum also includes the entire payload. Removing and adding
bytes to the payload therefore needs to be reflected in the checksum
update that we perform. When we add the channel data header, we need to
add the bytes to the checksum and when we remove them, they need to be
removed.
Related: #7518
Currently, the eBPF code isn't consistent in how it handles XDP actions.
For some cases, we return errors and then map them to `XDP_PASS` or
`XDP_DROP`. For others, we return `Ok(XDP_PASS)`. This is unnecessarily
hard to understand.
We refactor the eBPF kernel to ALWAYS use `Error`s for all code-paths
that don't end in `XDP_TX`, i.e. when we successfully modified the
packet and want to send it back out.
In addition, we also change the way we log these errors. Not all errors
are equal and most `XDP_PASS` actions don't need to be logged. Those
packets are simply passing through.
Finally, we also introduce new checks in case any calls to the eBPF
helper functions fail.
Related: #7518
At present, the eBPF program would try to pre-allocate around 800MB of
memory for all entries in the maps. This would allow for 1 million
channel mappings. We don't need that many to begin with. Reducing the
max number of channels down to 65536 reduces our memory usage to less
than 100MB.
Related: #7518
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Perf events are designed to be an extremely efficient way of
transferring data from an eBPF kernel to the user-space program. In
order to monitor, how much traffic we are actually relaying via eBPF, we
introduce a dedicated `STATS` map that is a `PerfEventArray`.
The events from that array are read asynchronously in user-space and fed
into our OTEL metrics. They will show up in our Google Cloud metrics as
`data_relayed_ebpf_bytes`. We already have a metric for the total
relayed bytes. That counter is renamed to `data_relayed_userspace_bytes`
so we can clearly differentiate the two.
This fills in the boilerplate for handling IPv6 packets in the eBPF
code. Unfortunately, we cannot add an integration test for this because
IPv6 doesn't have a checksum and thus doesn't allow the UDP checksum to
be set to 0. Because Linux (and other OSs too I'd assume) offload UDP
checksumming to the NIC yet on the loopback interface, the packets never
get to the NIC, our eBPF code sees only a partial checksum and can thus
updates the checksum incorrectly.
Related: #7518
Related: #8502
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This PR implements the "reverse path" of handling TURN traffic, i.e. UDP
datagrams that arrive on an allocation port and need to be wrapped in a
channel-data message to be sent to the TURN client.
In order to achieve that, I had to rewrite most of the TURN code to not
use the `etherparse` crate. I couldn't quite figure out the details but
the eBPF verifier rejected my code in mysterious ways that I didn't
understand. Commenting out random code-paths seemed to make it happy but
all code-paths combined caused an error. Eventually, I decided that we
simply have to use less abstractions to implement the same logic.
All the "parsing" code is now using types inspired by `network-types`.
The only modification here is that we use byte-arrays within our structs
in order to directly receive them in big-endian ordering.
`network-types` uses `u16`s and `u32`s which get interpreted as
little-endian on x86. Instead of converting around between the
endianness, constructing those values where we want them using the right
endianness is deemed much simpler. I opened an issue with upstream which
- if accepted - will allow us to remove our own structs and instead
depend on upstream again.
I also had to aggressively add `#[inline(always)]` to several functions,
otherwise the compiler would not optimise away our function calls,
causing the linker and / or eBPF verifier to fail.
This PR also fixes numerous bugs that I've found in the already existing
eBPF code. The number of bugs makes me question how this has been
working so far at all!
- We did not swap the Ethernet source and destination MAC address when
re-routing the packet. The integration-test didn't catch this because it
only operates on the loopback interface. Further testing on staging
should allow us to confirm that this is indeed working now.
- The UDP checksum update did not incorporate the new src and dst port.
The integration-test didnt' catch that because it has UDP checksumming
disabled. We need to have that disabled in the test because UDP
checksumming is typically offloaded to the NIC and packets on the
loopback interface never leave the device.
Related: https://github.com/vadorovsky/network-types/issues/32.
Related: #7518
As part of iterating on #8496, the API of `relay::Server` had changed
and I had commented out the regression tests to move quicker. In later
iterations, those API changes were reverted but I forgot to uncomment
them.
As part of working on https://github.com/aya-rs/aya/pull/1228, which I
am depending on in here I had to force-push which will break CI. Opening
this to fix it.
## Abstract
This pull-request implements the first stage of off-loading routing of
TURN data channel messages to the kernel via an eBPF XDP program. In
particular, the eBPF kernel implemented here **only** handles the
decapsulation of IPv4 data channel messages into their embedded UDP
payload. Implementation of other data paths, such as the receiving of
UDP traffic on an allocation and wrapping it in a TURN channel data
message is deferred to a later point for reasons explained further down.
As it stands, this PR implements the bare minimum for us to start
experimenting and benefiting from eBPF. It is already massive as it is
due to the infrastructure required for actually doing this. Let's dive
into it!
## A refresher on TURN channel-data messages
TURN specifies a channel-data message for relaying data between two
peers. A channel data message has a fixed 4-byte header:
- The first two bytes specify the channel number
- The second two bytes specify the length of the encapsulated payload
Like all TURN traffic, channel data messages run over UDP by default,
meaning this header sits at the very front of the UDP payload. This will
be important later.
After making an allocation with a TURN server (i.e. reserving a port on
the TURN server's interfaces), a TURN client can bind channels on that
allocation. As such, channel numbers are scoped to a client's
allocation. Channel numbers are allocated by the client within a given
range (0x4000 - 0x4FFF). When binding a channel, the client specifies
the remote's peer address that they'd like the data sent on the channel
to be sent to.
Given this setup, when a TURN server receives a channel data message, it
first looks at the sender's IP + port to infer the allocation (a client
can only ever have 1 allocation at a time). Within that allocation, the
server then looks for the channel number and retrieves the target socket
address from that. The allocation itself is a port on the relay's
interface. With that, we can now "unpack" the payload of the channel
data message and rewrite it to the new receiver:
- The new source IP can be set from the old dst IP (when operating in
user-space mode this is irrelevant because we are working with the
socket API).
- The new source port is the client's allocation.
- The new destination IP is retrieved from the mapping retrieved via the
channel number.
- The new destination port is retrieved from the mapping retrieved via
the channel number.
Last but not least, all that is left is removing the channel data header
from the UDP payload and we can send out the packet. In other words, we
need to cut off the first 4 bytes of the UDP payload.
## User-space relaying
At present, we implement the above flow in user-space. This is tricky to
do because we need to bind _many_ sockets, one for each possible
allocation port (of which there can be 16383). The actual work to be
done on these packets is also extremely minimal. All we do is cut off
(or add on) the data-channel header. Benchmarks show that we spend
pretty much all of our time copying data between user-space and
kernel-space. Cutting this out should give us a massive increase in
performance.
## Implementing an eBPF XDP TURN router
eBPF has been shown to be a very efficient way of speeding up a TURN
server [0]. After many failed experiments (e.g. using TC instead of XDP)
and countless rabbit-holes, we have also arrived at the design
documented within the paper. Most notably:
- The eBPF program is entirely optional. We try to load it on startup,
but if that fails, we will simply use the user-space mode.
- Retaining the user-space mode is also important because under certain
circumstances, the eBPF kernel needs to pass on the packet, for example,
when receiving IPv4 packets with options. Those make the header
dynamically-sized which makes further processing difficult because the
eBPF verifier disallows indexing into the packet with data derived from
the packet itself.
- In order to add/remove the channel-data header, we shift the packet
headers backwards / forwards and leave the payload in place as the
packet headers are constant in size and can thus easily and cheaply be
copied out.
In order to perform the relaying flow explained above, we introduce maps
that are shared with user-space. These maps go from a tuple of
(client-socket, channel-number) to a tuple of (allocation-port,
peer-socket) and thus give us all the data necessary to rewrite the
packet.
## Integration with our relay
Last but not least, to actually integrate the eBPF kernel with our
relay, we need to extend the `Server` with two more events so we can
learn, when channel bindings are created and when they expire. Using
these events, we can then update the eBPF maps accordingly and therefore
influence the routing behaviour in the kernel.
## Scope
What is implemented here is only one of several possible data paths.
Implementing the others isn't conceptually difficult but it does
increase the scope. Landing something that already works allows us to
gain experience running it in staging (and possibly production).
Additionally, I've hit some issues with the eBPF verifier when adding
more codepaths to the kernel. I expect those to be possible to resolve
given sufficient debugging but I'd like to do so after merging this.
---
Depends-On: #8506
Depends-On: #8507
Depends-On: #8500Resolves: #8501
[0]: https://dl.acm.org/doi/pdf/10.1145/3609021.3609296
The WebSocket connection to the portal from within the Clients, Gateways
and Relays may be temporarily interrupted by IO errors. In such cases we
simply reconnect to it. This isn't as much of a problem for Clients and
Gateways. For Relays however, a disconnect can be disruptive for
customers because the portal will send `relays_presence` events to all
Clients and Gateways. Any relayed connection will therefore be
interrupted. See #8177.
Relays run on our own infrastructure and we want to be notified if their
connection flaps.
In order to differentiate between these scenarios, we remove the logging
from within `phoenix-channel` and report these connection hiccups one
layer up. This allows Clients and Gateways to log them on DEBUG whereas
the Relay can log them on WARN.
Related: #8177
Related: #7004
At present, the relay uses a priority in the event-loop that favors
routing traffic. Whenever a task further up in the loop is
`Poll::Ready`, we loop back to the top to continue processing. The issue
with that is that in very busy times, this can lead to starvation in
processing timers and messages from the portal. If we then finally get
to process portal messages, we think that the portal hasn't replied in
some time and proactively cut the connection and reconnect.
As a result, the portal will send `relays_presence` messages to the
clients and gateways which in turn will locally remove the relay. This
breaks relayed connections.
To fix this, instead of immediately traversing to the top of the
event-loop with `continue`, we only set a boolean. This gives each
element of the event-loop a chance to execute, even when a certain
component is very busy.
Related: #8165
Related: #8176
Developing on Windows is much easier if all Rust code compiles without
errors or warnings because you can "trust" your IDE that your code is
error free if it says "0 errors; 0 warnings". We are not far off from
achieving this!
Apart from the "graceful termination" feature in the relay, both the
relay and gateway should actually also work on Windows just fine, thanks
to the platform-agnostic abstractions we have been building up for the
GUI and headless client.
As it turns out, the effort in #7104 was not a good idea. By logging
errors as values, most of our Sentry reports all have the same title and
thus cannot be differentiated from within the overview at all. To fix
this, we stringify errors with all their sources whenever they got
logged. This ensures log messages are unique and all Sentry issues will
have a useful title.
Previously, it was possible to use the Firezone relay in "standalone"
mode where it would not attempt to connect to a portal. A long time ago,
this mode was introduced in order for us to test the TURN compatibility
of the relay with non-Firezone TURN clients. These tests have long been
removed and thus the mode is no longer required.
The positive side-effect of this is that we can make the
`FIREZONE_API_URL` a mandatory parameter and thus direct self-hosted
users towards setting this to the endpoint of their self-hosted portal.
Currently, telemetry via Sentry in our relay code is opt-out but won't
actually activate for a portal instance that isn't our staging or
production environment. However, this isn't enough to prevent alerts
from relay instances that aren't ours. It turns out that some
self-hosted customers don't realise that they have to change the portal
URL to their self-hosted portal. Without changing that, the relay will
attempt to authenticate to our production portal with an unknown token
and error out with a 401, logging a false-positive to Sentry.