In CI, eBPF in driver mode actually functions just fine with no changes
to our existing tests, given we apply a few workarounds and bugfixes:
- The interface learning mechanism had two flaws: (1) it only learned
per-CPU, which meant the risk for a missing entry grew as the core count
of the relay host grew, and (2) it did not filter for unicast IPs, so it
picked up broadcast and link-local addresses, causing cross-relay paths
to fail occasionally
- The `relay-relay` candidate where the two relays are the same relay
causes packet drops / loops in the Docker bridge setup, and possibly in
GCP too. I'm not sure this is a valid path that solves a real
connectivity issue in the wild. I can understand relay-relay paths where
two relays are different hosts, and the client and gateway both talk
over their TURN channel to each other (i.e. WireGuard is blocked in each
of their networks), but I can't think of an advantage for a relay-relay
candidate where the traffic simply hairpins (or is dropped) off the
nearest switch. This has been now detected with a new `PacketLoop` error
that triggers whenever source_ip == dest_ip.
- The relays in CI need a common next-hop to talk to for the MAC address
swapping to work. A simple router service is added which functions as a
basic L3 router (no NAT) that allows the MAC swapping to work.
- The `veth` driver has some peculiar requirements to allow it to
function with XDP_TX. If you send a packet out of one interface of a
veth pair with XDP_TX, you need to either make sure both interfaces have
GRO enabled, or you need to attach a dummy XDP program that simply does
XDP_PASS to the other interface so that the sk_buff is allocated before
going up the stack to the Docker bridge. The GRO method was unreliable
and didn't work in our case, causing massive packet delays and
unpredictable bursts that prevented ICE from working, so we use the
XDP_PASS method instead. A simple docker image is built and lives at
https://github.com/firezone/xdp-pass to handle this.
Related: #10138
Related: #10260
Previously, the `service.name` attribute got overridden with "unknown
service" from the detector used in `Resource::default`. To avoid this,
we are now manually composing the two other detectors.
This gives us a useful set of default labels from within the code yet it
allows overriding all of them using `OTEL_RESOURCE_ATTRIBUTES`.
I recently discovered that the metrics reporting to Google Cloud Metrics
for the relays is actually working. Unfortunately, they are all bucketed
together because we don't set the metadata correctly.
This PR aims to fix that be setting some useful default metadata for
traces and metrics and additionally, discoveres instance ID and name
from GCE metadata.
Related: #2033.
Without masquerading, packets sent by the gateway through the TUN
interface use the wrong source address (the TUN device's address)
instead of the gateway's actual network interface.
We set this env variable in all our uses of the gateway, thus we might
as well remove it and always perform unconditionally.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>