In CI, eBPF in driver mode actually functions just fine with no changes
to our existing tests, given we apply a few workarounds and bugfixes:
- The interface learning mechanism had two flaws: (1) it only learned
per-CPU, which meant the risk for a missing entry grew as the core count
of the relay host grew, and (2) it did not filter for unicast IPs, so it
picked up broadcast and link-local addresses, causing cross-relay paths
to fail occasionally
- The `relay-relay` candidate where the two relays are the same relay
causes packet drops / loops in the Docker bridge setup, and possibly in
GCP too. I'm not sure this is a valid path that solves a real
connectivity issue in the wild. I can understand relay-relay paths where
two relays are different hosts, and the client and gateway both talk
over their TURN channel to each other (i.e. WireGuard is blocked in each
of their networks), but I can't think of an advantage for a relay-relay
candidate where the traffic simply hairpins (or is dropped) off the
nearest switch. This has been now detected with a new `PacketLoop` error
that triggers whenever source_ip == dest_ip.
- The relays in CI need a common next-hop to talk to for the MAC address
swapping to work. A simple router service is added which functions as a
basic L3 router (no NAT) that allows the MAC swapping to work.
- The `veth` driver has some peculiar requirements to allow it to
function with XDP_TX. If you send a packet out of one interface of a
veth pair with XDP_TX, you need to either make sure both interfaces have
GRO enabled, or you need to attach a dummy XDP program that simply does
XDP_PASS to the other interface so that the sk_buff is allocated before
going up the stack to the Docker bridge. The GRO method was unreliable
and didn't work in our case, causing massive packet delays and
unpredictable bursts that prevented ICE from working, so we use the
XDP_PASS method instead. A simple docker image is built and lives at
https://github.com/firezone/xdp-pass to handle this.
Related: #10138
Related: #10260
- Removes the swift DerivedData cache. This was added to attempt to
speed up the Swift builds in CI but in reality, those are already fast
and the cache did not speed them up.
- Removes the runner.os/arch specifier from the Webview installer cache
key. The binary download is hardcoded for a specific windows version /
arch already so the cache key just adds unneeded complexity.
These caches are getting saved on PR runs which consumes excess GHA
cache storage.
To avoid burning Azure credits, we move the runners back down to the
free tier. Now that caching is properly set up, this should incur only a
minor increase in CI time.
Docker for Mac finally supports IPv6 in general availability. It's time
to add IPv6 to our suite of integration tests.
The thinking behind this PR is try and not slow down CI much, if at all,
by testing IPv6 side-by-side with the existing IPv4 tests.
More comprehensive testing is being developed in #10131 that will test
things like IPv4-in-6 relaying, client / gateway IP stack mismatches,
and so forth.
Forcing 500MBit/s through a relayed connection in CI makes the
user-space relay fall-over and drop control messages, leading to ICE
timeouts of the connection.
Our ICE timeout is ~15s, so it would be a good idea to ensure the perf
tests span a possible ICE timeout if it occurs in the test, so that we
may detect cases where high throughput may cause an ICE timeout.
With the removal of the NAT64/46 modules, we can now simplify the
internals of our `IpPacket` struct. The requirements for our `IpPacket`
struct are somewhat delicate.
On the one hand, we don't want to be overly restrictive in our parsing /
validation code because there is a lot of broken software out there that
doesn't necessarily follow RFCs. Hence, we want to be as lenient as
possible in what we accept.
On the other hand, we do need to verify certain aspects of the packet,
like the payload lengths. At the moment, we are somewhat too lenient
there which causes errors on the Gateway where we have to NAT or
otherwise manipulate the packets. See #9567 or #9552 for example.
To fix this, we make the parsing in the `IpPacket` constructor more
restrictive. If it is a UDP, TCP or ICMP packet, we attempt to fully
parse its headers and validate the payload lengths.
This parsing allows us to then rely on the integrity of the packet as
part of the implementation. This does create several code paths that can
in theory panic but in practice, should be impossible to hit. To ensure
that this does in fact not happen, we also tackle an issue that is long
overdue: Fuzzing.
Resolves: #6667Resolves: #9567Resolves: #9552
Packet loss is a reality on the modern internet. Ideally, Firezone
should be able to handle some level of packet loss and still function
reliably, especially considering all of the UDP-based protocols we rely
on.
To test this, we set an extreme packet loss of 20% and perform a 10 MB
download through Firezone. Doing so actually exposed a bug:
For DNS resources, we need to set up the DNS resource NAT on the Gateway
which happens through the p2p control protocol. This packet is resent at
most every 2s but only if there are any other DNS queries. If we don't
receive another DNS query but get traffic for the resource, we keep
buffering those packets without trying to re-send the `AssignedIp`s
packet.
At present, the `direct-download-roaming-network` integration test is a
bit odd. It uses the `--retry` switch from `curl` to retry the download
once it failed. However, what we want to show with this integration test
is that a TCP connection can survive network roaming. We can show that
successfully but only if we specify the `--keepalive-time` option,
otherwise the download stalls.
From inspecting the network logs, this is because `curl` simply waits
for more data to be downloaded. After a network reset, the connection
however is gone and the _client_ (in this case `curl`) needs to send at
least 1 packet to re-establish the connection. By using the keep-alive
option, we can send such a packet and the download completes
successfully.
In Docker environments, applying iptables rules to filter
container-container traffic on the Docker bridged network is not
reliable, leading to direct connections being established in our relayed
tests. To fix this, we insert the rules directly from the client
container itself.
---------
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
[`actionlint`](https://github.com/rhysd/actionlint) is a static analysis
tool for GitHub workflows and actions. It detects various issues ahead
of time and runs shellcheck on all `run` blocks. It is worth noting that
this does **not** lint the contents of composite actions so we still
need to be vigilant when working with those.
A bit of legacy that we have inherited around our Firezone ID is that
the ID stored on the user's device is sha'd before being passed to the
portal as the "external ID". This makes it difficult to correlate IDs in
Sentry and PostHog with the data we have in the portal. For Sentry and
PostHog, we submit the raw UUID stored on the user's device.
As a first step in overcoming this, we embed an "external ID" in those
services as well IF the provided Firezone ID is a valid UUID. This will
allow us to immediately correlate those events.
As a second step, we automatically generate all new Firezone IDs for the
Windows and Linux Client as `hex(sha256(uuid))`. These won't parse as
valid UUIDs and therefore will be submitted as is to the portal.
As a third step, we update all documentation around generating Firezone
IDs to use `uuidgen | sha256` instead of just `uuidgen`. This is
effectively the equivalent of (2) but for the Headless Client and
Gateway where the Firezone ID can be configured via environment
variables.
Resolves: #9382
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
To make releases even more smoother, this PR creates a bit of automation
that automatically bumps the versions in the `scripts/bump-versions.sh`
script and opens a PR for it.
To make our FFI layer between Android and Rust safer, we adopt the
UniFFI tool from Mozilla. UniFFI allows us to create a dedicated crate
(here `client-ffi`) that contains Rust structs annotated with various
attributes. These macros then generate code at compile time that is
built into the shared object. Using a dedicated CLI from the UniFFI
project, we can then generate Kotlin bindings from this shared object.
The primary motivation for this effort is memory safety across the FFI
boundary. Most importantly, we want to ensure that:
- The session pointer is not used after it has been free'd
- Disconnecting the session frees the pointer
- Freeing the session does not happen as part of a callback as that
triggers a cyclic dependency on the Rust side (callbacks are executed on
a runtime and that runtime is dropped as part of dropping the session)
To achieve all of these goals, we move away from callbacks altogether.
UniFFI has great support for async functions. We leverage this support
to expose a `suspend fn` to Android that returns `Event`s. These events
map to the current callback functions. Internally, these events are read
from a channel with a capacity of 1000 events. It is therefore not very
time-critical that the app reads from this channel. `connlib` will
happily continue even if the channel is full. 1000 events should be more
than sufficient though in case the host app cannot immediately process
them. We don't send events very often after all.
This event-based design has major advantages: It allows us to make use
of `AutoCloseable` on the Kotlin side, meaning the `session` pointer is
only ever accessed as part of a `use` block and automatically closed
(and therefore free'd) at the end of the block.
To communicate with the session, we introduce a `TunnelCommand` which
represents all actions that the host app can send to `connlib`. These
are passed through a channel to the `suspend fn` which continuously
listens for events and commands.
Resolves: #9499
Related: #3959
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
The latest release now also sorts workspace dependencies, as well as
different dependency sections. Keeping these things sorted reduces the
chances of merge conflicts when multiple PRs edit these files.