Commit Graph

1236 Commits

Author SHA1 Message Date
Thomas Eizinger
20d0298a8a chore: fix clippy warnings about HashMap iteration (#10661)
Not quite sure how these didn't get picked up by CI but they showed in
my local IDE.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-21 02:54:20 +00:00
Thomas Eizinger
fc97816d6e chore: remove redunant clone (#10662) 2025-10-21 01:11:03 +00:00
Thomas Eizinger
fcda9c3b65 chore(connlib): add unit test for site-name change (#10622)
Turns out name changes of sites are already ignored as per the
`PartialEq` implementation of `Site`. This adds a unit-test to assert
that.
2025-10-19 23:57:45 +00:00
Thomas Eizinger
a07dfc9869 test(connlib): workaround DNS cache in proptests (#10602)
With the introduction of the DNS cache for Clients in #10533, we now
enable a behaviour where we don't necessarily need to establish a
connection to a Gateway to resolve a DNS query if we still have a valid
entry in the DNS cache. In particular, the proptests discovered that:

- a DNS query for an upstream resolver
- which happens to be a resource
- and has a valid entry in the DNS cache
- but (no longer) a connection to the corresponding Gateway

will now serve the cached DNS records instead of establishing a new
connection to the Gateway. As a result, the site status which we assert
in the proptests remains in "unknown" instead of the expected "online".

Modelling the caching behaviour in the tests is rather tedious. To avoid
that, we set the TTL of all simulated upstream DNS responses to 1 which
effectively bypasses the cache. Whilst not an ideal solution, it ensures
that CI is consistently green without flaky tests. The DNS cache itself
is already unit-tested.
2025-10-17 16:17:52 +00:00
Thomas Eizinger
928d8a2512 fix(connlib): handle resources changing site (#10604)
Similar to how resources can be edited to change their address, IP stack
or other properties, they can also be moved between different sites.
Currently, `connlib` requires the portal to explicitly remove the
resource and then re-add it for this to work.

Our system gets more robust if we also detect that the sites of a
resource have changed and handle it like other addressability changes.

To ensure that this works correctly, we also extend the proptests to
simulate addressability changes of resources.

Resolves: #9881
Related: #10593
2025-10-17 14:52:14 +00:00
Thomas Eizinger
6b3f2a32ce feat(gateway): associate packets with resource ID (#10588)
In order to support flow logs, we need to associate each IP packet that
gets routed with its corresponding resource ID. Currently, we only track
what is necessary for the actual routing behaviour: The IP addresses and
the filters. Therefore, we extend the data structures in `peer` to also
track the `ResourceId` now.

The entire code within `peer` became a bit hard to manage so I took this
opportunity to split it out into two dedicated modules.

This PR forms the base for recording flows logs in #10576.
2025-10-16 13:53:53 +00:00
Thomas Eizinger
08f8e886f1 chore(connlib): tune down INFO logs (#10574)
Several of these INFO logs are actually quite noisy, like exchanging
candidates with Gateways or updating the allocation. We barely look at
the INFO logs from customers and primarily investigate issues with DEBUG
logs streamed to Sentry.
2025-10-15 05:52:43 +00:00
Thomas Eizinger
df601be538 chore(rust): ban keys and values from HashMap (#10569)
In addition to the `iter` functions, `keys` and `values` also iterate
over the contents of a `HashMap` and are thus non-deterministic. This
can create problems where our test-suite is non-deterministic.
2025-10-14 22:44:17 +00:00
Thomas Eizinger
039d0be7b8 fix(connlib): drop packets with bad source IP on clients (#10552)
When using the Internet Resource, it can happen that Clients are still
receiving packets with a source IP that is different from the TUN IP.
Such packets are dropped on the Gateway already today and therefore have
never been routed to their destination.

The Gateway cannot route these packets because the reply packets would
have the original source address set as the destination and that one is
not unique across all Firezone Clients. Without a unique destination,
the Gateway cannot send the packet to the correct Client.

Today, these packets are filtered on the Gateway and thus trigger an
ICMP error. With the addition of #10462, we create a new flow for each
one of these packets. To prevent this spam, we drop such packets early
in the Client and don't even route them to the Gateway.
2025-10-13 22:54:26 +00:00
Thomas Eizinger
8ccf8b90bc chore(tests): remove comments from regression seeds file (#10534)
Whilst the regression seeds file itself is useful to have a fixed set of
tests that are always run, the comments what a specific seed samples to
quickly get outdated as the test suite evolves. Therefore, we remove the
comments to not confuse developers.
2025-10-08 05:21:47 +00:00
Thomas Eizinger
1140f6ffa3 feat(clients): cache DNS responses (#10533)
Firezone Clients set themselves as the system-wide DNS resolver on
startup. This is necessary to intercept queries for DNS resources which
resolve to proxy IPs whilst Firezone is active.

All DNS queries for non-resources are forwarded to either the resolver
defined on the system or the ones defined in the portal (if any). These
DNS servers can also be CIDR resources in which cases the queries get
forwarded through the tunnel to a Gateway.

Right now, the responses from these DNS servers are never cached. DNS is
pretty heavily relied on on most systems and having DNS fail or be slow
usually results in a bad user experience.

To improve on this, we embed a small DNS cache into connlib where for
each query, we first try to answer it from the cache. Queries otherwise
forwarded to the system/upstream resolver or through the tunnel will see
a much improved response time with this change.

When serving responses from this cache, the TTL is decremented
automatically based on how much time has passed since the entry was
first added to the cache. Outside of the response time being ~1ms, this
makes the cache fully transparent.

Resolves: #10508
2025-10-08 03:26:27 +00:00
Thomas Eizinger
8fc2ef8ad1 fix(clients): set Internet Resource state on startup (#10509)
Building on top of #10507, setting the initial Internet Resource state
is a piece of cake. All we need to do is thread a boolean variable
through to all call-sites of `Session::connect`. Without the need for
the Internet Resource's ID, we can simply pass in the boolean that is
saved in the configuration of each client.

Resolves: #10255
2025-10-07 07:13:52 +00:00
Thomas Eizinger
36dfee2c42 refactor(connlib): explicitly enable/disable Internet Resource (#10507)
Instead of the generic "disable any kind of resource"-functionality that
connlib currently exposes, we now provide an API to only enable /
disable the Internet Resource. This is a lot simpler to deal with and
reason about than the previous system, especially when it comes to the
proptests. Those need to model connlib's behaviour correctly across its
entire API surface which makes them unnecessarily complex if we only
ever use the `set_disabled_resources` API with a single resource.

In preparation for #4789, I want to extend the proptests to cover
traffic filters (#7126). This will make them a fair bit more
complicated, so any prior removal of complexity is appreciated.

Simplifying the implementation here is also a good starting point to fix
#10255. Not implicitly enabling the Internet Resource when it gets added
should be quite simple after this change.

Finally, resolving #8885 should also be quite easy. We just need to
store the state of the Internet Resource once per API URL instead of
globally.

Resolves: #8404

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-07 00:26:07 +00:00
Thomas Eizinger
e9e8792512 feat(connlib): tune down logs for recently disconnected clients (#10501)
When a Client disconnects from a Gateway, we might still be receiving
packets that are either in-flight or are still being sent by the
resource. For some amount of time after a disconnect, this is expected
and not worth logging a warning for.

With this PR, we define this time to be 60s. If we cannot look up a
connection either by ID, session index or public key but the peer has
disconnected within the last 60s, we will now only print a DEBUG log
instead of a WARN.

Resolves: #10175
2025-10-03 13:08:06 +00:00
Thomas Eizinger
2cc13cea24 refactor(connlib): set ECN bits directly on Transmit (#10497)
Instead of mirroring the ECN bits of an IP packet on the resulting UDP
packet in the event-loop, we can extend `Transmit` with an `ecn` field
and directly set it every time we construct a `Transmit`, mirroring the
ECN bits from the inner IP packet if the UDP packet contains an
encapsulated IP packet.

Extracted from #10485
2025-10-03 13:02:17 +00:00
Thomas Eizinger
881514edfc fix(connlib): log fragmented IP packets on debug (#10488)
When an application sends UDP packets that are larger than the MTU of
the underlying interface, the kernel fragments the packet at the IP
level. Firezone does not support fragmented IP packets because we need
to pack each IP packet into a UDP packet.

Right now, we don't check for fragmented IP packets which results in
packet parsing errors because the slice we are trying to parse the
packet from is not long enough.

To avoid spamming Sentry in these cases, we explicitly check for
fragmented IP packets and only log those on DEBUG.

Resolves: #10335
2025-10-02 05:03:12 +00:00
Thomas Eizinger
cfbdc30123 refactor(connlib): move log into state (#10498)
Instead of logging this inside the event-loop, it is better to move it
into the corresponding handler function to free up the event-loop from
as much "logic" as possible. It should ideally only be concerned with
linking the state machine with the IO components that actually cause the
side-effects.
2025-10-01 04:16:41 +00:00
Thomas Eizinger
a297c6dbbd chore: differentiate between shutdown and shut down (#10494)
In a prior code review, CoPilot flagged that we were using the noun
"shutdown" as a verb in certain places.

Resolves: #10425
2025-10-01 02:55:22 +00:00
Thomas Eizinger
b11adfcfe4 feat(connlib): create flow on ICMP error "prohibited" (#10462)
In Firezone, a Client requests an "access authorization" for a Resource
on the fly when it sees the first packet for said Resource going through
the tunnel. If we don't have a connection to the Gateway yet, this is
also where we will establish a connection and create the WireGuard
tunnel.

In order for this to work, the access authorization state between the
Client and the Gateway MUST NOT get out of sync. If the Client thinks it
has access to a Resource, it will just route the traffic to the Gateway.
If the access authorization on the Gateway has expired or vanished
otherwise, the packets will be black-holed.

Starting with #9816, the Gateway sends ICMP errors back to the
application whenever it filters a packet. This can happen either because
the access authorization is gone or because the traffic wasn't allowed
by the specific filter rules on the Resource.

With this patch, the Client will attempt to create a new flow (i.e.
re-authorize) traffic for this resource whenever it sees such an ICMP
error, therefore acting as a way of synchronizing the view of the world
between Client and Gateway should they ever run out of sync.

Testing turned out to be a bit tricky. If we let the authorization on
the Gateway lapse naturally, we portal will also toggle the Resource off
and on on the Client, resulting in "flushing" the current
authorizations. Additionally, it the Client had only access to one
Resource, then the Gateway will gracefully close the connection, also
resulting in the Client creating a new flow for the next packet.

To actually trigger this new behaviour we need to:

- Access at least two resources via the same Gateway
- Directly send `reject_access` to the Gateway for this particular
resource

To achieve this, we dynamically eval some code on the API node and
instruct the Gateway channel to send `reject_access`. The connection
stays intact because there is still another active access authorization
but packets for the other resource are answered with ICMP errors.

To achieve a safe roll-out, the new behaviour is feature-flagged. In
order to still test it, we now also allow feature flags to be set via
env variables.

Resolves: #10074

---------

Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>
2025-09-30 08:23:39 +00:00
Thomas Eizinger
685acdac3a feat: add more specific component type to user-agent header (#10457)
In order to allow the portal to more easily classify, what kind of
component is connecting, we extend the `get_user_agent` header to
include a component type instead of the generic `connlib/`.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2025-09-26 00:18:36 +00:00
Thomas Eizinger
0310bafbcd feat(clients): gracefully close connections on shutdown (#10400)
In #10076, connlib gained the ability to gracefully close connections
between peers. The Gateway already uses this when it is being gracefully
shutdown such as during an upgrade. This allows Clients to immediately
fail-over to a different Gateway instead of waiting for an ICE timeout.

When a Client signs out, we currently just drop all the state, resulting
in an ICE timeout on the Gateway ~15 seconds later. This makes it
difficult for us to analyze, whether an ICE timeout in the logs presents
an actual problem where a network connection got cut or whether the
Client simply signed out.

Whilst not water-tight, attempting to gracefully close our connections
when the Client signs out is better than nothing so we implement this
here.

All Clients use the `Session` abstraction from `client-shared` which
spawns the event-loop into a dedicated task.

- For the Linux and Windows GUI client, the already present tokio
runtime instance of the tunnel service is used for this.
- For Android and Apple, we create a dedicated, single-threaded runtime
instance for connlib.
- For the headless client, we also reuse the already existing tokio
runtime instance of the binary.

In case of Android, Apple and the headless client, this means we need to
ensure the tokio runtime instances stays alive long enough to actually
complete the graceful shutdown task. We achieve this by draining the
`EventStream` returned from `Session`. The `EventStream` is a wrapper
around a channel connected to the event-loop. This stream only finishes
once the event-loop is entirely dropped (and therefore completed the
graceful shutdown) as it holds the sender-end of the channel.

In case of the Linux and Windows GUI client, the runtime outlives the
`Session` because it is scoped to the entire tunnel process. Therefore,
no additional measures are necessary there to ensure the graceful
shutdown task completes.
2025-09-23 03:40:52 +00:00
Thomas Eizinger
8e00870942 refactor(gateway): close connections on error (#10401)
Previously, the Gateway would only proactively close connections to its
peers when it was shutdown gracefully via a SIGTERM or SIGINT signal. By
copying the same design for the event-loop as I've implemented in
#10400, we can now also initiate the graceful shutdown in case the
event-loop exits with an error.
2025-09-20 20:55:48 +00:00
Thomas Eizinger
e20929ad73 build(deps): bump Rust version to 1.90 (#10380)
One of the more quiet Rust releases with no new clippy lints that would
require code updates.
2025-09-20 04:28:03 +00:00
Thomas Eizinger
9c8101a3ee chore: render contextual information more Sentry-friendly (#10386)
Sentry can group issues together that have unique identifiers in their
message. Unfortunately, it does that only well for integers and UUIDs
and not so much for hex-values. To avoid alert fatigue, we render the
public key as a u256 which hopefully allows Sentry to group these
together.
2025-09-20 12:08:03 +10:00
Thomas Eizinger
90d10a8634 refactor(connlib): improve fairness of event-loop (#10347)
The event-loop inside `Tunnel` processes input according to a certain
priority. We only take input from lower priority sources when the higher
priority sources are not ready. The current priorities are:

- Flush all buffers
- Read from UDP sockets
- Read from TUN device
- Read from DNS servers
- Process recursive DNS queries
- Check timeout

The idea of this priority ordering is to keep all kinds of processing
bounded and "finish" any kind of work that is on-going before taking on
new work. Anything that sits in a buffer is basically done with
processing and just needs to be written out to the network / device.
Arriving UDP packets have already traversed the network and been
encrypted on the other end, meaning they are higher priority than
reading from the TUN device. Packets from the TUN device still need to
be encrypted and sent to the remote.

Whilst there is merit in this design, it also bears the potential of
starving input sources further down if the top ones are extremely busy.
To prevent this, we refactor `Io` to read from all input sources and
present it to the event-loop as a batch, allowing all sources to make
progress before looping around. Since this event-loop has first been
conceived, we have refactored `Io` to use background threads for the UDP
sockets and TUN device, meaning they will make progress by themselves
anyway until the channels to the main-thread fill up. As such, there
shouldn't be any latency increase in processing packets even though we
are performing slightly more work per event-loop tick.

This kind of batch-processing highlights a problem: Bailing out with an
error midway through processing a batch leaves the remainder of the
batch unprocessed, essentially dropping packets. To fix this, we
introduce a new `TunnelError` type that presents a collection of errors
that we encountered while processing the batch. This might actually also
be a problem with what is currently in `main` because we are already
batch-processing packets there but possibly are bailing out midway
through the batch.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>
2025-09-17 23:28:36 +00:00
Thomas Eizinger
3e6094af8d feat(linux): try to set rmem_max and wmem_max on startup (#10349)
The default send and receive buffer sizes on Linux are too small (only
~200 KB). Checking `nstat` after an iperf run revealed that the number
of dropped packets in the first interval directly correlates with the
number of receive buffer errors reported by `nstat`.

We already try to increase the send and receive buffer sizes for our UDP
socket but unfortunately, we cannot increase them beyond what the system
limits them to. To workaround this, we try to set `rmem_max` and
`wmem_max` during startup of the Linux headless client and Gateway. This
behaviour can be disabled by setting `FIREZONE_NO_INC_BUF=true`.

This doesn't work in Docker unfortunately, so we set the values manually
in the CI perf tests and verify after the test that we didn't encounter
any send and receive buffer errors.

It is yet to be determined how we should deal with this problem for all
the GUI clients. See #10350 as an issue tracking that.

Unfortunately, this doesn't fix all packet drops during the first iperf
interval. With this PR, we now see packet drops on the interface itself.
2025-09-17 23:05:01 +00:00
Thomas Eizinger
7222167b13 fix(connlib): limit the number of optimistic candidates (#10367)
To facilitate direct connections, `connlib` generates "optimistic"
candidates that combine the port of the host candidate with the IP of
the server-reflexive candidate. This allows sysadmins to port-forward
the Firezone port 52625 on the Gateway, allowing for direct connections
to happen behind symmetric NAT.

This feature is only really useful for IPv4 as IPv6 doesn't need
symmetric NAT due to the larger address space. It is also quite common
that users have multiple IPv6 addresses on a single interface. The
combination of the two can result in CPU spikes on the Gateway if a
client connects and sends over e.g. 10 IPv6 host candidates and various
IPv6 server-reflexive candidates. The Gateway then ends up in a loop
where it creates an NxM matrix of all these candidates.

To mitigate this, we disable optimistic candidates for IPv6 altogether
and limit the number of IPv4 optimistic candidates to 2.
2025-09-17 19:52:29 +00:00
Thomas Eizinger
69afe71215 refactor(connlib): remove concept of "ReplyMessages" (#10361)
In earlier versions of Firezone, the WebSocket protocol with the portal
was using the request-response semantics built into Phoenix. This
however is quite cumbersome to work with to due to the polymorphic
nature of the protocol design.

We ended up moving away from it and instead only use one-way messages
where each event directly corresponds to a message type. However, we
have never removed the capability reply messages from the
`phoenix-channel` module, instead all usages just set it to `()`.

We can simplify the code here by always setting this to `()`.

Resolves: #7091
2025-09-17 04:10:56 +00:00
Thomas Eizinger
a66a18782e chore(connlib): add context to IP packet parse errors (#10337)
We are seeing some very strange IP packet parse errors coming from MacOS
devices. To better understand these, we extend the error messages with
the src and dst IP as well as the L4 header.

Related: #10335
2025-09-12 14:11:12 +00:00
Thomas Eizinger
33a75f6fee chore(headless-client): don't make failures look like crashes (#10290)
Returning an error from `main` by default prints a backtrace. This may
lead users to believe that the program is crashing when in fact it is
exiting in a controlled way but with an error (such as when we don't
have Internet during startup).

Printing the chain of errors ourselves resolves this.
2025-09-10 01:08:32 +00:00
Thomas Eizinger
03ac73ac00 fix(gateway): reset DNS resource NAT if proxy IPs change (#10310)
In #10040, we decided to persist a peer's routing state on the Gateway
across ICE sessions. This routing state also includes the DNS resource
NAT.

Prior to #10104 (which is not released yet), when a Client signs out and
back in, it resets the proxy IP mapping for DNS resources and will start
numbering them again from the front, i.e. starting from 100.96.0.1. With
the state still being preserved on the Gateway, this represents a
problem: We keep existing mappings around if there is still a NAT
session for this proxy IP. However, if the proxy IP is actually for a
different domain, this NAT session is meaningless. In fact, not
replacing the IP is problematic as we will now route packets for the new
proxy IP to the wrong destination.

The persistent DNS resource mapping from #10104 fixes this. In this PR,
we add an additional check to the Gateway where we detect whether the
Client has started to re-assign proxy IPs and if so, we completely reset
the DNS resource NAT state including all existing NAT sessions.

Fixes #10268
2025-09-09 02:08:26 +00:00
Thomas Eizinger
ead1f40101 chore(gateway): only log skipped NAT entry if IP differs (#10285)
When we resolve a DNS resource domain name on the Gateway, we establish
the mapping between proxy IPs and resolved IPs in order to correctly NAT
traffic. These domains are re-resolved every time the Client sees a DNS
query for it. Thus, established connections could be interrupted if the
IPs returned by consecutive DNS queries are different.

Many SaaS products (GitHub for example) use DNS to load balance between
different IPs. In order to not interrupt those connections, we check
whether we have an open NAT session for an existing mapping every time
we re-resolve DNS.

This log is currently printed too often though because it doesn't take
into account whether the IPs actually changed. If the IP is the same, we
don't need to print this because the update is a no-op.
2025-09-04 21:12:46 +00:00
Thomas Eizinger
fb7b001cbf chore(rust): fix unused variable warning (#10283) 2025-09-03 01:17:11 +00:00
Thomas Eizinger
d718c5de8e fix(connlib): retry packets on IO error 5 (#10279)
Unfortunately, it isn't very easy to detect whether a socket supports
GSO on Linux. Hence, `quinn-udp` simply probes for its support by trying
to send GSO batches and effectively disables GSO by setting the
`max-gso-segments` state variable to 1 if it encounters either EINVAL
(-22) or EIO (-5).

For EINVAL, `quinn-udp` has an internal retry mechanism. For EIO, the
`Transmit` which is passed to `quinn-udp` needs to be re-chunked and
thus cannot be automatically retried.

In order to avoid dropping packets, we therefore add a once-off retry
step to sending a datagram whenever we hit EIO on Linux or Android. If
the error was due to GSO not being supported, the 2nd attempt should be
successful and going forward, even the first one should be until we roam
the socket (where this state variable gets reset).

These packet drops have been causing flakiness in CI ever since we
merged the eBPF tests. Those disable checksum offloading which appears
to trigger these errors.
2025-09-02 21:31:57 +00:00
Thomas Eizinger
e84bdc5566 refactor(connlib): periodically record queue depths (#10242)
Instead of recording the queue depths on every event-loop tick, we now
record them once a second by setting a Gauge. Not only is that a simpler
instrument to work with but it is significantly more performant. The
current version - when metrics are enabled - takes on quite a bit of CPU
time.

Resolves: #10237
2025-09-02 02:57:36 +00:00
Thomas Eizinger
a9e1b0fbfb chore(connlib): print full error when failing to read IP packet (#10275)
The error returned from `IpPacket::new` is an `anyhow::Error` but in
order to return it from `async_io`, we need to wrap it in an
`io::Error`. Printing an `io::Error` only prints the top-level error. To
fix this, we re-wrap the `io::Error` in an `anyhow::Error` again and
toggle "alternate" printing mode to see the full error chain.
2025-09-01 13:39:26 +00:00
Thomas Eizinger
0c2e54f54c feat(connlib): persistent DNS resource records across sessions (#10104)
When we receive a DNS query for a DNS resource in Firezone, we take the
next available 4 IPs from the CG-NAT range and assign them to the domain
name. For example, if `example.com` is a DNS resource and it is the
first resource being queried in a Firezone session, we will assigned the
IPs `100.96.0.1` - `100.96.0.4` to it. If the user now restarts Firezone
or signs out and back in, this state is lost and we assign those same
IPs to the next DNS query coming in.

This creates a problem for applications that do not re-query DNS very
often or never. They expect these IPs to not change. Restarting software
or signing out and back in is a common approach to fixing software
problems, yet in this specific case, doing so may create even more
problems for the user.

To mitigate this, `ClientState` introduce a new event
`DnsRecordsChanged` that gets emitted to the event-loop every time we
assign new records. The event-loop then caches this in memory and reuses
it in case a new session is initiated. The records are only stored
in-memory and not on disk. Most likely, the tunnel process will be alive
for the entire OS session.

To verify this behaviour, we add a new `RestartClient` transition to our
proptests. In the proptests, we already keep a mapping of all DNS names
we ever resolved, including DNS resources. When generating IP traffic,
we sample from this list of IPs and then expect the packet to be routed.
By replacing the `ClientState` as part of this transition and re-seeding
it with the previously exported DNS records, we can verify that packets
to IPs resolved from a previous session still get successfully routed to
the resource.

Related: #5498
2025-09-01 07:29:28 +00:00
Thomas Eizinger
533f4c319b feat(connlib): gracefully shutdown connections (#10076)
Right now, connections cannot be actively closed in Firezone. The
WireGuard tunnel and the ICE agent are coupled together, meaning only if
either one of them fails will we clean up the connection. One exception
here is when the Client roams. In that case, the Client simply clears
its local memory completely and then re-establishes all necessary
connections by re-requesting access.

There are three cases where gracefully closing a connection is useful:

1. If an access authorization is revoked or expires and this was the
last resource authorisation for that peer, we don't currently remove the
connection on the Gateway. Instead, the Client is still able to send
packets by they'll be dropped because we don't have a peer state
anymore.
1. If a Gateway gets restarted due to e.g. an upgrade or other
maintenance work, it loses all its connections and every Client needs to
wait for the ICE timeout (~15 seconds) before it can establish a new
one.
1. If a Client has its access revoked for all resources it has access to
in a particular site we also don't remove this connection, even though
it has become practically useless.

All of these cases are fixed with this PR. Here we introduce a way to
gracefully shutdown a connection without forcing the other side into an
ICE timeout. The graceful connection shutdown works by introducing a new
"goodbye" p2p control protocol message. Like all our p2p control
protocol messages, this is based on IP and therefore delivery is not
guaranteed. In other words, this "goodbye" message is sent on a
best-effort basis.

In the case of shutdown, the Gateway will wait for all UDP packets to be
flushed but will not resend them or wait for an ACK.

If either end receives such a "goodbye" message, they simply remove the
local peer and connection state just as if the connection would have
failed due to either ICE or WireGuard. For the Client, this means that
the next packet for a resource will trigger a new access authorization
request.
2025-09-01 06:30:13 +00:00
Thomas Eizinger
544ba11f21 chore(rust): allow too_many_arguments repo-wide (#10236)
We always end up allow this lint when it pops up so we can also just
allow it for the whole repo in general. Most of the time, the reason for
too many arguments are borrow-checker limitations of Rust where mutable
references need to be tracked explicitly.
2025-08-22 13:21:07 +00:00
Thomas Eizinger
c70c88c856 build(deps): upgrade to opentelemetry 0.30 (#10239) 2025-08-21 22:47:39 +00:00
Thomas Eizinger
99155490c5 chore(connlib): make UDP buffer sizes tunable at runtime (#10234)
For easier benchmarking, we make the UDP socket send and receive buffers
runtime-tunable.

Related: #7452
2025-08-21 18:18:14 +00:00
Thomas Eizinger
f85ae75ae0 refactor(connlib): increase UDP queues on desktop platforms (#10235)
On desktop platforms, we can easily afford to have larger queues here
despite each item in there being 65k. Benchmarking showed that we do
sometimes fill these up.

Related: #7452
2025-08-21 08:56:14 +00:00
Thomas Eizinger
a109c1a2ef feat(connlib): discard intermediate resource and TUN updates (#10223)
Right now, the Client event-loops have a channel with 1000 items for
sending new resource lists and updates to the TUN device to the host
app. This is kind of unnecessary as we always only care about the last
version of these. Intermediate updates that the host app doesn't process
are effectively irrelevant.

We've had an issue before where a bug in the portal caused us to receive
many updates to resources which ended up crashing Client apps because
this channel filled up.

To be more resilient on this front, we refactor the Client event loop to
use a `watch` channel for this. Watch channels only retain the last
value that got sent into them.
2025-08-21 05:42:54 +00:00
Thomas Eizinger
4e11112d9b feat(connlib): improve throughput on higher latencies (#10231)
Turns out the multi-threaded access of the TUN device on the Gateway
causes packet reordering which makes the TCP congestion controller
throttle the connection. Additionally, the default TX queue length of a
TUN device on Linux is only 500 packets.

With just a single thread and an increased TX queue length, we get a
throughput performance of just over 1 GBit/s for a 20ms link between
Client and Gateway with basically no packet drops:

```
Connecting to host 172.20.0.110, port 5201
[  5] local 100.79.130.70 port 49546 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   116 MBytes   977 Mbits/sec    0   6.40 MBytes       
[  5]   1.00-2.00   sec   137 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]   2.00-3.00   sec   134 MBytes  1.13 Gbits/sec    0   6.40 MBytes       
[  5]   3.00-4.00   sec   136 MBytes  1.14 Gbits/sec   47   6.40 MBytes       
[  5]   4.00-5.00   sec   137 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]   5.00-6.00   sec   138 MBytes  1.16 Gbits/sec    0   6.40 MBytes       
[  5]   6.00-7.00   sec   138 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]   7.00-8.00   sec   138 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]   8.00-9.00   sec   138 MBytes  1.16 Gbits/sec    0   6.40 MBytes       
[  5]   9.00-10.00  sec   138 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]  10.00-11.00  sec   139 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  11.00-12.00  sec   139 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  12.00-13.00  sec   136 MBytes  1.14 Gbits/sec    0   6.40 MBytes       
[  5]  13.00-14.00  sec   139 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  14.00-15.00  sec   140 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  15.00-16.00  sec   138 MBytes  1.16 Gbits/sec    0   6.40 MBytes       
[  5]  16.00-17.00  sec   137 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]  17.00-18.00  sec   139 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  18.00-19.00  sec   138 MBytes  1.16 Gbits/sec    0   6.40 MBytes       
[  5]  19.00-20.00  sec   136 MBytes  1.14 Gbits/sec    0   6.40 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-20.00  sec  2.67 GBytes  1.15 Gbits/sec   47             sender
[  5]   0.00-20.02  sec  2.67 GBytes  1.15 Gbits/sec                  receiver

iperf Done.

```

For further debugging in the future, we are now recording the send and
receive queue depths of both the TUN device and the UDP sockets. Neither
of those showed to be full in my testing which leads me to conclude that
it isn't any buffer inside Firezone that is too small here.

Related: #7452

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-08-20 23:08:56 +00:00
Thomas Eizinger
da00848549 build(deps): bump to Rust 1.89 (#10208)
Rust 1.89 comes with a new lint that wants us to use explicitly refer to
lifetimes, even if they are elided.
2025-08-18 05:04:55 +00:00
Thomas Eizinger
507a8957c2 chore(connlib): only debug-assert non-retransmitted DNS queries (#10136)
When we receive the same TCP DNS query twice, we currently wrongly hit a
debug assert.
2025-08-06 11:26:51 +00:00
Thomas Eizinger
2841fd0017 chore(connlib): spawn dedicated tasks for UDP send/recv (#10147)
At the moment, `connlib`'s UDP thread spawns a single task for reading
and writing to the UDP socket. It will always first try to write data
before reading new data. To avoid scheduling issues, we split this into
two dedicated tasks and insert

```rust
tokio::task::yield_now().await;
```

into each loop. This allows the `tokio` runtime to schedule each of the
tasks fairly even if one of them is very busy.

For example, if we are very busy writing data (because we are receiving
a lot of IP traffic), this ensures that we will occasionally also read
from our socket to receive STUN control messages from our peers.
2025-08-06 07:38:01 +00:00
Thomas Eizinger
3e46727362 chore(snownet): improve logging of boringtun session index (#10135)
Previously, boringtun's sender/receiver index of a session would just be
rendered as a full u32. In reality, this u32 contains two pieces of
information: The higher 24 bits identify the peer and the lower 8 bits
identify the session with that peer. With the update to boringtun in
https://github.com/firezone/boringtun/pull/112, we encode this logic in
a dedicated type that has prints this information separately. Here is
what the logs now look like:

```
2025-08-05T07:38:37.742Z DEBUG boringtun::noise: Received handshake_response local_idx=(3428714|1) remote_idx=(1937676|1)
2025-08-05T07:38:37.743Z DEBUG boringtun::noise: New session idx=(3428714|1)
2025-08-05T07:38:37.743Z DEBUG boringtun::noise: Sending keepalive local_idx=(3428714|1)
```
2025-08-05 13:08:32 +00:00
Thomas Eizinger
96579483d8 fix(phoenix-channel): timeout room join after 5s (#10130)
If we fail to join a given room for longer than 5s, we fail the
WebSocket connection and reconnect.
2025-08-05 02:00:26 +00:00
Thomas Eizinger
d1cbf4f76d chore(snownet): fix relay sampling spam (#10127)
When we disconnect from a relay, we currently spam `Failed to sample new
relay for connection` until we connect to a new one.
2025-08-05 00:16:28 +00:00