Commit Graph

1128 Commits

Author SHA1 Message Date
Thomas Eizinger
7689402c50 chore(snownet): print packets of unknown format (#9818)
When receiving UDP packets that we cannot decode we log an error. In
order to identify, whether we might have bugs in our decoding logic, we
now also print the hex-encoding of the packet for further analysis on
DEBUG.
2025-07-10 15:11:54 +00:00
Thomas Eizinger
0c151a2a96 chore(gateway): include ID of unknown peer in error message (#9819)
This will help with diagnosing issues in Sentry.
2025-07-10 14:32:05 +00:00
Thomas Eizinger
f98fcca542 refactor(connlib): directly implement async fn (#9806)
At present, and as a result of how `connlib` evolved, we still implement
a `Poll`-based function for receiving data on our UDP socket. Ever since
we moved to dedicated threads for the UDP socket, we can directly block
on "block" on receiving datagrams and don't have to poll the socket.

This simplifies the implementation a fair bit. Additionally, it made me
reailise that we currently don't expose any errors on the UDP socket.
Likely, those will be ephemeral but it is still better than completely
silencing them.
2025-07-10 13:54:44 +00:00
Thomas Eizinger
237bd62b20 fix(snownet): don't generate candidates of mixed IP version (#9804)
When we shipped the feature of optimistc server-reflexive candidates, we
failed to add a check to only combine address and base such that they
are the same IP version. This is not harmful but unnecessary noise.
2025-07-07 22:47:40 +00:00
Thomas Eizinger
e5fb6adbb4 fix(connlib): always signal server-reflexive candidates (#9802)
When we create a new connection, we seed the local ICE agent with all
known local candidates, i.e. host addresses and allocations on relays.
Server-reflexive candidates are never added to the local agent because
you cannot send directly from a server-reflexive addresses. Instead, an
agent sends from the _base_ of a server-reflexive candidate which in
turn is known as a host candidate.

The server-reflexive candidate is however signaled to the remote so it
can try and send packets to it. Those will then be mapped by the NAT to
our host candidate.

In case we have just performed a network reset, our own server-reflexive
candidate may not be known yet and therefore the seeding doesn't add an
candidates. With no candidates being seeded, we also can't signal them
to the remote.

For candidates discovered later in this process, the signalling happens
as part of adding them to the local agent. Because server-reflexive
candidates are not added to the local agent, we currently miss out on
signaling those to the remote IF they weren't already present when the
ICE agent got created.

This scenario can happen right after a network reset. In practice, it
shouldn't be much of an issue though. As soon as we start sending from
our host candidate, the remote will create a peer-reflexive candidate
for it. It is however cleaner to directly send the server-reflexive
candidate once we discover it.
2025-07-07 22:46:46 +00:00
Thomas Eizinger
c48ed2a1a0 feat(connlib): introduce 2s grace-period upon ICE disconnect (#9793)
When Firezone detects that the user is switching networks, we perform an
internal reset where we clear all connections and also all local
candidates. As part of the reset, we then send STUN requests to our
relays to re-discover our host and server-reflexive candidates. In this
scenario, the Gateway is still connected to its network and is therefore
able to send its candidates as soon as it receives the connection intent
from the portal.

This opens us up to the following race condition which leads to a
false-positive "ICE timeout":

1. Client roams network and clears all local state.
2. Client sends STUN binding requests to relays.
3. Client initiates a new connection.
4. Gateway acknowledges connection.
5. Client creates new ICE agent and attempts to seed it with local
candidates. We don't have a response from the relays yet and therefore
don't have any local candidates.
6. Client receives remote candidates and adds them to the agent.
7. ICE agent is unable to form pairs and therefore concludes that it is
disconnected.
8. We treat the disconnected event as a connection failure and clear the
connection.
9. Relays respond to STUN binding requests but we cannot add the new
candidates to the connection because it is already cleared.

The ICE spec states that after an agent transitions into the
"disconnected" state, it may transition back to "connected" if e.g. new
candidates are added as those allow the forming of new pairs. In
general, it is recommended to not treat "disconnected" as a permanent
state. To honor this recommendation, we introduce a 2s grace-period in
which we can recover from such a "disconnected" state.
2025-07-05 18:52:59 +00:00
Thomas Eizinger
b01984addb fix(phoenix-channel): replace all non-ASCII chars in user agent (#9725)
HTTP headers only reliably support ASCII characters. We include
information like the user's kernel build name in there and therefore
need to strip non-ASCII characters from that to avoid encoding errors.

Fixes: #9706
2025-06-30 15:20:55 +00:00
Thomas Eizinger
178a9da24d chore(rust): bump tokio-tungstenite (#9711) 2025-06-30 14:18:36 +00:00
Thomas Eizinger
0a14d72646 fix(phoenix-channel): don't pipeline messages (#9716)
In #9656, we already tried to fix the pipelining of messages to the
portal. Unfortunately, a bug was introduced in a last-minute refactoring
where we would _only_ send messages while we were joining a room. Due a
2nd bug where we weren't actually processing the room join replies
correctly, this didn't matter so the PR was effectively a no-op and
didn't change any behaviour.

Further investigation of the code surfaced additional problems. For one,
we were not re-queuing the message into the correct buffer. Two, we were
only flushing after sending a message.

To fix both of these, we move the flushing out of the message sending
branch completely and duplicate some of the code for sending messages in
order to correctly handle join requests before other messages.

Finally, join requests have an _empty_ payload and are therefore
processed in a different branch. By moving the checking for the replies
of join requests, we can correctly update the state and continue sending
messages once the join is successful.

Resolves: #9647
2025-06-30 13:18:34 +00:00
Thomas Eizinger
8cfc7ad865 chore(snownet): add more logging for connections (#9695)
In a recent release, `str0m` downgraded all INFO logs to DEBUG. Whilst
generally appreciated, it means we don't have a lot of visibility
anymore into which candidates are being exchanged and what the ICE
credentials of the connections are.

We re-add this information to our existing logs when creating and
updating connections.
2025-06-27 18:11:45 +00:00
Thomas Eizinger
2eedc23b82 chore(snownet): embed more context in WireGuard errors (#9687) 2025-06-26 15:49:07 +00:00
Thomas Eizinger
46931e0a68 chore(connlib): display WireGuardError using fmt::Display (#9686)
We've since added an `fmt::Display` implementation for these errors in
our `boringtun` fork so we can make use of it in our error
implementation.
2025-06-26 14:47:36 +00:00
Thomas Eizinger
5f38ccaeab feat(gateway): free TCP NAT bindings on RSTs (#9682)
Whenever we see a TCP packet with the RST bit set, we clear the current
NAT binding and move it to the `expired` list.
2025-06-26 14:20:01 +00:00
Thomas Eizinger
eddc4b95fb docs(connlib): explain why DNS resource NAT needs L4 component (#9675) 2025-06-25 20:26:07 +00:00
Thomas Eizinger
f435510dab fix(connlib): wait for room join before sending messages (#9656)
To avoid race conditions, we wait for all room joins on the WebSocket to
be successful before sending any messages to the portal. This requires
us to split room join messages from other messages so we can still send
them separately.

Resolves: #9647
2025-06-25 17:34:53 +00:00
Thomas Eizinger
bf03e13cf0 feat(gateway): vary DNS resource NAT TTL by protocol (#9655)
Instead of a 1 minute TTL for all connections, we vary the TTL based on
the protocol being used. For TCP, that is 2 hours. For UDP and ICMP, we
use 2 minutes.

Resolves: #9645
2025-06-25 17:24:40 +00:00
Thomas Eizinger
d5be185ae4 chore(rust): remove telemetry spans and events (#9634)
Originally, we introduced these to gather some data from logs / warnings
that we considered to be too spammy. We've since merged a
burst-protection that will at most submit the same event once every 5
minutes.

The data from the telemetry spans themselves have not been used at all.
2025-06-25 17:15:57 +00:00
Thomas Eizinger
4be73da21c fix(gateway): reply with cookie when rate limit is hit (#9657)
WireGuard implements a rate-limit mechanism when the number of handshake
initiations increases a certain limit. This is important because
handshakes involve asymmetric cryptography and are cryptographically
expensive. To prevent DoS attacks where other peers repeatedly ask for
new handshakes, the rate limiter implements a cookie mechanism where -
when under load - the remote peer needs to include a given cookie in new
handshakes. This cookie is tied to the peer's IP address to prevent it
from being reused by other peers.

Up until now, we have not been passing the sender's IP address to
`boringtun` and therefore, the only option when the rate limit was hit
was to error with `UnderLoad`.

By passing the source IP of the packet, `boringtun` can engage in the
cookie-reply mechanism and therefore avoid the `UnderLoad` error.

Resolves: #9643
2025-06-24 11:33:38 +00:00
Thomas Eizinger
91edd11a47 feat(gateway): send $identify event with account-slug (#9658)
When we receive the `account_slug` from the portal, the Gateway now
sends a `$identify` event to PostHog. This will allow us to target
Gateways with feature-flags based on the account they are connected to.
2025-06-24 11:31:56 +00:00
Thomas Eizinger
3c0e866e77 feat(connlib): listen on 52625 by default (#9593)
Presently, `connlib` always just lets the OS pick a random port for our
UDP socket. This works well in many cases but has the downside that IF
network admins would like to aid in the process of establishing direct
connections, they cannot open a specific port because it is always
random.

It doesn't cost us anything to try and bind to a particular port (here
52625) and fallback to a random one if something is listening there.

The port 52625 was chosen because:

- It is within the ephemeral port range and will therefore never be
registered to anything else.
- It is an palindrome and therefore easy to remember.
- When typing FIRE on a phone keypad, it you get the numbers 3473. 52625
is the port at the offset 3473 from the ephemeral port range.

In order for this port to be useful in establishing direct connections,
we generate optimistic candidates based on existing remote candidates by
combining the IP of all server-reflexive candidates with the port of all
host candidates.

This patch deliberately does not publicly announce this feature in the
docs or the changelog so we can first gather experience with it in our
own test environment.

Resolves: #9559
2025-06-24 08:41:08 +00:00
Thomas Eizinger
a91dda139f feat(connlib): only conditionally hash firezone ID (#9633)
A bit of legacy that we have inherited around our Firezone ID is that
the ID stored on the user's device is sha'd before being passed to the
portal as the "external ID". This makes it difficult to correlate IDs in
Sentry and PostHog with the data we have in the portal. For Sentry and
PostHog, we submit the raw UUID stored on the user's device.

As a first step in overcoming this, we embed an "external ID" in those
services as well IF the provided Firezone ID is a valid UUID. This will
allow us to immediately correlate those events.

As a second step, we automatically generate all new Firezone IDs for the
Windows and Linux Client as `hex(sha256(uuid))`. These won't parse as
valid UUIDs and therefore will be submitted as is to the portal.

As a third step, we update all documentation around generating Firezone
IDs to use `uuidgen | sha256` instead of just `uuidgen`. This is
effectively the equivalent of (2) but for the Headless Client and
Gateway where the Firezone ID can be configured via environment
variables.

Resolves: #9382

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2025-06-24 07:05:48 +00:00
Thomas Eizinger
686918f1d1 chore(rust): bump str0m (#9591)
The latest `main` of str0m undoes a breaking change in the constructor
of `Candidate::relayed` by flipping the parameters back. This will make
it easier to upgrade to the latest release once it is out.
2025-06-24 06:57:55 +00:00
Thomas Eizinger
1bd3d2a382 chore(gateway): remove NAT64/46 module (#9626)
This has been disabled for several releases now and is not causing any
problems in production. We can therefore safely remove it.

It is about time we do this because our tests are actually still testing
the variant without the feature flag and therefore deviate from what we
do in production. We therefore have to convert the tests as well. Doing
so uncovered a minor problem in our ICMP error parsing code: We
attempted to parse the payload of an ICMP error as a fully-valid layer 4
header (e.g. TCP header or UDP header). However, per the RFC a node only
needs to embed the first 8 bytes of the original packet in an ICMPv4
error. That is not enough to parse a valid TCP header as those are at
least 20 bytes.

I don't expect this to be a huge problem in production right now though.
We only use this code to parse ICMP errors arriving on the Gateway and I
_think_ most devices actually include more than 8 bytes. This only
surfaced because we are very strict with only embedding exactly 8 bytes
when we generate an ICMP error.

Additionally, we change our ICMP errors to be sent from the resource IP
rather than the Gateway's TUN device. Given that we perform NAT on these
IPs anyway, I think this can still be argued to be RFC conform. The
_proxy_ IP which we are trying to contact can be reached but it cannot
be routed further. Therefore the destination is unreachable, yet the
source of this error is the proxy IP itself. I think this is actually
more correct than sending the packets from the Gateway's TUN device
because the TUN device itself is not a routing hop per-se: its IP won't
ever show up in the routing path.
2025-06-24 06:48:30 +00:00
Thomas Eizinger
950afd9b2d chore(gateway): set account-slug in telemetry context (#9545)
This PR adds an optional field `account_slug` to the Gateway's init
message. If populated, we will use this field to set the account-slug in
the telemetry context. This will allow us to know, which customers a
particular Sentry issue is related to.
2025-06-23 18:52:39 +00:00
Thomas Eizinger
c8a4a20818 feat(snownet): increase ICE timeout (#9569)
Some of our users are facing issues on what looks to be very unreliable
network connections. At present, we consider a connection dead if we
don't receive a response within 9.25 seconds. Cutting a connection and
re-establishing it _should_ not be a problem in general and TCP
connections happening through Firezone should resume gracefully. Further
work on whether that is actually the case is due in #9531. Until then,
we increase the ICE timeout to ~15s.

Related: #9526
2025-06-18 22:16:32 +00:00
Thomas Eizinger
650cf893ba feat(snownet): decrease idle connection ICE timeout (#9570)
Any well-behaved NAT should keep the port mappings of an established UDP
connection open for 120s, even without seeing any traffic. Not all NATs
in the wild are well-behaved though and a discarded port mapping causes
connectivity loss for customers.

To combat these situations, we decrease the timer for STUN probes on
idle connections from 60s to 25s.

Related: #9526
2025-06-18 16:53:26 +00:00
Thomas Eizinger
d3ff59ab84 chore(rust): bump str0m (#9564)
The recent changes to str0m include a bug fix for network constellations
where both peers are behind symmetric NAT and therefore need a
relay-relay candidate pair to succeed. In the current version, such
candidate pairs would erroneously be rejected as redundant with host
candidates.

Fixes: #9514
2025-06-17 22:04:13 +00:00
Thomas Eizinger
f3dcd06115 chore(snownet): document current ICE timeouts with tests (#9558)
This ensures we always know, what the ICE timeouts of the agent are.
With the backoff implemented in the agent, it is not trivial to compute
this from the input parameters.
2025-06-17 21:38:08 +00:00
Jamil
805ba085c2 fix(connlib): re-add resource if ip_stack changes (#9372)
In #9300, we added logic to control whether we emit A and/or AAAA
records for a DNS resource based on the `ip_stack` property of the
`Resource` struct.

Unfortunately this didn't take updates into account when the client was
signed in, so updating a DNS resource's ip_stack failed to update the
client's local Resource copy.

To fix this, we determine if `resource_addressability_changed` which is
true if the resource's address, or ip_stack, has changed meaningfully.
If so, we remove the resource prior to evaluating the remaining logic of
the `resource_created_or_updated` handler, which in turn causes the
resource to be re-added, effectively updating its ip_stack.

Related:
https://github.com/firezone/firezone/pull/9300#issuecomment-2932365798
2025-06-03 03:00:19 +00:00
Thomas Eizinger
218c711789 fix(connlib): don't hard-fail if buffer increase is rejected (#9366)
When `connlib` creates new UDP sockets for the p2p traffic, it tries to
increase the send and receive buffers for improved performance. Failure
to do so currently results in `connlib` failing to start entirely. This
is unnecessarily harsh, we can simply log a warning instead and move on.
2025-06-02 15:20:58 +00:00
Thomas Eizinger
29f8dd8688 fix(connlib): block until UDP thread has been set up (#9363)
Internally, `connlib` spawns a new thread for handling IO on the UDP
socket. In order to make sure that this thread is operational, we
intended to block `connlib`s main thread until the setup of the UDP
thread has successfully completed.

Unfortunately, this isn't quite the case because we already send an
`Ok(())` value into the channel once we've successfully bound the
socket. Following the binding, we also try to increase the maximum
buffer size of the socket. Even though the intention here was to also
log this error, the error value sent into the channel there is never
read because we only ever read one value from the `error_tx` channel.

To fix this, we move the sending of the `Ok(())` value to the very
bottom of the UDP thread, just before we kick it off. Whilst this does
not fix the actual issue as to why the setup of the UDP thread fails,
these changes will at least surface the error.
2025-06-02 12:37:38 +00:00
Thomas Eizinger
e05c98bfca ci: update to new cargo sort release (#9354)
The latest release now also sorts workspace dependencies, as well as
different dependency sections. Keeping these things sorted reduces the
chances of merge conflicts when multiple PRs edit these files.
2025-06-02 02:01:09 +00:00
Thomas Eizinger
02638582fe feat(connlib): allow controlling IP stack per DNS resource (#9300)
With this patch, `connlib` exposes a new, optional field `ip_stack`
within the resource description of each DNS resource that controls the
supported IP stack.

By default, the IP stack is set to `Dual` to preserve the current
behaviour. When set to `IPv4Only` or `IPv6Only`, `connlib` will not
assign any IPv4 or IPv6 addresses when receiving DNS queries for such a
resource. The DNS query will still respond successfully with NOERROR
(and not NXDOMAIN) but the list of IPs will be empty.

This is useful to e.g. allow sys-admins to disable IPv6 for resources
with buggy clients such as the MongoDB atlas driver. The MongoDB driver
does not correctly handle happy-eyeballs and instead fails the
connection early on any connection error.

Additionally, customers operating in IPv6-exclusive networks can disable
IPv4 addresses with this setting.

Related: https://jira.mongodb.org/browse/NODE-4678
Related: #9042
Related: #8892
2025-05-31 00:27:59 +00:00
Thomas Eizinger
e6f13a124a fix(connlib): optimise logging of activated CIDR resources (#9293)
Instead of always logging when CIDR resources change, we add an
additional condition to the already existing `Activated resource` log
that suppresses it in case the currently active CIDR resource is
actively routing traffic.

Resolves: #9281
2025-05-29 02:19:33 +00:00
dependabot[bot]
82d097baa0 build(deps): bump domain from 0.10.4 to 0.11.0 in /rust (#9274)
Bumps [domain](https://github.com/nlnetlabs/domain) from 0.10.4 to
0.11.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/nlnetlabs/domain/releases">domain's
releases</a>.</em></p>
<blockquote>
<h2>Release 0.11.0</h2>
<p>Breaking changes</p>
<ul>
<li>FIX: Use base 16 per RFC 4034 for the DS digest, not base 64. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/423">#423</a>)</li>
<li>FIX: NSEC3 salt strings should only be accepted if within the salt
size limit. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/431">#431</a>)</li>
<li>Stricter RFC 1035 compliance by default in the <code>Zonefile</code>
parser. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/477">#477</a>)</li>
<li>Rename {DigestAlg, Nsec3HashAlg, SecAlg, ZonemdAlg} to
{DigestAlgorithm, Nsec3HashAlgorithm, SecurityAlgorithm,
ZonemdAlgorithm}</li>
</ul>
<p>New</p>
<ul>
<li>Added <code>HashCompressor</code>, an unlimited name compressor that
uses a hash map rather than a tree. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/396">#396</a>)</li>
<li>Changed <code>fmt::Display</code> for <code>HINFO</code> records to
a show a quoted string. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/421">#421</a>)</li>
<li>Added support for <code>NAPTR</code> record type. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/427">#427</a>
by [<a
href="https://github.com/weilence"><code>@​weilence</code></a>])</li>
<li>Added initial fuzz testing support for some types via a new
<code>arbitrary</code> feature (not enabled by default). (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/441">#441</a>)</li>
<li>Added <code>StubResolver::add_connection()</code> to allow adding a
connection to the running resolver. In combination with
<code>ResolvConf::new()</code> this can also be used to control the
connections made when testing code that uses the stub resolver. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/440">#440</a>)</li>
<li>Added <code>ZonefileFmt</code> trait for printing records as
zonefiles. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/379">#379</a>,
<a
href="https://redirect.github.com/nlnetlabs/domain/issues/446">#446</a>,
<a
href="https://redirect.github.com/nlnetlabs/domain/issues/463">#463</a>)</li>
</ul>
<p>Bug fixes</p>
<ul>
<li>NSEC records should include themselves in the generated bitmap. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/417">#417</a>)</li>
<li>Trailing double quote wrongly preserved when parsing record data.
(<a
href="https://redirect.github.com/nlnetlabs/domain/issues/470">#470</a>,
<a
href="https://redirect.github.com/nlnetlabs/domain/issues/472">#472</a>)</li>
<li>Don't error with unexpected end of entry for RFC 3597 RDATA of
length zero. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/475">#475</a>)</li>
</ul>
<p>Unstable features</p>
<ul>
<li>
<p>New unstable feature <code>unstable-crypto</code> that enable
cryptography support for features that do not rely on secret keys. This
feature needs either or both of the features <code>ring</code> and
<code>openssl</code> (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/416">#416</a>)</p>
</li>
<li>
<p>New unstable feature <code>unstable-crypto-sign</code> that enable
cryptography support including features that rely on secret keys. This
feature needs either or both of the features <code>ring</code> and
<code>openssl</code> (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/416">#416</a>)</p>
</li>
<li>
<p>New unstable feature <code>unstable-client-cache</code> that enable
the client transport cache. The reason is that the client cache uses the
<code>moka</code> crate.</p>
</li>
<li>
<p>New unstable feature <code>unstable-new</code> that introduces a new
API for all of domain (currently only with <code>base</code>,
<code>rdata</code>, and <code>edns</code> modules). Also see the
[associated blog post][new-base-post].</p>
</li>
<li>
<p><code>unstable-server-transport</code></p>
<ul>
<li>The trait <code>SingleService</code> which is a simplified service
trait for requests that should generate a single response (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/353">#353</a>).</li>
<li>The trait <code>ComposeReply</code> and an implementation of the
trait (<code>ReplyMessage</code>) to assist in capturing EDNS(0) options
that should be included in a response message (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/353">#353</a>).</li>
<li>Adapters to implement <code>Service</code> for
<code>SingleService</code> and to implement <code>SingleService</code>
for <code>SendRequest</code> (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/353">#353</a>).</li>
<li>Conversion of a <code>Request</code> to a
<code>RequestMessage</code> (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/353">#353</a>).</li>
<li>A sample query router, called <code>QnameRouter</code>, that routes
requests based on the QNAME field in the request (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/353">#353</a>).</li>
</ul>
</li>
<li>
<p><code>unstable-client-transport</code></p>
<ul>
<li>introduce timeout option in multi_stream (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/424">#424</a>).</li>
<li>improve probing in redundant (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/424">#424</a>).</li>
<li>restructure configuration for multi_stream and redundant (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/424">#424</a>).</li>
<li>introduce a load balancer client transport. This transport tries to
distribute requests equally over upstream transports (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/425">#425</a>).</li>
<li>the client cache now has it's own feature
<code>unstable-client-cache</code>.</li>
</ul>
</li>
<li>
<p><code>unstable-sign</code></p>
<ul>
<li>add key lifecycle management (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/459">#459</a>).</li>
<li>add support for adding NSEC3 records when signing.</li>
<li>add support for ZONEMD.</li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/NLnetLabs/domain/blob/main/Changelog.md">domain's
changelog</a>.</em></p>
<blockquote>
<h2>0.11.0</h2>
<p>Released 2025-05-21.</p>
<p>Breaking changes</p>
<ul>
<li>FIX: Use base 16 per RFC 4034 for the DS digest, not base 64. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/423">#423</a>)</li>
<li>FIX: NSEC3 salt strings should only be accepted if within the salt
size limit. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/431">#431</a>)</li>
<li>Stricter RFC 1035 compliance by default in the <code>Zonefile</code>
parser. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/477">#477</a>)</li>
<li>Rename {DigestAlg, Nsec3HashAlg, SecAlg, ZonemdAlg} to
{DigestAlgorithm, Nsec3HashAlgorithm, SecurityAlgorithm,
ZonemdAlgorithm}</li>
</ul>
<p>New</p>
<ul>
<li>Added <code>HashCompressor</code>, an unlimited name compressor that
uses a hash map
rather than a tree. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/396">#396</a>)</li>
<li>Changed <code>fmt::Display</code> for <code>HINFO</code> records to
a show a quoted string.
(<a
href="https://redirect.github.com/nlnetlabs/domain/issues/421">#421</a>)</li>
<li>Added support for <code>NAPTR</code> record type. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/427">#427</a>
by [<a
href="https://github.com/weilence"><code>@​weilence</code></a>])</li>
<li>Added initial fuzz testing support for some types via a new
<code>arbitrary</code>
feature (not enabled by default). (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/441">#441</a>)</li>
<li>Added <code>StubResolver::add_connection()</code> to allow adding a
connection to the
running resolver. In combination with <code>ResolvConf::new()</code>
this can also be
used to control the connections made when testing code that uses the
stub
resolver. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/440">#440</a>)</li>
<li>Added <code>ZonefileFmt</code> trait for printing records as
zonefiles. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/379">#379</a>,
<a
href="https://redirect.github.com/nlnetlabs/domain/issues/446">#446</a>,
<a
href="https://redirect.github.com/nlnetlabs/domain/issues/463">#463</a>)</li>
</ul>
<p>Bug fixes</p>
<ul>
<li>NSEC records should include themselves in the generated bitmap. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/417">#417</a>)</li>
<li>Trailing double quote wrongly preserved when parsing record data.
(<a
href="https://redirect.github.com/nlnetlabs/domain/issues/470">#470</a>,
<a
href="https://redirect.github.com/nlnetlabs/domain/issues/472">#472</a>)</li>
<li>Don't error with unexpected end of entry for RFC 3597 RDATA of
length zero. ([475])</li>
</ul>
<p>Unstable features</p>
<ul>
<li>
<p>New unstable feature <code>unstable-crypto</code> that enable
cryptography support
for features that do not rely on secret keys. This feature needs either
or both of the features <code>ring</code> and <code>openssl</code> (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/416">#416</a>)</p>
</li>
<li>
<p>New unstable feature <code>unstable-crypto-sign</code> that enable
cryptography support
including features that rely on secret keys. This feature needs either
or both of the features <code>ring</code> and <code>openssl</code> (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/416">#416</a>)</p>
</li>
<li>
<p>New unstable feature <code>unstable-client-cache</code> that enable
the client transport
cache. The reason is that the client cache uses the <code>moka</code>
crate.</p>
</li>
<li>
<p>New unstable feature <code>unstable-new</code> that introduces a new
API for all of
domain (currently only with <code>base</code>, <code>rdata</code>, and
<code>edns</code> modules). Also see
the [associated blog post][new-base-post].</p>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="84152353c9"><code>8415235</code></a>
Release 0.11.0 (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/533">#533</a>)</li>
<li><a
href="16d7f364ce"><code>16d7f36</code></a>
Revert boxing 'ring::sign::KeyPair' (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/532">#532</a>)</li>
<li><a
href="51a8360649"><code>51a8360</code></a>
Bump openssl from 0.10.71 to 0.10.72 (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/512">#512</a>)</li>
<li><a
href="1f9de15431"><code>1f9de15</code></a>
Introduce <code>domain::new</code> (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/474">#474</a>)</li>
<li><a
href="72b42a3991"><code>72b42a3</code></a>
Adjust for Clippy 1.87 lints (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/530">#530</a>)</li>
<li><a
href="a0bf99c922"><code>a0bf99c</code></a>
Merge pull request <a
href="https://redirect.github.com/nlnetlabs/domain/issues/515">#515</a>
from NLnetLabs/ends-to-edns</li>
<li><a
href="8e4280af39"><code>8e4280a</code></a>
Don't panic on mismatched private and public keys. (<a
href="https://redirect.github.com/nlnetlabs/domain/issues/528">#528</a>)</li>
<li><a
href="473f871036"><code>473f871</code></a>
Pass &amp;N instead of N and also remove thereby an unnecessary clone().
(<a
href="https://redirect.github.com/nlnetlabs/domain/issues/526">#526</a>)</li>
<li><a
href="2a390420af"><code>2a39042</code></a>
Remove incorrect logic for determining the apex from signing function.
(<a
href="https://redirect.github.com/nlnetlabs/domain/issues/521">#521</a>)</li>
<li><a
href="f43d53d010"><code>f43d53d</code></a>
Remove no longer needed mut on GenerateNsec3Config and SigningConfig
which ha...</li>
<li>Additional commits viewable in <a
href="https://github.com/nlnetlabs/domain/compare/v0.10.4...v0.11.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=domain&package-manager=cargo&previous-version=0.10.4&new-version=0.11.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-05-28 10:37:16 +00:00
Thomas Eizinger
6165555add build(deps): bump Rust to 1.87.0 (#9159) 2025-05-16 01:58:17 +00:00
Thomas Eizinger
ce06996a14 fix(connlib): allow more than one host candidate per IP version (#9147)
Currently, one machines that have multiple routable egress interfaces,
`connlib` may bounce between the two instead of settling on one. This
happens because we have a dedicated `CandidateSet` that we use to filter
out "duplicate" candidates of the same type. Doing that is important
because if the other party is behind a symmetric NAT, they will send us
many server-reflexive candidates that all only differ by their port,
none of them will actually be routable though.

To prevent sending many of these candidates to the remote, we first
gather them locally in our `CandidateSet` and de-duplicate them.
2025-05-15 02:08:53 +00:00
Thomas Eizinger
09191549eb test(rust): don't delay delivering scheduled Transmits (#9129)
To simulate varying network conditions in our tests, each `Host` in our
test network has an "inbox" that contains all incoming network packets
with an added latency. When another hosts sends a packet, the packet
gets added to the inbox. Internally, the inbox has a binary heap that
sorts incoming `Transmits` by their latency and only delivers them to
the node when that delay is up.

Currently, this delivery doesn't always happen because we fail to take
into account the timestamp as when the next `Transmit` is due when we
figure out what to do next.

Instead of just looking at the inner state via `poll_transmit`, we now
also consult the inbox of messages as to when the next message is due
and wake up at the correct time.

Not doing this caused our state machine to think that packets got
dropped because `REFRESH` messages to the relays were timing out.

Resolves: #9118
2025-05-14 05:37:05 +00:00
Thomas Eizinger
45924eb90b fix(connlib): ignore scopes for IPv6 link-local addresses (#9115)
To send UDP DNS queries to upstream DNS servers, we have a
`UdpSocket::handshake` function that turns a UDP socket into a
single-use object where exactly one datagram is expected from the
address we send a message to. The way this is enforced is via an
equality check.

It appears that this equality check fails if users run an upstream DNS
server on a link-local IPv6 address within a setup that utilises IPv6
scopes. At the time when we receive the response, the packet has already
been successfully routed back to us so we should accept it, even if we
didn't specify a scope as the destination address.
2025-05-13 13:33:28 +00:00
Thomas Eizinger
b8738448df refactor(connlib): forward error from source IP resolver (#9116)
In order to avoid routing loops on Windows, our UDP and TCP sockets in
`connlib` embed a "source IP resolver" that finds the "next best"
interface after our TUN device according to Windows' routing metrics.
This ensures that packets don't get routed back into our TUN device.

Currently, errors during this process are only logged on TRACE and
therefore not visible in Sentry. We fix this by moving around some of
the function interfaces and forward the error from the source IP
resolver together with some context of the destination IP.
2025-05-13 13:33:15 +00:00
Thomas Eizinger
945fed8e9d chore(phoenix-channel): downgrade log about dropped messages (#9092)
This can easily happen if we are briefly disconnected from the portal.
It is not the end of the world and not worth creating Sentry alerts for.

Originally, this was intended to be a way of detecting "bad
connectivity" but that didn't really work.
2025-05-12 11:40:40 +00:00
Thomas Eizinger
f01fd4ddf6 fix(connlib): clear pending sockets on DNS server re-creation (#9093)
Our DNS over TCP implementation uses `smoltcp` which requires us to
manage sockets individually, i.e. there is no such thing as a listening
socket. Instead, we have to create multiple sockets and rotate through
them.

Whenever we receive new DNS servers from the host app, we throw away all
of those sockets and create new ones.

The way we refer to these sockets internally is via `smoltcp`'s
`SocketHandle`. These are just indices into a `Vec` and this access can
panic when it is out of range. Normally that doesn't happen because such
a `SocketHandle` is only created when the socket is created and
therefore, each `SocketHandle` in existence should be valid.

What we overlooked is that these sockets get destroyed and re-created
when we call `set_listen_addresses` which happens when the host app
tells us about new DNS servers. In that case, sockets that we had just
received a query on and are waiting for a response have their handles
stored in a temporary `HashMap`. Attempting to send back a response for
one of those queries will then either fail with an error that the socket
is not in the right state or - worse - panic with an out of bounds error
if the previously had more listen addresses than we have now.

To fix this, we need to clear this map of pending queries every time we
call `set_listen_addresses`.
2025-05-12 11:39:59 +00:00
Thomas Eizinger
7e4fe68485 fix(connlib): take into account header overhead for GSO (#9088)
When calculating the maximum size of the UDP payload we can send in a
single syscall, we need to take into account the overhead of the IP and
UDP headers.
2025-05-12 11:36:10 +00:00
Jamil
537295d8a3 fix(rust): Downgrade fastest nameserver to DEBUG (#9071)
These run every minute and add a lot of noise to the logs.

```
May 11 18:21:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:21:14.154Z  INFO firezone_tunnel::io::nameserver_set: Evaluating fastest nameserver ips={127.0.0.53}
May 11 18:21:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:21:14.155Z  INFO firezone_tunnel::io::nameserver_set: Evaluated fastest nameserver fastest=127.0.0.53
May 11 18:22:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:22:14.154Z  INFO firezone_tunnel::io::nameserver_set: Evaluating fastest nameserver ips={127.0.0.53}
May 11 18:22:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:22:14.155Z  INFO firezone_tunnel::io::nameserver_set: Evaluated fastest nameserver fastest=127.0.0.53
May 11 18:23:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:23:14.153Z  INFO firezone_tunnel::io::nameserver_set: Evaluating fastest nameserver ips={127.0.0.53}
May 11 18:23:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:23:14.155Z  INFO firezone_tunnel::io::nameserver_set: Evaluated fastest nameserver fastest=127.0.0.53
May 11 18:24:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:24:14.154Z  INFO firezone_tunnel::io::nameserver_set: Evaluating fastest nameserver ips={127.0.0.53}
May 11 18:24:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:24:14.155Z  INFO firezone_tunnel::io::nameserver_set: Evaluated fastest nameserver fastest=127.0.0.53
May 11 18:25:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:25:14.153Z  INFO firezone_tunnel::io::nameserver_set: Evaluating fastest nameserver ips={127.0.0.53}
```
2025-05-12 01:58:17 +00:00
Thomas Eizinger
5566f1847f refactor(rust): move crates into a more sensical hierarchy (#9066)
The current `rust/` directory is a bit of a wild-west in terms of how
the crates are organised. Most of them are simply at the top-level when
in reality, they are all `connlib`-related. The Apple and Android FFI
crates - which are entrypoints in the Rust code are defined several
layers deep.

To improve the situation, we move around and rename several crates. The
end result is that all top-level crates / directories are:

- Either entrypoints into the Rust code, i.e. applications such as
Gateway, Relay or a Client
- Or crates shared across all those entrypoints, such as `telemetry` or
`logging`
2025-05-12 01:04:17 +00:00
Thomas Eizinger
3f4e004a48 fix(connlib): don't recreate DNS resource NAT for failed domains (#9064)
Before a Client can send packets to a DNS resource, the Gateway must
first setup a NAT table between the IPs assigned by the Client and the
IPs the domain actually resolves to. This is what we call the DNS
resource NAT.

The communication for this process happens over IP through the tunnel
which is an unreliable transport. To ensure that this works reliably
even in the presence of packet loss on the wire, the Client uses an
idempotent algorithm where it tracks the state of the NAT for each
domain that is has ever assigned IPs for (i.e. received an A or AAAA
query from an application). This algorithm ensures that if we don't hear
anything back from the Gateway within 2s, another packet for setting up
the NAT is sent as soon as we receive _any_ DNS query.

This design balances efficiency (we don't try forever) with reliability
(we always check all of them).

In case a domain does not resolve at all or there are resolution errors,
the Gateway replies with `NatStatus::Inactive`. At present, the Client
doesn't handle this in any particular way other than logging that it was
not able to successfully setup the NAT.

The combination of the above results in an undesirable behaviour: If an
application queries a domain without A and AAAA records once, we will
keep retrying forever to resolve it upon every other DNS query issued to
the system. To fix this, we introduce `dns_resource_nat::State::Failed`.
Entries in this state are ignored as part of the above algorithm and
only recreated when explicitly told to do so which we only do when we
receive another DNS query for this domain.

To handle the increased complexity around this system, we extract it
into its own component and add a fleet of unit tests for its behaviour.
2025-05-09 15:04:21 +00:00
Thomas Eizinger
fa790b231a fix(gateway): respond with SERVFAIL for missing nameserver (#9061)
When we implemented #8350, we chose an error handling strategy that
would shutdown the Gateway in case we didn't have a nameserver selected
for handling those SRV and TXT queries. At the time, this was deemed to
be sufficiently rare to be an adequate strategy. We have since learned
that this can indeed happen when the Gateway starts without network
connectivity which is quite common when using tools such as terraform to
provision infrastructure.

In #9060, we fix this by re-evaluating the fastest nameserver on a
timer. This however doesn't change the error handling strategy when we
don't have a working nameserver at all. It is practically impossible to
have a working Gateway yet us being unable to select a nameserver. We
read them from `/etc/resolv.conf` which is what `libc` uses to also
resolve the domain we connect to for the WebSocket. A working WebSocket
connection is required for us to establish connections to Clients, which
in turn is a precursor to us receiving DNS queries from a Client.

It causes unnecessary complexity to have a code path that can
potentially terminate the Gateway, yet is practically unreachable. To
fix this situation, we remove this code path and instead reply with a
DNS SERVFAIL error.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-09 05:55:48 +00:00
Thomas Eizinger
ac339ff63b fix(gateway): evaluate fastest nameserver every 60s (#9060)
Currently, the Gateway reads all nameservers from `/etc/resolv.conf` on
startup and evaluates the fastest one to use for SRV and TXT DNS queries
that are forwarded by the Client. If the machine just booted and we do
not have Internet connectivity just yet, this fails which leaves the
Gateway in state where it cannot fulfill those queries.

In order to ensure we always use the fastest one and to self-heal from
such situations, we add a 60s timer that refreshes this state.
Currently, this will **not** re-read the nameservers from
`/etc/resolv.conf` but still use the same IPs read on startup.
2025-05-09 03:38:35 +00:00
Thomas Eizinger
33d5c32f35 fix(gateway): truncate payload of ICMP errors (#9059)
When the Gateway is handed an IP packet for a DNS resource that it
cannot route, it sends back an ICMP unreachable error. According to RFC
792 [0] (for ICMPv4) and RFC 4443 [1] (for ICMPv6), parts of the
original packet should be included in the ICMP error payload to allow
the sending party to correlate, what could not be sent.

For ICMPv4, the RFC says:

```
Internet Header + 64 bits of Data Datagram

The internet header plus the first 64 bits of the original
datagram's data.  This data is used by the host to match the
message to the appropriate process.  If a higher level protocol
uses port numbers, they are assumed to be in the first 64 data
bits of the original datagram's data.
```

For ICMPv6, the RFC says:

```
As much of invoking packet as possible without the ICMPv6 packet exceeding the minimum IPv6 MTU
```

[0]: https://datatracker.ietf.org/doc/html/rfc792
[1]: https://datatracker.ietf.org/doc/html/rfc4443#section-3.1
2025-05-09 01:38:31 +00:00
Thomas Eizinger
18ec6c6860 refactor(rust): move service implementation to GUI client (#9045)
The module and crate structure around the GUI client and its background
service are currently a mess of circular dependencies. Most of the
service implementation actually sits in `firezone-headless-client`
because the headless-client and the service share certain modules. We
have recently moved most of these to `firezone-bin-shared` which is the
correct place for these modules.

In order to move the background service to `firezone-gui-client`, we
need to untangle a few more things in the GUI client. Those are done
commit-by-commit in this PR. With that out the way, we can finally move
the service module to the GUI client; where is should actually live
given that it has nothing to do with the headless client.

As a result, the headless-client is - as one would expect - really just
a thin wrapper around connlib itself and is reduced down to 4 files with
this PR.

To make things more consistent in the GUI client, we move the `main.rs`
file also into `bin/`. By convention `bin/` is where you define binaries
if a crate has more than one. cargo will then build all of them.

Eventually, we can optimise the compile-times for `firezone-gui-client`
by splitting it into multiple crates:

- Shared structs like IPC messages
- Background service
- GUI client

This will be useful because it allows only re-compiling of the GUI
client alone if nothing in `connlib` changes and vice versa.

Resolves: #6913
Resolves: #5754
2025-05-08 13:22:09 +00:00