As part of going through the changes since the last Client and Gateway
relies, I noticed that for several of the things we fixed, it might be
worth adding changelog entries.
To send UDP DNS queries to upstream DNS servers, we have a
`UdpSocket::handshake` function that turns a UDP socket into a
single-use object where exactly one datagram is expected from the
address we send a message to. The way this is enforced is via an
equality check.
It appears that this equality check fails if users run an upstream DNS
server on a link-local IPv6 address within a setup that utilises IPv6
scopes. At the time when we receive the response, the packet has already
been successfully routed back to us so we should accept it, even if we
didn't specify a scope as the destination address.
In order to avoid routing loops on Windows, our UDP and TCP sockets in
`connlib` embed a "source IP resolver" that finds the "next best"
interface after our TUN device according to Windows' routing metrics.
This ensures that packets don't get routed back into our TUN device.
Currently, errors during this process are only logged on TRACE and
therefore not visible in Sentry. We fix this by moving around some of
the function interfaces and forward the error from the source IP
resolver together with some context of the destination IP.
If `authURL`, `apiURL`, or `logFilter` are set in the managed
configuration, we disable each of these fields respectively from user
editing.
If all of them are overridden, we disable the `Apply` and `Reset to
Defaults` buttons.
Related #4505
As part of launching the Tauri GUI client, we need to observe a specific
initialisation order. In particular, we need to wait until Tauri sends
us a `RunEvent::Ready` before we can initialise things like the tray
menu.
To make this more convenient, Tauri offers a so-called "setup hook" that
can be set on the app builder. Unfortunately, Tauri internally panics if
this provided setup-hook returns an `Err`. Removing this is tracked
upstream: https://github.com/tauri-apps/tauri/issues/12815.
Until this is fixed, we stop using this "setup hook" and instead spawn
our own task that performs this work. This task needs to wait until
Tauri is ready. To achieve that, we introduce an additional mpsc channel
that sends a notification every time we receive a `RunEvent::Ready`.
That should only happen once. We only read from the receiver once, which
is why we ignore the error on the sending side in case the receiver has
already been dropped.
Resolves: #9101
When this crate is compiled by itself, these features are required. This
doesn't show up in CI because there we compile the entire workspace and
some crate somewhere already activates these features then.
One simple way we can tell the GUI app which configuration fields have
been overridden by MDM is to specify an `overriddenKeys` string array.
If provided, this will disable the relevant configuration from being set
/ editable in the GUI app and communicate to the user as such.
Related: #4505
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Similar to how we fetch new resources, we add a Configuration poller
that fetches new configuration every 1s. If the configuration is
unchanged, we respond to the caller with a cached copy to avoid needing
to serialize the data over IPC.
Related: #4505
Bumps [divan](https://github.com/nvzqz/divan) from 0.1.17 to 0.1.21.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/nvzqz/divan/blob/main/CHANGELOG.md">divan's
changelog</a>.</em></p>
<blockquote>
<h2>[0.1.21] - 2025-04-09</h2>
<h3>Fixed</h3>
<ul>
<li><code>Divan::skip_exact</code> behaved incorrectly in
<code>v0.1.19</code>.</li>
</ul>
<h3>Changed</h3>
<ul>
<li>Improved handling of internal code around filters and those
responsible for
sacking the people who have just been sacked have been sacked.</li>
</ul>
<h2>[0.1.20] - 2025-04-09</h2>
<h3>Fixed</h3>
<ul>
<li><code>Divan::skip_regex</code> accidentally dropped
<a
href="https://docs.rs/regex-lite/latest/regex_lite/struct.Regex.html"><code>regex_lite::Regex</code></a>
and behaved incorrectly in <code>v0.1.19</code>.</li>
</ul>
<h2>[0.1.19] - 2025-04-09</h2>
<h3>Fixed</h3>
<ul>
<li>[<code>cargo-nextest</code>] no longer skips benchmarks with
argument parameters (<a
href="https://redirect.github.com/nvzqz/divan/issues/75">#75</a>).</li>
</ul>
<h3>Changed</h3>
<ul>
<li>Organized positive and negative filters into a split buffer.</li>
</ul>
<h2>[0.1.18] - 2025-04-05</h2>
<h3>Added</h3>
<ul>
<li>
<p>Support for [<code>cargo-nextest</code>] running benchmarks as
tests.</p>
</li>
<li>
<p>[<code>prelude</code>] module for simplifying imports of
[<code>#[bench]</code>][bench_attr],
[<code>#[bench_group]</code>][bench_group_attr],
[<code>black_box</code>], [<code>black_box_drop</code>],
[<code>AllocProfiler</code>], [<code>Bencher</code>], and
[<code>Divan</code>].</p>
</li>
<li>
<p>Support <code>wasi</code> and <code>emscripten</code> targets.</p>
</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="52f9d4983e"><code>52f9d49</code></a>
Release v0.1.21</li>
<li><a
href="5afb095486"><code>5afb095</code></a>
Fix broken <code>Divan::skip_exact</code></li>
<li><a
href="e4d112ccbc"><code>e4d112c</code></a>
Release v0.1.20</li>
<li><a
href="1d74108bbe"><code>1d74108</code></a>
Fix broken <code>Divan::skip_regex</code></li>
<li><a
href="58988fc304"><code>58988fc</code></a>
Release v0.1.19</li>
<li><a
href="f43a742d0f"><code>f43a742</code></a>
docs: Change "changes" to "changed" in
changelog</li>
<li><a
href="f348769fb4"><code>f348769</code></a>
docs: List most recent changes in changelog</li>
<li><a
href="07111fb6f8"><code>07111fb</code></a>
Abstract filters into <code>FilterSet</code></li>
<li><a
href="2c865cdf24"><code>2c865cd</code></a>
docs: Add link to <code>prelude</code> module in changelog</li>
<li><a
href="9075b9e0ed"><code>9075b9e</code></a>
fix nextest support for parameterized benches</li>
<li>Additional commits viewable in <a
href="https://github.com/nvzqz/divan/compare/v0.1.17...v0.1.21">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
For at least 1 user, the threads shut down correctly, but we didn't seem
to have exited the loop. In
https://firezone-inc.sentry.io/issues/6335839279/events/c11596de18924ee3a1b64ced89b1fba2/?project=4508008945549312,
we can see that both flags are marked as `true` yet we still emitted the
message.
The only way how I can explain this is that the thread shut down in
between the two times we've called the `is_finished` function. To ensure
this doesn't happen, we now only read it once.
This however also shows that 5s may not be enough time for WinTUN to
shutdown. Therefore, we increase the grace period to 10s.
This can easily happen if we are briefly disconnected from the portal.
It is not the end of the world and not worth creating Sentry alerts for.
Originally, this was intended to be a way of detecting "bad
connectivity" but that didn't really work.
Our DNS over TCP implementation uses `smoltcp` which requires us to
manage sockets individually, i.e. there is no such thing as a listening
socket. Instead, we have to create multiple sockets and rotate through
them.
Whenever we receive new DNS servers from the host app, we throw away all
of those sockets and create new ones.
The way we refer to these sockets internally is via `smoltcp`'s
`SocketHandle`. These are just indices into a `Vec` and this access can
panic when it is out of range. Normally that doesn't happen because such
a `SocketHandle` is only created when the socket is created and
therefore, each `SocketHandle` in existence should be valid.
What we overlooked is that these sockets get destroyed and re-created
when we call `set_listen_addresses` which happens when the host app
tells us about new DNS servers. In that case, sockets that we had just
received a query on and are waiting for a response have their handles
stored in a temporary `HashMap`. Attempting to send back a response for
one of those queries will then either fail with an error that the socket
is not in the right state or - worse - panic with an out of bounds error
if the previously had more listen addresses than we have now.
To fix this, we need to clear this map of pending queries every time we
call `set_listen_addresses`.
When calculating the maximum size of the UDP payload we can send in a
single syscall, we need to take into account the overhead of the IP and
UDP headers.
We are currently storing app configuration across three places:
- UserDefaults (favorite resources)
- VPN configuration (Settings)
- Disk (firezone id)
These can be consolidated to UserDefaults, which is the standard way to
store app configuration like this.
UserDefaults is the umbrella persistence store for regular app
configuration (`plist` files which are just XML dictionaries),
iCloud-synced app configuration across a user's devices, and managed app
configuration (MDM). They provide a cached, thread-safe, and
interprocess-supported mechanism for handling app configuration. We can
also subscribe to changes on this app configuration to react to changes.
Unfortunately, the System Extension ruins some of our fun because it
runs as root, and is confined to a different group container, meaning we
cannot share configuration directly between GUI and tunnel procs.
To address this, we use the tunnel process to store all vital
configuration and introduce IPC calls to set and fetch these.
Commit-by-commit review recommended, but things got a little crazy
towards the end when I realized that we can't share a single
UserDefaults between both procs.
Our link checker `lychee` doesn't appear to de-duplicate requests to the
same URL which causes 429 errors with GitHub. To workaround this, we
reduce the concurrency to 1 and activate `lychee`'s cache. This cache is
just a file on disk. We don't need to actually save this in GitHub
actions' cache because all we want is for lychee to not make a request
to same URL again in the same session.
Related: https://github.com/lycheeverse/lychee-action/issues/289
These run every minute and add a lot of noise to the logs.
```
May 11 18:21:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:21:14.154Z INFO firezone_tunnel::io::nameserver_set: Evaluating fastest nameserver ips={127.0.0.53}
May 11 18:21:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:21:14.155Z INFO firezone_tunnel::io::nameserver_set: Evaluated fastest nameserver fastest=127.0.0.53
May 11 18:22:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:22:14.154Z INFO firezone_tunnel::io::nameserver_set: Evaluating fastest nameserver ips={127.0.0.53}
May 11 18:22:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:22:14.155Z INFO firezone_tunnel::io::nameserver_set: Evaluated fastest nameserver fastest=127.0.0.53
May 11 18:23:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:23:14.153Z INFO firezone_tunnel::io::nameserver_set: Evaluating fastest nameserver ips={127.0.0.53}
May 11 18:23:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:23:14.155Z INFO firezone_tunnel::io::nameserver_set: Evaluated fastest nameserver fastest=127.0.0.53
May 11 18:24:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:24:14.154Z INFO firezone_tunnel::io::nameserver_set: Evaluating fastest nameserver ips={127.0.0.53}
May 11 18:24:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:24:14.155Z INFO firezone_tunnel::io::nameserver_set: Evaluated fastest nameserver fastest=127.0.0.53
May 11 18:25:14 gateway-z1w4 firezone-gateway[2007]: 2025-05-11T18:25:14.153Z INFO firezone_tunnel::io::nameserver_set: Evaluating fastest nameserver ips={127.0.0.53}
```
The current `rust/` directory is a bit of a wild-west in terms of how
the crates are organised. Most of them are simply at the top-level when
in reality, they are all `connlib`-related. The Apple and Android FFI
crates - which are entrypoints in the Rust code are defined several
layers deep.
To improve the situation, we move around and rename several crates. The
end result is that all top-level crates / directories are:
- Either entrypoints into the Rust code, i.e. applications such as
Gateway, Relay or a Client
- Or crates shared across all those entrypoints, such as `telemetry` or
`logging`
When developing system extensions, Apple's
[documentation](https://developer.apple.com/documentation/DriverKit/debugging-and-testing-system-extensions)
instructs developers to disable SIP and turn on system extension
developer mode to disable certain runtime checks that allow the
extension to run.
It turns out this is completely unnecessary - any properly set up Xcode
toolchain can build a functioning macOS debug client.
When developing the macOS app, we always build the exact same version
and build code for each build. ~~This _may_ be one reason why we
constantly have to deactivate the extension before the new one will
launch.~~ Edit: Just tested, and I can verify that this does fix the
issue on dev builds, so no more having to uninstall the sysex between
builds.
Even if that's not the reason, this is a cleaner approach than building
it in our prod-only scripts.
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
When updating the provisioning profiles (i.e. when changing anything the
Apple Developer Portal), we needed to manually update these build
scripts to point to the new UUIDs.
This can be made simpler to automatically pull it out of the profiles in
CI.
Before a Client can send packets to a DNS resource, the Gateway must
first setup a NAT table between the IPs assigned by the Client and the
IPs the domain actually resolves to. This is what we call the DNS
resource NAT.
The communication for this process happens over IP through the tunnel
which is an unreliable transport. To ensure that this works reliably
even in the presence of packet loss on the wire, the Client uses an
idempotent algorithm where it tracks the state of the NAT for each
domain that is has ever assigned IPs for (i.e. received an A or AAAA
query from an application). This algorithm ensures that if we don't hear
anything back from the Gateway within 2s, another packet for setting up
the NAT is sent as soon as we receive _any_ DNS query.
This design balances efficiency (we don't try forever) with reliability
(we always check all of them).
In case a domain does not resolve at all or there are resolution errors,
the Gateway replies with `NatStatus::Inactive`. At present, the Client
doesn't handle this in any particular way other than logging that it was
not able to successfully setup the NAT.
The combination of the above results in an undesirable behaviour: If an
application queries a domain without A and AAAA records once, we will
keep retrying forever to resolve it upon every other DNS query issued to
the system. To fix this, we introduce `dns_resource_nat::State::Failed`.
Entries in this state are ignored as part of the above algorithm and
only recreated when explicitly told to do so which we only do when we
receive another DNS query for this domain.
To handle the increased complexity around this system, we extract it
into its own component and add a fleet of unit tests for its behaviour.
Currently, the Tauri build is broken on `main` because #9045
accidentally merged a bit too soon. In that PR, the two binaries that
the `gui-client` crate is composed of are now both defined in `src/bin`.
For some reason, this breaks Tauri's bundler and now on aarch64, it
stops including the `firezone-client-ipc` binary in the bundle. I don't
fully understand why and how that even works for x64 in the first place.
Nowhere in our repository can I find a configuration for the bundler as
to why it should even include that binary in the first place.
To fix this, we now explicitly copy this binary into the correct path
and also rebuild the `data` archive in addition to the `control`
archive.
When we implemented #8350, we chose an error handling strategy that
would shutdown the Gateway in case we didn't have a nameserver selected
for handling those SRV and TXT queries. At the time, this was deemed to
be sufficiently rare to be an adequate strategy. We have since learned
that this can indeed happen when the Gateway starts without network
connectivity which is quite common when using tools such as terraform to
provision infrastructure.
In #9060, we fix this by re-evaluating the fastest nameserver on a
timer. This however doesn't change the error handling strategy when we
don't have a working nameserver at all. It is practically impossible to
have a working Gateway yet us being unable to select a nameserver. We
read them from `/etc/resolv.conf` which is what `libc` uses to also
resolve the domain we connect to for the WebSocket. A working WebSocket
connection is required for us to establish connections to Clients, which
in turn is a precursor to us receiving DNS queries from a Client.
It causes unnecessary complexity to have a code path that can
potentially terminate the Gateway, yet is practically unreachable. To
fix this situation, we remove this code path and instead reply with a
DNS SERVFAIL error.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Why:
* It was pointed out that the way Postgresql does compound indexes there
is no need to have an individual index on the first column of the
compound index. This commit removes the redundant index on the
`actor_id` for the `actor_group_membership` table.
Opentelemetry logs a DEBUG log every time it creates a new meter. That
happens fairly often now in our codebase which spams the logs on the
DEBUG level.
```
2025-05-09T02:31:51.147Z DEBUG opentelemetry_sdk: name="MeterProvider.ExistingMeterReturned" meter_name="connlib"
```
We fix that be setting `opentelemetry_sdk` to default to `INFO` if it is
not specified explicitly.
Currently, the Gateway reads all nameservers from `/etc/resolv.conf` on
startup and evaluates the fastest one to use for SRV and TXT DNS queries
that are forwarded by the Client. If the machine just booted and we do
not have Internet connectivity just yet, this fails which leaves the
Gateway in state where it cannot fulfill those queries.
In order to ensure we always use the fastest one and to self-heal from
such situations, we add a 60s timer that refreshes this state.
Currently, this will **not** re-read the nameservers from
`/etc/resolv.conf` but still use the same IPs read on startup.
Why:
* As we move towards hard deleting data one issue we've run into is with
cascading deletes on the actor_group_memberships table. In order to
solve this problem indexes have been created on the `actor_id` and
`group_id` columns of the actor_group_memberships.
When the Gateway is handed an IP packet for a DNS resource that it
cannot route, it sends back an ICMP unreachable error. According to RFC
792 [0] (for ICMPv4) and RFC 4443 [1] (for ICMPv6), parts of the
original packet should be included in the ICMP error payload to allow
the sending party to correlate, what could not be sent.
For ICMPv4, the RFC says:
```
Internet Header + 64 bits of Data Datagram
The internet header plus the first 64 bits of the original
datagram's data. This data is used by the host to match the
message to the appropriate process. If a higher level protocol
uses port numbers, they are assumed to be in the first 64 data
bits of the original datagram's data.
```
For ICMPv6, the RFC says:
```
As much of invoking packet as possible without the ICMPv6 packet exceeding the minimum IPv6 MTU
```
[0]: https://datatracker.ietf.org/doc/html/rfc792
[1]: https://datatracker.ietf.org/doc/html/rfc4443#section-3.1
The module and crate structure around the GUI client and its background
service are currently a mess of circular dependencies. Most of the
service implementation actually sits in `firezone-headless-client`
because the headless-client and the service share certain modules. We
have recently moved most of these to `firezone-bin-shared` which is the
correct place for these modules.
In order to move the background service to `firezone-gui-client`, we
need to untangle a few more things in the GUI client. Those are done
commit-by-commit in this PR. With that out the way, we can finally move
the service module to the GUI client; where is should actually live
given that it has nothing to do with the headless client.
As a result, the headless-client is - as one would expect - really just
a thin wrapper around connlib itself and is reduced down to 4 files with
this PR.
To make things more consistent in the GUI client, we move the `main.rs`
file also into `bin/`. By convention `bin/` is where you define binaries
if a crate has more than one. cargo will then build all of them.
Eventually, we can optimise the compile-times for `firezone-gui-client`
by splitting it into multiple crates:
- Shared structs like IPC messages
- Background service
- GUI client
This will be useful because it allows only re-compiling of the GUI
client alone if nothing in `connlib` changes and vice versa.
Resolves: #6913Resolves: #5754
Both `device_id` and `device_info` are used by the headless-client and
the GUI client / IPC service. They should therefore be defined in the
`bin-shared` crate.
Somewhere between Xcode 16.0 and Xcode 16.3, the API for the libresolv
functions we call changed slightly, and we can now pass the return value
of `__res_9_state()` directly to the `res_9_ninit`, `res_9_ndestroy` and
`res_9_getservers` functions.
In order to experiment with alternative GUI libraries, we extracted a
`gui-client-common` crate that would hold GUI-library agnostic code.
We've since upgraded to Tauri v2 and settled on that as the GUI
framework for the Windows and Linux Firezone Clients. Therefore this
abstraction is unnecessary and can be removed again.
This makes it easier to work on the GUI client and also allows the
compiler to flag unused code more easily.