The current Rust workspace isn't as consistent as it could be. To make
navigation a bit easier, we move a few crates around. Generally, we
follow the idea that entry-points should be at the top-level. `rust/`
now looks like this (directories only):
```
.
├── cli # Firezone CLI
├── client-ffi # Entry point for Apple & Android
├── gateway # Gateway
├── gui-client # GUI client
├── headless-client # Headless client
├── libs # Library crates
├── relay # Relay
├── target # Compile artifacts
├── tests # Crates for testing
└── tools # Local tools
```
To further enforce this structure, we also drop the `firezone-` prefix
from all crates that are not top-level binary crates.
The downcasting abilities of `anyhow` are pretty powerful.
Unfortunately, they can also be a bit tricky to get right. Whilst `is`
and `downcast` work fine for any errors that are within the `anyhow`
error chain, they don't check the chain of errors prior to that. In
other words, if we already have a nested `std::error::Error` with
several causes, `anyhow` cannot downcast to these causes directly.
In order to avoid this footgun, we create a thin-layer on top of the
`anyhow` crate with some downcasting functions that always try to do the
right thing.
All of our Linux applications have a soft-dependency on systemd. That
is, in the default configuration, we expect systemd to be present on the
machine. The only exception here are the docker containers for Headless
Client and Gateway.
For the GUI client in particular, systemd is a hard-dependency in order
to control DNS on the system which we do via `systemd-resolved`. To
secure the communication between the GUI client and its tunnel process,
we automatically create a group called `firezone-client` to which the
user gets added. All members of the group are allowed to access the unix
socket which is used for IPC between the two processes. Membership in
this group is also a prerequisite for accessing any of the configuration
files.
On the first launch of the GUI client on a Linux system, this presents a
problem. For group membership changes to take the effect, the user needs
to reboot. We say that in the documentation but it is unclear whether
all users will read that thoroughly enough. To help the user, the GUI
client checks for membership of the current user in the group and alerts
the user via a dialog box if that isn't the case. This would all be fine
if it would actually work. Unfortunately, that check ends up being too
late in the process. If we aren't a member of the group, we cannot read
the device ID and bail early, thus never reaching the check and
terminating the process without any dialog box or user-visible error.
We could attempt to fix this by shuffling around some of the startup
init code. That is a sub-optimal solution however because it a) may get
broken again in the future and b) it means we have to delay
initialisation of telemetry until a much later point.
Given that this is only a problem on Linux, a better solution is to
simply not rely on the disk-based device ID at all. Instead, we can
integrate with systemd and deterministically derive a device ID from the
unique machine ID and a randomly chosen "app ID".
For backwards-compatibility reasons, the disk-based device ID is still
prioritised. For all new installs however, we will use the one based on
`/etc/machine-id`.
The current test installation fails because it is operating in a
headless environment without a display user. Some more testing of the
`who` command showed that we can simply take the first user. That avoids
`grep` which was previously failing with an exit code of 1, aborting the
installation because our `postinst` script has `pipefail` set.
Specifying `sudo` in the script is unnecessary as it already runs as
root. Additionally, only executing `systemd-sysusers` for our config
file is better because it narrows the scope of what should be done.
By checking various environment variables, we can automatically add the
current user to the `firezone-client` group which allows them to connect
to the IPC socket of the tunnel process. Unfortunately, they still have
to create a new login session / reboot for that to be reflected.
The docs update for this will follow once we have cut a release with
this code in it.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Right now, connlib hands out a `BiMap` of sentinel IPs <> upstream
servers whenever it emits a `TunInterfaceUpdated` event. This `BiMap`
internally uses two `HashMap`s. The iteration order of `HashMap`s is
non-deterministic and therefore, we lose the order in which the upstream
/ system resolvers have been passed to us originally.
To prevent that, we now emit a dedicated `DnsMapping` type that does not
expose its internal data structure but only getters for retrieving the
sentinel and upstream servers. Internally, it uses a `Vec` to store this
mapping and thus retains the original order. This is asserted as part of
our proptests by comparing the resulting `Vec`s.
This fix is preceded by a few refactorings that encapsulate the code for
creating and updating this DNS mapping.
Resolves: #8439
On Ubuntu, this should be the default anyway and already be installed
but to be correct, we should list this dependency in the `depends`
section of our `.deb`. That way, it will automatically get installed
again if a user chooses to install the GUI client from our repository
and doesn't have `systemd-resolved` installed.
Rust 1.91 has been released and brings with it a few new lints that we
need to tidy up. In addition, it also stabilizes `BTreeMap::extract_if`:
A really nifty std-lib function that allows us to conditionally take
elements from a map. We need that in a bunch of places.
Bumps [secrecy](https://github.com/iqlusioninc/crates) from 0.8.0 to
0.10.3.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/iqlusioninc/crates/commits">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
On Fedora, when a package gets upgraded, the new package is installed
first, followed by the uninstall of the old package. As a result, the
`prerm` script is called after the `postinst` script of the new package.
In our `prerm` script, we stop the tunnel service. On package upgrades,
this results in us stopping the tunnel service after installing the new
package, confronting the user with an error that the tunnel service is
not running.
`rpm` passes arguments to these maintenance scripts. In the case of
`prerm`, we receive the count of how many other instances of this
packages are installed. To fix this bug, we check whether the first
argument to the script is "1", meaning that we are being upgraded and
should not stop the tunnel service.
Building on top of #10507, setting the initial Internet Resource state
is a piece of cake. All we need to do is thread a boolean variable
through to all call-sites of `Session::connect`. Without the need for
the Internet Resource's ID, we can simply pass in the boolean that is
saved in the configuration of each client.
Resolves: #10255
Instead of the generic "disable any kind of resource"-functionality that
connlib currently exposes, we now provide an API to only enable /
disable the Internet Resource. This is a lot simpler to deal with and
reason about than the previous system, especially when it comes to the
proptests. Those need to model connlib's behaviour correctly across its
entire API surface which makes them unnecessarily complex if we only
ever use the `set_disabled_resources` API with a single resource.
In preparation for #4789, I want to extend the proptests to cover
traffic filters (#7126). This will make them a fair bit more
complicated, so any prior removal of complexity is appreciated.
Simplifying the implementation here is also a good starting point to fix
#10255. Not implicitly enabling the Internet Resource when it gets added
should be quite simple after this change.
Finally, resolving #8885 should also be quite easy. We just need to
store the state of the Internet Resource once per API URL instead of
globally.
Resolves: #8404
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
In order to allow the portal to more easily classify, what kind of
component is connecting, we extend the `get_user_agent` header to
include a component type instead of the generic `connlib/`.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
With the introduction of the pre-resolved Sentry host, all Firezone
clients now require Internet on startup. That is a signficant usability
hit that we can easily fix by simply falling back to resolving the host
on-demand.
We always end up allow this lint when it pops up so we can also just
allow it for the whole repo in general. Most of the time, the reason for
too many arguments are borrow-checker limitations of Rust where mutable
references need to be tracked explicitly.
Right now, the Client event-loops have a channel with 1000 items for
sending new resource lists and updates to the TUN device to the host
app. This is kind of unnecessary as we always only care about the last
version of these. Intermediate updates that the host app doesn't process
are effectively irrelevant.
We've had an issue before where a bug in the portal caused us to receive
many updates to resources which ended up crashing Client apps because
this channel filled up.
To be more resilient on this front, we refactor the Client event loop to
use a `watch` channel for this. Watch channels only retain the last
value that got sent into them.
Our Sentry client needs to resolve DNS before being able to send logs or
errors to the backend. Currently, this DNS resolution happens on-demand
as we don't take any control of the underlying HTTP client.
In addition, this will use HTTP/1.1 by default which isn't as efficient
as it could be, especially with concurrent requests.
Finally, if we decide to ever proxy all Sentry for traffic through our
own domain, we have to take control of the underlying client anyway.
To resolve all of the above, we create a custom `TransportFactory` where
we reuse the existing `ReqwestHttpTransport` but provide an already
configured `reqwest::Client` that always uses HTTP/2 with a
pre-configured set of DNS records for the given ingest host.
By default, dropping a `tokio` runtime waits until all tasks have
finished. The tasks we spawn within `connlib` can have complex
dependencies with each other. To ensure that we can shut down in any
case and don't hang, we apply a timeout of 1s to the runtime.
These don't really tell us much. It appears that Windows is sometimes
failing to access the pipe but then succeeds on the next attempt, hence
why we have the retry loop in the first place. Logging a warning here
just spams Sentry unnecessarily.
Despite still being in development, the `tauri-specta` project already
proves to be quite useful. It allows us to generate TypeScript bindings
for our commands and events, creating a type-safe contract between the
frontend and the backend.
For example, this ensures that the TypeScript code calls a command
actually with the required parameters and thus avoids runtime failures.
Similarly, the frontend can listen on type-safe events without having to
use any magic strings.
When looking through customer logs, we see a lot of "Resolved best route
outside of tunnel" messages. Those get logged every time we need to
rerun our re-implementation of Windows' weighting algorithm as to which
source interface / IP a packet should be sent from.
Currently, this gets cached in every socket instance so for the
peer-to-peer socket, this is only computed once per destination IP.
However, for DNS queries, we make a new socket for every query. Using a
new source port DNS queries is recommended to avoid fingerprinting of
DNS queries. Using a new socket also means that we need to re-run this
algorithm every time we make a DNS query which is why we see this log so
often.
To fix this, we need to share this cache across all UDP sockets. Cache
invalidation is one of the hardest problems in computer science and this
instance is no different. This cache needs to be reset every time we
roam as that changes the weighting of which source interface to use.
To achieve this, we extend the `SocketFactory` trait with a `reset`
method. This method is called whenever we roam and can then reset a
shared cache inside the `UdpSocketFactory`. The "source IP resolver"
function that is passed to the UDP socket now simply accesses this
shared cache and inserts a new entry when it needs to resolve the IP.
As an added benefit, this may speed up DNS queries on Windows a bit
(although I haven't benchmarked it). It should certainly drastically
reduce the amount of syscalls we make on Windows.
Rust 1.88 has been released and brings with it a quite exciting feature:
let-chains! It allows us to mix-and-match `if` and `let` expressions,
therefore often reducing the "right-drift" of the relevant code, making
it easier to read.
Rust.188 also comes with a new clippy lint that warns when creating a
mutable reference from an immutable pointer. Attempting to fix this
revealed that this is exactly what we are doing in the eBPF kernel.
Unfortunately, it doesn't seem to be possible to design this in a way
that is both accepted by the borrow-checker AND by the eBPF verifier.
Hence, we simply make the function `unsafe` and document for the
programmer, what needs to be upheld.
At present, our primary indicator as to whether telemetry is active is
whether we have a Sentry session. For our analytics events however, we
currently require passing in the Firezone ID and API url again. This
makes it difficult to send analytics events from areas of the code that
don't have this information available.
To still allow for that, we integrate the `analytics` module more
tightly with the Sentry session. This allows us to drop two parameters
from the `$identify` event and also means we now respect the
`NO_TELEMETRY` setting for these events except for `new_session`. This
event is sent regardless because it allows us to track, how many on-prem
installations of Firezone are out there.
A recent release of `tslink` now supports configuration via the
`package.metadata` table which resolved a warning about "unknown key"
that we have seeing for a while.
This log is misplaced within the current `try_connect` function because
that will also be called when we didn't have Internet while we tried to
first connect and then witnessed a network change (in which case the
token is cached in the `WaitingForNetwork` session state).
Thus, we move the log to the `Connect` msg handler where we shouldn't
have any existing session.
Bumps [tslink](https://github.com/icsmw/tslink) from 0.3.0 to 0.4.2.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/icsmw/tslink/blob/master/changelog.md">tslink's
changelog</a>.</em></p>
<blockquote>
<h1>0.4.2 (08.06.2025)</h1>
<h2>Changes</h2>
<ul>
<li>Migrate settings to <code>package.metadata.tslink</code></li>
</ul>
<h1>0.4.1 (08.06.2025)</h1>
<h2>Changes</h2>
<ul>
<li>Add support arrays in the context of <code>const</code></li>
</ul>
<h1>0.4.0 (08.06.2025)</h1>
<h2>Features</h2>
<ul>
<li>Add support of <code>const</code> for primitive types</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/icsmw/tslink/commits">compare view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
You can trigger a rebase of this PR by commenting `@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
> **Note**
> Automatic rebases have been disabled on this pull request as it has
been open for over 30 days.
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Instead of conditionally enabling the `logs` feature in the Sentry
client, we always enable it and control via the `tracing` integration,
which events should get forwarded to Sentry. The feature-flag check
accesses only shared-memory and is therefore really fast.
We already re-evaluate feature flags on a timer which means this boolean
will flip over automatically and logs will be streamed to Sentry.
Customer hit what seems to be a rare race condition where we try to
connect whilst we already have a session. I don't know which state it is
in so I am replacing it with a WARN log to learn more about this in
Sentry in case it gets hit again.
I believe some of the recent changes around how we load the
`firezone-id.json` from the GUI client surfaced that we in fact don't
always have access to it. Previously, this was silenced because we would
only optionally add it as context to the Sentry client.
Now, we need it to initialise telemetry so we know whether or not to
send logs to Sentry.
In order to be able to access the file, we need to change the config's
directory and the file to be owned by the `firezone-client` group.
The GUI client binary performs quite a few checks prior to setting up
logging. In order to log at least something, we have a bootstrap logger
config that logs to stdout based on the `RUST_LOG` env var.
However, in the context of an error, the logger guard was dropped to
early and therefore we couldn't actually see the error.
To fix this, we pass a mutable `Option` in to `try_main` instead. This
allows the function to drop the bootstrap logger once the real one is
set up but also keep logging using the bootstrap logger in case of an
error.
Originally, we introduced these to gather some data from logs / warnings
that we considered to be too spammy. We've since merged a
burst-protection that will at most submit the same event once every 5
minutes.
The data from the telemetry spans themselves have not been used at all.
I suspect that the new Windows runners are "too fast" and we hit a race
condition in the use of the keyring on Windows which causes failing CI
jobs. The attempt to fix this is to sleep for 1 seconds before every
assert in the test.
Sentry has a new "Logs" feature where we can stream logs directly to
Sentry. Doing this for all Clients and Gateways would be way too much
data to collect though.
In order to aid debugging from customer installations, we add a
PostHog-managed feature flag that - if set to `true` - enables the
streaming of logs to Sentry. This feature flag is evaluated every time
the telemetry context is initialised:
- For all FFI usages of connlib, this happens every time a new session
is created.
- For the Windows/Linux Tunnel service, this also happens every time we
create a new session.
- For the Headless Client and Gateway, it happens on startup and
afterwards, every minute. The feature-flag context itself is only
checked every 5 minutes though so it might take up to 5 minutes before
this takes effect.
The default value - like all feature flags - is `false`. Therefore, if
there is any issue with the PostHog service, we will fallback to the
previous behaviour where logs are simply stored locally.
Resolves: #9600
In order to more easily target customers with certain feature flags, we
include the `account_slug` in the `$identify` event to PostHog. This
will allow us to create Cohorts in PostHog and enable / disable feature
flags for all installations of Firezone for a particular customer.