When a Gateway or Client is running in an environment without IPv4 or
IPv6 connectivity, our initial probes for sending packets to the relays
will fail with network unreachable. That isn't a very big concern and
happens a lot in the wild. There is no need to report these as telemetry
events.
Resolves: #7514.
For persistent applications like the IPC service, it is possible that
telemetry gets initialised with different parameters depending on what
the user logs in with. Currently, only the first one is persisted and
all consecutive ones are ignored, leading to events that may be wrongly
tagged for a certain user / environment.
To fix this, we only skip the init if we are still in the same
environment. Otherwise, the close the previous session and initialise a
new one.
Fixes: #7525.
In order to achieve concurrency within `connlib`, we needed to create a
way for IP packets to own the piece of memory they are sitting in. This
allows us to concurrently read IP packets and them batch-process them
(as opposed to have a dedicated buffer and reference it). At the moment,
those IP packets are defined on the stack. With a size of ~1300 bytes
that isn't very large but still causes _some_ amount of copying.
We can avoid this copying by relying on a buffer pool:
1. When reading a new IP packet, we request a new buffer from the pool.
2. When the IP packet gets dropped, the buffer gets returned to the
pool.
This allows us to reuse an allocation for a packet once it finished
processing, resulting in less CPU time spent on copying around memory.
This causes us to make more _individual_ heap-allocations in the
beginning: Each packet is being processed by `connlib` is allocated on
the heap somewhere. At some point during the lifetime of the tunnel,
this will settle in an ideal state where we have allocated enough slots
to cover new packets whilst also reusing memory from packets that
finished processing already.
The actual `IpPacket` data type is now just a pointer. As a result, the
channels to and from the TUN thread (where we were holding multiple of
these packets) are now significantly smaller, leading to roughly the
same memory usage overall.
In my local testing on Linux, the client still only uses about ~15MB of
RAM even with multiple concurrent speedtests running.
Similar to #7497, when we receive a `ConnectResult`, we can simply
silently bail out of the function and not change our state instead of
printing a loud warning.
#7522 won't successfully complete on production because of the migration
in this PR. So, instead, we need to modify this migration, and then
manually apply the same operation to staging.
Normally, there always be exactly on pending flow per resource. It
appears though that it can sometimes happen that we first request a flow
for a resource but by the time it is authorised, we've already cleared
its local state.
Regardless, this isn't a concerning error and not worth logging on WARN
(which happens one layer up).
Windows appears to randomly fail to update the tray menu. There is
nothing we can do about that. Hence, we downgrade these errors to debug
and make the functions infallible, reducing the complexity for the
caller.
There is nothing we can do if the user doesn't have any DNS servers
defined. The default log level is INFO so a user reading the logs will
still come across this message in case they are trying to debug what is
happening.
Long term, problems like these would probably warrant some kind of
notification channel from `connlib` to the GUI where we can display
messages to the user.
There are several reasons why we can disconnect from a relay at runtime:
- STUN is blocked
- We have invalid credentials
- The TURN server is not protocol-conform
The first two are very much possible in production and there is nothing
we can do about them. When relays reboot, their credentials change and
if the Internet connection of a user / gateway gets cut, we may
disconnect from the relay because the messages get lost.
The last one should never happen if we are connected to our own relays.
Firezone can be self-hosted so ultimately, we don't have control over
what we are talking to. That error however is more of a safe-guard for
`connlib` itself to disconnect from the server as soon as it detects
that it is behaving weirdly.
None of these reasons are worth reporting to Sentry as a problem because
they aren't really fixable as such. It is more important that the user
sees them in the logs if they decide to dig into them which they will
still do on INFO level.
Why:
* Currently, when using the API, a user has no way of easily identifying
what identities they are pulling back as the response only includes the
`provider_identifier` which for most of our AuthProviders is an ID for
the IdP and not an email address. Along with that, when adding users to
an OIDC provider within Firezone, there is no check for whether or not
an identity has already been added with a given email address. By
creating a separate email column on the `auth_identities` table, it will
be very straight forward to know whether an email address exists for a
given identity, return it in an API response and allow the admin of a
Firezone account to track users (Identities) by email rather than IdP
identifier.
Fixes#7392
The communication between the GUI client, the IPC service and `connlib`
are asynchronous. As such, it may happen that the state machines run out
of sync. Receiving a `TunnelReady` despite not being in the right state
for that is no concern and can be handled gracefully.
In most cases, the caller of this function already handled the case of
it failing gracefully by logging. From Sentry alerts, we can see that if
this fails, there isn't much we can do about it and most likely, the
next refresh will work again (this has only happened a single time).
Logging this on `debug` is good enough in case something doesn't work
and we need to reproduce it or something really bad happens we need see
it in the breadcrumbs of another Sentry event.
When a client disconnects, we clear up the connection on the gateway.
There might still be packets arriving from resources that we then cannot
route. This isn't worth returning an error.
We are already handling one case where we are trying to remove a route
that doesn't exist. `ESRCH` is another variant of this error that
manifests as "No such process". According to the Internet, this just
means the route doesn't exist so we can bail out early here.
Currently, the Gateway logs all errors that happen when the event-loop
exits on ERROR level. This creates Sentry alerts for things like
"Unauthorized" errors or "404 Not found".
That isn't useful to us. To mitigate this, we polish the code a bit to
only log an ERROR when we actually fail to setup something during
startup (like the TUN device). In all other cases, we now log a more
user-friendly message on INFO but still exit with the appropriate exit
code (0 on CTRL+C, 1 on any other error).
Why:
* The API endpoint for updating Resources was using
`Resources.fetch_resource_by_id_or_persistent_id`, however that function
was fetching all Resources, which included deleted Resources. In order
to prevent an API user from attempting to update a Resource that is
deleted, a new function was added to fetch active Resources only.
Fixes: #7492
Attempting to refresh an allocation is the only idempotent way in TURN
to test whether one has an active allocation. As such, logging this on
WARN is too aggressive.
Resolves: #7481.
At present, `connlib` will always drop all IP packets until a connection
is established and the DNS resource NAT is created. This causes an
unnecessary delay until the connection is working because we need to
wait for retransmission timers of the host's network stack to resend
those packets.
With the new idempotent control protocol, it is now much easier to
buffer these packets and send them to the gateway once the connection is
established.
The buffer sizes are chosen somewhat conservatively to ensure we don't
consume a lot of memory. The hypothesis here is that every protocol -
even if the transport layer is unreliable like UDP - will start with a
handshake involving only one or at most a few packets and waiting for a
reply before sending more. Thus, as long as we can set up a connection
quicker than the re-transmit timer in the host's network stack,
buffering those packets should result in no packet loss. Typically,
setting up a new connection takes at most 500ms which should be fast
enough to not trigger any re-transmits.
Resolves: #3246.
Xcode has decent support for skipping certain build phases when input
files haven't changed. This only happens for build phases within a
single target, and not for entire Target dependencies.
Before, we defined `Connlib` as its own bonafide build target, and then
added it as a target dependency for the network extension targets. This
causes Xcode to always run our `build-rust.sh` script, which takes
around 30s on my M1 even when `rust/` hasn't changed.
Instead, we can remove the `Connlib` target, and add a "Run script"
phase to the network extension targets themselves. By configuring the
input file list, Xcode will skip this phase if `rust/**/*.rs`,
`rust/**/*.toml` and `rust/Cargo.lock` haven't changed.
This makes it **much** faster to iterate on Swift code -- Xcode is
_very_ fast when building pure Swift (sometimes under < 1s).
<img width="1016" alt="Screenshot 2024-12-11 at 6 10 45 PM"
src="https://github.com/user-attachments/assets/29b5f073-3d58-4c07-9592-f9209033c966"
/>
Bumps the npm_and_yarn group in /website with 1 update:
[nanoid](https://github.com/ai/nanoid).
Updates `nanoid` from 3.3.7 to 3.3.8
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/ai/nanoid/blob/main/CHANGELOG.md">nanoid's
changelog</a>.</em></p>
<blockquote>
<h2>3.3.8</h2>
<ul>
<li>Fixed a way to break Nano ID by passing non-integer size (by <a
href="https://github.com/myndzi"><code>@myndzi</code></a>).</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="3044cd5e73"><code>3044cd5</code></a>
Release 3.3.8 version</li>
<li><a
href="4fe34959c3"><code>4fe3495</code></a>
Update size limit</li>
<li><a
href="d643045f40"><code>d643045</code></a>
Fix pool pollution, infinite loop (<a
href="https://redirect.github.com/ai/nanoid/issues/510">#510</a>)</li>
<li>See full diff in <a
href="https://github.com/ai/nanoid/compare/3.3.7...3.3.8">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore <dependency name> major version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's major version (unless you unignore this specific
dependency's major version or upgrade to it yourself)
- `@dependabot ignore <dependency name> minor version` will close this
group update PR and stop Dependabot creating any more for the specific
dependency's minor version (unless you unignore this specific
dependency's minor version or upgrade to it yourself)
- `@dependabot ignore <dependency name>` will close this group update PR
and stop Dependabot creating any more for the specific dependency
(unless you unignore this specific dependency or upgrade to it yourself)
- `@dependabot unignore <dependency name>` will remove all of the ignore
conditions of the specified dependency
- `@dependabot unignore <dependency name> <ignore condition>` will
remove the ignore condition of the specified dependency and ignore
conditions
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/firezone/firezone/network/alerts).
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
On the one hand, learning about in which edgecases our software fails is
useful and thus having telemetry also active for self-hosted users is
beneficial. On the other hand, we have neither control nor a contact to
those self-hosted and whatever they are doing might spam our Sentry
account with errors that we can't do anything about.
To mitigate this, we disable telemetry for self-hosted users with the
next release.
Once we have more resources, we can consider enabling this again.
As per the RFC, the IPv6 traffic class should be 1-to-1 translated to
the IPv4 DSCP value. However, it appears that not all values here are
valid. In particular, when attempting to reach GitHub over IPv6, we
receive an IPv6 packet that has a traffic class value of 72 which is
out-of-range for the IPv4 DSCP value, resulting in the following error
on the Gateway:
```
Failed to translate packet: NAT64 failed: Error '72' is too big to be a 'IPv4 DSCP (Differentiated Services Code Point)' (maximum allowed value is '63')
```
The bigger scope of this issue is that this causes the ICMP packets
returned to the client to be dropped which means that `ssh` spawned by
`git` doesn't learn that the IPv6 address assigned by Firezone is not
actually routable.
Related: #7476.
In order for Sentry to parse our releases as semver, they need to be in
the form of `package@version` [0]. Without this, the feature of "Mark
this issue as resolved in the _next_ version" doesn't work properly
because Sentry compares the versions as to when it first saw them vs
parsing the semver string itself. We test versions prior to releasing
them, meaning Sentry learns about a 1.4.0 version before it is actually
released. This causes false-positive "regressions" even though they are
fixed in a later (as per semver) release.
This create some redundancy with the different DSNs that we are already
using. I think it would make sense to consider merging the two projects
we have for the GUI client for example. That is really just one project
that happens to run as two binaries.
For all other projects, I think the separation still makes sense because
we e.g. may add Sentry to the "host" applications of Android and
MacOS/iOS as well. For those, we would reuse the DSN and thus funnel the
issues into the same Sentry project.
As per Sentry's docs, releases are organisation-wide and therefore need
a package identifier to be grouped correctly.
[0]:
https://docs.sentry.io/platforms/javascript/configuration/releases/#bind-the-version
Unlike the App extension which runs as the user, the system extension
introduced in macOS client 1.4.0 runs as `root` and thus cannot read the
App Group container directory for the GUI process. However, both
processes can read and write to the shared Keychain, which is how we
pass the token between the two processes already.
This PR does two things:
1. Tries to read an existing `firezone-id` from the pre-1.4.0 App Group
container upon app launch. This needs to be done from the GUI process.
If found, it stores it in the Keychain.
1. Refactors the `firezone-id` to be stored in the Keychain instead of a
plaintext file going forward.
The Keychain API is also cleaned up and abstracted to be more ergonomic
to use for both Token and Firezone ID storage purposes.
- Installs the system extension on app launch instead of each time we
start the tunnel, as [recommended by
Apple](https://developer.apple.com/documentation/systemextensions/installing-system-extensions-and-drivers).
This will typically happen when the app is installed for the first time,
or upgraded / downgraded.
- Changes the completion handler functionality for observing the system
extension status to an observed property on the class. This allows us to
update the MenuBar based on the status of the installation, preventing
the user from attempting to sign in unless the system extension has been
installed.
~~This PR exposes a new, subtle issue - since we don't reinstall the
system extension on each startTunnel, the process stays running. This is
expected. However, now the logging handle needs to be maintained across
connlib sessions, similar to the Android tunnel lifetime.~~ Fixed in
#7460
Expect one or two more PRs to handle further edge cases with improved UX
as more testing with the release build and upgrade/downgrade workflows
are attempted.