Commit Graph

183 Commits

Author SHA1 Message Date
Thomas Eizinger
bc2febed99 fix(connlib): use correct constant for truncating DNS responses (#7551)
In case an upstream DNS server responds with a payload that exceeds the
available buffer space of an IP packet, we need to truncate the
response. Currently, this truncation uses the **wrong** constant to
check for the maximum allowed length. Instead of the
`MAX_DATAGRAM_PAYLOAD`, we actually need to check against a limit that
is less than the MTU as the IP layer and the UDP layer both add an
overhead.

To fix this, we introduce such a constant and provide additional
documentation on the remaining ones to hopefully avoid future errors.
2024-12-19 17:15:43 +00:00
Thomas Eizinger
d39b6ff1b9 chore(gateway): don't log errors for untranslatable packets (#7541)
Certain packets cannot be translated as part of NAT64/46. The RFC says
to "Silently drop" those. Currently, we log all errors that happens
during the translation and don't follow this guideline.

Most of these "silently drop" errors are related to ICMP types that
cannot be represented in the other version such as ICMPv6 Neighbor
Solicitation.

To fix this, we introduce a new error type in the `ip_packet` module:
`ImpossibleTranslation`. For convenience reasons, we carry that one
through all layers as an `anyhow::Error` and test at the very top of the
event-loop, whether the root-cause of the error is such a failed
translation. If so, we ignore the error and move on. This isn't as
type-safe as it could be but it is much easier to implement.
Additionally, the risk of a bug here (i.e. if we stop emitting this
error within the IP packet translation layer) is merely that the log
will pop up again.

Resolves: #7516.
2024-12-18 20:35:08 +00:00
Thomas Eizinger
8a1b6f26b4 fix(connlib): don't log warnings for unreachable errors (#7537)
When a Gateway or Client is running in an environment without IPv4 or
IPv6 connectivity, our initial probes for sending packets to the relays
will fail with network unreachable. That isn't a very big concern and
happens a lot in the wild. There is no need to report these as telemetry
events.

Resolves: #7514.
2024-12-17 17:59:20 +00:00
Thomas Eizinger
7309428cae chore(gateway): release version 1.4.2 (#7494)
Gateway 1.4.2 has been released
(https://github.com/firezone/firezone/releases/tag/gateway-1.4.2). This
PR updates the changelog and version numbers accordingly.
2024-12-13 05:49:19 +00:00
Thomas Eizinger
30376cd79a fix(gateway): polish error handling in main (#7500)
Currently, the Gateway logs all errors that happen when the event-loop
exits on ERROR level. This creates Sentry alerts for things like
"Unauthorized" errors or "404 Not found".

That isn't useful to us. To mitigate this, we polish the code a bit to
only log an ERROR when we actually fail to setup something during
startup (like the TUN device). In all other cases, we now log a more
user-friendly message on INFO but still exit with the appropriate exit
code (0 on CTRL+C, 1 on any other error).
2024-12-13 04:51:58 +00:00
Thomas Eizinger
81f71cba62 fix(telemetry): use package@version notation for releases (#7466)
In order for Sentry to parse our releases as semver, they need to be in
the form of `package@version` [0]. Without this, the feature of "Mark
this issue as resolved in the _next_ version" doesn't work properly
because Sentry compares the versions as to when it first saw them vs
parsing the semver string itself. We test versions prior to releasing
them, meaning Sentry learns about a 1.4.0 version before it is actually
released. This causes false-positive "regressions" even though they are
fixed in a later (as per semver) release.

This create some redundancy with the different DSNs that we are already
using. I think it would make sense to consider merging the two projects
we have for the GUI client for example. That is really just one project
that happens to run as two binaries.

For all other projects, I think the separation still makes sense because
we e.g. may add Sentry to the "host" applications of Android and
MacOS/iOS as well. For those, we would reuse the DSN and thus funnel the
issues into the same Sentry project.

As per Sentry's docs, releases are organisation-wide and therefore need
a package identifier to be grouped correctly.

[0]:
https://docs.sentry.io/platforms/javascript/configuration/releases/#bind-the-version
2024-12-09 05:04:45 +00:00
Thomas Eizinger
90cf191a7c feat(linux): multi-threaded TUN device operations (#7449)
## Context

At present, we only have a single thread that reads and writes to the
TUN device on all platforms. On Linux, it is possible to open the file
descriptor of a TUN device multiple times by setting the
`IFF_MULTI_QUEUE` option using `ioctl`. Using multi-queue, we can then
spawn multiple threads that concurrently read and write to the TUN
device. This is critical for achieving a better throughput.

## Solution

`IFF_MULTI_QUEUE` is a Linux-only thing and therefore only applies to
headless-client, GUI-client on Linux and the Gateway (it may also be
possible on Android, I haven't tried). As such, we need to first change
our internal abstractions a bit to move the creation of the TUN thread
to the `Tun` abstraction itself. For this, we change the interface of
`Tun` to the following:

- `poll_recv_many`: An API, inspired by tokio's `mpsc::Receiver` where
multiple items in a channel can be batch-received.
- `poll_send_ready`: Mimics the API of `Sink` to check whether more
items can be written.
- `send`: Mimics the API of `Sink` to actually send an item.

With these APIs in place, we can implement various (performance)
improvements for the different platforms.

- On Linux, this allows us to spawn multiple threads to read and write
from the TUN device and send all packets into the same channel. The `Io`
component of `connlib` then uses `poll_recv_many` to read batches of up
to 100 packets at once. This ties in well with #7210 because we can then
use GSO to send the encrypted packets in single syscalls to the OS.
- On Windows, we already have a dedicated recv thread because `WinTun`'s
most-convenient API uses blocking IO. As such, we can now also tie into
that by batch-receiving from this channel.
- In addition to using multiple threads, this API now also uses correct
readiness checks on Linux, Darwin and Android to uphold backpressure in
case we cannot write to the TUN device.

## Configuration

Local testing has shown that 2 threads give the best performance for a
local `iperf3` run. I suspect this is because there is only so much
traffic that a single application (i.e. `iperf3`) can generate. With
more than 2 threads, the throughput actually drops drastically because
`connlib`'s main thread is too busy with lock-contention and triggering
`Waker`s for the TUN threads (which mostly idle around if there are 4+
of them). I've made it configurable on the Gateway though so we can
experiment with this during concurrent speedtests etc.

In addition, switching `connlib` to a single-threaded tokio runtime
further increased the throughput. I suspect due to less task / context
switching.

## Results

Local testing with `iperf3` shows some very promising results. We now
achieve a throughput of 2+ Gbit/s.

```
Connecting to host 172.20.0.110, port 5201
Reverse mode, remote host 172.20.0.110 is sending
[  5] local 100.80.159.34 port 57040 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   274 MBytes  2.30 Gbits/sec
[  5]   1.00-2.00   sec   279 MBytes  2.34 Gbits/sec
[  5]   2.00-3.00   sec   216 MBytes  1.82 Gbits/sec
[  5]   3.00-4.00   sec   224 MBytes  1.88 Gbits/sec
[  5]   4.00-5.00   sec   234 MBytes  1.96 Gbits/sec
[  5]   5.00-6.00   sec   238 MBytes  2.00 Gbits/sec
[  5]   6.00-7.00   sec   229 MBytes  1.92 Gbits/sec
[  5]   7.00-8.00   sec   222 MBytes  1.86 Gbits/sec
[  5]   8.00-9.00   sec   223 MBytes  1.87 Gbits/sec
[  5]   9.00-10.00  sec   217 MBytes  1.82 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.30 GBytes  1.98 Gbits/sec  22247             sender
[  5]   0.00-10.00  sec  2.30 GBytes  1.98 Gbits/sec                  receiver

iperf Done.
```

This is a pretty solid improvement over what is in `main`:

```
Connecting to host 172.20.0.110, port 5201
[  5] local 100.65.159.3 port 56970 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  90.4 MBytes   758 Mbits/sec  1800    106 KBytes
[  5]   1.00-2.00   sec  93.4 MBytes   783 Mbits/sec  1550   51.6 KBytes
[  5]   2.00-3.00   sec  92.6 MBytes   777 Mbits/sec  1350   76.8 KBytes
[  5]   3.00-4.00   sec  92.9 MBytes   779 Mbits/sec  1800   56.4 KBytes
[  5]   4.00-5.00   sec  93.4 MBytes   783 Mbits/sec  1650   69.6 KBytes
[  5]   5.00-6.00   sec  90.6 MBytes   760 Mbits/sec  1500   73.2 KBytes
[  5]   6.00-7.00   sec  87.6 MBytes   735 Mbits/sec  1400   76.8 KBytes
[  5]   7.00-8.00   sec  92.6 MBytes   777 Mbits/sec  1600   82.7 KBytes
[  5]   8.00-9.00   sec  91.1 MBytes   764 Mbits/sec  1500   70.8 KBytes
[  5]   9.00-10.00  sec  92.0 MBytes   771 Mbits/sec  1550   85.1 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   917 MBytes   769 Mbits/sec  15700             sender
[  5]   0.00-10.00  sec   916 MBytes   768 Mbits/sec                  receiver

iperf Done.
```
2024-12-05 00:18:20 +00:00
Thomas Eizinger
b802021cc4 feat(connlib): implement idempotent control protocol for client (#6942)
Building on top of the gateway PR (#6941), this PR transitions the
clients to the new control protocol. Clients are **not**
backwards-compatible with old gateways. As a result, a certain customer
environment MUST have at least one gateway with the above PR running in
order for clients to be able to establish connections.

With this transition, Clients send explicit events to Gateways whenever
they assign IPs to a DNS resource name. The actual assignment only
happens once and the IPs then remain stable for the duration of the
client session.

When the Gateway receives such an event, it will perform a DNS
resolution of the requested domain name and set up the NAT between the
assigned proxy IPs and the IPs the domain actually resolves to. In order
to support self-healing of any problems that happen during this process,
the client will send an "Assigned IPs" event every time it receives a
DNS query for a particular domain. This in turn will trigger another DNS
resolution on the Gateway. Effectively, this means that DNS queries for
DNS resources propagate to the Gateway, triggering a DNS resolution
there. In case the domain resolves to the same set of IPs, no state is
changed to ensure existing connections are not interrupted.

With this new functionality in place, we can delete the old logic around
detecting "expired" IPs. This is considered a bugfix as this logic isn't
currently working as intended. It has been observed multiple times that
the Gateway can loop on this behaviour and resolving the same domain
over and over again. The only theoretical "incompatibility" here is that
pre-1.4.0 clients won't have access to this functionality of triggering
DNS refreshes on a Gateway 1.4.2+ Gateway. However, as soon as this PR
merges, we expect all admins to have already upgraded to a 1.4.0+
Gateway anyway which already mandates clients to be on 1.4.0+.

Resolves: #7391.
Resolves: #6828.
2024-12-04 12:05:35 +00:00
Thomas Eizinger
dd6b52b236 chore(rust): share edition key via workspace table (#7451) 2024-12-03 00:28:06 +00:00
Thomas Eizinger
932f6791fb fix(phoenix-channel): lazily create backoff timer (#7414)
Our `phoenix-channel` component is responsible for maintaining a
WebSocket connection to the portal. In case that connection fails, we
want to reconnect to it using an exponential backoff, eventually giving
up after a certain amount of time.

Unfortunately, the code we have today doesn't quite do that. An
`ExponentialBackoff` has a setting for the `max_elapsed_time`.
Regardless of how many and how often we retry something, we won't ever
wait longer than this amount of time. For the Relay, this is set to
15min. For other components its indefinite (Gateway, headless-client),
or very long (30 days for Android, 1 day for Apple).

The point in time from which this duration is counted is when the
`ExponentialBackoff` is **constructed** which translates to when we
**first** connected to the portal. As a result, our backoff would
immediately fail on the first error if it has been longer than
`max_elapsed_time` since we first connected. For most components, this
codepath is not relevant because the `max_elapsed_time` is so long. For
the Relay however, that is only 15 minutes so chances are, the Relay
would immediately fail (and get rebooted) on the first connection error
with the portal.

To fix this, we now lazily create the `ExponentialBackoff` on the first
error.

This bug has some interesting consequences: When a relay reboots, it
looses all its state, i.e. allocations, channel bindings, available
nonces etc, stamp-secret. Thus, all credentials and state that got
distributed to Clients and Gateways get invalidated, causing disconnects
from the Relay. We have observed these alerts in Sentry for a while and
couldn't explain them. Most likely, this is the root cause for those
because whilst a Relay disconnects, the portal also cannot detect its
presence and pro-actively inform Clients and Gateways to no longer use
this Relay.
2024-11-29 20:19:11 +00:00
Thomas Eizinger
78674a8b14 refactor(gateway): start telemetry earlier (#7404)
By removing the use of the `#[tokio::main]`, we can ensure that
telemetry is initialised as early as possible.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2024-11-28 20:47:11 +00:00
Thomas Eizinger
2c26fc9c0e ci: lint Rust dependencies using cargo deny (#7390)
One of Rust's promises is "if it compiles, it works". However, there are
certain situations in which this isn't true. In particular, when using
dynamic typing patterns where trait objects are downcast to concrete
types, having two versions of the same dependency can silently break
things.

This happened in #7379 where I forgot to patch a certain Sentry
dependency. A similar problem exists with our `tracing-stackdriver`
dependency (see #7241).

Lastly, duplicate dependencies increase the compile-times of a project,
so we should aim for having as few duplicate versions of a particular
dependency as possible in our dependency graph.

This PR introduces `cargo deny`, a linter for Rust dependencies. In
addition to linting for duplicate dependencies, it also enforces that
all dependencies are compatible with an allow-list of licenses and it
warns when a dependency is referred to from multiple crates without
introducing a workspace dependency. Thanks to existing tooling
(https://github.com/mainmatter/cargo-autoinherit), transitioning all
dependencies to workspace dependencies was quite easy.

Resolves: #7241.
2024-11-22 00:17:28 +00:00
dependabot[bot]
4014373dc2 build(deps): Bump clap from 4.5.20 to 4.5.21 in /rust (#7369)
Bumps [clap](https://github.com/clap-rs/clap) from 4.5.20 to 4.5.21.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/clap-rs/clap/releases">clap's
releases</a>.</em></p>
<blockquote>
<h2>v4.5.21</h2>
<h2>[4.5.21] - 2024-11-13</h2>
<h3>Fixes</h3>
<ul>
<li><em>(parser)</em> Ensure defaults are filled in on error with
<code>ignore_errors(true)</code></li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/clap-rs/clap/blob/master/CHANGELOG.md">clap's
changelog</a>.</em></p>
<blockquote>
<h2>[4.5.21] - 2024-11-13</h2>
<h3>Fixes</h3>
<ul>
<li><em>(parser)</em> Ensure defaults are filled in on error with
<code>ignore_errors(true)</code></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="03d722625a"><code>03d7226</code></a>
chore: Release</li>
<li><a
href="3df70fb2b6"><code>3df70fb</code></a>
docs: Update changelog</li>
<li><a
href="3266c36abf"><code>3266c36</code></a>
Merge pull request <a
href="https://redirect.github.com/clap-rs/clap/issues/5691">#5691</a>
from epage/custom</li>
<li><a
href="951762db57"><code>951762d</code></a>
feat(complete): Allow any OsString-compatible type to be a
CompletionCandidate</li>
<li><a
href="bb6493e890"><code>bb6493e</code></a>
feat(complete): Offer - as a path option</li>
<li><a
href="27b348dbcb"><code>27b348d</code></a>
refactor(complete): Simplify ArgValueCandidates code</li>
<li><a
href="49b8108f8c"><code>49b8108</code></a>
feat(complete): Add PathCompleter</li>
<li><a
href="82a360aa54"><code>82a360a</code></a>
feat(complete): Add ArgValueCompleter</li>
<li><a
href="47aedc6906"><code>47aedc6</code></a>
fix(complete): Ensure paths are sorted</li>
<li><a
href="431e2bc931"><code>431e2bc</code></a>
test(complete): Ensure ArgValueCandidates get filtered</li>
<li>Additional commits viewable in <a
href="https://github.com/clap-rs/clap/compare/clap_complete-v4.5.20...clap_complete-v4.5.21">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=clap&package-manager=cargo&previous-version=4.5.20&new-version=4.5.21)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-19 06:16:53 +00:00
Thomas Eizinger
de35bb067e fix(telemetry): don't embed errors values in telemetry_event! (#7366)
Due to https://github.com/getsentry/sentry-rust/issues/702, errors which
are embedded as `tracing::Value` unfortunately get silently discarded
when reported as part of Sentry "Event"s and not "Exception"s.

The design idea of these telemetry events is that they aren't fatal
errors so we don't need to treat them with the highest priority. They
may also appear quite often, so to save performance and bandwidth, we
sample them at a rate of 1% at creation time.

In order to not lose the context of these errors, we instead format them
into the message. This makes them completely identical to the `debug!`
logs which we have on every call-site of `telemetry_event!` which
prompted me to make that implicit as part of creating the
`telemetry_event!`.

Resolves: #7343.
2024-11-18 18:17:08 +00:00
Thomas Eizinger
fd04812cde chore(gateway): proactive close telemetry session (#7361)
This is important for the "Release Health" statistics of Sentry.
2024-11-16 16:28:42 +00:00
Thomas Eizinger
4db3a457a9 chore(gateway): publish version 1.4.1 (#7347) 2024-11-15 05:40:12 +00:00
Thomas Eizinger
48ba2869a8 chore(rust): ban the use of .unwrap except in tests (#7319)
Using the clippy lint `unwrap_used`, we can automatically lint against
all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a
few results actually. In most cases, they are invariants that can't
actually be hit. For these, we change them to `Option`. In other cases,
they can actually be hit. For example, if the user supplies an invalid
log-filter.

Activating this lint ensures the compiler will yell at us every time we
use `.unwrap` to double-check whether we do indeed want to panic here.

Resolves: #7292.
2024-11-13 03:59:22 +00:00
Thomas Eizinger
19f51568c2 chore(rust): don't pass errors as values for debug logs (#7318)
Our logging library `tracing` supports structured logging. Structured
logging means we can include values within a `tracing::Event` without
having to immediately format it as a string. Processing these values -
such as errors - as their original type allows the various `tracing`
layers to capture and represent them as they see fit.

One of these layers is responsible for sending ERROR and WARN events to
Sentry, as part of which `std::error::Error` values get automatically
captured as so-called "sentry exceptions".

Unfortunately, there is a caveat: If an `std::error::Error` value is
included in an event that does not get mapped to an exception, the
`error` field is completely lost. See
https://github.com/getsentry/sentry-rust/issues/702 for details.

To work around this, we introduce a `err_with_sources` adapter that an
error and all its sources together into a string. For all
`tracing::debug!` statements, we then use this to report these errors.

It is really unfortunate that we have to do this and cannot use the same
mechanism, regardless of the log level. However, until this is fixed
upstream, this will do and gives us better information in the log
submitted to Sentry.
2024-11-12 04:00:02 +00:00
Thomas Eizinger
488c599d5b chore(telemetry): capture Firezone ID and account in user ctx (#7310)
Sentry has a feature called the "User context" which allows us to assign
events to individual users. This in turn will give us statistics in
Sentry, how many users are affected by a certain issue.

Unfortunately, Sentry's user context cannot be built-up step-by-step but
has to be set as a whole. To achieve this, we need to slightly refactor
`Telemetry` to not be `clone`d and instead passed around by mutable
reference.

Resolves: #7248.
Related: https://github.com/getsentry/sentry-rust/issues/706.
2024-11-11 19:50:14 +00:00
Thomas Eizinger
62cb32b7a3 chore(gateway): report more tunnel errors to event-loop (#7299)
Currently, the Gateway's state machine functions for processing packets
use type-signature that only return `Option`. Any errors while
processing packets are logged internally. This makes it difficult to
consistently log these errors.

We refactor these functions to return `Result<Option<T>>` in most cases,
indicating that they may fail for various reasons and also sometimes
succeed without producing an output.

This allows us to consistently log these errors in the event-loop.
Logging them on WARN or ERROR would be too spammy though. In order to
still be alerted about some of these, we use the `telemetry_event!`
macro which samples them at a rate of 1%. This will alert us about cases
that happen often and allows us to handle them explicitly.

Once this is deployed to staging, I will monitor the alerts in Sentry to
ensure we won't get spammed with events from customers on the next
release.
2024-11-11 03:50:27 +00:00
Thomas Eizinger
e261cb3c27 chore: remove git_version! (#7270)
Reading the Git version requires the entire Git repository to be
present, including all tags. The tags are only created _after_ the
artifact is being built, when we publish the release. Therefore, these
tags are never included in the actual released binary.

For Sentry, we use the `CARGO_PKG_VERSION` variable instead. This
doesn't tell us whether somebody built a client from source and then
used it so there could be some confusion in Sentry events. It is quite
unlikely that this happens though so for the majority of Sentry alerts,
this will give us the correct version.

For the Android client, we also depend on the `GITHUB_SHA` env variable
at compile-time. We do the same thing for the GUI client here.

Resolves: #6925.
2024-11-07 22:56:17 +00:00
Thomas Eizinger
47e45a3cf3 chore(telemetry): improve telemetry spans and events (#7206)
DNS resolution is a critical part of `connlib`. If it is slow for
whatever reason, users will notice this. To make sure we notice as well,
we add `telemetry` spans to the client's and gateway's DNS resolution.
For the client, this applies to all DNS queries that we forward to the
upstream servers. For the gateway, this applies to all DNS resources.

In addition to those IO operations, we also instrument the
`match_resource_linear` function. This function operates in `O(n)` of
all defined DNS resources. It _should_ be fast enough to not create an
impact but it can't hurt to measure this regardless.

Lastly, we also instrument `refresh_translations` on the gateway.
Refreshing the DNS resolution of a DNS resource should really only
happen, when the previous IP addresses become stale yet the user is
still trying to send traffic to them. We don't actually have any data on
how often that happens. By instrumenting it, we can gather some of this
data.

To make sure that none of these telemetry events and spans hurt the
end-user performance, we introduce macros to `firezone-logging` that
sample the creation of these events and spans at a rate of 1%. I ran a
flamegraph and none of these even showed up. The most critical one here
is probably the `match_resource_linear` span because it happens on every
DNS query.

Resolves: #7198.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2024-11-06 01:17:57 +00:00
Thomas Eizinger
c48f3669c1 chore(gateway): log domain as field in dns resolution warning (#7204)
Logging the `domain` as part of the log message makes Sentry think that
these are distinct errors when in fact it is the same error but for
different domains.

Resolves: #7199.
2024-11-01 15:50:40 +00:00
Jamil
e9b2e4735a ci: Publish Gateway 1.4.0 (#7187)
Publish the 1.4.0 release so it's available at `/api/releases` and will
send upgrade Gateway notifications.
2024-10-30 20:44:33 +00:00
Thomas Eizinger
a2c9d148ac chore(gateway): bump version to 1.4.0 (#7090)
In order to release #6941, we need to bump the gateway's version to
1.4.0. The portal has a version gate that only allows connection clients
which have version >= 1.4.0. Thus, in order to test #6941 on staging,
the version must not yet be bumped and is thus split out into this PR.
2024-10-29 23:20:46 +00:00
Thomas Eizinger
f7a388345b fix(connlib): reconnect in case we lose all relays (#7164)
During normal operation, we should never lose connectivity to the set of
assigned relays in a client or gateway. In the presence of odd network
conditions and partitions however, it is possible that we disconnect
from a relay that is in fact only temporarily unavailable. Without an
explicit mechanism to retrieve new relays, this means that both clients
and gateways can end up with no relays at all. For clients, this can be
fixed by either roaming or signing out and in again. For gateways, this
can only be fixed by a restart!

Without connected relays, no connections can be established. With #7163,
we will at least be able to still establish direct connections. Yet,
that isn't good enough and we need a mechanism for restoring full
connectivity in such a case.

We creating a new connection, we already sample one of our relays and
assign it to this particular connection. This ensures that we don't
create an excessive amount of candidates for each individual connection.
Currently, this selection is allowed to be silently fallible. With this
PR, we make this a hard-error and bubble up the error that all the way
to the client's and gateway's event-loop. There, we initiate a reconnect
to the portal as a compensating action. Reconnecting to the portal means
we will receive another `init` message that allows us to reconnect the
relays.

Due to the nature of this implementation, this fix may only apply with a
certain delay from when we actually lost connectivity to the last relay.
However, this design has the advantage that we don't have to introduce
an additional state within `snownet`: Connections now simply fail to
establish and the next one soon after _should_ succeed again because we
will have received a new `init` message.

Resolves: #7162.
2024-10-29 01:01:47 +00:00
Thomas Eizinger
c48c33d935 chore(gateway): lower "Tunnel error" to debug (#7165)
This is spamming Sentry and we have almost reached our rate limit for
the amounts of events ingested.
2024-10-28 14:04:49 +00:00
Thomas Eizinger
8ad290f024 chore(gateway): fix bad docs on --no-telemetry flag (#7127) 2024-10-22 23:30:20 +00:00
Thomas Eizinger
990324b2ec chore(rust): enable sentry-tracing integration (#7105)
Using the `sentry-tracing` integration, we can automatically capture
events based on what we log via `tracing`. The mapping is defined as
follows:

- ERROR: Gets captured as a fatal error
- WARN: Gets captured as a message
- INFO: Gets captured as a breadcrumb
- `_`: Does not get captured at all

If telemetry isn't active / configured, this integration does nothing.
It is therefore safe to just always enable it.
2024-10-22 23:23:49 +00:00
Thomas Eizinger
b7b7626cfa feat(gateway): add error reporting via Sentry (#7103)
Similar to the GUI and headless clients, adding error reporting via
Sentry should give us much better insight into how well gateways are
performing.

Resolves: #7099.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-10-22 20:40:28 +00:00
Thomas Eizinger
73eebd2c4d refactor(rust): consistently record errors as tracing::Value (#7104)
Our logging library, `tracing` supports structured logging. This is
useful because it preserves the more than just the string representation
of a value and thus allows the active logging backend(s) to capture more
information for a particular value.

In the case of errors, this is especially useful because it allows us to
capture the sources of a particular error.

Unfortunately, recording an error as a tracing value is a bit cumbersome
because `tracing::Value` is only implemented for `&dyn
std::error::Error`. Casting an error to this is quite verbose. To make
it easier, we introduce two utility functions in `firezone-logging`:

- `std_dyn_err`
- `anyhow_dyn_err`

Tracking errors as correct `tracing::Value`s will be especially helpful
once we enable Sentry's `tracing` integration:
https://docs.rs/sentry-tracing/latest/sentry_tracing/#tracking-errors
2024-10-22 04:46:26 +00:00
Thomas Eizinger
ce1e59c9fe feat(connlib): implement idempotent control protocol for gateway (#6941)
This PR implements the new idempotent control protocol for the gateway.
We retain backwards-compatibility with old clients to allow admins to
perform a disruption-free update to the latest version.

With this new control protocol, we are moving the responsibility of
exchanging the proxy IPs we assigned to DNS resources to a p2p protocol
between client and gateway. As a result, wildcard DNS resources only get
authorized on the first access. Accessing a new domain within the same
resource will thus no longer require a roundtrip to the portal.

Overall, users will see a greatly decreased connection setup latency. On
top of that, the new protocol will allow us to more easily implement
packet buffering which will be another UX boost for Firezone.
2024-10-18 15:59:47 +00:00
Thomas Eizinger
dbe618c080 refactor(connlib): expose &mut TRoleState for direct access (#7026)
Currently, we have a lot of stupid code to forward data from the
`{Client,Gateway}Tunnel` interface to `{Client,Gateway}State`. Recent
refactorings such as #6919 made it possible to get rid of this
forwarding layer by directly exposing `&mut TRoleState`.

To maintain some type-privacy, several functions are made generic to
accept `impl Into` or `impl TryInto`.
2024-10-15 01:05:35 +00:00
Thomas Eizinger
8c4f6bdb0f chore(gateway): don't log WouldBlock on WARN (#6984)
This mirrors what we do on the clients, there is no need to log
`WouldBlock` on `WARN` as those can happen during normal operation.
2024-10-10 19:50:54 +00:00
Thomas Eizinger
2d4818e007 refactor(connlib): rotate tunnel private key on reset (#6909)
With the new control protocol specified in #6461, the client will no
longer initiate new connections. Instead, the credentials are generated
deterministically by the portal based on the gateway's and the client's
public key. For as long as they use the same public key, they also have
the same in-memory state which makes creating connections idempotent.

What we didn't consider in the new design at first is that when clients
roam, they discard all connections but keep the same private key. As a
result, the portal would generate the same ICE credentials which means
the gateway thinks it can reuse the existing connection when new flows
get authorized. The client however discarded all connections (and
rotated its ports and maybe IPs), meaning the previous candidates sent
to the gateway are no longer valid and connectivity fails.

We fix this by also rotating the private keys upon reset. Rotating the
keys itself isn't enough, we also need to propagate the new public key
all the way "over" to the phoenix channel component which lives
separately from connlib's data plane.

To achieve this, we change `PhoenixChannel` to now start in the
"disconnected" state and require an explicit `connect` call. In
addition, the `LoginUrl` constructed by various components now acts
merely as a "prototype", which may require additional data to construct
a fully valid URL. In the case of client and gateway, this is the public
key of the `Node`. This additional parameter needs to be passed to
`PhoenixChannel` in the `connect` call, thus forming a type-safe
contract that ensures we never attempt to connect without providing a
public key.

For the relay, this doesn't apply.

Lastly, this allows us to tidy up the code a bit by:

a) generating the `Node`'s private key from the existing RNG
b) removing `ConnectArgs` which only had two members left

Related: #6461.
Related: #6732.
2024-10-07 22:28:51 +00:00
dependabot[bot]
1140656b83 build(deps): Bump clap from 4.5.18 to 4.5.19 in /rust (#6950)
Bumps [clap](https://github.com/clap-rs/clap) from 4.5.18 to 4.5.19.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/clap-rs/clap/releases">clap's
releases</a>.</em></p>
<blockquote>
<h2>v4.5.19</h2>
<h2>[4.5.19] - 2024-10-01</h2>
<h3>Internal</h3>
<ul>
<li>Update dependencies</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/clap-rs/clap/blob/master/CHANGELOG.md">clap's
changelog</a>.</em></p>
<blockquote>
<h2>[4.5.19] - 2024-10-01</h2>
<h3>Internal</h3>
<ul>
<li>Update dependencies</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="108907385c"><code>1089073</code></a>
chore: Release</li>
<li><a
href="c9b8c85f09"><code>c9b8c85</code></a>
docs: Update changelog</li>
<li><a
href="8b3de18a8d"><code>8b3de18</code></a>
Merge pull request <a
href="https://redirect.github.com/clap-rs/clap/issues/5685">#5685</a>
from epage/engine</li>
<li><a
href="b38538d7c4"><code>b38538d</code></a>
fix(complete)!: Rename dynamic to engine</li>
<li><a
href="232af62f7d"><code>232af62</code></a>
Merge pull request <a
href="https://redirect.github.com/clap-rs/clap/issues/5684">#5684</a>
from epage/endless</li>
<li><a
href="0209a79031"><code>0209a79</code></a>
fix(complete): Don't cause endless completions for bash/zsh</li>
<li>See full diff in <a
href="https://github.com/clap-rs/clap/compare/clap_complete-v4.5.18...clap_complete-v4.5.19">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=clap&package-manager=cargo&previous-version=4.5.18&new-version=4.5.19)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-07 14:19:56 +00:00
Thomas Eizinger
be250f1e00 refactor(connlib): repurpose connlib-shared as connlib-model (#6919)
The `connlib-shared` crate has become a bit of a dependency magnet
without a clear purpose. It hosts utilities like `get_user_agent`,
messages for the client and gateway to communicate with the portal and
domain types like `ResourceId`.

To create a better dependency structure in our workspace, we repurpose
`connlib-shared` as a `connlib-model` crate. Its purpose is to host
domain-specific model types that multiple crates may want to use. For
that purpose, we rename the `callbacks::ResourceDescription` type to
`ResourceView`, designating that this is a _view_ onto a resource as
seen by `connlib`. The message types which currently double up as
connlib-internal model thus become an implementation detail of
`firezone-tunnel` and shouldn't be used for anything else.

---------

Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-10-03 14:47:58 +00:00
Jamil
613127d298 ci: Bump all clients and gateway (#6923)
Main fix: idle connection timing. These have already been released.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
2024-10-03 07:12:52 -07:00
Thomas Eizinger
e901d51550 refactor(gateway): split proxy IP assignment from authorisation (#6812)
At the moment, the mapping of proxy IPs to the resolved IPs of a DNS
resource happens at the same time as the "authorisation" that the client
is allowed to talk to that resource. This is somewhat convoluted
because:

- Mapping proxy IPs to resolved IPs only needs to happen for DNS
resources, yet it is called for all resources (and internally skipped).
- Wildcard DNS resources only need to be authorised once, after which
the client is allowed to communicate with any domain matching the
wildcard address.
- The code that models resources within `ClientOnGateway` doesn't
differentiate between resource types at all.

With #6461, the authorisation of a resource will be completely decoupled
from the domain resolution for a particular domain of a DNS resource. To
make that easier to implement, we re-model the internals of
`ClientOnGateway` to differentiate the various resource types. Instead
of holding a single vec of addresses, the IPs are now indexed by the
respective domain. For CIDR resources, we only hold a single address
anyway and for the Internet Resource, the IP networks are static.

This new model now implies that allowing a resource that has already
been allowed essentially implies an update and the filters get
re-calculated.
2024-09-26 23:04:03 +00:00
Thomas Eizinger
29bc276bf2 refactor(connlib): parallelise TUN operations (#6673)
Currently, `connlib` is entirely single-threaded. This allows us to
reuse a single buffer for processing IP packets and makes reasoning of
the packet processing code very simple. Being single-threaded also means
we can only make use of a single CPU core and all operations have to be
sequential.

Analyzing `connlib` using `perf` shows that we spend 26% of our CPU time
writing packets to the TUN interface [0]. Because we are
single-threaded, `connlib` cannot do anything else during this time. If
we could offload the writing of these packets to a different thread,
`connlib` could already process the next packet while the current one is
writing.

Packets that we send to the TUN interface arrived as an encrypted WG
packet over UDP and get decrypted into a - currently - shared buffer.
Moving the writing to a different thread implies that we have to have
more of these buffer that the next packet(s) can be decrypted into.

To avoid IP fragmentation, we set the maximum IP MTU to 1280 bytes on
the TUN interface. That actually isn't very big and easily fits into a
stackframe. The default stack size for threads is 2MB [1].

Instead of creating more buffers and cycling through them, we can also
simply stack-allocate our IP packets. This incurs some overhead from
copying packets but it is only ~3.5% [2] (This was measured without a
separate thread). With stack-allocated packets, almost all
lifetime-annotations go away which in itself is already a welcome
ergonomics boost. Stack-allocated packets also means we can simply spawn
a new thread for the packet processing. This thread is connected with
two channel to connlib's main thread. The capacity of 1000 packets will
at most consume an additional 3.5 MB of memory which is fine even on our
most-constrained devices such as iOS.

[0]: https://share.firefox.dev/3z78CzD
[1]: https://doc.rust-lang.org/std/thread/#stack-size
[2]: https://share.firefox.dev/3Bf4zla

Resolves: #6653.
Resolves: #5541.
2024-09-26 03:03:35 +00:00
dependabot[bot]
fec6cc9923 build(deps): Bump clap from 4.5.4 to 4.5.13 in /rust (#6800)
Bumps [clap](https://github.com/clap-rs/clap) from 4.5.4 to 4.5.13.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/clap-rs/clap/releases">clap's
releases</a>.</em></p>
<blockquote>
<h2>v4.5.13</h2>
<h2>[4.5.13] - 2024-07-31</h2>
<h3>Fixes</h3>
<ul>
<li><em>(derive)</em> Improve error message when
<code>#[flatten]</code>ing an optional <code>#[group(skip)]</code></li>
<li><em>(help)</em> Properly wrap long subcommand descriptions in
help</li>
</ul>
<h2>v4.5.12</h2>
<h2>[4.5.12] - 2024-07-31</h2>
<h2>v4.5.10</h2>
<h2>[4.5.10] - 2024-07-23</h2>
<h2>v4.5.9</h2>
<h2>[4.5.9] - 2024-07-09</h2>
<h3>Fixes</h3>
<ul>
<li><em>(error)</em> When defining a custom help flag, be sure to
suggest it like we do the built-in one</li>
</ul>
<h2>v4.5.8</h2>
<h2>[4.5.8] - 2024-06-28</h2>
<h3>Fixes</h3>
<ul>
<li>Reduce extra flushes</li>
</ul>
<h2>v4.5.7</h2>
<h2>[4.5.7] - 2024-06-10</h2>
<h3>Fixes</h3>
<ul>
<li>Clean up error message when too few arguments for
<code>num_args</code></li>
</ul>
<h2>v4.5.6</h2>
<h2>[4.5.6] - 2024-06-06</h2>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/clap-rs/clap/blob/master/CHANGELOG.md">clap's
changelog</a>.</em></p>
<blockquote>
<h2>[4.5.13] - 2024-07-31</h2>
<h3>Fixes</h3>
<ul>
<li><em>(derive)</em> Improve error message when
<code>#[flatten]</code>ing an optional <code>#[group(skip)]</code></li>
<li><em>(help)</em> Properly wrap long subcommand descriptions in
help</li>
</ul>
<h2>[4.5.12] - 2024-07-31</h2>
<h2>[4.5.11] - 2024-07-25</h2>
<h2>[4.5.10] - 2024-07-23</h2>
<h2>[4.5.9] - 2024-07-09</h2>
<h3>Fixes</h3>
<ul>
<li><em>(error)</em> When defining a custom help flag, be sure to
suggest it like we do the built-in one</li>
</ul>
<h2>[4.5.8] - 2024-06-28</h2>
<h3>Fixes</h3>
<ul>
<li>Reduce extra flushes</li>
</ul>
<h2>[4.5.7] - 2024-06-10</h2>
<h3>Fixes</h3>
<ul>
<li>Clean up error message when too few arguments for
<code>num_args</code></li>
</ul>
<h2>[4.5.6] - 2024-06-06</h2>
<h2>[4.5.5] - 2024-06-06</h2>
<h3>Fixes</h3>
<ul>
<li>Allow <code>exclusive</code> to override
<code>required_unless_present</code>,
<code>required_unless_present_any</code>,
<code>required_unless_present_all</code></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="d222ae4cb6"><code>d222ae4</code></a>
chore: Release</li>
<li><a
href="a8abcb40c5"><code>a8abcb4</code></a>
docs: Update changelog</li>
<li><a
href="2690e1bdb1"><code>2690e1b</code></a>
Merge pull request <a
href="https://redirect.github.com/clap-rs/clap/issues/5621">#5621</a>
from shannmu/dynamic_valuehint</li>
<li><a
href="7fd7b3e40b"><code>7fd7b3e</code></a>
feat(clap_complete): Support to complete custom value of argument</li>
<li><a
href="fc6aaca52b"><code>fc6aaca</code></a>
Merge pull request <a
href="https://redirect.github.com/clap-rs/clap/issues/5638">#5638</a>
from epage/cargo</li>
<li><a
href="631e54bc71"><code>631e54b</code></a>
docs(cookbook): Style cargo plugin</li>
<li><a
href="6fb49d08bb"><code>6fb49d0</code></a>
Merge pull request <a
href="https://redirect.github.com/clap-rs/clap/issues/5636">#5636</a>
from gibfahn/styles_const</li>
<li><a
href="6f215eee98"><code>6f215ee</code></a>
refactor(styles): make styles example use a const</li>
<li><a
href="bbb2e6fdde"><code>bbb2e6f</code></a>
test: Add test case for completing custom value of argument</li>
<li><a
href="999071c46d"><code>999071c</code></a>
fix: Change <code>visible</code> to <code>hidden</code></li>
<li>Additional commits viewable in <a
href="https://github.com/clap-rs/clap/compare/clap_complete-v4.5.4...clap_complete-v4.5.13">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=clap&package-manager=cargo&previous-version=4.5.4&new-version=4.5.13)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-09-24 14:22:12 +00:00
Thomas Eizinger
480a065bf8 chore(connlib): mitigate WARN logs from phoenix-channel (#6759)
Merging #6708 had an unintended side-effect that we are seeing a lot of
WARN logs from phoenix-channel because we can no longer parse the
response from gateways. We didn't do anything with these responses but
gateways are sending them for backwards-compatibility reasons.

To not confuse ourselves while debugging, we revert the client-side bit
of #6708 to remove these warnings.
2024-09-18 20:36:04 +00:00
Thomas Eizinger
5ae06a7b8c chore(gateway): remove domain response (breaks < 1.1.0 clients) (#6708)
Prior to version 1.1.0, clients did not have an embedded DNS resolver
and relied on the gateway for DNS resolution. In that design, the
gateway responded with the IPs that the domain resolved to.

Our next iteration of the control protocol (#6461) will decouple the
details of how DNS works from the flow-authorization. As a result, we
will need to be able to establish a flow for a DNS resource without
knowing which concrete domain the client is going to access.

Without a concrete domain, we cannot send anything back to these old
clients, meaning we unfortunately have to break compatibility with <
1.1.0 clients as part of implementing the new control protocol.
2024-09-18 14:12:46 +00:00
Jamil
ae5613b223 ci: Update changelog for 1.3.1ish clients (#6612)
Bumps internet resource UI.
2024-09-06 00:07:52 +00:00
Jamil
c6b0b0a922 ci: Release 1.3.0 for Internet Resource (#6503)
This publishes the 1.3.0 clients and gateways so that Internet Resources
will work.

The feature is still disabled for the Stripe plans until we publish the
launch post. Select customers have the feature enabled.

Closes #2667
2024-08-30 01:21:34 -07:00
Thomas Eizinger
35017537c7 feat(gateway): allow out-of-order allow_access requests (#6403)
Currently, the gateway requires a strict ordering of first receiving a
`request_connection` message, following by multiple `allow_access`
messages. Additionally, access can be granted as part of the initial
`request_connection` message too.

This isn't an ideal design. Setting up a new connection is infallible,
all we need to do is send our ICE credentials back to the client.
However, untangling that will require a bit more effort.

Starting with #6335, following this strict order on the client is a more
difficult. Whilst we can send them in order, it is harder to maintain
those ordering guarantees across all our systems.

To avoid this, we change the gateway to perform an upsert for its local
ACLs for a client. In case that an `allow_access` call would somehow get
to the gateway earlier, we can simply already create the `Peer` and only
set up the actual connection later.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-08-28 13:10:06 +00:00
Jamil
ea33b7868f ci: Bump GUI to 1.2.1 (#6462) 2024-08-27 22:19:26 -07:00
Thomas Eizinger
a1049b7d78 feat(connlib): suspend if we don't have UDP sockets (#6398)
Previously, failing to bind to any interfaces was a hard-error. In
reality and in `connlib`'s current state, this is quite unlikely because
machines will at least have a loopback interface that we will bind to.

However, with #6382 in the pipeline, it may be more likely that we
actually end up with no functional UDP sockets. Furthermore, we are
considering to extend those connectivity checks in the future.

Thus, it is important that the case of "no available UDP sockets" is
gracefully handled.

Instead of failing with a hard-error, we now suspend `connlib's` network
stack. The connectivity to the portal is unaffected by this and we will
still also receive commands from the client application like `reset`.
When we receive a `reset`, we attempt to rebind the sockets and thus
retry connectivity.

Because we are suspending the entire eventloop, this won't send any
messages or trigger any timers whatsoever. For example, if we
hypothetically started up without network interfaces, this is now the
log output:

```
2024-08-22T01:50:42.170101Z  INFO firezone_headless_client: arch="x86_64" git_version="headless-client-1.2.0-2-gc8eed5938-modified"
2024-08-22T01:50:42.178777Z DEBUG phoenix_channel: Connecting to portal host=api.firez.one user_agent=NixOS/24.5.0 connlib/1.2.1 (x86_64; 6.8.12)
2024-08-22T01:50:42.178978Z DEBUG firezone_headless_client::dns_control::linux: Deactivating DNS control...
2024-08-22T01:50:42.180691Z ERROR firezone_tunnel::sockets: No available UDP sockets
2024-08-22T01:50:42.197098Z  INFO firezone_tunnel::device_channel: Initializing TUN device name=tun-firezone
2024-08-22T01:50:42.197165Z DEBUG firezone_tunnel::client: Unable to update DNS servesr without interface configuration
2024-08-22T01:50:42.453988Z DEBUG tungstenite::handshake::client: Client handshake done.
2024-08-22T01:50:42.454161Z  INFO phoenix_channel: Connected to portal host=api.firez.one
2024-08-22T01:50:42.676825Z DEBUG firezone_tunnel::client: Updating DNS servers mapping={fd00:2021:1111:8000:100:100:111:0 <> [2606:4700:4700::1111]:53, 100.100.111.1 <> 1.1.1.1:53}
2024-08-22T01:50:42.677084Z  INFO firezone_tunnel::client: Activating resource name=IPerf3 address=10.0.32.101/32 sites=AWS Dev (Gateways track `main`)
2024-08-22T01:50:42.677173Z  INFO firezone_tunnel::client: Activating resource name=*.slack.com address=**.slack.com sites=Vultr Stable (Latest Release Gateways)
2024-08-22T01:50:42.677223Z  INFO firezone_tunnel::client: Activating resource name=*.slack-edge.com address=**.slack-edge.com sites=Vultr Stable (Latest Release Gateways)
2024-08-22T01:50:42.677283Z  INFO firezone_tunnel::client: Activating resource name=*.spotify.com address=**.spotify.com sites=AWS Dev (Gateways track `main`)
2024-08-22T01:50:42.677345Z  INFO firezone_tunnel::client: Activating resource name=*.github.com address=**.github.com sites=AWS Dev (Gateways track `main`)
2024-08-22T01:50:42.677418Z  INFO firezone_tunnel::client: Activating resource name=whatismyip.com address=**.whatismyip.com sites=AWS Dev (Gateways track `main`)
2024-08-22T01:50:42.677489Z  INFO firezone_tunnel::client: Activating resource name=ifconfig.net address=ifconfig.net sites=Vultr Stable (Latest Release Gateways)
2024-08-22T01:50:42.677538Z  INFO firezone_tunnel::client: Activating resource name=*.google.com address=**.google.com sites=AWS Dev (Gateways track `main`)
2024-08-22T01:50:42.677632Z  INFO firezone_tunnel::client: Activating resource name=*.fastmail.com address=**.fastmail.com sites=AWS Dev (Gateways track `main`)
2024-08-22T01:50:42.677682Z  INFO firezone_tunnel::client: Activating resource name=speed.cloudflare.com address=speed.cloudflare.com sites=Vultr Stable (Latest Release Gateways)
2024-08-22T01:50:42.678212Z  INFO snownet::node: Added new TURN server rid=b6fc4d73-9c8e-44df-a941-da7d2134cb70 address=Dual { v4: 34.40.133.55:3478, v6: [2600:1900:40b0:1504:0:97::]:3478 }
2024-08-22T01:50:42.678322Z  INFO snownet::node: Added new TURN server rid=c818b11a-d0cc-4f2a-bb88-473d8298a885 address=Dual { v4: 34.81.229.132:3478, v6: [2600:1900:4030:b0d9:0:9b::]:3478 }
2024-08-22T01:50:42.678365Z  INFO connlib_client_shared::eventloop: Firezone Started!
```

After this, nothing will happen other than receiving messages via from
the portal or the client app.

Related: #6382.
Related: #6385.
2024-08-22 04:15:31 +00:00
Jamil
c8eed59387 ci: Release 1.2.0 (#6395)
Releasing 1.2.0 to unblock portal deploy! Some of these have already
been published.
2024-08-22 00:18:27 +00:00
Thomas Eizinger
d399e65246 build(deps): bump tokio-tungstenite to 0.23 (#5509)
With the upgrade to 0.23, `tokio-tungstenite` pulls in `rustls` 0.27
which supports multiple crypto providers. By default, this uses the
`aws-lc-crypto` provider. The previous default was `ring`.

This PR bumps the necessary versions and installs the `ring` crypto
provider at the beginning of each application, before connlib starts. We
try and do this as early as possible to make it obvious that it only
needs to happen once per process.

Resolves: #5380.
2024-08-15 06:02:17 +00:00