`MACOSX_DEPLOYMENT_TARGET` is a standard env var read by gcc and rustc
that determines which version of macOS to target binaries for. This
variable was being removed inadvertently in our rust build script.
Exposing this var fixes a slew of warnings about object files being
built for a newer macOS target than being linked that were showing up in
Xcode for some time now.
Hasn't caused any issues thus far from what I can tell, but possibly
related to #7442
## Context
At present, `connlib` sends UDP packets one at a time. Sending a packet
requires us to make a syscall which is quite expensive. Under load, i.e.
during a speedtest, syscalls account for over 50% of our CPU time [0].
In order to improve this situation, we need to somehow make use of GSO
(generic segmentation offload). With GSO, we can send multiple packets
to the same destination in a single syscall.
The tricky question here is, how can we achieve having multiple UDP
packets ready at once so we can send them in a single syscall? Our TUN
interface only feeds us packets one at a time and `connlib`'s state
machine is single-threaded. Additionally, we currently only have a
single `EncryptBuffer` in which the to-be-sent datagram sits.
## 1. Stack-allocating encrypted IP packets
As a first step, we get rid of the single `EncryptBuffer` and instead
stack-allocate each encrypted IP packet. Due to our small MTU, these
packets are only around 1300 bytes. Stack-allocating that requires a few
memcpy's but those are in the single-digit % range in the terms of CPU
time performance hit. That is nothing compared to how much time we are
spending on UDP syscalls. With the `EncryptBuffer` out the way, we can
now "freely" move around the `EncryptedPacket` structs and - technically
- we can have multiple of them at the same time.
## 2. Implementing GSO
The GSO interface allows you to pass multiple packets **of the same
length and for the same destination** in a single syscall, meaning we
cannot just batch-up arbitrary UDP packets. Counterintuitively, making
use of GSO requires us to do more copying: In particular, we change the
interface of `Io` such that "sending" a packet performs essentially a
lookup of a `BytesMut`-buffer by destination and packet length and
appends the payload to that packet.
## 3. Batch-read IP packets
In order to actually perform GSO, we need to process more than a single
IP packet in one event-loop tick. We achieve this by batch-reading up to
50 IP packets from the mpsc-channel that connects `connlib`'s main
event-loop with the dedicated thread that reads and writes to the TUN
device. These reads and writes happen concurrently to `connlib`'s packet
processing. Thus, it is likely that by the time `connlib` is ready to
process another IP packet, multiple have been read from the device and
are sitting in the channel. Batch-processing these IP packets means that
the buffers in our `GsoQueue` are more likely to contain more than a
single datagram.
Imagine you are running a file upload. The OS will send many packets to
the same destination IP and likely max MTU to the TUN device. It is
likely, that we read 10-20 of these packets in one batch (i.e. within a
single "tick" of the event-loop). All packets will be appended to the
same buffer in the `GsoQueue` and on the next event-loop tick, they will
all be flushed out in a single syscall.
## Results
Overall, this results in a significant reduction of syscalls for sending
UDP message. In [1], we spend only a total of 16% of our CPU time in
`udpv6_sendmsg` whereas in [0] (main), we spent a total of 34%. Do note
that these numbers are relative to the total CPU time spent per program
run and thus can't be compared directly (i.e. you cannot just do 34 - 16
and say we now spend 18% less time sending UDP packets). Nevertheless,
this appears to be a great improvement.
In terms of throughput, we achieve a ~60% improvement in our benchmark
suite. That one is running on localhost though so it might not
necessarily be reflect like that in a real network.
[0]: https://share.firefox.dev/4hvoPju
[1]: https://share.firefox.dev/4frhCPv
A client may have lost its state and therefore "probe" the relay whether
or not is still has an allocation. If it does, it will react to the
error, delete it and make a new one. This is no reason to print a
warning on the relay side.
Windows appears to sometimes send ICMP (error?) packets to our DNS
resolver IPs. There is nothing we can do with these but the current code
path logs them as a warning because we expect everything that isn't TCP
to be UDP.
With this patch, we slightly change the parsing logic to first attempt
extracting the UDP packet and fail only with a DEBUG log if it isn't.
Our `phoenix-channel` component is responsible for maintaining a
WebSocket connection to the portal. In case that connection fails, we
want to reconnect to it using an exponential backoff, eventually giving
up after a certain amount of time.
Unfortunately, the code we have today doesn't quite do that. An
`ExponentialBackoff` has a setting for the `max_elapsed_time`.
Regardless of how many and how often we retry something, we won't ever
wait longer than this amount of time. For the Relay, this is set to
15min. For other components its indefinite (Gateway, headless-client),
or very long (30 days for Android, 1 day for Apple).
The point in time from which this duration is counted is when the
`ExponentialBackoff` is **constructed** which translates to when we
**first** connected to the portal. As a result, our backoff would
immediately fail on the first error if it has been longer than
`max_elapsed_time` since we first connected. For most components, this
codepath is not relevant because the `max_elapsed_time` is so long. For
the Relay however, that is only 15 minutes so chances are, the Relay
would immediately fail (and get rebooted) on the first connection error
with the portal.
To fix this, we now lazily create the `ExponentialBackoff` on the first
error.
This bug has some interesting consequences: When a relay reboots, it
looses all its state, i.e. allocations, channel bindings, available
nonces etc, stamp-secret. Thus, all credentials and state that got
distributed to Clients and Gateways get invalidated, causing disconnects
from the Relay. We have observed these alerts in Sentry for a while and
couldn't explain them. Most likely, this is the root cause for those
because whilst a Relay disconnects, the portal also cannot detect its
presence and pro-actively inform Clients and Gateways to no longer use
this Relay.
Rust 1.83 comes with a bunch of new lints for elidible lifetimes. Those
also trigger in the generated code of `derivative`. That crate is
actually unmaintained so we replace our usages of it with `derive_more`.
With this PR, we reduce some of the warnings emitted by the relay. If we
can only partially fulfill an allocation, we now only emit a warning.
Similarly, if we receive a repeated SIGTERM signal, we shut down
successfully (i.e. exit with code 0) instead of failing the event-loop.
During normal operation, we wait for all allocations to expire before we
shut down. On CI however, the relay gets shutdown much earlier so this
would generate unnecessary errors.
Receiving another SIGTERM is a user-initiated action so we shouldn't
fail as a result but instead just comply with it.
Our relays aren't semver-versioned like other components. So for the
version reported to Sentry, we use the current Git SHA. This one is only
available as an ENV variable because we are building within a docker
container using `cargo cross`. By default, no env variables are passed
through to the container. To fix this, we need to add a configuration
file that explicitly opts-in to the necessary ENV variable.
We ignore some advisories of unmaintained crates flagged by `cargo
deny`. As long as the crates work for us, there is not much reason to
directly remove them, especially if it requires upstream effort. We will
get rid of these as they cause problems.
To avoid having to look up what they correspond to, we add a comment to
each line.
Bumps
[tauri-winrt-notification](https://github.com/tauri-apps/winrt-notification)
from 0.6.0 to 0.7.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/tauri-apps/winrt-notification/releases">tauri-winrt-notification's
releases</a>.</em></p>
<blockquote>
<h2>tauri-winrt-notification v0.7.0</h2>
<p>Updating crates.io index
Locking 25 packages to latest compatible versions
Adding quick-xml v0.31.0 (latest: v0.37.0)
Adding windows-strings v0.1.0 (latest: v0.2.0)</p>
<!-- raw HTML omitted -->
<pre><code>Fetching advisory database from
`https://github.com/RustSec/advisory-db.git`
Loaded 664 security advisories (from /home/runner/.cargo/advisory-db)
Updating crates.io index
Scanning Cargo.lock for vulnerabilities (25 crate dependencies)
</code></pre>
<!-- raw HTML omitted -->
<h2>[0.7.0]</h2>
<ul>
<li><a
href="987f44fe47"><code>987f44f</code></a>
(<a
href="https://redirect.github.com/tauri-apps/winrt-notification/pull/37">#37</a>
by <a
href="https://github.com/tauri-apps/winrt-notification/../../iKineticate"><code>@iKineticate</code></a>)
Added progress bar APIs, <code>Toast::progress</code> and
<code>Toast::set_progress</code></li>
</ul>
<!-- raw HTML omitted -->
<pre><code>Updating crates.io index
Packaging tauri-winrt-notification v0.7.0
(/home/runner/work/winrt-notification/winrt-notification)
Updating crates.io index
Packaged 31 files, 98.2KiB (44.3KiB compressed)
Uploading tauri-winrt-notification v0.7.0
(/home/runner/work/winrt-notification/winrt-notification)
Uploaded tauri-winrt-notification v0.7.0 to registry `crates-io`
note: waiting for `tauri-winrt-notification v0.7.0` to be available at
registry `crates-io`.
You may press ctrl-c to skip waiting; the crate should be available
shortly.
Published tauri-winrt-notification v0.7.0 at registry `crates-io`
</code></pre>
<!-- raw HTML omitted -->
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/tauri-apps/winrt-notification/blob/dev/CHANGELOG.md">tauri-winrt-notification's
changelog</a>.</em></p>
<blockquote>
<h2>[0.7.0]</h2>
<ul>
<li><a
href="987f44fe47"><code>987f44f</code></a>
(<a
href="https://redirect.github.com/tauri-apps/winrt-notification/pull/37">#37</a>
by <a
href="https://github.com/tauri-apps/winrt-notification/../../iKineticate"><code>@iKineticate</code></a>)
Added progress bar APIs, <code>Toast::progress</code> and
<code>Toast::set_progress</code></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="98d351ba24"><code>98d351b</code></a>
Publish New Versions (<a
href="https://redirect.github.com/tauri-apps/winrt-notification/issues/38">#38</a>)</li>
<li><a
href="987f44fe47"><code>987f44f</code></a>
feat: add progress bar support (<a
href="https://redirect.github.com/tauri-apps/winrt-notification/issues/37">#37</a>)</li>
<li>See full diff in <a
href="https://github.com/tauri-apps/winrt-notification/compare/tauri-winrt-notification-v0.6...tauri-winrt-notification-v0.7">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.132 to
1.0.133.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/serde-rs/json/releases">serde_json's
releases</a>.</em></p>
<blockquote>
<h2>v1.0.133</h2>
<ul>
<li>Implement From<[T; N]> for serde_json::Value (<a
href="https://redirect.github.com/serde-rs/json/issues/1215">#1215</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="0903de449c"><code>0903de4</code></a>
Release 1.0.133</li>
<li><a
href="2b65ca0949"><code>2b65ca0</code></a>
Merge pull request <a
href="https://redirect.github.com/serde-rs/json/issues/1215">#1215</a>
from dtolnay/fromarray</li>
<li><a
href="4e5f985958"><code>4e5f985</code></a>
Implement From<[T; N]> for Value</li>
<li><a
href="2ccb5b67ca"><code>2ccb5b6</code></a>
Disable question_mark clippy lint in lexical test</li>
<li><a
href="a11f5f2bc4"><code>a11f5f2</code></a>
Resolve unnecessary_map_or clippy lints</li>
<li><a
href="07f280a79c"><code>07f280a</code></a>
Wrap PR 1213 to 80 columns</li>
<li><a
href="75ed44722d"><code>75ed447</code></a>
Merge pull request <a
href="https://redirect.github.com/serde-rs/json/issues/1213">#1213</a>
from djmitche/safety-comment</li>
<li><a
href="73011c0b2b"><code>73011c0</code></a>
Add a safety comment to unsafe block</li>
<li><a
href="be2198a54d"><code>be2198a</code></a>
Prevent upload-artifact step from causing CI failure</li>
<li><a
href="7cce517f53"><code>7cce517</code></a>
Raise minimum version for preserve_order feature to Rust 1.65</li>
<li>Additional commits viewable in <a
href="https://github.com/serde-rs/json/compare/v1.0.132...v1.0.133">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
In addition to monitoring clients and gateways, it is also useful to
monitor relays in the same way. This gives us alerts on ERROR and WARN
messages logged by the relay as well as panics.
By removing the use of the `#[tokio::main]`, we can ensure that
telemetry is initialised as early as possible.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
When `proptest` discovers a test failure, it will attempt to "shrink"
the input to identify, what exactly causes the issue. How this is done
depends on the data type but mostly performs things such as binary
search to be efficient. Not every input within our tests is relevant for
a failure. For example, which ID we have sampled for a client or a
gateway doesn't at all affect whether or not the test will fail.
`proptest` doesn't know that though so it will still happily spend
shrinking time and cycles on figuring out the minimal difference in IDs
(which is 1 because they have to be different). This is a huge waste of
time for no benefit. We are getting much more value out of `proptest` if
it tries to adjust other things such as the transitions involved in a
test, how many gateways and relays there are etc.
By marking the strategies for the IDs and private keys with `no_shrink`,
we can achieve that.
When generating random input data in property-based tests, we have to
ensure that the data conforms to certain criteria. For example, IP
addresses must not be multicast or unspecified addresses and they must
not be within our reserved IP ranges.
Currently, we ensure this using "filtering" which is a pretty poor
technique [0]. To improve on this, we refactor the generation of IPs to
automatically exclude all IPs within certain ranges. On very big
test-runs (i.e. > 30000 test cases), too many local rejections lead to
the test suite being aborted early.
[0]:
https://proptest-rs.github.io/proptest/proptest/tutorial/filtering.html
One of Rust's promises is "if it compiles, it works". However, there are
certain situations in which this isn't true. In particular, when using
dynamic typing patterns where trait objects are downcast to concrete
types, having two versions of the same dependency can silently break
things.
This happened in #7379 where I forgot to patch a certain Sentry
dependency. A similar problem exists with our `tracing-stackdriver`
dependency (see #7241).
Lastly, duplicate dependencies increase the compile-times of a project,
so we should aim for having as few duplicate versions of a particular
dependency as possible in our dependency graph.
This PR introduces `cargo deny`, a linter for Rust dependencies. In
addition to linting for duplicate dependencies, it also enforces that
all dependencies are compatible with an allow-list of licenses and it
warns when a dependency is referred to from multiple crates without
introducing a workspace dependency. Thanks to existing tooling
(https://github.com/mainmatter/cargo-autoinherit), transitioning all
dependencies to workspace dependencies was quite easy.
Resolves: #7241.
This switches our `sentry-tracing` dependency to a fork that includes
https://github.com/getsentry/sentry-rust/pull/708. Recording our span
fields with breadcrumbs is important to provide accurate context of the
message. Without the span fields, the messages give us a lot less
information.
Since the last release, the open issue on `flush` having a flipped
return value got fixed as well.
In order to avoid processing of responses of relays that somehow got
altered on the network path, we now use the client's `password` as a
shared secret for the relay to also authenticate its responses. This
means that not all message can be authenticated. In particular, BINDING
requests will still be unauthenticated.
Performing this validation now requires every component that crafts
input to the `Allocation` to include a valid `MessageIntegrity`
attribute. This is somewhat problematic for the regression tests of the
relay and the unit tests of `Allocation`. In both cases, we implement
workarounds so we don't have to actually compute a valid
`MessageIntegrity`. This is deemed acceptable because:
- Both of these are just tests.
- We do test the validation path using `tunnel_test` because there we
run an actual relay.
When debugging issues related to our TURN allocation code, we sometimes
only have the logs that code submitted to Sentry. As part of the event,
we submit the last 500 debug logs as breadcrumbs to give more context to
the error.
Unconditionally printing the attributes of each request-response pair
will help us in more easily diagnosing, why certain errors happen.
Bumps [clap](https://github.com/clap-rs/clap) from 4.5.20 to 4.5.21.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/clap-rs/clap/releases">clap's
releases</a>.</em></p>
<blockquote>
<h2>v4.5.21</h2>
<h2>[4.5.21] - 2024-11-13</h2>
<h3>Fixes</h3>
<ul>
<li><em>(parser)</em> Ensure defaults are filled in on error with
<code>ignore_errors(true)</code></li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/clap-rs/clap/blob/master/CHANGELOG.md">clap's
changelog</a>.</em></p>
<blockquote>
<h2>[4.5.21] - 2024-11-13</h2>
<h3>Fixes</h3>
<ul>
<li><em>(parser)</em> Ensure defaults are filled in on error with
<code>ignore_errors(true)</code></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="03d722625a"><code>03d7226</code></a>
chore: Release</li>
<li><a
href="3df70fb2b6"><code>3df70fb</code></a>
docs: Update changelog</li>
<li><a
href="3266c36abf"><code>3266c36</code></a>
Merge pull request <a
href="https://redirect.github.com/clap-rs/clap/issues/5691">#5691</a>
from epage/custom</li>
<li><a
href="951762db57"><code>951762d</code></a>
feat(complete): Allow any OsString-compatible type to be a
CompletionCandidate</li>
<li><a
href="bb6493e890"><code>bb6493e</code></a>
feat(complete): Offer - as a path option</li>
<li><a
href="27b348dbcb"><code>27b348d</code></a>
refactor(complete): Simplify ArgValueCandidates code</li>
<li><a
href="49b8108f8c"><code>49b8108</code></a>
feat(complete): Add PathCompleter</li>
<li><a
href="82a360aa54"><code>82a360a</code></a>
feat(complete): Add ArgValueCompleter</li>
<li><a
href="47aedc6906"><code>47aedc6</code></a>
fix(complete): Ensure paths are sorted</li>
<li><a
href="431e2bc931"><code>431e2bc</code></a>
test(complete): Ensure ArgValueCandidates get filtered</li>
<li>Additional commits viewable in <a
href="https://github.com/clap-rs/clap/compare/clap_complete-v4.5.20...clap_complete-v4.5.21">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
In the latest version, we added a warning log to str0m when the maximum
number of candidate pairs is exceeded:
https://github.com/algesten/str0m/pull/587.
We only ever add the candidates of a single relay to an agent (2
candidates), plus at most 2 server-reflexive candidates and at most 2
host candidates. Unless there is a bug like what we fixed in #7334,
exceeding the default number of candidate _pairs_ (100) should never
happen.
In case it does, the newly added `warn` log in `str0m` will trigger a
Sentry alert.
It was already a bit sus that we didn't receive as many errors in Sentry
from the IPC service as from the GUI client. Turns out that we forgot to
initialise our `sentry_layer` there. Additionally, we also didn't
initialise the `LogTracer`, meaning we didn't capture logs from the
`log` crate which is used by some of the dependencies, for example
`wintun`.
Due to https://github.com/getsentry/sentry-rust/issues/702, errors which
are embedded as `tracing::Value` unfortunately get silently discarded
when reported as part of Sentry "Event"s and not "Exception"s.
The design idea of these telemetry events is that they aren't fatal
errors so we don't need to treat them with the highest priority. They
may also appear quite often, so to save performance and bandwidth, we
sample them at a rate of 1% at creation time.
In order to not lose the context of these errors, we instead format them
into the message. This makes them completely identical to the `debug!`
logs which we have on every call-site of `telemetry_event!` which
prompted me to make that implicit as part of creating the
`telemetry_event!`.
Resolves: #7343.
Adding more context to these errors makes it easier to identify, which
of the operations fails. In addition, we remove some usages of the "log
and return" anti-pattern to avoid duplicate reports of the same issue.
All our logic for handling errors is based on the error code. Even
though there should be a 1:1 mapping between error code and reason
phrase, I am seeing odd reports in Sentry for a case that we should be
handling but aren't.
I noticed that in case there is an error when reading from the TUN
device, we currently exit that thread and we don't have a mechanism at
the moment to restart it. Discarding the thread also means we can no
longer send new instances of `Tun` into it.
Instead of exiting the thread, we now just log the error and continue.
In case the error was caused by the FD being closed, we discard the
instance of `Tun` and wait for a new one.
Previously, we printed only the size of each individual packet in the
`wire::net` logs. This makes it impossible to tell whether or not GRO
was used to receive this packet. The total number of bytes can still be
computed by calculating `num_packets * segment_size + trailing_bytes`.
Thus, the new log is strictly superior.
With the parallelisation of TUN and UDP operations, we lost
backpressure: Packets can now be read quicker from the UDP sockets than
they can be sent out the TUN device, causing packet loss in extremely
high-throughput situations.
To avoid this, we don't directly send packets into the channel to the
TUN device thread. This channel is bounded, meaning sending can fail if
reading UDP packets is faster than writing packets to the TUN device.
Due to GRO, we may read multiple UDP packets in one go, requiring us to
write multiple IP packets to the TUN device as part of a single
iteration in the event-loop. Thus, we cannot know, how much space we
need in the channel for outgoing IP packets.
By introducing a dedicated buffer, we can temporarily hold on to all of
these packets and on the next call to `poll`, we flush them out into the
channel. If the channel is full, we will suspend and only continue once
there is space in the channel. This behaviour restores backpressue
because we won't read UDP packets from the socket unless we have space
to write the corresponding packet to the TUN device.
UDP itself actually doesn't have any backpressure, instead the packets
will simply get dropped once the receive buffer overflows. The UDP
packets however carry encrypted IP packets, meaning whatever protocol
sits inside these packets will detect the packet loss and should
throttle their sending-pace accordingly.
Currently, some errors are double-logged when we show them to the user
because of the `tracing::error!` statements within the generation of the
user-friendly error message for the error dialog.
To get rid of these, we generalise the `show_error_dialog` function to
take just the message and move the generation of the message to a
function on the `Error` itself. This also allows us to split out a
separate error type that is only used for the elevation check, thereby
reducing the complexity of the other error enum.
I think I finally understood and correctly traced, where the use of ANSI
escape codes came from. It turns out, the `with_ansi` switch on
`tracing_subscriber::fmt::Layer` is what you want to toggle. From there,
it trickles down to the `Writer` which we can then test for in our
`Format`.
Resolves: #7284.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
A TURN server that doesn't understand certain attributes should return
"Unknown attributes" as part of its response. Whilst we aim to be as
spec-compliant as possible, Firezone doesn't officially support other
TURN servers than our own relay.
If we encounter a TURN server that sends us an "Unknown attribute", we
now immediately fail this allocation and clear it as we cannot make any
more assumptions about what the connected relay actually supports.
Errors from the tunnel can potentially happen on a per-packet basis. In
order to not flood Sentry, reduce the log-level down to `debug` and only
report 1% of all errors. We did the same thing for the gateway in #7299.