Due to https://github.com/getsentry/sentry-rust/issues/702, errors which
are embedded as `tracing::Value` unfortunately get silently discarded
when reported as part of Sentry "Event"s and not "Exception"s.
The design idea of these telemetry events is that they aren't fatal
errors so we don't need to treat them with the highest priority. They
may also appear quite often, so to save performance and bandwidth, we
sample them at a rate of 1% at creation time.
In order to not lose the context of these errors, we instead format them
into the message. This makes them completely identical to the `debug!`
logs which we have on every call-site of `telemetry_event!` which
prompted me to make that implicit as part of creating the
`telemetry_event!`.
Resolves: #7343.
Adding more context to these errors makes it easier to identify, which
of the operations fails. In addition, we remove some usages of the "log
and return" anti-pattern to avoid duplicate reports of the same issue.
All our logic for handling errors is based on the error code. Even
though there should be a 1:1 mapping between error code and reason
phrase, I am seeing odd reports in Sentry for a case that we should be
handling but aren't.
I noticed that in case there is an error when reading from the TUN
device, we currently exit that thread and we don't have a mechanism at
the moment to restart it. Discarding the thread also means we can no
longer send new instances of `Tun` into it.
Instead of exiting the thread, we now just log the error and continue.
In case the error was caused by the FD being closed, we discard the
instance of `Tun` and wait for a new one.
Previously, we printed only the size of each individual packet in the
`wire::net` logs. This makes it impossible to tell whether or not GRO
was used to receive this packet. The total number of bytes can still be
computed by calculating `num_packets * segment_size + trailing_bytes`.
Thus, the new log is strictly superior.
With the parallelisation of TUN and UDP operations, we lost
backpressure: Packets can now be read quicker from the UDP sockets than
they can be sent out the TUN device, causing packet loss in extremely
high-throughput situations.
To avoid this, we don't directly send packets into the channel to the
TUN device thread. This channel is bounded, meaning sending can fail if
reading UDP packets is faster than writing packets to the TUN device.
Due to GRO, we may read multiple UDP packets in one go, requiring us to
write multiple IP packets to the TUN device as part of a single
iteration in the event-loop. Thus, we cannot know, how much space we
need in the channel for outgoing IP packets.
By introducing a dedicated buffer, we can temporarily hold on to all of
these packets and on the next call to `poll`, we flush them out into the
channel. If the channel is full, we will suspend and only continue once
there is space in the channel. This behaviour restores backpressue
because we won't read UDP packets from the socket unless we have space
to write the corresponding packet to the TUN device.
UDP itself actually doesn't have any backpressure, instead the packets
will simply get dropped once the receive buffer overflows. The UDP
packets however carry encrypted IP packets, meaning whatever protocol
sits inside these packets will detect the packet loss and should
throttle their sending-pace accordingly.
Currently, some errors are double-logged when we show them to the user
because of the `tracing::error!` statements within the generation of the
user-friendly error message for the error dialog.
To get rid of these, we generalise the `show_error_dialog` function to
take just the message and move the generation of the message to a
function on the `Error` itself. This also allows us to split out a
separate error type that is only used for the elevation check, thereby
reducing the complexity of the other error enum.
I think I finally understood and correctly traced, where the use of ANSI
escape codes came from. It turns out, the `with_ansi` switch on
`tracing_subscriber::fmt::Layer` is what you want to toggle. From there,
it trickles down to the `Writer` which we can then test for in our
`Format`.
Resolves: #7284.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
A TURN server that doesn't understand certain attributes should return
"Unknown attributes" as part of its response. Whilst we aim to be as
spec-compliant as possible, Firezone doesn't officially support other
TURN servers than our own relay.
If we encounter a TURN server that sends us an "Unknown attribute", we
now immediately fail this allocation and clear it as we cannot make any
more assumptions about what the connected relay actually supports.
Errors from the tunnel can potentially happen on a per-packet basis. In
order to not flood Sentry, reduce the log-level down to `debug` and only
report 1% of all errors. We did the same thing for the gateway in #7299.
In #7163, we introduced a shared cache of server-reflexive candidates
within a `snownet::Node`. What we unfortunately overlooked is that if a
node (i.e. a client or a gateway) is behind symmetric NAT, then we will
repeatedly create "new" server-reflexive candiates, thereby filling up
this cache.
This cache is used to initialise the agents with local candidates, which
manifests in us sending dozens if not hundreds of candidates to the
other party. Whilst not harmful in itself, it does create quite a lot of
spam. To fix this, we introduce a limit of only keeping around 1
server-reflexive candidate per IP version, i.e. only 1 IPv4 and IPv6
address.
At present, `connlib` only supports a single egress interface meaning
for now, we are fine with making this assumption.
In case we encounter a new candidate of the same kind and same IP
version, we evict the old one and replace it with the new one. Thus, for
subsequent connections, only the new candidate is used.
This one has been lurking in the codebase for a while. Fortunately, it
is not very critical because invalidation of server-reflexive addresses
happens pretty rarely.
If we see these, something fishy is going on (see #7332), so we should
definitely know about these by recording Sentry events. These can
potentially be per packet so we only send a telemetry event which gets
sampled at a rate of 1%.
Using the clippy lint `unwrap_used`, we can automatically lint against
all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a
few results actually. In most cases, they are invariants that can't
actually be hit. For these, we change them to `Option`. In other cases,
they can actually be hit. For example, if the user supplies an invalid
log-filter.
Activating this lint ensures the compiler will yell at us every time we
use `.unwrap` to double-check whether we do indeed want to panic here.
Resolves: #7292.
This ensure that we run prettier across all supported filetypes to check
for any formatting / style inconsistencies. Previously, it was only run
for files in the website/ directory using a deprecated pre-commit
plugin.
The benefit to keeping this in our pre-commit config is that devs can
optionally run these checks locally with `pre-commit run --config
.github/pre-commit-config.yaml`.
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Bundles together several minor improvements around telemetry:
- Removes the obsolete "Firezone" context: This is now included in the
user context as of #7310.
- Entirely encapsulates `sentry` within the `telemetry` module
- Concludes sessions that were not explicitly closed as "abnormal"
All warnings triggered events in Sentry. This particular warning is of
no concern, it simply means that the user clicked on "Sign out" while we
were trying to set up the tunnel.
Resolves: #7250.
Currently, we don't report very detailed errors when we fail to parse
certain IP packets. With this patch, we use `Result` in more places and
also extend the validation of IP packets to:
a) enforce a length of at most 1280 bytes. This should already be the
case due to our MTU but bad things may happen if that is off for some
reason
b) validate the entire IP packet instead of just its header
Our logging library `tracing` supports structured logging. Structured
logging means we can include values within a `tracing::Event` without
having to immediately format it as a string. Processing these values -
such as errors - as their original type allows the various `tracing`
layers to capture and represent them as they see fit.
One of these layers is responsible for sending ERROR and WARN events to
Sentry, as part of which `std::error::Error` values get automatically
captured as so-called "sentry exceptions".
Unfortunately, there is a caveat: If an `std::error::Error` value is
included in an event that does not get mapped to an exception, the
`error` field is completely lost. See
https://github.com/getsentry/sentry-rust/issues/702 for details.
To work around this, we introduce a `err_with_sources` adapter that an
error and all its sources together into a string. For all
`tracing::debug!` statements, we then use this to report these errors.
It is really unfortunate that we have to do this and cannot use the same
mechanism, regardless of the log level. However, until this is fixed
upstream, this will do and gives us better information in the log
submitted to Sentry.
With a retry-mechanism in place, there is no need to log a warning when
`connect_to_service` fails. Instead, we just log this as on DEBUG and
continue trying. If it fails after all attempts, the entire function
will bail out and we will receive a Sentry event from error handling
higher up the callstack.
This switches our dependency on `boringtun` over to our fork at
https://github.com/firezone/boringtun. The idea of the fork is to
carefully only patch selective parts such that upstream things later is
still possible. The complete diff can be seen here:
https://github.com/cloudflare/boringtun/compare/master...firezone:boringtun:master
So far, the only patches in the fork are dependency bumps, linter fixes,
adjustments to log levels and the removal of panics when the destination
buffer is too small.
The `Server::new` function already returns a `Future`. Calling `.await`
on that within an `async` block is equivalent to just calling the `new`
function itself.
Within our test suite, we "spin" for several (simulated) seconds after
each state transition to allow for packets being sent between the
different nodes. The test suite simulates different latencies by
delaying the delivery of some of these packets.
`connlib` has several timers for sending packets, i.e. STUN bindings, WG
keep-alives etc. These timers never end so we cannot simply spin "until
we no longer want to send any packets". Currently, we simply hard-stop
after a few seconds and drop the remaining packets and move on to the
next state transition.
At present, this isn't an issue because only our ICE agent adheres to
the simulated time advancement. `boringtun` is still impure and thus we
usually don't get to see any of the WireGuard packets like keep-alives
and session timeouts etc in our tests. The STUN messages are pretty
resilient to retransmissions so the current packet drop doesn't matter.
In the process of adopting our boringtun fork
(https://github.com/firezone/boringtun) where we will eventually fix the
time impurity, dropping some of these packets caused problems.
To fix this, we now drain all remaining packets that are sitting in the
"yet-to-be-delivered" buffer. These packets are delivered to an "inbox"
that is per-host, meaning the host (i.e. client, gateway or relay) will
still perceive the incoming packet with the correct latency.
We extract this functionality from #7120 because it is generally useful.
Instead of collapsing multiple of these errors into one, we emit a
dedicated error message for each case. This will allow us to distinguish
them within Sentry events.
Windows has some funny behaviour where creating the deep-link server
sometimes fails and we have to try again. Currently, each of these
operations is logged as a warning when it would actually succeed later.
These create unnecessary Sentry alerts.
If we run out of attempts to create the deep-link server (currently 10),
the entire function fails which will be logged as an error further down.
The last 500 INFO and DEBUG logs will be captured as breadcrumbs
together with the event, meaning we still get to see those error
messages on why it failed to create the deep-link server.
Resolves: #7238.
"Just let it crash" is terrible advice for software that is shipped to
end users. Where possible, we should use proper error handling and only
fail the current function / task that is active, e.g. drop a particular
packet instead of failing all of connlib. We more or less already do
that.
Activating the clippy lint `unwrap_in_result` surfaced a few more places
where we panic despite being in a function that is fallible already.
These cases can easily be converted to not panic and return an error
instead.
When `connlib` fails to establish a session, the GUI client currently
only captures the top-level error within `connect_to_firezone` because
it uses `.to_string()` for all errors. Unfortunately, that doesn't print
any of the sources of an error.
To conveniently capture all sources, we can use `anyhow` and its
alternate formatting using `format!("{e:#}")` (notice the `#`). Not all
errors within `connect_to_firezone` should be captured like this
however. Certain IO errors, in particular when trying to resolve the
domain of the portal, need to be captured separately because they may
resolve by themselves if we gain connectivity again. This is important,
otherwise we discard the users token when they boot-up a machine without
internet access yet Firezone is auto-starting.
To make this more ergonomic, we trim down `IpcServiceError` to two
variants: The IO variant we need to special-case and everything else.
This allows us to create `From` impls which "do the right thing" by
capturing more error information using `anyhow`'s alternate formatting.
Bumps [test-strategy](https://github.com/frozenlib/test-strategy) from
0.3.1 to 0.4.0.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="c683eb3cf6"><code>c683eb3</code></a>
Version 0.4.0.</li>
<li><a
href="17706bcd1c"><code>17706bc</code></a>
Update MSRV to 1.70.0.</li>
<li><a
href="90a5efbf00"><code>90a5efb</code></a>
Update dependencies.</li>
<li><a
href="cff2ede71f"><code>cff2ede</code></a>
Changed the strategy generated by <code>#[filter(...)]</code> to reduce
`Too many local ...</li>
<li><a
href="34cc6d2545"><code>34cc6d2</code></a>
Update expected compile error message.</li>
<li><a
href="a4427e2d98"><code>a4427e2</code></a>
Update CI settings.</li>
<li><a
href="ecb7dbae04"><code>ecb7dba</code></a>
Clippy.</li>
<li><a
href="637f29e9c8"><code>637f29e</code></a>
Made it so an error occurs when an unsupported attribute is specified
for enu...</li>
<li><a
href="6d66057bb0"><code>6d66057</code></a>
Use <code>test</code> instead of <code>check</code> with <code>cargo
hack --rust-version</code>.</li>
<li><a
href="cee2ebbfe6"><code>cee2ebb</code></a>
Fix CI settings.</li>
<li>Additional commits viewable in <a
href="https://github.com/frozenlib/test-strategy/compare/v0.3.1...v0.4.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Within the Tauri client, we invoke commands from TypeScript on the Rust
side. These commands can sometimes fail, which is why these commands
return a `Result`.
Most of our commands actually only send messages through a channel to an
event-loop. This can only fail if the other side of the channel is
closed, which should(?) only happen if the program is shutting down or
some part of it crashed. Regardless, these errors can directly be
forwarded to the TypeScript code where they will get caught and logged
to the browser console.
In the future, we can install Sentry's TypeScript client in the GUI code
to automatically report errors on the TypeScript side too.
Resolves: #7256.
Most of the time, these errors are a result of a limited IP stack, for
example IPv6 not being available. Reporting these as errors to Sentry is
unnecessarily noisy.
If something else happens further down the line, the last 500 debug and
info logs will be sent along with the error report so we will still see
these in the breadcrumbs if an actual error happens later.
Resolves: #7245.
Sentry has a feature called the "User context" which allows us to assign
events to individual users. This in turn will give us statistics in
Sentry, how many users are affected by a certain issue.
Unfortunately, Sentry's user context cannot be built-up step-by-step but
has to be set as a whole. To achieve this, we need to slightly refactor
`Telemetry` to not be `clone`d and instead passed around by mutable
reference.
Resolves: #7248.
Related: https://github.com/getsentry/sentry-rust/issues/706.
The communication between the native Apple client and `connlib` uses
JSON encoding. The deserialisation of these should never fail because a
particular version of `connlib` is always bundled with the native
client. Thus, panicking here is justified.
In case it does ever happen, we improve the panic message with this
patch.
We don't have Sentry yet in the native Apple client, meaning we don't
yet learn about errors returned from the `connect` call. Normally,
logging and returning an error is an anti-pattern. We are okay with that
in this case until we can report these errors in the native Apple
client.
Currently, the Gateway's state machine functions for processing packets
use type-signature that only return `Option`. Any errors while
processing packets are logged internally. This makes it difficult to
consistently log these errors.
We refactor these functions to return `Result<Option<T>>` in most cases,
indicating that they may fail for various reasons and also sometimes
succeed without producing an output.
This allows us to consistently log these errors in the event-loop.
Logging them on WARN or ERROR would be too spammy though. In order to
still be alerted about some of these, we use the `telemetry_event!`
macro which samples them at a rate of 1%. This will alert us about cases
that happen often and allows us to handle them explicitly.
Once this is deployed to staging, I will monitor the alerts in Sentry to
ensure we won't get spammed with events from customers on the next
release.
Within `connlib`, we have many nested state machines. Many of them have
internal timers by means of timestamps with which they indicate, when
they'd like to be "woken" to perform time-related processing. For
example, the `Allocation` state machine would indicate with a timestamp
5 minutes from the time an allocation is created that it needs to be
woken again in order to send the refresh message to the relay.
When we reset our network connections, we pretty much discard all state
within connlib and together with that, all of these timers. Thus the
`poll_timeout` function would return `None`, indicating that our state
machines are not waiting for anything.
Within the eventloop, the most outer state machine, i.e. `ClientState`
is paired with an `Io` component that actually implements the timer by
scheduling a wake-up aggregated as the earliest point of all state
machines.
In order to not fire the same timer multiple times in a row, we already
intended to reset the timer once it fired. It turns out that this never
worked and the timer still lingered around.
When we call `reset`, `poll_timeout` - which feeds this timer - returns
`None` and the timer doesn't get updated until it will finally return
`Some` with an `Instant`. Because the previous timer didn't get cleared
when it fired, this caused `connlib` to busy loop and prevent some(?)
other parts of it from progressing, resulting in us never being able to
reconnect to the portal. Yet, because the event loop itself was still
operating, we could still resolve DNS queries and such.
Resolves: #7254.
---------
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
Currently, `connlib` uses the feature of "known hosts" to provide DNS
functionality for some domains even without any network connectivity.
This is primarily used to ensure that when we reconnect to the portal,
we can resolve the domain name which allows us to then create network
connections.
With recent changes to how our phoenix-channel implementation works,
this is actually no longer necessary. Currently, we re-resolve the
domain every time we connect, even though we already resolved them when
connecting to it for the first time. This step is unnecessary and we can
simply directly use the previously resolved IP addresses for the portal
domain.
When waiting on multiple futures concurrently within a loop, it is
important that they all get re-created whenever one of them resolves.
Currently, due to the `.fuse` call, the SIGHUP signal can only be sent
once and future signals get ignored.
As a more general fix, I swapped the `futures::select!` macro to the
`tokio::select!` macro which allows referencing these futures without
pinning and fusing. Ideally, we'd not use any of these macros here and
write our own eventloop but that is a larger refactoring.
Reading the Git version requires the entire Git repository to be
present, including all tags. The tags are only created _after_ the
artifact is being built, when we publish the release. Therefore, these
tags are never included in the actual released binary.
For Sentry, we use the `CARGO_PKG_VERSION` variable instead. This
doesn't tell us whether somebody built a client from source and then
used it so there could be some confusion in Sentry events. It is quite
unlikely that this happens though so for the majority of Sentry alerts,
this will give us the correct version.
For the Android client, we also depend on the `GITHUB_SHA` env variable
at compile-time. We do the same thing for the GUI client here.
Resolves: #6925.