This is another attempt at fixing #7386. Previous PR was #7379. The
difference is, this time it works! In the following screenshot,
`handle_input` is a currently active span.

I had to make some patches to Sentry, most notably:
- https://github.com/getsentry/sentry-rust/pull/708
- https://github.com/getsentry/sentry-rust/pull/712
The way we configure Sentry is quite tricky:
First and foremost, we need to understand that the `tracing` adapter for
Sentry has a `span_filter` configuration. When a span gets filtered out
there, the rest of `sentry-tracing` never sees the data in that span.
Thus, in order to capture variables from spans, we need to have a fairly
generous span filter. In this PR, we change this span filter to include
all spans except those on TRACE level.
Secondly, by default, the Sentry SDK doesn't send any spans to the
backend, i.e. the sampling rate is 0. Previously, we set the sampling
rate to 1.0 because the `span_filter` was already filtering out all
non-telemetry spans. A telemetry span is a concept that we invented. It
is a span that gets sampled at _creation_ time with a probability of 1%.
This is useful because creating a lot of spans is also expensive, so we
don't want to do it e.g. on a per-packet basis. With just these
configuration options, we now have a problem: We don't want to submit
all spans to Sentry but we need the `span_filter` to allow all spans
otherwise we can't capture the contextual fields from the span in
breadcrumbs. Luckily, the Sentry SDK has another configuration option:
`traces_sampler`.
The `traces_sampler` gets to compute a sampling rate for each individual
span. This allows us to discard all spans from being sent to Sentry
unless they are `telemetry` spans.
Resolves: #7386.
Rust 1.83 comes with a bunch of new lints for elidible lifetimes. Those
also trigger in the generated code of `derivative`. That crate is
actually unmaintained so we replace our usages of it with `derive_more`.
One of Rust's promises is "if it compiles, it works". However, there are
certain situations in which this isn't true. In particular, when using
dynamic typing patterns where trait objects are downcast to concrete
types, having two versions of the same dependency can silently break
things.
This happened in #7379 where I forgot to patch a certain Sentry
dependency. A similar problem exists with our `tracing-stackdriver`
dependency (see #7241).
Lastly, duplicate dependencies increase the compile-times of a project,
so we should aim for having as few duplicate versions of a particular
dependency as possible in our dependency graph.
This PR introduces `cargo deny`, a linter for Rust dependencies. In
addition to linting for duplicate dependencies, it also enforces that
all dependencies are compatible with an allow-list of licenses and it
warns when a dependency is referred to from multiple crates without
introducing a workspace dependency. Thanks to existing tooling
(https://github.com/mainmatter/cargo-autoinherit), transitioning all
dependencies to workspace dependencies was quite easy.
Resolves: #7241.
It was already a bit sus that we didn't receive as many errors in Sentry
from the IPC service as from the GUI client. Turns out that we forgot to
initialise our `sentry_layer` there. Additionally, we also didn't
initialise the `LogTracer`, meaning we didn't capture logs from the
`log` crate which is used by some of the dependencies, for example
`wintun`.
Due to https://github.com/getsentry/sentry-rust/issues/702, errors which
are embedded as `tracing::Value` unfortunately get silently discarded
when reported as part of Sentry "Event"s and not "Exception"s.
The design idea of these telemetry events is that they aren't fatal
errors so we don't need to treat them with the highest priority. They
may also appear quite often, so to save performance and bandwidth, we
sample them at a rate of 1% at creation time.
In order to not lose the context of these errors, we instead format them
into the message. This makes them completely identical to the `debug!`
logs which we have on every call-site of `telemetry_event!` which
prompted me to make that implicit as part of creating the
`telemetry_event!`.
Resolves: #7343.
I think I finally understood and correctly traced, where the use of ANSI
escape codes came from. It turns out, the `with_ansi` switch on
`tracing_subscriber::fmt::Layer` is what you want to toggle. From there,
it trickles down to the `Writer` which we can then test for in our
`Format`.
Resolves: #7284.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Using the clippy lint `unwrap_used`, we can automatically lint against
all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a
few results actually. In most cases, they are invariants that can't
actually be hit. For these, we change them to `Option`. In other cases,
they can actually be hit. For example, if the user supplies an invalid
log-filter.
Activating this lint ensures the compiler will yell at us every time we
use `.unwrap` to double-check whether we do indeed want to panic here.
Resolves: #7292.
Our logging library `tracing` supports structured logging. Structured
logging means we can include values within a `tracing::Event` without
having to immediately format it as a string. Processing these values -
such as errors - as their original type allows the various `tracing`
layers to capture and represent them as they see fit.
One of these layers is responsible for sending ERROR and WARN events to
Sentry, as part of which `std::error::Error` values get automatically
captured as so-called "sentry exceptions".
Unfortunately, there is a caveat: If an `std::error::Error` value is
included in an event that does not get mapped to an exception, the
`error` field is completely lost. See
https://github.com/getsentry/sentry-rust/issues/702 for details.
To work around this, we introduce a `err_with_sources` adapter that an
error and all its sources together into a string. For all
`tracing::debug!` statements, we then use this to report these errors.
It is really unfortunate that we have to do this and cannot use the same
mechanism, regardless of the log level. However, until this is fixed
upstream, this will do and gives us better information in the log
submitted to Sentry.
"Just let it crash" is terrible advice for software that is shipped to
end users. Where possible, we should use proper error handling and only
fail the current function / task that is active, e.g. drop a particular
packet instead of failing all of connlib. We more or less already do
that.
Activating the clippy lint `unwrap_in_result` surfaced a few more places
where we panic despite being in a function that is fallible already.
These cases can easily be converted to not panic and return an error
instead.
DNS resolution is a critical part of `connlib`. If it is slow for
whatever reason, users will notice this. To make sure we notice as well,
we add `telemetry` spans to the client's and gateway's DNS resolution.
For the client, this applies to all DNS queries that we forward to the
upstream servers. For the gateway, this applies to all DNS resources.
In addition to those IO operations, we also instrument the
`match_resource_linear` function. This function operates in `O(n)` of
all defined DNS resources. It _should_ be fast enough to not create an
impact but it can't hurt to measure this regardless.
Lastly, we also instrument `refresh_translations` on the gateway.
Refreshing the DNS resolution of a DNS resource should really only
happen, when the previous IP addresses become stale yet the user is
still trying to send traffic to them. We don't actually have any data on
how often that happens. By instrumenting it, we can gather some of this
data.
To make sure that none of these telemetry events and spans hurt the
end-user performance, we introduce macros to `firezone-logging` that
sample the creation of these events and spans at a rate of 1%. I ran a
flamegraph and none of these even showed up. The most critical one here
is probably the `match_resource_linear` span because it happens on every
DNS query.
Resolves: #7198.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
This should give us much more context for a particular error without
having to bother a customer with sending us the logs / digging for them
ourselves in our staging or production environment.
Resolves: #7176.
It appears that I have misunderstood the documentation of
`sentry-tracing`. When a message gets logged as an event (rather than an
"exception") `std::error::Error`s attached as tracing `Value`s do not
get recorded. It doesn't really matter whether we record our events as
exceptions or messages. We should ideally look at all of them and
particularly noisy ones can be muted forever in Sentry so we don't end
up in a "boy who cried wolf" situation. Therefore, this PR changes our
event filter to also submit WARNs as exceptions to make sure they get
logged accordingly.
Resolves: #7161.
Related: https://github.com/getsentry/sentry-rust/issues/702.
With the introduction of the `tracing-sentry` integration in #7105, we
started sending tracing spans to Sentry. By default, all spans with
level INFO and above get sampled at the configured rate and sent to
Sentry.
This results in a lot of useless transaction in Sentry because we use
INFO level spans in multiple places in connlib to attach contextual
information like the current connection ID.
This PR introduces the concept of `telemetry` spans which - similar to
the `telemetry` log target in #7147 - qualifies a span for being sent to
Sentry. By convention, these are also defined as requiring the TRACE
level. This ensures we won't ever see them as part of regular log
output.
With #7105, all ERROR events from `tracing` get logged as exceptions in
Sentry and all WARN events get logged as "messages". We don't want to
fill up the user's harddrive with logs which means we have to be
somewhat conservative, what gets logged on INFO and above (with INFO
being the default log level). There are certain events though where it
would be useful to know, how often they happen because too many of them
can indicate a problem.
To solve this problem, we introduce a dedicated `telemetry` log target
that the tracing-sentry integration layer watches for. Events for the
`telemetry` log target that gets logged on TRACE will be sampled at a
rate of 1% and submitted as messages to Sentry.
Using a trait means the call-site of the log message will always be the
`log_unwrap` module, despite the `#[track_caller]` annotation. That one
only works for `std::panic::Location` unfortunately which `tracing`
isn't using.
Macros will be evaluated earlier and thus the messages will show up with
the correct module name.
My theory for this issue is that we receive a UDP DNS response from an
upstream server that is bigger than our MTU and thus forwarding it
fails.
This PR doesn't fix that issue by itself but only mitigates the actual
panic. To properly fix the underlying issue, we need to parse the DNS
message. Truncate it and set the TC bit.
Related: #7121.
Using the `sentry-tracing` integration, we can automatically capture
events based on what we log via `tracing`. The mapping is defined as
follows:
- ERROR: Gets captured as a fatal error
- WARN: Gets captured as a message
- INFO: Gets captured as a breadcrumb
- `_`: Does not get captured at all
If telemetry isn't active / configured, this integration does nothing.
It is therefore safe to just always enable it.
Our logging library, `tracing` supports structured logging. This is
useful because it preserves the more than just the string representation
of a value and thus allows the active logging backend(s) to capture more
information for a particular value.
In the case of errors, this is especially useful because it allows us to
capture the sources of a particular error.
Unfortunately, recording an error as a tracing value is a bit cumbersome
because `tracing::Value` is only implemented for `&dyn
std::error::Error`. Casting an error to this is quite verbose. To make
it easier, we introduce two utility functions in `firezone-logging`:
- `std_dyn_err`
- `anyhow_dyn_err`
Tracking errors as correct `tracing::Value`s will be especially helpful
once we enable Sentry's `tracing` integration:
https://docs.rs/sentry-tracing/latest/sentry_tracing/#tracking-errors
Currently, any failure in the `StubResolver` while processing a query
results only in a log but no response. For UDP DNS queries, that isn't
too bad because the client will simply retry. With the upcoming support
for TCP DNS queries in #6944, we should really always reply with a
message.
This PR refactors the handling of DNS messages to always generate a
reply. In the case of an error, we reply with SERVFAIL.
This opens up a few more refactorings where we can now collapse the
handling of some branches into the same. As part of that, I noticed the
recurring need for "unwrapping" a `Result<(), E>` and logging the error.
To make that easier, I introduced an extension trait that does exactly
that.
This PR introduces a custom logging format for all Rust-components. It
is more or less a copy of `tracing_subscriber::fmt::format::Compact`
with the main difference that span-names don't get logged.
Spans are super useful because they allow us to record contextual
values, like the current connection ID, for a certain scope. What is IMO
less useful about them is that in the default formatter configuration,
active spans cause a right-drift of the actual log message.
The actual log message is still what most accurately describes, what
`connlib` is currently doing. Spans only add contextual information that
the reader may use for further understand what is happening. This
optional nature of the utility of spans IMO means that they should come
_after_ the actual log message.
Resolves: #7014.
The `expect` attribute is similar to `allow` in that it will silence a
particular lint. In addition to `allow` however, `expect` will fail as
soon as the lint is no longer emitted. This ensures we don't end up with
stale `allow` attributes in our codebase. Additionally, it provides a
way of adding a `reason` to document, why the lint is being suppressed.
On Windows, the network notifier always notifies once at startup. We
make the DNS notifier and Linux match this behavior, and we assert it in
the unit test.
Part of a yak shave towards removing Tauri.
Setting up a logger is something that pretty much every entrypoint needs
to do, be it a test, a shared library embedded in another app or a
standalone application. Thus, it makes sense to introduce a dedicated
crate that allows us to bundle all the things together, how we want to
do logging.
This allows us to introduce convenience functions like
`firezone_logging::test` which allow you to construct a logger for a
test as a one-liner.
Crucially though, introducing `firezone-logging` gives us a place to
store a default log directive that silences very noisy crates. When
looking into a problem, it is common to start by simply setting the
log-filter to `debug`. Without further action, this floods the output
with logs from crates like `netlink_proto` on Linux. It is very unlikely
that those are the logs that you want to see. Without a preset filter,
the only alternative here is to explicitly turn off the log filter for
`netlink_proto` by typing something like
`RUST_LOG=netlink_proto=off,debug`. Especially when debugging issues
with customers, this is annoying.
Log filters can be overridden, i.e. a 2nd filter that matches the exact
same scope overrides a previous one. Thus, with this design it is still
possible to activate certain logs at runtime, even if they have silenced
by default.
I'd expect `firezone-logging` to attract more functionality in the
future. For example, we want to support re-loading of log-filters on
other platforms. Additionally, where logs get stored could also be
defined in this crate.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>