This is another attempt at fixing #7386. Previous PR was #7379. The
difference is, this time it works! In the following screenshot,
`handle_input` is a currently active span.

I had to make some patches to Sentry, most notably:
- https://github.com/getsentry/sentry-rust/pull/708
- https://github.com/getsentry/sentry-rust/pull/712
The way we configure Sentry is quite tricky:
First and foremost, we need to understand that the `tracing` adapter for
Sentry has a `span_filter` configuration. When a span gets filtered out
there, the rest of `sentry-tracing` never sees the data in that span.
Thus, in order to capture variables from spans, we need to have a fairly
generous span filter. In this PR, we change this span filter to include
all spans except those on TRACE level.
Secondly, by default, the Sentry SDK doesn't send any spans to the
backend, i.e. the sampling rate is 0. Previously, we set the sampling
rate to 1.0 because the `span_filter` was already filtering out all
non-telemetry spans. A telemetry span is a concept that we invented. It
is a span that gets sampled at _creation_ time with a probability of 1%.
This is useful because creating a lot of spans is also expensive, so we
don't want to do it e.g. on a per-packet basis. With just these
configuration options, we now have a problem: We don't want to submit
all spans to Sentry but we need the `span_filter` to allow all spans
otherwise we can't capture the contextual fields from the span in
breadcrumbs. Luckily, the Sentry SDK has another configuration option:
`traces_sampler`.
The `traces_sampler` gets to compute a sampling rate for each individual
span. This allows us to discard all spans from being sent to Sentry
unless they are `telemetry` spans.
Resolves: #7386.
In addition to monitoring clients and gateways, it is also useful to
monitor relays in the same way. This gives us alerts on ERROR and WARN
messages logged by the relay as well as panics.
One of Rust's promises is "if it compiles, it works". However, there are
certain situations in which this isn't true. In particular, when using
dynamic typing patterns where trait objects are downcast to concrete
types, having two versions of the same dependency can silently break
things.
This happened in #7379 where I forgot to patch a certain Sentry
dependency. A similar problem exists with our `tracing-stackdriver`
dependency (see #7241).
Lastly, duplicate dependencies increase the compile-times of a project,
so we should aim for having as few duplicate versions of a particular
dependency as possible in our dependency graph.
This PR introduces `cargo deny`, a linter for Rust dependencies. In
addition to linting for duplicate dependencies, it also enforces that
all dependencies are compatible with an allow-list of licenses and it
warns when a dependency is referred to from multiple crates without
introducing a workspace dependency. Thanks to existing tooling
(https://github.com/mainmatter/cargo-autoinherit), transitioning all
dependencies to workspace dependencies was quite easy.
Resolves: #7241.
This switches our `sentry-tracing` dependency to a fork that includes
https://github.com/getsentry/sentry-rust/pull/708. Recording our span
fields with breadcrumbs is important to provide accurate context of the
message. Without the span fields, the messages give us a lot less
information.
Since the last release, the open issue on `flush` having a flipped
return value got fixed as well.
Using the clippy lint `unwrap_used`, we can automatically lint against
all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a
few results actually. In most cases, they are invariants that can't
actually be hit. For these, we change them to `Option`. In other cases,
they can actually be hit. For example, if the user supplies an invalid
log-filter.
Activating this lint ensures the compiler will yell at us every time we
use `.unwrap` to double-check whether we do indeed want to panic here.
Resolves: #7292.
Bundles together several minor improvements around telemetry:
- Removes the obsolete "Firezone" context: This is now included in the
user context as of #7310.
- Entirely encapsulates `sentry` within the `telemetry` module
- Concludes sessions that were not explicitly closed as "abnormal"
Sentry has a feature called the "User context" which allows us to assign
events to individual users. This in turn will give us statistics in
Sentry, how many users are affected by a certain issue.
Unfortunately, Sentry's user context cannot be built-up step-by-step but
has to be set as a whole. To achieve this, we need to slightly refactor
`Telemetry` to not be `clone`d and instead passed around by mutable
reference.
Resolves: #7248.
Related: https://github.com/getsentry/sentry-rust/issues/706.
Reading the Git version requires the entire Git repository to be
present, including all tags. The tags are only created _after_ the
artifact is being built, when we publish the release. Therefore, these
tags are never included in the actual released binary.
For Sentry, we use the `CARGO_PKG_VERSION` variable instead. This
doesn't tell us whether somebody built a client from source and then
used it so there could be some confusion in Sentry events. It is quite
unlikely that this happens though so for the majority of Sentry alerts,
this will give us the correct version.
For the Android client, we also depend on the `GITHUB_SHA` env variable
at compile-time. We do the same thing for the GUI client here.
Resolves: #6925.
DNS resolution is a critical part of `connlib`. If it is slow for
whatever reason, users will notice this. To make sure we notice as well,
we add `telemetry` spans to the client's and gateway's DNS resolution.
For the client, this applies to all DNS queries that we forward to the
upstream servers. For the gateway, this applies to all DNS resources.
In addition to those IO operations, we also instrument the
`match_resource_linear` function. This function operates in `O(n)` of
all defined DNS resources. It _should_ be fast enough to not create an
impact but it can't hurt to measure this regardless.
Lastly, we also instrument `refresh_translations` on the gateway.
Refreshing the DNS resolution of a DNS resource should really only
happen, when the previous IP addresses become stale yet the user is
still trying to send traffic to them. We don't actually have any data on
how often that happens. By instrumenting it, we can gather some of this
data.
To make sure that none of these telemetry events and spans hurt the
end-user performance, we introduce macros to `firezone-logging` that
sample the creation of these events and spans at a rate of 1%. I ran a
flamegraph and none of these even showed up. The most critical one here
is probably the `match_resource_linear` span because it happens on every
DNS query.
Resolves: #7198.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
`sentry`'s transport layer appears to be using blocking IO for flushing
events. Performing blocking IO within a future that is running on a
worker-thread of tokio causes this operation to hang and eventually
time-out after 5 seconds. As a result, many events - especially traces -
don't get flushed to sentry when an app is being shut down.
To fix this, we make `Telemetry::stop` an `async fn` and offload the
flushing to a task on tokio's thread-pool for blocking IO.
Closes#7175
Also fixes a bug with the initialization order of Tokio and Sentry.
Previously:
1. Start Tokio, executor threads inherit main thread context
2. Load device ID and set it on the main telemetry hub
Now:
1. Load device ID and set it on the main telemetry hub
2. Start Tokio, executor threads inherit main thread context
The context and possibly tags didn't seem to propagate from the main hub
if we set them after the worker threads spawned.
Based on this understanding, the IPC service process is still wrong, but
a fix will have to wait, because telemetry in the IPC service is more
complicated than in the GUI process.
<img width="818" alt="image"
src="https://github.com/user-attachments/assets/9c9efec8-fc55-4863-99eb-5fe9ba5b36fa">
This should give us much more context for a particular error without
having to bother a customer with sending us the logs / digging for them
ourselves in our staging or production environment.
Resolves: #7176.
With the introduction of the `tracing-sentry` integration in #7105, we
started sending tracing spans to Sentry. By default, all spans with
level INFO and above get sampled at the configured rate and sent to
Sentry.
This results in a lot of useless transaction in Sentry because we use
INFO level spans in multiple places in connlib to attach contextual
information like the current connection ID.
This PR introduces the concept of `telemetry` spans which - similar to
the `telemetry` log target in #7147 - qualifies a span for being sent to
Sentry. By convention, these are also defined as requiring the TRACE
level. This ensures we won't ever see them as part of regular log
output.
As a first step for integration Sentry into the Android app, we launch
the Sentry Rust agent as soon as a `connlib` session starts up. At a
later point, we can also integrate Sentry into the Android app itself
using the Java / Kotlin SDK.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
This starts up telemetry together with each `connlib` session. At a
later point, we can also integrate the native Swift SDK into the MacOS /
iOS app to catch non-connlib specific problems.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Using the `sentry-tracing` integration, we can automatically capture
events based on what we log via `tracing`. The mapping is defined as
follows:
- ERROR: Gets captured as a fatal error
- WARN: Gets captured as a message
- INFO: Gets captured as a breadcrumb
- `_`: Does not get captured at all
If telemetry isn't active / configured, this integration does nothing.
It is therefore safe to just always enable it.
Similar to the GUI and headless clients, adding error reporting via
Sentry should give us much better insight into how well gateways are
performing.
Resolves: #7099.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Do we want to track 401s in sentry? If we see a lot of them, something
is likely wrong but I guess there is some level of 401s that users will
just run into.
Is there a way of marking these as "might not be a really bad error"?
---------
Co-authored-by: Not Applicable <ReactorScram@users.noreply.github.com>
This makes it easier to ignore random issues from my dev system.
Also added OS tag (`linux` or `windows`) since that doesn't seem to be a
default for Sentry.
```[tasklist]
- [ ] Bikeshed the name `firezone_id` since it'll be hard to change later
```
<img width="367" alt="image"
src="https://github.com/user-attachments/assets/2e936aea-5c36-4208-965a-c578ff8407b7">
Closes#6854
- Sets release version from the GUI Client / Headless Client version
instead of the `firezone-telemetry` version
- Set environment to "production" and "staging" for well-known API URLs,
and "self-hosted" for others, since environments in Sentry can't have
slashes in them
- Sets API URL as a tag
- Sets release to `unit test` for unit testing `firezone-telemetry`
itself, since it has no good version number
<img width="398" alt="image"
src="https://github.com/user-attachments/assets/86f71193-2511-45c1-8304-413db8e5ef90">
Refs #6138
Sentry is always enabled for now. In the near future we'll make it
opt-out per device and opt-in per org (see #6138 for details)
- Replaces the `crash_handling` module
- Catches panics in GUI process, tunnel daemon, and Headless Client
- Added a couple "breadcrumbs" to play with that feature
- User ID is not set yet
- Environment is set to the API URL, e.g. `wss://api.firezone.dev`
- Reports panics from the connlib async task
- Release should be automatically pulled from the Cargo version which we
automatically set in the version Makefile
Example screenshot of sentry.io with a caught panic:
<img width="861" alt="image"
src="https://github.com/user-attachments/assets/c5188d86-10d0-4d94-b503-3fba51a21a90">