Instead of recording the queue depths on every event-loop tick, we now
record them once a second by setting a Gauge. Not only is that a simpler
instrument to work with but it is significantly more performant. The
current version - when metrics are enabled - takes on quite a bit of CPU
time.
Resolves: #10237
With the introduction of the pre-resolved Sentry host, all Firezone
clients now require Internet on startup. That is a signficant usability
hit that we can easily fix by simply falling back to resolving the host
on-demand.
We always end up allow this lint when it pops up so we can also just
allow it for the whole repo in general. Most of the time, the reason for
too many arguments are borrow-checker limitations of Rust where mutable
references need to be tracked explicitly.
Right now, the Client event-loops have a channel with 1000 items for
sending new resource lists and updates to the TUN device to the host
app. This is kind of unnecessary as we always only care about the last
version of these. Intermediate updates that the host app doesn't process
are effectively irrelevant.
We've had an issue before where a bug in the portal caused us to receive
many updates to resources which ended up crashing Client apps because
this channel filled up.
To be more resilient on this front, we refactor the Client event loop to
use a `watch` channel for this. Watch channels only retain the last
value that got sent into them.
Our Sentry client needs to resolve DNS before being able to send logs or
errors to the backend. Currently, this DNS resolution happens on-demand
as we don't take any control of the underlying HTTP client.
In addition, this will use HTTP/1.1 by default which isn't as efficient
as it could be, especially with concurrent requests.
Finally, if we decide to ever proxy all Sentry for traffic through our
own domain, we have to take control of the underlying client anyway.
To resolve all of the above, we create a custom `TransportFactory` where
we reuse the existing `ReqwestHttpTransport` but provide an already
configured `reqwest::Client` that always uses HTTP/2 with a
pre-configured set of DNS records for the given ingest host.
By default, dropping a `tokio` runtime waits until all tasks have
finished. The tasks we spawn within `connlib` can have complex
dependencies with each other. To ensure that we can shut down in any
case and don't hang, we apply a timeout of 1s to the runtime.
Applying a filter globally to the entire subscriber means it filters
events for all layers. This prevents the Sentry layer from uploading
DEBUG logs if configured.
At present, our primary indicator as to whether telemetry is active is
whether we have a Sentry session. For our analytics events however, we
currently require passing in the Firezone ID and API url again. This
makes it difficult to send analytics events from areas of the code that
don't have this information available.
To still allow for that, we integrate the `analytics` module more
tightly with the Sentry session. This allows us to drop two parameters
from the `$identify` event and also means we now respect the
`NO_TELEMETRY` setting for these events except for `new_session`. This
event is sent regardless because it allows us to track, how many on-prem
installations of Firezone are out there.
Sentry has a new "Logs" feature where we can stream logs directly to
Sentry. Doing this for all Clients and Gateways would be way too much
data to collect though.
In order to aid debugging from customer installations, we add a
PostHog-managed feature flag that - if set to `true` - enables the
streaming of logs to Sentry. This feature flag is evaluated every time
the telemetry context is initialised:
- For all FFI usages of connlib, this happens every time a new session
is created.
- For the Windows/Linux Tunnel service, this also happens every time we
create a new session.
- For the Headless Client and Gateway, it happens on startup and
afterwards, every minute. The feature-flag context itself is only
checked every 5 minutes though so it might take up to 5 minutes before
this takes effect.
The default value - like all feature flags - is `false`. Therefore, if
there is any issue with the PostHog service, we will fallback to the
previous behaviour where logs are simply stored locally.
Resolves: #9600
In order to more easily target customers with certain feature flags, we
include the `account_slug` in the `$identify` event to PostHog. This
will allow us to create Cohorts in PostHog and enable / disable feature
flags for all installations of Firezone for a particular customer.
To make our FFI layer between Android and Rust safer, we adopt the
UniFFI tool from Mozilla. UniFFI allows us to create a dedicated crate
(here `client-ffi`) that contains Rust structs annotated with various
attributes. These macros then generate code at compile time that is
built into the shared object. Using a dedicated CLI from the UniFFI
project, we can then generate Kotlin bindings from this shared object.
The primary motivation for this effort is memory safety across the FFI
boundary. Most importantly, we want to ensure that:
- The session pointer is not used after it has been free'd
- Disconnecting the session frees the pointer
- Freeing the session does not happen as part of a callback as that
triggers a cyclic dependency on the Rust side (callbacks are executed on
a runtime and that runtime is dropped as part of dropping the session)
To achieve all of these goals, we move away from callbacks altogether.
UniFFI has great support for async functions. We leverage this support
to expose a `suspend fn` to Android that returns `Event`s. These events
map to the current callback functions. Internally, these events are read
from a channel with a capacity of 1000 events. It is therefore not very
time-critical that the app reads from this channel. `connlib` will
happily continue even if the channel is full. 1000 events should be more
than sufficient though in case the host app cannot immediately process
them. We don't send events very often after all.
This event-based design has major advantages: It allows us to make use
of `AutoCloseable` on the Kotlin side, meaning the `session` pointer is
only ever accessed as part of a `use` block and automatically closed
(and therefore free'd) at the end of the block.
To communicate with the session, we introduce a `TunnelCommand` which
represents all actions that the host app can send to `connlib`. These
are passed through a channel to the `suspend fn` which continuously
listens for events and commands.
Resolves: #9499
Related: #3959
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>