In Firezone, a Client requests an "access authorization" for a Resource
on the fly when it sees the first packet for said Resource going through
the tunnel. If we don't have a connection to the Gateway yet, this is
also where we will establish a connection and create the WireGuard
tunnel.
In order for this to work, the access authorization state between the
Client and the Gateway MUST NOT get out of sync. If the Client thinks it
has access to a Resource, it will just route the traffic to the Gateway.
If the access authorization on the Gateway has expired or vanished
otherwise, the packets will be black-holed.
Starting with #9816, the Gateway sends ICMP errors back to the
application whenever it filters a packet. This can happen either because
the access authorization is gone or because the traffic wasn't allowed
by the specific filter rules on the Resource.
With this patch, the Client will attempt to create a new flow (i.e.
re-authorize) traffic for this resource whenever it sees such an ICMP
error, therefore acting as a way of synchronizing the view of the world
between Client and Gateway should they ever run out of sync.
Testing turned out to be a bit tricky. If we let the authorization on
the Gateway lapse naturally, we portal will also toggle the Resource off
and on on the Client, resulting in "flushing" the current
authorizations. Additionally, it the Client had only access to one
Resource, then the Gateway will gracefully close the connection, also
resulting in the Client creating a new flow for the next packet.
To actually trigger this new behaviour we need to:
- Access at least two resources via the same Gateway
- Directly send `reject_access` to the Gateway for this particular
resource
To achieve this, we dynamically eval some code on the API node and
instruct the Gateway channel to send `reject_access`. The connection
stays intact because there is still another active access authorization
but packets for the other resource are answered with ICMP errors.
To achieve a safe roll-out, the new behaviour is feature-flagged. In
order to still test it, we now also allow feature flags to be set via
env variables.
Resolves: #10074
---------
Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>
At present, the Gateway performs DNS resolution for A & AAAA queries via
`libc`. The `resolve` system call only provides us with the resolved IPs
but not any of the metadata around the query such as TTL. As a result,
we can only cache DNS queries for a static amount of time, currently
30s. It would be more correct to cache them for their TTL instead.
To do so, we re-introduce `hickory-resolver` to our codebase.
Deliberately, we only use it for resolving A and AAAA records on the
Gateway for now. DNS resolution for SRV & TXT records happens one layer
below and uses the same infrastructure as DNS resolution on the Client.
Merging this is difficult however because the Gateway still supports the
control protocol of 1.3.x clients. That one requires DNS resolution
prior to setting up the connection of DNS resources which means it needs
to happen in the event-loop of the Gateway binary and cannot be moved
into the `Tunnel` where DNS resolution for Client and SRV/TXT records
happen.
Once we can drop support for 1.3.x clients, this Gateway's event-loop
will simplify drastically which will allow us to refactor this to a more
unified approach of DNS resolution. Until then, we can at least fix the
hardcoded TTL by using `hickory-resolver` in the event-loop.
The functionality is guarded behind a feature-flag which - as usual - is
off by default (i.e. for as long as we haven't fetched the flags). The
feature flag is already configured to `true` for staging and production
so we can test the new behaviour.
Resolves: #8232
Related: #10385
In order to effectively share the HTTP client for requests to PostHog,
we pre-resolve the IPs of the host and create a lazily initialised
`reqwest::Client` that gets shared between all analytics calls.
In addition to sending true/false for a feature-flag, PostHog also
allows us to send a payload with them. We can use this to carry the
log-filter we'd like to stream logs for. With this, we can dynamically
change which logs we are getting forwarded to Sentry.
Unfortunately, this cannot be done on a per-user basis, meaning we will
always have the same log filter for all users where the feature-flag is
enabled.
From Sentry reports and user-submitted logs, we know that it is possible
for Client and Gateway to de-sync in regards to what each other's public
key is. In such a scenario, ICE will succeed to make a connection but
`boringtun` will fail to handshake a tunnel. By default, `boringtun`
tries for 90s to handshake a session before it gives up and expires it.
In Firezone, the ICE agent takes care of establishing connectivity
whereas `boringtun` itself just encrypts and decrypts packets. As such,
if ICE is working, we know that packets aren't getting lost but instead,
there must be some other issue as to why we cannot establish a session.
To improve the UX in these error cases, we reduce the rekey-attempt-time
to 15s. This roughly matches our ICE timeout. Those 15s count from the
moment we send the first handshake which is just after ICE completes.
Thus we can be sure that after at most 15s, we either have a working
WireGuard session or the connection gets cleaned up.
Related: #9890
Related: #9850
As per the WireGuard paper, `boringtun` tries to handshake with the
remote peer for 90s before it gives up. This timeout is important
because when a session is discarded due to e.g. missing replies,
WireGuard attempts to handshake a new session. Without this timeout, we
would then try to handshake a session forever.
Unfortunately, `boringtun` does not distinguish a missing handshake
response from a bad one. Decryption errors whilst decoding a handshake
response are simply passed up to the upper layer, in our case `snownet`.
I am not sure how we can actually fail to decrypt a handshake but the
pattern we are seeing in customer logs is that this happens over and
over again, so there is no point in having `boringtun` retry the
handshake. Therefore, we immediately fail the connection when this
happens.
Failed connections are immediately removed, triggering the client send a
new connection-intent to the portal. Such a new connection intent will
then sync-up the state between Client and Gateway so both of them use
the most recent public key.
Resolves: #9845
As a first step in preparation for sending OTEL metrics from Clients and
Gateways to a cloud-hosted OTEL collector, we extend the CLI of the
Gateway with configuration options to provide a gRPC endpoint to an OTEL
collector.
If `FIREZONE_METRICS` is set to `otel-collector` and an endpoint is
configured via `OTLP_GRPC_ENDPOINT`, we will report our metrics to that
collector.
The future plan for extending this is such that if `FIREZONE_METRICS` is
set to `otel-collector` (which will likely be the default) and no
`OTLP_GRPC_ENDPOINT` is set, then we will use our own, hosted OTEL
collector and report metrics IF the `export-metrics` feature-flag is set
to `true`.
This is a similar integration as we have done it with streaming logs to
Sentry. We can therefore enable it on a similar granularity as we do
with the logs and e.g. only enable it for the `firezone` account to
start with.
In meantime, customers can already make use of those metrics if they'd
like by using the current integration.
Resolves: #1550
Related: #7419
---------
Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>
Socket APIs across operating systems vary in how they handle
back-pressure. In most cases, a non-blocking socket should return
`EWOULDBLOCK` when it cannot send a given datagram and would have to
block to wait for resources to free up.
It appears that macOS doesn't always behave like that. In particular, we
are seeing error logs from a few users where sending a datagram fails
with
> No buffer space available (os error 55)
Digging through `libc`, I've found that this error is known as `ENOBUFS`
[0].
There are reports on the Apple developer forum [1] that recommend
retrying when this error happens. It is however unclear as to whether it
is entirely safe to map this error to `EWOULDBLOCK`. Other non-blocking
event-loop implementations [2] appear to do that but we don't know
whether it is fully correct.
At present, Firezone's behaviour here is to drop the packet. This means
the host networking stack has to fall-back to running into a timeout and
re-send the packet. This very likely negatively impacts the UX for the
users hitting this.
In order to validate this assumption, we implement a feature-flag. This
allows us to ship this code but switch back to the old behaviour, should
it negatively impact how Firezone behaves. In particular, if the
assumption that mapping `ENOBUFS` to `EWOULDBLOCK` is safe turns out
wrong and `kqueue` does in fact not signal readiness when more buffers
are available, then we may have missing wake-ups which would lead a
further delay in datagrams being sent.
[0]:
8e6f36c6ba/src/unix/bsd/apple/mod.rs (L2998)
[1]: https://developer.apple.com/forums/thread/42334
[2]:
aac866f399/src/unix/stream.c (L820)
We are currently in the process of transitioning the Firezone Clients
away from always hashing the ID before sending it to the portal. This
will make lookups and correlation of data between our systems much
easier.
The way we are performing this migration is that new installations of
Firezone will directly generate a 64 char hex-string as the Firezone ID.
If the ID looks like a UUID (which is the old format), we still hash it
and send it to the portal, otherwise we send it as-is.
Presently, the telemetry integration with Sentry and PostHog do the
opposite. They always sets the Firezone ID as-is and includes an
`external_id` that is the hashed form if it detects that it is a UUID
(or in the case of PostHog, create an alias). It is much better to flip
this around and always set the hex-string as the user id. That way, we
can simply always filter by the `user.id` attribute in Sentry and always
refer to the ID that we are seeing in the portal.
Feature flags may be accessed _very_ often such as on every log
statement with #9780. To make sure this is as performant as possible, we
move from an `RwLock` to atomic booleans with relaxed ordering.
Sentry has a new "Logs" feature where we can stream logs directly to
Sentry. Doing this for all Clients and Gateways would be way too much
data to collect though.
In order to aid debugging from customer installations, we add a
PostHog-managed feature flag that - if set to `true` - enables the
streaming of logs to Sentry. This feature flag is evaluated every time
the telemetry context is initialised:
- For all FFI usages of connlib, this happens every time a new session
is created.
- For the Windows/Linux Tunnel service, this also happens every time we
create a new session.
- For the Headless Client and Gateway, it happens on startup and
afterwards, every minute. The feature-flag context itself is only
checked every 5 minutes though so it might take up to 5 minutes before
this takes effect.
The default value - like all feature flags - is `false`. Therefore, if
there is any issue with the PostHog service, we will fallback to the
previous behaviour where logs are simply stored locally.
Resolves: #9600
This PR adds basic analytics to `connlib` by sending two events to
PostHog:
1. `new_session` which is sent every time we establish a new session
with a Firezone backend. This could be our production or staging
instance but also a session to an on-premise installation of Firezone.
We include the API URL in the event payload to further distinguish
these.
2. `$identify` to link the client + version as well as the operating
system to the user. The user is identified by the Firezone ID.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
The original idea of this feature flag was that we can easily disable
the eBPF router in case it causes issues in production. However,
something seems to be not working in reliably turning this on / off.
Without an explicit toggle of the feature-flag, the eBPF program doesn't
seem to be loaded correctly. The uncertainty in this makes me not the
trust the metrics that we are seeing because we don't know, whether
really all relays are using the eBPF router to relay TURN traffic.
In order to draw truthful conclusions as too how much traffic we are
relaying via eBPF, this patch removes the feature flag again. As of
#8656, we can disable the eBPF program by not setting the
`EBPF_OFFLOADING` env variable. This requires a re-deploy / restart of
relays to take effect which isn't quite as fast as toggling a feature
flag but much reliable and easier to maintain.
This PR implements a feature-flag in PostHog that we can use to toggle
the use of the eBPF data plane at runtime. At every tick of the
event-loop, the relay will compare the (cached) configuration of the
eBPF program with the (cached) value of the feature-flag. If they
differ, the flag will be updated and upon the next packet, the eBPF
program will act accordingly.
Feature-flags are re-evaluated every 5 minutes, meaning there is some
delay until this gets applied.
The default value of our all our feature-flags is `false`, meaning if
there is some problem with evaluating them, we'd turn the eBPF data
plane off. Performing routing in userspace is slower but it is a safer
default.
Resolves: #8548
As per PostHog's recommendation [0], we now use different projects to
manage the feature-flags. This allows us to turn feature flags in
staging or production on / off without affecting the other.
[0]: https://posthog.com/tutorials/multiple-environments
In order to be able to dynamically configure long-running applications
such as the Gateway via feature-flags, we need to regularly re-evaluate
them by sending another POST request to the `/decide` endpoint.
To do this without impacting anything else, we create a separate runtime
that is lazily initialised on first access and use that to run the async
code for connecting to the PostHog service. In addition to that, we also
spawn a task that re-evaluates the feature flags for the currently set
user in the Sentry context every 5 minutes.
Resolves: #8454
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>