firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-03-22 09:41:59 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	b11adfcfe4	feat(connlib): create flow on ICMP error "prohibited" (#10462 ) In Firezone, a Client requests an "access authorization" for a Resource on the fly when it sees the first packet for said Resource going through the tunnel. If we don't have a connection to the Gateway yet, this is also where we will establish a connection and create the WireGuard tunnel. In order for this to work, the access authorization state between the Client and the Gateway MUST NOT get out of sync. If the Client thinks it has access to a Resource, it will just route the traffic to the Gateway. If the access authorization on the Gateway has expired or vanished otherwise, the packets will be black-holed. Starting with #9816, the Gateway sends ICMP errors back to the application whenever it filters a packet. This can happen either because the access authorization is gone or because the traffic wasn't allowed by the specific filter rules on the Resource. With this patch, the Client will attempt to create a new flow (i.e. re-authorize) traffic for this resource whenever it sees such an ICMP error, therefore acting as a way of synchronizing the view of the world between Client and Gateway should they ever run out of sync. Testing turned out to be a bit tricky. If we let the authorization on the Gateway lapse naturally, we portal will also toggle the Resource off and on on the Client, resulting in "flushing" the current authorizations. Additionally, it the Client had only access to one Resource, then the Gateway will gracefully close the connection, also resulting in the Client creating a new flow for the next packet. To actually trigger this new behaviour we need to: - Access at least two resources via the same Gateway - Directly send `reject_access` to the Gateway for this particular resource To achieve this, we dynamically eval some code on the API node and instruct the Gateway channel to send `reject_access`. The connection stays intact because there is still another active access authorization but packets for the other resource are answered with ICMP errors. To achieve a safe roll-out, the new behaviour is feature-flagged. In order to still test it, we now also allow feature flags to be set via env variables. Resolves: #10074 --------- Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>	2025-09-30 08:23:39 +00:00
Thomas Eizinger	aa68029a33	feat(gateway): use hickory resolver to resolve A/AAAA queries (#10373 ) At present, the Gateway performs DNS resolution for A & AAAA queries via `libc`. The `resolve` system call only provides us with the resolved IPs but not any of the metadata around the query such as TTL. As a result, we can only cache DNS queries for a static amount of time, currently 30s. It would be more correct to cache them for their TTL instead. To do so, we re-introduce `hickory-resolver` to our codebase. Deliberately, we only use it for resolving A and AAAA records on the Gateway for now. DNS resolution for SRV & TXT records happens one layer below and uses the same infrastructure as DNS resolution on the Client. Merging this is difficult however because the Gateway still supports the control protocol of 1.3.x clients. That one requires DNS resolution prior to setting up the connection of DNS resources which means it needs to happen in the event-loop of the Gateway binary and cannot be moved into the `Tunnel` where DNS resolution for Client and SRV/TXT records happen. Once we can drop support for 1.3.x clients, this Gateway's event-loop will simplify drastically which will allow us to refactor this to a more unified approach of DNS resolution. Until then, we can at least fix the hardcoded TTL by using `hickory-resolver` in the event-loop. The functionality is guarded behind a feature-flag which - as usual - is off by default (i.e. for as long as we haven't fetched the flags). The feature flag is already configured to `true` for staging and production so we can test the new behaviour. Resolves: #8232 Related: #10385	2025-09-23 06:00:16 +00:00
Thomas Eizinger	da802323e4	feat(telemetry): pre-resolve PostHog ingest host (#10207 ) In order to effectively share the HTTP client for requests to PostHog, we pre-resolve the IPs of the host and create a lazily initialised `reqwest::Client` that gets shared between all analytics calls.	2025-08-22 13:19:53 +00:00
Thomas Eizinger	1bdc5f0584	feat(telemetry): reuse connections to PostHog server (#10203 )	2025-08-18 00:34:14 +00:00
Thomas Eizinger	ea6f1ce145	chore(telemetry): allow to dynamically change the log filter (#10065 ) In addition to sending true/false for a feature-flag, PostHog also allows us to send a payload with them. We can use this to carry the log-filter we'd like to stream logs for. With this, we can dynamically change which logs we are getting forwarded to Sentry. Unfortunately, this cannot be done on a per-user basis, meaning we will always have the same log filter for all users where the feature-flag is enabled.	2025-08-02 10:23:35 +00:00
Thomas Eizinger	a6ffdd2654	feat(snownet): reduce rekey-attempt-time to 15s (#9891 ) From Sentry reports and user-submitted logs, we know that it is possible for Client and Gateway to de-sync in regards to what each other's public key is. In such a scenario, ICE will succeed to make a connection but `boringtun` will fail to handshake a tunnel. By default, `boringtun` tries for 90s to handshake a session before it gives up and expires it. In Firezone, the ICE agent takes care of establishing connectivity whereas `boringtun` itself just encrypts and decrypts packets. As such, if ICE is working, we know that packets aren't getting lost but instead, there must be some other issue as to why we cannot establish a session. To improve the UX in these error cases, we reduce the rekey-attempt-time to 15s. This roughly matches our ICE timeout. Those 15s count from the moment we send the first handshake which is just after ICE completes. Thus we can be sure that after at most 15s, we either have a working WireGuard session or the connection gets cleaned up. Related: #9890 Related: #9850	2025-07-17 00:50:31 +00:00
Thomas Eizinger	f5425ac8e4	fix(snownet): fail connection on handshake decryption errors (#9850 ) As per the WireGuard paper, `boringtun` tries to handshake with the remote peer for 90s before it gives up. This timeout is important because when a session is discarded due to e.g. missing replies, WireGuard attempts to handshake a new session. Without this timeout, we would then try to handshake a session forever. Unfortunately, `boringtun` does not distinguish a missing handshake response from a bad one. Decryption errors whilst decoding a handshake response are simply passed up to the upper layer, in our case `snownet`. I am not sure how we can actually fail to decrypt a handshake but the pattern we are seeing in customer logs is that this happens over and over again, so there is no point in having `boringtun` retry the handshake. Therefore, we immediately fail the connection when this happens. Failed connections are immediately removed, triggering the client send a new connection-intent to the portal. Such a new connection intent will then sync-up the state between Client and Gateway so both of them use the most recent public key. Resolves: #9845	2025-07-14 13:22:23 +00:00
Thomas Eizinger	cecca37073	feat(gateway): allow exporting metrics to an OTEL collector (#9838 ) As a first step in preparation for sending OTEL metrics from Clients and Gateways to a cloud-hosted OTEL collector, we extend the CLI of the Gateway with configuration options to provide a gRPC endpoint to an OTEL collector. If `FIREZONE_METRICS` is set to `otel-collector` and an endpoint is configured via `OTLP_GRPC_ENDPOINT`, we will report our metrics to that collector. The future plan for extending this is such that if `FIREZONE_METRICS` is set to `otel-collector` (which will likely be the default) and no `OTLP_GRPC_ENDPOINT` is set, then we will use our own, hosted OTEL collector and report metrics IF the `export-metrics` feature-flag is set to `true`. This is a similar integration as we have done it with streaming logs to Sentry. We can therefore enable it on a similar granularity as we do with the logs and e.g. only enable it for the `firezone` account to start with. In meantime, customers can already make use of those metrics if they'd like by using the current integration. Resolves: #1550 Related: #7419 --------- Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>	2025-07-14 03:54:38 +00:00
Thomas Eizinger	70e4b6572f	chore(rust): log environment when updating feature flags (#9855 ) It is useful to know, which environment we've updated the feature-flags for.	2025-07-13 17:27:10 +00:00
Thomas Eizinger	13c8c70750	fix(connlib): treat `ENOBUFS` as `EWOULDBLOCK` (#9798 ) Socket APIs across operating systems vary in how they handle back-pressure. In most cases, a non-blocking socket should return `EWOULDBLOCK` when it cannot send a given datagram and would have to block to wait for resources to free up. It appears that macOS doesn't always behave like that. In particular, we are seeing error logs from a few users where sending a datagram fails with > No buffer space available (os error 55) Digging through `libc`, I've found that this error is known as `ENOBUFS` [0]. There are reports on the Apple developer forum [1] that recommend retrying when this error happens. It is however unclear as to whether it is entirely safe to map this error to `EWOULDBLOCK`. Other non-blocking event-loop implementations [2] appear to do that but we don't know whether it is fully correct. At present, Firezone's behaviour here is to drop the packet. This means the host networking stack has to fall-back to running into a timeout and re-send the packet. This very likely negatively impacts the UX for the users hitting this. In order to validate this assumption, we implement a feature-flag. This allows us to ship this code but switch back to the old behaviour, should it negatively impact how Firezone behaves. In particular, if the assumption that mapping `ENOBUFS` to `EWOULDBLOCK` is safe turns out wrong and `kqueue` does in fact not signal readiness when more buffers are available, then we may have missing wake-ups which would lead a further delay in datagrams being sent. [0]: `8e6f36c6ba/src/unix/bsd/apple/mod.rs (L2998)` [1]: https://developer.apple.com/forums/thread/42334 [2]: `aac866f399/src/unix/stream.c (L820)`	2025-07-10 17:51:16 +00:00
Thomas Eizinger	a6796fe8b2	fix(telemetry): always use hex-encoded ID as user ID (#9781 ) We are currently in the process of transitioning the Firezone Clients away from always hashing the ID before sending it to the portal. This will make lookups and correlation of data between our systems much easier. The way we are performing this migration is that new installations of Firezone will directly generate a 64 char hex-string as the Firezone ID. If the ID looks like a UUID (which is the old format), we still hash it and send it to the portal, otherwise we send it as-is. Presently, the telemetry integration with Sentry and PostHog do the opposite. They always sets the Firezone ID as-is and includes an `external_id` that is the hashed form if it detects that it is a UUID (or in the case of PostHog, create an alias). It is much better to flip this around and always set the hex-string as the user id. That way, we can simply always filter by the `user.id` attribute in Sentry and always refer to the ID that we are seeing in the portal.	2025-07-04 16:55:44 +00:00
Thomas Eizinger	8b001b3e8b	refactor(telemetry): use atomics for feature-flags (#9783 ) Feature flags may be accessed _very_ often such as on every log statement with #9780. To make sure this is as performant as possible, we move from an `RwLock` to atomic booleans with relaxed ordering.	2025-07-04 14:55:45 +00:00
Thomas Eizinger	3b972643b1	feat(rust): stream logs to Sentry when enabled in PostHog (#9635 ) Sentry has a new "Logs" feature where we can stream logs directly to Sentry. Doing this for all Clients and Gateways would be way too much data to collect though. In order to aid debugging from customer installations, we add a PostHog-managed feature flag that - if set to `true` - enables the streaming of logs to Sentry. This feature flag is evaluated every time the telemetry context is initialised: - For all FFI usages of connlib, this happens every time a new session is created. - For the Windows/Linux Tunnel service, this also happens every time we create a new session. - For the Headless Client and Gateway, it happens on startup and afterwards, every minute. The feature-flag context itself is only checked every 5 minutes though so it might take up to 5 minutes before this takes effect. The default value - like all feature flags - is `false`. Therefore, if there is any issue with the PostHog service, we will fallback to the previous behaviour where logs are simply stored locally. Resolves: #9600	2025-06-25 16:14:14 +00:00
Thomas Eizinger	182a560091	fix(telemetry): don't log events for local and CI env (#9492 ) Avoids spamming PostHog with events from our CI or other instances of the docker-compose setup.	2025-06-10 14:34:20 +00:00
Thomas Eizinger	6ef079357c	feat(connlib): add basic analytics about new sessions (#9379 ) This PR adds basic analytics to `connlib` by sending two events to PostHog: 1. `new_session` which is sent every time we establish a new session with a Firezone backend. This could be our production or staging instance but also a session to an on-premise installation of Firezone. We include the API URL in the event payload to further distinguish these. 2. `$identify` to link the client + version as well as the operating system to the user. The user is identified by the Firezone ID. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-04 06:03:29 +00:00
Thomas Eizinger	07a82d2254	chore(relay): remove feature flag for eBPF TURN router (#8681 ) The original idea of this feature flag was that we can easily disable the eBPF router in case it causes issues in production. However, something seems to be not working in reliably turning this on / off. Without an explicit toggle of the feature-flag, the eBPF program doesn't seem to be loaded correctly. The uncertainty in this makes me not the trust the metrics that we are seeing because we don't know, whether really all relays are using the eBPF router to relay TURN traffic. In order to draw truthful conclusions as too how much traffic we are relaying via eBPF, this patch removes the feature flag again. As of #8656, we can disable the eBPF program by not setting the `EBPF_OFFLOADING` env variable. This requires a re-deploy / restart of relays to take effect which isn't quite as fast as toggling a feature flag but much reliable and easier to maintain.	2025-04-07 03:31:22 +00:00
Thomas Eizinger	941ef6c668	feat(relay): introduce feature-flag for toggling eBPF program (#8650 ) This PR implements a feature-flag in PostHog that we can use to toggle the use of the eBPF data plane at runtime. At every tick of the event-loop, the relay will compare the (cached) configuration of the eBPF program with the (cached) value of the feature-flag. If they differ, the flag will be updated and upon the next packet, the eBPF program will act accordingly. Feature-flags are re-evaluated every 5 minutes, meaning there is some delay until this gets applied. The default value of our all our feature-flags is `false`, meaning if there is some problem with evaluating them, we'd turn the eBPF data plane off. Performing routing in userspace is slower but it is a safer default. Resolves: #8548	2025-04-04 02:51:52 +00:00
Thomas Eizinger	3ce3c03291	fix(telemetry): introduce staging and prod PostHog projects (#8647 ) As per PostHog's recommendation [0], we now use different projects to manage the feature-flags. This allows us to turn feature flags in staging or production on / off without affecting the other. [0]: https://posthog.com/tutorials/multiple-environments	2025-04-04 01:56:28 +00:00
Thomas Eizinger	8ee1cb9e89	feat(telemetry): include environment in decide request (#8616 ) This allows us to toggle feature-flags based on environments.	2025-04-03 11:25:03 +00:00
Thomas Eizinger	84a2c275ca	build(rust): upgrade to Rust 1.85 and Edition 2024 (#8240 ) Updates our codebase to the 2024 Edition. For highlights on what changes, see the following blogpost: https://blog.rust-lang.org/2025/02/20/Rust-1.85.0.html	2025-03-19 02:58:55 +00:00
Thomas Eizinger	e54a7c2d64	feat(connlib): regularly evaluate feature flags (#8467 ) In order to be able to dynamically configure long-running applications such as the Gateway via feature-flags, we need to regularly re-evaluate them by sending another POST request to the `/decide` endpoint. To do this without impacting anything else, we create a separate runtime that is lazily initialised on first access and use that to run the async code for connecting to the PostHog service. In addition to that, we also spawn a task that re-evaluates the feature flags for the currently set user in the Sentry context every 5 minutes. Resolves: #8454 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-03-17 23:50:54 +00:00

21 Commits