firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 02:18:47 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	b7dc897eea	refactor(rust): introduce `libs/` directory (#10964 ) The current Rust workspace isn't as consistent as it could be. To make navigation a bit easier, we move a few crates around. Generally, we follow the idea that entry-points should be at the top-level. `rust/` now looks like this (directories only): ``` . ├── cli # Firezone CLI ├── client-ffi # Entry point for Apple & Android ├── gateway # Gateway ├── gui-client # GUI client ├── headless-client # Headless client ├── libs # Library crates ├── relay # Relay ├── target # Compile artifacts ├── tests # Crates for testing └── tools # Local tools ``` To further enforce this structure, we also drop the `firezone-` prefix from all crates that are not top-level binary crates.	2025-11-25 10:59:11 +00:00
Thomas Eizinger	bcf4ccf817	fix(rust): introduce dedicated downcast functions for `anyhow` (#10966 ) The downcasting abilities of `anyhow` are pretty powerful. Unfortunately, they can also be a bit tricky to get right. Whilst `is` and `downcast` work fine for any errors that are within the `anyhow` error chain, they don't check the chain of errors prior to that. In other words, if we already have a nested `std::error::Error` with several causes, `anyhow` cannot downcast to these causes directly. In order to avoid this footgun, we create a thin-layer on top of the `anyhow` crate with some downcasting functions that always try to do the right thing.	2025-11-25 04:14:17 +00:00
Thomas Eizinger	6e2be658b0	chore(gateway): log unroutable packets only on DEBUG (#10897 ) Currently, the Gateway logs all kinds of errors during packet processing on WARN. Whilst it is generally good to be aware of warnings / errors, some of these scenarios are particularly noisy. For various reasons, we may not be able to route a packet arriving from the TUN device. In such cases, we now return an `UnroutablePacket` error to the event-loop which is special-cased to only log on DEBUG. It also includes the 5 tuple as variables, which should make log analysis a bit easier if we want to filter on specific parts of the 5 tuple.	2025-11-18 04:23:14 +00:00
Thomas Eizinger	de7d3bff89	fix(connlib): re-resolve portal host on WS hiccup (#10817 ) Currently, the DNS records for the portal's hostname are only resolved during startup. When the WebSocket connection fails, we try to reconnect but only with the IPs that we have previously resolved. If the local IP stack changed since then or the hostname now points to different IPs, we will run into the reconnect-timeout configured in `phoenix-channel`. To fix this, we re-resolve the portal's hostname every time the WebSocket connection fails. For the Gateway, this is easy as we can simply reuse the already existing `TokioResolver` provided by hickory. For the Client, we need to write our own DNS client on top of our socket factory abstraction to ensure we don't create a routing loop with the resulting DNS queries. To simplify things, we only send DNS queries over UDP. Those are not guaranteed to succeed but given that we do this on every "hiccup", we already have a retry mechanism. We use the currently configured upstream DNS servers for this. Resolves: #10238	2025-11-11 03:24:36 +00:00
Firezone Bot	5ae2707719	chore: publish gateway 1.4.18 (#10823 )	2025-11-10 19:08:17 +11:00
Thomas Eizinger	166b0d1573	feat(linux): compute device ID from `/etc/machine-id` (#10805 ) All of our Linux applications have a soft-dependency on systemd. That is, in the default configuration, we expect systemd to be present on the machine. The only exception here are the docker containers for Headless Client and Gateway. For the GUI client in particular, systemd is a hard-dependency in order to control DNS on the system which we do via `systemd-resolved`. To secure the communication between the GUI client and its tunnel process, we automatically create a group called `firezone-client` to which the user gets added. All members of the group are allowed to access the unix socket which is used for IPC between the two processes. Membership in this group is also a prerequisite for accessing any of the configuration files. On the first launch of the GUI client on a Linux system, this presents a problem. For group membership changes to take the effect, the user needs to reboot. We say that in the documentation but it is unclear whether all users will read that thoroughly enough. To help the user, the GUI client checks for membership of the current user in the group and alerts the user via a dialog box if that isn't the case. This would all be fine if it would actually work. Unfortunately, that check ends up being too late in the process. If we aren't a member of the group, we cannot read the device ID and bail early, thus never reaching the check and terminating the process without any dialog box or user-visible error. We could attempt to fix this by shuffling around some of the startup init code. That is a sub-optimal solution however because it a) may get broken again in the future and b) it means we have to delay initialisation of telemetry until a much later point. Given that this is only a problem on Linux, a better solution is to simply not rely on the disk-based device ID at all. Instead, we can integrate with systemd and deterministically derive a device ID from the unique machine ID and a randomly chosen "app ID". For backwards-compatibility reasons, the disk-based device ID is still prioritised. For all new installs however, we will use the one based on `/etc/machine-id`.	2025-11-10 02:29:52 +00:00
Thomas Eizinger	8651413a95	chore(gateway): downgrade warning if peer not found (#10814 ) Logging this on WARN appears to be a bit excessive and there is not really anything we can do about it. Resolves: #10813	2025-11-10 01:45:50 +00:00
Thomas Eizinger	f98c4dd428	fix(gateway): declare hard-dependency on systemd (#10803 ) Several aspects of the Gateway's Debian package depend on `systemd` being present. Without it, we don't have the necessary users and files in place for the Gateway to function. With that specified, we can fail the `postinst` script (and therefore the installation) if anything in there goes wrong.	2025-11-07 14:33:30 +00:00
Thomas Eizinger	89f0af3fd7	fix(gateway): remove exclamation mark from sysusers.conf (#10802 )	2025-11-07 12:21:32 +11:00
Thomas Eizinger	352a83bbb0	refactor(connlib): allow creating multiple layer 4 DNS servers (#10763 ) Within Firezone, there are multiple components that deal with DNS queries. Two of those components are the `l4-udp-dns-server` and `l4-tcp-dns-server`. Both of them are responsible for receiving DNS queries on layer 4, i.e. UDP or TCP. In other words, they do _not_ operate on an IP level (which would be layer 3) but instead use `UdpSocket` and `TcpListener` to receive queries and sent back responses. Right now, the interfaces of these crates are designed for the usecase of receiving forwarded DNS queries from the CLient on the Gateway's TUN device. This is a special-case of DNS resolution. When receiving a TXT or SRV query for a domain that is covered by a DNS resources, Firezone Client's will forward that query to the corresponding Gateway and resolve it in its network context. SRV and TXT records are commonly used for service discovery and as such, should be resolved in the network context of the service, i.e. the site that assigned to the resource. For that usecase, it made sense to allow each DNS server to listen on 1 IPv4 and 1 IPv6 address. Since then, our event-loop has evolved a bit, being able to handle multiple inputs at once. As such, we can simplify the API of these crates to only listen on a single address and instead create multiple instances of them inside `Io`. Depending on how the design of our DNS implementation for the Clients evolves, this may be used to listen on multiple IPs later (e.g. from the `127.0.0.0/8` subnet). Related: #8263	2025-11-04 03:45:49 +00:00
Thomas Eizinger	3e9ef4772b	feat(gateway): extend flow logs with more client properties (#10717 ) In order to make the flow logs emitted by the Gateway more useful and self-contained, we extend the `authorize_flow` message sent to the Gateway with some more context around the Client and Actor of that flow. In particular, we now also send the following to the Gateway: - `client_version` - `device_os_version` - `device_os_name` - `device_serial` - `device_uuid` - `device_identifier_for_vendor` - `device_firebase_installation_id` - `identity_id` - `identity_name` - `actor_id` - `actor_email` We only extend the `authorize_flow` message with these additional properties. The legacy messages for 1.3.x Clients remain as is. For those Clients, the above properties will be empty in the flow logs. Resolves: #10690 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-10-30 02:13:22 +00:00
dependabot[bot]	941f6f3d1c	build(deps): bump secrecy from 0.8.0 to 0.10.3 in /rust (#10631 ) Bumps [secrecy](https://github.com/iqlusioninc/crates) from 0.8.0 to 0.10.3. <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/iqlusioninc/crates/commits">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=secrecy&package-manager=cargo&previous-version=0.8.0&new-version=0.10.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2025-10-30 01:17:10 +00:00
Thomas Eizinger	0d2ddd8497	feat(gateway): create debian package (#10537 ) With this PR we add `cargo-deb` to our CI pipeline and build a debian package for the Gateway. The debian package comes with several configuration files that make it easy for admins to start and maintain a Gateway installation: - The embedded systemd unit file is essentially the same one as what we currently install with the install script with some minor modifications. - The token is read from `/etc/firezone/gateway-token` and passed as a systemd credential. This allows us to set the permissions for this file to `0400` and have it owned by `root:root`. - The configuration is read from `/etc/firezone/gateway-env`. - Both of these changes basically mean the user should never need to touch the unit file itself. - The `sysusers` configuration file ensures the `firezone` user and group are present on the system. - The `tmpfiles` configuration file ensures the necessary directories are present. All of the above is automatically installed and configured using the post-installation script which is called by `apt` once the package is installed. In addition to the Gateway, we also package a first version of the `firezone-cli`. Right now, `firezone-cli` (installed as `firezone`) has three subcommands: - `gateway authenticate`: Asks for the Gateway's token and installs it at `/etc/firezone/gateway-token`. The user doesn't have to know how we manage this token and can trust that we are using safe defaults. - `gateway enable`: Enables and starts the systemd service. - `gateway disable`: Disables the systemd service. Right now, the `.deb` file is only uploaded to the preview APT repository and not attached to the release. It should therefore not yet be user-visible unless somebody pokes around a lot, meaning we can defer documentation to a later PR and start testing it from the preview repository for our own purposes. Related: #10598 Resolves: #8484 Resolves: #10681	2025-10-24 05:14:58 +00:00
Thomas Eizinger	fbf1a1e322	fix(gateway): trim whitespace from systemd credential (#10695 ) Unix tools often write a newline at the end of a file. When using the file's contents as a token, they need to match byte-for-byte otherwise we cannot authenticate to the portal. To ensure that, we trim the content from the file before creating the `SecretString`.	2025-10-24 04:03:40 +00:00
Thomas Eizinger	ed2bc0bd25	feat(gateway): revise handling of DNS resolution errors (#10623 ) Even prior to #10373, failures in resolving a name on the Gateway for a DNS resource resulted in a failure of setting up the DNS resource NAT. Without the DNS resource NAT, packets for that resource bounced on the Gateway because we didn't have any traffic filters. A non-existent filter is being treated as a "traffic not allowed" error and we respond with an ICMP permission denied error. For domains where both the A and AAAA query result in NXDOMAIN, that isn't necessarily appropriate. Instead, I am proposing that for such cases, we want to return a regular "address/host unreachable" ICMP error instead of the more specific "permission denied" variant. To achieve that, we refactor the Gateway's peer state to be able to hold an `Option<IpAddr>` inside the `TranslationState`. This allows us to always insert an entry for each proxy IP, even if we did not resolve any IPs for it. Then, when receiving traffic for a proxy IP where the resolved IP is `None`, we reply with the appropriate ICMP error. As part of this, we also simplify the assignment of the proxy IPs. With the NAT64 module removed, there is no more reason to cross-assign IPv4 and IPv6 addresses. We can simply leave the mappings for e.g. IPv6 proxy addresses empty if the AAAA query didn't resolve anything. From the Client's perspective, not much changes. The DNS resource NAT setup will now succeed, even for domains that don't resolve to anything. This doesn't change any behaviour though as we are currently already passing packets through for failed DNS resource NAT setups. The main change is that we now send back a different ICMP error. Most importantly, the "address/host unreachable variant" does not trigger #10462.	2025-10-22 19:14:45 +00:00
dependabot[bot]	c795e0da72	build(deps): bump futures-bounded from 0.2.4 to 0.3.0 in /rust (#10645 ) Bumps [futures-bounded](https://github.com/thomaseizinger/rust-futures-bounded) from 0.2.4 to 0.3.0. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/thomaseizinger/rust-futures-bounded/blob/main/CHANGELOG.md">futures-bounded's changelog</a>.</em></p> <blockquote> <h2>0.3.0</h2> <ul> <li>Allow for multiple timer implementations. See <a href="https://redirect.github.com/thomaseizinger/rust-futures-bounded/pull/5">PR 5</a>.</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li>See full diff in <a href="https://github.com/thomaseizinger/rust-futures-bounded/commits">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=futures-bounded&package-manager=cargo&previous-version=0.2.4&new-version=0.3.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2025-10-22 03:55:25 +00:00
Thomas Eizinger	80331b4e93	feat(gateway): add option for outputting logs as JSON (#10620 ) To enable customers to ingest flow logs (#8353) into various SIEMS, outputting structured logs is crucial.	2025-10-22 03:09:33 +00:00
Firezone Bot	e3bb2fb931	chore: publish gateway 1.4.17 (#10584 )	2025-10-16 05:38:12 +00:00
Thomas Eizinger	038aa6b590	feat(gateway): support systemd credentials (#10538 ) For more permanent Gateway installations, or ones that are managed through something else other than our install script, it is useful to define the Gateway's token outside the systemd unit file. Systemd provides support for credentials via the `LoadCredential` and `LoadCredentialEncrypted` instructions. We just need a tiny bit of glue code in the Gateway to actually use that if it is set. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-10-14 00:07:49 +00:00
Thomas Eizinger	531a84268f	fix(connlib): always process all errors from tunnel (#10500 ) In #10347, we made sure that we always return all errors that happen during a single tick of the event-loop. What we overlooked is that as part of handling the errors, we need to use `continue` to jump to the next one instead of returning directly from the function. Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-10-06 17:07:53 +00:00
Thomas Eizinger	e9e8792512	feat(connlib): tune down logs for recently disconnected clients (#10501 ) When a Client disconnects from a Gateway, we might still be receiving packets that are either in-flight or are still being sent by the resource. For some amount of time after a disconnect, this is expected and not worth logging a warning for. With this PR, we define this time to be 60s. If we cannot look up a connection either by ID, session index or public key but the peer has disconnected within the last 60s, we will now only print a DEBUG log instead of a WARN. Resolves: #10175	2025-10-03 13:08:06 +00:00
Thomas Eizinger	a297c6dbbd	chore: differentiate between `shutdown` and `shut down` (#10494 ) In a prior code review, CoPilot flagged that we were using the noun "shutdown" as a verb in certain places. Resolves: #10425	2025-10-01 02:55:22 +00:00
Thomas Eizinger	685acdac3a	feat: add more specific component type to user-agent header (#10457 ) In order to allow the portal to more easily classify, what kind of component is connecting, we extend the `get_user_agent` header to include a component type instead of the generic `connlib/`. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-09-26 00:18:36 +00:00
Thomas Eizinger	aa68029a33	feat(gateway): use hickory resolver to resolve A/AAAA queries (#10373 ) At present, the Gateway performs DNS resolution for A & AAAA queries via `libc`. The `resolve` system call only provides us with the resolved IPs but not any of the metadata around the query such as TTL. As a result, we can only cache DNS queries for a static amount of time, currently 30s. It would be more correct to cache them for their TTL instead. To do so, we re-introduce `hickory-resolver` to our codebase. Deliberately, we only use it for resolving A and AAAA records on the Gateway for now. DNS resolution for SRV & TXT records happens one layer below and uses the same infrastructure as DNS resolution on the Client. Merging this is difficult however because the Gateway still supports the control protocol of 1.3.x clients. That one requires DNS resolution prior to setting up the connection of DNS resources which means it needs to happen in the event-loop of the Gateway binary and cannot be moved into the `Tunnel` where DNS resolution for Client and SRV/TXT records happen. Once we can drop support for 1.3.x clients, this Gateway's event-loop will simplify drastically which will allow us to refactor this to a more unified approach of DNS resolution. Until then, we can at least fix the hardcoded TTL by using `hickory-resolver` in the event-loop. The functionality is guarded behind a feature-flag which - as usual - is off by default (i.e. for as long as we haven't fetched the flags). The feature flag is already configured to `true` for staging and production so we can test the new behaviour. Resolves: #8232 Related: #10385	2025-09-23 06:00:16 +00:00
Thomas Eizinger	8e00870942	refactor(gateway): close connections on error (#10401 ) Previously, the Gateway would only proactively close connections to its peers when it was shutdown gracefully via a SIGTERM or SIGINT signal. By copying the same design for the event-loop as I've implemented in #10400, we can now also initiate the graceful shutdown in case the event-loop exits with an error.	2025-09-20 20:55:48 +00:00
Thomas Eizinger	88e801ad97	fix(gateway): re-join topic in phoenix-channel on error (#10397 ) For whatever reason, we seem to sometimes lose the association with the "room" we are meant to be in in order to send messages to the portal. Without joining the right room, messages get dropped silently. To fix this, we re-join the room on such errors. Long-term, this will be fixed by ditching phoenix-channel in favor of simple HTTP requests. Related: #9649	2025-09-20 05:14:12 +00:00
Thomas Eizinger	90d10a8634	refactor(connlib): improve fairness of event-loop (#10347 ) The event-loop inside `Tunnel` processes input according to a certain priority. We only take input from lower priority sources when the higher priority sources are not ready. The current priorities are: - Flush all buffers - Read from UDP sockets - Read from TUN device - Read from DNS servers - Process recursive DNS queries - Check timeout The idea of this priority ordering is to keep all kinds of processing bounded and "finish" any kind of work that is on-going before taking on new work. Anything that sits in a buffer is basically done with processing and just needs to be written out to the network / device. Arriving UDP packets have already traversed the network and been encrypted on the other end, meaning they are higher priority than reading from the TUN device. Packets from the TUN device still need to be encrypted and sent to the remote. Whilst there is merit in this design, it also bears the potential of starving input sources further down if the top ones are extremely busy. To prevent this, we refactor `Io` to read from all input sources and present it to the event-loop as a batch, allowing all sources to make progress before looping around. Since this event-loop has first been conceived, we have refactored `Io` to use background threads for the UDP sockets and TUN device, meaning they will make progress by themselves anyway until the channels to the main-thread fill up. As such, there shouldn't be any latency increase in processing packets even though we are performing slightly more work per event-loop tick. This kind of batch-processing highlights a problem: Bailing out with an error midway through processing a batch leaves the remainder of the batch unprocessed, essentially dropping packets. To fix this, we introduce a new `TunnelError` type that presents a collection of errors that we encountered while processing the batch. This might actually also be a problem with what is currently in `main` because we are already batch-processing packets there but possibly are bailing out midway through the batch. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>	2025-09-17 23:28:36 +00:00
Thomas Eizinger	3e6094af8d	feat(linux): try to set `rmem_max` and `wmem_max` on startup (#10349 ) The default send and receive buffer sizes on Linux are too small (only ~200 KB). Checking `nstat` after an iperf run revealed that the number of dropped packets in the first interval directly correlates with the number of receive buffer errors reported by `nstat`. We already try to increase the send and receive buffer sizes for our UDP socket but unfortunately, we cannot increase them beyond what the system limits them to. To workaround this, we try to set `rmem_max` and `wmem_max` during startup of the Linux headless client and Gateway. This behaviour can be disabled by setting `FIREZONE_NO_INC_BUF=true`. This doesn't work in Docker unfortunately, so we set the values manually in the CI perf tests and verify after the test that we didn't encounter any send and receive buffer errors. It is yet to be determined how we should deal with this problem for all the GUI clients. See #10350 as an issue tracking that. Unfortunately, this doesn't fix all packet drops during the first iperf interval. With this PR, we now see packet drops on the interface itself.	2025-09-17 23:05:01 +00:00
Thomas Eizinger	69afe71215	refactor(connlib): remove concept of "ReplyMessages" (#10361 ) In earlier versions of Firezone, the WebSocket protocol with the portal was using the request-response semantics built into Phoenix. This however is quite cumbersome to work with to due to the polymorphic nature of the protocol design. We ended up moving away from it and instead only use one-way messages where each event directly corresponds to a message type. However, we have never removed the capability reply messages from the `phoenix-channel` module, instead all usages just set it to `()`. We can simplify the code here by always setting this to `()`. Resolves: #7091	2025-09-17 04:10:56 +00:00
Firezone Bot	cacef44b4b	chore: publish gateway 1.4.16 (#10321 )	2025-09-10 04:50:43 +00:00
Thomas Eizinger	e84bdc5566	refactor(connlib): periodically record queue depths (#10242 ) Instead of recording the queue depths on every event-loop tick, we now record them once a second by setting a Gauge. Not only is that a simpler instrument to work with but it is significantly more performant. The current version - when metrics are enabled - takes on quite a bit of CPU time. Resolves: #10237	2025-09-02 02:57:36 +00:00
Thomas Eizinger	533f4c319b	feat(connlib): gracefully shutdown connections (#10076 ) Right now, connections cannot be actively closed in Firezone. The WireGuard tunnel and the ICE agent are coupled together, meaning only if either one of them fails will we clean up the connection. One exception here is when the Client roams. In that case, the Client simply clears its local memory completely and then re-establishes all necessary connections by re-requesting access. There are three cases where gracefully closing a connection is useful: 1. If an access authorization is revoked or expires and this was the last resource authorisation for that peer, we don't currently remove the connection on the Gateway. Instead, the Client is still able to send packets by they'll be dropped because we don't have a peer state anymore. 1. If a Gateway gets restarted due to e.g. an upgrade or other maintenance work, it loses all its connections and every Client needs to wait for the ICE timeout (~15 seconds) before it can establish a new one. 1. If a Client has its access revoked for all resources it has access to in a particular site we also don't remove this connection, even though it has become practically useless. All of these cases are fixed with this PR. Here we introduce a way to gracefully shutdown a connection without forcing the other side into an ICE timeout. The graceful connection shutdown works by introducing a new "goodbye" p2p control protocol message. Like all our p2p control protocol messages, this is based on IP and therefore delivery is not guaranteed. In other words, this "goodbye" message is sent on a best-effort basis. In the case of shutdown, the Gateway will wait for all UDP packets to be flushed but will not resend them or wait for an ACK. If either end receives such a "goodbye" message, they simply remove the local peer and connection state just as if the connection would have failed due to either ICE or WireGuard. For the Client, this means that the next packet for a resource will trigger a new access authorization request.	2025-09-01 06:30:13 +00:00
Thomas Eizinger	9cddfe59fa	fix(rust): don't require Internet on startup (#10264 ) With the introduction of the pre-resolved Sentry host, all Firezone clients now require Internet on startup. That is a signficant usability hit that we can easily fix by simply falling back to resolving the host on-demand.	2025-09-01 01:31:05 +00:00
Thomas Eizinger	46afa52f78	feat(telemetry): pre-resolve Sentry ingest host (#10206 ) Our Sentry client needs to resolve DNS before being able to send logs or errors to the backend. Currently, this DNS resolution happens on-demand as we don't take any control of the underlying HTTP client. In addition, this will use HTTP/1.1 by default which isn't as efficient as it could be, especially with concurrent requests. Finally, if we decide to ever proxy all Sentry for traffic through our own domain, we have to take control of the underlying client anyway. To resolve all of the above, we create a custom `TransportFactory` where we reuse the existing `ReqwestHttpTransport` but provide an already configured `reqwest::Client` that always uses HTTP/2 with a pre-configured set of DNS records for the given ingest host.	2025-08-21 03:28:05 +00:00
Thomas Eizinger	b4cbc4f33b	fix(connlib): exit phoenix-channel event-loop on error (#10229 ) We cannot poll the `PhoenixChannel` after it has returned an error, otherwise it will panic. Therefore, we exit the event-loop then. The outer event-loop also exits as soon as it receives an error from this channel so this is fine. `PhoenixChannel` only returns an error when it has irrecoverably disconnected, e.g. after the retries have been exhausted or we hit a 4xx error on the WebSocket connection. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-21 03:25:46 +00:00
Thomas Eizinger	4e11112d9b	feat(connlib): improve throughput on higher latencies (#10231 ) Turns out the multi-threaded access of the TUN device on the Gateway causes packet reordering which makes the TCP congestion controller throttle the connection. Additionally, the default TX queue length of a TUN device on Linux is only 500 packets. With just a single thread and an increased TX queue length, we get a throughput performance of just over 1 GBit/s for a 20ms link between Client and Gateway with basically no packet drops: ``` Connecting to host 172.20.0.110, port 5201 [ 5] local 100.79.130.70 port 49546 connected to 172.20.0.110 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 116 MBytes 977 Mbits/sec 0 6.40 MBytes [ 5] 1.00-2.00 sec 137 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 2.00-3.00 sec 134 MBytes 1.13 Gbits/sec 0 6.40 MBytes [ 5] 3.00-4.00 sec 136 MBytes 1.14 Gbits/sec 47 6.40 MBytes [ 5] 4.00-5.00 sec 137 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 5.00-6.00 sec 138 MBytes 1.16 Gbits/sec 0 6.40 MBytes [ 5] 6.00-7.00 sec 138 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 7.00-8.00 sec 138 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 8.00-9.00 sec 138 MBytes 1.16 Gbits/sec 0 6.40 MBytes [ 5] 9.00-10.00 sec 138 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 10.00-11.00 sec 139 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 11.00-12.00 sec 139 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 12.00-13.00 sec 136 MBytes 1.14 Gbits/sec 0 6.40 MBytes [ 5] 13.00-14.00 sec 139 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 14.00-15.00 sec 140 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 15.00-16.00 sec 138 MBytes 1.16 Gbits/sec 0 6.40 MBytes [ 5] 16.00-17.00 sec 137 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 17.00-18.00 sec 139 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 18.00-19.00 sec 138 MBytes 1.16 Gbits/sec 0 6.40 MBytes [ 5] 19.00-20.00 sec 136 MBytes 1.14 Gbits/sec 0 6.40 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-20.00 sec 2.67 GBytes 1.15 Gbits/sec 47 sender [ 5] 0.00-20.02 sec 2.67 GBytes 1.15 Gbits/sec receiver iperf Done. ``` For further debugging in the future, we are now recording the send and receive queue depths of both the TUN device and the UDP sockets. Neither of those showed to be full in my testing which leads me to conclude that it isn't any buffer inside Firezone that is too small here. Related: #7452 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-08-20 23:08:56 +00:00
Thomas Eizinger	6f4242769a	refactor(connlib): move gw phoenix-channel to separate task (#10211 ) Similar to #10210, we also move the phoenix-channel to a separate task for the Gateway's and connect it with channels to the event-loop. Related: #10003 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-18 14:55:02 +00:00
Firezone Bot	3e529ed36c	chore: publish gateway 1.4.15 (#10134 )	2025-08-05 17:17:25 +10:00
Thomas Eizinger	cd177a6448	fix(gateway): don't remove peer state on disconnect (#10040 ) When the connection to a Client disappears, the Gateway currently clears all state related to this peer. Whilst eagerly cleaning up memory can be good, in this case, it may lead to the Client thinking it has access to a resource when in reality it doesn't. Just because the connection to a Client failed doesn't mean their access authorizations are invalid. In case the Client reconnects, it should be able to just continue sending traffic. At the moment, this only works if the connection also failed on the Client and therefore, its view of the world in regards to "which resources do I have access to" was also reset. What we are seeing in Sentry reports though is that Clients are attempting to access these resources, thinking they have access but the Gateway denies it because it has lost the access authorization state.	2025-08-02 08:27:49 +00:00
Thomas Eizinger	69f9a03ee8	refactor(connlib): simplify `IpPacket` struct (#9795 ) With the removal of the NAT64/46 modules, we can now simplify the internals of our `IpPacket` struct. The requirements for our `IpPacket` struct are somewhat delicate. On the one hand, we don't want to be overly restrictive in our parsing / validation code because there is a lot of broken software out there that doesn't necessarily follow RFCs. Hence, we want to be as lenient as possible in what we accept. On the other hand, we do need to verify certain aspects of the packet, like the payload lengths. At the moment, we are somewhat too lenient there which causes errors on the Gateway where we have to NAT or otherwise manipulate the packets. See #9567 or #9552 for example. To fix this, we make the parsing in the `IpPacket` constructor more restrictive. If it is a UDP, TCP or ICMP packet, we attempt to fully parse its headers and validate the payload lengths. This parsing allows us to then rely on the integrity of the packet as part of the implementation. This does create several code paths that can in theory panic but in practice, should be impossible to hit. To ensure that this does in fact not happen, we also tackle an issue that is long overdue: Fuzzing. Resolves: #6667 Resolves: #9567 Resolves: #9552	2025-07-29 04:42:57 +00:00
Thomas Eizinger	5c3b15c1a9	chore(connlib): harmonise naming of IDs (#10038 ) When filtering through logs in Sentry, it is useful to narrow them down by context of a client, gateway or resource. Currently, these fields are sometimes called `client`, `cid`, `client_id` etc and the same for the Gateway and Resources. To make this filtering easier, name all of them `cid` for Client IDs, `gid` for Gateway IDs and `rid` for Resource IDs.	2025-07-29 03:33:09 +00:00
Thomas Eizinger	e9c74b1bfe	chore(connlib): treat `Invalid Argument` as unreachable hosts (#10037 ) These appear to happen on systems that e.g. don't have IPv6 support or where the destination cannot be reached. It is a bit of a catch-all but all the ones I am seeing in Sentry are false-positives. To reduce the noise a bit, we log these on DEBUG now.	2025-07-29 03:04:13 +00:00
Firezone Bot	cf40f4dd96	chore: publish gateway 1.4.14 (#10030 )	2025-07-28 06:14:07 +00:00
Thomas Eizinger	d7b9ecb60b	feat(gateway): update expiry of access authoritzations on init (#9975 ) Resolves: #9971	2025-07-24 06:36:56 +00:00
Thomas Eizinger	301d2137e5	refactor(windows): share src IP cache across UDP sockets (#9976 ) When looking through customer logs, we see a lot of "Resolved best route outside of tunnel" messages. Those get logged every time we need to rerun our re-implementation of Windows' weighting algorithm as to which source interface / IP a packet should be sent from. Currently, this gets cached in every socket instance so for the peer-to-peer socket, this is only computed once per destination IP. However, for DNS queries, we make a new socket for every query. Using a new source port DNS queries is recommended to avoid fingerprinting of DNS queries. Using a new socket also means that we need to re-run this algorithm every time we make a DNS query which is why we see this log so often. To fix this, we need to share this cache across all UDP sockets. Cache invalidation is one of the hardest problems in computer science and this instance is no different. This cache needs to be reset every time we roam as that changes the weighting of which source interface to use. To achieve this, we extend the `SocketFactory` trait with a `reset` method. This method is called whenever we roam and can then reset a shared cache inside the `UdpSocketFactory`. The "source IP resolver" function that is passed to the UDP socket now simply accesses this shared cache and inserts a new entry when it needs to resolve the IP. As an added benefit, this may speed up DNS queries on Windows a bit (although I haven't benchmarked it). It should certainly drastically reduce the amount of syscalls we make on Windows.	2025-07-24 01:36:53 +00:00
Thomas Eizinger	ecb2bbc86b	feat(gateway): allow updating expiry of access authorization (#9973 ) Resolves: #9966	2025-07-23 07:25:36 +00:00
Firezone Bot	a11983e4b3	chore: publish gateway 1.4.13 (#9969 )	2025-07-22 18:56:40 +00:00
Thomas Eizinger	c4457bf203	feat(gateway): shutdown after 15m of portal disconnect (#9894 )	2025-07-18 05:47:30 +00:00
Thomas Eizinger	3e71a91667	feat(gateway): revoke unlisted authorizations upon `init` (#9896 ) When receiving an `init` message from the portal, we will now revoke all authorizations not listed in the `authorizations` list of the `init` message. We (partly) test this by introducing a new transition in our proptests that de-authorizes a certain resource whilst the Gateway is simulated to be partitioned. It is difficult to test that we cannot make a connection once that has happened because we would have to simulate a malicious client that knows about resources / connections or ignores the "remove resource" message. Testing this is deferred to a dedicated task. We do test that we hit the code path of revoking the resource authorization and because the other resources keep working, we also test that we are at least not revoking the wrong ones. Resolves: #9892	2025-07-17 19:04:54 +00:00
Thomas Eizinger	2e0ed018ee	chore: document metrics config switches as private API (#9865 )	2025-07-14 13:53:03 +00:00

1 2 3 4 5 ...

290 Commits