firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 18:18:55 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	20d0298a8a	chore: fix clippy warnings about HashMap iteration (#10661 ) Not quite sure how these didn't get picked up by CI but they showed in my local IDE. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-10-21 02:54:20 +00:00
Thomas Eizinger	fc97816d6e	chore: remove redunant clone (#10662 )	2025-10-21 01:11:03 +00:00
Thomas Eizinger	fcda9c3b65	chore(connlib): add unit test for site-name change (#10622 ) Turns out name changes of sites are already ignored as per the `PartialEq` implementation of `Site`. This adds a unit-test to assert that.	2025-10-19 23:57:45 +00:00
Thomas Eizinger	a07dfc9869	test(connlib): workaround DNS cache in proptests (#10602 ) With the introduction of the DNS cache for Clients in #10533, we now enable a behaviour where we don't necessarily need to establish a connection to a Gateway to resolve a DNS query if we still have a valid entry in the DNS cache. In particular, the proptests discovered that: - a DNS query for an upstream resolver - which happens to be a resource - and has a valid entry in the DNS cache - but (no longer) a connection to the corresponding Gateway will now serve the cached DNS records instead of establishing a new connection to the Gateway. As a result, the site status which we assert in the proptests remains in "unknown" instead of the expected "online". Modelling the caching behaviour in the tests is rather tedious. To avoid that, we set the TTL of all simulated upstream DNS responses to 1 which effectively bypasses the cache. Whilst not an ideal solution, it ensures that CI is consistently green without flaky tests. The DNS cache itself is already unit-tested.	2025-10-17 16:17:52 +00:00
Thomas Eizinger	928d8a2512	fix(connlib): handle resources changing site (#10604 ) Similar to how resources can be edited to change their address, IP stack or other properties, they can also be moved between different sites. Currently, `connlib` requires the portal to explicitly remove the resource and then re-add it for this to work. Our system gets more robust if we also detect that the sites of a resource have changed and handle it like other addressability changes. To ensure that this works correctly, we also extend the proptests to simulate addressability changes of resources. Resolves: #9881 Related: #10593	2025-10-17 14:52:14 +00:00
Thomas Eizinger	6b3f2a32ce	feat(gateway): associate packets with resource ID (#10588 ) In order to support flow logs, we need to associate each IP packet that gets routed with its corresponding resource ID. Currently, we only track what is necessary for the actual routing behaviour: The IP addresses and the filters. Therefore, we extend the data structures in `peer` to also track the `ResourceId` now. The entire code within `peer` became a bit hard to manage so I took this opportunity to split it out into two dedicated modules. This PR forms the base for recording flows logs in #10576.	2025-10-16 13:53:53 +00:00
Thomas Eizinger	08f8e886f1	chore(connlib): tune down INFO logs (#10574 ) Several of these INFO logs are actually quite noisy, like exchanging candidates with Gateways or updating the allocation. We barely look at the INFO logs from customers and primarily investigate issues with DEBUG logs streamed to Sentry.	2025-10-15 05:52:43 +00:00
Thomas Eizinger	df601be538	chore(rust): ban `keys` and `values` from `HashMap` (#10569 ) In addition to the `iter` functions, `keys` and `values` also iterate over the contents of a `HashMap` and are thus non-deterministic. This can create problems where our test-suite is non-deterministic.	2025-10-14 22:44:17 +00:00
Thomas Eizinger	039d0be7b8	fix(connlib): drop packets with bad source IP on clients (#10552 ) When using the Internet Resource, it can happen that Clients are still receiving packets with a source IP that is different from the TUN IP. Such packets are dropped on the Gateway already today and therefore have never been routed to their destination. The Gateway cannot route these packets because the reply packets would have the original source address set as the destination and that one is not unique across all Firezone Clients. Without a unique destination, the Gateway cannot send the packet to the correct Client. Today, these packets are filtered on the Gateway and thus trigger an ICMP error. With the addition of #10462, we create a new flow for each one of these packets. To prevent this spam, we drop such packets early in the Client and don't even route them to the Gateway.	2025-10-13 22:54:26 +00:00
Thomas Eizinger	8ccf8b90bc	chore(tests): remove comments from regression seeds file (#10534 ) Whilst the regression seeds file itself is useful to have a fixed set of tests that are always run, the comments what a specific seed samples to quickly get outdated as the test suite evolves. Therefore, we remove the comments to not confuse developers.	2025-10-08 05:21:47 +00:00
Thomas Eizinger	1140f6ffa3	feat(clients): cache DNS responses (#10533 ) Firezone Clients set themselves as the system-wide DNS resolver on startup. This is necessary to intercept queries for DNS resources which resolve to proxy IPs whilst Firezone is active. All DNS queries for non-resources are forwarded to either the resolver defined on the system or the ones defined in the portal (if any). These DNS servers can also be CIDR resources in which cases the queries get forwarded through the tunnel to a Gateway. Right now, the responses from these DNS servers are never cached. DNS is pretty heavily relied on on most systems and having DNS fail or be slow usually results in a bad user experience. To improve on this, we embed a small DNS cache into connlib where for each query, we first try to answer it from the cache. Queries otherwise forwarded to the system/upstream resolver or through the tunnel will see a much improved response time with this change. When serving responses from this cache, the TTL is decremented automatically based on how much time has passed since the entry was first added to the cache. Outside of the response time being ~1ms, this makes the cache fully transparent. Resolves: #10508	2025-10-08 03:26:27 +00:00
Thomas Eizinger	8fc2ef8ad1	fix(clients): set Internet Resource state on startup (#10509 ) Building on top of #10507, setting the initial Internet Resource state is a piece of cake. All we need to do is thread a boolean variable through to all call-sites of `Session::connect`. Without the need for the Internet Resource's ID, we can simply pass in the boolean that is saved in the configuration of each client. Resolves: #10255	2025-10-07 07:13:52 +00:00
Thomas Eizinger	36dfee2c42	refactor(connlib): explicitly enable/disable Internet Resource (#10507 ) Instead of the generic "disable any kind of resource"-functionality that connlib currently exposes, we now provide an API to only enable / disable the Internet Resource. This is a lot simpler to deal with and reason about than the previous system, especially when it comes to the proptests. Those need to model connlib's behaviour correctly across its entire API surface which makes them unnecessarily complex if we only ever use the `set_disabled_resources` API with a single resource. In preparation for #4789, I want to extend the proptests to cover traffic filters (#7126). This will make them a fair bit more complicated, so any prior removal of complexity is appreciated. Simplifying the implementation here is also a good starting point to fix #10255. Not implicitly enabling the Internet Resource when it gets added should be quite simple after this change. Finally, resolving #8885 should also be quite easy. We just need to store the state of the Internet Resource once per API URL instead of globally. Resolves: #8404 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-10-07 00:26:07 +00:00
Thomas Eizinger	e9e8792512	feat(connlib): tune down logs for recently disconnected clients (#10501 ) When a Client disconnects from a Gateway, we might still be receiving packets that are either in-flight or are still being sent by the resource. For some amount of time after a disconnect, this is expected and not worth logging a warning for. With this PR, we define this time to be 60s. If we cannot look up a connection either by ID, session index or public key but the peer has disconnected within the last 60s, we will now only print a DEBUG log instead of a WARN. Resolves: #10175	2025-10-03 13:08:06 +00:00
Thomas Eizinger	2cc13cea24	refactor(connlib): set ECN bits directly on `Transmit` (#10497 ) Instead of mirroring the ECN bits of an IP packet on the resulting UDP packet in the event-loop, we can extend `Transmit` with an `ecn` field and directly set it every time we construct a `Transmit`, mirroring the ECN bits from the inner IP packet if the UDP packet contains an encapsulated IP packet. Extracted from #10485	2025-10-03 13:02:17 +00:00
Thomas Eizinger	881514edfc	fix(connlib): log fragmented IP packets on debug (#10488 ) When an application sends UDP packets that are larger than the MTU of the underlying interface, the kernel fragments the packet at the IP level. Firezone does not support fragmented IP packets because we need to pack each IP packet into a UDP packet. Right now, we don't check for fragmented IP packets which results in packet parsing errors because the slice we are trying to parse the packet from is not long enough. To avoid spamming Sentry in these cases, we explicitly check for fragmented IP packets and only log those on DEBUG. Resolves: #10335	2025-10-02 05:03:12 +00:00
Thomas Eizinger	cfbdc30123	refactor(connlib): move log into state (#10498 ) Instead of logging this inside the event-loop, it is better to move it into the corresponding handler function to free up the event-loop from as much "logic" as possible. It should ideally only be concerned with linking the state machine with the IO components that actually cause the side-effects.	2025-10-01 04:16:41 +00:00
Thomas Eizinger	a297c6dbbd	chore: differentiate between `shutdown` and `shut down` (#10494 ) In a prior code review, CoPilot flagged that we were using the noun "shutdown" as a verb in certain places. Resolves: #10425	2025-10-01 02:55:22 +00:00
Thomas Eizinger	b11adfcfe4	feat(connlib): create flow on ICMP error "prohibited" (#10462 ) In Firezone, a Client requests an "access authorization" for a Resource on the fly when it sees the first packet for said Resource going through the tunnel. If we don't have a connection to the Gateway yet, this is also where we will establish a connection and create the WireGuard tunnel. In order for this to work, the access authorization state between the Client and the Gateway MUST NOT get out of sync. If the Client thinks it has access to a Resource, it will just route the traffic to the Gateway. If the access authorization on the Gateway has expired or vanished otherwise, the packets will be black-holed. Starting with #9816, the Gateway sends ICMP errors back to the application whenever it filters a packet. This can happen either because the access authorization is gone or because the traffic wasn't allowed by the specific filter rules on the Resource. With this patch, the Client will attempt to create a new flow (i.e. re-authorize) traffic for this resource whenever it sees such an ICMP error, therefore acting as a way of synchronizing the view of the world between Client and Gateway should they ever run out of sync. Testing turned out to be a bit tricky. If we let the authorization on the Gateway lapse naturally, we portal will also toggle the Resource off and on on the Client, resulting in "flushing" the current authorizations. Additionally, it the Client had only access to one Resource, then the Gateway will gracefully close the connection, also resulting in the Client creating a new flow for the next packet. To actually trigger this new behaviour we need to: - Access at least two resources via the same Gateway - Directly send `reject_access` to the Gateway for this particular resource To achieve this, we dynamically eval some code on the API node and instruct the Gateway channel to send `reject_access`. The connection stays intact because there is still another active access authorization but packets for the other resource are answered with ICMP errors. To achieve a safe roll-out, the new behaviour is feature-flagged. In order to still test it, we now also allow feature flags to be set via env variables. Resolves: #10074 --------- Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>	2025-09-30 08:23:39 +00:00
Thomas Eizinger	685acdac3a	feat: add more specific component type to user-agent header (#10457 ) In order to allow the portal to more easily classify, what kind of component is connecting, we extend the `get_user_agent` header to include a component type instead of the generic `connlib/`. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-09-26 00:18:36 +00:00
Thomas Eizinger	0310bafbcd	feat(clients): gracefully close connections on shutdown (#10400 ) In #10076, connlib gained the ability to gracefully close connections between peers. The Gateway already uses this when it is being gracefully shutdown such as during an upgrade. This allows Clients to immediately fail-over to a different Gateway instead of waiting for an ICE timeout. When a Client signs out, we currently just drop all the state, resulting in an ICE timeout on the Gateway ~15 seconds later. This makes it difficult for us to analyze, whether an ICE timeout in the logs presents an actual problem where a network connection got cut or whether the Client simply signed out. Whilst not water-tight, attempting to gracefully close our connections when the Client signs out is better than nothing so we implement this here. All Clients use the `Session` abstraction from `client-shared` which spawns the event-loop into a dedicated task. - For the Linux and Windows GUI client, the already present tokio runtime instance of the tunnel service is used for this. - For Android and Apple, we create a dedicated, single-threaded runtime instance for connlib. - For the headless client, we also reuse the already existing tokio runtime instance of the binary. In case of Android, Apple and the headless client, this means we need to ensure the tokio runtime instances stays alive long enough to actually complete the graceful shutdown task. We achieve this by draining the `EventStream` returned from `Session`. The `EventStream` is a wrapper around a channel connected to the event-loop. This stream only finishes once the event-loop is entirely dropped (and therefore completed the graceful shutdown) as it holds the sender-end of the channel. In case of the Linux and Windows GUI client, the runtime outlives the `Session` because it is scoped to the entire tunnel process. Therefore, no additional measures are necessary there to ensure the graceful shutdown task completes.	2025-09-23 03:40:52 +00:00
Thomas Eizinger	8e00870942	refactor(gateway): close connections on error (#10401 ) Previously, the Gateway would only proactively close connections to its peers when it was shutdown gracefully via a SIGTERM or SIGINT signal. By copying the same design for the event-loop as I've implemented in #10400, we can now also initiate the graceful shutdown in case the event-loop exits with an error.	2025-09-20 20:55:48 +00:00
Thomas Eizinger	e20929ad73	build(deps): bump Rust version to 1.90 (#10380 ) One of the more quiet Rust releases with no new clippy lints that would require code updates.	2025-09-20 04:28:03 +00:00
Thomas Eizinger	9c8101a3ee	chore: render contextual information more Sentry-friendly (#10386 ) Sentry can group issues together that have unique identifiers in their message. Unfortunately, it does that only well for integers and UUIDs and not so much for hex-values. To avoid alert fatigue, we render the public key as a u256 which hopefully allows Sentry to group these together.	2025-09-20 12:08:03 +10:00
Thomas Eizinger	90d10a8634	refactor(connlib): improve fairness of event-loop (#10347 ) The event-loop inside `Tunnel` processes input according to a certain priority. We only take input from lower priority sources when the higher priority sources are not ready. The current priorities are: - Flush all buffers - Read from UDP sockets - Read from TUN device - Read from DNS servers - Process recursive DNS queries - Check timeout The idea of this priority ordering is to keep all kinds of processing bounded and "finish" any kind of work that is on-going before taking on new work. Anything that sits in a buffer is basically done with processing and just needs to be written out to the network / device. Arriving UDP packets have already traversed the network and been encrypted on the other end, meaning they are higher priority than reading from the TUN device. Packets from the TUN device still need to be encrypted and sent to the remote. Whilst there is merit in this design, it also bears the potential of starving input sources further down if the top ones are extremely busy. To prevent this, we refactor `Io` to read from all input sources and present it to the event-loop as a batch, allowing all sources to make progress before looping around. Since this event-loop has first been conceived, we have refactored `Io` to use background threads for the UDP sockets and TUN device, meaning they will make progress by themselves anyway until the channels to the main-thread fill up. As such, there shouldn't be any latency increase in processing packets even though we are performing slightly more work per event-loop tick. This kind of batch-processing highlights a problem: Bailing out with an error midway through processing a batch leaves the remainder of the batch unprocessed, essentially dropping packets. To fix this, we introduce a new `TunnelError` type that presents a collection of errors that we encountered while processing the batch. This might actually also be a problem with what is currently in `main` because we are already batch-processing packets there but possibly are bailing out midway through the batch. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>	2025-09-17 23:28:36 +00:00
Thomas Eizinger	3e6094af8d	feat(linux): try to set `rmem_max` and `wmem_max` on startup (#10349 ) The default send and receive buffer sizes on Linux are too small (only ~200 KB). Checking `nstat` after an iperf run revealed that the number of dropped packets in the first interval directly correlates with the number of receive buffer errors reported by `nstat`. We already try to increase the send and receive buffer sizes for our UDP socket but unfortunately, we cannot increase them beyond what the system limits them to. To workaround this, we try to set `rmem_max` and `wmem_max` during startup of the Linux headless client and Gateway. This behaviour can be disabled by setting `FIREZONE_NO_INC_BUF=true`. This doesn't work in Docker unfortunately, so we set the values manually in the CI perf tests and verify after the test that we didn't encounter any send and receive buffer errors. It is yet to be determined how we should deal with this problem for all the GUI clients. See #10350 as an issue tracking that. Unfortunately, this doesn't fix all packet drops during the first iperf interval. With this PR, we now see packet drops on the interface itself.	2025-09-17 23:05:01 +00:00
Thomas Eizinger	7222167b13	fix(connlib): limit the number of optimistic candidates (#10367 ) To facilitate direct connections, `connlib` generates "optimistic" candidates that combine the port of the host candidate with the IP of the server-reflexive candidate. This allows sysadmins to port-forward the Firezone port 52625 on the Gateway, allowing for direct connections to happen behind symmetric NAT. This feature is only really useful for IPv4 as IPv6 doesn't need symmetric NAT due to the larger address space. It is also quite common that users have multiple IPv6 addresses on a single interface. The combination of the two can result in CPU spikes on the Gateway if a client connects and sends over e.g. 10 IPv6 host candidates and various IPv6 server-reflexive candidates. The Gateway then ends up in a loop where it creates an NxM matrix of all these candidates. To mitigate this, we disable optimistic candidates for IPv6 altogether and limit the number of IPv4 optimistic candidates to 2.	2025-09-17 19:52:29 +00:00
Thomas Eizinger	69afe71215	refactor(connlib): remove concept of "ReplyMessages" (#10361 ) In earlier versions of Firezone, the WebSocket protocol with the portal was using the request-response semantics built into Phoenix. This however is quite cumbersome to work with to due to the polymorphic nature of the protocol design. We ended up moving away from it and instead only use one-way messages where each event directly corresponds to a message type. However, we have never removed the capability reply messages from the `phoenix-channel` module, instead all usages just set it to `()`. We can simplify the code here by always setting this to `()`. Resolves: #7091	2025-09-17 04:10:56 +00:00
Thomas Eizinger	a66a18782e	chore(connlib): add context to IP packet parse errors (#10337 ) We are seeing some very strange IP packet parse errors coming from MacOS devices. To better understand these, we extend the error messages with the src and dst IP as well as the L4 header. Related: #10335	2025-09-12 14:11:12 +00:00
Thomas Eizinger	33a75f6fee	chore(headless-client): don't make failures look like crashes (#10290 ) Returning an error from `main` by default prints a backtrace. This may lead users to believe that the program is crashing when in fact it is exiting in a controlled way but with an error (such as when we don't have Internet during startup). Printing the chain of errors ourselves resolves this.	2025-09-10 01:08:32 +00:00
Thomas Eizinger	03ac73ac00	fix(gateway): reset DNS resource NAT if proxy IPs change (#10310 ) In #10040, we decided to persist a peer's routing state on the Gateway across ICE sessions. This routing state also includes the DNS resource NAT. Prior to #10104 (which is not released yet), when a Client signs out and back in, it resets the proxy IP mapping for DNS resources and will start numbering them again from the front, i.e. starting from 100.96.0.1. With the state still being preserved on the Gateway, this represents a problem: We keep existing mappings around if there is still a NAT session for this proxy IP. However, if the proxy IP is actually for a different domain, this NAT session is meaningless. In fact, not replacing the IP is problematic as we will now route packets for the new proxy IP to the wrong destination. The persistent DNS resource mapping from #10104 fixes this. In this PR, we add an additional check to the Gateway where we detect whether the Client has started to re-assign proxy IPs and if so, we completely reset the DNS resource NAT state including all existing NAT sessions. Fixes #10268	2025-09-09 02:08:26 +00:00
Thomas Eizinger	ead1f40101	chore(gateway): only log skipped NAT entry if IP differs (#10285 ) When we resolve a DNS resource domain name on the Gateway, we establish the mapping between proxy IPs and resolved IPs in order to correctly NAT traffic. These domains are re-resolved every time the Client sees a DNS query for it. Thus, established connections could be interrupted if the IPs returned by consecutive DNS queries are different. Many SaaS products (GitHub for example) use DNS to load balance between different IPs. In order to not interrupt those connections, we check whether we have an open NAT session for an existing mapping every time we re-resolve DNS. This log is currently printed too often though because it doesn't take into account whether the IPs actually changed. If the IP is the same, we don't need to print this because the update is a no-op.	2025-09-04 21:12:46 +00:00
Thomas Eizinger	fb7b001cbf	chore(rust): fix unused variable warning (#10283 )	2025-09-03 01:17:11 +00:00
Thomas Eizinger	d718c5de8e	fix(connlib): retry packets on IO error 5 (#10279 ) Unfortunately, it isn't very easy to detect whether a socket supports GSO on Linux. Hence, `quinn-udp` simply probes for its support by trying to send GSO batches and effectively disables GSO by setting the `max-gso-segments` state variable to 1 if it encounters either EINVAL (-22) or EIO (-5). For EINVAL, `quinn-udp` has an internal retry mechanism. For EIO, the `Transmit` which is passed to `quinn-udp` needs to be re-chunked and thus cannot be automatically retried. In order to avoid dropping packets, we therefore add a once-off retry step to sending a datagram whenever we hit EIO on Linux or Android. If the error was due to GSO not being supported, the 2nd attempt should be successful and going forward, even the first one should be until we roam the socket (where this state variable gets reset). These packet drops have been causing flakiness in CI ever since we merged the eBPF tests. Those disable checksum offloading which appears to trigger these errors.	2025-09-02 21:31:57 +00:00
Thomas Eizinger	e84bdc5566	refactor(connlib): periodically record queue depths (#10242 ) Instead of recording the queue depths on every event-loop tick, we now record them once a second by setting a Gauge. Not only is that a simpler instrument to work with but it is significantly more performant. The current version - when metrics are enabled - takes on quite a bit of CPU time. Resolves: #10237	2025-09-02 02:57:36 +00:00
Thomas Eizinger	a9e1b0fbfb	chore(connlib): print full error when failing to read IP packet (#10275 ) The error returned from `IpPacket::new` is an `anyhow::Error` but in order to return it from `async_io`, we need to wrap it in an `io::Error`. Printing an `io::Error` only prints the top-level error. To fix this, we re-wrap the `io::Error` in an `anyhow::Error` again and toggle "alternate" printing mode to see the full error chain.	2025-09-01 13:39:26 +00:00
Thomas Eizinger	0c2e54f54c	feat(connlib): persistent DNS resource records across sessions (#10104 ) When we receive a DNS query for a DNS resource in Firezone, we take the next available 4 IPs from the CG-NAT range and assign them to the domain name. For example, if `example.com` is a DNS resource and it is the first resource being queried in a Firezone session, we will assigned the IPs `100.96.0.1` - `100.96.0.4` to it. If the user now restarts Firezone or signs out and back in, this state is lost and we assign those same IPs to the next DNS query coming in. This creates a problem for applications that do not re-query DNS very often or never. They expect these IPs to not change. Restarting software or signing out and back in is a common approach to fixing software problems, yet in this specific case, doing so may create even more problems for the user. To mitigate this, `ClientState` introduce a new event `DnsRecordsChanged` that gets emitted to the event-loop every time we assign new records. The event-loop then caches this in memory and reuses it in case a new session is initiated. The records are only stored in-memory and not on disk. Most likely, the tunnel process will be alive for the entire OS session. To verify this behaviour, we add a new `RestartClient` transition to our proptests. In the proptests, we already keep a mapping of all DNS names we ever resolved, including DNS resources. When generating IP traffic, we sample from this list of IPs and then expect the packet to be routed. By replacing the `ClientState` as part of this transition and re-seeding it with the previously exported DNS records, we can verify that packets to IPs resolved from a previous session still get successfully routed to the resource. Related: #5498	2025-09-01 07:29:28 +00:00
Thomas Eizinger	533f4c319b	feat(connlib): gracefully shutdown connections (#10076 ) Right now, connections cannot be actively closed in Firezone. The WireGuard tunnel and the ICE agent are coupled together, meaning only if either one of them fails will we clean up the connection. One exception here is when the Client roams. In that case, the Client simply clears its local memory completely and then re-establishes all necessary connections by re-requesting access. There are three cases where gracefully closing a connection is useful: 1. If an access authorization is revoked or expires and this was the last resource authorisation for that peer, we don't currently remove the connection on the Gateway. Instead, the Client is still able to send packets by they'll be dropped because we don't have a peer state anymore. 1. If a Gateway gets restarted due to e.g. an upgrade or other maintenance work, it loses all its connections and every Client needs to wait for the ICE timeout (~15 seconds) before it can establish a new one. 1. If a Client has its access revoked for all resources it has access to in a particular site we also don't remove this connection, even though it has become practically useless. All of these cases are fixed with this PR. Here we introduce a way to gracefully shutdown a connection without forcing the other side into an ICE timeout. The graceful connection shutdown works by introducing a new "goodbye" p2p control protocol message. Like all our p2p control protocol messages, this is based on IP and therefore delivery is not guaranteed. In other words, this "goodbye" message is sent on a best-effort basis. In the case of shutdown, the Gateway will wait for all UDP packets to be flushed but will not resend them or wait for an ACK. If either end receives such a "goodbye" message, they simply remove the local peer and connection state just as if the connection would have failed due to either ICE or WireGuard. For the Client, this means that the next packet for a resource will trigger a new access authorization request.	2025-09-01 06:30:13 +00:00
Thomas Eizinger	544ba11f21	chore(rust): allow `too_many_arguments` repo-wide (#10236 ) We always end up allow this lint when it pops up so we can also just allow it for the whole repo in general. Most of the time, the reason for too many arguments are borrow-checker limitations of Rust where mutable references need to be tracked explicitly.	2025-08-22 13:21:07 +00:00
Thomas Eizinger	c70c88c856	build(deps): upgrade to opentelemetry 0.30 (#10239 )	2025-08-21 22:47:39 +00:00
Thomas Eizinger	99155490c5	chore(connlib): make UDP buffer sizes tunable at runtime (#10234 ) For easier benchmarking, we make the UDP socket send and receive buffers runtime-tunable. Related: #7452	2025-08-21 18:18:14 +00:00
Thomas Eizinger	f85ae75ae0	refactor(connlib): increase UDP queues on desktop platforms (#10235 ) On desktop platforms, we can easily afford to have larger queues here despite each item in there being 65k. Benchmarking showed that we do sometimes fill these up. Related: #7452	2025-08-21 08:56:14 +00:00
Thomas Eizinger	a109c1a2ef	feat(connlib): discard intermediate resource and TUN updates (#10223 ) Right now, the Client event-loops have a channel with 1000 items for sending new resource lists and updates to the TUN device to the host app. This is kind of unnecessary as we always only care about the last version of these. Intermediate updates that the host app doesn't process are effectively irrelevant. We've had an issue before where a bug in the portal caused us to receive many updates to resources which ended up crashing Client apps because this channel filled up. To be more resilient on this front, we refactor the Client event loop to use a `watch` channel for this. Watch channels only retain the last value that got sent into them.	2025-08-21 05:42:54 +00:00
Thomas Eizinger	4e11112d9b	feat(connlib): improve throughput on higher latencies (#10231 ) Turns out the multi-threaded access of the TUN device on the Gateway causes packet reordering which makes the TCP congestion controller throttle the connection. Additionally, the default TX queue length of a TUN device on Linux is only 500 packets. With just a single thread and an increased TX queue length, we get a throughput performance of just over 1 GBit/s for a 20ms link between Client and Gateway with basically no packet drops: ``` Connecting to host 172.20.0.110, port 5201 [ 5] local 100.79.130.70 port 49546 connected to 172.20.0.110 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 116 MBytes 977 Mbits/sec 0 6.40 MBytes [ 5] 1.00-2.00 sec 137 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 2.00-3.00 sec 134 MBytes 1.13 Gbits/sec 0 6.40 MBytes [ 5] 3.00-4.00 sec 136 MBytes 1.14 Gbits/sec 47 6.40 MBytes [ 5] 4.00-5.00 sec 137 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 5.00-6.00 sec 138 MBytes 1.16 Gbits/sec 0 6.40 MBytes [ 5] 6.00-7.00 sec 138 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 7.00-8.00 sec 138 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 8.00-9.00 sec 138 MBytes 1.16 Gbits/sec 0 6.40 MBytes [ 5] 9.00-10.00 sec 138 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 10.00-11.00 sec 139 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 11.00-12.00 sec 139 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 12.00-13.00 sec 136 MBytes 1.14 Gbits/sec 0 6.40 MBytes [ 5] 13.00-14.00 sec 139 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 14.00-15.00 sec 140 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 15.00-16.00 sec 138 MBytes 1.16 Gbits/sec 0 6.40 MBytes [ 5] 16.00-17.00 sec 137 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 17.00-18.00 sec 139 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 18.00-19.00 sec 138 MBytes 1.16 Gbits/sec 0 6.40 MBytes [ 5] 19.00-20.00 sec 136 MBytes 1.14 Gbits/sec 0 6.40 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-20.00 sec 2.67 GBytes 1.15 Gbits/sec 47 sender [ 5] 0.00-20.02 sec 2.67 GBytes 1.15 Gbits/sec receiver iperf Done. ``` For further debugging in the future, we are now recording the send and receive queue depths of both the TUN device and the UDP sockets. Neither of those showed to be full in my testing which leads me to conclude that it isn't any buffer inside Firezone that is too small here. Related: #7452 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-08-20 23:08:56 +00:00
Thomas Eizinger	da00848549	build(deps): bump to Rust 1.89 (#10208 ) Rust 1.89 comes with a new lint that wants us to use explicitly refer to lifetimes, even if they are elided.	2025-08-18 05:04:55 +00:00
Thomas Eizinger	507a8957c2	chore(connlib): only debug-assert non-retransmitted DNS queries (#10136 ) When we receive the same TCP DNS query twice, we currently wrongly hit a debug assert.	2025-08-06 11:26:51 +00:00
Thomas Eizinger	2841fd0017	chore(connlib): spawn dedicated tasks for UDP send/recv (#10147 ) At the moment, `connlib`'s UDP thread spawns a single task for reading and writing to the UDP socket. It will always first try to write data before reading new data. To avoid scheduling issues, we split this into two dedicated tasks and insert ```rust tokio::task::yield_now().await; ``` into each loop. This allows the `tokio` runtime to schedule each of the tasks fairly even if one of them is very busy. For example, if we are very busy writing data (because we are receiving a lot of IP traffic), this ensures that we will occasionally also read from our socket to receive STUN control messages from our peers.	2025-08-06 07:38:01 +00:00
Thomas Eizinger	3e46727362	chore(snownet): improve logging of boringtun session index (#10135 ) Previously, boringtun's sender/receiver index of a session would just be rendered as a full u32. In reality, this u32 contains two pieces of information: The higher 24 bits identify the peer and the lower 8 bits identify the session with that peer. With the update to boringtun in https://github.com/firezone/boringtun/pull/112, we encode this logic in a dedicated type that has prints this information separately. Here is what the logs now look like: ``` 2025-08-05T07:38:37.742Z DEBUG boringtun::noise: Received handshake_response local_idx=(3428714\|1) remote_idx=(1937676\|1) 2025-08-05T07:38:37.743Z DEBUG boringtun::noise: New session idx=(3428714\|1) 2025-08-05T07:38:37.743Z DEBUG boringtun::noise: Sending keepalive local_idx=(3428714\|1) ```	2025-08-05 13:08:32 +00:00
Thomas Eizinger	96579483d8	fix(phoenix-channel): timeout room join after 5s (#10130 ) If we fail to join a given room for longer than 5s, we fail the WebSocket connection and reconnect.	2025-08-05 02:00:26 +00:00
Thomas Eizinger	d1cbf4f76d	chore(snownet): fix relay sampling spam (#10127 ) When we disconnect from a relay, we currently spam `Failed to sample new relay for connection` until we connect to a new one.	2025-08-05 00:16:28 +00:00

1 2 3 4 5 ...

1236 Commits