firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 18:18:55 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	fdd05e0e0f	chore(connlib): remove stale comments (#10617 ) These comments are no longer correct and need to be removed before the confuse someone.	2025-10-19 12:04:42 +00:00
Mariusz Klochowicz	e76daaaab3	refactor: remove JSON serialization from FFI boundary (#10575 ) This PR eliminates JSON-based communication across the FFI boundary, replacing it with proper uniffi-generated types for improved type safety, performance, and reliability. We replace JSON string parameters with native uniffi types for: - Resources (DNS, CIDR, Internet) - Device information - DNS server lists - Network routes (CIDR representation) Also, get rid of JSON serialisation in Swift client IPC in favour of PropertyList based serialisation. Fixes: https://github.com/firezone/firezone/issues/9548 --------- Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2025-10-16 05:15:31 +00:00
Thomas Eizinger	8fc2ef8ad1	fix(clients): set Internet Resource state on startup (#10509 ) Building on top of #10507, setting the initial Internet Resource state is a piece of cake. All we need to do is thread a boolean variable through to all call-sites of `Session::connect`. Without the need for the Internet Resource's ID, we can simply pass in the boolean that is saved in the configuration of each client. Resolves: #10255	2025-10-07 07:13:52 +00:00
Thomas Eizinger	36dfee2c42	refactor(connlib): explicitly enable/disable Internet Resource (#10507 ) Instead of the generic "disable any kind of resource"-functionality that connlib currently exposes, we now provide an API to only enable / disable the Internet Resource. This is a lot simpler to deal with and reason about than the previous system, especially when it comes to the proptests. Those need to model connlib's behaviour correctly across its entire API surface which makes them unnecessarily complex if we only ever use the `set_disabled_resources` API with a single resource. In preparation for #4789, I want to extend the proptests to cover traffic filters (#7126). This will make them a fair bit more complicated, so any prior removal of complexity is appreciated. Simplifying the implementation here is also a good starting point to fix #10255. Not implicitly enabling the Internet Resource when it gets added should be quite simple after this change. Finally, resolving #8885 should also be quite easy. We just need to store the state of the Internet Resource once per API URL instead of globally. Resolves: #8404 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-10-07 00:26:07 +00:00
Thomas Eizinger	531a84268f	fix(connlib): always process all errors from tunnel (#10500 ) In #10347, we made sure that we always return all errors that happen during a single tick of the event-loop. What we overlooked is that as part of handling the errors, we need to use `continue` to jump to the next one instead of returning directly from the function. Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-10-06 17:07:53 +00:00
Thomas Eizinger	a297c6dbbd	chore: differentiate between `shutdown` and `shut down` (#10494 ) In a prior code review, CoPilot flagged that we were using the noun "shutdown" as a verb in certain places. Resolves: #10425	2025-10-01 02:55:22 +00:00
Thomas Eizinger	0310bafbcd	feat(clients): gracefully close connections on shutdown (#10400 ) In #10076, connlib gained the ability to gracefully close connections between peers. The Gateway already uses this when it is being gracefully shutdown such as during an upgrade. This allows Clients to immediately fail-over to a different Gateway instead of waiting for an ICE timeout. When a Client signs out, we currently just drop all the state, resulting in an ICE timeout on the Gateway ~15 seconds later. This makes it difficult for us to analyze, whether an ICE timeout in the logs presents an actual problem where a network connection got cut or whether the Client simply signed out. Whilst not water-tight, attempting to gracefully close our connections when the Client signs out is better than nothing so we implement this here. All Clients use the `Session` abstraction from `client-shared` which spawns the event-loop into a dedicated task. - For the Linux and Windows GUI client, the already present tokio runtime instance of the tunnel service is used for this. - For Android and Apple, we create a dedicated, single-threaded runtime instance for connlib. - For the headless client, we also reuse the already existing tokio runtime instance of the binary. In case of Android, Apple and the headless client, this means we need to ensure the tokio runtime instances stays alive long enough to actually complete the graceful shutdown task. We achieve this by draining the `EventStream` returned from `Session`. The `EventStream` is a wrapper around a channel connected to the event-loop. This stream only finishes once the event-loop is entirely dropped (and therefore completed the graceful shutdown) as it holds the sender-end of the channel. In case of the Linux and Windows GUI client, the runtime outlives the `Session` because it is scoped to the entire tunnel process. Therefore, no additional measures are necessary there to ensure the graceful shutdown task completes.	2025-09-23 03:40:52 +00:00
Thomas Eizinger	7c326e003e	fix(connlib): fuse event-loop future inside client session (#10399 ) A `Future` in Rust should not be polled once it has been completed as that may lead to panics or otherwise undesirable behaviour. To avoid this, a `Future` can be `fuse`d which will make it return `Poll::Pending` indefinitely after it has returned `Ready`. We have received several Sentry alerts of poll-after-completion panics that I believe are all stemming from this particular code.	2025-09-20 12:35:32 +00:00
Thomas Eizinger	90d10a8634	refactor(connlib): improve fairness of event-loop (#10347 ) The event-loop inside `Tunnel` processes input according to a certain priority. We only take input from lower priority sources when the higher priority sources are not ready. The current priorities are: - Flush all buffers - Read from UDP sockets - Read from TUN device - Read from DNS servers - Process recursive DNS queries - Check timeout The idea of this priority ordering is to keep all kinds of processing bounded and "finish" any kind of work that is on-going before taking on new work. Anything that sits in a buffer is basically done with processing and just needs to be written out to the network / device. Arriving UDP packets have already traversed the network and been encrypted on the other end, meaning they are higher priority than reading from the TUN device. Packets from the TUN device still need to be encrypted and sent to the remote. Whilst there is merit in this design, it also bears the potential of starving input sources further down if the top ones are extremely busy. To prevent this, we refactor `Io` to read from all input sources and present it to the event-loop as a batch, allowing all sources to make progress before looping around. Since this event-loop has first been conceived, we have refactored `Io` to use background threads for the UDP sockets and TUN device, meaning they will make progress by themselves anyway until the channels to the main-thread fill up. As such, there shouldn't be any latency increase in processing packets even though we are performing slightly more work per event-loop tick. This kind of batch-processing highlights a problem: Bailing out with an error midway through processing a batch leaves the remainder of the batch unprocessed, essentially dropping packets. To fix this, we introduce a new `TunnelError` type that presents a collection of errors that we encountered while processing the batch. This might actually also be a problem with what is currently in `main` because we are already batch-processing packets there but possibly are bailing out midway through the batch. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>	2025-09-17 23:28:36 +00:00
Thomas Eizinger	69afe71215	refactor(connlib): remove concept of "ReplyMessages" (#10361 ) In earlier versions of Firezone, the WebSocket protocol with the portal was using the request-response semantics built into Phoenix. This however is quite cumbersome to work with to due to the polymorphic nature of the protocol design. We ended up moving away from it and instead only use one-way messages where each event directly corresponds to a message type. However, we have never removed the capability reply messages from the `phoenix-channel` module, instead all usages just set it to `()`. We can simplify the code here by always setting this to `()`. Resolves: #7091	2025-09-17 04:10:56 +00:00
Thomas Eizinger	0c2e54f54c	feat(connlib): persistent DNS resource records across sessions (#10104 ) When we receive a DNS query for a DNS resource in Firezone, we take the next available 4 IPs from the CG-NAT range and assign them to the domain name. For example, if `example.com` is a DNS resource and it is the first resource being queried in a Firezone session, we will assigned the IPs `100.96.0.1` - `100.96.0.4` to it. If the user now restarts Firezone or signs out and back in, this state is lost and we assign those same IPs to the next DNS query coming in. This creates a problem for applications that do not re-query DNS very often or never. They expect these IPs to not change. Restarting software or signing out and back in is a common approach to fixing software problems, yet in this specific case, doing so may create even more problems for the user. To mitigate this, `ClientState` introduce a new event `DnsRecordsChanged` that gets emitted to the event-loop every time we assign new records. The event-loop then caches this in memory and reuses it in case a new session is initiated. The records are only stored in-memory and not on disk. Most likely, the tunnel process will be alive for the entire OS session. To verify this behaviour, we add a new `RestartClient` transition to our proptests. In the proptests, we already keep a mapping of all DNS names we ever resolved, including DNS resources. When generating IP traffic, we sample from this list of IPs and then expect the packet to be routed. By replacing the `ClientState` as part of this transition and re-seeding it with the previously exported DNS records, we can verify that packets to IPs resolved from a previous session still get successfully routed to the resource. Related: #5498	2025-09-01 07:29:28 +00:00
Thomas Eizinger	533f4c319b	feat(connlib): gracefully shutdown connections (#10076 ) Right now, connections cannot be actively closed in Firezone. The WireGuard tunnel and the ICE agent are coupled together, meaning only if either one of them fails will we clean up the connection. One exception here is when the Client roams. In that case, the Client simply clears its local memory completely and then re-establishes all necessary connections by re-requesting access. There are three cases where gracefully closing a connection is useful: 1. If an access authorization is revoked or expires and this was the last resource authorisation for that peer, we don't currently remove the connection on the Gateway. Instead, the Client is still able to send packets by they'll be dropped because we don't have a peer state anymore. 1. If a Gateway gets restarted due to e.g. an upgrade or other maintenance work, it loses all its connections and every Client needs to wait for the ICE timeout (~15 seconds) before it can establish a new one. 1. If a Client has its access revoked for all resources it has access to in a particular site we also don't remove this connection, even though it has become practically useless. All of these cases are fixed with this PR. Here we introduce a way to gracefully shutdown a connection without forcing the other side into an ICE timeout. The graceful connection shutdown works by introducing a new "goodbye" p2p control protocol message. Like all our p2p control protocol messages, this is based on IP and therefore delivery is not guaranteed. In other words, this "goodbye" message is sent on a best-effort basis. In the case of shutdown, the Gateway will wait for all UDP packets to be flushed but will not resend them or wait for an ACK. If either end receives such a "goodbye" message, they simply remove the local peer and connection state just as if the connection would have failed due to either ICE or WireGuard. For the Client, this means that the next packet for a resource will trigger a new access authorization request.	2025-09-01 06:30:13 +00:00
Thomas Eizinger	a109c1a2ef	feat(connlib): discard intermediate resource and TUN updates (#10223 ) Right now, the Client event-loops have a channel with 1000 items for sending new resource lists and updates to the TUN device to the host app. This is kind of unnecessary as we always only care about the last version of these. Intermediate updates that the host app doesn't process are effectively irrelevant. We've had an issue before where a bug in the portal caused us to receive many updates to resources which ended up crashing Client apps because this channel filled up. To be more resilient on this front, we refactor the Client event loop to use a `watch` channel for this. Watch channels only retain the last value that got sent into them.	2025-08-21 05:42:54 +00:00
Thomas Eizinger	b4cbc4f33b	fix(connlib): exit phoenix-channel event-loop on error (#10229 ) We cannot poll the `PhoenixChannel` after it has returned an error, otherwise it will panic. Therefore, we exit the event-loop then. The outer event-loop also exits as soon as it receives an error from this channel so this is fine. `PhoenixChannel` only returns an error when it has irrecoverably disconnected, e.g. after the retries have been exhausted or we hit a 4xx error on the WebSocket connection. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-21 03:25:46 +00:00
Thomas Eizinger	2545c41366	refactor(connlib): move client phoenix-channel to separate task (#10210 ) Currently, `connlib`'s event-loop for clients uses manual polling to advance the state of the tunnel and the phoenix-channel. Manual polling is powerful but also easy to get wrong, resulting in task-wakeup bugs. Additionally, if the tunnel is very busy with processing packets, the phoenix-channel may not get enough CPU time, resulting in a loss of the WebSocket connection. To fix this, we move the phoenix-channel to a separate task and use channels to connect it with `connlib`'s main event-loop. This one is now primarily focused on advancing the tunnel state, effectively offloading the problem of fair scheduling to the tokio runtime. Related: #10003	2025-08-18 07:20:57 +00:00
Thomas Eizinger	5c3b15c1a9	chore(connlib): harmonise naming of IDs (#10038 ) When filtering through logs in Sentry, it is useful to narrow them down by context of a client, gateway or resource. Currently, these fields are sometimes called `client`, `cid`, `client_id` etc and the same for the Gateway and Resources. To make this filtering easier, name all of them `cid` for Client IDs, `gid` for Gateway IDs and `rid` for Resource IDs.	2025-07-29 03:33:09 +00:00
Thomas Eizinger	e9c74b1bfe	chore(connlib): treat `Invalid Argument` as unreachable hosts (#10037 ) These appear to happen on systems that e.g. don't have IPv6 support or where the destination cannot be reached. It is a bit of a catch-all but all the ones I am seeing in Sentry are false-positives. To reduce the noise a bit, we log these on DEBUG now.	2025-07-29 03:04:13 +00:00
Thomas Eizinger	cf2470ba1e	test(iperf): install iptables rule inside of container (#9880 ) In Docker environments, applying iptables rules to filter container-container traffic on the Docker bridged network is not reliable, leading to direct connections being established in our relayed tests. To fix this, we insert the rules directly from the client container itself. --------- Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>	2025-07-16 10:29:33 +00:00
Thomas Eizinger	5141817134	feat(connlib): add `reason` argument to `reset` API (#9878 ) In order to provide more detailed logs, why `connlib`'s network state is being reset, we add a `reason` parameter that is gets logged. Resolves: #9867	2025-07-15 13:48:33 +00:00
Thomas Eizinger	a87485a4e0	chore(connlib): silence EHOSTDOWN errors (#9797 ) Certain UNIX systems such as macOS also use the EHOSTDOWN error to signal that a packet cannot be sent to a certain IP. There is nothing we can do about this error so we downgrade it from a WARN to a DEBUG like we do for other kinds of "unreachable" errors.	2025-07-07 22:52:39 +00:00
Thomas Eizinger	01e3fea0ac	fix(gui-client): don't panic on existing session (#9779 ) Customer hit what seems to be a rare race condition where we try to connect whilst we already have a session. I don't know which state it is in so I am replacing it with a WARN log to learn more about this in Sentry in case it gets hit again.	2025-07-04 07:44:53 +00:00
Thomas Eizinger	faeb958882	refactor: use UniFFI for Android FFI (#9415 ) To make our FFI layer between Android and Rust safer, we adopt the UniFFI tool from Mozilla. UniFFI allows us to create a dedicated crate (here `client-ffi`) that contains Rust structs annotated with various attributes. These macros then generate code at compile time that is built into the shared object. Using a dedicated CLI from the UniFFI project, we can then generate Kotlin bindings from this shared object. The primary motivation for this effort is memory safety across the FFI boundary. Most importantly, we want to ensure that: - The session pointer is not used after it has been free'd - Disconnecting the session frees the pointer - Freeing the session does not happen as part of a callback as that triggers a cyclic dependency on the Rust side (callbacks are executed on a runtime and that runtime is dropped as part of dropping the session) To achieve all of these goals, we move away from callbacks altogether. UniFFI has great support for async functions. We leverage this support to expose a `suspend fn` to Android that returns `Event`s. These events map to the current callback functions. Internally, these events are read from a channel with a capacity of 1000 events. It is therefore not very time-critical that the app reads from this channel. `connlib` will happily continue even if the channel is full. 1000 events should be more than sufficient though in case the host app cannot immediately process them. We don't send events very often after all. This event-based design has major advantages: It allows us to make use of `AutoCloseable` on the Kotlin side, meaning the `session` pointer is only ever accessed as part of a `use` block and automatically closed (and therefore free'd) at the end of the block. To communicate with the session, we introduce a `TunnelCommand` which represents all actions that the host app can send to `connlib`. These are passed through a channel to the `suspend fn` which continuously listens for events and commands. Resolves: #9499 Related: #3959 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>	2025-06-17 21:48:34 +00:00
Thomas Eizinger	cc50d58d8c	chore(client,gateway): log portal connection hiccups on INFO (#9557 ) These don't happen very often so are safe to log on INFO. That is the default log level and it is useful to see, why we are re-connecting to the portal.	2025-06-17 14:01:34 +00:00
Thomas Eizinger	1914ea7076	refactor(rust): remove forced callback indirection (#9362 ) As relict from very early designs of `connlib`, the `Callbacks` trait is still present and defines how the host app receives events from a running `Session`. Callbacks are not a great design pattern however because they force the running code, i.e. `connlib`s event-loop to execute unknown code. For example, if that code panics, all of `connlib` is taken down. Additionally, not all consumers may want to receive events via callbacks. The GUI and headless client for example already have their own event-loop in which they process all kinds of things. Having to deal with the `Callbacks` interface introduces an odd indirection here. To fix this, we instead return an `EventStream` when constructing a `Session`. This essentially aligns the API of `Session` with that of a channel. You receive two handles, one for sending in commands and one for receiving events. A `Session` will automatically spawn itself onto the given runtime so progress is made even if one does not poll on these channel handles. This greatly simplifies the code: - We get to delete the `Callbacks` interface. - We can delete the threaded callback adapter. This was only necessary because we didn't want to block `connlib` with the handling of the event. By using a channel for events, this is automatically guaranteed. - The GUI and headless client can directly integrate the event handling in their event-loop, without having to create an indirection with a channel. - It is now clear that only the Apple and Android FFI layers actually use callbacks to communicate these events. - We net-delete 100 LoC	2025-06-02 11:28:04 +00:00
Thomas Eizinger	1bdba3601a	feat(gui-client): rename IPC service to Tunnel service (#9154 ) The name IPC service is not very descriptive. By nature of being separate processes, we need to use IPC to communicate between them. The important thing is that the service process has control over the tunnel. Therefore, we rename everything to "Tunnel service". The only part that is not changed are historic changelog entries. Resolves: #9048	2025-05-19 09:52:06 +00:00
Thomas Eizinger	f965487739	chore(connlib): turn down logs for non-fatal IO errors (#9091 ) Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-05-12 11:48:40 +00:00
Thomas Eizinger	5566f1847f	refactor(rust): move crates into a more sensical hierarchy (#9066 ) The current `rust/` directory is a bit of a wild-west in terms of how the crates are organised. Most of them are simply at the top-level when in reality, they are all `connlib`-related. The Apple and Android FFI crates - which are entrypoints in the Rust code are defined several layers deep. To improve the situation, we move around and rename several crates. The end result is that all top-level crates / directories are: - Either entrypoints into the Rust code, i.e. applications such as Gateway, Relay or a Client - Or crates shared across all those entrypoints, such as `telemetry` or `logging`	2025-05-12 01:04:17 +00:00

27 Commits