firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-28 10:18:51 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	0c2e54f54c	feat(connlib): persistent DNS resource records across sessions (#10104 ) When we receive a DNS query for a DNS resource in Firezone, we take the next available 4 IPs from the CG-NAT range and assign them to the domain name. For example, if `example.com` is a DNS resource and it is the first resource being queried in a Firezone session, we will assigned the IPs `100.96.0.1` - `100.96.0.4` to it. If the user now restarts Firezone or signs out and back in, this state is lost and we assign those same IPs to the next DNS query coming in. This creates a problem for applications that do not re-query DNS very often or never. They expect these IPs to not change. Restarting software or signing out and back in is a common approach to fixing software problems, yet in this specific case, doing so may create even more problems for the user. To mitigate this, `ClientState` introduce a new event `DnsRecordsChanged` that gets emitted to the event-loop every time we assign new records. The event-loop then caches this in memory and reuses it in case a new session is initiated. The records are only stored in-memory and not on disk. Most likely, the tunnel process will be alive for the entire OS session. To verify this behaviour, we add a new `RestartClient` transition to our proptests. In the proptests, we already keep a mapping of all DNS names we ever resolved, including DNS resources. When generating IP traffic, we sample from this list of IPs and then expect the packet to be routed. By replacing the `ClientState` as part of this transition and re-seeding it with the previously exported DNS records, we can verify that packets to IPs resolved from a previous session still get successfully routed to the resource. Related: #5498	2025-09-01 07:29:28 +00:00
Thomas Eizinger	533f4c319b	feat(connlib): gracefully shutdown connections (#10076 ) Right now, connections cannot be actively closed in Firezone. The WireGuard tunnel and the ICE agent are coupled together, meaning only if either one of them fails will we clean up the connection. One exception here is when the Client roams. In that case, the Client simply clears its local memory completely and then re-establishes all necessary connections by re-requesting access. There are three cases where gracefully closing a connection is useful: 1. If an access authorization is revoked or expires and this was the last resource authorisation for that peer, we don't currently remove the connection on the Gateway. Instead, the Client is still able to send packets by they'll be dropped because we don't have a peer state anymore. 1. If a Gateway gets restarted due to e.g. an upgrade or other maintenance work, it loses all its connections and every Client needs to wait for the ICE timeout (~15 seconds) before it can establish a new one. 1. If a Client has its access revoked for all resources it has access to in a particular site we also don't remove this connection, even though it has become practically useless. All of these cases are fixed with this PR. Here we introduce a way to gracefully shutdown a connection without forcing the other side into an ICE timeout. The graceful connection shutdown works by introducing a new "goodbye" p2p control protocol message. Like all our p2p control protocol messages, this is based on IP and therefore delivery is not guaranteed. In other words, this "goodbye" message is sent on a best-effort basis. In the case of shutdown, the Gateway will wait for all UDP packets to be flushed but will not resend them or wait for an ACK. If either end receives such a "goodbye" message, they simply remove the local peer and connection state just as if the connection would have failed due to either ICE or WireGuard. For the Client, this means that the next packet for a resource will trigger a new access authorization request.	2025-09-01 06:30:13 +00:00
Thomas Eizinger	a109c1a2ef	feat(connlib): discard intermediate resource and TUN updates (#10223 ) Right now, the Client event-loops have a channel with 1000 items for sending new resource lists and updates to the TUN device to the host app. This is kind of unnecessary as we always only care about the last version of these. Intermediate updates that the host app doesn't process are effectively irrelevant. We've had an issue before where a bug in the portal caused us to receive many updates to resources which ended up crashing Client apps because this channel filled up. To be more resilient on this front, we refactor the Client event loop to use a `watch` channel for this. Watch channels only retain the last value that got sent into them.	2025-08-21 05:42:54 +00:00
Thomas Eizinger	b4cbc4f33b	fix(connlib): exit phoenix-channel event-loop on error (#10229 ) We cannot poll the `PhoenixChannel` after it has returned an error, otherwise it will panic. Therefore, we exit the event-loop then. The outer event-loop also exits as soon as it receives an error from this channel so this is fine. `PhoenixChannel` only returns an error when it has irrecoverably disconnected, e.g. after the retries have been exhausted or we hit a 4xx error on the WebSocket connection. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-21 03:25:46 +00:00
Thomas Eizinger	2545c41366	refactor(connlib): move client phoenix-channel to separate task (#10210 ) Currently, `connlib`'s event-loop for clients uses manual polling to advance the state of the tunnel and the phoenix-channel. Manual polling is powerful but also easy to get wrong, resulting in task-wakeup bugs. Additionally, if the tunnel is very busy with processing packets, the phoenix-channel may not get enough CPU time, resulting in a loss of the WebSocket connection. To fix this, we move the phoenix-channel to a separate task and use channels to connect it with `connlib`'s main event-loop. This one is now primarily focused on advancing the tunnel state, effectively offloading the problem of fair scheduling to the tokio runtime. Related: #10003	2025-08-18 07:20:57 +00:00
Thomas Eizinger	5c3b15c1a9	chore(connlib): harmonise naming of IDs (#10038 ) When filtering through logs in Sentry, it is useful to narrow them down by context of a client, gateway or resource. Currently, these fields are sometimes called `client`, `cid`, `client_id` etc and the same for the Gateway and Resources. To make this filtering easier, name all of them `cid` for Client IDs, `gid` for Gateway IDs and `rid` for Resource IDs.	2025-07-29 03:33:09 +00:00
Thomas Eizinger	e9c74b1bfe	chore(connlib): treat `Invalid Argument` as unreachable hosts (#10037 ) These appear to happen on systems that e.g. don't have IPv6 support or where the destination cannot be reached. It is a bit of a catch-all but all the ones I am seeing in Sentry are false-positives. To reduce the noise a bit, we log these on DEBUG now.	2025-07-29 03:04:13 +00:00
Thomas Eizinger	cf2470ba1e	test(iperf): install iptables rule inside of container (#9880 ) In Docker environments, applying iptables rules to filter container-container traffic on the Docker bridged network is not reliable, leading to direct connections being established in our relayed tests. To fix this, we insert the rules directly from the client container itself. --------- Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>	2025-07-16 10:29:33 +00:00
Thomas Eizinger	5141817134	feat(connlib): add `reason` argument to `reset` API (#9878 ) In order to provide more detailed logs, why `connlib`'s network state is being reset, we add a `reason` parameter that is gets logged. Resolves: #9867	2025-07-15 13:48:33 +00:00
Thomas Eizinger	a87485a4e0	chore(connlib): silence EHOSTDOWN errors (#9797 ) Certain UNIX systems such as macOS also use the EHOSTDOWN error to signal that a packet cannot be sent to a certain IP. There is nothing we can do about this error so we downgrade it from a WARN to a DEBUG like we do for other kinds of "unreachable" errors.	2025-07-07 22:52:39 +00:00
Thomas Eizinger	01e3fea0ac	fix(gui-client): don't panic on existing session (#9779 ) Customer hit what seems to be a rare race condition where we try to connect whilst we already have a session. I don't know which state it is in so I am replacing it with a WARN log to learn more about this in Sentry in case it gets hit again.	2025-07-04 07:44:53 +00:00
Thomas Eizinger	faeb958882	refactor: use UniFFI for Android FFI (#9415 ) To make our FFI layer between Android and Rust safer, we adopt the UniFFI tool from Mozilla. UniFFI allows us to create a dedicated crate (here `client-ffi`) that contains Rust structs annotated with various attributes. These macros then generate code at compile time that is built into the shared object. Using a dedicated CLI from the UniFFI project, we can then generate Kotlin bindings from this shared object. The primary motivation for this effort is memory safety across the FFI boundary. Most importantly, we want to ensure that: - The session pointer is not used after it has been free'd - Disconnecting the session frees the pointer - Freeing the session does not happen as part of a callback as that triggers a cyclic dependency on the Rust side (callbacks are executed on a runtime and that runtime is dropped as part of dropping the session) To achieve all of these goals, we move away from callbacks altogether. UniFFI has great support for async functions. We leverage this support to expose a `suspend fn` to Android that returns `Event`s. These events map to the current callback functions. Internally, these events are read from a channel with a capacity of 1000 events. It is therefore not very time-critical that the app reads from this channel. `connlib` will happily continue even if the channel is full. 1000 events should be more than sufficient though in case the host app cannot immediately process them. We don't send events very often after all. This event-based design has major advantages: It allows us to make use of `AutoCloseable` on the Kotlin side, meaning the `session` pointer is only ever accessed as part of a `use` block and automatically closed (and therefore free'd) at the end of the block. To communicate with the session, we introduce a `TunnelCommand` which represents all actions that the host app can send to `connlib`. These are passed through a channel to the `suspend fn` which continuously listens for events and commands. Resolves: #9499 Related: #3959 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>	2025-06-17 21:48:34 +00:00
Thomas Eizinger	cc50d58d8c	chore(client,gateway): log portal connection hiccups on INFO (#9557 ) These don't happen very often so are safe to log on INFO. That is the default log level and it is useful to see, why we are re-connecting to the portal.	2025-06-17 14:01:34 +00:00
Thomas Eizinger	1914ea7076	refactor(rust): remove forced callback indirection (#9362 ) As relict from very early designs of `connlib`, the `Callbacks` trait is still present and defines how the host app receives events from a running `Session`. Callbacks are not a great design pattern however because they force the running code, i.e. `connlib`s event-loop to execute unknown code. For example, if that code panics, all of `connlib` is taken down. Additionally, not all consumers may want to receive events via callbacks. The GUI and headless client for example already have their own event-loop in which they process all kinds of things. Having to deal with the `Callbacks` interface introduces an odd indirection here. To fix this, we instead return an `EventStream` when constructing a `Session`. This essentially aligns the API of `Session` with that of a channel. You receive two handles, one for sending in commands and one for receiving events. A `Session` will automatically spawn itself onto the given runtime so progress is made even if one does not poll on these channel handles. This greatly simplifies the code: - We get to delete the `Callbacks` interface. - We can delete the threaded callback adapter. This was only necessary because we didn't want to block `connlib` with the handling of the event. By using a channel for events, this is automatically guaranteed. - The GUI and headless client can directly integrate the event handling in their event-loop, without having to create an indirection with a channel. - It is now clear that only the Apple and Android FFI layers actually use callbacks to communicate these events. - We net-delete 100 LoC	2025-06-02 11:28:04 +00:00
Thomas Eizinger	1bdba3601a	feat(gui-client): rename IPC service to Tunnel service (#9154 ) The name IPC service is not very descriptive. By nature of being separate processes, we need to use IPC to communicate between them. The important thing is that the service process has control over the tunnel. Therefore, we rename everything to "Tunnel service". The only part that is not changed are historic changelog entries. Resolves: #9048	2025-05-19 09:52:06 +00:00
Thomas Eizinger	f965487739	chore(connlib): turn down logs for non-fatal IO errors (#9091 ) Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-05-12 11:48:40 +00:00
Thomas Eizinger	5566f1847f	refactor(rust): move crates into a more sensical hierarchy (#9066 ) The current `rust/` directory is a bit of a wild-west in terms of how the crates are organised. Most of them are simply at the top-level when in reality, they are all `connlib`-related. The Apple and Android FFI crates - which are entrypoints in the Rust code are defined several layers deep. To improve the situation, we move around and rename several crates. The end result is that all top-level crates / directories are: - Either entrypoints into the Rust code, i.e. applications such as Gateway, Relay or a Client - Or crates shared across all those entrypoints, such as `telemetry` or `logging`	2025-05-12 01:04:17 +00:00

17 Commits