Commit Graph

14 Commits

Author SHA1 Message Date
Mariusz Klochowicz
e76daaaab3 refactor: remove JSON serialization from FFI boundary (#10575)
This PR eliminates JSON-based communication across the FFI boundary,
replacing it with proper
uniffi-generated types for improved type safety, performance, and
reliability. We replace JSON string parameters with native uniffi types
for:
 - Resources (DNS, CIDR, Internet)
 - Device information
 - DNS server lists
 - Network routes (CIDR representation)
 
Also, get rid of JSON serialisation in Swift client IPC in favour of
PropertyList based serialisation.
 
 Fixes: https://github.com/firezone/firezone/issues/9548

---------

Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2025-10-16 05:15:31 +00:00
Thomas Eizinger
8fc2ef8ad1 fix(clients): set Internet Resource state on startup (#10509)
Building on top of #10507, setting the initial Internet Resource state
is a piece of cake. All we need to do is thread a boolean variable
through to all call-sites of `Session::connect`. Without the need for
the Internet Resource's ID, we can simply pass in the boolean that is
saved in the configuration of each client.

Resolves: #10255
2025-10-07 07:13:52 +00:00
Thomas Eizinger
36dfee2c42 refactor(connlib): explicitly enable/disable Internet Resource (#10507)
Instead of the generic "disable any kind of resource"-functionality that
connlib currently exposes, we now provide an API to only enable /
disable the Internet Resource. This is a lot simpler to deal with and
reason about than the previous system, especially when it comes to the
proptests. Those need to model connlib's behaviour correctly across its
entire API surface which makes them unnecessarily complex if we only
ever use the `set_disabled_resources` API with a single resource.

In preparation for #4789, I want to extend the proptests to cover
traffic filters (#7126). This will make them a fair bit more
complicated, so any prior removal of complexity is appreciated.

Simplifying the implementation here is also a good starting point to fix
#10255. Not implicitly enabling the Internet Resource when it gets added
should be quite simple after this change.

Finally, resolving #8885 should also be quite easy. We just need to
store the state of the Internet Resource once per API URL instead of
globally.

Resolves: #8404

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-07 00:26:07 +00:00
Thomas Eizinger
0310bafbcd feat(clients): gracefully close connections on shutdown (#10400)
In #10076, connlib gained the ability to gracefully close connections
between peers. The Gateway already uses this when it is being gracefully
shutdown such as during an upgrade. This allows Clients to immediately
fail-over to a different Gateway instead of waiting for an ICE timeout.

When a Client signs out, we currently just drop all the state, resulting
in an ICE timeout on the Gateway ~15 seconds later. This makes it
difficult for us to analyze, whether an ICE timeout in the logs presents
an actual problem where a network connection got cut or whether the
Client simply signed out.

Whilst not water-tight, attempting to gracefully close our connections
when the Client signs out is better than nothing so we implement this
here.

All Clients use the `Session` abstraction from `client-shared` which
spawns the event-loop into a dedicated task.

- For the Linux and Windows GUI client, the already present tokio
runtime instance of the tunnel service is used for this.
- For Android and Apple, we create a dedicated, single-threaded runtime
instance for connlib.
- For the headless client, we also reuse the already existing tokio
runtime instance of the binary.

In case of Android, Apple and the headless client, this means we need to
ensure the tokio runtime instances stays alive long enough to actually
complete the graceful shutdown task. We achieve this by draining the
`EventStream` returned from `Session`. The `EventStream` is a wrapper
around a channel connected to the event-loop. This stream only finishes
once the event-loop is entirely dropped (and therefore completed the
graceful shutdown) as it holds the sender-end of the channel.

In case of the Linux and Windows GUI client, the runtime outlives the
`Session` because it is scoped to the entire tunnel process. Therefore,
no additional measures are necessary there to ensure the graceful
shutdown task completes.
2025-09-23 03:40:52 +00:00
Thomas Eizinger
7c326e003e fix(connlib): fuse event-loop future inside client session (#10399)
A `Future` in Rust should not be polled once it has been completed as
that may lead to panics or otherwise undesirable behaviour. To avoid
this, a `Future` can be `fuse`d which will make it return
`Poll::Pending` indefinitely after it has returned `Ready`.

We have received several Sentry alerts of poll-after-completion panics
that I believe are all stemming from this particular code.
2025-09-20 12:35:32 +00:00
Thomas Eizinger
69afe71215 refactor(connlib): remove concept of "ReplyMessages" (#10361)
In earlier versions of Firezone, the WebSocket protocol with the portal
was using the request-response semantics built into Phoenix. This
however is quite cumbersome to work with to due to the polymorphic
nature of the protocol design.

We ended up moving away from it and instead only use one-way messages
where each event directly corresponds to a message type. However, we
have never removed the capability reply messages from the
`phoenix-channel` module, instead all usages just set it to `()`.

We can simplify the code here by always setting this to `()`.

Resolves: #7091
2025-09-17 04:10:56 +00:00
Thomas Eizinger
0c2e54f54c feat(connlib): persistent DNS resource records across sessions (#10104)
When we receive a DNS query for a DNS resource in Firezone, we take the
next available 4 IPs from the CG-NAT range and assign them to the domain
name. For example, if `example.com` is a DNS resource and it is the
first resource being queried in a Firezone session, we will assigned the
IPs `100.96.0.1` - `100.96.0.4` to it. If the user now restarts Firezone
or signs out and back in, this state is lost and we assign those same
IPs to the next DNS query coming in.

This creates a problem for applications that do not re-query DNS very
often or never. They expect these IPs to not change. Restarting software
or signing out and back in is a common approach to fixing software
problems, yet in this specific case, doing so may create even more
problems for the user.

To mitigate this, `ClientState` introduce a new event
`DnsRecordsChanged` that gets emitted to the event-loop every time we
assign new records. The event-loop then caches this in memory and reuses
it in case a new session is initiated. The records are only stored
in-memory and not on disk. Most likely, the tunnel process will be alive
for the entire OS session.

To verify this behaviour, we add a new `RestartClient` transition to our
proptests. In the proptests, we already keep a mapping of all DNS names
we ever resolved, including DNS resources. When generating IP traffic,
we sample from this list of IPs and then expect the packet to be routed.
By replacing the `ClientState` as part of this transition and re-seeding
it with the previously exported DNS records, we can verify that packets
to IPs resolved from a previous session still get successfully routed to
the resource.

Related: #5498
2025-09-01 07:29:28 +00:00
Thomas Eizinger
a109c1a2ef feat(connlib): discard intermediate resource and TUN updates (#10223)
Right now, the Client event-loops have a channel with 1000 items for
sending new resource lists and updates to the TUN device to the host
app. This is kind of unnecessary as we always only care about the last
version of these. Intermediate updates that the host app doesn't process
are effectively irrelevant.

We've had an issue before where a bug in the portal caused us to receive
many updates to resources which ended up crashing Client apps because
this channel filled up.

To be more resilient on this front, we refactor the Client event loop to
use a `watch` channel for this. Watch channels only retain the last
value that got sent into them.
2025-08-21 05:42:54 +00:00
Thomas Eizinger
2545c41366 refactor(connlib): move client phoenix-channel to separate task (#10210)
Currently, `connlib`'s event-loop for clients uses manual polling to
advance the state of the tunnel and the phoenix-channel. Manual polling
is powerful but also easy to get wrong, resulting in task-wakeup bugs.

Additionally, if the tunnel is very busy with processing packets, the
phoenix-channel may not get enough CPU time, resulting in a loss of the
WebSocket connection.

To fix this, we move the phoenix-channel to a separate task and use
channels to connect it with `connlib`'s main event-loop. This one is now
primarily focused on advancing the tunnel state, effectively offloading
the problem of fair scheduling to the tokio runtime.

Related: #10003
2025-08-18 07:20:57 +00:00
Thomas Eizinger
5141817134 feat(connlib): add reason argument to reset API (#9878)
In order to provide more detailed logs, why `connlib`'s network state is
being reset, we add a `reason` parameter that is gets logged.

Resolves: #9867
2025-07-15 13:48:33 +00:00
Thomas Eizinger
01e3fea0ac fix(gui-client): don't panic on existing session (#9779)
Customer hit what seems to be a rare race condition where we try to
connect whilst we already have a session. I don't know which state it is
in so I am replacing it with a WARN log to learn more about this in
Sentry in case it gets hit again.
2025-07-04 07:44:53 +00:00
Thomas Eizinger
faeb958882 refactor: use UniFFI for Android FFI (#9415)
To make our FFI layer between Android and Rust safer, we adopt the
UniFFI tool from Mozilla. UniFFI allows us to create a dedicated crate
(here `client-ffi`) that contains Rust structs annotated with various
attributes. These macros then generate code at compile time that is
built into the shared object. Using a dedicated CLI from the UniFFI
project, we can then generate Kotlin bindings from this shared object.

The primary motivation for this effort is memory safety across the FFI
boundary. Most importantly, we want to ensure that:

- The session pointer is not used after it has been free'd
- Disconnecting the session frees the pointer
- Freeing the session does not happen as part of a callback as that
triggers a cyclic dependency on the Rust side (callbacks are executed on
a runtime and that runtime is dropped as part of dropping the session)

To achieve all of these goals, we move away from callbacks altogether.
UniFFI has great support for async functions. We leverage this support
to expose a `suspend fn` to Android that returns `Event`s. These events
map to the current callback functions. Internally, these events are read
from a channel with a capacity of 1000 events. It is therefore not very
time-critical that the app reads from this channel. `connlib` will
happily continue even if the channel is full. 1000 events should be more
than sufficient though in case the host app cannot immediately process
them. We don't send events very often after all.

This event-based design has major advantages: It allows us to make use
of `AutoCloseable` on the Kotlin side, meaning the `session` pointer is
only ever accessed as part of a `use` block and automatically closed
(and therefore free'd) at the end of the block.

To communicate with the session, we introduce a `TunnelCommand` which
represents all actions that the host app can send to `connlib`. These
are passed through a channel to the `suspend fn` which continuously
listens for events and commands.

Resolves: #9499
Related: #3959

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
2025-06-17 21:48:34 +00:00
Thomas Eizinger
1914ea7076 refactor(rust): remove forced callback indirection (#9362)
As relict from very early designs of `connlib`, the `Callbacks` trait is
still present and defines how the host app receives events from a
running `Session`. Callbacks are not a great design pattern however
because they force the running code, i.e. `connlib`s event-loop to
execute unknown code. For example, if that code panics, all of `connlib`
is taken down. Additionally, not all consumers may want to receive
events via callbacks. The GUI and headless client for example already
have their own event-loop in which they process all kinds of things.
Having to deal with the `Callbacks` interface introduces an odd
indirection here.

To fix this, we instead return an `EventStream` when constructing a
`Session`. This essentially aligns the API of `Session` with that of a
channel. You receive two handles, one for sending in commands and one
for receiving events. A `Session` will automatically spawn itself onto
the given runtime so progress is made even if one does not poll on these
channel handles.

This greatly simplifies the code:

- We get to delete the `Callbacks` interface.
- We can delete the threaded callback adapter. This was only necessary
because we didn't want to block `connlib` with the handling of the
event. By using a channel for events, this is automatically guaranteed.
- The GUI and headless client can directly integrate the event handling
in their event-loop, without having to create an indirection with a
channel.
- It is now clear that only the Apple and Android FFI layers actually
use callbacks to communicate these events.
- We net-delete 100 LoC
2025-06-02 11:28:04 +00:00
Thomas Eizinger
5566f1847f refactor(rust): move crates into a more sensical hierarchy (#9066)
The current `rust/` directory is a bit of a wild-west in terms of how
the crates are organised. Most of them are simply at the top-level when
in reality, they are all `connlib`-related. The Apple and Android FFI
crates - which are entrypoints in the Rust code are defined several
layers deep.

To improve the situation, we move around and rename several crates. The
end result is that all top-level crates / directories are:

- Either entrypoints into the Rust code, i.e. applications such as
Gateway, Relay or a Client
- Or crates shared across all those entrypoints, such as `telemetry` or
`logging`
2025-05-12 01:04:17 +00:00