When we receive a DNS query for a DNS resource in Firezone, we take the
next available 4 IPs from the CG-NAT range and assign them to the domain
name. For example, if `example.com` is a DNS resource and it is the
first resource being queried in a Firezone session, we will assigned the
IPs `100.96.0.1` - `100.96.0.4` to it. If the user now restarts Firezone
or signs out and back in, this state is lost and we assign those same
IPs to the next DNS query coming in.
This creates a problem for applications that do not re-query DNS very
often or never. They expect these IPs to not change. Restarting software
or signing out and back in is a common approach to fixing software
problems, yet in this specific case, doing so may create even more
problems for the user.
To mitigate this, `ClientState` introduce a new event
`DnsRecordsChanged` that gets emitted to the event-loop every time we
assign new records. The event-loop then caches this in memory and reuses
it in case a new session is initiated. The records are only stored
in-memory and not on disk. Most likely, the tunnel process will be alive
for the entire OS session.
To verify this behaviour, we add a new `RestartClient` transition to our
proptests. In the proptests, we already keep a mapping of all DNS names
we ever resolved, including DNS resources. When generating IP traffic,
we sample from this list of IPs and then expect the packet to be routed.
By replacing the `ClientState` as part of this transition and re-seeding
it with the previously exported DNS records, we can verify that packets
to IPs resolved from a previous session still get successfully routed to
the resource.
Related: #5498
Right now, connections cannot be actively closed in Firezone. The
WireGuard tunnel and the ICE agent are coupled together, meaning only if
either one of them fails will we clean up the connection. One exception
here is when the Client roams. In that case, the Client simply clears
its local memory completely and then re-establishes all necessary
connections by re-requesting access.
There are three cases where gracefully closing a connection is useful:
1. If an access authorization is revoked or expires and this was the
last resource authorisation for that peer, we don't currently remove the
connection on the Gateway. Instead, the Client is still able to send
packets by they'll be dropped because we don't have a peer state
anymore.
1. If a Gateway gets restarted due to e.g. an upgrade or other
maintenance work, it loses all its connections and every Client needs to
wait for the ICE timeout (~15 seconds) before it can establish a new
one.
1. If a Client has its access revoked for all resources it has access to
in a particular site we also don't remove this connection, even though
it has become practically useless.
All of these cases are fixed with this PR. Here we introduce a way to
gracefully shutdown a connection without forcing the other side into an
ICE timeout. The graceful connection shutdown works by introducing a new
"goodbye" p2p control protocol message. Like all our p2p control
protocol messages, this is based on IP and therefore delivery is not
guaranteed. In other words, this "goodbye" message is sent on a
best-effort basis.
In the case of shutdown, the Gateway will wait for all UDP packets to be
flushed but will not resend them or wait for an ACK.
If either end receives such a "goodbye" message, they simply remove the
local peer and connection state just as if the connection would have
failed due to either ICE or WireGuard. For the Client, this means that
the next packet for a resource will trigger a new access authorization
request.
Right now, the Client event-loops have a channel with 1000 items for
sending new resource lists and updates to the TUN device to the host
app. This is kind of unnecessary as we always only care about the last
version of these. Intermediate updates that the host app doesn't process
are effectively irrelevant.
We've had an issue before where a bug in the portal caused us to receive
many updates to resources which ended up crashing Client apps because
this channel filled up.
To be more resilient on this front, we refactor the Client event loop to
use a `watch` channel for this. Watch channels only retain the last
value that got sent into them.
We cannot poll the `PhoenixChannel` after it has returned an error,
otherwise it will panic. Therefore, we exit the event-loop then. The
outer event-loop also exits as soon as it receives an error from this
channel so this is fine.
`PhoenixChannel` only returns an error when it has irrecoverably
disconnected, e.g. after the retries have been exhausted or we hit a 4xx
error on the WebSocket connection.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Currently, `connlib`'s event-loop for clients uses manual polling to
advance the state of the tunnel and the phoenix-channel. Manual polling
is powerful but also easy to get wrong, resulting in task-wakeup bugs.
Additionally, if the tunnel is very busy with processing packets, the
phoenix-channel may not get enough CPU time, resulting in a loss of the
WebSocket connection.
To fix this, we move the phoenix-channel to a separate task and use
channels to connect it with `connlib`'s main event-loop. This one is now
primarily focused on advancing the tunnel state, effectively offloading
the problem of fair scheduling to the tokio runtime.
Related: #10003
When filtering through logs in Sentry, it is useful to narrow them down
by context of a client, gateway or resource. Currently, these fields are
sometimes called `client`, `cid`, `client_id` etc and the same for the
Gateway and Resources.
To make this filtering easier, name all of them `cid` for Client IDs,
`gid` for Gateway IDs and `rid` for Resource IDs.
These appear to happen on systems that e.g. don't have IPv6 support or
where the destination cannot be reached. It is a bit of a catch-all but
all the ones I am seeing in Sentry are false-positives. To reduce the
noise a bit, we log these on DEBUG now.
In Docker environments, applying iptables rules to filter
container-container traffic on the Docker bridged network is not
reliable, leading to direct connections being established in our relayed
tests. To fix this, we insert the rules directly from the client
container itself.
---------
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
Certain UNIX systems such as macOS also use the EHOSTDOWN error to
signal that a packet cannot be sent to a certain IP. There is nothing we
can do about this error so we downgrade it from a WARN to a DEBUG like
we do for other kinds of "unreachable" errors.
Customer hit what seems to be a rare race condition where we try to
connect whilst we already have a session. I don't know which state it is
in so I am replacing it with a WARN log to learn more about this in
Sentry in case it gets hit again.
To make our FFI layer between Android and Rust safer, we adopt the
UniFFI tool from Mozilla. UniFFI allows us to create a dedicated crate
(here `client-ffi`) that contains Rust structs annotated with various
attributes. These macros then generate code at compile time that is
built into the shared object. Using a dedicated CLI from the UniFFI
project, we can then generate Kotlin bindings from this shared object.
The primary motivation for this effort is memory safety across the FFI
boundary. Most importantly, we want to ensure that:
- The session pointer is not used after it has been free'd
- Disconnecting the session frees the pointer
- Freeing the session does not happen as part of a callback as that
triggers a cyclic dependency on the Rust side (callbacks are executed on
a runtime and that runtime is dropped as part of dropping the session)
To achieve all of these goals, we move away from callbacks altogether.
UniFFI has great support for async functions. We leverage this support
to expose a `suspend fn` to Android that returns `Event`s. These events
map to the current callback functions. Internally, these events are read
from a channel with a capacity of 1000 events. It is therefore not very
time-critical that the app reads from this channel. `connlib` will
happily continue even if the channel is full. 1000 events should be more
than sufficient though in case the host app cannot immediately process
them. We don't send events very often after all.
This event-based design has major advantages: It allows us to make use
of `AutoCloseable` on the Kotlin side, meaning the `session` pointer is
only ever accessed as part of a `use` block and automatically closed
(and therefore free'd) at the end of the block.
To communicate with the session, we introduce a `TunnelCommand` which
represents all actions that the host app can send to `connlib`. These
are passed through a channel to the `suspend fn` which continuously
listens for events and commands.
Resolves: #9499
Related: #3959
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
These don't happen very often so are safe to log on INFO. That is the
default log level and it is useful to see, why we are re-connecting to
the portal.
As relict from very early designs of `connlib`, the `Callbacks` trait is
still present and defines how the host app receives events from a
running `Session`. Callbacks are not a great design pattern however
because they force the running code, i.e. `connlib`s event-loop to
execute unknown code. For example, if that code panics, all of `connlib`
is taken down. Additionally, not all consumers may want to receive
events via callbacks. The GUI and headless client for example already
have their own event-loop in which they process all kinds of things.
Having to deal with the `Callbacks` interface introduces an odd
indirection here.
To fix this, we instead return an `EventStream` when constructing a
`Session`. This essentially aligns the API of `Session` with that of a
channel. You receive two handles, one for sending in commands and one
for receiving events. A `Session` will automatically spawn itself onto
the given runtime so progress is made even if one does not poll on these
channel handles.
This greatly simplifies the code:
- We get to delete the `Callbacks` interface.
- We can delete the threaded callback adapter. This was only necessary
because we didn't want to block `connlib` with the handling of the
event. By using a channel for events, this is automatically guaranteed.
- The GUI and headless client can directly integrate the event handling
in their event-loop, without having to create an indirection with a
channel.
- It is now clear that only the Apple and Android FFI layers actually
use callbacks to communicate these events.
- We net-delete 100 LoC
The name IPC service is not very descriptive. By nature of being
separate processes, we need to use IPC to communicate between them. The
important thing is that the service process has control over the tunnel.
Therefore, we rename everything to "Tunnel service".
The only part that is not changed are historic changelog entries.
Resolves: #9048
The current `rust/` directory is a bit of a wild-west in terms of how
the crates are organised. Most of them are simply at the top-level when
in reality, they are all `connlib`-related. The Apple and Android FFI
crates - which are entrypoints in the Rust code are defined several
layers deep.
To improve the situation, we move around and rename several crates. The
end result is that all top-level crates / directories are:
- Either entrypoints into the Rust code, i.e. applications such as
Gateway, Relay or a Client
- Or crates shared across all those entrypoints, such as `telemetry` or
`logging`