When `snownet` originally got developed, its API was designed with the
idea in mind that a packet that doesn't get handled is an error. Whilst
that is technically true, we don't have any other component that
processes packets within Firezone. When a connection is killed by e.g.
an ICE timeout, we may still be receiving packets from the other party.
Those fill the logs until the other party also runs into a timeout. To
prevent this, we don't return errors for these but instead, log them on
TRACE.
For the case where we are given a packet that doesn't match any known
format, we still emit an error.
Within `connlib` - on UNIX platforms - we have dedicated threads that
read from and write to the TUN device. These threads are connected with
`connlib`'s main thread via bounded channels: one in each direction.
When these channels are full, `connlib`'s main thread will suspend and
not read any network packets from the sockets in order to maintain
back-pressure. Reading more packets from the socket would mean most
likely sending more packets out the TUN device.
When debugging #7763, it became apparent that _something_ must be wrong
with these threads and that somehow, we either consider them as full or
aren't emptying them and as a result, we don't read _any_ network
packets from our sockets.
To maintain back-pressure here, we currently use our own `AtomicWaker`
construct that is shared with the TUN thread(s). This is unnecessary. We
can also directly convert the `flume::Sender` into a
`flume::async::SendSink` and therefore directly access a `poll`
interface.
Once we've finished ICE and nominated a socket, we ignore future
candidates for the same connection (see #6876). To make this log a bit
more helpful, we now log the candidate that we are ignoring on this
connection.
The batch size effects how many packets we process one at a time. It
also effects the worst-case size of a single buffer as all packets may
be of the same size and thus need to be appended to the same buffer.
On mobile, we can't afford to allocate all of these so we reduce the
batch-size there.
Within the `GsoQueue` data structure, we keep a hash map indexed by
source, destination and segment length of UDP packets pointing to a
buffer for those payloads. What we intended to do here is to return the
buffer to the pool after we sent the payload. What we failed to realise
is that putting another buffer into the hash map means we have a buffer
allocated for a certain destination address and segment length! This
buffer would only get reused for the exact same address and segment
length, causing memory usage to balloon over time.
To fix this, we wrap the `DatagramBuffer` in an additional `Option`.
This allows us to actually remove it from the hash map and return the
buffer for future use to the buffer pool.
Resolves: #7866.
Resolves: #7747.
At present, the file logger for all Rust code starts each logfile with
`connlib.`. This is very confusing when exporting the logs from the GUI
client because even the logs from the client itself will start with
`connlib.`. To fix this, we make the base file name of the log file
configurable.
When we are queuing a new UDP payload for sending, we always immediately
pulled a new buffer even though we might already have on allocated for
this particular segment length. This causes an unnecessary spike in
memory when we are under load.
When the Gateway's filter-engine drops a packet, we currently only log
"destination not allowed". This could happen either because we don't
have a filter (i.e. the resource is not allowed) or because the TCP /
UDP port or ICMP traffic is not allowed. To make debugging easier, we
now include that information in the error message.
Resolves: #7875.
STUN binding requests & responses are not authenticated on purpose
because they are so easy to fulfill that having to perform the
computational work to check the authentication is more work than
actually just sending the request. With #7819, we send STUN binding
requests more often because they are used as keep-alives to the relay.
This spams the debug log because we see
> Message does not have a `MessageIntegrity` attribute
for every BINDING response. This information isn't interesting for
BINDING responses because those will never have a `MessageIntegrity`
attribute.
In order to debug connection wake-ups, it is useful to know, which
packet is the first one that gets sent on an idle connection. With this
PR, we do exactly that for incoming and outgoing packets through the
tunnel. The resulting log looks something like this:
```
2025-01-24T02:52:51.818Z DEBUG snownet::node: Connection is idle cid=65f149ea-96a4-4eee-ac70-62a1a2590821
2025-01-24T02:52:57.312Z DEBUG firezone_tunnel::client: Cleared DNS resource NAT domain=speed.cloudflare.com
2025-01-24T02:52:57.312Z DEBUG firezone_tunnel::client: Setting up DNS resource NAT gid=65f149ea-96a4-4eee-ac70-62a1a2590821 domain=speed.cloudflare.com
2025-01-24T02:52:57.312Z DEBUG snownet::node: Connection resumed packet=Packet { src: ::, dst: ::, protocol: "Reserved" } cid=65f149ea-96a4-4eee-ac70-62a1a2590821
```
Here, the connection got resumed because we locally received a DNS query
for a DNS resource which triggers a new control protocol message through
the tunnel. For this, we use the unspecified IPv6 address for src and
dst and the 0x255 protocol identifier which here renders as "Reserved".
The committed regression seeds trigger a scenario where the WireGuard
sessions of the peers expire in a way where by the time the Client sends
the packet, it is still active (179.xx seconds old) and with the latency
to the Gateway, the 180s mark is reached and the Gateway clears the
session and discards the packet as a result.
In order to fix this, I opted to patch WireGuard by introducing a new
timer that does not allow the initiator to use a session that is almost
expired: https://github.com/firezone/boringtun/pull/68.
Resolves: #7832.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
With #7819, these log messages appear at a ~10x higher rate than before
- a day's worth of these would be over 3,000 messages. For BINDING
requests, these only matter if the candidates change, therefore we can
make the logging conditional to that.
---------
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Contrary to my prior belief, we don't actually need the WireGuard
_persistent_ keep-alive. The in-built timers from WireGuard will
automatically send keep-alive messages in case no organic reply is sent
for a particular request.
All NAT bindings along the network path are already kept open using the
STUN bindings sent on all candidate pairs. Even on idle connections, we
send those every 60s. Well-behaved NATs are meant to keep confirmed UDP
bindings open for at least 120s. Even if not, the worst-case here is
that a connection which does not send any(!) application traffic is cut.
#7819 triggers this log every 25s which isn't exactly describing the
correct condition any longer. This PR updates the log to only fire when
we're determining which socket to use for communicating with the Relay,
and not at each keepalive interval.
Firezone Clients and Gateways create an allocation with a given set of
Relays as soon as they start up. If no traffic is being secured and thus
no connections are established between them, NAT bindings between
Clients / Gateways and the Relays may expire. Typically, these bindings
last for 120s. Allocations are only refreshed every 5 min (after 50% of
their lifetime has passed).
After a NAT binding is expired, the next UDP message passing through the
NAT may allocate a new port, thus changing the 3-tuple of the sender.
TURN identifies clients by their 3-tuple. Therefore, without a proactive
keepalive, TURN clients lose access to their allocation and need to
create one under the new port.
To fix this, we implement a scheduled STUN binding request every 25s
once we have chosen a socket (IPv4 or IPv6) for a given relay.
Resolves: #7802.
In order to ensure that the "site status" in the UIs is always
up-to-date, we model the resource status as part of `tunnel_test`. This
should cover even the most bizarre combinations of adding, removing,
disabling and enabling resources interleaved with sending packets,
resetting connections etc.
Fixes: #7761.
We introduced a regression in `connlib` in #7749 whereby queued but
unsent datagrams got dropped in case the socket was not ready to send
more data.
This happens because within `Io`, we pull each datagram one by one from
the iterator:
e60ec7144c/rust/connlib/tunnel/src/io.rs (L178-L188)
This function will send datagrams for as long as the socket is ready and
drop the iterator afterwards. This means the returned iterator MUST BE
lazy and "cancel-safe". This was the case prior to #7749 because
`datagrams` function used `iter_mut` and only cut off the to be sent
bytes when the next item got pulled from iterator. With #7749, the
entire `HashMap` got drained, thus dropping packets if `Io` didn't
manage to process the iterator in full.
In #7758, we fix `connlib`s event-loop to always provide the current
time to the state machine rather than the one that was requested (which
may be in the past). Even though this is already fairly resilient, we
should never request a time in the past.
This patch adds this as an assertion to our test suite.
On a high level, `connlib` is a state machine that gets driven by a
custom event-loop. For time-related actions, the state machine computes,
when it would like to be woken next. The event-loop sets a timer for
that value and emits this value when the timer fires.
There is an edge-case where this may result in the time going backwards
within the state machine. Specifically, if - for whatever reason - the
state machine emits a time value that is in the past, the timer in the
`Io` component will fire right away **but the `deadline` will point to
the time in the past**.
The only thing we are actually interested in is that the timer fires at
all. Instead of passing back the deadline of the timer, we fetch the
_current_ time and pass that back to the state machine as the current
input. This ensures that we never jump back in time because Rust
guarantees for calls to `Instant::now` to be monotonic.
(https://doc.rust-lang.org/std/time/struct.Instant.html#:~:text=a%20measurement%20of%20a%20monotonically%20nondecreasing%20clock.)
When `snownet` is tasked to establish a new connection, it first
randomly samples one of its relays that is used as an additional source
of candidates in case a direct connection is not possible. We (try to)
maintain an allocation on each relay throughout the lifetime of a
`connlib` session. In case a relay doesn't respond to the initial
binding message at all (even after several retries), we consider the
relay offline and remove all state associated to it.
It is possible that we sampled a relay for use in a connection and only
then realise that it is offline. In that case, we print a message to the
log:
> Selected relay disconnected during ICE; connection may fail
The condition for when we print this log is: "we are in `Connecting` and
the sampled relay does no longer exist". This results in log spam in
case that condition is actually hit because no state is being changed as
part of this check and thus, on the next call to `handle_timeout`, this
condition is still true!
To fix this, we change the `rid` field of `Connecting` to an `Option`.
In case we detect that a relay is no longer present, we print the log
and then clear the option. As a result, the log is only printed once.
Within `connlib`, we read batches of IP packets and process them at
once. Each encrypted packet is appended to a buffer shared with other
packets of the same length. Once the batch is successfully processed,
all of these buffers are written out using GSO to the network. This
allows UDP operations to be much more efficient because not every packet
has to traverse the entire syscall hierarchy of the operating system.
Until now, these buffers got re-allocated on every batch. This is pretty
wasteful and leads to a lot of repeated allocations. Measurements show
that most of the time, we only have a handful of packets with different
segments lengths _per batch_. For example, just booting up the
headless-client and running a speedtest showed that only 5 of these
buffers are were needed at one time.
By introducing a buffer pool, we can reuse these buffers between batches
and avoid reallocating them.
Related: #7747.
When a Firezone client roams, the host app sends a "reset" command to
`connlib`. At present, this "reset" command clears the network
connection state and therefore restarts ICE. As part of that, the tunnel
key also gets rotated yet which resources have already been authorized
is retained.
This isn't a problem per se because the client's identity is determined
by the "Firezone ID" which persists even across restarts of a Client.
For the Gateway however, a roamed Client and a restarted Client are
indistinguishable as in both cases, the tunnel public key and ICE
credentials change.
Instead of only clearing the connection-specific state, we now also
throw away all the ACL state that is associated with connections, i.e.
which Resource already got authorized on the Gateway. As a result - with
this change - Clients will emit another "connection intent" to the
portal upon roaming, triggering a new authorization of this flow with a
Gateway.
There isn't any particular need for doing this except that lingering
state can be a nasty source of bugs. With the now idempotent control
protocol, it is pretty easy to re-request these authorisations. Overall,
this makes `connlib` more resilient and easier to reason about.
Ever since #7289, we no longer issue any DNS queries to `connlib` when
we reconnect to the portal. Thus, the back-then conceived feature of
"known hosts" that allowed us to resolve that DNS query without having
an upstream receiver is no longer needed.
When `connlib` detects that no data is being sent on a connection, it
enters a "low-power" mode within which timers are set to a much longer
interval than usual. For `boringtun` this moves the timer from 1s to
30s.
At present, this timer also guards, how often we actually update the
timer state within `boringtun`. Instead of following a "only update
exactly when this timer fires"-policy, we now adopt a "update at least
this often"-policy. The difference here is that while we are executing
the `handle_timeout` function, we might as well call into `boringtun`
and update its timer state too.
Another side-effect of this timer is that `boringtun` may not be woken
in time to initiate a rekey when the session expires. WireGuard sessions
without activity expire after 3 minutes. Only the initiater should then
recreate the session. If this doesn't happen in time, the responder
(Gateway) may trigger a keep-alive timeout. Without an active session,
keep-alives also initiate sessions, resulting in us having two competing
sessions.
This fixes the failing test cases added in this PR: There, we ran into a
situation where a WireGuard tunnel idled for so long that the spec
requires the session to expire. In the test, we then sent a packet using
such an expired session but that packet got discarded by the Gateway
because of the expired session. The timers are what check whether a
session is expired:
- By calling `update_timers_at` more often, we can expire the session in
time and `boringtun` will buffer the to-be-sent packet until the new
session is established.
- By deactivating the keep-alive on the Gateway, we ensure that we only
ever have a single WireGuard session active.
- With https://github.com/firezone/boringtun/pull/53, we ensure the
Gateway doesn't initiate a new session in the beginning.
- With https://github.com/firezone/boringtun/pull/51, we ensure the
Client only ever initiates a single session.
To be entirely reliable, we also had to remove the idle WG timer and
update `boringtun`'s state every second. This is unfortunate but can
long-term be fixed by patching WireGuard to tell us, when it exactly
wants to be woken instead of us having to proactively wake it every
second _in case_ it needs to act on a timer.
Related: https://github.com/firezone/boringtun/issues/54.
Xcode doesn't allow wildcards in input file lists, so the rules I set up
in #7488 never took effect.
Upon further investigation, it appears that the `strip` command executed
unconditionally at the end of every Rust build was the culprit. Since
Xcode already does this for us, it's a useless step that adds about 30s
to the build time.
Unfortunately there isn't a good way to tell Xcode not to build rust.
But now we don't need to -- `cargo`'s build cache is smart enough to
skip builds and we are back to the ~1-2s range for repeated builds when
only Swift code has changed.
We also add the swift bridge generated code to version control. These
doesn't change regularly, and Xcode sometimes complains that the files
don't exist _before_ it lets you run the `cargo build` to generate them
🙃 .
For a while now, `connlib` has been calling these two callbacks right
after each other because the internal event already bundles all the
information about the TUN device. With this PR, we merge the two
callback functions also in layers above `connlib` itself.
Resolves: #6182.
With #7684, we update our boringtun fork to support deterministic timers
and handshake jitter. Further testing revealed that there was a bug
within the jitter implementation that prevented the jitter from actually
applying (https://github.com/firezone/boringtun/pull/48). In addition,
we were only calling `update_timers_at` with a precision of 1s, making
the internal jittering of 0 to 333ms within `boringtun` useless.
To fix this, we introduced a `next_timer_update` function in `Tunn` in
https://github.com/firezone/boringtun/pull/49 and make use of it in
here.
Finally, https://github.com/firezone/boringtun/pull/50 prioritizes the
sending of these scheduled handshakes to further improve the timer
precision.
With these patches applied, this is what the rekey logs look like:
```
2025-01-08T13:20:09.209Z DEBUG boringtun::noise::timers: HANDSHAKE(REKEY_AFTER_TIME (on send)) cid=b3d34a15-55ab-40df-994b-a838e75d65d7
2025-01-08T13:20:09.209Z DEBUG boringtun::noise::timers: Scheduling new handshake jitter=204.361814ms cid=b3d34a15-55ab-40df-994b-a838e75d65d7
2025-01-08T13:20:09.415Z DEBUG boringtun::noise: Sending handshake_initiation cid=b3d34a15-55ab-40df-994b-a838e75d65d7
2025-01-08T13:20:09.537Z DEBUG boringtun::noise: Received handshake_response local_idx=2898279939 remote_idx=2039394307 cid=b3d34a15-55ab-40df-994b-a838e75d65d7
2025-01-08T13:20:09.540Z DEBUG boringtun::noise: New session session=2898279939 cid=b3d34a15-55ab-40df-994b-a838e75d65d7
```
We can see that the scheduled handshake now does indeed get sent with
the applied jitter of 200ms.
When file descriptors like sockets or the TUN device are opened in
non-blocking mode, performing operations that would block emit the
`WouldBlock` IO error. These errors _should_ be translated into
`Poll::Pending` and have a waker registered that gets called whenever
the operation should be attempted again. Therefore, we should _never_
see these IO errors.
Previously, the implementation of the tunnel's event-loop did not yet
properly handle this backpressure and instead sometimes dropped packets
when it should have suspended. This has since been fixed but the then
introduced branch of just ignored the `io::ErrorKind::WouldBlock` errors
had remained.
Changing this to a debug-assert will alert us whenever we accidentally
break this without altering the behaviour of the release binary.
At present, the WireGuard implementation within `boringtun` is impure
with regards to time due to calls to `Instant::now` and
`Instant::elapsed`. This makes it impossible to exhaustively test
time-related features because time cannot be advanced arbitrarily. The
rest of `connlib` is implemented in a sans-IO fashion where time is
controlled from the outside via `Instant` parameters on every function
that requires access to the current time.
With this PR, we update to the latest version of our `boringtun` fork at
https://github.com/firezone/boringtun which introduces pure equivalents
of all functions that require access to the current time _and_ also
implements the missing handshake-delay jitter feature (see
https://github.com/firezone/boringtun/issues/19).
This is a pretty safe upgrade as the production code doesn't really
change and time advances at the same rate as before. To ensure this
passes our test-suite, I ran 50_000 iterations locally.
For our test-suite, we need to sample a unique, non-overlapping IP for
each component that is being simulated (client, gateways and relays).
These are sampled from a predefined range.
Currently, we only consider the first 100 IPs of this range and pick it
from an allocated `Vec`. This isn't ideal for performance and increases
the likelihood of two hosts having the same IP. IPv4 and IPv6 addresses
can also just be represented as numbers. Instead of sampling a random IP
from a list, we can simply sample a random number between the first and
last address of the particular IP network to achieve the same effect.
Reading and writing to the TUN device within `connlib` happens in a
separate thread. The task running within these threads is connected to
the rest of `connlib` via channels. When the application shuts down,
these threads also need to exit. Currently, we attempt to detect this
from within the task when these channels close. It appears that there is
a race condition here because we first attempt to read from the TUN
device before reading from the channels. We treat read & write errors on
the TUN device as non-fatal so we loop around and attempt to read from
it again, causing an infinite-loop and log spam.
To fix this, we swap the order in which we evaluate the two concurrent
tasks: The first task to be polled is now the channel for outbound
packets and only if that one is empty, we attempt to read new packets
from the TUN device. This is also better from a backpressure point of
view: We should attempt to flush out our local buffers of already
processed packets before taking on "new work".
As a defense-in-depth strategy, we also attempt to detect the particular
error from the tokio runtime when it is being shut down and exit the
task.
Resolves: #7601.
Related: https://github.com/tokio-rs/tokio/issues/7056.
- Refactor Telemetry module to expose firezoneId and accountSlug for
easier access in the Adapter module
- Set accountSlug to WrappedSession.connect for hydrating the Rust
sentry context
Firezone needs to deterministically handle overlapping CIDR routes. The
way we handle this is that more specific routes are preferred over less
specific one. In case of an exact overlap, the sorting of the resource
ID acts as a tie-breaker: "Smaller" resource IDs preferred over "larger"
ones. This ensures that regardless of which order the resources are
added / enabled in, Firezone behaves deterministically.
In addition to the above rules, existing connections to Gateways always
have precedence: In other words, if we are connected to resource A via
Gateway 1 and resource B exactly overlaps with A yet needs to be routed
to Gateway B and B < A, we still retain resource A in order to not
interrupt existing connections.
When a connection to a Gateway fails, these mappings are cleaned up. The
proptests seeds added in this PR identify a routing mismatch in case a
(relayed) connection is cut, followed by adding a non-CIDR resource:
`connlib` recalculated the CIDR routes as part of adding the new
resource, even though the CIDR resources didn't actually change. This
could potentially result in a connection suddenly being routed to a
different Gateway despite nothing about that resource changing.
To fix this, we add a check for updating the CIDR routes and only
perform it in case CIDR resources get changed.
We are receiving multiple reports of message, especially error messages
from relays, where the message integrity check fails. To get more
information as to why, this patch extends this error message with the
attributes of the request and response message.
IPv6 treats fragmentation and MTU errors differently than IPv4. Rather
than requiring fragmentation on each hop of a routing path,
fragmentation needs to happen at the packet source and failure to route
a packet triggers an ICMPv6 `PacketTooBig` error.
These need to be translated back through our NAT64 implementation of the
Gateway. Due to the size difference in the headers of IPv4 and IPv6, the
available MTU to the IPv4 packet is 20 bytes _less_ than the MTU
reported by the ICMP error. IPv6 headers are always 40 bytes, meaning if
the MTU is reported as e.g. 1200 on the IPv6 side, we need to only offer
1180 to the IPv4 end of the application. Once the new MTU is then
honored, the packets translated by our NAT64 implementation will still
conform to the required MTU of 1200, despite the overhead introduced by
the translation.
Resolves: #7515.
In #7477, we introduced a regression in our test suite for DNS queries
that are forwarded through the tunnel.
In order to be deterministic when users configure overlapping CIDR
resources, we use the sort order of all CIDR resource IDs to pick, which
one "wins". To make sure existing connections are not interrupted, this
rule does not apply when we already have a connection to a gateway for a
resource. In other words, if a new CIDR resource (e.g. resource `A`) is
added to connlib that has an overlapping route with another resource
(e.g. resource `B`) but we already have a connection to resource `B`, we
will continue routing traffic for this CIDR range to resource `B`,
despite `A` sorting "before" `B`.
The regression that we introduced was that we did not account for
resources being "connected" after forwarding a query through the tunnel
to it. As a result, in the found failure case, the test suite was
expecting to route the packet to resource `A` because it did not know
that we are connected to resource `B` at the time of processing the ICMP
packet.
In case an upstream DNS server responds with a payload that exceeds the
available buffer space of an IP packet, we need to truncate the
response. Currently, this truncation uses the **wrong** constant to
check for the maximum allowed length. Instead of the
`MAX_DATAGRAM_PAYLOAD`, we actually need to check against a limit that
is less than the MTU as the IP layer and the UDP layer both add an
overhead.
To fix this, we introduce such a constant and provide additional
documentation on the remaining ones to hopefully avoid future errors.
When a Firezone Client roams, we reset all network connections and
rebind our local sockets. Doing that enables us to start from a clean
state and establish new connections to Gateways. What we are currently
not clearing are in-flight DNS queries. Those are all very likely to
fail because our network connection is changing. There is no point in us
keeping those around. Additionally, as part of roaming, it may also be
that our upstream DNS server changes and thus, we may suddenly receive a
response from a DNS server that we no longer know about.
Clear all in-flight DNS queries on reset solves this.
Initially, when we receive a new candidate from a remote peer, we bind a
channel for each remote address on the relay that we sampled. This
ensures that every possible communication path is actually functioning.
In ICE, all candidates are tried against each other, meaning the remote
will attempt to send from each of their candidates to every one of ours,
including our relay candidates. To allow this traffic, a channel needs
to be bound first.
For various reasons, an allocation might become stale or needs to be
otherwise invalidated. In that case, all the channel bindings are lost
but there might still be an active connection that wants to utilise
them. In that case, we will see "No channel" warnings like
https://firezone-inc.sentry.io/issues/6036662614/events/f8375883fd3243a4afbb27c36f253e23/.
To fix this, we use the attempt to encode a message for a channel as an
intent to bind a new one. This is deemed safe because wanting to encode
a message to a peer as a channel data message means we want such a
channel to exist. The first message here is still dropped but that is
better than not establishing the channel at all.
When deciding what to do with a certain DNS query, we check whether the
domain name in question corresponds to any of the (wildcard) DNS
resource addresses. If yes, we resolve it to the resource ID of that
resource. The source of those resource IDs is the `dns_resources` map.
If we have looked up a `ResourceId` in that map, it is impossible for it
to not be "known" which means the branch deleted in this PR is
completely redundant and already covered by the catch-all branch where
`maybe_resource` is `None`.