Files
firezone/rust/connlib
Thomas Eizinger 7e0fa50cae fix(connlib): handle silently rebooted / disconnected relays (#6666)
Our relays are essential for connectivity because they also perform STUN
for us through which we learn our server-reflexive address. Thus, we
must at all times have at least one relay that we can reach in order to
establish a connection.

The portal tracks the connectivity to the relays for us and in case any
of them go down, sends us a `relays_presence` message, meaning we can
stop using that relay and migrate any relayed connections to a new one.
This works well for as long as we are connected to the portal while the
relay is rebooting / going-down. If we are not currently connected to
the portal and a relay we are using reboots, we don't learn about it. At
least if we are actively using it, the connection will fail and further
attempted communication with the relay will time-out and we will stop
using it.

In case we aren't currently using the relay, this gets a bit trickier.
If we aren't using the relay but it rebooted while we were partitioned
from the portal, logging in again might return the same relay to us in
the `init` message, but this time with different credentials.

The first bug that we are fixing in this PR is that we previously
ignored those credentials because we already knew about the relay,
thinking that we can still use our existing credentials. The fix here is
to also compare the credentials and ditch the local state if they
differ.

The second bug identified during fixing the first one is that we need to
pro-actively probe whether all other relays that we know about are
actually still responsive. For that, we issue a `REFRESH` message to
them. If that one times-out or fails otherwise, we will remove that one
from our list of `Allocation`s too.

To fix the 2nd bug, several changes were necessary:

1. We lower the log-level of `Disconnecting from relay` from ERROR to
WARN. Any ERROR emitted during a test-run fails our test-suite which is
what partially motivated this. The test suite builds on the assumption
that ERRORs are fatal and thus should never happen during our tests.
This change surfaces that disconnecting from a relay can indeed happen
during normal operation, which justifies lowering this to WARN. Users
should at the minimum monitor on WARN to be alerted about problems.
2. We reduce the total backoff duration for requests to relays from 60s
to 10s. The current 60s result in total of 8 retries. UDP is unreliable
but it isn't THAT unreliable to justify retrying everything for 60s. We
also use a 10s timeout for ICE, which means these are now aligned to
better match each other. We had to change the max backoff duration
because we only idle-spin for at most 10s in the tests and thus the
current 60s were too long to detect that a relay actually disappeared.
3. We had to shuffle around some function calls to make sure all
intermediary event buffers are emptied at the right point in time to
make the test deterministic.

Fixes: #6648.
2024-10-02 21:14:51 +00:00
..

Connlib

Firezone's connectivity library shared by all clients.

Building Connlib

You shouldn't need to build connlib directly; it's typically built as a dependency of one of the other Firezone components. See READMEs in those directories for relevant instructions.