Files
firezone/rust/phoenix-channel
Thomas Eizinger f7a388345b fix(connlib): reconnect in case we lose all relays (#7164)
During normal operation, we should never lose connectivity to the set of
assigned relays in a client or gateway. In the presence of odd network
conditions and partitions however, it is possible that we disconnect
from a relay that is in fact only temporarily unavailable. Without an
explicit mechanism to retrieve new relays, this means that both clients
and gateways can end up with no relays at all. For clients, this can be
fixed by either roaming or signing out and in again. For gateways, this
can only be fixed by a restart!

Without connected relays, no connections can be established. With #7163,
we will at least be able to still establish direct connections. Yet,
that isn't good enough and we need a mechanism for restoring full
connectivity in such a case.

We creating a new connection, we already sample one of our relays and
assign it to this particular connection. This ensures that we don't
create an excessive amount of candidates for each individual connection.
Currently, this selection is allowed to be silently fallible. With this
PR, we make this a hard-error and bubble up the error that all the way
to the client's and gateway's event-loop. There, we initiate a reconnect
to the portal as a compensating action. Reconnecting to the portal means
we will receive another `init` message that allows us to reconnect the
relays.

Due to the nature of this implementation, this fix may only apply with a
certain delay from when we actually lost connectivity to the last relay.
However, this design has the advantage that we don't have to introduce
an additional state within `snownet`: Connections now simply fail to
establish and the next one soon after _should_ succeed again because we
will have received a new `init` message.

Resolves: #7162.
2024-10-29 01:01:47 +00:00
..