mirror of
https://github.com/outbackdingo/firezone.git
synced 2026-01-27 10:18:54 +00:00
If the websocket connection between a relay and the portal experiences a temporary network split, the portal will immediately send the disconnected id of the relay to any connected clients and gateways, and all relayed connections (and current allocations) will be immediately revoked by connlib. This tight coupling is needlessly disruptive. As we've seen in staging and production logs, relay disconnects can happen randomly, and in the vast majority of cases immediately reconnect. Currently we see about 1-2 dozen of these **per day**. To better account for this, we introduce a debounce mechanism in the portal for `relays_presence` disconnects that works as follows: - When a relay disconnects, record its `stamp_secret` (this is somewhat tricky as we don't get this at the time of disconnect - we need to cache it by relay_id beforehand) - If the same `relay_id` reconnects again with the same `stamp_secret` within `relays_presence_debounce_timeout` -> no-op - If the same `relay_id` reconnects again with a **different** `stamp_secret` -> disconnect immediately - If it doesn't reconnect, **then** send the `relays_presence` with the disconnected_id after the `relays_presence_debounce_timeout` There are several ways connlib detects a relay is down: 1. Binding requests time out. These happen every 25s, so on average we don't know a Relay is down for 12.5s + backoff timer. 2. `relays_presence` - this is currently the fastest way to detect relays are down. With this change, the caveat is we will now detect this with a delay of `relays_presence_debounce_timer`. Fixes #8301