mirror of
https://github.com/outbackdingo/firezone.git
synced 2026-01-28 02:18:50 +00:00
Our relays are essential for connectivity because they also perform STUN for us through which we learn our server-reflexive address. Thus, we must at all times have at least one relay that we can reach in order to establish a connection. The portal tracks the connectivity to the relays for us and in case any of them go down, sends us a `relays_presence` message, meaning we can stop using that relay and migrate any relayed connections to a new one. This works well for as long as we are connected to the portal while the relay is rebooting / going-down. If we are not currently connected to the portal and a relay we are using reboots, we don't learn about it. At least if we are actively using it, the connection will fail and further attempted communication with the relay will time-out and we will stop using it. In case we aren't currently using the relay, this gets a bit trickier. If we aren't using the relay but it rebooted while we were partitioned from the portal, logging in again might return the same relay to us in the `init` message, but this time with different credentials. The first bug that we are fixing in this PR is that we previously ignored those credentials because we already knew about the relay, thinking that we can still use our existing credentials. The fix here is to also compare the credentials and ditch the local state if they differ. The second bug identified during fixing the first one is that we need to pro-actively probe whether all other relays that we know about are actually still responsive. For that, we issue a `REFRESH` message to them. If that one times-out or fails otherwise, we will remove that one from our list of `Allocation`s too. To fix the 2nd bug, several changes were necessary: 1. We lower the log-level of `Disconnecting from relay` from ERROR to WARN. Any ERROR emitted during a test-run fails our test-suite which is what partially motivated this. The test suite builds on the assumption that ERRORs are fatal and thus should never happen during our tests. This change surfaces that disconnecting from a relay can indeed happen during normal operation, which justifies lowering this to WARN. Users should at the minimum monitor on WARN to be alerted about problems. 2. We reduce the total backoff duration for requests to relays from 60s to 10s. The current 60s result in total of 8 retries. UDP is unreliable but it isn't THAT unreliable to justify retrying everything for 60s. We also use a 10s timeout for ICE, which means these are now aligned to better match each other. We had to change the max backoff duration because we only idle-spin for at most 10s in the tests and thus the current 60s were too long to detect that a relay actually disappeared. 3. We had to shuffle around some function calls to make sure all intermediary event buffers are emptied at the right point in time to make the test deterministic. Fixes: #6648.
Connlib
Firezone's connectivity library shared by all clients.
Building Connlib
You shouldn't need to build connlib directly; it's typically built as a dependency of one of the other Firezone components. See READMEs in those directories for relevant instructions.