mirror of
https://github.com/outbackdingo/firezone.git
synced 2026-01-27 10:18:54 +00:00
We use `libcluster`, a common Elixir library, for node discovery. It's a very lightweight wrapper around Erlang's standard `Node.connect` functionality. It supports custom cluster formation strategies, and we've implemented one based on fetching the list of nodes from the GCP API, and then attempting to connect to them. Unfortunately, our implementation had two bugs that prevented the cluster from healing in the following two cases: - If we successfully connect to nodes, we tracked an internal state var as having successfully connected to them, forever. If we lost the connection to these nodes (such as during a deploy where the elixir nodes don't come up in time, causing the instance group manager to reap them), then the state would never be updated, and we would never reconnect to the lost nodes. - If we failed to fetch the list of nodes more than 10 times (every 10 seconds, so 100 seconds), then we would fail to schedule the timer to load the nodes again. The first issue is fixed by removing our kept state altogether - this is what libcluster is for. We can simply try to connect to the most recent list of nodes returned from Google's API, and we now log a warning for any nodes that don't connect. The second issue is fixed by always scheduling the timer, forever, regardless of the state of the Google API. Fixes #8660 Fixes #8698