Files
firezone/elixir/apps
Jamil 544b6455eb fix(portal): ensure cluster state heals (#9319)
We use `libcluster`, a common Elixir library, for node discovery. It's a
very lightweight wrapper around Erlang's standard `Node.connect`
functionality.

It supports custom cluster formation strategies, and we've implemented
one based on fetching the list of nodes from the GCP API, and then
attempting to connect to them.

Unfortunately, our implementation had two bugs that prevented the
cluster from healing in the following two cases:

- If we successfully connect to nodes, we tracked an internal state var
as having successfully connected to them, forever. If we lost the
connection to these nodes (such as during a deploy where the elixir
nodes don't come up in time, causing the instance group manager to reap
them), then the state would never be updated, and we would never
reconnect to the lost nodes.
- If we failed to fetch the list of nodes more than 10 times (every 10
seconds, so 100 seconds), then we would fail to schedule the timer to
load the nodes again.

The first issue is fixed by removing our kept state altogether - this is
what libcluster is for. We can simply try to connect to the most recent
list of nodes returned from Google's API, and we now log a warning for
any nodes that don't connect.

The second issue is fixed by always scheduling the timer, forever,
regardless of the state of the Google API.

Fixes #8660 
Fixes #8698
2025-05-31 05:01:52 +00:00
..