mirror of
https://github.com/outbackdingo/firezone.git
synced 2026-01-27 10:18:54 +00:00
When deploying, new Elixir nodes are spun up before the old ones are
brought down. This ensures that the state in the cluster is merge into
the new nodes between the old ones go away.
This also means, however, that the existing WAL consumer is still up
when our new one tries to come online.
Normally, this isn't an issue, because we find the old pid and return it
with `{:ok, existing_pid}`. When that VM goes away, the Supervisor(s) of
the new application notice and restart it,
However, if the cluster state diverges or is inconsistent during this
period, we may fail to find the existing pid, and try to start a new
ReplicationConnection. If the old pid is still active, this will fail
because there's a mutex on the connection. The original implementation
was designed to handle this case using the Supervisor with a
`:transient` restart policy.
What the author failed to understand is that the restart policy applies
only to _restarts_ and not initial starts, so if all of the new
application servers fail to find the old pid which is still connected,
and they all fail to come up, we won't consume the WAL.
This is fixed with a `ReplicationConnectionManager` that always comes up
fine, and then simply tries to start a `ReplicationConnection` every
30s, giving up after 5 minutes if it can't start one or find an existing
one. This will crash, causing the Supervisor to restart us, and then
notify us.