When debugging why we're receiving "Failed to start replication
connection" errors on deploy, it was discovered that there's a bug in
the Process discovery mechanism that new nodes use to attempt to link to
the existing replication connection. When restarting an existing
`domain` container that's not doing replication, we see this:
```
{"message":"Elixir.Domain.Events.ReplicationConnection: Publication tables are up to date","time":"2025-07-22T07:18:45.948Z","domain":["elixir"],"application":"domain","severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Events.ReplicationConnection.handle_publication_tables_diff/2","line":2,"file":"lib/domain/events/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.764.0>"}}
{"message":"notifier only receiving messages from its own node, functionality may be degraded","time":"2025-07-22T07:18:45.942Z","domain":["elixir"],"application":"oban","source":"oban","severity":"DEBUG","event":"notifier:switch","connectivity_status":"solitary","logging.googleapis.com/sourceLocation":{"function":"Elixir.Oban.Telemetry.log/2","line":624,"file":"lib/oban/telemetry.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.756.0>"}}
{"message":"Elixir.Domain.ChangeLogs.ReplicationConnection: Publication tables are up to date","time":"2025-07-22T07:18:45.952Z","domain":["elixir"],"application":"domain","severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.ChangeLogs.ReplicationConnection.handle_publication_tables_diff/2","line":2,"file":"lib/domain/change_logs/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.763.0>"}}
{"message":"Elixir.Domain.ChangeLogs.ReplicationConnection: Starting replication slot change_logs_slot","time":"2025-07-22T07:18:45.966Z","state":"[REDACTED]","domain":["elixir"],"application":"domain","severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.ChangeLogs.ReplicationConnection.handle_result/2","line":2,"file":"lib/domain/change_logs/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.763.0>"}}
{"message":"Elixir.Domain.Events.ReplicationConnection: Starting replication slot events_slot","time":"2025-07-22T07:18:45.966Z","state":"[REDACTED]","domain":["elixir"],"application":"domain","severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Events.ReplicationConnection.handle_result/2","line":2,"file":"lib/domain/events/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.764.0>"}}
{"message":"Elixir.Domain.ChangeLogs.ReplicationConnection: Replication connection disconnected","time":"2025-07-22T07:18:45.977Z","domain":["elixir"],"application":"domain","counter":0,"severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.ChangeLogs.ReplicationConnection.handle_disconnect/1","line":2,"file":"lib/domain/change_logs/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.763.0>"}}
{"message":"Elixir.Domain.Events.ReplicationConnection: Replication connection disconnected","time":"2025-07-22T07:18:45.977Z","domain":["elixir"],"application":"domain","counter":0,"severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Events.ReplicationConnection.handle_disconnect/1","line":2,"file":"lib/domain/events/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.764.0>"}}
{"message":"Failed to start replication connection Elixir.Domain.Events.ReplicationConnection","reason":"%Postgrex.Error{message: nil, postgres: %{code: :object_in_use, line: \"607\", message: \"replication slot \\\"events_slot\\\" is active for PID 135123\", file: \"slot.c\", unknown: \"ERROR\", severity: \"ERROR\", pg_code: \"55006\", routine: \"ReplicationSlotAcquire\"}, connection_id: 136400, query: nil}","time":"2025-07-22T07:18:45.978Z","domain":["elixir"],"application":"domain","max_retries":10,"severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Replication.Manager.handle_info/2","line":41,"file":"lib/domain/replication/manager.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.761.0>"},"retries":0}
{"message":"Failed to start replication connection Elixir.Domain.ChangeLogs.ReplicationConnection","reason":"%Postgrex.Error{message: nil, postgres: %{code: :object_in_use, line: \"607\", message: \"replication slot \\\"change_logs_slot\\\" is active for PID 135124\", file: \"slot.c\", unknown: \"ERROR\", severity: \"ERROR\", pg_code: \"55006\", routine: \"ReplicationSlotAcquire\"}, connection_id: 136401, query: nil}","time":"2025-07-22T07:18:45.978Z","domain":["elixir"],"application":"domain","max_retries":10,"severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Replication.Manager.handle_info/2","line":41,"file":"lib/domain/replication/manager.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.760.0>"},"retries":0}
```
Before, we relied on `start_link` telling us that there was an existing
pid running in the cluster. However, from the output above, it appears
that may not always be reliable.
Instead, we first check explicitly where the running process is and, if
alive, we try linking to it. If not, we try starting the connection
ourselves.
Once linked to the process, we react to it being torn down as well,
causing a first-one-wins scenario where all nodes will attempt to start
replication, minimizing downtime during deploys.
Now that https://github.com/firezone/infra/pull/94 is in place, I did
verify we are properly handling SIGTERM in the BEAM, so the deployment
would now go like this:
1. GCP brings up the new nodes, they all find the existing pid and link
to it
2. GCP sends SIGTERM to the old nodes
3. The _actual_ pid receives SIGTERM and exits
4. This exit propagates to all other nodes due to the link
5. Some node will "win", and the others will end up linking to it
Fixes #9911
Firezone Elixir Development
Before reading this doc, make sure you've read through our CONTRIBUTING guide.
Getting Started
This is not an in depth guide for setting up all dependencies, but it should give you a starting point.
Prerequisites:
- All prerequisites in the CONTRIBUTING guide
- Install ASDF and all plugins/tools from
.tool-versionin the top level of the Firezone repo - Install pnpm
From the top level director of the Firezone repo start the Postgres container:
docker compose up -d postgres
Inside the /elixir directory run the following commands:
# Install dependencies
# --------------------
> mix deps.get
# Install npm packages and build assets
# -------------------------------------
> cd apps/web/
> mix setup
# Setup and seed the DB
# ---------------------
> cd ../..
> mix ecto.seed
# Start all of the portal Elixir apps:
# ------------------------------------
> iex -S mix
The web and api applications should now be running:
- Web -> http://localhost:13000/
- API -> http://localhost:13001/
Stripe integration for local development
Prerequisites:
- Stripe account
- Stripe CLI
Steps:
-
Use static seeds to provision account ID that corresponds to staging setup on Stripe:
STATIC_SEEDS=true mix do ecto.reset, ecto.seed -
Start Stripe CLI webhook proxy:
stripe listen --forward-to localhost:13001/integrations/stripe/webhooks -
Start the Phoenix server with enabled billing from the elixir/ folder using a test mode token:
cd elixir/ BILLING_ENABLED=true STRIPE_SECRET_KEY="...copy from stripe dashboard..." STRIPE_WEBHOOK_SIGNING_SECRET="...copy from stripe cli tool.." mix phx.server
When updating the billing plan in stripe, use the Stripe Testing Docs for how to add test payment info
WorkOS integration for local development
Prerequisites:
- WorkOS account
WorkOS is currently being used for JumpCloud directory sync integration. This allows JumpCloud users to use SCIM on the JumpCloud side, rather than having to give Firezone an admin JumpCloud API token.
Connecting WorkOS in dev mode for manual testing
If you are not planning to use the JumpCloud provider in your local development setup, then no additional setup is needed. However, if you need to use the JumpCloud provider locally, you will need to obtain an API Key and Client ID from the WorkOS Dashboard.
After obtaining WorkOS API credentials, you will need to make sure they are set in the environment ENVs when starting your local dev instance of Firezone. As an example:
WORKOS_API_KEY="..." WORKOS_CLIENT_ID="..." mix phx.server
Acceptance tests
You can disable headless mode for the browser by adding
@tag debug: true
feature ....