mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 10:18:54 +00:00

Files

Jamil 2c3692582b fix(portal): more robust replication pid discovery (#9960 )

When debugging why we're receiving "Failed to start replication
connection" errors on deploy, it was discovered that there's a bug in
the Process discovery mechanism that new nodes use to attempt to link to
the existing replication connection. When restarting an existing
`domain` container that's not doing replication, we see this:

```
{"message":"Elixir.Domain.Events.ReplicationConnection: Publication tables are up to date","time":"2025-07-22T07:18:45.948Z","domain":["elixir"],"application":"domain","severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Events.ReplicationConnection.handle_publication_tables_diff/2","line":2,"file":"lib/domain/events/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.764.0>"}}
{"message":"notifier only receiving messages from its own node, functionality may be degraded","time":"2025-07-22T07:18:45.942Z","domain":["elixir"],"application":"oban","source":"oban","severity":"DEBUG","event":"notifier:switch","connectivity_status":"solitary","logging.googleapis.com/sourceLocation":{"function":"Elixir.Oban.Telemetry.log/2","line":624,"file":"lib/oban/telemetry.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.756.0>"}}
{"message":"Elixir.Domain.ChangeLogs.ReplicationConnection: Publication tables are up to date","time":"2025-07-22T07:18:45.952Z","domain":["elixir"],"application":"domain","severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.ChangeLogs.ReplicationConnection.handle_publication_tables_diff/2","line":2,"file":"lib/domain/change_logs/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.763.0>"}}
{"message":"Elixir.Domain.ChangeLogs.ReplicationConnection: Starting replication slot change_logs_slot","time":"2025-07-22T07:18:45.966Z","state":"[REDACTED]","domain":["elixir"],"application":"domain","severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.ChangeLogs.ReplicationConnection.handle_result/2","line":2,"file":"lib/domain/change_logs/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.763.0>"}}
{"message":"Elixir.Domain.Events.ReplicationConnection: Starting replication slot events_slot","time":"2025-07-22T07:18:45.966Z","state":"[REDACTED]","domain":["elixir"],"application":"domain","severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Events.ReplicationConnection.handle_result/2","line":2,"file":"lib/domain/events/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.764.0>"}}
{"message":"Elixir.Domain.ChangeLogs.ReplicationConnection: Replication connection disconnected","time":"2025-07-22T07:18:45.977Z","domain":["elixir"],"application":"domain","counter":0,"severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.ChangeLogs.ReplicationConnection.handle_disconnect/1","line":2,"file":"lib/domain/change_logs/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.763.0>"}}
{"message":"Elixir.Domain.Events.ReplicationConnection: Replication connection disconnected","time":"2025-07-22T07:18:45.977Z","domain":["elixir"],"application":"domain","counter":0,"severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Events.ReplicationConnection.handle_disconnect/1","line":2,"file":"lib/domain/events/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.764.0>"}}
{"message":"Failed to start replication connection Elixir.Domain.Events.ReplicationConnection","reason":"%Postgrex.Error{message: nil, postgres: %{code: :object_in_use, line: \"607\", message: \"replication slot \\\"events_slot\\\" is active for PID 135123\", file: \"slot.c\", unknown: \"ERROR\", severity: \"ERROR\", pg_code: \"55006\", routine: \"ReplicationSlotAcquire\"}, connection_id: 136400, query: nil}","time":"2025-07-22T07:18:45.978Z","domain":["elixir"],"application":"domain","max_retries":10,"severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Replication.Manager.handle_info/2","line":41,"file":"lib/domain/replication/manager.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.761.0>"},"retries":0}
{"message":"Failed to start replication connection Elixir.Domain.ChangeLogs.ReplicationConnection","reason":"%Postgrex.Error{message: nil, postgres: %{code: :object_in_use, line: \"607\", message: \"replication slot \\\"change_logs_slot\\\" is active for PID 135124\", file: \"slot.c\", unknown: \"ERROR\", severity: \"ERROR\", pg_code: \"55006\", routine: \"ReplicationSlotAcquire\"}, connection_id: 136401, query: nil}","time":"2025-07-22T07:18:45.978Z","domain":["elixir"],"application":"domain","max_retries":10,"severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Replication.Manager.handle_info/2","line":41,"file":"lib/domain/replication/manager.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.760.0>"},"retries":0}
```

Before, we relied on `start_link` telling us that there was an existing
pid running in the cluster. However, from the output above, it appears
that may not always be reliable.

Instead, we first check explicitly where the running process is and, if
alive, we try linking to it. If not, we try starting the connection
ourselves.

Once linked to the process, we react to it being torn down as well,
causing a first-one-wins scenario where all nodes will attempt to start
replication, minimizing downtime during deploys.

Now that https://github.com/firezone/infra/pull/94 is in place, I did
verify we are properly handling SIGTERM in the BEAM, so the deployment
would now go like this:

1. GCP brings up the new nodes, they all find the existing pid and link
to it
2. GCP sends SIGTERM to the old nodes
3. The _actual_ pid receives SIGTERM and exits
4. This exit propagates to all other nodes due to the link
5. Some node will "win", and the others will end up linking to it

Fixes #9911

2025-07-22 15:13:45 +00:00

apps

fix(portal): more robust replication pid discovery (#9960 )

2025-07-22 15:13:45 +00:00

config

fix(elixir): handle nil external url config in dev mode (#9958 )

2025-07-22 01:05:06 +00:00

priv

Move elixir code to a subfolder (#1631 )

2023-05-24 15:46:51 -06:00

rel

feat(portal): ecs metadata discovery (#6619 )

2024-09-12 12:07:28 -06:00

.credo.exs

feat(portal): Billing system (#3642 )

2024-02-20 15:01:17 -06:00

.dockerignore

chore: move terraform/ to private repo (#9421 )

2025-06-05 19:24:06 +00:00

.formatter.exs

Move elixir code to a subfolder (#1631 )

2023-05-24 15:46:51 -06:00

.gitignore

connlib: moves it to the main firezone library

2023-06-23 16:39:58 -06:00

Dockerfile

fix(portal): inherit pid 1 in cmd (#9957 )

2025-07-21 22:38:25 +00:00

LICENSE

Update LICENSE to include component license clarification for subcomponents (#1806 )

2023-07-20 21:14:38 +00:00

mix.exs

build(deps): bump sentry from 10.10.0 to 11.0.2 in /elixir (#9933 )

2025-07-21 21:01:25 +00:00

mix.lock

build(deps): bump sentry from 10.10.0 to 11.0.2 in /elixir (#9933 )

2025-07-21 21:01:25 +00:00

README.md

chore(portal): Update Elixir README (#9496 )

2025-06-10 14:36:37 +00:00

README.md

Firezone Elixir Development

Before reading this doc, make sure you've read through our CONTRIBUTING guide.

Getting Started

This is not an in depth guide for setting up all dependencies, but it should give you a starting point.

Prerequisites:

All prerequisites in the CONTRIBUTING guide
Install ASDF and all plugins/tools from .tool-version in the top level of the Firezone repo
Install pnpm

From the top level director of the Firezone repo start the Postgres container:

docker compose up -d postgres

Inside the /elixir directory run the following commands:

# Install dependencies
# --------------------
> mix deps.get

# Install npm packages and build assets
# -------------------------------------
> cd apps/web/
> mix setup

# Setup and seed the DB
# ---------------------
> cd ../..
> mix ecto.seed

# Start all of the portal Elixir apps:
# ------------------------------------
> iex -S mix

The web and api applications should now be running:

Web -> http://localhost:13000/
API -> http://localhost:13001/

Stripe integration for local development

Prerequisites:

Stripe account
Stripe CLI

Steps:

Use static seeds to provision account ID that corresponds to staging setup on Stripe:
```
STATIC_SEEDS=true mix do ecto.reset, ecto.seed
```

Start Stripe CLI webhook proxy:

stripe listen --forward-to localhost:13001/integrations/stripe/webhooks

Start the Phoenix server with enabled billing from the elixir/ folder using a test mode token:

cd elixir/
BILLING_ENABLED=true STRIPE_SECRET_KEY="...copy from stripe dashboard..." STRIPE_WEBHOOK_SIGNING_SECRET="...copy from stripe cli tool.." mix phx.server

When updating the billing plan in stripe, use the Stripe Testing Docs for how to add test payment info

WorkOS integration for local development

Prerequisites:

WorkOS account

WorkOS is currently being used for JumpCloud directory sync integration. This allows JumpCloud users to use SCIM on the JumpCloud side, rather than having to give Firezone an admin JumpCloud API token.

Connecting WorkOS in dev mode for manual testing

If you are not planning to use the JumpCloud provider in your local development setup, then no additional setup is needed. However, if you need to use the JumpCloud provider locally, you will need to obtain an API Key and Client ID from the WorkOS Dashboard.

After obtaining WorkOS API credentials, you will need to make sure they are set in the environment ENVs when starting your local dev instance of Firezone. As an example:

WORKOS_API_KEY="..." WORKOS_CLIENT_ID="..." mix phx.server

Acceptance tests

You can disable headless mode for the browser by adding

@tag debug: true
feature ....