firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 10:18:54 +00:00

Author	SHA1	Message	Date
Brian Manifold	80e1c3255f	refactor(portal): refactor billing event handler (#10064 ) Why: * There were intermittent issues with accounts updates from Stripe events. Specifically, when an account would update it's subscription from Starter to Team. The reason was due to the fact that Stripe does not guarantee order of delivery for it's webhook events. At times we were seeing and responding to an event that was a few seconds old after processing a newer event. This would have the effect of quickly transitioning an account from Team back to Starter. This commit refactors our event handler and adds a `processed_stripe_events` DB table to make sure we don't process duplicate events as well as prevent processing an event that was created prior to the last event we've processed for a given account. * Along with refactoring the billing event handling, the Stripe mock module has also been refactored to better reflect real Stripe objects. Related: #8668	2025-08-05 16:56:52 +00:00
Jamil	cacb44f7bb	test(portal): fix flaky acceptance auth test (#10140 ) Occasionally, this fails because the element is found, but not visible due to a race condition. To fix this, we assert that the element should be visible before clicking on it. Fixes https://github.com/firezone/firezone/actions/runs/16751908154/job/47424125321	2025-08-05 14:53:18 +00:00
Jamil	54d91e2004	fix(portal): don't send reject_access for remaining flows (#10071 ) This fixes a simple logic bug where we were mistakenly reacting to a flow deletion event where flows still existed in the cache by sending `reject_access`. This fixes that bug, and adds more comprehensive logging to help diagnose issues like this more quickly in the future. This PR also fixes the following issues found during the investigation: - We were redundantly reacting to Token deletion in the channel pids. This is unnecessary: we send a global socket disconnect from the Token hook module instead. - We had a bug that would crash the WAL consumer if a "global" token (i.e. relay) was deleted or expired - these have no `account_id`. - We now always use `min(max(all_conforming_polices_expiration), token.expires_at)` when setting expiration on a new flow to minimize the possibility for access churn. - We now check to ensure the token and gateway are still undeleted when re-authorizing a given flow. This prevents us from failing to send `reject_access` when a token or gateway is deleted corresponding to a flow, but the other entities would have granted access. Related: https://firezone.statuspage.io/incidents/xrsm13tml3dh Related: #10068 Related: #9501	2025-08-01 00:03:00 +00:00
Jamil	ef3ee3aba8	fix(portal): relax gateway group perms (#10034 ) This is hit by the client channel when a gateway group needs to be hydrated, which should only require "connect gateways" permissions.	2025-07-28 19:58:11 +00:00
Jamil	44a9691df5	refactor(portal): don't store account assoc on client (#10009 ) The full `account` struct is only used to render the client's interface, and doesn't need to be stored in the `client` struct when the `subject` struct already tracks it.	2025-07-28 16:24:58 +00:00
Jamil	3ff31e3a33	fix(portal): maintain identity preload on client (#10008 ) When updating a client, we need to maintain the preloaded `identity` association to use for the IdP policy condition.	2025-07-26 00:42:19 +00:00
Jamil	f1a5af356d	fix(portal): groom resource list and flows periodically (#10005 ) Time-based policy conditions are tricky. When they authorize a flow, we correctly tell the Gateway to remove access when the time window expires. However, we do nothing on the client to reset the connectivity state. This means that whenever the window of time of access was re-entered, the client would essentially never be able to connect to it again until the resource was toggled. To fix this, we add a 1-minute check in the client channel that re-checks allowed resources, and updates the client state with the difference. This means that policies that have time-based conditions are only accurate to the minute, but this is how they're presented anyhow. For good measure, we also add a periodic job that runs every minute to delete expired Flows. This will propagate to the Gateway where, if the access for a particular client-resource is determined to be actually gone, will receive `reject_access`. Zooming out a bit, this PR furthers the theme that: - Client channels react to underlying resource / policy / membership changes directly, while - Gateway channels react primarily to flows being deleted, or the downstream effects of a prior client authorization	2025-07-25 21:04:41 +00:00
Jamil	2959cca8ce	fix(portal): use consistent wireguard psk (#10004 ) Whenever a client requests a connection to gateway, we need to generate a preshared key that will be used for the underlying WireGuard tunnel. When the connection setup broke or otherwise was lost, _after_ the gateway the received the authorize_flow call, but _before_ the client could receive the response (and initiate a tunnel), we would have to wait until an ICE timeout occurred in order to reset state on the gateway. This is because the psk was not used to determine if this was a _new_ flow authorization. So the old authorization would be matched, and the client would never be able to connect, since its tunnel was using the new psk, and the gateway the old. To fix this, we generate a secure random 32-byte `psk_base` on each client and gateway. When a client wishes to connect to a gateway, we compute the WireGuard preshared key as an HMAC over these two inputs. This fixes the issue by ensuring that subsequent flow authorization requests from a particular client to a particular gateway will yield the same psk. Related: #9999 Related: https://github.com/firezone/infra/issues/99	2025-07-25 19:28:47 +00:00
Jamil	ccc736e63e	fix(portal): reauthorize new flow when last flow deleted (#9974 ) The `flows` table tracks authorizations we've made for a resource and persists them, so that we can determine which authorizations are still valid across deploys or hiccups in the control plane connections. Before, when the "in-use" authorization for a resource was deleted, we would have flapped the resource in the client, and sent `reject_access` to the gateway. However, that would cause issues in the following edge case: - Client is currently connected to Resource A through Policy B - Client websocket goes down - Policy B is created for Resource A (for another actor group), and Policy A is deleted by admin - Client reconnects - Client sees that its resource list is the same - Gateway has since received `reject_access` because no new flows were created for this client-resource combination To prevent this from happening, we now try to "reauthorize" the flow whenever the last cached flow is removed for a particular client-resource pair. This avoids needing to toggle the resource on the client since we won't have sent `reject_access` to the gateway.	2025-07-25 01:53:10 +00:00
Jamil	f41a6f9e0b	fix(portal): don't use process.alive on remote pid (#9964 ) This can be removed, since we handle the ArgumentError in the link operation.	2025-07-22 09:42:51 -07:00
Jamil	2c3692582b	fix(portal): more robust replication pid discovery (#9960 ) When debugging why we're receiving "Failed to start replication connection" errors on deploy, it was discovered that there's a bug in the Process discovery mechanism that new nodes use to attempt to link to the existing replication connection. When restarting an existing `domain` container that's not doing replication, we see this: ``` {"message":"Elixir.Domain.Events.ReplicationConnection: Publication tables are up to date","time":"2025-07-22T07:18:45.948Z","domain":["elixir"],"application":"domain","severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Events.ReplicationConnection.handle_publication_tables_diff/2","line":2,"file":"lib/domain/events/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.764.0>"}} {"message":"notifier only receiving messages from its own node, functionality may be degraded","time":"2025-07-22T07:18:45.942Z","domain":["elixir"],"application":"oban","source":"oban","severity":"DEBUG","event":"notifier:switch","connectivity_status":"solitary","logging.googleapis.com/sourceLocation":{"function":"Elixir.Oban.Telemetry.log/2","line":624,"file":"lib/oban/telemetry.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.756.0>"}} {"message":"Elixir.Domain.ChangeLogs.ReplicationConnection: Publication tables are up to date","time":"2025-07-22T07:18:45.952Z","domain":["elixir"],"application":"domain","severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.ChangeLogs.ReplicationConnection.handle_publication_tables_diff/2","line":2,"file":"lib/domain/change_logs/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.763.0>"}} {"message":"Elixir.Domain.ChangeLogs.ReplicationConnection: Starting replication slot change_logs_slot","time":"2025-07-22T07:18:45.966Z","state":"[REDACTED]","domain":["elixir"],"application":"domain","severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.ChangeLogs.ReplicationConnection.handle_result/2","line":2,"file":"lib/domain/change_logs/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.763.0>"}} {"message":"Elixir.Domain.Events.ReplicationConnection: Starting replication slot events_slot","time":"2025-07-22T07:18:45.966Z","state":"[REDACTED]","domain":["elixir"],"application":"domain","severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Events.ReplicationConnection.handle_result/2","line":2,"file":"lib/domain/events/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.764.0>"}} {"message":"Elixir.Domain.ChangeLogs.ReplicationConnection: Replication connection disconnected","time":"2025-07-22T07:18:45.977Z","domain":["elixir"],"application":"domain","counter":0,"severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.ChangeLogs.ReplicationConnection.handle_disconnect/1","line":2,"file":"lib/domain/change_logs/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.763.0>"}} {"message":"Elixir.Domain.Events.ReplicationConnection: Replication connection disconnected","time":"2025-07-22T07:18:45.977Z","domain":["elixir"],"application":"domain","counter":0,"severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Events.ReplicationConnection.handle_disconnect/1","line":2,"file":"lib/domain/events/replication_connection.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.764.0>"}} {"message":"Failed to start replication connection Elixir.Domain.Events.ReplicationConnection","reason":"%Postgrex.Error{message: nil, postgres: %{code: :object_in_use, line: \"607\", message: \"replication slot \\\"events_slot\\\" is active for PID 135123\", file: \"slot.c\", unknown: \"ERROR\", severity: \"ERROR\", pg_code: \"55006\", routine: \"ReplicationSlotAcquire\"}, connection_id: 136400, query: nil}","time":"2025-07-22T07:18:45.978Z","domain":["elixir"],"application":"domain","max_retries":10,"severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Replication.Manager.handle_info/2","line":41,"file":"lib/domain/replication/manager.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.761.0>"},"retries":0} {"message":"Failed to start replication connection Elixir.Domain.ChangeLogs.ReplicationConnection","reason":"%Postgrex.Error{message: nil, postgres: %{code: :object_in_use, line: \"607\", message: \"replication slot \\\"change_logs_slot\\\" is active for PID 135124\", file: \"slot.c\", unknown: \"ERROR\", severity: \"ERROR\", pg_code: \"55006\", routine: \"ReplicationSlotAcquire\"}, connection_id: 136401, query: nil}","time":"2025-07-22T07:18:45.978Z","domain":["elixir"],"application":"domain","max_retries":10,"severity":"INFO","logging.googleapis.com/sourceLocation":{"function":"Elixir.Domain.Replication.Manager.handle_info/2","line":41,"file":"lib/domain/replication/manager.ex"},"logging.googleapis.com/operation":{"producer":"#PID<0.760.0>"},"retries":0} ``` Before, we relied on `start_link` telling us that there was an existing pid running in the cluster. However, from the output above, it appears that may not always be reliable. Instead, we first check explicitly where the running process is and, if alive, we try linking to it. If not, we try starting the connection ourselves. Once linked to the process, we react to it being torn down as well, causing a first-one-wins scenario where all nodes will attempt to start replication, minimizing downtime during deploys. Now that https://github.com/firezone/infra/pull/94 is in place, I did verify we are properly handling SIGTERM in the BEAM, so the deployment would now go like this: 1. GCP brings up the new nodes, they all find the existing pid and link to it 2. GCP sends SIGTERM to the old nodes 3. The _actual_ pid receives SIGTERM and exits 4. This exit propagates to all other nodes due to the link 5. Some node will "win", and the others will end up linking to it Fixes #9911	2025-07-22 15:13:45 +00:00
dependabot[bot]	0a0ee3c940	build(deps): bump sentry from 10.10.0 to 11.0.2 in /elixir (#9933 ) Bumps [sentry](https://github.com/getsentry/sentry-elixir) from 10.10.0 to 11.0.2. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/getsentry/sentry-elixir/releases">sentry's releases</a>.</em></p> <blockquote> <h2>11.0.2</h2> <h3>Bug fixes</h3> <ul> <li>Deeply nested spans are handled now when building up traces in <code>SpanProcessor</code> (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/924">#924</a>)</li> </ul> <h4>Various improvements</h4> <ul> <li>Span's attributes no longer include <code>db.url: "ecto:"</code> entries as they are now filtered out (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/925">#925</a>)</li> </ul> <h2>11.0.1</h2> <h4>Various improvements</h4> <ul> <li><code>Sentry.OpenTelemetry.Sampler</code> now works with an empty config (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/915">#915</a>)</li> </ul> <h2>11.0.0</h2> <p>This release comes with a beta support for Traces using OpenTelemetry - please test it out and report any issues you find.</p> <h3>New features</h3> <ul> <li> <p>Beta support for Traces using OpenTelemetry (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/902">#902</a>)</p> <p>To enable Tracing in your Phoenix application, you need to add the following to your <code>mix.exs</code>:</p> <pre lang="elixir"><code>def deps do [ # ... {:sentry, "~> 11.0.0"}, {:opentelemetry, "~> 1.5"}, {:opentelemetry_api, "~> 1.4"}, {:opentelemetry_exporter, "~> 1.0"}, {:opentelemetry_semantic_conventions, "~> 1.27"}, {:opentelemetry_phoenix, "~> 2.0"}, {:opentelemetry_ecto, "~> 1.2"}, # ... ] </code></pre> <p>And then configure Tracing in Sentry and OpenTelemetry in your <code>config.exs</code>:</p> <pre lang="elixir"><code>config :sentry, # ... traces_sample_rate: 1.0 # any value between 0 and 1.0 enables tracing <p>config :opentelemetry, span_processor: {Sentry.OpenTelemetry.SpanProcessor, []} config :opentelemetry, sampler: {Sentry.OpenTelemetry.Sampler, [drop: []]} </code></pre></p> </li> <li> <p>Add installer (based on Igniter) (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/876">#876</a>)</p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/getsentry/sentry-elixir/blob/master/CHANGELOG.md">sentry's changelog</a>.</em></p> <blockquote> <h2>11.0.2</h2> <h3>Bug fixes</h3> <ul> <li>Deeply nested spans are handled now when building up traces in <code>SpanProcessor</code> (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/924">#924</a>)</li> </ul> <h4>Various improvements</h4> <ul> <li>Span's attributes no longer include <code>db.url: "ecto:"</code> entries as they are now filtered out (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/925">#925</a>)</li> </ul> <h2>11.0.1</h2> <h4>Various improvements</h4> <ul> <li><code>Sentry.OpenTelemetry.Sampler</code> now works with an empty config (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/915">#915</a>)</li> </ul> <h2>11.0.0</h2> <p>This release comes with a beta support for Traces using OpenTelemetry - please test it out and report any issues you find.</p> <h3>New features</h3> <ul> <li> <p>Beta support for Traces using OpenTelemetry (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/902">#902</a>)</p> <p>To enable Tracing in your Phoenix application, you need to add the following to your <code>mix.exs</code>:</p> <pre lang="elixir"><code>def deps do [ # ... {:sentry, "~> 11.0.0"}, {:opentelemetry, "~> 1.5"}, {:opentelemetry_api, "~> 1.4"}, {:opentelemetry_exporter, "~> 1.0"}, {:opentelemetry_semantic_conventions, "~> 1.27"}, {:opentelemetry_phoenix, "~> 2.0"}, {:opentelemetry_ecto, "~> 1.2"}, # ... ] </code></pre> <p>And then configure Tracing in Sentry and OpenTelemetry in your <code>config.exs</code>:</p> <pre lang="elixir"><code>config :sentry, # ... traces_sample_rate: 1.0 # any value between 0 and 1.0 enables tracing <p>config :opentelemetry, span_processor: {Sentry.OpenTelemetry.SpanProcessor, []} config :opentelemetry, sampler: {Sentry.OpenTelemetry.Sampler, []} </code></pre></p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`b142174df9`"><code>b142174</code></a> release: 11.0.2</li> <li><a href="`f43055b8ca`"><code>f43055b</code></a> Update CHANGELOG for 11.0.2 (<a href="https://redirect.github.com/getsentry/sentry-elixir/issues/926">#926</a>)</li> <li><a href="`ee512d3bf6`"><code>ee512d3</code></a> Filter out empty db.url from span's attributes (<a href="https://redirect.github.com/getsentry/sentry-elixir/issues/925">#925</a>)</li> <li><a href="`6809aaa68c`"><code>6809aaa</code></a> Fix handling of spans at 2+ levels (<a href="https://redirect.github.com/getsentry/sentry-elixir/issues/924">#924</a>)</li> <li><a href="`b7e16798d3`"><code>b7e1679</code></a> Improve event callback docs (<a href="https://redirect.github.com/getsentry/sentry-elixir/issues/922">#922</a>)</li> <li><a href="`97d0382418`"><code>97d0382</code></a> Merge branch 'release/11.0.1'</li> <li><a href="`738fc763cd`"><code>738fc76</code></a> release: 11.0.1</li> <li><a href="`ab58c0ef6b`"><code>ab58c0e</code></a> Update CHANGELOG (<a href="https://redirect.github.com/getsentry/sentry-elixir/issues/917">#917</a>)</li> <li><a href="`028ce18841`"><code>028ce18</code></a> handle nil drop list (<a href="https://redirect.github.com/getsentry/sentry-elixir/issues/915">#915</a>)</li> <li><a href="`5850c73a96`"><code>5850c73</code></a> Merge branch 'release/11.0.0'</li> <li>Additional commits viewable in <a href="https://github.com/getsentry/sentry-elixir/compare/10.10.0...11.0.2">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=sentry&package-manager=hex&previous-version=10.10.0&new-version=11.0.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-07-21 21:01:25 +00:00
Jamil	1b1bd6401a	fix(portal): gracefully account deletions in changelog (#9955 ) When an account is perma-deleted, we need to handle that with another function clause matching the WAL message coming into the change logs replication connection module.	2025-07-21 20:47:41 +00:00
Jamil	488cb96469	fix(portal): don't prematurely reject access (#9952 ) Before: - When a flow was deleted, we flapped the resource on the client, and sent `reject_access` naively for the flow's `{client_id, resource_id}` pair on the gateway. This resulted in lots of unneeded resource flappage on the client whenever bulk flow deletions happened. After: - When a flow is deleted, we check if this is an active flow for the client. If so, we flap the resource then in order to trigger generation of a new flow. If access was truly affected, that results in a loss of a resource, we will push `resource_deleted` for the update that triggered the flow deletion (for example the resource/policy removal). On the gateway, we only send `reject_access` if it was the last flow granting access for a particular `client/resource` tuple. Why: - While the access state is still correct in the previous implementation, we run the possibility of pushing way too many resource flaps to the client in an overly eager attempt to remove access the client may not have access to. cc @thomaseizinger Related: https://firezonehq.slack.com/archives/C08FPHECLUF/p1753101115735179	2025-07-21 13:12:05 -07:00
dependabot[bot]	272074e8d4	build(deps): bump hammer from 7.0.1 to 7.1.0 in /elixir (#9935 ) Bumps [hammer](https://github.com/ExHammer/hammer) from 7.0.1 to 7.1.0. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/ExHammer/hammer/blob/master/CHANGELOG.md">hammer's changelog</a>.</em></p> <blockquote> <h2>7.1.0 - 2025-07-18</h2> <ul> <li>Fix key type inconsistency in backend implementations - all backends now accept <code>term()</code> keys instead of <code>String.t()</code> (<a href="https://redirect.github.com/ExHammer/hammer/issues/143">#143</a>)</li> <li>Add comprehensive test coverage for various key types (atoms, tuples, integers, lists, maps)</li> <li>Fix race conditions in atomic backend tests (FixWindow, LeakyBucket, TokenBucket)</li> <li>Replace timing-dependent tests with polling-based <code>eventually</code> helper for better CI reliability</li> <li>Add documentation warning about Redis backend string key requirement</li> <li>Fix typo in <code>inc/3</code> optional callback documentation (<a href="https://redirect.github.com/ExHammer/hammer/issues/142">#142</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`a57bdecdc1`"><code>a57bdec</code></a> improve changelog last commit (<a href="https://redirect.github.com/ExHammer/hammer/issues/145">#145</a>)</li> <li><a href="`bb061c5334`"><code>bb061c5</code></a> Bump version to 7.1.0 (<a href="https://redirect.github.com/ExHammer/hammer/issues/144">#144</a>)</li> <li><a href="`7d7967f898`"><code>7d7967f</code></a> Fix key type inconsistency in backend implementations (<a href="https://redirect.github.com/ExHammer/hammer/issues/143">#143</a>)</li> <li><a href="`94d39525e8`"><code>94d3952</code></a> Fixes typo for inc/3 optional callback <code>@doc</code> (<a href="https://redirect.github.com/ExHammer/hammer/issues/142">#142</a>)</li> <li><a href="`79ca221876`"><code>79ca221</code></a> Bump benchee from 1.3.1 to 1.4.0 (<a href="https://redirect.github.com/ExHammer/hammer/issues/135">#135</a>)</li> <li><a href="`a09bbd0d42`"><code>a09bbd0</code></a> Bump ex_doc from 0.37.3 to 0.38.2 (<a href="https://redirect.github.com/ExHammer/hammer/issues/141">#141</a>)</li> <li><a href="`d06a17b6be`"><code>d06a17b</code></a> Bump credo from 1.7.11 to 1.7.12 (<a href="https://redirect.github.com/ExHammer/hammer/issues/134">#134</a>)</li> <li><a href="`26df742620`"><code>26df742</code></a> Update bug_report.md (<a href="https://redirect.github.com/ExHammer/hammer/issues/133">#133</a>)</li> <li><a href="`b8765fe216`"><code>b8765fe</code></a> Bump ex_doc from 0.37.2 to 0.37.3 (<a href="https://redirect.github.com/ExHammer/hammer/issues/131">#131</a>)</li> <li>See full diff in <a href="https://github.com/ExHammer/hammer/compare/7.0.1...7.1.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=hammer&package-manager=hex&previous-version=7.0.1&new-version=7.1.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-07-21 13:24:40 +00:00
Jamil	b5af132ae8	feat(portal): allow queue_target and queue_interval via ENV (#9943 ) These parameters should be tuned to how long we expect "normal" queries to take against the SQL instance. For smaller instances, "normal" queries may take longer than 500ms, so we need to be able to configure these via our Terraform configuration. If not specified, the same defaults are used as before. Related: https://github.com/firezone/infra/pull/82	2025-07-20 12:28:04 -07:00
Jamil	f379e85e9b	refactor(portal): cache access state in channel pids (#9773 ) When changes occur in the Firezone DB that trigger side effects, we need some mechanism to broadcast and handle these. Before, the system we used was: - Each process subscribes to a myriad of topics related to data it wants to receive. In some cases it would subscribe to new topics based on received events from existing topics (I.e. flows in the gateway channel), and sometimes in a loop. It would then need to be sure to _unsubscribe_ from these topics - Handle the side effect in the `after_commit` hook of the Ecto function call after it completes - Broadcast only a simply (thin) event message with a DB id - In the receiver, use the id(s) to re-evaluate, or lookup one or many records associated with the change - After the lookup completes, `push` the relevant message(s) to the LiveView, `client` pid, or `gateway` pid in their respective channel processes This system had a number of drawbacks ranging from scalability issues to undesirable access bugs: 1. The `after_commit` callback, on each App node, is not globally ordered. Since we broadcast a thin event schema and read from the DB to hydrate each event, this meant we had a `read after write` problem in our event architecture, leading to the potential for lost updates. Case in point: if a policy is updated from `resource_id-1` to `resource_id-2`, and then back to `resource_id-1`, it's possible that, given the right amount of delay, the gateway channel will receive two `reject_access` events for `resource_id-1`, as opposed to one for `resource_id-1` and one for `resource_id-2`, leading to the potential for unauthorized access. 1. It was very difficult to ensure that the correct topics were being subscribed to and unsubscribed from, and the correct number of times, leading to maintenance issues for other engineers. 1. We had a nasty N+1 query problem whenever memberships were added or removed that resolved in essentially all access related to that membership (so all Policies touching its actor group) to be re-evaluated, and broadcasted. This meant that any bulk addition or deletion of memberships would generate so many queries that they'd timeout or consume the entire connection pool. 1. We had no durability for side-effect processing. In some places, we were iterating over many returned records to send broadcasts. Broadcasting is not a zero-time operation, each call takes a small amount of CPU time to copy the message into the receiver's mailbox. If we deployed while this was happening, the state update would be lost forever. If this was a `reject_access` for a Gateway, the Gateway would never remove access for that particular flow. 1. On each flow authorization, we needed to hit `us-east1` not only to "authorize" the flow, but to log it as well. This incurs latency especially for users in other parts of the world, which happens on _each_ connection setup to a new resource. 1. Since we read and re-authorize access due to the thin events broadcasted from side effects, we risk hitting thundering herd problems (see the N+1 query problem above) where a single DB change could result in all receivers hitting the DB at once to "hydrate" their processing.ion 1. If an administrator modifies the DB directly, or, if we need to run a DB migration that involves side effects, they'll be lost, because the side effect triggers happened in `after_commit` hooks that are only available when querying the DB through Ecto. Manually deleting (or resurrecting) a policy, for example, would not have updated any connected clients or gateways with the new state. To fix all of the above, we move to the system introduced in this PR: - All changes are now serialized (for free) by Postgres and broadcasted as a single event stream - The number of topics has been reduced to just one, the `account_id` of an account. All receivers subscribe to this one topic for the lifetime of their pid and then only filter the events they want to act upon, ignoring all other messages - The events themselves have been turned into "fat" structs based on the schemas they present. By making them properly typed, we can apply things like the existing Policy authorizer functions to them as if we had just fetched them from the DB. - All flow creation now happens in memory and doesn't not need to incur a DB hit in `us-east1` to proceed. - Since clients and gateways now track state in a push-based manner from the DB, this means very few actual DB queries are needed to maintain state in the channel procs, and it also means we can be smarter about when to send `resource_deleted` and `resource_created_or_updated` appropriately, since we can always diff between what the client _had_ access to, and what they _now_ have access to. - All DB operations, whether they happen from the application code, a `psql` prompt, or even via Google SQL Studio in the GCP console, will trigger the _same_ side effects. - We now use a replication consumer based off Postgres logical decoding of the write-ahead log using a _durable slot_. This means that Postgres will retain _all events_ until they are acknowledged, giving us the ability to ensure at-least-once processing semantics for our system. Today, the ACK is simply, "did we broadcast this event successfully". But in the future, we can assert that replies are received before we acknowledge the event as processed back to Postgres. The tests in this PR have been updated to pass given the refactor. However, since we are tracking more state now in the channel procs, it would be a good idea to add more tests for those edge cases. That is saved as a later PR because (1) this one is already huge, and (2) we need to get this out to staging to smoke test everything anyhow. Fixes: #9908 Fixes: #9909 Fixes: #9910 Fixes: #9900 Related: #9501	2025-07-18 22:47:18 +00:00
Jamil	789a3012d6	fix(portal): only process jsonb strings (#9883 ) As a followup to #9882, we need to ensure that `jsonb` columns that have value data other than strings are not decoded as jsonb. An example of when this happens is when Postgres sends an `:unchanged_toast` to indicate the data hasn't changed.	2025-07-15 18:06:13 -07:00
Jamil	cce21a8dea	fix(portal): handle `jsonb` for embedded schemas (#9882 ) In #9664, we introduced the `Domain.struct_from_params/2` function which converts a set of params containing string keys into a provided struct representing a schema module. This is used to broadcast actual structs pertaining to WAL data as opposed to simple string encodings of the data. The problem is that function was a bit too naive and failed to properly cast embedded schemas, resulting in all embedded schema on the root struct being `nil` or `[]`. To fix this, we need to do two things: 1. We now decode JSON/JSONB fields from binaries (strings) into actual lists and maps in the replication consumer module for downstream processors to use 2. We update our `struct_from_params/2` function to properly cast embedded schemas from these lists and maps using Ecto.Changeset's `apply_changes` function, which uses the same logic to instantiate the schemas as if we were saving a form or API request. Lastly, tests are added to ensure this works under various scenarios, including nested embedded schemas which we use in some places. Fixes #9835 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-07-15 23:50:27 +00:00
Thomas Eizinger	cb497a7435	fix(portal): use correct password generation algorithm (#9874 ) In #9870, the password generation algorithm was broken. The correct order of the elements in the hash is: expiry, stamp_secret, salt. The relay expects this order when it re-generates the password to validate the message. Due to a different bug in our CI system, we weren't actually checking for warnings / errors in our perf-test suite: https://github.com/firezone/firezone/actions/runs/16285038111/job/45982241021#step:9:66	2025-07-15 13:39:31 +00:00
Brian Manifold	0d9e865ea8	feat(porat): Update portal telemetry (#9868 ) Why: * Adding more BEAM VM metrics to give us better insight as to how our BEAM cluster is running since we're in the middle of making some moderately large architectural changes to the application.	2025-07-15 02:11:59 +00:00
Jamil	17d7e29b81	fix(portal): use public key for TURN creds (#9870 ) As a followup to #9856, after talking with @bmanifold, we determined using the public_key as the username for TURN credentials is a safer bet because: - It's by definition public and therefore does not need to be obfuscated - It's shorter-lived than the token, especially for the gateway - It essentially represents the data plane connection for client/gateway and naturally rotates along with the key state for those	2025-07-15 01:48:02 +00:00
Jamil	1e577d31b9	fix(portal): use reproducible relay creds (#9857 ) When giving TURN credentials to clients and gateways, it's important that they remain consistent across hiccups in the portal connection so that relayed connections are not interrupted during a deploy, or if the user's internet is flaky, or the GCP load balancer decides to disconnect the client/gateway. Prior to this PR, that was not the case because we essentially tied TURN credentials, required for data plane packet flows, to the WebSocket connection, a control plane element. This happened because we generated random `expires_at` and `salt` elements on _each_ connection to the portal. Instead, what we do now is make these reproducible and tied to the auth token by hashing then base64-encoding it. The expiry is tied to the auth-token's expiry. Fixes #9856	2025-07-14 17:42:11 +00:00
Jamil	e98aa82e8e	fix(portal): respect gateway_group_id filter in REST API (#9840 ) Fixes #9815	2025-07-11 19:12:05 +00:00
Jamil	26cfab3b88	fix(portal): reply to all wal keepalives with ack (#9828 ) The Postgres logical decoding protocol is lacking documentation and unclear about keepalive behavior when `wal_sender_timeout` is set to 0 (disabled). We have it disabled so that Postgres doesn't terminate our connection for falling too far behind. What we failed to take into account is that on some installations, Postgres _never_ requests an immediate reply (keepalive with the reply now bit set) if wal_sender_timeout is disabled. This means we would always reply with the empty message, failing to advance the position of the LSN. In this PR, we fix that to always respond to every keepalive message with a standby status update to advance the LSN position. Relevant documentation: https://www.postgresql.org/docs/current/protocol-replication.html#PROTOCOL-REPLICATION-STANDBY-STATUS-UPDATE	2025-07-11 14:32:56 +00:00
Thomas Eizinger	8e5ce66810	feat(gateway): don't apply traffic filters to ICMP errors (#9834 ) Firezone uses ICMP errors to signal to client applications that e.g. a certain IP is not reachable. This happens for example if a DNS resource only resolves to IPv4 addresses yet the client application attempted to use an IPv6 proxy address to connect to it. In the presence of traffic filters for such a resource that does _not_ allow ICMP, we currently filter out these ICMP errors because - well - ICMP traffic is not allowed! However, even in the presence of ICMP traffic being allowed, we would fail to evaluate this filter because the ICMP error packet is not an ICMP echo reply and therefore doesn't have an ICMP identifier. We require this in the DNS resource NAT to identify "connections" and NAT them correctly. The same L4 component is used to evaluate the traffic filters. ICMP errors are critical to many usage scenarios and algorithms like happy-eyeballs. Dropping them usually results in weird behaviour as client applications can then only react to timeouts.	2025-07-11 13:20:37 +00:00
Jamil	cfcd5b3b8f	chore(portal): track more WAL monitoring info (#9826 ) When debugging WAL processing, it's helpful to know what the last replied LSN was and when the last keepalive message was received from postgres.	2025-07-10 18:30:34 -07:00
Jamil	080818c466	fix(portal): fix reply for remaining wal message (#9824 ) Missed one reply fix from #9821	2025-07-10 21:46:05 +00:00
Jamil	fb0dd36dbc	chore(portal): ignore expected libcluster issue (#9822 ) Adds another expected error message to the ignore list. We have a different (less noisy) log that will alert us if the cluster is below threshold.	2025-07-10 21:35:18 +00:00
Jamil	704ff9fd7a	fix(portal): send empty reply for incoming wal messages (#9821 ) In #9733, we changed the replies of the handle_data messages which seems to have caused Postgres to not respect our acknowledgements sent in the keepalive. To fix this, we revert to sending an empty message in response to write messages.	2025-07-10 19:50:00 +00:00
Jamil	b20c141759	feat(portal): add batch-insert to change logs (#9733 ) Inserting a change log incurs some minor overhead for sending query over the network and reacting to its response. In many cases, this makes up the bulk of the actual time it takes to run the change log insert. To reduce this overhead and avoid any kind of processing delay in the WAL consumers, we introduce batch insert functionality with size `500` and timeout `30` seconds. If either of those two are hit, we flush the batch using `insert_all`. `insert_all` does not use `Ecto.Changeset`, so we need to be a bit more careful about the data we insert, and check the inserted LSNs to determine what to update the acknowledged LSN pointer to. The functionality to determine when to call the new `on_flush/1` callback lives in the replication_connection module, but the actual behavior of `on_flush/1` is left to the child modules to implement. The `Events.ReplicationConnection` module does not use flush behavior, and so does not override the defaults, which is not to use a flush mechanism. Related: #949	2025-07-05 19:03:28 +00:00
Jamil	c869bcfe13	chore(portal): tag Relay WAL todos (#9767 ) These aren't a priority to clean up right now, but I wanted to tag them so I don't forget to do it later on.	2025-07-04 22:30:06 +00:00
Jamil	2a38c532af	chore(portal): remove gateway masquerade option (#9790 ) AFAIK these are ignored by connlib. Instead, we configure masquerading on the host.	2025-07-04 21:08:11 +00:00
Brian Manifold	83e71f45b8	fix(portal): catch all errors when sending welcome email (#9776 ) Why: * We were previously only catching the `:rate_limited` error when sending welcome emails. This update adds a catch-all case to gracefully handle the error and alert us. --------- Signed-off-by: Brian Manifold <bmanifold@users.noreply.github.com> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-07-03 21:41:12 +00:00
Jamil	29d8881c54	fix(seeds): remove unused vars (#9731 ) This fixes some warnings introduced by #9692.	2025-06-30 19:33:11 +00:00
Jamil	23c43c12dd	chore(portal): log wal status every 60s (#9729 ) It would be helpful to see these more often in the logs to better understand our current processing position.	2025-06-30 18:25:19 +00:00
Jamil	972ece507d	chore(portal): downgrade expected wal log to info (#9726 ) This is expected during deploys so we downgrade it to info to avoid sending to Sentry.	2025-06-30 15:35:35 +00:00
Jamil	47fe7b388e	chore(portal): ack WAL records more often (#9703 ) - log connection module for replication manager logs - simplify bypass conditionals - ACK write messages to avoid PG resending data	2025-06-28 20:34:26 +00:00
Jamil	a24f582ff5	fix(portal): increase change_log lag warning threshold (#9702 ) This is needlessly short and has already tripped a false alarm once.	2025-06-27 20:58:58 +00:00
Jamil	3760536afd	chore(portal): add unique index to lsn (#9699 )	2025-06-27 20:58:20 +00:00
Jamil	dddd1b57fc	refactor(portal): remove flow_activities (#9693 ) This has been dead code for a long time. The feature this was meant to support, #8353, will require a different domain model, views, and user flows. Related: #8353	2025-06-27 20:40:25 +00:00
Jamil	9655dacc04	fix(portal): restart wal connection from manager proc (#9701 ) When the ReplicationConnection dies, its Manager will die too on all other nodes, and all domain Application supervisors on all nodes will attempt to restart them. This allows the connection to migrate to a healthy node automagically. However, the default Supervisor behavior is to allow 3 restarts in 5 seconds before the whole tree is taken down. To prevent this, we trap the exit in the ReplicationManager and attempt to reconnect right away, beginning the backoff process.	2025-06-27 20:40:04 +00:00
Jamil	6c0a62aa73	fix(tests): wait for visible els before click (#9697 ) We had an old bug in one of our acceptance tests that is just now being hit again due to the faster runners. - We need to wait for the dropdown to become visible before clicking - We fix a minor timer issue that was calculating elapsed time incorrectly when determining when time out finding an el.	2025-06-27 19:06:59 +00:00
Jamil	3247b7c5d2	fix(portal): don't log soft-deleted deletes (#9698 )	2025-06-27 19:06:45 +00:00
Jamil	0b09d9f2f5	refactor(portal): don't rely on flows.expires_at (#9692 ) The `expires_at` column on the `flows` table was never used outside of the context in which the flow was created in the Client Channel. This ephemeral state, which is created in the `Domain.Flows.authorize_flow/4` function, is never read from the DB in any meaningful capacity, so it can be safely removed. The `expire_flows_for` family of functions now simply reads the needed fields from the flows table in order to broadcast `{:expire_flow, flow_id, client_id, resource_id}` directly to the subscribed entities. This PR is step 1 in removing the reliance on `Flows` to manage ephemeral access state. In a subsequent PR we will actually change the structure of what state is kept in the channel PIDs such that reliance on this Flows table will no longer be necessary. Additionally, in a few places, we were referencing a Flows.Show view that was never available in production, so this dead code has been removed. Lastly, the `flows` table subscription and associated hook processing has been completely removed as it is no longer needed. We've implemented in #9667 logic to remove publications from removed table subscriptions, so we can expect to get a couple ingest warnings when we deploy this as the `Hooks.Flows` processor no longer exists, and the WAL data may have lingering flows records in the queue. These can be safely ignored.	2025-06-27 18:29:12 +00:00
Jamil	fbf48a207a	chore(portal): handle lag up to 30m (#9681 ) Now that we know the bypass system works, it might be a good idea to allow it to lag data up to 30m so that events accrued during deploys are not lost. Also, this PR fixes a small bug where we triggered the threshold _after_ a transaction already committed (`COMMIT`), instead of before the data came through (`BEGIN`). Since the timestamps are identical (see below), it would be more accurate to read the timestamp of the transaction before acting on the data contained within. ``` [(domain 0.1.0+dev) lib/domain/change_logs/replication_connection.ex:4: Domain.ChangeLogs.ReplicationConnection.handle_message/3] "BEGIN #{commit_timestamp}" #=> "BEGIN 2025-06-26 04:22:45.283151Z" [(domain 0.1.0+dev) lib/domain/change_logs/replication_connection.ex:4: Domain.ChangeLogs.ReplicationConnection.handle_message/3] "END #{commit_timestamp}" #=> "END 2025-06-26 04:22:45.283151Z" ``` --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>	2025-06-26 13:38:40 +00:00
Jamil	59fa7fa4f1	fix(portal): diff = now - past (#9680 ) We were performing the diff backwards, so the bypass never kicked in.	2025-06-25 18:10:28 -07:00
Jamil	e7756a9be5	fix(portal): bypass delayed events past threshold (#9679 ) When attempting to process a WAL that's _very_ far behind, it's helpful to have a time past which we simply `noop` the handlers.	2025-06-25 22:32:15 +00:00
Jamil	eed8343e8f	fix(portal): wait 30s for agm query (#9678 ) These queries are timing out, so we wait longer for them.	2025-06-25 22:22:21 +00:00
Jamil	42e3027c34	fix(portal): use replication config in dev (#9676 )	2025-06-25 21:02:01 +00:00

1 2 3 4 5 ...

877 Commits