firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 10:18:54 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	b11adfcfe4	feat(connlib): create flow on ICMP error "prohibited" (#10462 ) In Firezone, a Client requests an "access authorization" for a Resource on the fly when it sees the first packet for said Resource going through the tunnel. If we don't have a connection to the Gateway yet, this is also where we will establish a connection and create the WireGuard tunnel. In order for this to work, the access authorization state between the Client and the Gateway MUST NOT get out of sync. If the Client thinks it has access to a Resource, it will just route the traffic to the Gateway. If the access authorization on the Gateway has expired or vanished otherwise, the packets will be black-holed. Starting with #9816, the Gateway sends ICMP errors back to the application whenever it filters a packet. This can happen either because the access authorization is gone or because the traffic wasn't allowed by the specific filter rules on the Resource. With this patch, the Client will attempt to create a new flow (i.e. re-authorize) traffic for this resource whenever it sees such an ICMP error, therefore acting as a way of synchronizing the view of the world between Client and Gateway should they ever run out of sync. Testing turned out to be a bit tricky. If we let the authorization on the Gateway lapse naturally, we portal will also toggle the Resource off and on on the Client, resulting in "flushing" the current authorizations. Additionally, it the Client had only access to one Resource, then the Gateway will gracefully close the connection, also resulting in the Client creating a new flow for the next packet. To actually trigger this new behaviour we need to: - Access at least two resources via the same Gateway - Directly send `reject_access` to the Gateway for this particular resource To achieve this, we dynamically eval some code on the API node and instruct the Gateway channel to send `reject_access`. The connection stays intact because there is still another active access authorization but packets for the other resource are answered with ICMP errors. To achieve a safe roll-out, the new behaviour is feature-flagged. In order to still test it, we now also allow feature flags to be set via env variables. Resolves: #10074 --------- Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>	2025-09-30 08:23:39 +00:00
Jamil	2e0517ed7b	feat(api): GET /account API (#10302 ) By customer request, it would be helpful to expose an endpoint to retrieve current account / billing details like seats used and other usage-based metrics. --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-09-28 11:59:39 -07:00
Brian Manifold	6bd19ee9b0	refactor(portal): hard delete data (#9694 )	2025-08-29 22:13:44 +00:00
Jamil	6d2ea0b224	fix(portal): adapt resource on resource_updated (#10247 ) When filters are updated for a Resource, we need to first adapt the resource before rendering it down to the Gateway. Otherwise, the gateway may see a Resource that does not match its expected schema.	2025-08-23 17:53:20 +00:00
Jamil	cafe6554ff	refactor(portal): reduce cache memory usage (#10058 ) Napkin math shows that we can save substantial memory (~3x or more) on the API nodes as connected clients/gateways grow if we just store the fields we need in order to keep the client and gateway state maintained in the channel pids. To facilitate this, we create new `Cacheable` structs that represent their `Domain` cousins, which use byte arrays for `id`s and strip out unused fields. Additionally, all business logic involved with maintaining these caches is now contained within two modules: `Domain.Cache.Client` and `Domain.Cache.Gateway`, and type specs have been added to aid in static analysis and code documentation. Comprehensive testing is now added not only for the cache modules, but for their associated channel modules as well to ensure we handle different kinds of edge cases gracefully. The `Events` nomenclature was renamed to `Changes` to better name what we are doing: Change-Data-Capture. Lastly, the following related changes are included in this PR since they were "in the way" so to speak of getting this done: - We save the last received LSN in each channel and drop the `change` with a warning if we receive it twice in a row, or we receive it out of order - The client/gateway version compatibility calculations have been moved to `Domain.Resources` and `Domain.Gateways` and have been simplified to make them easier to understand and maintain going forward. Related: #10174 Fixes: #9392 Fixes: #9965 Fixes: #9501 Fixes: #10227 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-22 21:52:29 +00:00
Brian Manifold	551ceafb13	fix(portal): REST api updates (#10191 ) * Minor updates to the REST API to more gracefully handle incorrect input data from requests. * Minor updates to the OpenAPI spec.	2025-08-20 21:08:07 +00:00
Jamil	54d91e2004	fix(portal): don't send reject_access for remaining flows (#10071 ) This fixes a simple logic bug where we were mistakenly reacting to a flow deletion event where flows still existed in the cache by sending `reject_access`. This fixes that bug, and adds more comprehensive logging to help diagnose issues like this more quickly in the future. This PR also fixes the following issues found during the investigation: - We were redundantly reacting to Token deletion in the channel pids. This is unnecessary: we send a global socket disconnect from the Token hook module instead. - We had a bug that would crash the WAL consumer if a "global" token (i.e. relay) was deleted or expired - these have no `account_id`. - We now always use `min(max(all_conforming_polices_expiration), token.expires_at)` when setting expiration on a new flow to minimize the possibility for access churn. - We now check to ensure the token and gateway are still undeleted when re-authorizing a given flow. This prevents us from failing to send `reject_access` when a token or gateway is deleted corresponding to a flow, but the other entities would have granted access. Related: https://firezone.statuspage.io/incidents/xrsm13tml3dh Related: #10068 Related: #9501	2025-08-01 00:03:00 +00:00
Jamil	44a9691df5	refactor(portal): don't store account assoc on client (#10009 ) The full `account` struct is only used to render the client's interface, and doesn't need to be stored in the `client` struct when the `subject` struct already tracks it.	2025-07-28 16:24:58 +00:00
Jamil	3ff31e3a33	fix(portal): maintain identity preload on client (#10008 ) When updating a client, we need to maintain the preloaded `identity` association to use for the IdP policy condition.	2025-07-26 00:42:19 +00:00
Jamil	f1a5af356d	fix(portal): groom resource list and flows periodically (#10005 ) Time-based policy conditions are tricky. When they authorize a flow, we correctly tell the Gateway to remove access when the time window expires. However, we do nothing on the client to reset the connectivity state. This means that whenever the window of time of access was re-entered, the client would essentially never be able to connect to it again until the resource was toggled. To fix this, we add a 1-minute check in the client channel that re-checks allowed resources, and updates the client state with the difference. This means that policies that have time-based conditions are only accurate to the minute, but this is how they're presented anyhow. For good measure, we also add a periodic job that runs every minute to delete expired Flows. This will propagate to the Gateway where, if the access for a particular client-resource is determined to be actually gone, will receive `reject_access`. Zooming out a bit, this PR furthers the theme that: - Client channels react to underlying resource / policy / membership changes directly, while - Gateway channels react primarily to flows being deleted, or the downstream effects of a prior client authorization	2025-07-25 21:04:41 +00:00
Jamil	2959cca8ce	fix(portal): use consistent wireguard psk (#10004 ) Whenever a client requests a connection to gateway, we need to generate a preshared key that will be used for the underlying WireGuard tunnel. When the connection setup broke or otherwise was lost, _after_ the gateway the received the authorize_flow call, but _before_ the client could receive the response (and initiate a tunnel), we would have to wait until an ICE timeout occurred in order to reset state on the gateway. This is because the psk was not used to determine if this was a _new_ flow authorization. So the old authorization would be matched, and the client would never be able to connect, since its tunnel was using the new psk, and the gateway the old. To fix this, we generate a secure random 32-byte `psk_base` on each client and gateway. When a client wishes to connect to a gateway, we compute the WireGuard preshared key as an HMAC over these two inputs. This fixes the issue by ensuring that subsequent flow authorization requests from a particular client to a particular gateway will yield the same psk. Related: #9999 Related: https://github.com/firezone/infra/issues/99	2025-07-25 19:28:47 +00:00
Jamil	ccc736e63e	fix(portal): reauthorize new flow when last flow deleted (#9974 ) The `flows` table tracks authorizations we've made for a resource and persists them, so that we can determine which authorizations are still valid across deploys or hiccups in the control plane connections. Before, when the "in-use" authorization for a resource was deleted, we would have flapped the resource in the client, and sent `reject_access` to the gateway. However, that would cause issues in the following edge case: - Client is currently connected to Resource A through Policy B - Client websocket goes down - Policy B is created for Resource A (for another actor group), and Policy A is deleted by admin - Client reconnects - Client sees that its resource list is the same - Gateway has since received `reject_access` because no new flows were created for this client-resource combination To prevent this from happening, we now try to "reauthorize" the flow whenever the last cached flow is removed for a particular client-resource pair. This avoids needing to toggle the resource on the client since we won't have sent `reject_access` to the gateway.	2025-07-25 01:53:10 +00:00
dependabot[bot]	0a0ee3c940	build(deps): bump sentry from 10.10.0 to 11.0.2 in /elixir (#9933 ) Bumps [sentry](https://github.com/getsentry/sentry-elixir) from 10.10.0 to 11.0.2. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/getsentry/sentry-elixir/releases">sentry's releases</a>.</em></p> <blockquote> <h2>11.0.2</h2> <h3>Bug fixes</h3> <ul> <li>Deeply nested spans are handled now when building up traces in <code>SpanProcessor</code> (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/924">#924</a>)</li> </ul> <h4>Various improvements</h4> <ul> <li>Span's attributes no longer include <code>db.url: "ecto:"</code> entries as they are now filtered out (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/925">#925</a>)</li> </ul> <h2>11.0.1</h2> <h4>Various improvements</h4> <ul> <li><code>Sentry.OpenTelemetry.Sampler</code> now works with an empty config (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/915">#915</a>)</li> </ul> <h2>11.0.0</h2> <p>This release comes with a beta support for Traces using OpenTelemetry - please test it out and report any issues you find.</p> <h3>New features</h3> <ul> <li> <p>Beta support for Traces using OpenTelemetry (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/902">#902</a>)</p> <p>To enable Tracing in your Phoenix application, you need to add the following to your <code>mix.exs</code>:</p> <pre lang="elixir"><code>def deps do [ # ... {:sentry, "~> 11.0.0"}, {:opentelemetry, "~> 1.5"}, {:opentelemetry_api, "~> 1.4"}, {:opentelemetry_exporter, "~> 1.0"}, {:opentelemetry_semantic_conventions, "~> 1.27"}, {:opentelemetry_phoenix, "~> 2.0"}, {:opentelemetry_ecto, "~> 1.2"}, # ... ] </code></pre> <p>And then configure Tracing in Sentry and OpenTelemetry in your <code>config.exs</code>:</p> <pre lang="elixir"><code>config :sentry, # ... traces_sample_rate: 1.0 # any value between 0 and 1.0 enables tracing <p>config :opentelemetry, span_processor: {Sentry.OpenTelemetry.SpanProcessor, []} config :opentelemetry, sampler: {Sentry.OpenTelemetry.Sampler, [drop: []]} </code></pre></p> </li> <li> <p>Add installer (based on Igniter) (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/876">#876</a>)</p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/getsentry/sentry-elixir/blob/master/CHANGELOG.md">sentry's changelog</a>.</em></p> <blockquote> <h2>11.0.2</h2> <h3>Bug fixes</h3> <ul> <li>Deeply nested spans are handled now when building up traces in <code>SpanProcessor</code> (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/924">#924</a>)</li> </ul> <h4>Various improvements</h4> <ul> <li>Span's attributes no longer include <code>db.url: "ecto:"</code> entries as they are now filtered out (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/925">#925</a>)</li> </ul> <h2>11.0.1</h2> <h4>Various improvements</h4> <ul> <li><code>Sentry.OpenTelemetry.Sampler</code> now works with an empty config (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/915">#915</a>)</li> </ul> <h2>11.0.0</h2> <p>This release comes with a beta support for Traces using OpenTelemetry - please test it out and report any issues you find.</p> <h3>New features</h3> <ul> <li> <p>Beta support for Traces using OpenTelemetry (<a href="https://redirect.github.com/getsentry/sentry-elixir/pull/902">#902</a>)</p> <p>To enable Tracing in your Phoenix application, you need to add the following to your <code>mix.exs</code>:</p> <pre lang="elixir"><code>def deps do [ # ... {:sentry, "~> 11.0.0"}, {:opentelemetry, "~> 1.5"}, {:opentelemetry_api, "~> 1.4"}, {:opentelemetry_exporter, "~> 1.0"}, {:opentelemetry_semantic_conventions, "~> 1.27"}, {:opentelemetry_phoenix, "~> 2.0"}, {:opentelemetry_ecto, "~> 1.2"}, # ... ] </code></pre> <p>And then configure Tracing in Sentry and OpenTelemetry in your <code>config.exs</code>:</p> <pre lang="elixir"><code>config :sentry, # ... traces_sample_rate: 1.0 # any value between 0 and 1.0 enables tracing <p>config :opentelemetry, span_processor: {Sentry.OpenTelemetry.SpanProcessor, []} config :opentelemetry, sampler: {Sentry.OpenTelemetry.Sampler, []} </code></pre></p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`b142174df9`"><code>b142174</code></a> release: 11.0.2</li> <li><a href="`f43055b8ca`"><code>f43055b</code></a> Update CHANGELOG for 11.0.2 (<a href="https://redirect.github.com/getsentry/sentry-elixir/issues/926">#926</a>)</li> <li><a href="`ee512d3bf6`"><code>ee512d3</code></a> Filter out empty db.url from span's attributes (<a href="https://redirect.github.com/getsentry/sentry-elixir/issues/925">#925</a>)</li> <li><a href="`6809aaa68c`"><code>6809aaa</code></a> Fix handling of spans at 2+ levels (<a href="https://redirect.github.com/getsentry/sentry-elixir/issues/924">#924</a>)</li> <li><a href="`b7e16798d3`"><code>b7e1679</code></a> Improve event callback docs (<a href="https://redirect.github.com/getsentry/sentry-elixir/issues/922">#922</a>)</li> <li><a href="`97d0382418`"><code>97d0382</code></a> Merge branch 'release/11.0.1'</li> <li><a href="`738fc763cd`"><code>738fc76</code></a> release: 11.0.1</li> <li><a href="`ab58c0ef6b`"><code>ab58c0e</code></a> Update CHANGELOG (<a href="https://redirect.github.com/getsentry/sentry-elixir/issues/917">#917</a>)</li> <li><a href="`028ce18841`"><code>028ce18</code></a> handle nil drop list (<a href="https://redirect.github.com/getsentry/sentry-elixir/issues/915">#915</a>)</li> <li><a href="`5850c73a96`"><code>5850c73</code></a> Merge branch 'release/11.0.0'</li> <li>Additional commits viewable in <a href="https://github.com/getsentry/sentry-elixir/compare/10.10.0...11.0.2">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=sentry&package-manager=hex&previous-version=10.10.0&new-version=11.0.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-07-21 21:01:25 +00:00
Jamil	488cb96469	fix(portal): don't prematurely reject access (#9952 ) Before: - When a flow was deleted, we flapped the resource on the client, and sent `reject_access` naively for the flow's `{client_id, resource_id}` pair on the gateway. This resulted in lots of unneeded resource flappage on the client whenever bulk flow deletions happened. After: - When a flow is deleted, we check if this is an active flow for the client. If so, we flap the resource then in order to trigger generation of a new flow. If access was truly affected, that results in a loss of a resource, we will push `resource_deleted` for the update that triggered the flow deletion (for example the resource/policy removal). On the gateway, we only send `reject_access` if it was the last flow granting access for a particular `client/resource` tuple. Why: - While the access state is still correct in the previous implementation, we run the possibility of pushing way too many resource flaps to the client in an overly eager attempt to remove access the client may not have access to. cc @thomaseizinger Related: https://firezonehq.slack.com/archives/C08FPHECLUF/p1753101115735179	2025-07-21 13:12:05 -07:00
dependabot[bot]	272074e8d4	build(deps): bump hammer from 7.0.1 to 7.1.0 in /elixir (#9935 ) Bumps [hammer](https://github.com/ExHammer/hammer) from 7.0.1 to 7.1.0. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/ExHammer/hammer/blob/master/CHANGELOG.md">hammer's changelog</a>.</em></p> <blockquote> <h2>7.1.0 - 2025-07-18</h2> <ul> <li>Fix key type inconsistency in backend implementations - all backends now accept <code>term()</code> keys instead of <code>String.t()</code> (<a href="https://redirect.github.com/ExHammer/hammer/issues/143">#143</a>)</li> <li>Add comprehensive test coverage for various key types (atoms, tuples, integers, lists, maps)</li> <li>Fix race conditions in atomic backend tests (FixWindow, LeakyBucket, TokenBucket)</li> <li>Replace timing-dependent tests with polling-based <code>eventually</code> helper for better CI reliability</li> <li>Add documentation warning about Redis backend string key requirement</li> <li>Fix typo in <code>inc/3</code> optional callback documentation (<a href="https://redirect.github.com/ExHammer/hammer/issues/142">#142</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`a57bdecdc1`"><code>a57bdec</code></a> improve changelog last commit (<a href="https://redirect.github.com/ExHammer/hammer/issues/145">#145</a>)</li> <li><a href="`bb061c5334`"><code>bb061c5</code></a> Bump version to 7.1.0 (<a href="https://redirect.github.com/ExHammer/hammer/issues/144">#144</a>)</li> <li><a href="`7d7967f898`"><code>7d7967f</code></a> Fix key type inconsistency in backend implementations (<a href="https://redirect.github.com/ExHammer/hammer/issues/143">#143</a>)</li> <li><a href="`94d39525e8`"><code>94d3952</code></a> Fixes typo for inc/3 optional callback <code>@doc</code> (<a href="https://redirect.github.com/ExHammer/hammer/issues/142">#142</a>)</li> <li><a href="`79ca221876`"><code>79ca221</code></a> Bump benchee from 1.3.1 to 1.4.0 (<a href="https://redirect.github.com/ExHammer/hammer/issues/135">#135</a>)</li> <li><a href="`a09bbd0d42`"><code>a09bbd0</code></a> Bump ex_doc from 0.37.3 to 0.38.2 (<a href="https://redirect.github.com/ExHammer/hammer/issues/141">#141</a>)</li> <li><a href="`d06a17b6be`"><code>d06a17b</code></a> Bump credo from 1.7.11 to 1.7.12 (<a href="https://redirect.github.com/ExHammer/hammer/issues/134">#134</a>)</li> <li><a href="`26df742620`"><code>26df742</code></a> Update bug_report.md (<a href="https://redirect.github.com/ExHammer/hammer/issues/133">#133</a>)</li> <li><a href="`b8765fe216`"><code>b8765fe</code></a> Bump ex_doc from 0.37.2 to 0.37.3 (<a href="https://redirect.github.com/ExHammer/hammer/issues/131">#131</a>)</li> <li>See full diff in <a href="https://github.com/ExHammer/hammer/compare/7.0.1...7.1.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=hammer&package-manager=hex&previous-version=7.0.1&new-version=7.1.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-07-21 13:24:40 +00:00
Jamil	f379e85e9b	refactor(portal): cache access state in channel pids (#9773 ) When changes occur in the Firezone DB that trigger side effects, we need some mechanism to broadcast and handle these. Before, the system we used was: - Each process subscribes to a myriad of topics related to data it wants to receive. In some cases it would subscribe to new topics based on received events from existing topics (I.e. flows in the gateway channel), and sometimes in a loop. It would then need to be sure to _unsubscribe_ from these topics - Handle the side effect in the `after_commit` hook of the Ecto function call after it completes - Broadcast only a simply (thin) event message with a DB id - In the receiver, use the id(s) to re-evaluate, or lookup one or many records associated with the change - After the lookup completes, `push` the relevant message(s) to the LiveView, `client` pid, or `gateway` pid in their respective channel processes This system had a number of drawbacks ranging from scalability issues to undesirable access bugs: 1. The `after_commit` callback, on each App node, is not globally ordered. Since we broadcast a thin event schema and read from the DB to hydrate each event, this meant we had a `read after write` problem in our event architecture, leading to the potential for lost updates. Case in point: if a policy is updated from `resource_id-1` to `resource_id-2`, and then back to `resource_id-1`, it's possible that, given the right amount of delay, the gateway channel will receive two `reject_access` events for `resource_id-1`, as opposed to one for `resource_id-1` and one for `resource_id-2`, leading to the potential for unauthorized access. 1. It was very difficult to ensure that the correct topics were being subscribed to and unsubscribed from, and the correct number of times, leading to maintenance issues for other engineers. 1. We had a nasty N+1 query problem whenever memberships were added or removed that resolved in essentially all access related to that membership (so all Policies touching its actor group) to be re-evaluated, and broadcasted. This meant that any bulk addition or deletion of memberships would generate so many queries that they'd timeout or consume the entire connection pool. 1. We had no durability for side-effect processing. In some places, we were iterating over many returned records to send broadcasts. Broadcasting is not a zero-time operation, each call takes a small amount of CPU time to copy the message into the receiver's mailbox. If we deployed while this was happening, the state update would be lost forever. If this was a `reject_access` for a Gateway, the Gateway would never remove access for that particular flow. 1. On each flow authorization, we needed to hit `us-east1` not only to "authorize" the flow, but to log it as well. This incurs latency especially for users in other parts of the world, which happens on _each_ connection setup to a new resource. 1. Since we read and re-authorize access due to the thin events broadcasted from side effects, we risk hitting thundering herd problems (see the N+1 query problem above) where a single DB change could result in all receivers hitting the DB at once to "hydrate" their processing.ion 1. If an administrator modifies the DB directly, or, if we need to run a DB migration that involves side effects, they'll be lost, because the side effect triggers happened in `after_commit` hooks that are only available when querying the DB through Ecto. Manually deleting (or resurrecting) a policy, for example, would not have updated any connected clients or gateways with the new state. To fix all of the above, we move to the system introduced in this PR: - All changes are now serialized (for free) by Postgres and broadcasted as a single event stream - The number of topics has been reduced to just one, the `account_id` of an account. All receivers subscribe to this one topic for the lifetime of their pid and then only filter the events they want to act upon, ignoring all other messages - The events themselves have been turned into "fat" structs based on the schemas they present. By making them properly typed, we can apply things like the existing Policy authorizer functions to them as if we had just fetched them from the DB. - All flow creation now happens in memory and doesn't not need to incur a DB hit in `us-east1` to proceed. - Since clients and gateways now track state in a push-based manner from the DB, this means very few actual DB queries are needed to maintain state in the channel procs, and it also means we can be smarter about when to send `resource_deleted` and `resource_created_or_updated` appropriately, since we can always diff between what the client _had_ access to, and what they _now_ have access to. - All DB operations, whether they happen from the application code, a `psql` prompt, or even via Google SQL Studio in the GCP console, will trigger the _same_ side effects. - We now use a replication consumer based off Postgres logical decoding of the write-ahead log using a _durable slot_. This means that Postgres will retain _all events_ until they are acknowledged, giving us the ability to ensure at-least-once processing semantics for our system. Today, the ACK is simply, "did we broadcast this event successfully". But in the future, we can assert that replies are received before we acknowledge the event as processed back to Postgres. The tests in this PR have been updated to pass given the refactor. However, since we are tracking more state now in the channel procs, it would be a good idea to add more tests for those edge cases. That is saved as a later PR because (1) this one is already huge, and (2) we need to get this out to staging to smoke test everything anyhow. Fixes: #9908 Fixes: #9909 Fixes: #9910 Fixes: #9900 Related: #9501	2025-07-18 22:47:18 +00:00
Jamil	17d7e29b81	fix(portal): use public key for TURN creds (#9870 ) As a followup to #9856, after talking with @bmanifold, we determined using the public_key as the username for TURN credentials is a safer bet because: - It's by definition public and therefore does not need to be obfuscated - It's shorter-lived than the token, especially for the gateway - It essentially represents the data plane connection for client/gateway and naturally rotates along with the key state for those	2025-07-15 01:48:02 +00:00
Jamil	1e577d31b9	fix(portal): use reproducible relay creds (#9857 ) When giving TURN credentials to clients and gateways, it's important that they remain consistent across hiccups in the portal connection so that relayed connections are not interrupted during a deploy, or if the user's internet is flaky, or the GCP load balancer decides to disconnect the client/gateway. Prior to this PR, that was not the case because we essentially tied TURN credentials, required for data plane packet flows, to the WebSocket connection, a control plane element. This happened because we generated random `expires_at` and `salt` elements on _each_ connection to the portal. Instead, what we do now is make these reproducible and tied to the auth token by hashing then base64-encoding it. The expiry is tied to the auth-token's expiry. Fixes #9856	2025-07-14 17:42:11 +00:00
Jamil	e98aa82e8e	fix(portal): respect gateway_group_id filter in REST API (#9840 ) Fixes #9815	2025-07-11 19:12:05 +00:00
Jamil	2a38c532af	chore(portal): remove gateway masquerade option (#9790 ) AFAIK these are ignored by connlib. Instead, we configure masquerading on the host.	2025-07-04 21:08:11 +00:00
Jamil	dddd1b57fc	refactor(portal): remove flow_activities (#9693 ) This has been dead code for a long time. The feature this was meant to support, #8353, will require a different domain model, views, and user flows. Related: #8353	2025-06-27 20:40:25 +00:00
Jamil	0b09d9f2f5	refactor(portal): don't rely on flows.expires_at (#9692 ) The `expires_at` column on the `flows` table was never used outside of the context in which the flow was created in the Client Channel. This ephemeral state, which is created in the `Domain.Flows.authorize_flow/4` function, is never read from the DB in any meaningful capacity, so it can be safely removed. The `expire_flows_for` family of functions now simply reads the needed fields from the flows table in order to broadcast `{:expire_flow, flow_id, client_id, resource_id}` directly to the subscribed entities. This PR is step 1 in removing the reliance on `Flows` to manage ephemeral access state. In a subsequent PR we will actually change the structure of what state is kept in the channel PIDs such that reliance on this Flows table will no longer be necessary. Additionally, in a few places, we were referencing a Flows.Show view that was never available in production, so this dead code has been removed. Lastly, the `flows` table subscription and associated hook processing has been completely removed as it is no longer needed. We've implemented in #9667 logic to remove publications from removed table subscriptions, so we can expect to get a couple ingest warnings when we deploy this as the `Hooks.Flows` processor no longer exists, and the WAL data may have lingering flows records in the queue. These can be safely ignored.	2025-06-27 18:29:12 +00:00
Jamil	343717b502	refactor(portal): broadcast client struct when updated (#9664 ) When a client is updated, we may need to re-initialize it if "breaking" fields are updated. If non-breaking fields are changed, such as name, we don't need to re-initialize the client. This PR also adds a helper `struct_from_params/2` which will create a schema struct from WAL data in order to type cast any needed data for convenience. This avoid having to do a DB hit - we _already have the data from the DB_ - we just need to format and send it. Related: #9501	2025-06-25 17:04:41 +00:00
Jamil	933d51e3d0	feat(portal): send account_slug in gateway init (#9653 ) Adds the `account_slug` to the gateway's `init` message. When the account slug is changed, the gateway's socket is disconnected using the same mechanism as gateway deletion, which causes the gateway to reconnect immediately and receive a new `init`. Related: #9545	2025-06-24 18:35:06 +00:00
Jamil	c6545fe853	refactor(portal): consolidate pubsub functions (#9529 ) We issue broadcasts and subscribes in many places throughout the portal. To help keep the cognitive overhead low, this PR consolidates all PubSub functionality to the `Domain.PubSub` module. This allows for: - better maintainability - see all of the topics we use at a glance - consolidate repeated functionality (saved for a future PR) - use the module hierarchy to define function names, which feels more intuitive when reading and sets a convention We also introduce a `Domain.Events.Hooks` behavior to ensure all hooks comply with this simple contract, and we also introduce a convention to standardize on topic names using the module hierarchy defined herein. Lastly, we add convenience functions to the Presence modules to save a bit of duplication and chance for errors. This will make it much easier to maintain PubSub going forward. Related: #9501	2025-06-15 04:30:57 +00:00
Jamil	cbe33cd108	refactor(portal): move policy events to WAL (#9521 ) Moves all of the policy lifecycle events to be broadcasted from the WAL consumer. #### Test - [x] Enable policy - [x] Disable policy - [x] Delete policy - [x] Non-breaking change - [x] Breaking change Related: #6294 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2025-06-14 01:10:09 +00:00
Jamil	c31f51d138	refactor(portal): move resource events to WAL (#9406 ) We move the resource events to the WAL system. Notably, we no longer need `fetch_and_update_breakable` for resource updates, so a bit of refactoring is included to update the call sites for those. Additionally, we need to add a `Flow.expire_flows_for_resource_id/1` function to expire flows from the WAL system. This is now being called in the WAL event handler. To prevent this from blocking the WAL consumer/broadcaster, we wrap it with a Task.async. These will be cleaned up when the lookup table for access is implemented next. Another thing to note is that we lose the `subject` when moving from `Flows.expire_flows_for(%Resource{}, subject)` to `Flows.expire_flows_for_resource_id(resource_id)` when a resource is deleted or updated by an actor since we respond to this event in the WAL where that data isn't available. However, we don't actually _use_ the subject when expiring flows (other than authorize the initial resource update), so this isn't an issue. Related: #9501 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>	2025-06-11 00:12:45 +00:00
Jamil	38c1de351c	refactor(portal): move membership events to WAL (#9388 ) Membership events are quite simple to move to the WAL: - Only one topic is used to determine which client(s) receive updates for which Actor(s). - The unsubscribe was removed because it was unused. - Notably, the N+1 query problem regarding re-evaluating all access again after each membership is updated is still present. This will be fixed using a lookup table in the client channel in the last PR to move events to the WAL. Related: https://github.com/firezone/firezone/issues/6294 Related: https://github.com/firezone/firezone/issues/8187	2025-06-06 06:23:33 +00:00
Jamil	9c3f6e7b36	refactor(portal): don't send `ip_stack` for non-DNS resources (#9376 ) We always return the `ip_stack` field when rendering resource for both WebSocket and REST APIs. If the resource's type is not `:dns` then this will be `nil`. Related: https://github.com/firezone/firezone/pull/9303#discussion_r2119681062	2025-06-02 23:16:49 -07:00
Jamil	6fc7d2e4e0	feat(portal): configurable ip stack for DNS resources (#9303 ) Some poorly-behaved applications (e.g. mongo) will fail to connect if they see both IPv4 and IPv6 addresses for a DNS resource, because they will try to connect to both of them and fail the whole connection setup if either one is not routable. To fix this, we need to introduce a knob to allow admins to restrict DNS resources to only A or AAAA records. <img width="750" alt="Screenshot 2025-06-02 at 10 48 39 AM" src="https://github.com/user-attachments/assets/4dbcb6ae-685f-43ee-b9e8-1502b365a294" /> <img width="1174" alt="Screenshot 2025-06-02 at 11 05 53 AM" src="https://github.com/user-attachments/assets/02d0a4b3-e6e8-4b6d-89fa-d3d999b5811e" /> --- Related: https://firezonehq.slack.com/archives/C08KPQKJZKM/p1746720923535349 Related: #9300 Fixes: #9042	2025-06-03 02:24:41 +00:00
Jamil	73c3e2d87b	refactor(portal): move gateway events to WAL (#9299 ) This PR moves Gateway events to be triggered by the WAL broadcaster. Some things of note that are cleaned up: - The gateway `:update` event was never received anywhere (but in a test) and so has been removed - The account topic has been removed as it was also never acted upon anywhere. Presence yes, but topic no - The group topic has also been removed as it was only used to receive broadcasted disconnects when a group is deleted, but this was already handled by the token deletion and so is redundant.	2025-06-01 16:40:28 +00:00
Brian Manifold	a51b35a6b4	refactor(portal): remove created_by_<identity/actor> columns (#9306 ) Why: * Now that we have started using the `created_by_subject` field on various tables, we no longer need to keep the `created_by_<identity/actor>` fields. This will help remove a foreign key reference and will be one step closer to allowing us to hard delete data rather than soft deleting all data in order to keep foreign key references like these.	2025-05-30 21:06:35 +00:00
Jamil	6cea0cd6ec	refactor(portal): Move client updates to WAL broadcaster (#9288 ) Client updates are next on the path to moving more side effects to the WAL broadcaster. This one has the following notable changes: - ~~The `actor_clients` pubsub topic were only used to broadcast removal of clients belonging to an actor; these are no longer needed since we handle this in the individual removal event~~ EDIT: only the presence is kept - The `account_clients:{account_id}` pubsub and presence topic definition has been moved to `Events.Hooks.Accounts` because these are broadcasted using the account_id field based on account changes, and have nothing to do with the client lifecycle Related: #6294 Related: #8187	2025-05-29 16:56:08 +00:00
Jamil	7c674ea21c	refactor(portal): Move expire_flow to WAL broadcaster (#9286 ) Similar to #9285, we move the `expire_flow` event to be broadcasted from the WAL broadcaster. Unrelated tests needed to be updated to not expect to receive the broadcast, and instead check to ensure the record has been updated. A minor bug is also fixed in the ordering of the `old_data, data` fields. Tested manually on dev. Related: #6294 Related: #8187	2025-05-29 06:35:03 +00:00
Jamil	8d701efe4b	refactor(portal): Move `config_changed` to WAL broadcaster (#9285 ) Now that the WAL consumer has been dry running in production for some time, we can begin moving events over to it. We start with a relatively simple case: the account `config_changed` event. Since side effects now happen decoupled from the actual record updates, testing is updated in this PR: - We don't expect broadcasts to happen in the `accounts_test.exs` - these context modules are now solely responsible for managing updates to records and will no longer need to worry about side effects (in the typical case) like subscribe and broadcast - The Event hooks module now contains all logic related to processing side effects for a particular account update. The net effect is that we now have dedicated module and tests for side effects, starting with `accounts`. Related: #6294 Related: #8187	2025-05-28 18:23:48 +00:00
Brian Manifold	12b4a12f26	feat(portal): Add created_by_subject (#9176 ) Why: * We have decided to change the way we will do audit logging. Instead of soft deleting data and keeping it in the table it was created in, we will be moving to an audit trail table where various actions will be recorded in a table/DB specifically for auditing purposes. Due to this change we need to make sure that we don't have stale/dangling references. One set of references we keep everywhere is `created_by_identity_id` and `created_by_actor_id`. Those foreign key references won't be able to be used after moving to the new audit system. This commit will allow us to keep that info by pulling the values and storing the data in a created_by_subject field on the record.	2025-05-20 20:03:46 +00:00
Brian Manifold	dd5a53f686	fix(portal): Fix sign_up to properly populate email (#9105 ) Why: * During the account sign up flow, the email of the first admin was not being populated in the `email` column on the auth_identities table. This was due to atoms being passed in the attrs instead of strings to the `create_identity` function. A migration was also created to backfill the missing emails in the `auth_identities` table.	2025-05-13 19:49:25 +00:00
Jamil	649c03e290	chore(portal): Bump LoggerJSON to 7.0.0, fixing config (#8759 ) There was slight API change in the way LoggerJSON's configuration is generation, so I took the time to do a little fixing and cleanup here. Specifically, we should be using the `new/1` callback to create the Logger config which fixes the below exception due to missing config keys: ``` FORMATTER CRASH: {report,[{formatter_crashed,'Elixir.LoggerJSON.Formatters.GoogleCloud'},{config,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},{log_event,#{meta => #{line => 15,pid => <0.308.0>,time => 1744145139650804,file => "lib/logger.ex",gl => <0.281.0>,domain => [elixir],application => libcluster,mfa => {'Elixir.Cluster.Logger',info,2}},msg => {string,<<"[libcluster:default] connected to :\"web@web.cluster.local\"">>},level => info}},{reason,{error,{badmatch,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},[{'Elixir.LoggerJSON.Formatters.GoogleCloud',format,2,[{file,"lib/logger_json/formatters/google_cloud.ex"},{line,148}]}]}}]} ``` Supersedes #8714 --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-04-11 19:00:06 -07:00
Jamil	d2fd57a3b6	fix(portal): Attach Sentry in each umbrella app (#8749 ) - Attaches the Sentry Logging hook in each of [api, web, domain] - Removes errant Sentry logging configuration in config/config.exs - Fixes the exception logger to default to logging exceptions, use `skip_sentry: true` to skip Tested successfully in dev. Hopefully the cluster behaves the same way. Fixes #8639	2025-04-11 04:17:12 +00:00
Jamil	43d084f97f	refactor(portal): Enforce internet resource site exclusion (#8448 ) Finishes up the Internet Resource migration by enforcing: - No internet resources in non-internet sites - No regular resources in internet sites - Removing the prompt to migrate ~~I've already migrated the existing internet resources in customer's accounts. No one that was using the internet resource hadn't already migrated.~~ Edit: I started to head down that path, then decided doing this here in a data migration was going to be a better approach. Fixes #8212	2025-03-15 18:25:32 -05:00
Brian Manifold	d133ee84b7	feat(portal): Add API rate limiting (#8417 )	2025-03-13 03:21:09 +00:00
Jamil	6d527c1308	feat(portal): Search domain UI and JSON view (#8401 ) - Adds a simple text input to configure search domains ("default DNS suffix") in the Settings -> DNS page. - Sends the `search_domain` field as part of the client's `init` message - Fixes a minor UI alignment inconsistency for the upstream resolvers field so that the total form width and `New resolver` button width are the same. <img width="1137" alt="Screenshot 2025-03-09 at 10 56 56 PM" src="https://github.com/user-attachments/assets/a1d5a570-8eae-4aa9-8a1c-6aaeb9f4c33a" /> Fixes #8365	2025-03-10 17:46:40 +00:00
Jamil	c3a9bac465	feat(portal): Add client endpoints to REST API (#8355 ) Adds the following endpoints: - `PUT /clients/:id` for updating the `name` - `PUT /clients/:client_id/verify` for verifying a client - `PUT /clients/:client_id/unverify` for unverifying a client - `GET /clients` for listing clients in an account - `GET /clients/:id` for getting a single client - `DELETE /clients/:id` for deleting a client Related: #8081	2025-03-05 00:37:01 +00:00
Jamil	e064cf5821	fix(portal): Debounce relays_presence (#8302 ) If the websocket connection between a relay and the portal experiences a temporary network split, the portal will immediately send the disconnected id of the relay to any connected clients and gateways, and all relayed connections (and current allocations) will be immediately revoked by connlib. This tight coupling is needlessly disruptive. As we've seen in staging and production logs, relay disconnects can happen randomly, and in the vast majority of cases immediately reconnect. Currently we see about 1-2 dozen of these per day. To better account for this, we introduce a debounce mechanism in the portal for `relays_presence` disconnects that works as follows: - When a relay disconnects, record its `stamp_secret` (this is somewhat tricky as we don't get this at the time of disconnect - we need to cache it by relay_id beforehand) - If the same `relay_id` reconnects again with the same `stamp_secret` within `relays_presence_debounce_timeout` -> no-op - If the same `relay_id` reconnects again with a different `stamp_secret` -> disconnect immediately - If it doesn't reconnect, then send the `relays_presence` with the disconnected_id after the `relays_presence_debounce_timeout` There are several ways connlib detects a relay is down: 1. Binding requests time out. These happen every 25s, so on average we don't know a Relay is down for 12.5s + backoff timer. 2. `relays_presence` - this is currently the fastest way to detect relays are down. With this change, the caveat is we will now detect this with a delay of `relays_presence_debounce_timer`. Fixes #8301	2025-03-04 23:56:40 +00:00
Jamil	fee808bc62	chore(portal): Log error for unknown channel messages (#8299 ) Instead of crashing, it would make sense to log these and let the connected entity maintain its WebSocket connection. This should never happen in practice if we maintain our version compatibility matrix properly, but it will help reduce the blast radius of a channel message bug that happens to slip out into the wild. Fixes #4679	2025-03-03 21:21:39 +00:00
Jamil	e5ae00ab99	fix(portal): norely -> noreply in gateway/channel.ex (#8329 ) Fixes a typo that snuck in in #8267	2025-03-03 08:15:46 +00:00
Jamil	cb0bf44815	chore: Remove ability to create GCP log sinks (#8298 ) This has long since been removed in the Clients.	2025-02-28 20:57:21 +00:00
Jamil	e03047d549	feat(portal): Send gateway ipv4 and ipv6 to client (#8291 ) In order to properly handle SRV and TXT records on the clients, we need to be able to pick a Gateway using the initial query itself. After that, we need to know the Gateway Tunnel IPs we're connecting to so we can have the query perform the lookup. Fixes #8281	2025-02-28 03:52:27 +00:00
Brian Manifold	bc150156ce	fix(portal): Update gateway channel to process resource_update (#8280 ) Why: * After merging #8267 it was discovered that there was a race condition that allowed a `resource_create` message to end up at the Gateway Channel process. Previously, this message would not have ever arrived, because we were replacing Resource IDs when a breaking change was made, but since that is no longer the case, it is possible that a connection could be established between the time the `delete_resource` and `create_resource` messages are sent and the `create_resource` would end up at the Gateway Channel process. This commit adds a no-op handler to make sure the message gets processed without throwing an error.	2025-02-27 01:46:13 +00:00
Brian Manifold	d0f0de0f8d	refactor(portal): Allow breaking changes in Resources/Policies (#8267 ) Why: * Rather than using a persistent_id field in Resources/Policies, it was decided that we should allow "breaking changes" to these entities. This means that Resources/Policies will now be able to update all fields on the schema without changing the primary key ID of the entity. * This change will greatly help the API and Terraform provider development. @jamilbk, would you like me to put a migration in this PR to actually get rid of all of the existing soft deleted entities? @thomaseizinger, I tagged you on this, because I wanted to make sure that these changes weren't going to break any expectations in the client and/or gateways. --------- Signed-off-by: Brian Manifold <bmanifold@users.noreply.github.com> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-02-26 17:05:34 +00:00

1 2 3 4

166 Commits