Before:
- When a flow was deleted, we flapped the resource on the client, and
sent `reject_access` naively for the flow's `{client_id, resource_id}`
pair on the gateway. This resulted in lots of unneeded resource flappage
on the client whenever bulk flow deletions happened.
After:
- When a flow is deleted, we check if this is an active flow for the
client. If so, we flap the resource then in order to trigger generation
of a new flow. If access was truly affected, that results in a loss of a
resource, we will push `resource_deleted` for the update that triggered
the flow deletion (for example the resource/policy removal). On the
gateway, we only send `reject_access` if it was the last flow granting
access for a particular `client/resource` tuple.
Why:
- While the access state is still correct in the previous
implementation, we run the possibility of pushing way too many resource
flaps to the client in an overly eager attempt to remove access the
client may not have access to.
cc @thomaseizinger
Related:
https://firezonehq.slack.com/archives/C08FPHECLUF/p1753101115735179
Bumps [hammer](https://github.com/ExHammer/hammer) from 7.0.1 to 7.1.0.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/ExHammer/hammer/blob/master/CHANGELOG.md">hammer's
changelog</a>.</em></p>
<blockquote>
<h2>7.1.0 - 2025-07-18</h2>
<ul>
<li>Fix key type inconsistency in backend implementations - all backends
now accept <code>term()</code> keys instead of <code>String.t()</code>
(<a
href="https://redirect.github.com/ExHammer/hammer/issues/143">#143</a>)</li>
<li>Add comprehensive test coverage for various key types (atoms,
tuples, integers, lists, maps)</li>
<li>Fix race conditions in atomic backend tests (FixWindow, LeakyBucket,
TokenBucket)</li>
<li>Replace timing-dependent tests with polling-based
<code>eventually</code> helper for better CI reliability</li>
<li>Add documentation warning about Redis backend string key
requirement</li>
<li>Fix typo in <code>inc/3</code> optional callback documentation (<a
href="https://redirect.github.com/ExHammer/hammer/issues/142">#142</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="a57bdecdc1"><code>a57bdec</code></a>
improve changelog last commit (<a
href="https://redirect.github.com/ExHammer/hammer/issues/145">#145</a>)</li>
<li><a
href="bb061c5334"><code>bb061c5</code></a>
Bump version to 7.1.0 (<a
href="https://redirect.github.com/ExHammer/hammer/issues/144">#144</a>)</li>
<li><a
href="7d7967f898"><code>7d7967f</code></a>
Fix key type inconsistency in backend implementations (<a
href="https://redirect.github.com/ExHammer/hammer/issues/143">#143</a>)</li>
<li><a
href="94d39525e8"><code>94d3952</code></a>
Fixes typo for inc/3 optional callback <code>@doc</code> (<a
href="https://redirect.github.com/ExHammer/hammer/issues/142">#142</a>)</li>
<li><a
href="79ca221876"><code>79ca221</code></a>
Bump benchee from 1.3.1 to 1.4.0 (<a
href="https://redirect.github.com/ExHammer/hammer/issues/135">#135</a>)</li>
<li><a
href="a09bbd0d42"><code>a09bbd0</code></a>
Bump ex_doc from 0.37.3 to 0.38.2 (<a
href="https://redirect.github.com/ExHammer/hammer/issues/141">#141</a>)</li>
<li><a
href="d06a17b6be"><code>d06a17b</code></a>
Bump credo from 1.7.11 to 1.7.12 (<a
href="https://redirect.github.com/ExHammer/hammer/issues/134">#134</a>)</li>
<li><a
href="26df742620"><code>26df742</code></a>
Update bug_report.md (<a
href="https://redirect.github.com/ExHammer/hammer/issues/133">#133</a>)</li>
<li><a
href="b8765fe216"><code>b8765fe</code></a>
Bump ex_doc from 0.37.2 to 0.37.3 (<a
href="https://redirect.github.com/ExHammer/hammer/issues/131">#131</a>)</li>
<li>See full diff in <a
href="https://github.com/ExHammer/hammer/compare/7.0.1...7.1.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
These parameters should be tuned to how long we expect "normal" queries
to take against the SQL instance. For smaller instances, "normal"
queries may take longer than 500ms, so we need to be able to configure
these via our Terraform configuration.
If not specified, the same defaults are used as before.
Related: https://github.com/firezone/infra/pull/82
When changes occur in the Firezone DB that trigger side effects, we need
some mechanism to broadcast and handle these.
Before, the system we used was:
- Each process subscribes to a myriad of topics related to data it wants
to receive. In some cases it would subscribe to new topics based on
received events from existing topics (I.e. flows in the gateway
channel), and sometimes in a loop. It would then need to be sure to
_unsubscribe_ from these topics
- Handle the side effect in the `after_commit` hook of the Ecto function
call after it completes
- Broadcast only a simply (thin) event message with a DB id
- In the receiver, use the id(s) to re-evaluate, or lookup one or many
records associated with the change
- After the lookup completes, `push` the relevant message(s) to the
LiveView, `client` pid, or `gateway` pid in their respective channel
processes
This system had a number of drawbacks ranging from scalability issues to
undesirable access bugs:
1. The `after_commit` callback, on each App node, is not globally
ordered. Since we broadcast a thin event schema and read from the DB to
hydrate each event, this meant we had a `read after write` problem in
our event architecture, leading to the potential for lost updates. Case
in point: if a policy is updated from `resource_id-1` to
`resource_id-2`, and then back to `resource_id-1`, it's possible that,
given the right amount of delay, the gateway channel will receive two
`reject_access` events for `resource_id-1`, as opposed to one for
`resource_id-1` and one for `resource_id-2`, leading to the potential
for unauthorized access.
1. It was very difficult to ensure that the correct topics were being
subscribed to and unsubscribed from, and the correct number of times,
leading to maintenance issues for other engineers.
1. We had a nasty N+1 query problem whenever memberships were added or
removed that resolved in essentially all access related to that
membership (so all Policies touching its actor group) to be
re-evaluated, and broadcasted. This meant that any bulk addition or
deletion of memberships would generate so many queries that they'd
timeout or consume the entire connection pool.
1. We had no durability for side-effect processing. In some places, we
were iterating over many returned records to send broadcasts.
Broadcasting is not a zero-time operation, each call takes a small
amount of CPU time to copy the message into the receiver's mailbox. If
we deployed while this was happening, the state update would be lost
forever. If this was a `reject_access` for a Gateway, the Gateway would
never remove access for that particular flow.
1. On each flow authorization, we needed to hit `us-east1` not only to
"authorize" the flow, but to log it as well. This incurs latency
especially for users in other parts of the world, which happens on
_each_ connection setup to a new resource.
1. Since we read and re-authorize access due to the thin events
broadcasted from side effects, we risk hitting thundering herd problems
(see the N+1 query problem above) where a single DB change could result
in all receivers hitting the DB at once to "hydrate" their
processing.ion
1. If an administrator modifies the DB directly, or, if we need to run a
DB migration that involves side effects, they'll be lost, because the
side effect triggers happened in `after_commit` hooks that are only
available when querying the DB through Ecto. Manually deleting (or
resurrecting) a policy, for example, would not have updated any
connected clients or gateways with the new state.
To fix all of the above, we move to the system introduced in this PR:
- All changes are now serialized (for free) by Postgres and broadcasted
as a single event stream
- The number of topics has been reduced to just one, the `account_id` of
an account. All receivers subscribe to this one topic for the lifetime
of their pid and then only filter the events they want to act upon,
ignoring all other messages
- The events themselves have been turned into "fat" structs based on the
schemas they present. By making them properly typed, we can apply things
like the existing Policy authorizer functions to them as if we had just
fetched them from the DB.
- All flow creation now happens in memory and doesn't not need to incur
a DB hit in `us-east1` to proceed.
- Since clients and gateways now track state in a push-based manner from
the DB, this means very few actual DB queries are needed to maintain
state in the channel procs, and it also means we can be smarter about
when to send `resource_deleted` and `resource_created_or_updated`
appropriately, since we can always diff between what the client _had_
access to, and what they _now_ have access to.
- All DB operations, whether they happen from the application code, a
`psql` prompt, or even via Google SQL Studio in the GCP console, will
trigger the _same_ side effects.
- We now use a replication consumer based off Postgres logical decoding
of the write-ahead log using a _durable slot_. This means that Postgres
will retain _all events_ until they are acknowledged, giving us the
ability to ensure at-least-once processing semantics for our system.
Today, the ACK is simply, "did we broadcast this event successfully".
But in the future, we can assert that replies are received before we
acknowledge the event as processed back to Postgres.
The tests in this PR have been updated to pass given the refactor.
However, since we are tracking more state now in the channel procs, it
would be a good idea to add more tests for those edge cases. That is
saved as a later PR because (1) this one is already huge, and (2) we
need to get this out to staging to smoke test everything anyhow.
Fixes: #9908Fixes: #9909Fixes: #9910Fixes: #9900
Related: #9501
As a followup to #9882, we need to ensure that `jsonb` columns that have
value data other than strings are not decoded as jsonb. An example of
when this happens is when Postgres sends an `:unchanged_toast` to
indicate the data hasn't changed.
In #9664, we introduced the `Domain.struct_from_params/2` function which
converts a set of params containing string keys into a provided struct
representing a schema module. This is used to broadcast actual structs
pertaining to WAL data as opposed to simple string encodings of the
data.
The problem is that function was a bit too naive and failed to properly
cast embedded schemas, resulting in all embedded schema on the root
struct being `nil` or `[]`.
To fix this, we need to do two things:
1. We now decode JSON/JSONB fields from binaries (strings) into actual
lists and maps in the replication consumer module for downstream
processors to use
2. We update our `struct_from_params/2` function to properly cast
embedded schemas from these lists and maps using Ecto.Changeset's
`apply_changes` function, which uses the same logic to instantiate the
schemas as if we were saving a form or API request.
Lastly, tests are added to ensure this works under various scenarios,
including nested embedded schemas which we use in some places.
Fixes#9835
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
In #9870, the password generation algorithm was broken. The correct
order of the elements in the hash is: expiry, stamp_secret, salt. The
relay expects this order when it re-generates the password to validate
the message.
Due to a different bug in our CI system, we weren't actually checking
for warnings / errors in our perf-test suite:
https://github.com/firezone/firezone/actions/runs/16285038111/job/45982241021#step:9:66
Why:
* Adding more BEAM VM metrics to give us better insight as to how our
BEAM cluster is running since we're in the middle of making some
moderately large architectural changes to the application.
As a followup to #9856, after talking with @bmanifold, we determined
using the public_key as the username for TURN credentials is a safer bet
because:
- It's by definition public and therefore does not need to be obfuscated
- It's shorter-lived than the token, especially for the gateway
- It essentially represents the data plane connection for client/gateway
and naturally rotates along with the key state for those
When giving TURN credentials to clients and gateways, it's important
that they remain consistent across hiccups in the portal connection so
that relayed connections are not interrupted during a deploy, or if the
user's internet is flaky, or the GCP load balancer decides to disconnect
the client/gateway.
Prior to this PR, that was not the case because we essentially tied TURN
credentials, required for data plane packet flows, to the WebSocket
connection, a control plane element. This happened because we generated
random `expires_at` and `salt` elements on _each_ connection to the
portal.
Instead, what we do now is make these reproducible and tied to the auth
token by hashing then base64-encoding it. The expiry is tied to the
auth-token's expiry.
Fixes#9856
The Postgres logical decoding protocol is lacking documentation and
unclear about keepalive behavior when `wal_sender_timeout` is set to 0
(disabled). We have it disabled so that Postgres doesn't terminate our
connection for falling too far behind.
What we failed to take into account is that on some installations,
Postgres _never_ requests an immediate reply (keepalive with the reply
now bit set) if wal_sender_timeout is disabled. This means we would
always reply with the empty message, failing to advance the position of
the LSN.
In this PR, we fix that to always respond to every keepalive message
with a standby status update to advance the LSN position.
Relevant documentation:
https://www.postgresql.org/docs/current/protocol-replication.html#PROTOCOL-REPLICATION-STANDBY-STATUS-UPDATE
Firezone uses ICMP errors to signal to client applications that e.g. a
certain IP is not reachable. This happens for example if a DNS resource
only resolves to IPv4 addresses yet the client application attempted to
use an IPv6 proxy address to connect to it.
In the presence of traffic filters for such a resource that does _not_
allow ICMP, we currently filter out these ICMP errors because - well -
ICMP traffic is not allowed! However, even in the presence of ICMP
traffic being allowed, we would fail to evaluate this filter because the
ICMP error packet is not an ICMP echo reply and therefore doesn't have
an ICMP identifier. We require this in the DNS resource NAT to identify
"connections" and NAT them correctly. The same L4 component is used to
evaluate the traffic filters.
ICMP errors are critical to many usage scenarios and algorithms like
happy-eyeballs. Dropping them usually results in weird behaviour as
client applications can then only react to timeouts.
In #9733, we changed the replies of the handle_data messages which seems
to have caused Postgres to not respect our acknowledgements sent in the
keepalive.
To fix this, we revert to sending an empty message in response to write
messages.
Inserting a change log incurs some minor overhead for sending query over
the network and reacting to its response. In many cases, this makes up
the bulk of the actual time it takes to run the change log insert.
To reduce this overhead and avoid any kind of processing delay in the
WAL consumers, we introduce batch insert functionality with size `500`
and timeout `30` seconds. If either of those two are hit, we flush the
batch using `insert_all`.
`insert_all` does not use `Ecto.Changeset`, so we need to be a bit more
careful about the data we insert, and check the inserted LSNs to
determine what to update the acknowledged LSN pointer to.
The functionality to determine when to call the new `on_flush/1`
callback lives in the replication_connection module, but the actual
behavior of `on_flush/1` is left to the child modules to implement. The
`Events.ReplicationConnection` module does not use flush behavior, and
so does not override the defaults, which is not to use a flush
mechanism.
Related: #949
Why:
* We were previously only catching the `:rate_limited` error when
sending welcome emails. This update adds a catch-all case to gracefully
handle the error and alert us.
---------
Signed-off-by: Brian Manifold <bmanifold@users.noreply.github.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
This has been dead code for a long time. The feature this was meant to
support, #8353, will require a different domain model, views, and user
flows.
Related: #8353
When the ReplicationConnection dies, its Manager will die too on all
other nodes, and all domain Application supervisors on all nodes will
attempt to restart them. This allows the connection to migrate to a
healthy node automagically.
However, the default Supervisor behavior is to allow 3 restarts in 5
seconds before the whole tree is taken down. To prevent this, we trap
the exit in the ReplicationManager and attempt to reconnect right away,
beginning the backoff process.
We had an old bug in one of our acceptance tests that is just now being
hit again due to the faster runners.
- We need to wait for the dropdown to become visible before clicking
- We fix a minor timer issue that was calculating elapsed time
incorrectly when determining when time out finding an el.
The `expires_at` column on the `flows` table was never used outside of
the context in which the flow was created in the Client Channel. This
ephemeral state, which is created in the `Domain.Flows.authorize_flow/4`
function, is never read from the DB in any meaningful capacity, so it
can be safely removed.
The `expire_flows_for` family of functions now simply reads the needed
fields from the flows table in order to broadcast `{:expire_flow,
flow_id, client_id, resource_id}` directly to the subscribed entities.
This PR is step 1 in removing the reliance on `Flows` to manage
ephemeral access state. In a subsequent PR we will actually change the
structure of what state is kept in the channel PIDs such that reliance
on this Flows table will no longer be necessary.
Additionally, in a few places, we were referencing a Flows.Show view
that was never available in production, so this dead code has been
removed.
Lastly, the `flows` table subscription and associated hook processing
has been completely removed as it is no longer needed. We've implemented
in #9667 logic to remove publications from removed table subscriptions,
so we can expect to get a couple ingest warnings when we deploy this as
the `Hooks.Flows` processor no longer exists, and the WAL data may have
lingering flows records in the queue. These can be safely ignored.
Now that we know the bypass system works, it might be a good idea to
allow it to lag data up to 30m so that events accrued during deploys are
not lost.
Also, this PR fixes a small bug where we triggered the threshold _after_
a transaction already committed (`COMMIT`), instead of before the data
came through (`BEGIN`). Since the timestamps are identical (see below),
it would be more accurate to read the timestamp of the transaction
before acting on the data contained within.
```
[(domain 0.1.0+dev) lib/domain/change_logs/replication_connection.ex:4: Domain.ChangeLogs.ReplicationConnection.handle_message/3]
"BEGIN #{commit_timestamp}" #=> "BEGIN 2025-06-26 04:22:45.283151Z"
[(domain 0.1.0+dev) lib/domain/change_logs/replication_connection.ex:4: Domain.ChangeLogs.ReplicationConnection.handle_message/3]
"END #{commit_timestamp}" #=> "END 2025-06-26 04:22:45.283151Z"
```
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>
When a client is updated, we may need to re-initialize it if "breaking"
fields are updated. If non-breaking fields are changed, such as name, we
don't need to re-initialize the client.
This PR also adds a helper `struct_from_params/2` which will create a
schema struct from WAL data in order to type cast any needed data for
convenience. This avoid having to do a DB hit - we _already have the
data from the DB_ - we just need to format and send it.
Related: #9501
Creating a table publication(s) (and associated replication slot) is
sticky. These will outlive the lifetime of the process that created
them.
We don't want to remove them on shutdown, because this will pause WAL
writing to disk.
However, when starting the _new_ application, it's possible
`table_subscriptions` has changed (such as if we decide we no longer
want events for a certain table). We weren't updating the created
publication(s) with these added/removed tables, so this PR updates the
replication connection setup state machine to pass through a few
conditionals to get these properly updated with the diff of old vs new.
Building on the WAL consumer that's been in development over the past
several weeks, we introduce a new `change_logs` table that stores very
lightly up-fitted data decoded from the WAL:
- `account_id` (indexed): a foreign key reference to an account.
- `inserted_at` (indexed): the timestamp of insert, for truncating rows
later.
- `table`: the table where the op took place.
- `op`: the operation performed (insert/update/delete)
- `old_data`: a nullable map of the old row data (update/delete)
- `data`: a nullable map of the new row data(insert/update)
- `vsn`: an integer version field we can bump to signify schema changes
in the data in case we need to apply operations to only new or only old
data.
Judging from our prod metrics, we're currently average about 1,000 write
operations a minute, which will generate about 1-2 dozen changelogs / s.
Doing the math on this, 30 days at our current volume will yield about
50M / month, which should be ok for some time, since this is an
append-only, rarely (if ever) read from table.
The one aspect of this we may need to handle sooner than later is
batch-inserting these. That raises an issue though - currently, in this
PR, we process each WAL event serially, ending with the final
acknowledgement `:ok` which will signal to Postgres our status in
processing the WAL.
If we do anything async here, this processing "cursor" then becomes
inaccurate, so we may need to think about what to track and what data we
care about.
Related: #7124
It's confusing that we clear this field upon sync failure. Instead, we
let it track the time of the last sync.
Will be cleaned up in #6294 so just applying a minimal fix now.
Fixes#7715
Adds the `account_slug` to the gateway's `init` message. When the
account slug is changed, the gateway's socket is disconnected using the
same mechanism as gateway deletion, which causes the gateway to
reconnect immediately and receive a new `init`.
Related: #9545
Why:
* After updating the Auth Provider changesets to trim all whitespace
from user editable string fields we realized we needed to do the same
for all forms/entities within Firezone. This commit updates all entities
to trim whitespace on string fields.
Fixes: #9579
Unfortunately #9608 did not handle the case where we receive more than
200 compressed metrics in a single call. To fix this, we ensure we
`flush` the metrics buffer inside the `reduce` so that we never grow the
accumulated metrics buffer larger than 200 points.
The log string was updated to roll the issue over in Sentry as well as
the old issue was set to delete and destroy to prevent issue spam.
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>