Why:
* We were previously only catching the `:rate_limited` error when
sending welcome emails. This update adds a catch-all case to gracefully
handle the error and alert us.
---------
Signed-off-by: Brian Manifold <bmanifold@users.noreply.github.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
This has been dead code for a long time. The feature this was meant to
support, #8353, will require a different domain model, views, and user
flows.
Related: #8353
When the ReplicationConnection dies, its Manager will die too on all
other nodes, and all domain Application supervisors on all nodes will
attempt to restart them. This allows the connection to migrate to a
healthy node automagically.
However, the default Supervisor behavior is to allow 3 restarts in 5
seconds before the whole tree is taken down. To prevent this, we trap
the exit in the ReplicationManager and attempt to reconnect right away,
beginning the backoff process.
We had an old bug in one of our acceptance tests that is just now being
hit again due to the faster runners.
- We need to wait for the dropdown to become visible before clicking
- We fix a minor timer issue that was calculating elapsed time
incorrectly when determining when time out finding an el.
The `expires_at` column on the `flows` table was never used outside of
the context in which the flow was created in the Client Channel. This
ephemeral state, which is created in the `Domain.Flows.authorize_flow/4`
function, is never read from the DB in any meaningful capacity, so it
can be safely removed.
The `expire_flows_for` family of functions now simply reads the needed
fields from the flows table in order to broadcast `{:expire_flow,
flow_id, client_id, resource_id}` directly to the subscribed entities.
This PR is step 1 in removing the reliance on `Flows` to manage
ephemeral access state. In a subsequent PR we will actually change the
structure of what state is kept in the channel PIDs such that reliance
on this Flows table will no longer be necessary.
Additionally, in a few places, we were referencing a Flows.Show view
that was never available in production, so this dead code has been
removed.
Lastly, the `flows` table subscription and associated hook processing
has been completely removed as it is no longer needed. We've implemented
in #9667 logic to remove publications from removed table subscriptions,
so we can expect to get a couple ingest warnings when we deploy this as
the `Hooks.Flows` processor no longer exists, and the WAL data may have
lingering flows records in the queue. These can be safely ignored.
Now that we know the bypass system works, it might be a good idea to
allow it to lag data up to 30m so that events accrued during deploys are
not lost.
Also, this PR fixes a small bug where we triggered the threshold _after_
a transaction already committed (`COMMIT`), instead of before the data
came through (`BEGIN`). Since the timestamps are identical (see below),
it would be more accurate to read the timestamp of the transaction
before acting on the data contained within.
```
[(domain 0.1.0+dev) lib/domain/change_logs/replication_connection.ex:4: Domain.ChangeLogs.ReplicationConnection.handle_message/3]
"BEGIN #{commit_timestamp}" #=> "BEGIN 2025-06-26 04:22:45.283151Z"
[(domain 0.1.0+dev) lib/domain/change_logs/replication_connection.ex:4: Domain.ChangeLogs.ReplicationConnection.handle_message/3]
"END #{commit_timestamp}" #=> "END 2025-06-26 04:22:45.283151Z"
```
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>
When a client is updated, we may need to re-initialize it if "breaking"
fields are updated. If non-breaking fields are changed, such as name, we
don't need to re-initialize the client.
This PR also adds a helper `struct_from_params/2` which will create a
schema struct from WAL data in order to type cast any needed data for
convenience. This avoid having to do a DB hit - we _already have the
data from the DB_ - we just need to format and send it.
Related: #9501
Creating a table publication(s) (and associated replication slot) is
sticky. These will outlive the lifetime of the process that created
them.
We don't want to remove them on shutdown, because this will pause WAL
writing to disk.
However, when starting the _new_ application, it's possible
`table_subscriptions` has changed (such as if we decide we no longer
want events for a certain table). We weren't updating the created
publication(s) with these added/removed tables, so this PR updates the
replication connection setup state machine to pass through a few
conditionals to get these properly updated with the diff of old vs new.
Building on the WAL consumer that's been in development over the past
several weeks, we introduce a new `change_logs` table that stores very
lightly up-fitted data decoded from the WAL:
- `account_id` (indexed): a foreign key reference to an account.
- `inserted_at` (indexed): the timestamp of insert, for truncating rows
later.
- `table`: the table where the op took place.
- `op`: the operation performed (insert/update/delete)
- `old_data`: a nullable map of the old row data (update/delete)
- `data`: a nullable map of the new row data(insert/update)
- `vsn`: an integer version field we can bump to signify schema changes
in the data in case we need to apply operations to only new or only old
data.
Judging from our prod metrics, we're currently average about 1,000 write
operations a minute, which will generate about 1-2 dozen changelogs / s.
Doing the math on this, 30 days at our current volume will yield about
50M / month, which should be ok for some time, since this is an
append-only, rarely (if ever) read from table.
The one aspect of this we may need to handle sooner than later is
batch-inserting these. That raises an issue though - currently, in this
PR, we process each WAL event serially, ending with the final
acknowledgement `:ok` which will signal to Postgres our status in
processing the WAL.
If we do anything async here, this processing "cursor" then becomes
inaccurate, so we may need to think about what to track and what data we
care about.
Related: #7124
It's confusing that we clear this field upon sync failure. Instead, we
let it track the time of the last sync.
Will be cleaned up in #6294 so just applying a minimal fix now.
Fixes#7715
Adds the `account_slug` to the gateway's `init` message. When the
account slug is changed, the gateway's socket is disconnected using the
same mechanism as gateway deletion, which causes the gateway to
reconnect immediately and receive a new `init`.
Related: #9545
Why:
* After updating the Auth Provider changesets to trim all whitespace
from user editable string fields we realized we needed to do the same
for all forms/entities within Firezone. This commit updates all entities
to trim whitespace on string fields.
Fixes: #9579
Unfortunately #9608 did not handle the case where we receive more than
200 compressed metrics in a single call. To fix this, we ensure we
`flush` the metrics buffer inside the `reduce` so that we never grow the
accumulated metrics buffer larger than 200 points.
The log string was updated to roll the issue over in Sentry as well as
the old issue was set to delete and destroy to prevent issue spam.
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Instead of checking for buffer surpass _after_ adding new timeseries to
it, we should check before.
Variables were renamed to be a little more clear on what they represent.
`Repo.aggregate(:count)` which performs a `COUNT(*)` query should be
relatively fast if it's able to do an index-only scan. For that to
happen we need to ensure all of the fields in the WHERE clause are
indexed. Currently, we're missing an index on `actors.type` so a full
row scan is executed per account each time we calculate Billing limits,
every 5 minutes, for all accounts.
If we need to check these limits more often and/or our data grows in
size, it could be worth moving these to a limits counter field on
`accounts` which is maintained via INSERT/DELETE triggers.
Related:
https://firezone-inc.sentry.io/issues/6346235615/events/588a61860e0b4875a5dbe8531dbb806a/?project=4508756715569152&referrer=next-event
When reacting to `ActorGroupMembership` updates, we were issuing a query
to expire Flows given an `actor_id, actor_group_id` combination.
Unfortunately, this query never included an `account_id` to scope it,
causing a table scan of flows and associated join tables to resolve it.
To fix this, we introduce the `account_id` and ensure the expire flows
uses this field to ensure only data for an account is considered in the
query.
Related:
https://firezone-inc.sentry.io/issues/6346235615/events/e225e1c488cb4ea3896649aabd529c50
The `compile_config` macro only works on environment and DB variables.
This caused recent confusion when determining where `database_pool_size`
was coming from.
To fix this issue, we rename `compile_config` to be more clear.
We also remove the technical debt around supporting "legacy keys" and
DB-based configuration.
The configuration compiler now works exclusively on environment
variables only, where it is still useful for:
- Casting environment variables to their expected type
- Alerting us when one is missing that should be set
Why:
* We recently had an issue where a space was entered into a provider
form field and caused our system to not be able to authenticate the
admin when setting up the auth provider and directory sync. To mitigate
this moving forward we are making sure all white space is trimmed in the
form fields. This commit focuses on the form fields for the auth
providers.
related: #9579
Sentry isn't started when this runs, so start it and manually capture a
message to ensure we're reminded about pending conditional migrations.
Verified that this works with the Release script.
In #9562, we introduced a bug where the pending conditional migrations
check was run without the repo being started. Wrapping it with
`with_repo` fixes that.
This table was added to try and gate the request rate to Google's
Metrics API.
However, this was a flawed endeavor as we later discovered that the time
series points need to be spaced apart at least 5s, not the API requests
themselves.
This PR gets rid of the table and therefore the problematic DB query
which is timing out quite often due to the contention involved in 12
elixir nodes trying to grab a lock every 5s.
Related: #9539
Related:
https://firezone-inc.sentry.io/issues/6346235615/?project=4508756715569152&query=is%3Aunresolved&referrer=issue-stream&stream_index=1
Some migrations take a long time to run because they require locks or
modify large amounts of data. To prevent this from causing issues during
deploy, we leverage Ecto's native support for loading migrations from
multiple directories to introduce a `conditional_migrations/` directory
that houses any conditional migrations we want to run.
To run these migrations, you'll need to do one of the following:
- `dev, test`: The `mix ecto.migrate` will run them by default because
we have aliased this to load conditional_migrations for dev
- `prod`: Set the `RUN_CONDITIONAL_MIGRATIONS` env var to `true` before
starting a prod server using the `bin/migrate` script.
- `dev, test, prod`: Run `Domain.Release.migrate(conditional: true)`
from an IEx shell.
If conditional migrations were found that weren't executed during
`Domain.Release.migrate`, a warning is logged to remind us to run them.
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
The shape of this from libcluster is `[:"NODE_NAME": connected_bool?]`
so we need to extract the first element of each item before using this
var.
This is just for logging and doesn't affect how we actually connect to
nodes.
Why:
* Adding some simple logging around OIDC calls to help with better
debugging.
* Removing the `opentelemetry_liveview` package as it has been pulled in
to the `opentelemetry_phoenix` package that we are already using.
After updating the CSS config to use `main.css` in the portal the root
layout was updated, but there were a small number of one-off templates
that do not use the root layout and those pages were not updated with
the new `main.css` file. This commit updates those non-root templates.
Fixes#9532
We issue broadcasts and subscribes in many places throughout the portal.
To help keep the cognitive overhead low, this PR consolidates all PubSub
functionality to the `Domain.PubSub` module.
This allows for:
- better maintainability
- see all of the topics we use at a glance
- consolidate repeated functionality (saved for a future PR)
- use the module hierarchy to define function names, which feels more
intuitive when reading and sets a convention
We also introduce a `Domain.Events.Hooks` behavior to ensure all hooks
comply with this simple contract, and we also introduce a convention to
standardize on topic names using the module hierarchy defined herein.
Lastly, we add convenience functions to the Presence modules to save a
bit of duplication and chance for errors.
This will make it much easier to maintain PubSub going forward.
Related: #9501
In #9513 a bug was introduced that added all service accounts to the
Everyone group. This fixes that by ensuring the `insert_all` query only
cross joins where actor type is `:account_user, :account_admin_user`.
Staging data will be manually fixed after this goes in.
I briefly considered updating the delete clause of this query to "clean
things up" by removing any found service accounts but that is a bit too
defensive in my opinion - if there's no way a service account should
make it into this group, then we shouldn't have code to expect it. This
will all be going away in #8750 which should be much less brittle.
In many places throughout the portal codebase, we called a function
"update_dynamic_group_memberships/1" which recomputed all of the
dynamic/managed memberships for a particular account, and reapplied them
to each affected group.
Since the `has_many :memberships` relationship used `on_replace:
:delete`, this caused Ecto to delete _all_ the `Everyone` group
memberships, and reinsert them on each sync.
Since each membership change triggers a policy re-evaluation for all
policies to the affected actor
(`Policies.broadcast_access_events_for/3`), this in effect was causing a
massive amount of queries to be triggered upon each sync job as each
membership deletion and insertion triggered a lookup for all resources
available to that particular actor.
To fix this, we introduce the following changes:
- Remove `dynamic` group type. This will never be used as it will create
an immense amount of complexity for any organization trying to manage
groups this way
- Refactor `update_dynamic_group_memberships/1` to use a smarter query
that first gathers all the _needed_ changes and applies them within a
transaction using Ecto.Multi. Previously all memberships would be rolled
over unconditionally due to the `on_replace: :delete` option on the
relationship. Note that the option is still there, but we generally
don't set memberships on groups any longer unless editing the affected
group directly, where the everyone group doesn't apply.
Resolves: #8407Resolves: #8408
Related: #6294
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
We move the resource events to the WAL system. Notably, we no longer
need `fetch_and_update_breakable` for resource updates, so a bit of
refactoring is included to update the call sites for those.
Additionally, we need to add a `Flow.expire_flows_for_resource_id/1`
function to expire flows from the WAL system. This is now being called
in the WAL event handler. To prevent this from blocking the WAL
consumer/broadcaster, we wrap it with a Task.async. These will be
cleaned up when the lookup table for access is implemented next.
Another thing to note is that we lose the `subject` when moving from
`Flows.expire_flows_for(%Resource{}, subject)` to
`Flows.expire_flows_for_resource_id(resource_id)` when a resource is
deleted or updated by an actor since we respond to this event in the WAL
where that data isn't available. However, we don't actually _use_ the
subject when expiring flows (other than authorize the initial resource
update), so this isn't an issue.
Related: #9501
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>