firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 18:18:55 +00:00

Author	SHA1	Message	Date
Brian Manifold	83e71f45b8	fix(portal): catch all errors when sending welcome email (#9776 ) Why: * We were previously only catching the `:rate_limited` error when sending welcome emails. This update adds a catch-all case to gracefully handle the error and alert us. --------- Signed-off-by: Brian Manifold <bmanifold@users.noreply.github.com> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-07-03 21:41:12 +00:00
Jamil	29d8881c54	fix(seeds): remove unused vars (#9731 ) This fixes some warnings introduced by #9692.	2025-06-30 19:33:11 +00:00
Jamil	23c43c12dd	chore(portal): log wal status every 60s (#9729 ) It would be helpful to see these more often in the logs to better understand our current processing position.	2025-06-30 18:25:19 +00:00
Jamil	972ece507d	chore(portal): downgrade expected wal log to info (#9726 ) This is expected during deploys so we downgrade it to info to avoid sending to Sentry.	2025-06-30 15:35:35 +00:00
Jamil	47fe7b388e	chore(portal): ack WAL records more often (#9703 ) - log connection module for replication manager logs - simplify bypass conditionals - ACK write messages to avoid PG resending data	2025-06-28 20:34:26 +00:00
Jamil	a24f582ff5	fix(portal): increase change_log lag warning threshold (#9702 ) This is needlessly short and has already tripped a false alarm once.	2025-06-27 20:58:58 +00:00
Jamil	3760536afd	chore(portal): add unique index to lsn (#9699 )	2025-06-27 20:58:20 +00:00
Jamil	dddd1b57fc	refactor(portal): remove flow_activities (#9693 ) This has been dead code for a long time. The feature this was meant to support, #8353, will require a different domain model, views, and user flows. Related: #8353	2025-06-27 20:40:25 +00:00
Jamil	9655dacc04	fix(portal): restart wal connection from manager proc (#9701 ) When the ReplicationConnection dies, its Manager will die too on all other nodes, and all domain Application supervisors on all nodes will attempt to restart them. This allows the connection to migrate to a healthy node automagically. However, the default Supervisor behavior is to allow 3 restarts in 5 seconds before the whole tree is taken down. To prevent this, we trap the exit in the ReplicationManager and attempt to reconnect right away, beginning the backoff process.	2025-06-27 20:40:04 +00:00
Jamil	6c0a62aa73	fix(tests): wait for visible els before click (#9697 ) We had an old bug in one of our acceptance tests that is just now being hit again due to the faster runners. - We need to wait for the dropdown to become visible before clicking - We fix a minor timer issue that was calculating elapsed time incorrectly when determining when time out finding an el.	2025-06-27 19:06:59 +00:00
Jamil	3247b7c5d2	fix(portal): don't log soft-deleted deletes (#9698 )	2025-06-27 19:06:45 +00:00
Jamil	0b09d9f2f5	refactor(portal): don't rely on flows.expires_at (#9692 ) The `expires_at` column on the `flows` table was never used outside of the context in which the flow was created in the Client Channel. This ephemeral state, which is created in the `Domain.Flows.authorize_flow/4` function, is never read from the DB in any meaningful capacity, so it can be safely removed. The `expire_flows_for` family of functions now simply reads the needed fields from the flows table in order to broadcast `{:expire_flow, flow_id, client_id, resource_id}` directly to the subscribed entities. This PR is step 1 in removing the reliance on `Flows` to manage ephemeral access state. In a subsequent PR we will actually change the structure of what state is kept in the channel PIDs such that reliance on this Flows table will no longer be necessary. Additionally, in a few places, we were referencing a Flows.Show view that was never available in production, so this dead code has been removed. Lastly, the `flows` table subscription and associated hook processing has been completely removed as it is no longer needed. We've implemented in #9667 logic to remove publications from removed table subscriptions, so we can expect to get a couple ingest warnings when we deploy this as the `Hooks.Flows` processor no longer exists, and the WAL data may have lingering flows records in the queue. These can be safely ignored.	2025-06-27 18:29:12 +00:00
Jamil	fbf48a207a	chore(portal): handle lag up to 30m (#9681 ) Now that we know the bypass system works, it might be a good idea to allow it to lag data up to 30m so that events accrued during deploys are not lost. Also, this PR fixes a small bug where we triggered the threshold _after_ a transaction already committed (`COMMIT`), instead of before the data came through (`BEGIN`). Since the timestamps are identical (see below), it would be more accurate to read the timestamp of the transaction before acting on the data contained within. ``` [(domain 0.1.0+dev) lib/domain/change_logs/replication_connection.ex:4: Domain.ChangeLogs.ReplicationConnection.handle_message/3] "BEGIN #{commit_timestamp}" #=> "BEGIN 2025-06-26 04:22:45.283151Z" [(domain 0.1.0+dev) lib/domain/change_logs/replication_connection.ex:4: Domain.ChangeLogs.ReplicationConnection.handle_message/3] "END #{commit_timestamp}" #=> "END 2025-06-26 04:22:45.283151Z" ``` --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>	2025-06-26 13:38:40 +00:00
Jamil	59fa7fa4f1	fix(portal): diff = now - past (#9680 ) We were performing the diff backwards, so the bypass never kicked in.	2025-06-25 18:10:28 -07:00
Jamil	e7756a9be5	fix(portal): bypass delayed events past threshold (#9679 ) When attempting to process a WAL that's _very_ far behind, it's helpful to have a time past which we simply `noop` the handlers.	2025-06-25 22:32:15 +00:00
Jamil	eed8343e8f	fix(portal): wait 30s for agm query (#9678 ) These queries are timing out, so we wait longer for them.	2025-06-25 22:22:21 +00:00
Jamil	42e3027c34	fix(portal): use replication config in dev (#9676 )	2025-06-25 21:02:01 +00:00
Jamil	855c427688	chore(portal): don't log found nodes (#9674 ) These are better logged elsewhere and this is just noise.	2025-06-25 18:40:48 +00:00
Jamil	bebc69e2bc	fix(portal): use distinct slot names (#9672 ) These were being configured using the same default `events_` value.	2025-06-25 17:28:17 +00:00
Jamil	343717b502	refactor(portal): broadcast client struct when updated (#9664 ) When a client is updated, we may need to re-initialize it if "breaking" fields are updated. If non-breaking fields are changed, such as name, we don't need to re-initialize the client. This PR also adds a helper `struct_from_params/2` which will create a schema struct from WAL data in order to type cast any needed data for convenience. This avoid having to do a DB hit - we _already have the data from the DB_ - we just need to format and send it. Related: #9501	2025-06-25 17:04:41 +00:00
Jamil	02dd21018d	fix(portal): log error when connected_nodes crossed (#9668 ) To avoid log spam, we only log an error when the threshold boundary is crossed.	2025-06-24 21:47:17 -07:00
Jamil	95624211cd	fix(portal): update publications when config changes (#9667 ) Creating a table publication(s) (and associated replication slot) is sticky. These will outlive the lifetime of the process that created them. We don't want to remove them on shutdown, because this will pause WAL writing to disk. However, when starting the _new_ application, it's possible `table_subscriptions` has changed (such as if we decide we no longer want events for a certain table). We weren't updating the created publication(s) with these added/removed tables, so this PR updates the replication connection setup state machine to pass through a few conditionals to get these properly updated with the diff of old vs new.	2025-06-24 21:31:40 -07:00
Jamil	a9f49629ae	feat(portal): add change_logs table and insert data (#9553 ) Building on the WAL consumer that's been in development over the past several weeks, we introduce a new `change_logs` table that stores very lightly up-fitted data decoded from the WAL: - `account_id` (indexed): a foreign key reference to an account. - `inserted_at` (indexed): the timestamp of insert, for truncating rows later. - `table`: the table where the op took place. - `op`: the operation performed (insert/update/delete) - `old_data`: a nullable map of the old row data (update/delete) - `data`: a nullable map of the new row data(insert/update) - `vsn`: an integer version field we can bump to signify schema changes in the data in case we need to apply operations to only new or only old data. Judging from our prod metrics, we're currently average about 1,000 write operations a minute, which will generate about 1-2 dozen changelogs / s. Doing the math on this, 30 days at our current volume will yield about 50M / month, which should be ok for some time, since this is an append-only, rarely (if ever) read from table. The one aspect of this we may need to handle sooner than later is batch-inserting these. That raises an issue though - currently, in this PR, we process each WAL event serially, ending with the final acknowledgement `:ok` which will signal to Postgres our status in processing the WAL. If we do anything async here, this processing "cursor" then becomes inaccurate, so we may need to think about what to track and what data we care about. Related: #7124	2025-06-25 02:06:20 +00:00
Jamil	ff5a632d2a	fix(portal): only show `never synced` correctly (#9652 ) It's confusing that we clear this field upon sync failure. Instead, we let it track the time of the last sync. Will be cleaned up in #6294 so just applying a minimal fix now. Fixes #7715	2025-06-24 22:54:30 +00:00
Jamil	933d51e3d0	feat(portal): send account_slug in gateway init (#9653 ) Adds the `account_slug` to the gateway's `init` message. When the account slug is changed, the gateway's socket is disconnected using the same mechanism as gateway deletion, which causes the gateway to reconnect immediately and receive a new `init`. Related: #9545	2025-06-24 18:35:06 +00:00
Brian Manifold	27f482e061	fix(portal): trim whitespace in all remaining forms (#9654 ) Why: * After updating the Auth Provider changesets to trim all whitespace from user editable string fields we realized we needed to do the same for all forms/entities within Firezone. This commit updates all entities to trim whitespace on string fields. Fixes: #9579	2025-06-24 14:28:51 +00:00
Jamil	0cd919a5e2	fix(portal): use account_id index in flow expiration (#9623 ) There were a couple more instances where we weren't using the `account_id` which prevented use of the index, causing a DB Connection queue drop.	2025-06-23 21:51:21 +00:00
Jamil	f55596be4e	fix(portal): index auth_providers on adapter (#9625 ) The `refresh_tokens` job for each auth provider uses a cross-account query that unfortunately hits no indexes. This can cause slow queries each time the job runs for the adapter. We add a simple sparse index to speed this query up. Related: https://firezone-inc.sentry.io/issues/6346235615/?project=4508756715569152&query=is%3Aunresolved&referrer=issue-stream&stream_index=1	2025-06-23 18:50:22 +00:00
Jamil	0af7582ab6	fix(portal): flush metrics as we accumulate (#9622 ) Unfortunately #9608 did not handle the case where we receive more than 200 compressed metrics in a single call. To fix this, we ensure we `flush` the metrics buffer inside the `reduce` so that we never grow the accumulated metrics buffer larger than 200 points. The log string was updated to roll the issue over in Sentry as well as the old issue was set to delete and destroy to prevent issue spam. --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-23 14:58:18 +00:00
Jamil	c783b23bae	refactor(portal): rename conditional->manual (#9612 ) These only have one condition - to run manually. `manual migrations` better implies that these migrations _must_ typically be run manually.	2025-06-21 21:17:33 +00:00
Jamil	2523bedd19	fix(portal): add if not exists to concurrent index (#9611 ) With `@disable_ddl_transaction` this needs to be added. See https://firezonehq.slack.com/archives/C04HRQTFY0Z/p1750516438992329?thread_ts=1750510766.640919&cid=C04HRQTFY0Z	2025-06-21 15:42:51 +00:00
Jamil	e113def903	fix(portal): flush metrics buffer before exceeding limit (#9608 ) Instead of checking for buffer surpass _after_ adding new timeseries to it, we should check before. Variables were renamed to be a little more clear on what they represent.	2025-06-20 21:44:52 +00:00
Jamil	a1677494b5	chore(portal): drop index concurrently (#9609 ) Looks like postgres does support this, so adding for good measure.	2025-06-20 14:55:23 -07:00
Jamil	975057f9b4	fix(portal): add account_id,type index on actors (#9607 ) `Repo.aggregate(:count)` which performs a `COUNT(*)` query should be relatively fast if it's able to do an index-only scan. For that to happen we need to ensure all of the fields in the WHERE clause are indexed. Currently, we're missing an index on `actors.type` so a full row scan is executed per account each time we calculate Billing limits, every 5 minutes, for all accounts. If we need to check these limits more often and/or our data grows in size, it could be worth moving these to a limits counter field on `accounts` which is maintained via INSERT/DELETE triggers. Related: https://firezone-inc.sentry.io/issues/6346235615/events/588a61860e0b4875a5dbe8531dbb806a/?project=4508756715569152&referrer=next-event	2025-06-20 20:53:09 +00:00
Jamil	6f87f5ea2c	fix(portal): use account_id in index for agm hook (#9606 ) When reacting to `ActorGroupMembership` updates, we were issuing a query to expire Flows given an `actor_id, actor_group_id` combination. Unfortunately, this query never included an `account_id` to scope it, causing a table scan of flows and associated join tables to resolve it. To fix this, we introduce the `account_id` and ensure the expire flows uses this field to ensure only data for an account is considered in the query. Related: https://firezone-inc.sentry.io/issues/6346235615/events/e225e1c488cb4ea3896649aabd529c50	2025-06-20 20:40:31 +00:00
Jamil	ddb3dc8ce0	refactor(portal): compile_config macro to env_var_to_config (#9605 ) The `compile_config` macro only works on environment and DB variables. This caused recent confusion when determining where `database_pool_size` was coming from. To fix this issue, we rename `compile_config` to be more clear. We also remove the technical debt around supporting "legacy keys" and DB-based configuration. The configuration compiler now works exclusively on environment variables only, where it is still useful for: - Casting environment variables to their expected type - Alerting us when one is missing that should be set	2025-06-20 20:39:06 +00:00
Brian Manifold	5bd5a7f6ad	fix(portal): trim whitespace in auth provider forms (#9587 ) Why: * We recently had an issue where a space was entered into a provider form field and caused our system to not be able to authenticate the admin when setting up the auth provider and directory sync. To mitigate this moving forward we are making sure all white space is trimmed in the form fields. This commit focuses on the form fields for the auth providers. related: #9579	2025-06-20 18:44:33 +00:00
Jamil	fc3a9d17b9	fix(portal): broadcast before possible query errors out (#9601 ) When handling some side effects, if the query fails for whatever reason, we don't want these preventing handling side effects. Related: https://firezone-inc.sentry.io/issues/6346235615/events/d30d222f8a3e436d8058a54c0b2a508c/?project=4508756715569152&query=is%3Aunresolved&referrer=previous-event&stream_index=3	2025-06-20 17:03:42 +00:00
Jamil	e5a0bdc3b1	fix(portal): ensure sentry reports conditional migrations (#9582 ) Sentry isn't started when this runs, so start it and manually capture a message to ensure we're reminded about pending conditional migrations. Verified that this works with the Release script.	2025-06-19 17:28:38 +00:00
Jamil	2d6e478a44	fix(portal): check conditional migrations with repo started (#9577 ) In #9562, we introduced a bug where the pending conditional migrations check was run without the repo being started. Wrapping it with `with_repo` fixes that.	2025-06-18 22:40:24 +00:00
Jamil	236c21111a	refactor(portal): don't rely on db to gate metric reporting (#9565 ) This table was added to try and gate the request rate to Google's Metrics API. However, this was a flawed endeavor as we later discovered that the time series points need to be spaced apart at least 5s, not the API requests themselves. This PR gets rid of the table and therefore the problematic DB query which is timing out quite often due to the contention involved in 12 elixir nodes trying to grab a lock every 5s. Related: #9539 Related: https://firezone-inc.sentry.io/issues/6346235615/?project=4508756715569152&query=is%3Aunresolved&referrer=issue-stream&stream_index=1	2025-06-18 18:40:33 +00:00
Jamil	a20989a819	feat(portal): conditional migrations on prod (#9562 ) Some migrations take a long time to run because they require locks or modify large amounts of data. To prevent this from causing issues during deploy, we leverage Ecto's native support for loading migrations from multiple directories to introduce a `conditional_migrations/` directory that houses any conditional migrations we want to run. To run these migrations, you'll need to do one of the following: - `dev, test`: The `mix ecto.migrate` will run them by default because we have aliased this to load conditional_migrations for dev - `prod`: Set the `RUN_CONDITIONAL_MIGRATIONS` env var to `true` before starting a prod server using the `bin/migrate` script. - `dev, test, prod`: Run `Domain.Release.migrate(conditional: true)` from an IEx shell. If conditional migrations were found that weren't executed during `Domain.Release.migrate`, a warning is logged to remind us to run them. --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2025-06-18 18:08:25 +00:00
Jamil	38471738aa	fix(portal): fix problem_nodes removal (#9561 ) The shape of this from libcluster is `[:"NODE_NAME": connected_bool?]` so we need to extract the first element of each item before using this var. This is just for logging and doesn't affect how we actually connect to nodes.	2025-06-17 16:53:42 +00:00
Brian Manifold	e5914af50f	fix(portal): Add more logging around OIDC setup (#9555 ) Why: * Adding some simple logging around OIDC calls to help with better debugging. * Removing the `opentelemetry_liveview` package as it has been pulled in to the `opentelemetry_phoenix` package that we are already using.	2025-06-17 16:52:33 +00:00
Brian Manifold	25434c6898	fix(portal): update non-root layout to use main.css (#9533 ) After updating the CSS config to use `main.css` in the portal the root layout was updated, but there were a small number of one-off templates that do not use the root layout and those pages were not updated with the new `main.css` file. This commit updates those non-root templates. Fixes #9532	2025-06-15 15:31:45 +00:00
Jamil	c6545fe853	refactor(portal): consolidate pubsub functions (#9529 ) We issue broadcasts and subscribes in many places throughout the portal. To help keep the cognitive overhead low, this PR consolidates all PubSub functionality to the `Domain.PubSub` module. This allows for: - better maintainability - see all of the topics we use at a glance - consolidate repeated functionality (saved for a future PR) - use the module hierarchy to define function names, which feels more intuitive when reading and sets a convention We also introduce a `Domain.Events.Hooks` behavior to ensure all hooks comply with this simple contract, and we also introduce a convention to standardize on topic names using the module hierarchy defined herein. Lastly, we add convenience functions to the Presence modules to save a bit of duplication and chance for errors. This will make it much easier to maintain PubSub going forward. Related: #9501	2025-06-15 04:30:57 +00:00
Jamil	62c3dd9370	fix(portal): don't add service accounts to everyone group (#9530 ) In #9513 a bug was introduced that added all service accounts to the Everyone group. This fixes that by ensuring the `insert_all` query only cross joins where actor type is `:account_user, :account_admin_user`. Staging data will be manually fixed after this goes in. I briefly considered updating the delete clause of this query to "clean things up" by removing any found service accounts but that is a bit too defensive in my opinion - if there's no way a service account should make it into this group, then we shouldn't have code to expect it. This will all be going away in #8750 which should be much less brittle.	2025-06-14 15:20:32 +00:00
Jamil	cbe33cd108	refactor(portal): move policy events to WAL (#9521 ) Moves all of the policy lifecycle events to be broadcasted from the WAL consumer. #### Test - [x] Enable policy - [x] Disable policy - [x] Delete policy - [x] Non-breaking change - [x] Breaking change Related: #6294 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2025-06-14 01:10:09 +00:00
Jamil	817eeff19f	refactor(portal): simplify managed groups (#9513 ) In many places throughout the portal codebase, we called a function "update_dynamic_group_memberships/1" which recomputed all of the dynamic/managed memberships for a particular account, and reapplied them to each affected group. Since the `has_many :memberships` relationship used `on_replace: :delete`, this caused Ecto to delete _all_ the `Everyone` group memberships, and reinsert them on each sync. Since each membership change triggers a policy re-evaluation for all policies to the affected actor (`Policies.broadcast_access_events_for/3`), this in effect was causing a massive amount of queries to be triggered upon each sync job as each membership deletion and insertion triggered a lookup for all resources available to that particular actor. To fix this, we introduce the following changes: - Remove `dynamic` group type. This will never be used as it will create an immense amount of complexity for any organization trying to manage groups this way - Refactor `update_dynamic_group_memberships/1` to use a smarter query that first gathers all the _needed_ changes and applies them within a transaction using Ecto.Multi. Previously all memberships would be rolled over unconditionally due to the `on_replace: :delete` option on the relationship. Note that the option is still there, but we generally don't set memberships on groups any longer unless editing the affected group directly, where the everyone group doesn't apply. Resolves: #8407 Resolves: #8408 Related: #6294 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-13 18:55:37 +00:00
Jamil	c31f51d138	refactor(portal): move resource events to WAL (#9406 ) We move the resource events to the WAL system. Notably, we no longer need `fetch_and_update_breakable` for resource updates, so a bit of refactoring is included to update the call sites for those. Additionally, we need to add a `Flow.expire_flows_for_resource_id/1` function to expire flows from the WAL system. This is now being called in the WAL event handler. To prevent this from blocking the WAL consumer/broadcaster, we wrap it with a Task.async. These will be cleaned up when the lookup table for access is implemented next. Another thing to note is that we lose the `subject` when moving from `Flows.expire_flows_for(%Resource{}, subject)` to `Flows.expire_flows_for_resource_id(resource_id)` when a resource is deleted or updated by an actor since we respond to this event in the WAL where that data isn't available. However, we don't actually _use_ the subject when expiring flows (other than authorize the initial resource update), so this isn't an issue. Related: #9501 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>	2025-06-11 00:12:45 +00:00

1 2 3 4 5 ...

794 Commits