firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 18:18:55 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	bce2aa30b5	feat(portal): extend DNS settings to allow for DoH providers (#10882 ) In order to allow customers to make use of connlib's DoH functionality, we need a configuration UI for it. We take inspiration from the "New Resource" page and implement a 3-choice UI component for configuring how Clients should resolve DNS queries: - System - Secure DNS - Custom The secure and custom DNS options show an additional form when selected for either picking a DoH provider or the addresses of the custom DNS servers. Right now, the "Secure DNS" part is disabled if the `DISABLE_DOH_PROVIDER` env variable is set. We render a "Coming soon" tooltip on hover: <img width="1534" height="1100" alt="image" src="https://github.com/user-attachments/assets/a12a6ba4-806f-4d19-8aea-5c1cd981d609" /> This allows us to test this in staging and still ship to production if needed prior to enabling it. Resolves: #10792 Resolves: #10786 --------- Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>	2025-11-22 06:48:07 +00:00
Brian Manifold	28e35b6500	refactor(portal): remove all soft delete columns (#10685 ) Why: * Now that soft delete fields have been removed being referenced in the codebase and soft deleted rows have been removed from the DB we can now remove any remaining traces of soft delete functionality. Resolves #8187	2025-10-22 21:13:59 +00:00
Jamil	8cd469a9a9	refactor(portal): recreate remaining indexes without deleted_at (#10615 ) Followup to #10607	2025-10-18 11:09:44 -07:00
Brian Manifold	27565ea5c8	refactor(portal): remove soft delete elements from portal code (#10607 ) Why: * In previous commits, the portal code had been updated to use hard deletion rather than soft deletion of data. The fields used in the soft deletion were still kept in the DB and the code to allow for zero downtime rollout and an easy rollback if necessary. To continue with that work the portal code has now been updated to remove any reference to the soft deleted fields (e.g. deleted_at, persistent_id, etc...). While the code has been updated the actual data in the DB will need to remain for now, to once again allow for a zero downtime rollout. Once this commit has been deployed to production another PR can follow to remove the columns from the necessary tables in the DB. Related: #8187	2025-10-18 17:02:26 +00:00
Mariusz Klochowicz	a27676a903	fix(portal): support dark mode in outbound emails (#10493 ) Ensure that users with dark mode enabled system-wide get nice experience whilst reading the emails. Add a `mix test_emails` task to send all the emails and quickly inspect them locally. Before: <img width="767" height="924" alt="image" src="https://github.com/user-attachments/assets/aaac75bd-67ad-4fd8-82e8-6726ffea6bae" /> After (viewed via `mix test_emails`): <img width="1063" height="928" alt="image" src="https://github.com/user-attachments/assets/57d3a4d9-5b8f-4a45-8546-7615e15422d8" /> --------- Signed-off-by: Mariusz Klochowicz <mariusz@klochowicz.com> Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>	2025-10-16 19:35:36 +00:00
Jamil	7ab5fee43a	chore(portal): add remaining simple indexes (#10403 ) - recreates the flows actor_group_membership index that didn't get created due to name collision with an existing index - adds missing resource_id, actor_group_id indexes on policies - removes redundant `resource_id` index on resource_connections since there's a composite index that matches already Related: #10396	2025-09-20 10:06:39 -07:00
Brian Manifold	c3e1bc8a5b	chore(portal): add non-composite indexes (#10396 ) Why: * Now that hard-delete has been rolled out, we need to make sure that all cascade deletes are efficient. Some of the foreign key references didn't have indexes but needed them. Fixes #10393	2025-09-19 22:13:35 -07:00
Jamil	15283f1af5	feat(portal): batch_upsert and delete_unsynced functions (#10369 ) In order to support the new, upcoming directory sync implementations, we need the ability to batch upsert auth_identities, actors, actor_groups, and actor_group_memberships. We also need the ability to delete entities that were not upserted at the tail end of a sync job iteration in order to remove entities that are no longer in the directory. To support this, we add these functions and related tests here. Related: #6294 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-09-19 20:48:26 +00:00
Jamil	5f4d2c14ea	fix(ci): use correct index module name (#10383 ) This gets redefined twice which could lead to indexes failing to run properly.	2025-09-19 05:17:49 +00:00
Brian Manifold	56a3ce9041	fix(portal): move hard delete migrations (#10316 ) Move some of the hard-delete migrations from manual to inline to allow us to deploy the `HEAD` of main	2025-09-09 23:41:45 +00:00
Brian Manifold	826a304071	feat(portal): enable outdated gateway email (#10281 ) Enables 'outdated gateway' notifications for all accounts. Closes #8361	2025-09-04 03:56:01 +00:00
Brian Manifold	5797511ebd	fix(portal): update directory sync job w/ new hard delete (#10267 ) Why: * During the refactor to move to hard delete data in the portal there were a couple places of inconsistency in the directory sync job where deletion was concerned.	2025-08-31 01:19:59 +00:00
Brian Manifold	6bd19ee9b0	refactor(portal): hard delete data (#9694 )	2025-08-29 22:13:44 +00:00
Brian Manifold	3673ce01a0	refactor(portal): add migrations needed for hard delete (#10257 ) In preparation for transitioning from the portal from soft-delete to hard-delete some updates to the foreign key constraints were required. Most of these updates are adding `ON DELETE CASCADE` constraints and relevant indexes. These new constraints should have no effect on the current portal code as soft-deletes are still being used. One other update in this PR is changing the FK constraint names on the clients table. The names were `devices_` due to the table originally being called `devices`. This PR updates them to `clients_`. Related: #8187	2025-08-26 19:10:01 +00:00
Jamil	cafe6554ff	refactor(portal): reduce cache memory usage (#10058 ) Napkin math shows that we can save substantial memory (~3x or more) on the API nodes as connected clients/gateways grow if we just store the fields we need in order to keep the client and gateway state maintained in the channel pids. To facilitate this, we create new `Cacheable` structs that represent their `Domain` cousins, which use byte arrays for `id`s and strip out unused fields. Additionally, all business logic involved with maintaining these caches is now contained within two modules: `Domain.Cache.Client` and `Domain.Cache.Gateway`, and type specs have been added to aid in static analysis and code documentation. Comprehensive testing is now added not only for the cache modules, but for their associated channel modules as well to ensure we handle different kinds of edge cases gracefully. The `Events` nomenclature was renamed to `Changes` to better name what we are doing: Change-Data-Capture. Lastly, the following related changes are included in this PR since they were "in the way" so to speak of getting this done: - We save the last received LSN in each channel and drop the `change` with a warning if we receive it twice in a row, or we receive it out of order - The client/gateway version compatibility calculations have been moved to `Domain.Resources` and `Domain.Gateways` and have been simplified to make them easier to understand and maintain going forward. Related: #10174 Fixes: #9392 Fixes: #9965 Fixes: #9501 Fixes: #10227 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-22 21:52:29 +00:00
Jamil	0698e0d35f	ci: test IPv6 for CIDR resources (#10168 ) Docker for Mac finally supports IPv6 in general availability. It's time to add IPv6 to our suite of integration tests. The thinking behind this PR is try and not slow down CI much, if at all, by testing IPv6 side-by-side with the existing IPv4 tests. More comprehensive testing is being developed in #10131 that will test things like IPv4-in-6 relaying, client / gateway IP stack mismatches, and so forth.	2025-08-18 20:59:40 +00:00
Jamil	25e15bbd14	chore(portal): drop id in favor of lsn pkey (#10152 ) On the `change_logs` table, we want to minimize write overhead as much as possible. One major way to do this is the minimize the number of indexes maintained. Because `lsn` is guaranteed to be unique, we can use it as the primary, saving us an index (and column). NOTE: This migration will need to acquire a lock on the table, so it's added as a manual migration to execute out of band. Since we don't read ChangeLogs anywhere, it should be fine for the app servers to come up without this migration applied.	2025-08-06 15:04:34 +00:00
Brian Manifold	80e1c3255f	refactor(portal): refactor billing event handler (#10064 ) Why: * There were intermittent issues with accounts updates from Stripe events. Specifically, when an account would update it's subscription from Starter to Team. The reason was due to the fact that Stripe does not guarantee order of delivery for it's webhook events. At times we were seeing and responding to an event that was a few seconds old after processing a newer event. This would have the effect of quickly transitioning an account from Team back to Starter. This commit refactors our event handler and adds a `processed_stripe_events` DB table to make sure we don't process duplicate events as well as prevent processing an event that was created prior to the last event we've processed for a given account. * Along with refactoring the billing event handling, the Stripe mock module has also been refactored to better reflect real Stripe objects. Related: #8668	2025-08-05 16:56:52 +00:00
Jamil	f1a5af356d	fix(portal): groom resource list and flows periodically (#10005 ) Time-based policy conditions are tricky. When they authorize a flow, we correctly tell the Gateway to remove access when the time window expires. However, we do nothing on the client to reset the connectivity state. This means that whenever the window of time of access was re-entered, the client would essentially never be able to connect to it again until the resource was toggled. To fix this, we add a 1-minute check in the client channel that re-checks allowed resources, and updates the client state with the difference. This means that policies that have time-based conditions are only accurate to the minute, but this is how they're presented anyhow. For good measure, we also add a periodic job that runs every minute to delete expired Flows. This will propagate to the Gateway where, if the access for a particular client-resource is determined to be actually gone, will receive `reject_access`. Zooming out a bit, this PR furthers the theme that: - Client channels react to underlying resource / policy / membership changes directly, while - Gateway channels react primarily to flows being deleted, or the downstream effects of a prior client authorization	2025-07-25 21:04:41 +00:00
Jamil	2959cca8ce	fix(portal): use consistent wireguard psk (#10004 ) Whenever a client requests a connection to gateway, we need to generate a preshared key that will be used for the underlying WireGuard tunnel. When the connection setup broke or otherwise was lost, _after_ the gateway the received the authorize_flow call, but _before_ the client could receive the response (and initiate a tunnel), we would have to wait until an ICE timeout occurred in order to reset state on the gateway. This is because the psk was not used to determine if this was a _new_ flow authorization. So the old authorization would be matched, and the client would never be able to connect, since its tunnel was using the new psk, and the gateway the old. To fix this, we generate a secure random 32-byte `psk_base` on each client and gateway. When a client wishes to connect to a gateway, we compute the WireGuard preshared key as an HMAC over these two inputs. This fixes the issue by ensuring that subsequent flow authorization requests from a particular client to a particular gateway will yield the same psk. Related: #9999 Related: https://github.com/firezone/infra/issues/99	2025-07-25 19:28:47 +00:00
Jamil	ccc736e63e	fix(portal): reauthorize new flow when last flow deleted (#9974 ) The `flows` table tracks authorizations we've made for a resource and persists them, so that we can determine which authorizations are still valid across deploys or hiccups in the control plane connections. Before, when the "in-use" authorization for a resource was deleted, we would have flapped the resource in the client, and sent `reject_access` to the gateway. However, that would cause issues in the following edge case: - Client is currently connected to Resource A through Policy B - Client websocket goes down - Policy B is created for Resource A (for another actor group), and Policy A is deleted by admin - Client reconnects - Client sees that its resource list is the same - Gateway has since received `reject_access` because no new flows were created for this client-resource combination To prevent this from happening, we now try to "reauthorize" the flow whenever the last cached flow is removed for a particular client-resource pair. This avoids needing to toggle the resource on the client since we won't have sent `reject_access` to the gateway.	2025-07-25 01:53:10 +00:00
Jamil	f379e85e9b	refactor(portal): cache access state in channel pids (#9773 ) When changes occur in the Firezone DB that trigger side effects, we need some mechanism to broadcast and handle these. Before, the system we used was: - Each process subscribes to a myriad of topics related to data it wants to receive. In some cases it would subscribe to new topics based on received events from existing topics (I.e. flows in the gateway channel), and sometimes in a loop. It would then need to be sure to _unsubscribe_ from these topics - Handle the side effect in the `after_commit` hook of the Ecto function call after it completes - Broadcast only a simply (thin) event message with a DB id - In the receiver, use the id(s) to re-evaluate, or lookup one or many records associated with the change - After the lookup completes, `push` the relevant message(s) to the LiveView, `client` pid, or `gateway` pid in their respective channel processes This system had a number of drawbacks ranging from scalability issues to undesirable access bugs: 1. The `after_commit` callback, on each App node, is not globally ordered. Since we broadcast a thin event schema and read from the DB to hydrate each event, this meant we had a `read after write` problem in our event architecture, leading to the potential for lost updates. Case in point: if a policy is updated from `resource_id-1` to `resource_id-2`, and then back to `resource_id-1`, it's possible that, given the right amount of delay, the gateway channel will receive two `reject_access` events for `resource_id-1`, as opposed to one for `resource_id-1` and one for `resource_id-2`, leading to the potential for unauthorized access. 1. It was very difficult to ensure that the correct topics were being subscribed to and unsubscribed from, and the correct number of times, leading to maintenance issues for other engineers. 1. We had a nasty N+1 query problem whenever memberships were added or removed that resolved in essentially all access related to that membership (so all Policies touching its actor group) to be re-evaluated, and broadcasted. This meant that any bulk addition or deletion of memberships would generate so many queries that they'd timeout or consume the entire connection pool. 1. We had no durability for side-effect processing. In some places, we were iterating over many returned records to send broadcasts. Broadcasting is not a zero-time operation, each call takes a small amount of CPU time to copy the message into the receiver's mailbox. If we deployed while this was happening, the state update would be lost forever. If this was a `reject_access` for a Gateway, the Gateway would never remove access for that particular flow. 1. On each flow authorization, we needed to hit `us-east1` not only to "authorize" the flow, but to log it as well. This incurs latency especially for users in other parts of the world, which happens on _each_ connection setup to a new resource. 1. Since we read and re-authorize access due to the thin events broadcasted from side effects, we risk hitting thundering herd problems (see the N+1 query problem above) where a single DB change could result in all receivers hitting the DB at once to "hydrate" their processing.ion 1. If an administrator modifies the DB directly, or, if we need to run a DB migration that involves side effects, they'll be lost, because the side effect triggers happened in `after_commit` hooks that are only available when querying the DB through Ecto. Manually deleting (or resurrecting) a policy, for example, would not have updated any connected clients or gateways with the new state. To fix all of the above, we move to the system introduced in this PR: - All changes are now serialized (for free) by Postgres and broadcasted as a single event stream - The number of topics has been reduced to just one, the `account_id` of an account. All receivers subscribe to this one topic for the lifetime of their pid and then only filter the events they want to act upon, ignoring all other messages - The events themselves have been turned into "fat" structs based on the schemas they present. By making them properly typed, we can apply things like the existing Policy authorizer functions to them as if we had just fetched them from the DB. - All flow creation now happens in memory and doesn't not need to incur a DB hit in `us-east1` to proceed. - Since clients and gateways now track state in a push-based manner from the DB, this means very few actual DB queries are needed to maintain state in the channel procs, and it also means we can be smarter about when to send `resource_deleted` and `resource_created_or_updated` appropriately, since we can always diff between what the client _had_ access to, and what they _now_ have access to. - All DB operations, whether they happen from the application code, a `psql` prompt, or even via Google SQL Studio in the GCP console, will trigger the _same_ side effects. - We now use a replication consumer based off Postgres logical decoding of the write-ahead log using a _durable slot_. This means that Postgres will retain _all events_ until they are acknowledged, giving us the ability to ensure at-least-once processing semantics for our system. Today, the ACK is simply, "did we broadcast this event successfully". But in the future, we can assert that replies are received before we acknowledge the event as processed back to Postgres. The tests in this PR have been updated to pass given the refactor. However, since we are tracking more state now in the channel procs, it would be a good idea to add more tests for those edge cases. That is saved as a later PR because (1) this one is already huge, and (2) we need to get this out to staging to smoke test everything anyhow. Fixes: #9908 Fixes: #9909 Fixes: #9910 Fixes: #9900 Related: #9501	2025-07-18 22:47:18 +00:00
Jamil	b20c141759	feat(portal): add batch-insert to change logs (#9733 ) Inserting a change log incurs some minor overhead for sending query over the network and reacting to its response. In many cases, this makes up the bulk of the actual time it takes to run the change log insert. To reduce this overhead and avoid any kind of processing delay in the WAL consumers, we introduce batch insert functionality with size `500` and timeout `30` seconds. If either of those two are hit, we flush the batch using `insert_all`. `insert_all` does not use `Ecto.Changeset`, so we need to be a bit more careful about the data we insert, and check the inserted LSNs to determine what to update the acknowledged LSN pointer to. The functionality to determine when to call the new `on_flush/1` callback lives in the replication_connection module, but the actual behavior of `on_flush/1` is left to the child modules to implement. The `Events.ReplicationConnection` module does not use flush behavior, and so does not override the defaults, which is not to use a flush mechanism. Related: #949	2025-07-05 19:03:28 +00:00
Jamil	29d8881c54	fix(seeds): remove unused vars (#9731 ) This fixes some warnings introduced by #9692.	2025-06-30 19:33:11 +00:00
Jamil	3760536afd	chore(portal): add unique index to lsn (#9699 )	2025-06-27 20:58:20 +00:00
Jamil	dddd1b57fc	refactor(portal): remove flow_activities (#9693 ) This has been dead code for a long time. The feature this was meant to support, #8353, will require a different domain model, views, and user flows. Related: #8353	2025-06-27 20:40:25 +00:00
Jamil	0b09d9f2f5	refactor(portal): don't rely on flows.expires_at (#9692 ) The `expires_at` column on the `flows` table was never used outside of the context in which the flow was created in the Client Channel. This ephemeral state, which is created in the `Domain.Flows.authorize_flow/4` function, is never read from the DB in any meaningful capacity, so it can be safely removed. The `expire_flows_for` family of functions now simply reads the needed fields from the flows table in order to broadcast `{:expire_flow, flow_id, client_id, resource_id}` directly to the subscribed entities. This PR is step 1 in removing the reliance on `Flows` to manage ephemeral access state. In a subsequent PR we will actually change the structure of what state is kept in the channel PIDs such that reliance on this Flows table will no longer be necessary. Additionally, in a few places, we were referencing a Flows.Show view that was never available in production, so this dead code has been removed. Lastly, the `flows` table subscription and associated hook processing has been completely removed as it is no longer needed. We've implemented in #9667 logic to remove publications from removed table subscriptions, so we can expect to get a couple ingest warnings when we deploy this as the `Hooks.Flows` processor no longer exists, and the WAL data may have lingering flows records in the queue. These can be safely ignored.	2025-06-27 18:29:12 +00:00
Jamil	a9f49629ae	feat(portal): add change_logs table and insert data (#9553 ) Building on the WAL consumer that's been in development over the past several weeks, we introduce a new `change_logs` table that stores very lightly up-fitted data decoded from the WAL: - `account_id` (indexed): a foreign key reference to an account. - `inserted_at` (indexed): the timestamp of insert, for truncating rows later. - `table`: the table where the op took place. - `op`: the operation performed (insert/update/delete) - `old_data`: a nullable map of the old row data (update/delete) - `data`: a nullable map of the new row data(insert/update) - `vsn`: an integer version field we can bump to signify schema changes in the data in case we need to apply operations to only new or only old data. Judging from our prod metrics, we're currently average about 1,000 write operations a minute, which will generate about 1-2 dozen changelogs / s. Doing the math on this, 30 days at our current volume will yield about 50M / month, which should be ok for some time, since this is an append-only, rarely (if ever) read from table. The one aspect of this we may need to handle sooner than later is batch-inserting these. That raises an issue though - currently, in this PR, we process each WAL event serially, ending with the final acknowledgement `:ok` which will signal to Postgres our status in processing the WAL. If we do anything async here, this processing "cursor" then becomes inaccurate, so we may need to think about what to track and what data we care about. Related: #7124	2025-06-25 02:06:20 +00:00
Jamil	f55596be4e	fix(portal): index auth_providers on adapter (#9625 ) The `refresh_tokens` job for each auth provider uses a cross-account query that unfortunately hits no indexes. This can cause slow queries each time the job runs for the adapter. We add a simple sparse index to speed this query up. Related: https://firezone-inc.sentry.io/issues/6346235615/?project=4508756715569152&query=is%3Aunresolved&referrer=issue-stream&stream_index=1	2025-06-23 18:50:22 +00:00
Jamil	c783b23bae	refactor(portal): rename conditional->manual (#9612 ) These only have one condition - to run manually. `manual migrations` better implies that these migrations _must_ typically be run manually.	2025-06-21 21:17:33 +00:00
Jamil	2523bedd19	fix(portal): add if not exists to concurrent index (#9611 ) With `@disable_ddl_transaction` this needs to be added. See https://firezonehq.slack.com/archives/C04HRQTFY0Z/p1750516438992329?thread_ts=1750510766.640919&cid=C04HRQTFY0Z	2025-06-21 15:42:51 +00:00
Jamil	a1677494b5	chore(portal): drop index concurrently (#9609 ) Looks like postgres does support this, so adding for good measure.	2025-06-20 14:55:23 -07:00
Jamil	975057f9b4	fix(portal): add account_id,type index on actors (#9607 ) `Repo.aggregate(:count)` which performs a `COUNT(*)` query should be relatively fast if it's able to do an index-only scan. For that to happen we need to ensure all of the fields in the WHERE clause are indexed. Currently, we're missing an index on `actors.type` so a full row scan is executed per account each time we calculate Billing limits, every 5 minutes, for all accounts. If we need to check these limits more often and/or our data grows in size, it could be worth moving these to a limits counter field on `accounts` which is maintained via INSERT/DELETE triggers. Related: https://firezone-inc.sentry.io/issues/6346235615/events/588a61860e0b4875a5dbe8531dbb806a/?project=4508756715569152&referrer=next-event	2025-06-20 20:53:09 +00:00
Jamil	236c21111a	refactor(portal): don't rely on db to gate metric reporting (#9565 ) This table was added to try and gate the request rate to Google's Metrics API. However, this was a flawed endeavor as we later discovered that the time series points need to be spaced apart at least 5s, not the API requests themselves. This PR gets rid of the table and therefore the problematic DB query which is timing out quite often due to the contention involved in 12 elixir nodes trying to grab a lock every 5s. Related: #9539 Related: https://firezone-inc.sentry.io/issues/6346235615/?project=4508756715569152&query=is%3Aunresolved&referrer=issue-stream&stream_index=1	2025-06-18 18:40:33 +00:00
Jamil	a20989a819	feat(portal): conditional migrations on prod (#9562 ) Some migrations take a long time to run because they require locks or modify large amounts of data. To prevent this from causing issues during deploy, we leverage Ecto's native support for loading migrations from multiple directories to introduce a `conditional_migrations/` directory that houses any conditional migrations we want to run. To run these migrations, you'll need to do one of the following: - `dev, test`: The `mix ecto.migrate` will run them by default because we have aliased this to load conditional_migrations for dev - `prod`: Set the `RUN_CONDITIONAL_MIGRATIONS` env var to `true` before starting a prod server using the `bin/migrate` script. - `dev, test, prod`: Run `Domain.Release.migrate(conditional: true)` from an IEx shell. If conditional migrations were found that weren't executed during `Domain.Release.migrate`, a warning is logged to remind us to run them. --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2025-06-18 18:08:25 +00:00
Jamil	cbe33cd108	refactor(portal): move policy events to WAL (#9521 ) Moves all of the policy lifecycle events to be broadcasted from the WAL consumer. #### Test - [x] Enable policy - [x] Disable policy - [x] Delete policy - [x] Non-breaking change - [x] Breaking change Related: #6294 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2025-06-14 01:10:09 +00:00
Jamil	817eeff19f	refactor(portal): simplify managed groups (#9513 ) In many places throughout the portal codebase, we called a function "update_dynamic_group_memberships/1" which recomputed all of the dynamic/managed memberships for a particular account, and reapplied them to each affected group. Since the `has_many :memberships` relationship used `on_replace: :delete`, this caused Ecto to delete _all_ the `Everyone` group memberships, and reinsert them on each sync. Since each membership change triggers a policy re-evaluation for all policies to the affected actor (`Policies.broadcast_access_events_for/3`), this in effect was causing a massive amount of queries to be triggered upon each sync job as each membership deletion and insertion triggered a lookup for all resources available to that particular actor. To fix this, we introduce the following changes: - Remove `dynamic` group type. This will never be used as it will create an immense amount of complexity for any organization trying to manage groups this way - Refactor `update_dynamic_group_memberships/1` to use a smarter query that first gathers all the _needed_ changes and applies them within a transaction using Ecto.Multi. Previously all memberships would be rolled over unconditionally due to the `on_replace: :delete` option on the relationship. Note that the option is still there, but we generally don't set memberships on groups any longer unless editing the affected group directly, where the everyone group doesn't apply. Resolves: #8407 Resolves: #8408 Related: #6294 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-06-13 18:55:37 +00:00
Jamil	38c1de351c	refactor(portal): move membership events to WAL (#9388 ) Membership events are quite simple to move to the WAL: - Only one topic is used to determine which client(s) receive updates for which Actor(s). - The unsubscribe was removed because it was unused. - Notably, the N+1 query problem regarding re-evaluating all access again after each membership is updated is still present. This will be fixed using a lookup table in the client channel in the last PR to move events to the WAL. Related: https://github.com/firezone/firezone/issues/6294 Related: https://github.com/firezone/firezone/issues/8187	2025-06-06 06:23:33 +00:00
Jamil	6fc7d2e4e0	feat(portal): configurable ip stack for DNS resources (#9303 ) Some poorly-behaved applications (e.g. mongo) will fail to connect if they see both IPv4 and IPv6 addresses for a DNS resource, because they will try to connect to both of them and fail the whole connection setup if either one is not routable. To fix this, we need to introduce a knob to allow admins to restrict DNS resources to only A or AAAA records. <img width="750" alt="Screenshot 2025-06-02 at 10 48 39 AM" src="https://github.com/user-attachments/assets/4dbcb6ae-685f-43ee-b9e8-1502b365a294" /> <img width="1174" alt="Screenshot 2025-06-02 at 11 05 53 AM" src="https://github.com/user-attachments/assets/02d0a4b3-e6e8-4b6d-89fa-d3d999b5811e" /> --- Related: https://firezonehq.slack.com/archives/C08KPQKJZKM/p1746720923535349 Related: #9300 Fixes: #9042	2025-06-03 02:24:41 +00:00
Brian Manifold	870aee3812	refactor(portal): add migration to remove created_by_ columns (#9318 ) Why: * This commit contains only a migration to remove the created_by_identity and created_by_actor columns on multiple tables. This migration will be run manually due to the long running sync jobs that are currently in the system. This migration should be a no-op after the manual DB updates.	2025-06-01 12:08:15 -07:00
Brian Manifold	a51b35a6b4	refactor(portal): remove created_by_<identity/actor> columns (#9306 ) Why: * Now that we have started using the `created_by_subject` field on various tables, we no longer need to keep the `created_by_<identity/actor>` fields. This will help remove a foreign key reference and will be one step closer to allowing us to hard delete data rather than soft deleting all data in order to keep foreign key references like these.	2025-05-30 21:06:35 +00:00
Jamil	068f7dfc8e	chore(deps): Bump codespell to 2.4.1 and fix found issues (#9221 ) The dictionary was updated in and codespell found a few minor issues to fix. Supersedes #8990	2025-05-24 22:02:06 +00:00
Brian Manifold	12b4a12f26	feat(portal): Add created_by_subject (#9176 ) Why: * We have decided to change the way we will do audit logging. Instead of soft deleting data and keeping it in the table it was created in, we will be moving to an audit trail table where various actions will be recorded in a table/DB specifically for auditing purposes. Due to this change we need to make sure that we don't have stale/dangling references. One set of references we keep everywhere is `created_by_identity_id` and `created_by_actor_id`. Those foreign key references won't be able to be used after moving to the new audit system. This commit will allow us to keep that info by pulling the values and storing the data in a created_by_subject field on the record.	2025-05-20 20:03:46 +00:00
Jamil	65c58ee254	feat(portal): Zero-click client authentication (#9144 ) Adds a new field to `settings/identity_providers` that allows an Admin to designate any non-email/otp provider as the `default` for client authentication. Clients will then navigate directly to the provider's `/redirect` endpoint when authenticating, which in many cases will automatically sign them in. No existing providers are updated in this PR. https://github.com/user-attachments/assets/7b962a25-76fd-491f-a194-60ed993821fc	2025-05-16 19:26:08 +00:00
Brian Manifold	dd5a53f686	fix(portal): Fix sign_up to properly populate email (#9105 ) Why: * During the account sign up flow, the email of the first admin was not being populated in the `email` column on the auth_identities table. This was due to atoms being passed in the attrs instead of strings to the `create_identity` function. A migration was also created to backfill the missing emails in the `auth_identities` table.	2025-05-13 19:49:25 +00:00
Brian Manifold	a8bea13591	fix(portal): Remove redundant index on actor_group_memberships (#9063 ) Why: * It was pointed out that the way Postgresql does compound indexes there is no need to have an individual index on the first column of the compound index. This commit removes the redundant index on the `actor_id` for the `actor_group_membership` table.	2025-05-09 05:32:45 +00:00
Brian Manifold	20d8246ce8	fix(portal): Add indexes to actor_group_memberships (#9058 ) Why: * As we move towards hard deleting data one issue we've run into is with cascading deletes on the actor_group_memberships table. In order to solve this problem indexes have been created on the `actor_id` and `group_id` columns of the actor_group_memberships.	2025-05-09 01:47:24 +00:00
Jamil	fdd1105b10	fix(portal): alter db user role with replication (#8952 ) We need the `replication` attribute set on the db user. This is trivially done in a migration, and with the `CURRENT_USER` specifier, we don't need to fetch the Application configuration.	2025-04-30 13:02:34 -07:00
Jamil	968db2ae39	feat(portal): Receive WAL events (#8909 ) Firezone's control plane is a realtime, distributed system that relies on a broadcast/subscribe system to function. In many cases, these events are broadcasted whenever relevant data in the DB changes, such as an actor losing access to a policy, a membership being deleted, and so forth. Today, this is handled in the application layer, typically happening at the place where the relevant DB call is made (i.e. in an `after_commit`). While this approach has worked thus far, it has several issues: 1. We have no guarantee that the DB change will issue a broadcast. If the application is deployed or the process crashes after the DB changes are made but before the broadcast happens, we will have potentially failed to update any connected clients or gateways with the changes. 2. We have no guarantee that the order of DB updates will be maintained in order for broadcasts. In other words, app server A could win its DB operation against app server B, but then proceed to lose being the first to broadcast. 3. If the cluster is in a bad state where broadcasts may return an error (i.e. https://github.com/firezone/firezone/issues/8660), we will never retry the broadcast. To fix the above issues, we introduce a WAL logical decoder that process the event stream one message at a time and performs any needed work. Serializability is guaranteed since we only process the WAL in a single, cluster-global process, `ReplicationConnection`. Durability is also guaranteed since we only ACK WAL segments after we've successfully ingested the event. This means we will only advance the position of our WAL stream after successfully broadcasting the event. This PR only introduces the WAL stream processing system but does not introduce any changes to our current broadcasting behavior - that's saved for another PR.	2025-04-29 23:53:06 -07:00
Jamil	0f300f2484	fix(portal): Prevent dupe sync adapters (#8887 ) Prevents more than one sync-enabled adapter per account in order to prepare for eventually adding a unique constraint on `provider_identifier` for identities and groups per account. Related: #6294 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>	2025-04-22 13:58:24 +00:00

1 2 3 4

164 Commits