firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 18:18:55 +00:00

Author	SHA1	Message	Date
Jamil	f379e85e9b	refactor(portal): cache access state in channel pids (#9773 ) When changes occur in the Firezone DB that trigger side effects, we need some mechanism to broadcast and handle these. Before, the system we used was: - Each process subscribes to a myriad of topics related to data it wants to receive. In some cases it would subscribe to new topics based on received events from existing topics (I.e. flows in the gateway channel), and sometimes in a loop. It would then need to be sure to _unsubscribe_ from these topics - Handle the side effect in the `after_commit` hook of the Ecto function call after it completes - Broadcast only a simply (thin) event message with a DB id - In the receiver, use the id(s) to re-evaluate, or lookup one or many records associated with the change - After the lookup completes, `push` the relevant message(s) to the LiveView, `client` pid, or `gateway` pid in their respective channel processes This system had a number of drawbacks ranging from scalability issues to undesirable access bugs: 1. The `after_commit` callback, on each App node, is not globally ordered. Since we broadcast a thin event schema and read from the DB to hydrate each event, this meant we had a `read after write` problem in our event architecture, leading to the potential for lost updates. Case in point: if a policy is updated from `resource_id-1` to `resource_id-2`, and then back to `resource_id-1`, it's possible that, given the right amount of delay, the gateway channel will receive two `reject_access` events for `resource_id-1`, as opposed to one for `resource_id-1` and one for `resource_id-2`, leading to the potential for unauthorized access. 1. It was very difficult to ensure that the correct topics were being subscribed to and unsubscribed from, and the correct number of times, leading to maintenance issues for other engineers. 1. We had a nasty N+1 query problem whenever memberships were added or removed that resolved in essentially all access related to that membership (so all Policies touching its actor group) to be re-evaluated, and broadcasted. This meant that any bulk addition or deletion of memberships would generate so many queries that they'd timeout or consume the entire connection pool. 1. We had no durability for side-effect processing. In some places, we were iterating over many returned records to send broadcasts. Broadcasting is not a zero-time operation, each call takes a small amount of CPU time to copy the message into the receiver's mailbox. If we deployed while this was happening, the state update would be lost forever. If this was a `reject_access` for a Gateway, the Gateway would never remove access for that particular flow. 1. On each flow authorization, we needed to hit `us-east1` not only to "authorize" the flow, but to log it as well. This incurs latency especially for users in other parts of the world, which happens on _each_ connection setup to a new resource. 1. Since we read and re-authorize access due to the thin events broadcasted from side effects, we risk hitting thundering herd problems (see the N+1 query problem above) where a single DB change could result in all receivers hitting the DB at once to "hydrate" their processing.ion 1. If an administrator modifies the DB directly, or, if we need to run a DB migration that involves side effects, they'll be lost, because the side effect triggers happened in `after_commit` hooks that are only available when querying the DB through Ecto. Manually deleting (or resurrecting) a policy, for example, would not have updated any connected clients or gateways with the new state. To fix all of the above, we move to the system introduced in this PR: - All changes are now serialized (for free) by Postgres and broadcasted as a single event stream - The number of topics has been reduced to just one, the `account_id` of an account. All receivers subscribe to this one topic for the lifetime of their pid and then only filter the events they want to act upon, ignoring all other messages - The events themselves have been turned into "fat" structs based on the schemas they present. By making them properly typed, we can apply things like the existing Policy authorizer functions to them as if we had just fetched them from the DB. - All flow creation now happens in memory and doesn't not need to incur a DB hit in `us-east1` to proceed. - Since clients and gateways now track state in a push-based manner from the DB, this means very few actual DB queries are needed to maintain state in the channel procs, and it also means we can be smarter about when to send `resource_deleted` and `resource_created_or_updated` appropriately, since we can always diff between what the client _had_ access to, and what they _now_ have access to. - All DB operations, whether they happen from the application code, a `psql` prompt, or even via Google SQL Studio in the GCP console, will trigger the _same_ side effects. - We now use a replication consumer based off Postgres logical decoding of the write-ahead log using a _durable slot_. This means that Postgres will retain _all events_ until they are acknowledged, giving us the ability to ensure at-least-once processing semantics for our system. Today, the ACK is simply, "did we broadcast this event successfully". But in the future, we can assert that replies are received before we acknowledge the event as processed back to Postgres. The tests in this PR have been updated to pass given the refactor. However, since we are tracking more state now in the channel procs, it would be a good idea to add more tests for those edge cases. That is saved as a later PR because (1) this one is already huge, and (2) we need to get this out to staging to smoke test everything anyhow. Fixes: #9908 Fixes: #9909 Fixes: #9910 Fixes: #9900 Related: #9501	2025-07-18 22:47:18 +00:00
Andrew Dryga	835fc4c8eb	chore(portal): Bump all deps related to portal (#6445 )	2024-08-28 10:40:02 -06:00
Andrew Dryga	a7e54686b0	feat(portal): Track page views and sign ups using Mixpanel and HubSpot on public pages (#5050 ) Fixes firezone/gtm#253 Fixes firezone/gtm#278	2024-05-21 10:34:56 -06:00
Andrew Dryga	f3c8c734ab	feat(portal): Filtering, Fulltext Search, Pagination, Preloads (#3751 ) On the domain side this PR extends `Domain.Repo` with filtering, pagination, and ordering, along with some convention changes are removing the code that is not needed since we have the filtering now. This required to touch pretty much all contexts and code, but I went through all public functions and added missing tests to make sure nothing will be broken. On the web side I've introduced a `<.live_table />` which is as close as possible to being a drop-in replacement for the regular `<.table />` (but requires to structure the LiveView module differently due to assigns anyways). I've updated all the listing tables to use it.	2024-03-16 13:27:48 -06:00
Jamil	6419b1d096	chore(portal): Fix static files (#3974 ) Fixes issues with static files returning 404s	2024-03-05 17:43:14 +00:00
Jamil	127b97e588	fix(portal\|website): Fix static paths for website and elixir (#3802 ) Phoenix VerifiedRoutes expects directories for `statics` where we were passing filenames too. These are removed since they're not required -- all of the top level files we need to serve at the root don't need VerifiedRoutes. For the website, the files were named incorrectly. The above issues were causing 404s on both the website and portal.	2024-02-28 20:03:42 +00:00
Andrew Dryga	9e11ddb1cd	Do not crash on disconnect messages in LV (#3795 ) This message is sent by the some of the broadcasters and it was resulting in a process crash (on a socket that will be disconnected anyways), but this triggered our logging alerts anyways. So we will simply ignore them globally to suppres the noise.	2024-02-28 11:42:07 -06:00
Jamil	17692ecf4d	fix(portal\|website): Fix favicons for dark mode (#3785 )	2024-02-27 18:57:37 +00:00
Andrew Dryga	e290f26298	Complete Actors, Devices and Groups UIs (#1885 ) This will be done once the remaining UI code is covered with tests.	2023-09-02 05:35:52 +00:00
Andrew Dryga	fe06d2e42d	Actor groups and group sync helpers (#1727 )	2023-07-31 16:22:40 -06:00
bmanifold	9a06a9bb14	Refactor Gateway Liveviews to use real data (#1760 ) Why: * The previous Gateway Liveviews had used static views and data as a starting point for fleshing out the web UI. This commit builds on that and replaces (most) of the static data with data from the database, as well as updating the static Liveview templates to use components where possible. Note: These changes are only meant to involve the Gateway views (index/show/edit). More changes to other resources will follow(i.e. Resource, Users, Devices, etc...) --------- Signed-off-by: bmanifold <bmanifold@users.noreply.github.com> Co-authored-by: Andrew Dryga <andrew@dryga.com>	2023-07-18 21:15:59 +00:00
Andrew Dryga	e7d5d0579b	Authentication for the live app (#1674 ) Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2023-06-27 13:11:36 -06:00
Andrew Dryga	89b7e3b474	Fix assets pipeline, add Elixir deps audit, add Android applink manifest (#1659 )	2023-06-14 17:15:38 -06:00
Andrew Dryga	37a2d7b7f5	Move elixir code to a subfolder (#1631 )	2023-05-24 15:46:51 -06:00

14 Commits