Commit Graph

14 Commits

Author SHA1 Message Date
Jamil
f379e85e9b refactor(portal): cache access state in channel pids (#9773)
When changes occur in the Firezone DB that trigger side effects, we need
some mechanism to broadcast and handle these.

Before, the system we used was:

- Each process subscribes to a myriad of topics related to data it wants
to receive. In some cases it would subscribe to new topics based on
received events from existing topics (I.e. flows in the gateway
channel), and sometimes in a loop. It would then need to be sure to
_unsubscribe_ from these topics
- Handle the side effect in the `after_commit` hook of the Ecto function
call after it completes
- Broadcast only a simply (thin) event message with a DB id
- In the receiver, use the id(s) to re-evaluate, or lookup one or many
records associated with the change
- After the lookup completes, `push` the relevant message(s) to the
LiveView, `client` pid, or `gateway` pid in their respective channel
processes

This system had a number of drawbacks ranging from scalability issues to
undesirable access bugs:

1. The `after_commit` callback, on each App node, is not globally
ordered. Since we broadcast a thin event schema and read from the DB to
hydrate each event, this meant we had a `read after write` problem in
our event architecture, leading to the potential for lost updates. Case
in point: if a policy is updated from `resource_id-1` to
`resource_id-2`, and then back to `resource_id-1`, it's possible that,
given the right amount of delay, the gateway channel will receive two
`reject_access` events for `resource_id-1`, as opposed to one for
`resource_id-1` and one for `resource_id-2`, leading to the potential
for unauthorized access.
1. It was very difficult to ensure that the correct topics were being
subscribed to and unsubscribed from, and the correct number of times,
leading to maintenance issues for other engineers.
1. We had a nasty N+1 query problem whenever memberships were added or
removed that resolved in essentially all access related to that
membership (so all Policies touching its actor group) to be
re-evaluated, and broadcasted. This meant that any bulk addition or
deletion of memberships would generate so many queries that they'd
timeout or consume the entire connection pool.
1. We had no durability for side-effect processing. In some places, we
were iterating over many returned records to send broadcasts.
Broadcasting is not a zero-time operation, each call takes a small
amount of CPU time to copy the message into the receiver's mailbox. If
we deployed while this was happening, the state update would be lost
forever. If this was a `reject_access` for a Gateway, the Gateway would
never remove access for that particular flow.
1. On each flow authorization, we needed to hit `us-east1` not only to
"authorize" the flow, but to log it as well. This incurs latency
especially for users in other parts of the world, which happens on
_each_ connection setup to a new resource.
1. Since we read and re-authorize access due to the thin events
broadcasted from side effects, we risk hitting thundering herd problems
(see the N+1 query problem above) where a single DB change could result
in all receivers hitting the DB at once to "hydrate" their
processing.ion
1. If an administrator modifies the DB directly, or, if we need to run a
DB migration that involves side effects, they'll be lost, because the
side effect triggers happened in `after_commit` hooks that are only
available when querying the DB through Ecto. Manually deleting (or
resurrecting) a policy, for example, would not have updated any
connected clients or gateways with the new state.


To fix all of the above, we move to the system introduced in this PR:

- All changes are now serialized (for free) by Postgres and broadcasted
as a single event stream
- The number of topics has been reduced to just one, the `account_id` of
an account. All receivers subscribe to this one topic for the lifetime
of their pid and then only filter the events they want to act upon,
ignoring all other messages
- The events themselves have been turned into "fat" structs based on the
schemas they present. By making them properly typed, we can apply things
like the existing Policy authorizer functions to them as if we had just
fetched them from the DB.
- All flow creation now happens in memory and doesn't not need to incur
a DB hit in `us-east1` to proceed.
- Since clients and gateways now track state in a push-based manner from
the DB, this means very few actual DB queries are needed to maintain
state in the channel procs, and it also means we can be smarter about
when to send `resource_deleted` and `resource_created_or_updated`
appropriately, since we can always diff between what the client _had_
access to, and what they _now_ have access to.
- All DB operations, whether they happen from the application code, a
`psql` prompt, or even via Google SQL Studio in the GCP console, will
trigger the _same_ side effects.
- We now use a replication consumer based off Postgres logical decoding
of the write-ahead log using a _durable slot_. This means that Postgres
will retain _all events_ until they are acknowledged, giving us the
ability to ensure at-least-once processing semantics for our system.
Today, the ACK is simply, "did we broadcast this event successfully".
But in the future, we can assert that replies are received before we
acknowledge the event as processed back to Postgres.



The tests in this PR have been updated to pass given the refactor.
However, since we are tracking more state now in the channel procs, it
would be a good idea to add more tests for those edge cases. That is
saved as a later PR because (1) this one is already huge, and (2) we
need to get this out to staging to smoke test everything anyhow.

Fixes: #9908 
Fixes: #9909 
Fixes: #9910
Fixes: #9900 
Related: #9501
2025-07-18 22:47:18 +00:00
Andrew Dryga
835fc4c8eb chore(portal): Bump all deps related to portal (#6445) 2024-08-28 10:40:02 -06:00
Andrew Dryga
a7e54686b0 feat(portal): Track page views and sign ups using Mixpanel and HubSpot on public pages (#5050)
Fixes firezone/gtm#253
Fixes firezone/gtm#278
2024-05-21 10:34:56 -06:00
Andrew Dryga
f3c8c734ab feat(portal): Filtering, Fulltext Search, Pagination, Preloads (#3751)
On the domain side this PR extends `Domain.Repo` with filtering,
pagination, and ordering, along with some convention changes are
removing the code that is not needed since we have the filtering now.
This required to touch pretty much all contexts and code, but I went
through all public functions and added missing tests to make sure
nothing will be broken.

On the web side I've introduced a `<.live_table />` which is as close as
possible to being a drop-in replacement for the regular `<.table />`
(but requires to structure the LiveView module differently due to
assigns anyways). I've updated all the listing tables to use it.
2024-03-16 13:27:48 -06:00
Jamil
6419b1d096 chore(portal): Fix static files (#3974)
Fixes issues with static files returning 404s
2024-03-05 17:43:14 +00:00
Jamil
127b97e588 fix(portal|website): Fix static paths for website and elixir (#3802)
Phoenix VerifiedRoutes expects directories for `statics` where we were
passing filenames too.

These are removed since they're not required -- all of the top level
files we need to serve at the root don't need VerifiedRoutes.

For the website, the files were named incorrectly.


The above issues were causing 404s on both the website and portal.
2024-02-28 20:03:42 +00:00
Andrew Dryga
9e11ddb1cd Do not crash on disconnect messages in LV (#3795)
This message is sent by the some of the broadcasters and it was
resulting in a process crash (on a socket that will be disconnected
anyways), but this triggered our logging alerts anyways. So we will
simply ignore them globally to suppres the noise.
2024-02-28 11:42:07 -06:00
Jamil
17692ecf4d fix(portal|website): Fix favicons for dark mode (#3785) 2024-02-27 18:57:37 +00:00
Andrew Dryga
e290f26298 Complete Actors, Devices and Groups UIs (#1885)
This will be done once the remaining UI code is covered with tests.
2023-09-02 05:35:52 +00:00
Andrew Dryga
fe06d2e42d Actor groups and group sync helpers (#1727) 2023-07-31 16:22:40 -06:00
bmanifold
9a06a9bb14 Refactor Gateway Liveviews to use real data (#1760)
Why:

* The previous Gateway Liveviews had used static views and data as a
starting point for fleshing out the web UI. This commit builds on that
and replaces (most) of the static data with data from the database, as
well as updating the static Liveview templates to use components where
possible.

Note: These changes are only meant to involve the Gateway views
(index/show/edit). More changes to other resources will follow(i.e.
Resource, Users, Devices, etc...)

---------

Signed-off-by: bmanifold <bmanifold@users.noreply.github.com>
Co-authored-by: Andrew Dryga <andrew@dryga.com>
2023-07-18 21:15:59 +00:00
Andrew Dryga
e7d5d0579b Authentication for the live app (#1674)
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2023-06-27 13:11:36 -06:00
Andrew Dryga
89b7e3b474 Fix assets pipeline, add Elixir deps audit, add Android applink manifest (#1659) 2023-06-14 17:15:38 -06:00
Andrew Dryga
37a2d7b7f5 Move elixir code to a subfolder (#1631) 2023-05-24 15:46:51 -06:00