Jamil f379e85e9b refactor(portal): cache access state in channel pids (#9773)
When changes occur in the Firezone DB that trigger side effects, we need
some mechanism to broadcast and handle these.

Before, the system we used was:

- Each process subscribes to a myriad of topics related to data it wants
to receive. In some cases it would subscribe to new topics based on
received events from existing topics (I.e. flows in the gateway
channel), and sometimes in a loop. It would then need to be sure to
_unsubscribe_ from these topics
- Handle the side effect in the `after_commit` hook of the Ecto function
call after it completes
- Broadcast only a simply (thin) event message with a DB id
- In the receiver, use the id(s) to re-evaluate, or lookup one or many
records associated with the change
- After the lookup completes, `push` the relevant message(s) to the
LiveView, `client` pid, or `gateway` pid in their respective channel
processes

This system had a number of drawbacks ranging from scalability issues to
undesirable access bugs:

1. The `after_commit` callback, on each App node, is not globally
ordered. Since we broadcast a thin event schema and read from the DB to
hydrate each event, this meant we had a `read after write` problem in
our event architecture, leading to the potential for lost updates. Case
in point: if a policy is updated from `resource_id-1` to
`resource_id-2`, and then back to `resource_id-1`, it's possible that,
given the right amount of delay, the gateway channel will receive two
`reject_access` events for `resource_id-1`, as opposed to one for
`resource_id-1` and one for `resource_id-2`, leading to the potential
for unauthorized access.
1. It was very difficult to ensure that the correct topics were being
subscribed to and unsubscribed from, and the correct number of times,
leading to maintenance issues for other engineers.
1. We had a nasty N+1 query problem whenever memberships were added or
removed that resolved in essentially all access related to that
membership (so all Policies touching its actor group) to be
re-evaluated, and broadcasted. This meant that any bulk addition or
deletion of memberships would generate so many queries that they'd
timeout or consume the entire connection pool.
1. We had no durability for side-effect processing. In some places, we
were iterating over many returned records to send broadcasts.
Broadcasting is not a zero-time operation, each call takes a small
amount of CPU time to copy the message into the receiver's mailbox. If
we deployed while this was happening, the state update would be lost
forever. If this was a `reject_access` for a Gateway, the Gateway would
never remove access for that particular flow.
1. On each flow authorization, we needed to hit `us-east1` not only to
"authorize" the flow, but to log it as well. This incurs latency
especially for users in other parts of the world, which happens on
_each_ connection setup to a new resource.
1. Since we read and re-authorize access due to the thin events
broadcasted from side effects, we risk hitting thundering herd problems
(see the N+1 query problem above) where a single DB change could result
in all receivers hitting the DB at once to "hydrate" their
processing.ion
1. If an administrator modifies the DB directly, or, if we need to run a
DB migration that involves side effects, they'll be lost, because the
side effect triggers happened in `after_commit` hooks that are only
available when querying the DB through Ecto. Manually deleting (or
resurrecting) a policy, for example, would not have updated any
connected clients or gateways with the new state.


To fix all of the above, we move to the system introduced in this PR:

- All changes are now serialized (for free) by Postgres and broadcasted
as a single event stream
- The number of topics has been reduced to just one, the `account_id` of
an account. All receivers subscribe to this one topic for the lifetime
of their pid and then only filter the events they want to act upon,
ignoring all other messages
- The events themselves have been turned into "fat" structs based on the
schemas they present. By making them properly typed, we can apply things
like the existing Policy authorizer functions to them as if we had just
fetched them from the DB.
- All flow creation now happens in memory and doesn't not need to incur
a DB hit in `us-east1` to proceed.
- Since clients and gateways now track state in a push-based manner from
the DB, this means very few actual DB queries are needed to maintain
state in the channel procs, and it also means we can be smarter about
when to send `resource_deleted` and `resource_created_or_updated`
appropriately, since we can always diff between what the client _had_
access to, and what they _now_ have access to.
- All DB operations, whether they happen from the application code, a
`psql` prompt, or even via Google SQL Studio in the GCP console, will
trigger the _same_ side effects.
- We now use a replication consumer based off Postgres logical decoding
of the write-ahead log using a _durable slot_. This means that Postgres
will retain _all events_ until they are acknowledged, giving us the
ability to ensure at-least-once processing semantics for our system.
Today, the ACK is simply, "did we broadcast this event successfully".
But in the future, we can assert that replies are received before we
acknowledge the event as processed back to Postgres.



The tests in this PR have been updated to pass given the refactor.
However, since we are tracking more state now in the channel procs, it
would be a good idea to add more tests for those edge cases. That is
saved as a later PR because (1) this one is already huge, and (2) we
need to get this out to staging to smoke test everything anyhow.

Fixes: #9908 
Fixes: #9909 
Fixes: #9910
Fixes: #9900 
Related: #9501
2025-07-18 22:47:18 +00:00
2024-02-27 23:56:46 +00:00

firezone logo

A modern alternative to legacy VPNs.


firezone Discourse firezone Coverage Status GitHub commit activity GitHub closed issues Cloudsmith follow on Twitter


Overview

Firezone is an open source platform to securely manage remote access for any-sized organization. Unlike most VPNs, Firezone takes a granular, least-privileged approach to access management with group-based policies that control access to individual applications, entire subnets, and everything in between.

architecture

Features

Firezone is:

  • Fast: Built on WireGuard® to be 3-4 times faster than OpenVPN.
  • Scalable: Deploy two or more gateways for automatic load balancing and failover.
  • Private: Peer-to-peer, end-to-end encrypted tunnels prevent packets from routing through our infrastructure.
  • Secure: Zero attack surface thanks to Firezone's holepunching tech which establishes tunnels on-the-fly at the time of access.
  • Open: Our entire product is open-source, allowing anyone to audit the codebase.
  • Flexible: Authenticate users via email, Google Workspace, Okta, Entra ID, or OIDC and sync users and groups automatically.
  • Simple: Deploy gateways and configure access in minutes with a snappy admin UI.

Firezone is not:

  • A tool for creating bi-directional mesh networks
  • A full-featured router or firewall
  • An IPSec or OpenVPN server

Contents of this repository

This is a monorepo containing the full Firezone product, marketing website, and product documentation, organized as follows:

Quickstart

The quickest way to get started with Firezone is to sign up for an account at https://app.firezone.dev/sign_up.

Once you've signed up, follow the instructions in the welcome email to get started.

Frequently asked questions (FAQ)

Can I self-host Firezone?

Our license won't stop you from self-hosting the entire Firezone product top to bottom, but our internal APIs are changing rapidly so we can't meaningfully support self-hosting Firezone in production at this time.

If you're feeling especially adventurous and want to self-host Firezone for educational or hobby purposes, follow the instructions to spin up a local development environment in CONTRIBUTING.md.

The latest published clients (on App Stores and on releases) are only guaranteed to work with the managed version of Firezone and may not work with a self-hosted portal built from this repository. This is because Apple and Google can sometimes delay updates to their app stores, and so the latest published version may not be compatible with the tip of main from this repository.

Therefore, if you're experimenting with self-hosting Firezone, you will probably want to use clients you build and distribute yourself as well.

See the READMEs in the following directories for more information on building each client:

How long will 0.7 be supported until?

Firezone 0.7 is currently end-of-life and has stopped receiving updates as of January 31st, 2024. It will continue to be available indefinitely from the legacy branch of this repo under the Apache 2.0 license.

How much does it cost?

We offer flexible per-seat monthly and annual plans for the cloud-managed version of Firezone, with optional invoicing for larger organizations. See our pricing page for more details.

Those experimenting with self-hosting can use Firezone for free without feature or seat limitations, but we can't provide support for self-hosted installations at this time.

Documentation

Additional documentation on general usage, troubleshooting, and configuration can be found at https://www.firezone.dev/kb.

Get Help

If you're looking for help installing, configuring, or using Firezone, check our community support options:

  1. Discussion Forums: Ask questions, report bugs, and suggest features.
  2. Join our Discord Server: Join live discussions, meet other users, and chat with the Firezone team.
  3. Open a PR: Contribute a bugfix or make a contribution to Firezone.

If you need help deploying or maintaining Firezone for your business, consider contacting our sales team to speak with a Firezone expert.

See all support options on our main support page.

Star History

Star History Chart

Developing and Contributing

See CONTRIBUTING.md.

Security

See SECURITY.md.

License

Portions of this software are licensed as follows:

  • All content residing under the "elixir/" directory of this repository, if that directory exists, is licensed under the "Elastic License 2.0" license defined in "elixir/LICENSE".
  • All third party components incorporated into the Firezone Software are licensed under the original license provided by the owner of the applicable component.
  • Content outside of the above mentioned directories or restrictions above is available under the "Apache 2.0 License" license as defined in "LICENSE".

WireGuard® is a registered trademark of Jason A. Donenfeld.

Description
No description provided
Readme Apache-2.0 169 MiB
Languages
Elixir 57.1%
Rust 29.2%
TypeScript 5.9%
Swift 3.3%
Kotlin 1.8%
Other 2.5%