In order to allow customers to make use of connlib's DoH functionality,
we need a configuration UI for it. We take inspiration from the "New
Resource" page and implement a 3-choice UI component for configuring how
Clients should resolve DNS queries:
- System
- Secure DNS
- Custom
The secure and custom DNS options show an additional form when selected
for either picking a DoH provider or the addresses of the custom DNS
servers.
Right now, the "Secure DNS" part is disabled if the
`DISABLE_DOH_PROVIDER` env variable is set. We render a "Coming soon"
tooltip on hover:
<img width="1534" height="1100" alt="image"
src="https://github.com/user-attachments/assets/a12a6ba4-806f-4d19-8aea-5c1cd981d609"
/>
This allows us to test this in staging and still ship to production if
needed prior to enabling it.
Resolves: #10792Resolves: #10786
---------
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
DNS servers are standarised to be contacted on port 53. This is also
hard-coded within `connlib` when we contact an upstream server. As such,
we should disallow users inputting any custom port for upstream DNS
servers. Luckily - or perhaps because it doesn't presently work - no
users in production have actually put in a port.
Resolves: #8330
Why:
* Now that soft delete fields have been removed being referenced in the
codebase and soft deleted rows have been removed from the DB we can now
remove any remaining traces of soft delete functionality.
Resolves#8187
Why:
* In previous commits, the portal code had been updated to use hard
deletion rather than soft deletion of data. The fields used in the soft
deletion were still kept in the DB and the code to allow for zero
downtime rollout and an easy rollback if necessary. To continue with
that work the portal code has now been updated to remove any reference
to the soft deleted fields (e.g. deleted_at, persistent_id, etc...).
While the code has been updated the actual data in the DB will need to
remain for now, to once again allow for a zero downtime rollout. Once
this commit has been deployed to production another PR can follow to
remove the columns from the necessary tables in the DB.
Related: #8187
Since Elixir 1.18, json encoding and decoding support is included in the
standard library. This is built on OTP's native json support which is
often faster than other implementations.
It mostly has the same API as the popular Jason library, differing
mainly in the format of the error responses returned when decoding
fails.
To minimize dependence on external libraries, we remove the Jason lib in
favor of this external dependency.
Fixes#8011
Bumps Phoenix to 1.8 and Phoenix LiveView to 1.1. As part of the bump a
number of issues had to be addressed. Comments inline provide more
context.
Supersedes #10475
Supersedes #10448
Clients more than one minor away from a particular won't be able to
connect, so it would be useful to help admins recognize when this might
be the case and encourage them to upgrade.
We accomplish that with two small UX improvements in this PR:
- In the outdated gateways email, we show them a count of clients that
will no longer be able to connect to the new gateway version, linking
them to a sorted view of the clients table
- In the clients table, we add a new sortable `version` column to allow
admins to see which clients are outdated
Fixes#7727Fixes#10385
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Napkin math shows that we can save substantial memory (~3x or more) on
the API nodes as connected clients/gateways grow if we just store the
fields we need in order to keep the client and gateway state maintained
in the channel pids.
To facilitate this, we create new `Cacheable` structs that represent
their `Domain` cousins, which use byte arrays for `id`s and strip out
unused fields.
Additionally, all business logic involved with maintaining these caches
is now contained within two modules: `Domain.Cache.Client` and
`Domain.Cache.Gateway`, and type specs have been added to aid in static
analysis and code documentation.
Comprehensive testing is now added not only for the cache modules, but
for their associated channel modules as well to ensure we handle
different kinds of edge cases gracefully.
The `Events` nomenclature was renamed to `Changes` to better name what
we are doing: Change-Data-Capture.
Lastly, the following related changes are included in this PR since they
were "in the way" so to speak of getting this done:
- We save the last received LSN in each channel and drop the `change`
with a warning if we receive it twice in a row, or we receive it out of
order
- The client/gateway version compatibility calculations have been moved
to `Domain.Resources` and `Domain.Gateways` and have been simplified to
make them easier to understand and maintain going forward.
Related: #10174Fixes: #9392Fixes: #9965Fixes: #9501Fixes: #10227
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
When changes occur in the Firezone DB that trigger side effects, we need
some mechanism to broadcast and handle these.
Before, the system we used was:
- Each process subscribes to a myriad of topics related to data it wants
to receive. In some cases it would subscribe to new topics based on
received events from existing topics (I.e. flows in the gateway
channel), and sometimes in a loop. It would then need to be sure to
_unsubscribe_ from these topics
- Handle the side effect in the `after_commit` hook of the Ecto function
call after it completes
- Broadcast only a simply (thin) event message with a DB id
- In the receiver, use the id(s) to re-evaluate, or lookup one or many
records associated with the change
- After the lookup completes, `push` the relevant message(s) to the
LiveView, `client` pid, or `gateway` pid in their respective channel
processes
This system had a number of drawbacks ranging from scalability issues to
undesirable access bugs:
1. The `after_commit` callback, on each App node, is not globally
ordered. Since we broadcast a thin event schema and read from the DB to
hydrate each event, this meant we had a `read after write` problem in
our event architecture, leading to the potential for lost updates. Case
in point: if a policy is updated from `resource_id-1` to
`resource_id-2`, and then back to `resource_id-1`, it's possible that,
given the right amount of delay, the gateway channel will receive two
`reject_access` events for `resource_id-1`, as opposed to one for
`resource_id-1` and one for `resource_id-2`, leading to the potential
for unauthorized access.
1. It was very difficult to ensure that the correct topics were being
subscribed to and unsubscribed from, and the correct number of times,
leading to maintenance issues for other engineers.
1. We had a nasty N+1 query problem whenever memberships were added or
removed that resolved in essentially all access related to that
membership (so all Policies touching its actor group) to be
re-evaluated, and broadcasted. This meant that any bulk addition or
deletion of memberships would generate so many queries that they'd
timeout or consume the entire connection pool.
1. We had no durability for side-effect processing. In some places, we
were iterating over many returned records to send broadcasts.
Broadcasting is not a zero-time operation, each call takes a small
amount of CPU time to copy the message into the receiver's mailbox. If
we deployed while this was happening, the state update would be lost
forever. If this was a `reject_access` for a Gateway, the Gateway would
never remove access for that particular flow.
1. On each flow authorization, we needed to hit `us-east1` not only to
"authorize" the flow, but to log it as well. This incurs latency
especially for users in other parts of the world, which happens on
_each_ connection setup to a new resource.
1. Since we read and re-authorize access due to the thin events
broadcasted from side effects, we risk hitting thundering herd problems
(see the N+1 query problem above) where a single DB change could result
in all receivers hitting the DB at once to "hydrate" their
processing.ion
1. If an administrator modifies the DB directly, or, if we need to run a
DB migration that involves side effects, they'll be lost, because the
side effect triggers happened in `after_commit` hooks that are only
available when querying the DB through Ecto. Manually deleting (or
resurrecting) a policy, for example, would not have updated any
connected clients or gateways with the new state.
To fix all of the above, we move to the system introduced in this PR:
- All changes are now serialized (for free) by Postgres and broadcasted
as a single event stream
- The number of topics has been reduced to just one, the `account_id` of
an account. All receivers subscribe to this one topic for the lifetime
of their pid and then only filter the events they want to act upon,
ignoring all other messages
- The events themselves have been turned into "fat" structs based on the
schemas they present. By making them properly typed, we can apply things
like the existing Policy authorizer functions to them as if we had just
fetched them from the DB.
- All flow creation now happens in memory and doesn't not need to incur
a DB hit in `us-east1` to proceed.
- Since clients and gateways now track state in a push-based manner from
the DB, this means very few actual DB queries are needed to maintain
state in the channel procs, and it also means we can be smarter about
when to send `resource_deleted` and `resource_created_or_updated`
appropriately, since we can always diff between what the client _had_
access to, and what they _now_ have access to.
- All DB operations, whether they happen from the application code, a
`psql` prompt, or even via Google SQL Studio in the GCP console, will
trigger the _same_ side effects.
- We now use a replication consumer based off Postgres logical decoding
of the write-ahead log using a _durable slot_. This means that Postgres
will retain _all events_ until they are acknowledged, giving us the
ability to ensure at-least-once processing semantics for our system.
Today, the ACK is simply, "did we broadcast this event successfully".
But in the future, we can assert that replies are received before we
acknowledge the event as processed back to Postgres.
The tests in this PR have been updated to pass given the refactor.
However, since we are tracking more state now in the channel procs, it
would be a good idea to add more tests for those edge cases. That is
saved as a later PR because (1) this one is already huge, and (2) we
need to get this out to staging to smoke test everything anyhow.
Fixes: #9908Fixes: #9909Fixes: #9910Fixes: #9900
Related: #9501
This has been dead code for a long time. The feature this was meant to
support, #8353, will require a different domain model, views, and user
flows.
Related: #8353
We had an old bug in one of our acceptance tests that is just now being
hit again due to the faster runners.
- We need to wait for the dropdown to become visible before clicking
- We fix a minor timer issue that was calculating elapsed time
incorrectly when determining when time out finding an el.
The `expires_at` column on the `flows` table was never used outside of
the context in which the flow was created in the Client Channel. This
ephemeral state, which is created in the `Domain.Flows.authorize_flow/4`
function, is never read from the DB in any meaningful capacity, so it
can be safely removed.
The `expire_flows_for` family of functions now simply reads the needed
fields from the flows table in order to broadcast `{:expire_flow,
flow_id, client_id, resource_id}` directly to the subscribed entities.
This PR is step 1 in removing the reliance on `Flows` to manage
ephemeral access state. In a subsequent PR we will actually change the
structure of what state is kept in the channel PIDs such that reliance
on this Flows table will no longer be necessary.
Additionally, in a few places, we were referencing a Flows.Show view
that was never available in production, so this dead code has been
removed.
Lastly, the `flows` table subscription and associated hook processing
has been completely removed as it is no longer needed. We've implemented
in #9667 logic to remove publications from removed table subscriptions,
so we can expect to get a couple ingest warnings when we deploy this as
the `Hooks.Flows` processor no longer exists, and the WAL data may have
lingering flows records in the queue. These can be safely ignored.
We issue broadcasts and subscribes in many places throughout the portal.
To help keep the cognitive overhead low, this PR consolidates all PubSub
functionality to the `Domain.PubSub` module.
This allows for:
- better maintainability
- see all of the topics we use at a glance
- consolidate repeated functionality (saved for a future PR)
- use the module hierarchy to define function names, which feels more
intuitive when reading and sets a convention
We also introduce a `Domain.Events.Hooks` behavior to ensure all hooks
comply with this simple contract, and we also introduce a convention to
standardize on topic names using the module hierarchy defined herein.
Lastly, we add convenience functions to the Presence modules to save a
bit of duplication and chance for errors.
This will make it much easier to maintain PubSub going forward.
Related: #9501
In many places throughout the portal codebase, we called a function
"update_dynamic_group_memberships/1" which recomputed all of the
dynamic/managed memberships for a particular account, and reapplied them
to each affected group.
Since the `has_many :memberships` relationship used `on_replace:
:delete`, this caused Ecto to delete _all_ the `Everyone` group
memberships, and reinsert them on each sync.
Since each membership change triggers a policy re-evaluation for all
policies to the affected actor
(`Policies.broadcast_access_events_for/3`), this in effect was causing a
massive amount of queries to be triggered upon each sync job as each
membership deletion and insertion triggered a lookup for all resources
available to that particular actor.
To fix this, we introduce the following changes:
- Remove `dynamic` group type. This will never be used as it will create
an immense amount of complexity for any organization trying to manage
groups this way
- Refactor `update_dynamic_group_memberships/1` to use a smarter query
that first gathers all the _needed_ changes and applies them within a
transaction using Ecto.Multi. Previously all memberships would be rolled
over unconditionally due to the `on_replace: :delete` option on the
relationship. Note that the option is still there, but we generally
don't set memberships on groups any longer unless editing the affected
group directly, where the everyone group doesn't apply.
Resolves: #8407Resolves: #8408
Related: #6294
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
We move the resource events to the WAL system. Notably, we no longer
need `fetch_and_update_breakable` for resource updates, so a bit of
refactoring is included to update the call sites for those.
Additionally, we need to add a `Flow.expire_flows_for_resource_id/1`
function to expire flows from the WAL system. This is now being called
in the WAL event handler. To prevent this from blocking the WAL
consumer/broadcaster, we wrap it with a Task.async. These will be
cleaned up when the lookup table for access is implemented next.
Another thing to note is that we lose the `subject` when moving from
`Flows.expire_flows_for(%Resource{}, subject)` to
`Flows.expire_flows_for_resource_id(resource_id)` when a resource is
deleted or updated by an actor since we respond to this event in the WAL
where that data isn't available. However, we don't actually _use_ the
subject when expiring flows (other than authorize the initial resource
update), so this isn't an issue.
Related: #9501
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>
Membership events are quite simple to move to the WAL:
- Only one topic is used to determine which client(s) receive updates
for which Actor(s).
- The unsubscribe was removed because it was unused.
- Notably, the N+1 query problem regarding re-evaluating all access
again after each membership is updated is still present. This will be
fixed using a lookup table in the client channel in the last PR to move
events to the WAL.
Related: https://github.com/firezone/firezone/issues/6294
Related: https://github.com/firezone/firezone/issues/8187
This PR moves Gateway events to be triggered by the WAL broadcaster.
Some things of note that are cleaned up:
- The gateway `:update` event was never received anywhere (but in a
test) and so has been removed
- The account topic has been removed as it was also never acted upon
anywhere. Presence yes, but topic no
- The group topic has also been removed as it was only used to receive
broadcasted disconnects when a group is deleted, but this was already
handled by the token deletion and so is redundant.
Why:
* Now that we have started using the `created_by_subject` field on
various tables, we no longer need to keep the
`created_by_<identity/actor>` fields. This will help remove a foreign
key reference and will be one step closer to allowing us to hard delete
data rather than soft deleting all data in order to keep foreign key
references like these.
Moves the broadcasting of `disconnect` messages caused by token
soft-deletions to the WAL broadcaster.
Notably, many tests had to be cleaned up because they were specifically
testing this side effect. Instead, these tests now test (1) the token is
deleted, and then the token deletion handler is tested to ensure the
message is broadcasted.
Client updates are next on the path to moving more side effects to the
WAL broadcaster. This one has the following notable changes:
- ~~The `actor_clients` pubsub topic were only used to broadcast removal
of clients belonging to an actor; these are no longer needed since we
handle this in the individual removal event~~ EDIT: only the presence is
kept
- The `account_clients:{account_id}` pubsub and presence topic
definition has been moved to `Events.Hooks.Accounts` because these are
broadcasted using the account_id field based on account changes, and
have nothing to do with the client lifecycle
Related: #6294
Related: #8187
Why:
* We have decided to change the way we will do audit logging. Instead of
soft deleting data and keeping it in the table it was created in, we
will be moving to an audit trail table where various actions will be
recorded in a table/DB specifically for auditing purposes. Due to this
change we need to make sure that we don't have stale/dangling
references. One set of references we keep everywhere is
`created_by_identity_id` and `created_by_actor_id`. Those foreign key
references won't be able to be used after moving to the new audit
system. This commit will allow us to keep that info by pulling the
values and storing the data in a created_by_subject field on the record.
Adds a new field to `settings/identity_providers` that allows an Admin
to designate any non-email/otp provider as the `default` for client
authentication. Clients will then navigate directly to the provider's
`/redirect` endpoint when authenticating, which in many cases will
automatically sign them in.
No existing providers are updated in this PR.
https://github.com/user-attachments/assets/7b962a25-76fd-491f-a194-60ed993821fc
Why:
* The copy to clipboard button was not working at all on the API new
token page due to the fact that the FlowbiteJS library expects the
presence of the elements in the DOM on first render. This was not true
of the API Token code block. Along with that issue the existing code
blocks copy to clipboard buttons did not give any visual indication that
the copy had been completed. It was also somewhat difficult to see the
copy to clipboard button on those code blocks as well. This commit
updates the buttons to be more visible, as well as adds a phx-hook to
make sure the FlowbiteJS init functions are run on every code block even
if it's inserted after the initial load of the page and adds functions
that are run as a callback to toggle the button text and icon to show
the text has been copied.
After removing some of the functionality for viewing the Internet
Resource, customer was confused where to find it again.
This places an `Internet` section in the Resources index page (similar
to Sites page) with a short help text and an action button to view the
Internet Resource.
This also adds a convenient helper that allows us to route to
`/#{account}/resources/internet` for a nicer-looking URL that users can
bookmark if needed.
<img width="1423" alt="Screenshot 2025-03-19 at 11 52 31 PM"
src="https://github.com/user-attachments/assets/f2da1c31-92b2-429e-832f-73ddd0524155"
/>
Fixes#8479
Why:
* This commit will allow account admins to send a request through the
Firezone portal to schedule a deletion of their account, rather than
having the account admins email their request manually. Doing this
through the portal allows us to verify that the request actually came
from an admin of the account.
Finishes up the Internet Resource migration by enforcing:
- No internet resources in non-internet sites
- No regular resources in internet sites
- Removing the prompt to migrate
~~I've already migrated the existing internet resources in customer's
accounts. No one that was using the internet resource hadn't already
migrated.~~
Edit: I started to head down that path, then decided doing this here in
a data migration was going to be a better approach.
Fixes#8212
- Adds a simple text input to configure search domains ("default DNS
suffix") in the Settings -> DNS page.
- Sends the `search_domain` field as part of the client's `init` message
- Fixes a minor UI alignment inconsistency for the upstream resolvers
field so that the total form width and `New resolver` button width are
the same.
<img width="1137" alt="Screenshot 2025-03-09 at 10 56 56 PM"
src="https://github.com/user-attachments/assets/a1d5a570-8eae-4aa9-8a1c-6aaeb9f4c33a"
/>
Fixes#8365
Rather than the current behavior of raising a 500 when we receive
missing / invalid params in IdP auth callbacks, it would be helpful to
show the user which params were provided, in case the IdP has set
anything useful to aid the user.
For example, we recently received these params from `okta` for a pilot
account (and subsequently rendered them a 500):
```
%{"account_id_or_slug" => "<redacted>", "error" => "access_denied", "error_description" => "User is not assigned to the client application.", "provider_id" => "<redacted>", "state" => "<redacted>"}
```
We had a number of validation issues:
- DNS resources allow address `1.1.1.1` or `1.1.1.1/32`. These are not
valid and will cause issues during resolution.
- IP resources were allowing basically any string character on `edit`
caused by a logic bug in the changeset
- CIDR resources, same as above
- `*.*.*.*.google.com` and similar DNS wildcard resources were not
allowed
This PR beefs all of those up so that we have a higher degree of
certainty that our data is valid. If invalid data reaches connlib, it
will cause a panic.
This PR also introduces a migration to migrate any invalid resources
into the proper format in the DB.
Fixes#8287
Why:
* Rather than using a persistent_id field in Resources/Policies, it was
decided that we should allow "breaking changes" to these entities. This
means that Resources/Policies will now be able to update all fields on
the schema without changing the primary key ID of the entity.
* This change will greatly help the API and Terraform provider
development.
@jamilbk, would you like me to put a migration in this PR to actually
get rid of all of the existing soft deleted entities?
@thomaseizinger, I tagged you on this, because I wanted to make sure
that these changes weren't going to break any expectations in the client
and/or gateways.
---------
Signed-off-by: Brian Manifold <bmanifold@users.noreply.github.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Sentry uncovered a bug in the resources index liveview where it looks
like some code copy-pasted from the policies index view wasn't updated
properly to work in the resources live view, causing the view to crash
if an admin was viewing the table while the resources are changed in
another page.
In debugging that, I realized the best UX when viewing these tables is
usually just to show a `Reload` button and not update the data live
while the admin is viewing it, as this can cause missed clicks and other
annoyances.
This PR adds an optional `stale` component attribute that, if true, will
render a `Reload` button in the live table which upon clicking will
reload the live table.
Not all index views are updated with this - in some views there is
already logic to handle making an intelligent update without breaking
the view if the data is updated - for example for the clients table.
Ideally, we live-update things that don't reflow layout inline (such as
`online/offline` presence) and for things that do cause layout reflow
(create/delete), we show the `Reload` button.
However that work is saved for a future PR as this one fixes the
immediate bug and this is not the highest priority.
<img width="1195" alt="Screenshot 2025-02-16 at 8 44 43 AM"
src="https://github.com/user-attachments/assets/114efffa-85ea-490d-9cea-78c607081ce3"
/>
<img width="401" alt="Screenshot 2025-02-16 at 9 59 53 AM"
src="https://github.com/user-attachments/assets/8a570213-d4ec-4b6c-a489-dcd9ad1c351c"
/>
It's possible for a client or admin to try and load the redirect URL
directly, or a misconfigured IdP may redirect back to us with missing
params. We should redirect with an error flash instead of 500'ing.
This is mostly to stay up to date with current Elixir and benefit from
the new included [JSON parser](https://hexdocs.pm/elixir/JSON.html).
Removing `Jason` in favor of the embedded `JSON` parser is saved for a
[future PR](https://github.com/firezone/firezone/issues/8011).
It found a couple type violations which were simple to fix, and some
formatting changes.
Updates the Resource's pagination cursor such that the default cursor
(with no HTTP params applied) uses `{:resources, :asc, :name}` as the
default, which correctly updates all Resources live tables to sort by
`name`.
The reason this is updated at the Query layer is because I wanted to
achieve this without populating URL params by default, and still
allowing the sort icon to properly reflect the default sort order upon
page load, which it does.
My initial attempt went down the path of updating `assign_live_table/3`
to take a `default_order_by` option. That didn't work because upon page
load we `handle_params` which resets the ordering immediately based on
the URL params.
Rather than update the UI code to track even more state in order to use
`default_order_by` when the `order_by` param is not specified, I opted
to updated the Query module instead which the UI uses.
Fixes#7842