Commit Graph

765 Commits

Author SHA1 Message Date
Jamil
c783b23bae refactor(portal): rename conditional->manual (#9612)
These only have one condition - to run manually. `manual migrations`
better implies that these migrations _must_ typically be run manually.
2025-06-21 21:17:33 +00:00
Jamil
2523bedd19 fix(portal): add if not exists to concurrent index (#9611)
With `@disable_ddl_transaction` this needs to be added.

See
https://firezonehq.slack.com/archives/C04HRQTFY0Z/p1750516438992329?thread_ts=1750510766.640919&cid=C04HRQTFY0Z
2025-06-21 15:42:51 +00:00
Jamil
e113def903 fix(portal): flush metrics buffer before exceeding limit (#9608)
Instead of checking for buffer surpass _after_ adding new timeseries to
it, we should check before.

Variables were renamed to be a little more clear on what they represent.
2025-06-20 21:44:52 +00:00
Jamil
a1677494b5 chore(portal): drop index concurrently (#9609)
Looks like postgres does support this, so adding for good measure.
2025-06-20 14:55:23 -07:00
Jamil
975057f9b4 fix(portal): add account_id,type index on actors (#9607)
`Repo.aggregate(:count)` which performs a `COUNT(*)` query should be
relatively fast if it's able to do an index-only scan. For that to
happen we need to ensure all of the fields in the WHERE clause are
indexed. Currently, we're missing an index on `actors.type` so a full
row scan is executed per account each time we calculate Billing limits,
every 5 minutes, for all accounts.

If we need to check these limits more often and/or our data grows in
size, it could be worth moving these to a limits counter field on
`accounts` which is maintained via INSERT/DELETE triggers.

Related:
https://firezone-inc.sentry.io/issues/6346235615/events/588a61860e0b4875a5dbe8531dbb806a/?project=4508756715569152&referrer=next-event
2025-06-20 20:53:09 +00:00
Jamil
6f87f5ea2c fix(portal): use account_id in index for agm hook (#9606)
When reacting to `ActorGroupMembership` updates, we were issuing a query
to expire Flows given an `actor_id, actor_group_id` combination.

Unfortunately, this query never included an `account_id` to scope it,
causing a table scan of flows and associated join tables to resolve it.

To fix this, we introduce the `account_id` and ensure the expire flows
uses this field to ensure only data for an account is considered in the
query.

Related:
https://firezone-inc.sentry.io/issues/6346235615/events/e225e1c488cb4ea3896649aabd529c50
2025-06-20 20:40:31 +00:00
Jamil
ddb3dc8ce0 refactor(portal): compile_config macro to env_var_to_config (#9605)
The `compile_config` macro only works on environment and DB variables.
This caused recent confusion when determining where `database_pool_size`
was coming from.

To fix this issue, we rename `compile_config` to be more clear.

We also remove the technical debt around supporting "legacy keys" and
DB-based configuration.

The configuration compiler now works exclusively on environment
variables only, where it is still useful for:

- Casting environment variables to their expected type
- Alerting us when one is missing that should be set
2025-06-20 20:39:06 +00:00
Brian Manifold
5bd5a7f6ad fix(portal): trim whitespace in auth provider forms (#9587)
Why:

* We recently had an issue where a space was entered into a provider
form field and caused our system to not be able to authenticate the
admin when setting up the auth provider and directory sync. To mitigate
this moving forward we are making sure all white space is trimmed in the
form fields. This commit focuses on the form fields for the auth
providers.

related: #9579
2025-06-20 18:44:33 +00:00
Jamil
fc3a9d17b9 fix(portal): broadcast before possible query errors out (#9601)
When handling some side effects, if the query fails for whatever reason,
we don't want these preventing handling side effects.

Related:
https://firezone-inc.sentry.io/issues/6346235615/events/d30d222f8a3e436d8058a54c0b2a508c/?project=4508756715569152&query=is%3Aunresolved&referrer=previous-event&stream_index=3
2025-06-20 17:03:42 +00:00
Jamil
e5a0bdc3b1 fix(portal): ensure sentry reports conditional migrations (#9582)
Sentry isn't started when this runs, so start it and manually capture a
message to ensure we're reminded about pending conditional migrations.

Verified that this works with the Release script.
2025-06-19 17:28:38 +00:00
Jamil
2d6e478a44 fix(portal): check conditional migrations with repo started (#9577)
In #9562, we introduced a bug where the pending conditional migrations
check was run without the repo being started. Wrapping it with
`with_repo` fixes that.
2025-06-18 22:40:24 +00:00
Jamil
236c21111a refactor(portal): don't rely on db to gate metric reporting (#9565)
This table was added to try and gate the request rate to Google's
Metrics API.

However, this was a flawed endeavor as we later discovered that the time
series points need to be spaced apart at least 5s, not the API requests
themselves.

This PR gets rid of the table and therefore the problematic DB query
which is timing out quite often due to the contention involved in 12
elixir nodes trying to grab a lock every 5s.

Related: #9539 
Related:
https://firezone-inc.sentry.io/issues/6346235615/?project=4508756715569152&query=is%3Aunresolved&referrer=issue-stream&stream_index=1
2025-06-18 18:40:33 +00:00
Jamil
a20989a819 feat(portal): conditional migrations on prod (#9562)
Some migrations take a long time to run because they require locks or
modify large amounts of data. To prevent this from causing issues during
deploy, we leverage Ecto's native support for loading migrations from
multiple directories to introduce a `conditional_migrations/` directory
that houses any conditional migrations we want to run.

To run these migrations, you'll need to do one of the following:

- `dev, test`: The `mix ecto.migrate` will run them by default because
we have aliased this to load conditional_migrations for dev
- `prod`: Set the `RUN_CONDITIONAL_MIGRATIONS` env var to `true` before
starting a prod server using the `bin/migrate` script.
- `dev, test, prod`: Run `Domain.Release.migrate(conditional: true)`
from an IEx shell.

If conditional migrations were found that weren't executed during
`Domain.Release.migrate`, a warning is logged to remind us to run them.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
2025-06-18 18:08:25 +00:00
Jamil
38471738aa fix(portal): fix problem_nodes removal (#9561)
The shape of this from libcluster is `[:"NODE_NAME": connected_bool?]`
so we need to extract the first element of each item before using this
var.

This is just for logging and doesn't affect how we actually connect to
nodes.
2025-06-17 16:53:42 +00:00
Brian Manifold
e5914af50f fix(portal): Add more logging around OIDC setup (#9555)
Why:

* Adding some simple logging around OIDC calls to help with better
debugging.
* Removing the `opentelemetry_liveview` package as it has been pulled in
to the `opentelemetry_phoenix` package that we are already using.
2025-06-17 16:52:33 +00:00
Brian Manifold
25434c6898 fix(portal): update non-root layout to use main.css (#9533)
After updating the CSS config to use `main.css` in the portal the root
layout was updated, but there were a small number of one-off templates
that do not use the root layout and those pages were not updated with
the new `main.css` file. This commit updates those non-root templates.

Fixes #9532
2025-06-15 15:31:45 +00:00
Jamil
c6545fe853 refactor(portal): consolidate pubsub functions (#9529)
We issue broadcasts and subscribes in many places throughout the portal.
To help keep the cognitive overhead low, this PR consolidates all PubSub
functionality to the `Domain.PubSub` module.

This allows for:

- better maintainability
- see all of the topics we use at a glance
- consolidate repeated functionality (saved for a future PR)
- use the module hierarchy to define function names, which feels more
intuitive when reading and sets a convention

We also introduce a `Domain.Events.Hooks` behavior to ensure all hooks
comply with this simple contract, and we also introduce a convention to
standardize on topic names using the module hierarchy defined herein.

Lastly, we add convenience functions to the Presence modules to save a
bit of duplication and chance for errors.

This will make it much easier to maintain PubSub going forward.


Related: #9501
2025-06-15 04:30:57 +00:00
Jamil
62c3dd9370 fix(portal): don't add service accounts to everyone group (#9530)
In #9513 a bug was introduced that added all service accounts to the
Everyone group. This fixes that by ensuring the `insert_all` query only
cross joins where actor type is `:account_user, :account_admin_user`.

Staging data will be manually fixed after this goes in.

I briefly considered updating the delete clause of this query to "clean
things up" by removing any found service accounts but that is a bit too
defensive in my opinion - if there's no way a service account should
make it into this group, then we shouldn't have code to expect it. This
will all be going away in #8750 which should be much less brittle.
2025-06-14 15:20:32 +00:00
Jamil
cbe33cd108 refactor(portal): move policy events to WAL (#9521)
Moves all of the policy lifecycle events to be broadcasted from the WAL
consumer.

#### Test

- [x] Enable policy
- [x] Disable policy
- [x] Delete policy
- [x] Non-breaking change
- [x] Breaking change


Related: #6294

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
2025-06-14 01:10:09 +00:00
Jamil
817eeff19f refactor(portal): simplify managed groups (#9513)
In many places throughout the portal codebase, we called a function
"update_dynamic_group_memberships/1" which recomputed all of the
dynamic/managed memberships for a particular account, and reapplied them
to each affected group.

Since the `has_many :memberships` relationship used `on_replace:
:delete`, this caused Ecto to delete _all_ the `Everyone` group
memberships, and reinsert them on each sync.

Since each membership change triggers a policy re-evaluation for all
policies to the affected actor
(`Policies.broadcast_access_events_for/3`), this in effect was causing a
massive amount of queries to be triggered upon each sync job as each
membership deletion and insertion triggered a lookup for all resources
available to that particular actor.

To fix this, we introduce the following changes:

- Remove `dynamic` group type. This will never be used as it will create
an immense amount of complexity for any organization trying to manage
groups this way
- Refactor `update_dynamic_group_memberships/1` to use a smarter query
that first gathers all the _needed_ changes and applies them within a
transaction using Ecto.Multi. Previously all memberships would be rolled
over unconditionally due to the `on_replace: :delete` option on the
relationship. Note that the option is still there, but we generally
don't set memberships on groups any longer unless editing the affected
group directly, where the everyone group doesn't apply.

Resolves: #8407 
Resolves: #8408
Related: #6294

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-13 18:55:37 +00:00
Jamil
c31f51d138 refactor(portal): move resource events to WAL (#9406)
We move the resource events to the WAL system. Notably, we no longer
need `fetch_and_update_breakable` for resource updates, so a bit of
refactoring is included to update the call sites for those.

Additionally, we need to add a `Flow.expire_flows_for_resource_id/1`
function to expire flows from the WAL system. This is now being called
in the WAL event handler. To prevent this from blocking the WAL
consumer/broadcaster, we wrap it with a Task.async. These will be
cleaned up when the lookup table for access is implemented next.

Another thing to note is that we lose the `subject` when moving from
`Flows.expire_flows_for(%Resource{}, subject)` to
`Flows.expire_flows_for_resource_id(resource_id)` when a resource is
deleted or updated by an actor since we respond to this event in the WAL
where that data isn't available. However, we don't actually _use_ the
subject when expiring flows (other than authorize the initial resource
update), so this isn't an issue.

Related: #9501

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>
2025-06-11 00:12:45 +00:00
Brian Manifold
d4c7b48754 refactor(portal): update asset config in portal (#9504)
Why:

* This commit brings our web app inline with how new Phoenix
applications manage and configure js/css/font assets. Along with that
this commit updates our Tailwind and esbuild tools.
2025-06-10 23:00:44 +00:00
Jamil
f58176a447 chore: remove docs writer (#9494)
This was added in an earlier era and will be just too cumbersome to
maintain going forward. We have OpenAPI docs which are more flexible.
2025-06-10 02:51:46 +00:00
Brian Manifold
6d425d5677 refactor(portal): add retry logic to Stripe API client (#9466)
Why:

* We've seen some Stripe API requests come back with 429 responses,
which likely could be retried and succeed. This commit adds some basic
retry logic to our Stripe API client.
2025-06-09 23:11:33 +00:00
Jamil
38c1de351c refactor(portal): move membership events to WAL (#9388)
Membership events are quite simple to move to the WAL:

- Only one topic is used to determine which client(s) receive updates
for which Actor(s).
- The unsubscribe was removed because it was unused.
- Notably, the N+1 query problem regarding re-evaluating all access
again after each membership is updated is still present. This will be
fixed using a lookup table in the client channel in the last PR to move
events to the WAL.

Related: https://github.com/firezone/firezone/issues/6294
Related: https://github.com/firezone/firezone/issues/8187
2025-06-06 06:23:33 +00:00
Jamil
00a761ba22 feat(portal): add replication config (#9395) (#9404)
> This PR adds two configuration keys for the replication connection.

> Exemple usecase:
> If you run two firezone control planes on the same db cluster (like me
😂 ), you'll need to have two different replication slot names

Related: #9395

Co-authored-by: Antoine <antoinelabarussias@gmail.com>
2025-06-04 22:15:28 +00:00
Jamil
443e6d2891 fix(portal): preserve device name (#9393) (#9401)
When upserting a client, the device name would overwrite any name
changes performed by the admin. To prevent this, we don't allow changing
a device's name on upsert conflicts, only on initial insert and updates.

Fixes #8536

Co-authored-by: Antoine <antoinelabarussias@gmail.com>
2025-06-04 19:41:48 +00:00
Jamil
dc0867c3ed fix(portal): prevent resource addresses like **test.com (#9384)
These aren't allowed on the clients, causing an error.

Fixes: #9054

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2025-06-03 19:40:02 +00:00
dependabot[bot]
665d11b29a build(deps): bump @fontsource/source-sans-3 from 5.2.7 to 5.2.8 in /elixir/apps/web/assets (#9326)
Bumps
[@fontsource/source-sans-3](https://github.com/fontsource/font-files/tree/HEAD/fonts/google/source-sans-3)
from 5.2.7 to 5.2.8.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/fontsource/font-files/commits/HEAD/fonts/google/source-sans-3">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=@fontsource/source-sans-3&package-manager=npm_and_yarn&previous-version=5.2.7&new-version=5.2.8)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-03 07:26:44 +00:00
dependabot[bot]
5e9a8e06dd build(deps): bump opentelemetry_logger_metadata from 0.1.0 to 0.2.0 in /elixir (#9338)
Bumps
[opentelemetry_logger_metadata](https://github.com/salemove/opentelemetry_logger_metadata)
from 0.1.0 to 0.2.0.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/salemove/opentelemetry_logger_metadata/commits">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=opentelemetry_logger_metadata&package-manager=hex&previous-version=0.1.0&new-version=0.2.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-06-03 07:25:24 +00:00
Jamil
0f0f34cd40 fix(portal): sort before asserting on list equality (#9377)
Fixes minor flakiness introduced in #9373
2025-06-03 06:58:31 +00:00
Jamil
9c3f6e7b36 refactor(portal): don't send ip_stack for non-DNS resources (#9376)
We always return the `ip_stack` field when rendering resource for both
WebSocket and REST APIs. If the resource's type is not `:dns` then this
will be `nil`.

Related:
https://github.com/firezone/firezone/pull/9303#discussion_r2119681062
2025-06-02 23:16:49 -07:00
Brian Manifold
60c90c5c9a fix(portal): Update group sync to ignore soft deleted groups (#9373)
Why:

* When a directory sync occurs, all groups in the DB need to be pulled
in case a synced group needs to be resurrected. Prior to adding the
directory sync deletion circuit breaker the app would "re-delete"
already deleted groups. This was basically a no-op, however, once the
deletion circuit breaker was put in the deletion of already deleted
groups had the possibility of throwing off the circuit breaker and cause
it to fail a directory sync when it was not needed, due to making it
seem as though too many groups were being deleted. This commit makes
sure we don't add already deleted groups to the list of groups needing
to be deleted for a given sync.

Fixes: #9364
2025-06-03 04:29:56 +00:00
Jamil
440eee3086 fix(portal): treat missing organizationUnits as empty list (#9371)
When a Google account has no organization units defined in its
directory, the Google API can return a `200` response without the
`organizationUnits` key. In such cases, we should treat this as an empty
list such that the remainder of the sync will continue.
2025-06-03 03:01:12 +00:00
Jamil
6fc7d2e4e0 feat(portal): configurable ip stack for DNS resources (#9303)
Some poorly-behaved applications (e.g. mongo) will fail to connect if
they see both IPv4 and IPv6 addresses for a DNS resource, because they
will try to connect to both of them and fail the whole connection setup
if either one is not routable.

To fix this, we need to introduce a knob to allow admins to restrict DNS
resources to only A or AAAA records.


<img width="750" alt="Screenshot 2025-06-02 at 10 48 39 AM"
src="https://github.com/user-attachments/assets/4dbcb6ae-685f-43ee-b9e8-1502b365a294"
/>

<img width="1174" alt="Screenshot 2025-06-02 at 11 05 53 AM"
src="https://github.com/user-attachments/assets/02d0a4b3-e6e8-4b6d-89fa-d3d999b5811e"
/>

---

Related:
https://firezonehq.slack.com/archives/C08KPQKJZKM/p1746720923535349
Related: #9300
Fixes: #9042
2025-06-03 02:24:41 +00:00
Jamil
6d4d3a34a0 fix(portal): Uniq nodes before counting (#9352)
While we shouldn't have any duplicates in this list, it would be a good
idea to unique the node names before counting just to be sure.
2025-06-01 18:00:30 -07:00
Jamil
37ae1a4e92 fix(portal): fix false-positive cluster errors (#9351)
Fixes the following issues after learning they're still a problem:

- We need to include our own node when checking for connected node count
- Need to match against the `formatted` key inside message when
filtering Sentry events
2025-06-01 17:56:19 -07:00
Jamil
8bbc7e2960 fix(portal): fix threshold calc for connected nodes (#9350)
In #9342 we started logging only if our connected nodes fell below the
threshold. However, on error, we failed to calculated the new list.

On startup, the first few `loads` will be failures, and the connected
list will remain empty, causing this to report a false positive.
2025-06-01 16:38:07 -07:00
Jamil
f65fcffbfc fix(portal): fix sentry before_send when message is nil (#9349)
A regression was introduced in #9242 where it appears that some Sentry
events don't contain messages, so the filtering module is updated only
to act on events with messages.
2025-06-01 12:32:08 -07:00
Brian Manifold
870aee3812 refactor(portal): add migration to remove created_by_ columns (#9318)
Why:

* This commit contains only a migration to remove the
created_by_identity and created_by_actor columns on multiple tables.
This migration will be run manually due to the long running sync jobs
that are currently in the system. This migration should be a no-op after
the manual DB updates.
2025-06-01 12:08:15 -07:00
Jamil
cc6c57125d revert: "fix(portal): Silence cluster challenge reply errors" (#9345)
This was actually an issue due to accidentally deleting the
`RELEASE_COOKIE` var.

Reverts firezone/firezone#9344
2025-06-01 10:49:49 -07:00
Jamil
42bccfd5e5 fix(portal): Silence cluster challenge reply errors (#9344)
Issues with node connections will be reported by the threshold logger.
2025-06-01 17:29:07 +00:00
Jamil
73c3e2d87b refactor(portal): move gateway events to WAL (#9299)
This PR moves Gateway events to be triggered by the WAL broadcaster.
Some things of note that are cleaned up:

- The gateway `:update` event was never received anywhere (but in a
test) and so has been removed
- The account topic has been removed as it was also never acted upon
anywhere. Presence yes, but topic no
- The group topic has also been removed as it was only used to receive
broadcasted disconnects when a group is deleted, but this was already
handled by the token deletion and so is redundant.
2025-06-01 16:40:28 +00:00
Jamil
bfca4e8411 fix(portal): Use threshold-based logging for cluster errors (#9342)
We periodically fetch a list of all `RUNNING` VMs in GCP and then try to
connect to them for clustering. However, during deploys, it's expected
that we won't be able to connect to new VMs until they are fully up. The
fetch doesn't take health checks into account, so we need a
threshold-based error logging.

To address this, we do the following:

- We only log an error when failing to connect to nodes if we are
currently below the threshold for each of the `api`, `domain`, and `web`
node counts
- We silence node timeout errors, as these will happen during deploys
2025-06-01 15:53:38 +00:00
Jamil
544b6455eb fix(portal): ensure cluster state heals (#9319)
We use `libcluster`, a common Elixir library, for node discovery. It's a
very lightweight wrapper around Erlang's standard `Node.connect`
functionality.

It supports custom cluster formation strategies, and we've implemented
one based on fetching the list of nodes from the GCP API, and then
attempting to connect to them.

Unfortunately, our implementation had two bugs that prevented the
cluster from healing in the following two cases:

- If we successfully connect to nodes, we tracked an internal state var
as having successfully connected to them, forever. If we lost the
connection to these nodes (such as during a deploy where the elixir
nodes don't come up in time, causing the instance group manager to reap
them), then the state would never be updated, and we would never
reconnect to the lost nodes.
- If we failed to fetch the list of nodes more than 10 times (every 10
seconds, so 100 seconds), then we would fail to schedule the timer to
load the nodes again.

The first issue is fixed by removing our kept state altogether - this is
what libcluster is for. We can simply try to connect to the most recent
list of nodes returned from Google's API, and we now log a warning for
any nodes that don't connect.

The second issue is fixed by always scheduling the timer, forever,
regardless of the state of the Google API.

Fixes #8660 
Fixes #8698
2025-05-31 05:01:52 +00:00
Jamil
23bae8f878 fix(portal): Use account param for autoredirect (#9304)
When the client is connecting for the first time without any cookies
loaded the `conn.assigns.account` is non-existent, causing a `KeyError`.

Instead, we should be loading this param from the URL and fetching the
account from it.
2025-05-30 21:23:25 +00:00
Brian Manifold
a51b35a6b4 refactor(portal): remove created_by_<identity/actor> columns (#9306)
Why:

* Now that we have started using the `created_by_subject` field on
various tables, we no longer need to keep the
`created_by_<identity/actor>` fields. This will help remove a foreign
key reference and will be one step closer to allowing us to hard delete
data rather than soft deleting all data in order to keep foreign key
references like these.
2025-05-30 21:06:35 +00:00
Jamil
5fb36cf327 fix(portal): Fix sign out acceptance test (#9302)
In #9294, we moved the token deletion side effect to the WAL consumer,
which is not executed for standard tests. As such, we need to call this
callback manually in the sign out acceptance test.

Fixes
https://github.com/firezone/firezone/actions/runs/15337858094/job/43158416750?pr=9295

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-30 07:16:55 +00:00
Jamil
e09c7b42b0 refactor(portal): Move token events to WAL broadcaster (#9294)
Moves the broadcasting of `disconnect` messages caused by token
soft-deletions to the WAL broadcaster.

Notably, many tests had to be cleaned up because they were specifically
testing this side effect. Instead, these tests now test (1) the token is
deleted, and then the token deletion handler is tested to ensure the
message is broadcasted.
2025-05-29 17:46:57 +00:00
Jamil
6cea0cd6ec refactor(portal): Move client updates to WAL broadcaster (#9288)
Client updates are next on the path to moving more side effects to the
WAL broadcaster. This one has the following notable changes:

- ~~The `actor_clients` pubsub topic were only used to broadcast removal
of clients belonging to an actor; these are no longer needed since we
handle this in the individual removal event~~ EDIT: only the presence is
kept
- The `account_clients:{account_id}` pubsub and presence topic
definition has been moved to `Events.Hooks.Accounts` because these are
broadcasted using the account_id field based on account changes, and
have nothing to do with the client lifecycle


Related: #6294 
Related: #8187
2025-05-29 16:56:08 +00:00