Why:
* During the account sign up flow, the email of the first admin was not
being populated in the `email` column on the auth_identities table. This
was due to atoms being passed in the attrs instead of strings to the
`create_identity` function. A migration was also created to backfill the
missing emails in the `auth_identities` table.
Why:
* It was pointed out that the way Postgresql does compound indexes there
is no need to have an individual index on the first column of the
compound index. This commit removes the redundant index on the
`actor_id` for the `actor_group_membership` table.
Why:
* As we move towards hard deleting data one issue we've run into is with
cascading deletes on the actor_group_memberships table. In order to
solve this problem indexes have been created on the `actor_id` and
`group_id` columns of the actor_group_memberships.
These are expected during deploys, so don't log them as errors. If the
Supervisor fails to start us after exhausting all attempts, it will log
an error.
The `background_jobs_enabled` config in an ENV var that needs to be set
for a specific configuration key. It's not set on the top-level
`:domain` config by default.
Instead, it's used to enable / disable specific modules to start by the
application's Supervisor.
The `Domain.Events.ReplicationConnection` module is updated in this PR
to follow this convention.
The `web` and `api` application use `domain` as a dependency in their
`mix.exs`. This means by default their Supervisor will start the
Domain's supervision tree as well.
The author did not realize this at the time of implementation, and so we
now leverage the convention in place for restricting tasks to `domain`
nodes, the `background_jobs_enabled` application configuration
parameter.
We also add an info log when the replication slot is being started so we
can verify the node it's starting on.
When deploying, the cluster state diverges temporarily, which allows
more than one `ReplicationConnection` process to start on the new nodes.
(One of) the old nodes still has an active slot, and we get an "object
in use" error `(Postgrex.Error) ERROR 55006 (object_in_use) replication
slot "events_slot" is active for PID 603037`.
Rather than use ReplicationConnection's restart behavior (which logs
tons of errors with Logger.error), we can use the Supervisor here
instead, and continue to try and start the ReplicationConnection until
successful.
Note that if the process name is registered (globally) and running,
ReplicationConnection.start_link/1 simply returns `{:ok, pid}` instead
of erroring out with `:already_running`, so eventually one of the nodes
will succeed and the remaining ones will return the globally-registered
pid.
We need the `replication` attribute set on the db user. This is
trivially done in a migration, and with the `CURRENT_USER` specifier, we
don't need to fetch the Application configuration.
The LoggerJSON Redactor only redacts top-level keys, so we need to
redact the entire `connection_opts` param to redact its contained
password.
We also don't need to pass around `connection_opts` across the entire
ReplicationConnection process state, only for the initial connection, so
we refactor that out of the `state`.
Firezone's control plane is a realtime, distributed system that relies
on a broadcast/subscribe system to function. In many cases, these events
are broadcasted whenever relevant data in the DB changes, such as an
actor losing access to a policy, a membership being deleted, and so
forth.
Today, this is handled in the application layer, typically happening at
the place where the relevant DB call is made (i.e. in an
`after_commit`). While this approach has worked thus far, it has several
issues:
1. We have no guarantee that the DB change will issue a broadcast. If
the application is deployed or the process crashes after the DB changes
are made but before the broadcast happens, we will have potentially
failed to update any connected clients or gateways with the changes.
2. We have no guarantee that the order of DB updates will be maintained
in order for broadcasts. In other words, app server A could win its DB
operation against app server B, but then proceed to lose being the first
to broadcast.
3. If the cluster is in a bad state where broadcasts may return an error
(i.e. https://github.com/firezone/firezone/issues/8660), we will never
retry the broadcast.
To fix the above issues, we introduce a WAL logical decoder that process
the event stream one message at a time and performs any needed work.
Serializability is guaranteed since we only process the WAL in a single,
cluster-global process, `ReplicationConnection`. Durability is also
guaranteed since we only ACK WAL segments after we've successfully
ingested the event.
This means we will only advance the position of our WAL stream after
successfully broadcasting the event.
This PR only introduces the WAL stream processing system but does not
introduce any changes to our current broadcasting behavior - that's
saved for another PR.
Why:
* The copy to clipboard button was not working at all on the API new
token page due to the fact that the FlowbiteJS library expects the
presence of the elements in the DOM on first render. This was not true
of the API Token code block. Along with that issue the existing code
blocks copy to clipboard buttons did not give any visual indication that
the copy had been completed. It was also somewhat difficult to see the
copy to clipboard button on those code blocks as well. This commit
updates the buttons to be more visible, as well as adds a phx-hook to
make sure the FlowbiteJS init functions are run on every code block even
if it's inserted after the initial load of the page and adds functions
that are run as a callback to toggle the button text and icon to show
the text has been copied.
API clients don't belong to any actor_groups and attempting to deep link
into the `groups` section when viewing an actor raises a 500 error.
This PR fixes that by removing the deep link into `actor_groups` from
the actors index view.
Prevents more than one sync-enabled adapter per account in order to
prepare for eventually adding a unique constraint on
`provider_identifier` for identities and groups per account.
Related: #6294
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>
This was left behind in a large refactor as part of #3642 and was never
cleaned up.
I verified on prod this table in fact has no meaningful data in it and
has not changed since that PR was merged.
Our current bespoke job system, while it's worked out well so far, has
the following shortcomings:
- No retry logic
- No robust to guarantee job isolation / uniqueness without resorting to
row-level locking
- No support for cron-based scheduling
This PR adds the boilerplate required to get started with
[Oban](https://hexdocs.pm/oban/Oban.html), the job management system for
Elixir.
There was slight API change in the way LoggerJSON's configuration is
generation, so I took the time to do a little fixing and cleanup here.
Specifically, we should be using the `new/1` callback to create the
Logger config which fixes the below exception due to missing config
keys:
```
FORMATTER CRASH: {report,[{formatter_crashed,'Elixir.LoggerJSON.Formatters.GoogleCloud'},{config,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},{log_event,#{meta => #{line => 15,pid => <0.308.0>,time => 1744145139650804,file => "lib/logger.ex",gl => <0.281.0>,domain => [elixir],application => libcluster,mfa => {'Elixir.Cluster.Logger',info,2}},msg => {string,<<"[libcluster:default] connected to :\"web@web.cluster.local\"">>},level => info}},{reason,{error,{badmatch,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},[{'Elixir.LoggerJSON.Formatters.GoogleCloud',format,2,[{file,"lib/logger_json/formatters/google_cloud.ex"},{line,148}]}]}}]}
```
Supersedes #8714
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Why:
* The Okta IdP sync job needs to make sure it is always using the latest
access token available. If not, there is the possibility for the job to
take too long to complete and the access token that the job started with
might time out. This commit updates the Okta API client to always check
and make sure it is using the latest access token for each request to
the Okta API.
- Attaches the Sentry Logging hook in each of [api, web, domain]
- Removes errant Sentry logging configuration in config/config.exs
- Fixes the exception logger to default to logging exceptions, use
`skip_sentry: true` to skip
Tested successfully in dev. Hopefully the cluster behaves the same way.
Fixes#8639
Bumps [tailwind](https://github.com/phoenixframework/tailwind) from
0.2.4 to 0.3.1.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/phoenixframework/tailwind/blob/main/CHANGELOG.md">tailwind's
changelog</a>.</em></p>
<blockquote>
<h2>v0.3.1 (2025-02-28)</h2>
<ul>
<li>Support correct target for Linux MUSL with Tailwind v3.</li>
</ul>
<h2>v0.3.0 (2025-02-26)</h2>
<ul>
<li>Support Tailwind v4+. This release assumes Tailwind v4 for new
projects.</li>
</ul>
<p>Note: v0.3.0 dropped target code for handling Linux MUSL with
Tailwind v3. Use v0.3.1+ instead.</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="dec852e08d"><code>dec852e</code></a>
release v0.3.1</li>
<li><a
href="2bc2fdff38"><code>2bc2fdf</code></a>
Merge pull request <a
href="https://redirect.github.com/phoenixframework/tailwind/issues/115">#115</a>
from phoenixframework/sd-musl-target-v3v4</li>
<li><a
href="c0006e254b"><code>c0006e2</code></a>
Support Linux MUSL v3 and v4</li>
<li><a
href="08629c84b8"><code>08629c8</code></a>
release v0.3.0</li>
<li><a
href="8b3247daad"><code>8b3247d</code></a>
Merge branch 'next'</li>
<li><a
href="7e1f93b284"><code>7e1f93b</code></a>
use Tailwind 4.0.9 as latest</li>
<li><a
href="44ac9014f0"><code>44ac901</code></a>
don't mention 0.3 or Tailwind v4 in README yet</li>
<li><a
href="8ad425c2da"><code>8ad425c</code></a>
Pass url as a string into fetch_body! as URI.parse would not succeed
with a c...</li>
<li><a
href="6f45cae55d"><code>6f45cae</code></a>
Merge pull request <a
href="https://redirect.github.com/phoenixframework/tailwind/issues/97">#97</a>
from arcanemachine/main</li>
<li><a
href="22788850d2"><code>2278885</code></a>
Merge pull request <a
href="https://redirect.github.com/phoenixframework/tailwind/issues/110">#110</a>
from phoenixframework/sd-tailwind3to4</li>
<li>Additional commits viewable in <a
href="https://github.com/phoenixframework/tailwind/compare/v0.2.4...v0.3.1">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps
[@fontsource/source-sans-3](https://github.com/fontsource/font-files/tree/HEAD/fonts/google/source-sans-3)
from 5.1.1 to 5.2.6.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/fontsource/font-files/commits/HEAD/fonts/google/source-sans-3">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
The `flows` table currently has `ON DELETE SET NULL` behavior for many
of its foreign key constraints. The problem is that if we try to delete
any of the associated entities, setting a null here causes the DB
operation to fail with:
```
ERROR: null value in column "policy_id" of relation "flows" violates not-null constraint
```
I can understand why it was originally architected like this to preserve
connection log data, but we'll be using another approach for that that
doesn't require maintaining relational data in perpetuity.
Related: #949
The Google API will often return a missing `members` key alongside a
`200` response from their members API. The documentation here isn't
clear whether this key is expected or not, but since the sync has been
working fine up until #8608, we can only surmise that the missing key in
fact means the group has no members.
This PR updates the Google API client so that a `default_if_missing` can
be passed in which is returned if the API response is missing the JSON
key to fetch.
For the users, groups, and organization units fetches, we consider a
missing key to be an error and we return `{:error, :invalid_response}`
since this most likely indicates an API problem.
For the members endpoint, we consider the missing key to be the empty
set.
Additionally, a bug is fixed that was introduced in #8608 whereupon we
returned `{:error, :retry_later}` for newly-accounted-for API responses,
which would have caused a "sync failed" email to be sent to the admins
on the instance.
Instead, we want to return `{:error, :invalid_response}` which will stop
the sync from progressing, and log it internally.
The previous migration only accounted for soft-deleted rows that have an
active counterpart.
This fails the new unique index if multiple soft-deleted rows exist for
the same `account_id, provider_id, provider_identifier` combination.
Instead, to appease the new index, we need to delete all soft-deleted
rows where these fields exist.
Related: #8615
After merging #8608, we discovered that we receive unexpected API
responses on the regular. This adds improved logging to uncover what
exactly these unexpected API responses are.
When syncing identities from an identity, we have logic in place that
resurrects any soft-deleted identities in order to maintain their
session history, group memberships and any other relevant data. Users
can be temporarily suspended from their identity provider and then
resumed.
Groups, however, based on cursory research, can never be temporarily
suspended at the identity provider. However, this doesn't mean that we
can't see the group disappear and reappear at a later point in time.
This can happen due to a temporary sync issue, or in the upcoming Group
Filters PR: #8381.
This PR adds more robust testing to ensure we can in fact resurrect
identities as expected.
It also updates the group sync logic to similarly resurrect soft-deleted
groups if they are seen again in a subsequent sync.
To achieve this, we need to update the `UNIQUE CONSTRAINT` used in the
upsert clause during the sync. Before, it was possible for two (or more)
groups to exist with the same provider_identifier and provider_id, if
`deleted_at IS NOT NULL`. Now, we need to ensure that only one group
with the same `account_id, provider_id, provider_identifier` can exist,
since we want to resurrect and not recreate these.
To do this, we use a migration that does the following:
1. Ensures any potentially problematic data is permanently deleted
2. Drops the existing unique constraint
3. Recreates it, omitting `WHERE DELETED_AT IS NULL` from the partial
index.
Based on exploring the production DB data, this should not cause any
issues, but it would be a good idea to double-check before rolling this
out to prod.
Lastly, the final missing piece to the resurrection story is Policies.
This is saved for a future PR since we need to first define the
difference between a policy that was soft-deleted via a sync job, and a
policy that was "perma" deleted by a user.
Related: #8187
When calling the various directory sync endpoints, we had error cases
that matched a few of the possible error scenarios in an appropriate way
by returning either `{:error, :retry_later}` or the `{:error, ...}`
tuples.
However, as we've recently learned in [this
thread](https://firezonehq.slack.com/archives/C069H865MHP/p1743521884037159),
it's possible for identity provider APIs to return all kinds of bogus
data here, and we need a more defensive approach.
The specific issue this PR addresses is the case where we receive a
`2xx` response, but without the expected JSON key in the response body.
That will result in the `list*` functions returning an empty list, which
the calling code paths then use to soft-delete all existing record types
in the DB.
This is wrong. If the JSON response is missing a key we're expecting, we
instead log a warning and return `{:error, :retry_later}`. It's
currently unknown when exactly this happens and why, but with better
monitoring here we'll have a much better picture as to why.
When reading through these modules, it's helpful to know that the actual
sync data update doesn't occur more often than 10 minutes due to a
database check.
When a customer signs up for Starter or Team, we don't enable tax
calculation by default. This means customers can upgrade to Team, start
paying invoices, and we won't collect taxes.
This creates a management issue and possible tax liability since I need
to manually reconcile these.
Instead, since we have Stripe Tax configured on our account, we can
enable automatic tax calculation when the subscription is created. Any
products (Starter/Team/Enterprise) therefore in the subscription will
automatically collect tax appropriately.
In most cases in the US, the tax rate is 0. In EU transactions, for B2B
sales, the tax rate for us is also 0 (reverse charge basis). If we sell
a Team subscription to an individual, however, we need to collect VAT.
There doesn't seem to be a way to block consumer EU transactions in
Stripe, so we'll likely need to register for VAT in the EU if we cross
the reporting threshold.
A regression was introduced in d0f0de0f8d
whereupon we started using the updated policy record for broadcasting
the `delete_policy` and `expire_flows` events. This caused a security
issue because if the actor group changed from `Everyone` to `thomas`,
for example, we'd only expire flows and broadcast policy removal (i.e.
resource removal) events for `thomas`, and `Everyone` would still have
access granted by the old policy.
To fix this, we broadcast the destructive events to the old policy, so
that its `actor_group_id` and `resource_id` are used, and not the new
policy's.
Fixes#8549
After removing some of the functionality for viewing the Internet
Resource, customer was confused where to find it again.
This places an `Internet` section in the Resources index page (similar
to Sites page) with a short help text and an action button to view the
Internet Resource.
This also adds a convenient helper that allows us to route to
`/#{account}/resources/internet` for a nicer-looking URL that users can
bookmark if needed.
<img width="1423" alt="Screenshot 2025-03-19 at 11 52 31 PM"
src="https://github.com/user-attachments/assets/f2da1c31-92b2-429e-832f-73ddd0524155"
/>
Fixes#8479
Why:
* This commit will allow account admins to send a request through the
Firezone portal to schedule a deletion of their account, rather than
having the account admins email their request manually. Doing this
through the portal allows us to verify that the request actually came
from an admin of the account.
I was debugging some of this just now and realized our naming / comments
are incorrect here, so thought I'd open a PR to tidy things up for the
next person reading this.
Resource CIDRs actually occupy the `100.96.0.0/11` range (and IPv6
equivalent), but the portal doesn't generate these.
Why:
* Previously, when running a directory sync with the Google Workspace
IdP adapter, if a service account had been configured but there was a
problem getting an access token for the service account, the sync job
would fall back to using a personal access token. We no longer want to
rely on any personal access token once a service account has been
configured. This commit will make sure that if a service account is
configured there is no way to fall back to any personal access token.
Fixes#8409
When deploying a Gateway from the admin portal UI, we show various
environment variables required for setup. Until now, we've relied on the
`/var/lib/firezone` persistence method for identifying the Gateway.
However, this can cause issues on some systems that don't have writeable
access to /var/lib/firezone, or old versions of systemd that don't
support sandboxed access to this directory.
This PR updates each deployment method to use `FIREZONE_ID` instead
everywhere. Additionally, since the Docker upgrade script needs to
reinvoke the new container using the same arguments (more or less) as
the install, we need to extract the old `/var/lib/firezone/gateway_id`
file out of the existing container if it exists, and try to insert it
into the upgraded container.
Tested both scripts, including upgrades for the Docker script.
Fixes: #8471
Finishes up the Internet Resource migration by enforcing:
- No internet resources in non-internet sites
- No regular resources in internet sites
- Removing the prompt to migrate
~~I've already migrated the existing internet resources in customer's
accounts. No one that was using the internet resource hadn't already
migrated.~~
Edit: I started to head down that path, then decided doing this here in
a data migration was going to be a better approach.
Fixes#8212
[Step
2](https://cloud.google.com/sql/docs/postgres/pg-audit#set-pgaudit-flag-values)
of the pgaudit setup guide for Google Cloud SQL. It would be good to
have detailed pg audit logs on the master application instance in case
things go wrong.
Notably, this prevents erroring out when the `pgaudit` is not available,
which by default, it is. Enabling the `pgaudit` extension for our dev
instance is left as a future endeavor.
Supersedes #5442
The submit button on the settings -> dns page has a couple UX issues
with the new search domain section:
- It's ambiguous what the `Save` is actually saving
- The spacing makes it look like it's only saving upstream resolvers
This PR introduces a simple fix that address the two issues by:
- Updating the button text to `Save DNS Settings`
- Increasing spacing between submit button and form elements
- Slightly decreasing spacing between the `search domain` and `upstream
resolvers` inputs
<img width="968" alt="Screenshot 2025-03-14 at 12 06 02 AM"
src="https://github.com/user-attachments/assets/651f54c8-3b5f-4747-ad3a-e2ae32eccbf0"
/>
Related #5248
Why:
* This commit updates the 500 error page in the portal to have the same
look and feel of the 404 error page in order to be consistent within the
portal UI.