This table was added to try and gate the request rate to Google's
Metrics API.
However, this was a flawed endeavor as we later discovered that the time
series points need to be spaced apart at least 5s, not the API requests
themselves.
This PR gets rid of the table and therefore the problematic DB query
which is timing out quite often due to the contention involved in 12
elixir nodes trying to grab a lock every 5s.
Related: #9539
Related:
https://firezone-inc.sentry.io/issues/6346235615/?project=4508756715569152&query=is%3Aunresolved&referrer=issue-stream&stream_index=1
Some migrations take a long time to run because they require locks or
modify large amounts of data. To prevent this from causing issues during
deploy, we leverage Ecto's native support for loading migrations from
multiple directories to introduce a `conditional_migrations/` directory
that houses any conditional migrations we want to run.
To run these migrations, you'll need to do one of the following:
- `dev, test`: The `mix ecto.migrate` will run them by default because
we have aliased this to load conditional_migrations for dev
- `prod`: Set the `RUN_CONDITIONAL_MIGRATIONS` env var to `true` before
starting a prod server using the `bin/migrate` script.
- `dev, test, prod`: Run `Domain.Release.migrate(conditional: true)`
from an IEx shell.
If conditional migrations were found that weren't executed during
`Domain.Release.migrate`, a warning is logged to remind us to run them.
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
The shape of this from libcluster is `[:"NODE_NAME": connected_bool?]`
so we need to extract the first element of each item before using this
var.
This is just for logging and doesn't affect how we actually connect to
nodes.
Why:
* Adding some simple logging around OIDC calls to help with better
debugging.
* Removing the `opentelemetry_liveview` package as it has been pulled in
to the `opentelemetry_phoenix` package that we are already using.
After updating the CSS config to use `main.css` in the portal the root
layout was updated, but there were a small number of one-off templates
that do not use the root layout and those pages were not updated with
the new `main.css` file. This commit updates those non-root templates.
Fixes#9532
We issue broadcasts and subscribes in many places throughout the portal.
To help keep the cognitive overhead low, this PR consolidates all PubSub
functionality to the `Domain.PubSub` module.
This allows for:
- better maintainability
- see all of the topics we use at a glance
- consolidate repeated functionality (saved for a future PR)
- use the module hierarchy to define function names, which feels more
intuitive when reading and sets a convention
We also introduce a `Domain.Events.Hooks` behavior to ensure all hooks
comply with this simple contract, and we also introduce a convention to
standardize on topic names using the module hierarchy defined herein.
Lastly, we add convenience functions to the Presence modules to save a
bit of duplication and chance for errors.
This will make it much easier to maintain PubSub going forward.
Related: #9501
In #9513 a bug was introduced that added all service accounts to the
Everyone group. This fixes that by ensuring the `insert_all` query only
cross joins where actor type is `:account_user, :account_admin_user`.
Staging data will be manually fixed after this goes in.
I briefly considered updating the delete clause of this query to "clean
things up" by removing any found service accounts but that is a bit too
defensive in my opinion - if there's no way a service account should
make it into this group, then we shouldn't have code to expect it. This
will all be going away in #8750 which should be much less brittle.
In many places throughout the portal codebase, we called a function
"update_dynamic_group_memberships/1" which recomputed all of the
dynamic/managed memberships for a particular account, and reapplied them
to each affected group.
Since the `has_many :memberships` relationship used `on_replace:
:delete`, this caused Ecto to delete _all_ the `Everyone` group
memberships, and reinsert them on each sync.
Since each membership change triggers a policy re-evaluation for all
policies to the affected actor
(`Policies.broadcast_access_events_for/3`), this in effect was causing a
massive amount of queries to be triggered upon each sync job as each
membership deletion and insertion triggered a lookup for all resources
available to that particular actor.
To fix this, we introduce the following changes:
- Remove `dynamic` group type. This will never be used as it will create
an immense amount of complexity for any organization trying to manage
groups this way
- Refactor `update_dynamic_group_memberships/1` to use a smarter query
that first gathers all the _needed_ changes and applies them within a
transaction using Ecto.Multi. Previously all memberships would be rolled
over unconditionally due to the `on_replace: :delete` option on the
relationship. Note that the option is still there, but we generally
don't set memberships on groups any longer unless editing the affected
group directly, where the everyone group doesn't apply.
Resolves: #8407Resolves: #8408
Related: #6294
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
We move the resource events to the WAL system. Notably, we no longer
need `fetch_and_update_breakable` for resource updates, so a bit of
refactoring is included to update the call sites for those.
Additionally, we need to add a `Flow.expire_flows_for_resource_id/1`
function to expire flows from the WAL system. This is now being called
in the WAL event handler. To prevent this from blocking the WAL
consumer/broadcaster, we wrap it with a Task.async. These will be
cleaned up when the lookup table for access is implemented next.
Another thing to note is that we lose the `subject` when moving from
`Flows.expire_flows_for(%Resource{}, subject)` to
`Flows.expire_flows_for_resource_id(resource_id)` when a resource is
deleted or updated by an actor since we respond to this event in the WAL
where that data isn't available. However, we don't actually _use_ the
subject when expiring flows (other than authorize the initial resource
update), so this isn't an issue.
Related: #9501
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>
Why:
* This commit brings our web app inline with how new Phoenix
applications manage and configure js/css/font assets. Along with that
this commit updates our Tailwind and esbuild tools.
Why:
* We've seen some Stripe API requests come back with 429 responses,
which likely could be retried and succeed. This commit adds some basic
retry logic to our Stripe API client.
Membership events are quite simple to move to the WAL:
- Only one topic is used to determine which client(s) receive updates
for which Actor(s).
- The unsubscribe was removed because it was unused.
- Notably, the N+1 query problem regarding re-evaluating all access
again after each membership is updated is still present. This will be
fixed using a lookup table in the client channel in the last PR to move
events to the WAL.
Related: https://github.com/firezone/firezone/issues/6294
Related: https://github.com/firezone/firezone/issues/8187
> This PR adds two configuration keys for the replication connection.
> Exemple usecase:
> If you run two firezone control planes on the same db cluster (like me
😂 ), you'll need to have two different replication slot names
Related: #9395
Co-authored-by: Antoine <antoinelabarussias@gmail.com>
When upserting a client, the device name would overwrite any name
changes performed by the admin. To prevent this, we don't allow changing
a device's name on upsert conflicts, only on initial insert and updates.
Fixes#8536
Co-authored-by: Antoine <antoinelabarussias@gmail.com>
Bumps
[@fontsource/source-sans-3](https://github.com/fontsource/font-files/tree/HEAD/fonts/google/source-sans-3)
from 5.2.7 to 5.2.8.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/fontsource/font-files/commits/HEAD/fonts/google/source-sans-3">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps
[opentelemetry_logger_metadata](https://github.com/salemove/opentelemetry_logger_metadata)
from 0.1.0 to 0.2.0.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/salemove/opentelemetry_logger_metadata/commits">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Why:
* When a directory sync occurs, all groups in the DB need to be pulled
in case a synced group needs to be resurrected. Prior to adding the
directory sync deletion circuit breaker the app would "re-delete"
already deleted groups. This was basically a no-op, however, once the
deletion circuit breaker was put in the deletion of already deleted
groups had the possibility of throwing off the circuit breaker and cause
it to fail a directory sync when it was not needed, due to making it
seem as though too many groups were being deleted. This commit makes
sure we don't add already deleted groups to the list of groups needing
to be deleted for a given sync.
Fixes: #9364
When a Google account has no organization units defined in its
directory, the Google API can return a `200` response without the
`organizationUnits` key. In such cases, we should treat this as an empty
list such that the remainder of the sync will continue.
Fixes the following issues after learning they're still a problem:
- We need to include our own node when checking for connected node count
- Need to match against the `formatted` key inside message when
filtering Sentry events
In #9342 we started logging only if our connected nodes fell below the
threshold. However, on error, we failed to calculated the new list.
On startup, the first few `loads` will be failures, and the connected
list will remain empty, causing this to report a false positive.
A regression was introduced in #9242 where it appears that some Sentry
events don't contain messages, so the filtering module is updated only
to act on events with messages.
Why:
* This commit contains only a migration to remove the
created_by_identity and created_by_actor columns on multiple tables.
This migration will be run manually due to the long running sync jobs
that are currently in the system. This migration should be a no-op after
the manual DB updates.
This PR moves Gateway events to be triggered by the WAL broadcaster.
Some things of note that are cleaned up:
- The gateway `:update` event was never received anywhere (but in a
test) and so has been removed
- The account topic has been removed as it was also never acted upon
anywhere. Presence yes, but topic no
- The group topic has also been removed as it was only used to receive
broadcasted disconnects when a group is deleted, but this was already
handled by the token deletion and so is redundant.
We periodically fetch a list of all `RUNNING` VMs in GCP and then try to
connect to them for clustering. However, during deploys, it's expected
that we won't be able to connect to new VMs until they are fully up. The
fetch doesn't take health checks into account, so we need a
threshold-based error logging.
To address this, we do the following:
- We only log an error when failing to connect to nodes if we are
currently below the threshold for each of the `api`, `domain`, and `web`
node counts
- We silence node timeout errors, as these will happen during deploys
We use `libcluster`, a common Elixir library, for node discovery. It's a
very lightweight wrapper around Erlang's standard `Node.connect`
functionality.
It supports custom cluster formation strategies, and we've implemented
one based on fetching the list of nodes from the GCP API, and then
attempting to connect to them.
Unfortunately, our implementation had two bugs that prevented the
cluster from healing in the following two cases:
- If we successfully connect to nodes, we tracked an internal state var
as having successfully connected to them, forever. If we lost the
connection to these nodes (such as during a deploy where the elixir
nodes don't come up in time, causing the instance group manager to reap
them), then the state would never be updated, and we would never
reconnect to the lost nodes.
- If we failed to fetch the list of nodes more than 10 times (every 10
seconds, so 100 seconds), then we would fail to schedule the timer to
load the nodes again.
The first issue is fixed by removing our kept state altogether - this is
what libcluster is for. We can simply try to connect to the most recent
list of nodes returned from Google's API, and we now log a warning for
any nodes that don't connect.
The second issue is fixed by always scheduling the timer, forever,
regardless of the state of the Google API.
Fixes#8660Fixes#8698
When the client is connecting for the first time without any cookies
loaded the `conn.assigns.account` is non-existent, causing a `KeyError`.
Instead, we should be loading this param from the URL and fetching the
account from it.
Why:
* Now that we have started using the `created_by_subject` field on
various tables, we no longer need to keep the
`created_by_<identity/actor>` fields. This will help remove a foreign
key reference and will be one step closer to allowing us to hard delete
data rather than soft deleting all data in order to keep foreign key
references like these.
Moves the broadcasting of `disconnect` messages caused by token
soft-deletions to the WAL broadcaster.
Notably, many tests had to be cleaned up because they were specifically
testing this side effect. Instead, these tests now test (1) the token is
deleted, and then the token deletion handler is tested to ensure the
message is broadcasted.
Client updates are next on the path to moving more side effects to the
WAL broadcaster. This one has the following notable changes:
- ~~The `actor_clients` pubsub topic were only used to broadcast removal
of clients belonging to an actor; these are no longer needed since we
handle this in the individual removal event~~ EDIT: only the presence is
kept
- The `account_clients:{account_id}` pubsub and presence topic
definition has been moved to `Events.Hooks.Accounts` because these are
broadcasted using the account_id field based on account changes, and
have nothing to do with the client lifecycle
Related: #6294
Related: #8187
Similar to #9285, we move the `expire_flow` event to be broadcasted from
the WAL broadcaster.
Unrelated tests needed to be updated to not expect to receive the
broadcast, and instead check to ensure the record has been updated.
A minor bug is also fixed in the ordering of the `old_data, data`
fields.
Tested manually on dev.
Related: #6294
Related: #8187
Now that the WAL consumer has been dry running in production for some
time, we can begin moving events over to it.
We start with a relatively simple case: the account `config_changed`
event.
Since side effects now happen decoupled from the actual record updates,
testing is updated in this PR:
- We don't expect broadcasts to happen in the `accounts_test.exs` -
these context modules are now solely responsible for managing updates to
records and will no longer need to worry about side effects (in the
typical case) like subscribe and broadcast
- The Event hooks module now contains all logic related to processing
side effects for a particular account update.
The net effect is that we now have dedicated module and tests for side
effects, starting with `accounts`.
Related: #6294
Related: #8187
Bumps
[@fontsource/source-sans-3](https://github.com/fontsource/font-files/tree/HEAD/fonts/google/source-sans-3)
from 5.2.6 to 5.2.7.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/fontsource/font-files/commits/HEAD/fonts/google/source-sans-3">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Why:
* We have seen issues with Google Admin SDK API returning bad
information when requesting directory info, such as Groups and
Identities. The requests seem to return successful HTTP codes, but the
data is missing, which our sync system interprets as all
Groups/Identities have been deleted from the Google Workspace. In order
to prevent this from happening a deletion circuit breaker function has
been added to stop a sync job if a certain percentage of the identities
will be deleted on the current run. This should prevent the possibility
of mass deleting Groups/Identities if an Identity Provider hands back
incorrect info on any sync.
Fixes: #9188
Why:
* We have decided to change the way we will do audit logging. Instead of
soft deleting data and keeping it in the table it was created in, we
will be moving to an audit trail table where various actions will be
recorded in a table/DB specifically for auditing purposes. Due to this
change we need to make sure that we don't have stale/dangling
references. One set of references we keep everywhere is
`created_by_identity_id` and `created_by_actor_id`. Those foreign key
references won't be able to be used after moving to the new audit
system. This commit will allow us to keep that info by pulling the
values and storing the data in a created_by_subject field on the record.
We are currently consuming the WAL on production and it has shown very
little cost in terms of resource usage.
It would be better to get a more real-world test by sending actual
broadcasts with data.
To do this, we simply send a `Domain.PubSub.broadcast` with all of the
data received in the WAL message, which represents an absolute
worst-case scenario.