125 Commits

Author SHA1 Message Date
Jamil
d329880ec8 fix(portal): don't use Web functions from Domain (#10546)
Fixes an issue introduced in #10510 where Web functions (like
VerifiedRoutes) cannot be called from Domain because they are not
available in the release.

This happens to work in dev mode because everything is available under
the same dev context.
2025-10-13 20:24:46 +00:00
Jamil
b61fd20de8 chore(portal): remove Jason in favor of JSON (#10550)
Since Elixir 1.18, json encoding and decoding support is included in the
standard library. This is built on OTP's native json support which is
often faster than other implementations.

It mostly has the same API as the popular Jason library, differing
mainly in the format of the error responses returned when decoding
fails.

To minimize dependence on external libraries, we remove the Jason lib in
favor of this external dependency.

Fixes #8011
2025-10-13 17:39:53 +00:00
Jamil
bb089846d7 chore(portal): bump phoenix to 1.8 (#10510)
Bumps Phoenix to 1.8 and Phoenix LiveView to 1.1. As part of the bump a
number of issues had to be addressed. Comments inline provide more
context.

Supersedes #10475 
Supersedes #10448
2025-10-10 15:08:50 +00:00
Mariusz Klochowicz
91cf1e0152 fix(staging): Update Docker registry (#10486)
Sets the current Docker registry, with `/dev` suffix to prefer latest
dev gateway release
2025-09-30 04:28:29 +00:00
Jamil
6c485be45e fix(portal): fix socket based postgres connections (#10189) (#10245)
This adds the option to do socket based postgres connection to the
replication connections. Basically just a copy of the existing config
for the base postgres connection.

---------

Co-authored-by: PatrickDaG <patrickdag@failmail.dev>
2025-08-25 17:23:03 +00:00
Jamil
cafe6554ff refactor(portal): reduce cache memory usage (#10058)
Napkin math shows that we can save substantial memory (~3x or more) on
the API nodes as connected clients/gateways grow if we just store the
fields we need in order to keep the client and gateway state maintained
in the channel pids.

To facilitate this, we create new `Cacheable` structs that represent
their `Domain` cousins, which use byte arrays for `id`s and strip out
unused fields.

Additionally, all business logic involved with maintaining these caches
is now contained within two modules: `Domain.Cache.Client` and
`Domain.Cache.Gateway`, and type specs have been added to aid in static
analysis and code documentation.

Comprehensive testing is now added not only for the cache modules, but
for their associated channel modules as well to ensure we handle
different kinds of edge cases gracefully.

The `Events` nomenclature was renamed to `Changes` to better name what
we are doing: Change-Data-Capture.

Lastly, the following related changes are included in this PR since they
were "in the way" so to speak of getting this done:

- We save the last received LSN in each channel and drop the `change`
with a warning if we receive it twice in a row, or we receive it out of
order
- The client/gateway version compatibility calculations have been moved
to `Domain.Resources` and `Domain.Gateways` and have been simplified to
make them easier to understand and maintain going forward.


Related: #10174 
Fixes: #9392 
Fixes: #9965
Fixes: #9501 
Fixes: #10227

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-22 21:52:29 +00:00
Brian Manifold
80e1c3255f refactor(portal): refactor billing event handler (#10064)
Why:

* There were intermittent issues with accounts updates from Stripe
events. Specifically, when an account would update it's subscription
from Starter to Team. The reason was due to the fact that Stripe does
not guarantee order of delivery for it's webhook events. At times we
were seeing and responding to an event that was a few seconds old after
processing a newer event. This would have the effect of quickly
transitioning an account from Team back to Starter. This commit
refactors our event handler and adds a `processed_stripe_events` DB
table to make sure we don't process duplicate events as well as prevent
processing an event that was created prior to the last event we've
processed for a given account.

* Along with refactoring the billing event handling, the Stripe mock
module has also been refactored to better reflect real Stripe objects.

Related: #8668
2025-08-05 16:56:52 +00:00
Jamil
f169746389 chore(portal): use local website url for versions in dev (#10057)
When starting a local client with a local portal, this URL is hit and
times out, causing noise in the local gateway log.

In order to develop against this API in local dev, it might be better to
use the local website URL as well.
2025-07-30 21:16:24 +00:00
Jamil
4a448e5517 fix(portal): separate dev and runtime Oban configs (#10027)
Oban includes its own configuration validation, which seems to prevent
`runtime.exs` from overriding any compile-time options. This prevents us
from using ENV vars to configure it, such as restricting job execution
to `domain` nodes by setting `queues: []`. To fix that, we make sure to
set Oban configuration in env-specific files `config/dev.exs` and
`config/test.exs`, and at runtime for prod with `config/runtime.exs`.

Fixes #10016
2025-07-28 15:13:52 +00:00
Jamil
f1a5af356d fix(portal): groom resource list and flows periodically (#10005)
Time-based policy conditions are tricky. When they authorize a flow, we
correctly tell the Gateway to remove access when the time window
expires.

However, we do nothing on the client to reset the connectivity state.
This means that whenever the window of time of access was re-entered,
the client would essentially never be able to connect to it again until
the resource was toggled.

To fix this, we add a 1-minute check in the client channel that
re-checks allowed resources, and updates the client state with the
difference. This means that policies that have time-based conditions are
only accurate to the minute, but this is how they're presented anyhow.


For good measure, we also add a periodic job that runs every minute to
delete expired Flows. This will propagate to the Gateway where, if the
access for a particular client-resource is determined to be actually
gone, will receive `reject_access`.

Zooming out a bit, this PR furthers the theme that:

- Client channels react to underlying resource / policy / membership
changes directly, while
- Gateway channels react primarily to flows being deleted, or the
downstream effects of a prior client authorization
2025-07-25 21:04:41 +00:00
Jamil
2392bddacb fix(elixir): handle nil external url config in dev mode (#9958)
This is nil in local dev. Fixing this allows us to run the local dev in
`:prod` mode to hit more codepaths.
2025-07-22 01:05:06 +00:00
Jamil
b5af132ae8 feat(portal): allow queue_target and queue_interval via ENV (#9943)
These parameters should be tuned to how long we expect "normal" queries
to take against the SQL instance. For smaller instances, "normal"
queries may take longer than 500ms, so we need to be able to configure
these via our Terraform configuration.

If not specified, the same defaults are used as before.

Related: https://github.com/firezone/infra/pull/82
2025-07-20 12:28:04 -07:00
Jamil
f379e85e9b refactor(portal): cache access state in channel pids (#9773)
When changes occur in the Firezone DB that trigger side effects, we need
some mechanism to broadcast and handle these.

Before, the system we used was:

- Each process subscribes to a myriad of topics related to data it wants
to receive. In some cases it would subscribe to new topics based on
received events from existing topics (I.e. flows in the gateway
channel), and sometimes in a loop. It would then need to be sure to
_unsubscribe_ from these topics
- Handle the side effect in the `after_commit` hook of the Ecto function
call after it completes
- Broadcast only a simply (thin) event message with a DB id
- In the receiver, use the id(s) to re-evaluate, or lookup one or many
records associated with the change
- After the lookup completes, `push` the relevant message(s) to the
LiveView, `client` pid, or `gateway` pid in their respective channel
processes

This system had a number of drawbacks ranging from scalability issues to
undesirable access bugs:

1. The `after_commit` callback, on each App node, is not globally
ordered. Since we broadcast a thin event schema and read from the DB to
hydrate each event, this meant we had a `read after write` problem in
our event architecture, leading to the potential for lost updates. Case
in point: if a policy is updated from `resource_id-1` to
`resource_id-2`, and then back to `resource_id-1`, it's possible that,
given the right amount of delay, the gateway channel will receive two
`reject_access` events for `resource_id-1`, as opposed to one for
`resource_id-1` and one for `resource_id-2`, leading to the potential
for unauthorized access.
1. It was very difficult to ensure that the correct topics were being
subscribed to and unsubscribed from, and the correct number of times,
leading to maintenance issues for other engineers.
1. We had a nasty N+1 query problem whenever memberships were added or
removed that resolved in essentially all access related to that
membership (so all Policies touching its actor group) to be
re-evaluated, and broadcasted. This meant that any bulk addition or
deletion of memberships would generate so many queries that they'd
timeout or consume the entire connection pool.
1. We had no durability for side-effect processing. In some places, we
were iterating over many returned records to send broadcasts.
Broadcasting is not a zero-time operation, each call takes a small
amount of CPU time to copy the message into the receiver's mailbox. If
we deployed while this was happening, the state update would be lost
forever. If this was a `reject_access` for a Gateway, the Gateway would
never remove access for that particular flow.
1. On each flow authorization, we needed to hit `us-east1` not only to
"authorize" the flow, but to log it as well. This incurs latency
especially for users in other parts of the world, which happens on
_each_ connection setup to a new resource.
1. Since we read and re-authorize access due to the thin events
broadcasted from side effects, we risk hitting thundering herd problems
(see the N+1 query problem above) where a single DB change could result
in all receivers hitting the DB at once to "hydrate" their
processing.ion
1. If an administrator modifies the DB directly, or, if we need to run a
DB migration that involves side effects, they'll be lost, because the
side effect triggers happened in `after_commit` hooks that are only
available when querying the DB through Ecto. Manually deleting (or
resurrecting) a policy, for example, would not have updated any
connected clients or gateways with the new state.


To fix all of the above, we move to the system introduced in this PR:

- All changes are now serialized (for free) by Postgres and broadcasted
as a single event stream
- The number of topics has been reduced to just one, the `account_id` of
an account. All receivers subscribe to this one topic for the lifetime
of their pid and then only filter the events they want to act upon,
ignoring all other messages
- The events themselves have been turned into "fat" structs based on the
schemas they present. By making them properly typed, we can apply things
like the existing Policy authorizer functions to them as if we had just
fetched them from the DB.
- All flow creation now happens in memory and doesn't not need to incur
a DB hit in `us-east1` to proceed.
- Since clients and gateways now track state in a push-based manner from
the DB, this means very few actual DB queries are needed to maintain
state in the channel procs, and it also means we can be smarter about
when to send `resource_deleted` and `resource_created_or_updated`
appropriately, since we can always diff between what the client _had_
access to, and what they _now_ have access to.
- All DB operations, whether they happen from the application code, a
`psql` prompt, or even via Google SQL Studio in the GCP console, will
trigger the _same_ side effects.
- We now use a replication consumer based off Postgres logical decoding
of the write-ahead log using a _durable slot_. This means that Postgres
will retain _all events_ until they are acknowledged, giving us the
ability to ensure at-least-once processing semantics for our system.
Today, the ACK is simply, "did we broadcast this event successfully".
But in the future, we can assert that replies are received before we
acknowledge the event as processed back to Postgres.



The tests in this PR have been updated to pass given the refactor.
However, since we are tracking more state now in the channel procs, it
would be a good idea to add more tests for those edge cases. That is
saved as a later PR because (1) this one is already huge, and (2) we
need to get this out to staging to smoke test everything anyhow.

Fixes: #9908 
Fixes: #9909 
Fixes: #9910
Fixes: #9900 
Related: #9501
2025-07-18 22:47:18 +00:00
Jamil
b20c141759 feat(portal): add batch-insert to change logs (#9733)
Inserting a change log incurs some minor overhead for sending query over
the network and reacting to its response. In many cases, this makes up
the bulk of the actual time it takes to run the change log insert.

To reduce this overhead and avoid any kind of processing delay in the
WAL consumers, we introduce batch insert functionality with size `500`
and timeout `30` seconds. If either of those two are hit, we flush the
batch using `insert_all`.

`insert_all` does not use `Ecto.Changeset`, so we need to be a bit more
careful about the data we insert, and check the inserted LSNs to
determine what to update the acknowledged LSN pointer to.

The functionality to determine when to call the new `on_flush/1`
callback lives in the replication_connection module, but the actual
behavior of `on_flush/1` is left to the child modules to implement. The
`Events.ReplicationConnection` module does not use flush behavior, and
so does not override the defaults, which is not to use a flush
mechanism.

Related: #949
2025-07-05 19:03:28 +00:00
Jamil
2a38c532af chore(portal): remove gateway masquerade option (#9790)
AFAIK these are ignored by connlib. Instead, we configure masquerading
on the host.
2025-07-04 21:08:11 +00:00
Jamil
865442eabe chore(portal): increase queue_target on bg nodes to 5000 (#9732)
The domain nodes that run background jobs aren't user-facing and as
such, don't need short queue_interval times. They can afford to wait a
few seconds for a connection from the pool to become available.

This should be a relatively low-risk change as it does not increase the
resources used on the app server and database, it only relaxes the
length of time we want for a connection to be available for us to use.
2025-06-30 19:33:01 +00:00
Jamil
dddd1b57fc refactor(portal): remove flow_activities (#9693)
This has been dead code for a long time. The feature this was meant to
support, #8353, will require a different domain model, views, and user
flows.

Related: #8353
2025-06-27 20:40:25 +00:00
Jamil
0b09d9f2f5 refactor(portal): don't rely on flows.expires_at (#9692)
The `expires_at` column on the `flows` table was never used outside of
the context in which the flow was created in the Client Channel. This
ephemeral state, which is created in the `Domain.Flows.authorize_flow/4`
function, is never read from the DB in any meaningful capacity, so it
can be safely removed.

The `expire_flows_for` family of functions now simply reads the needed
fields from the flows table in order to broadcast `{:expire_flow,
flow_id, client_id, resource_id}` directly to the subscribed entities.

This PR is step 1 in removing the reliance on `Flows` to manage
ephemeral access state. In a subsequent PR we will actually change the
structure of what state is kept in the channel PIDs such that reliance
on this Flows table will no longer be necessary.

Additionally, in a few places, we were referencing a Flows.Show view
that was never available in production, so this dead code has been
removed.

Lastly, the `flows` table subscription and associated hook processing
has been completely removed as it is no longer needed. We've implemented
in #9667 logic to remove publications from removed table subscriptions,
so we can expect to get a couple ingest warnings when we deploy this as
the `Hooks.Flows` processor no longer exists, and the WAL data may have
lingering flows records in the queue. These can be safely ignored.
2025-06-27 18:29:12 +00:00
Jamil
42e3027c34 fix(portal): use replication config in dev (#9676) 2025-06-25 21:02:01 +00:00
Jamil
bebc69e2bc fix(portal): use distinct slot names (#9672)
These were being configured using the same default `events_` value.
2025-06-25 17:28:17 +00:00
Jamil
95624211cd fix(portal): update publications when config changes (#9667)
Creating a table publication(s) (and associated replication slot) is
sticky. These will outlive the lifetime of the process that created
them.

We don't want to remove them on shutdown, because this will pause WAL
writing to disk.

However, when starting the _new_ application, it's possible
`table_subscriptions` has changed (such as if we decide we no longer
want events for a certain table). We weren't updating the created
publication(s) with these added/removed tables, so this PR updates the
replication connection setup state machine to pass through a few
conditionals to get these properly updated with the diff of old vs new.
2025-06-24 21:31:40 -07:00
Jamil
a9f49629ae feat(portal): add change_logs table and insert data (#9553)
Building on the WAL consumer that's been in development over the past
several weeks, we introduce a new `change_logs` table that stores very
lightly up-fitted data decoded from the WAL:

- `account_id` (indexed): a foreign key reference to an account.
- `inserted_at` (indexed): the timestamp of insert, for truncating rows
later.
- `table`: the table where the op took place.
- `op`: the operation performed (insert/update/delete)
- `old_data`: a nullable map of the old row data (update/delete)
- `data`: a nullable map of the new row data(insert/update)
- `vsn`: an integer version field we can bump to signify schema changes
in the data in case we need to apply operations to only new or only old
data.

Judging from our prod metrics, we're currently average about 1,000 write
operations a minute, which will generate about 1-2 dozen changelogs / s.
Doing the math on this, 30 days at our current volume will yield about
50M / month, which should be ok for some time, since this is an
append-only, rarely (if ever) read from table.

The one aspect of this we may need to handle sooner than later is
batch-inserting these. That raises an issue though - currently, in this
PR, we process each WAL event serially, ending with the final
acknowledgement `:ok` which will signal to Postgres our status in
processing the WAL.

If we do anything async here, this processing "cursor" then becomes
inaccurate, so we may need to think about what to track and what data we
care about.

Related: #7124
2025-06-25 02:06:20 +00:00
Jamil
caa21accf9 feat(portal): add mock sync adapter staging (#9660)
This needs to be enabled here too.
2025-06-24 19:08:58 +00:00
Jamil
c783b23bae refactor(portal): rename conditional->manual (#9612)
These only have one condition - to run manually. `manual migrations`
better implies that these migrations _must_ typically be run manually.
2025-06-21 21:17:33 +00:00
Jamil
ddb3dc8ce0 refactor(portal): compile_config macro to env_var_to_config (#9605)
The `compile_config` macro only works on environment and DB variables.
This caused recent confusion when determining where `database_pool_size`
was coming from.

To fix this issue, we rename `compile_config` to be more clear.

We also remove the technical debt around supporting "legacy keys" and
DB-based configuration.

The configuration compiler now works exclusively on environment
variables only, where it is still useful for:

- Casting environment variables to their expected type
- Alerting us when one is missing that should be set
2025-06-20 20:39:06 +00:00
Jamil
6cd077cd18 feat(portal): populate otel resource attributes (#9574)
If the otel collector runs on a different node from the elixir apps,
these won't be set anymore.

Luckily, all of the info we need is already in the application's env, so
we can simply copy them over to the needed attributes.

Related: https://github.com/firezone/infra/pull/47
Related:
https://github.com/firezone/infra/pull/47#discussion_r2149291472
Related:
https://opentelemetry.io/docs/languages/erlang/resources/#adding-resources-with-os-and-otp-application-environment-variables
Related: https://opentelemetry.io/docs/specs/semconv/resource/#service
2025-06-18 19:26:10 +00:00
Jamil
a20989a819 feat(portal): conditional migrations on prod (#9562)
Some migrations take a long time to run because they require locks or
modify large amounts of data. To prevent this from causing issues during
deploy, we leverage Ecto's native support for loading migrations from
multiple directories to introduce a `conditional_migrations/` directory
that houses any conditional migrations we want to run.

To run these migrations, you'll need to do one of the following:

- `dev, test`: The `mix ecto.migrate` will run them by default because
we have aliased this to load conditional_migrations for dev
- `prod`: Set the `RUN_CONDITIONAL_MIGRATIONS` env var to `true` before
starting a prod server using the `bin/migrate` script.
- `dev, test, prod`: Run `Domain.Release.migrate(conditional: true)`
from an IEx shell.

If conditional migrations were found that weren't executed during
`Domain.Release.migrate`, a warning is logged to remind us to run them.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
2025-06-18 18:08:25 +00:00
Brian Manifold
d4c7b48754 refactor(portal): update asset config in portal (#9504)
Why:

* This commit brings our web app inline with how new Phoenix
applications manage and configure js/css/font assets. Along with that
this commit updates our Tailwind and esbuild tools.
2025-06-10 23:00:44 +00:00
Jamil
f58176a447 chore: remove docs writer (#9494)
This was added in an earlier era and will be just too cumbersome to
maintain going forward. We have OpenAPI docs which are more flexible.
2025-06-10 02:51:46 +00:00
Brian Manifold
6d425d5677 refactor(portal): add retry logic to Stripe API client (#9466)
Why:

* We've seen some Stripe API requests come back with 429 responses,
which likely could be retried and succeed. This commit adds some basic
retry logic to our Stripe API client.
2025-06-09 23:11:33 +00:00
Jamil
00a761ba22 feat(portal): add replication config (#9395) (#9404)
> This PR adds two configuration keys for the replication connection.

> Exemple usecase:
> If you run two firezone control planes on the same db cluster (like me
😂 ), you'll need to have two different replication slot names

Related: #9395

Co-authored-by: Antoine <antoinelabarussias@gmail.com>
2025-06-04 22:15:28 +00:00
Jamil
299fbcd096 fix(portal): Properly check background jobs (#8986)
The `background_jobs_enabled` config in an ENV var that needs to be set
for a specific configuration key. It's not set on the top-level
`:domain` config by default.

Instead, it's used to enable / disable specific modules to start by the
application's Supervisor.

The `Domain.Events.ReplicationConnection` module is updated in this PR
to follow this convention.
2025-05-01 16:32:43 +00:00
Jamil
8e054f5c74 fix(portal): Restrict WAL streaming to domain nodes only (#8956)
The `web` and `api` application use `domain` as a dependency in their
`mix.exs`. This means by default their Supervisor will start the
Domain's supervision tree as well.

The author did not realize this at the time of implementation, and so we
now leverage the convention in place for restricting tasks to `domain`
nodes, the `background_jobs_enabled` application configuration
parameter.

We also add an info log when the replication slot is being started so we
can verify the node it's starting on.
2025-05-01 13:28:40 +00:00
Jamil
c0a670d947 fix(portal): Restart ReplicationConnection using Supervisor (#8953)
When deploying, the cluster state diverges temporarily, which allows
more than one `ReplicationConnection` process to start on the new nodes.

(One of) the old nodes still has an active slot, and we get an "object
in use" error `(Postgrex.Error) ERROR 55006 (object_in_use) replication
slot "events_slot" is active for PID 603037`.

Rather than use ReplicationConnection's restart behavior (which logs
tons of errors with Logger.error), we can use the Supervisor here
instead, and continue to try and start the ReplicationConnection until
successful.

Note that if the process name is registered (globally) and running,
ReplicationConnection.start_link/1 simply returns `{:ok, pid}` instead
of erroring out with `:already_running`, so eventually one of the nodes
will succeed and the remaining ones will return the globally-registered
pid.
2025-05-01 03:48:35 +00:00
Jamil
1f8090c60d fix(portal): use existing database user for replication (#8950)
Turns out we are making replication overly complex by creating a
dedicated user for it. The `web` user is already privileged and we can
reuse it since the replication system operates in the same security
context as the remaining app.
2025-04-30 11:19:14 -07:00
Jamil
a98a9867af fix(portal): Redact entire connection_opts param (#8946)
The LoggerJSON Redactor only redacts top-level keys, so we need to
redact the entire `connection_opts` param to redact its contained
password.

We also don't need to pass around `connection_opts` across the entire
ReplicationConnection process state, only for the initial connection, so
we refactor that out of the `state`.
2025-04-30 16:33:21 +00:00
Jamil
968db2ae39 feat(portal): Receive WAL events (#8909)
Firezone's control plane is a realtime, distributed system that relies
on a broadcast/subscribe system to function. In many cases, these events
are broadcasted whenever relevant data in the DB changes, such as an
actor losing access to a policy, a membership being deleted, and so
forth.

Today, this is handled in the application layer, typically happening at
the place where the relevant DB call is made (i.e. in an
`after_commit`). While this approach has worked thus far, it has several
issues:

1. We have no guarantee that the DB change will issue a broadcast. If
the application is deployed or the process crashes after the DB changes
are made but before the broadcast happens, we will have potentially
failed to update any connected clients or gateways with the changes.
2. We have no guarantee that the order of DB updates will be maintained
in order for broadcasts. In other words, app server A could win its DB
operation against app server B, but then proceed to lose being the first
to broadcast.
3. If the cluster is in a bad state where broadcasts may return an error
(i.e. https://github.com/firezone/firezone/issues/8660), we will never
retry the broadcast.

To fix the above issues, we introduce a WAL logical decoder that process
the event stream one message at a time and performs any needed work.
Serializability is guaranteed since we only process the WAL in a single,
cluster-global process, `ReplicationConnection`. Durability is also
guaranteed since we only ACK WAL segments after we've successfully
ingested the event.

This means we will only advance the position of our WAL stream after
successfully broadcasting the event.

This PR only introduces the WAL stream processing system but does not
introduce any changes to our current broadcasting behavior - that's
saved for another PR.
2025-04-29 23:53:06 -07:00
Jamil
1a1c812f66 fix(portal): Set migration_lock to advisory lock (#8902)
The migration that failed today got hung up on a global migration lock.
This PR would alleviate that if we also run the index creation
concurrently, which we should do in many cases.

See
https://hexdocs.pm/ecto_sql/Ecto.Migration.html#index/3-adding-dropping-indexes-concurrently
2025-04-24 20:26:01 +00:00
Jamil
2bbc0abc3a feat(portal): Add Oban (#8786)
Our current bespoke job system, while it's worked out well so far, has
the following shortcomings:

- No retry logic
- No robust to guarantee job isolation / uniqueness without resorting to
row-level locking
- No support for cron-based scheduling

This PR adds the boilerplate required to get started with
[Oban](https://hexdocs.pm/oban/Oban.html), the job management system for
Elixir.
2025-04-15 03:56:49 +00:00
Jamil
649c03e290 chore(portal): Bump LoggerJSON to 7.0.0, fixing config (#8759)
There was slight API change in the way LoggerJSON's configuration is
generation, so I took the time to do a little fixing and cleanup here.

Specifically, we should be using the `new/1` callback to create the
Logger config which fixes the below exception due to missing config
keys:

```
FORMATTER CRASH: {report,[{formatter_crashed,'Elixir.LoggerJSON.Formatters.GoogleCloud'},{config,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},{log_event,#{meta => #{line => 15,pid => <0.308.0>,time => 1744145139650804,file => "lib/logger.ex",gl => <0.281.0>,domain => [elixir],application => libcluster,mfa => {'Elixir.Cluster.Logger',info,2}},msg => {string,<<"[libcluster:default] connected to :\"web@web.cluster.local\"">>},level => info}},{reason,{error,{badmatch,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},[{'Elixir.LoggerJSON.Formatters.GoogleCloud',format,2,[{file,"lib/logger_json/formatters/google_cloud.ex"},{line,148}]}]}}]}
```

Supersedes #8714

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-11 19:00:06 -07:00
Jamil
d2fd57a3b6 fix(portal): Attach Sentry in each umbrella app (#8749)
- Attaches the Sentry Logging hook in each of [api, web, domain]
- Removes errant Sentry logging configuration in config/config.exs
- Fixes the exception logger to default to logging exceptions, use
`skip_sentry: true` to skip

Tested successfully in dev. Hopefully the cluster behaves the same way.

Fixes #8639
2025-04-11 04:17:12 +00:00
Jamil
ce82859cd4 fix(portal): Disable mock sync job in prod (#8606)
The adapter itself isn't enabled in the UI on prod, but the background
job to sync mock data was. This prevents the job from being started and
emitting log noise into production logs.
2025-04-01 18:24:48 +00:00
Brian Manifold
d133ee84b7 feat(portal): Add API rate limiting (#8417) 2025-03-13 03:21:09 +00:00
Jamil
e3897aebd8 feat(portal): Add Mock sync adapter and more seeds (#8370)
- Adds more actor groups to the existing `oidc_provider`
- Configures a rand seed so our seed data is reproducible across
machines
- Formats the seeds file to allow for some refactoring a later PR
- Adds a `Mock` identity provider adapter with sync enabled
2025-03-07 09:37:32 -08:00
Jamil
e064cf5821 fix(portal): Debounce relays_presence (#8302)
If the websocket connection between a relay and the portal experiences a
temporary network split, the portal will immediately send the
disconnected id of the relay to any connected clients and gateways, and
all relayed connections (and current allocations) will be immediately
revoked by connlib.

This tight coupling is needlessly disruptive. As we've seen in staging
and production logs, relay disconnects can happen randomly, and in the
vast majority of cases immediately reconnect. Currently we see about 1-2
dozen of these **per day**.

To better account for this, we introduce a debounce mechanism in the
portal for `relays_presence` disconnects that works as follows:

- When a relay disconnects, record its `stamp_secret` (this is somewhat
tricky as we don't get this at the time of disconnect - we need to cache
it by relay_id beforehand)
- If the same `relay_id` reconnects again with the same `stamp_secret`
within `relays_presence_debounce_timeout` -> no-op
- If the same `relay_id` reconnects again with a **different**
`stamp_secret` -> disconnect immediately
- If it doesn't reconnect, **then** send the `relays_presence` with the
disconnected_id after the `relays_presence_debounce_timeout`

There are several ways connlib detects a relay is down:

1. Binding requests time out. These happen every 25s, so on average we
don't know a Relay is down for 12.5s + backoff timer.
2. `relays_presence` - this is currently the fastest way to detect
relays are down. With this change, the caveat is we will now detect this
with a delay of `relays_presence_debounce_timer`.

Fixes #8301
2025-03-04 23:56:40 +00:00
Jamil
0f4f20bd9c fix(elixir): Fix conditional in sentry clase in runtime.exs (#8188) 2025-02-18 17:50:18 -08:00
Jamil
6232f1a27e fix(elixir): Don't start sentry in unknown environments (#8185)
This ensures Sentry doesn't start in unknown `prod` environments.
2025-02-18 17:24:26 -08:00
Jamil
28559a317f chore(portal): Optionally drop NotFoundError to sentry (#8183)
By specifying the `before_send` hook, we can easily drop events based on
their data, such as `original_exception` which contains the original
exception instance raised.

Leveraging this, we can add a `report_to_sentry` parameter to
`Web.LiveErrors.NotFound` to optionally ignore certain not found errors
from going to Sentry.
2025-02-18 21:55:23 +00:00
Jamil
93a88563f3 feat(portal): allow socket based postgres connections (#8044) (#8097)
This allows connections to the postgresql database via the standard
socket, which - opposed to TCP sockets - allows `peer` authentication
based on local unix users. This removes the need for a password and is
much simpler to deploy when running components locally.

In the current form, `DATABASE_SOCKET_DIR` takes precedence over
hostname, if the environment variable is present. I found that
`compile_config!` somehow enforces a value to be present which is
explicitly not what I want for some of these values (i think). I'd be
glad if anyone with more elixir experience can guide me as to how I can
make this more idiomatic.

---------

Supersedes: #8044

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: oddlama <oddlama@oddlama.org>
2025-02-11 19:25:00 -08:00
Jamil
5bac3f5ec2 fix(infra): Don't send more/faster metrics than Google accepts (#8028)
We are getting quite a few of these warnings on prod:

```
{400, "{\n  \"error\": {\n    \"code\": 400,\n    \"message\": \"One or more TimeSeries could not be written: timeSeries[0-39]: write for resource=gce_instance{zone:us-east1-d,instance_id:2678918148122610092} failed with: One or more points were written more frequently than the maximum sampling period configured for the metric.\",\n    \"status\": \"INVALID_ARGUMENT\",\n    \"details\": [\n      {\n        \"@type\": \"type.googleapis.com/google.monitoring.v3.CreateTimeSeriesSummary\",\n        \"totalPointCount\": 40,\n        \"successPointCount\": 31,\n        \"errors\": [\n          {\n            \"status\": {\n              \"code\": 9\n            },\n            \"pointCount\": 9\n          }\n        ]\n      }\n    ]\n  }\n}\n"}
```

Since the point count is _much_ less than our flush buffer size of 1000,
we can only surmise the limit we're hitting is the flush interval.

The telemetry metrics reporter is run on each node, so we run the risk
of violating Google's API limit regardless of what a single node's
`@flush_interval` is set to.

To solve this, we use a new table `telemetry_reporter_logs` that stores
the last time a particular `flush` occurred for a reporter module. This
tracks global state as to when the last flush occurred, and if too
recent, the timer-based flush is call is `no-op`ed until the next one.

**Note**: The buffer-based `flush` is left unchanged, this will always
be called when `buffer_size > max_buffer_size`.
2025-02-10 18:21:40 +00:00