31 Commits

Author SHA1 Message Date
Jamil
cafe6554ff refactor(portal): reduce cache memory usage (#10058)
Napkin math shows that we can save substantial memory (~3x or more) on
the API nodes as connected clients/gateways grow if we just store the
fields we need in order to keep the client and gateway state maintained
in the channel pids.

To facilitate this, we create new `Cacheable` structs that represent
their `Domain` cousins, which use byte arrays for `id`s and strip out
unused fields.

Additionally, all business logic involved with maintaining these caches
is now contained within two modules: `Domain.Cache.Client` and
`Domain.Cache.Gateway`, and type specs have been added to aid in static
analysis and code documentation.

Comprehensive testing is now added not only for the cache modules, but
for their associated channel modules as well to ensure we handle
different kinds of edge cases gracefully.

The `Events` nomenclature was renamed to `Changes` to better name what
we are doing: Change-Data-Capture.

Lastly, the following related changes are included in this PR since they
were "in the way" so to speak of getting this done:

- We save the last received LSN in each channel and drop the `change`
with a warning if we receive it twice in a row, or we receive it out of
order
- The client/gateway version compatibility calculations have been moved
to `Domain.Resources` and `Domain.Gateways` and have been simplified to
make them easier to understand and maintain going forward.


Related: #10174 
Fixes: #9392 
Fixes: #9965
Fixes: #9501 
Fixes: #10227

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-22 21:52:29 +00:00
Jamil
4a448e5517 fix(portal): separate dev and runtime Oban configs (#10027)
Oban includes its own configuration validation, which seems to prevent
`runtime.exs` from overriding any compile-time options. This prevents us
from using ENV vars to configure it, such as restricting job execution
to `domain` nodes by setting `queues: []`. To fix that, we make sure to
set Oban configuration in env-specific files `config/dev.exs` and
`config/test.exs`, and at runtime for prod with `config/runtime.exs`.

Fixes #10016
2025-07-28 15:13:52 +00:00
Jamil
b20c141759 feat(portal): add batch-insert to change logs (#9733)
Inserting a change log incurs some minor overhead for sending query over
the network and reacting to its response. In many cases, this makes up
the bulk of the actual time it takes to run the change log insert.

To reduce this overhead and avoid any kind of processing delay in the
WAL consumers, we introduce batch insert functionality with size `500`
and timeout `30` seconds. If either of those two are hit, we flush the
batch using `insert_all`.

`insert_all` does not use `Ecto.Changeset`, so we need to be a bit more
careful about the data we insert, and check the inserted LSNs to
determine what to update the acknowledged LSN pointer to.

The functionality to determine when to call the new `on_flush/1`
callback lives in the replication_connection module, but the actual
behavior of `on_flush/1` is left to the child modules to implement. The
`Events.ReplicationConnection` module does not use flush behavior, and
so does not override the defaults, which is not to use a flush
mechanism.

Related: #949
2025-07-05 19:03:28 +00:00
Jamil
bebc69e2bc fix(portal): use distinct slot names (#9672)
These were being configured using the same default `events_` value.
2025-06-25 17:28:17 +00:00
Jamil
a9f49629ae feat(portal): add change_logs table and insert data (#9553)
Building on the WAL consumer that's been in development over the past
several weeks, we introduce a new `change_logs` table that stores very
lightly up-fitted data decoded from the WAL:

- `account_id` (indexed): a foreign key reference to an account.
- `inserted_at` (indexed): the timestamp of insert, for truncating rows
later.
- `table`: the table where the op took place.
- `op`: the operation performed (insert/update/delete)
- `old_data`: a nullable map of the old row data (update/delete)
- `data`: a nullable map of the new row data(insert/update)
- `vsn`: an integer version field we can bump to signify schema changes
in the data in case we need to apply operations to only new or only old
data.

Judging from our prod metrics, we're currently average about 1,000 write
operations a minute, which will generate about 1-2 dozen changelogs / s.
Doing the math on this, 30 days at our current volume will yield about
50M / month, which should be ok for some time, since this is an
append-only, rarely (if ever) read from table.

The one aspect of this we may need to handle sooner than later is
batch-inserting these. That raises an issue though - currently, in this
PR, we process each WAL event serially, ending with the final
acknowledgement `:ok` which will signal to Postgres our status in
processing the WAL.

If we do anything async here, this processing "cursor" then becomes
inaccurate, so we may need to think about what to track and what data we
care about.

Related: #7124
2025-06-25 02:06:20 +00:00
Jamil
c783b23bae refactor(portal): rename conditional->manual (#9612)
These only have one condition - to run manually. `manual migrations`
better implies that these migrations _must_ typically be run manually.
2025-06-21 21:17:33 +00:00
Jamil
a20989a819 feat(portal): conditional migrations on prod (#9562)
Some migrations take a long time to run because they require locks or
modify large amounts of data. To prevent this from causing issues during
deploy, we leverage Ecto's native support for loading migrations from
multiple directories to introduce a `conditional_migrations/` directory
that houses any conditional migrations we want to run.

To run these migrations, you'll need to do one of the following:

- `dev, test`: The `mix ecto.migrate` will run them by default because
we have aliased this to load conditional_migrations for dev
- `prod`: Set the `RUN_CONDITIONAL_MIGRATIONS` env var to `true` before
starting a prod server using the `bin/migrate` script.
- `dev, test, prod`: Run `Domain.Release.migrate(conditional: true)`
from an IEx shell.

If conditional migrations were found that weren't executed during
`Domain.Release.migrate`, a warning is logged to remind us to run them.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
2025-06-18 18:08:25 +00:00
Jamil
f58176a447 chore: remove docs writer (#9494)
This was added in an earlier era and will be just too cumbersome to
maintain going forward. We have OpenAPI docs which are more flexible.
2025-06-10 02:51:46 +00:00
Brian Manifold
6d425d5677 refactor(portal): add retry logic to Stripe API client (#9466)
Why:

* We've seen some Stripe API requests come back with 429 responses,
which likely could be retried and succeed. This commit adds some basic
retry logic to our Stripe API client.
2025-06-09 23:11:33 +00:00
Jamil
968db2ae39 feat(portal): Receive WAL events (#8909)
Firezone's control plane is a realtime, distributed system that relies
on a broadcast/subscribe system to function. In many cases, these events
are broadcasted whenever relevant data in the DB changes, such as an
actor losing access to a policy, a membership being deleted, and so
forth.

Today, this is handled in the application layer, typically happening at
the place where the relevant DB call is made (i.e. in an
`after_commit`). While this approach has worked thus far, it has several
issues:

1. We have no guarantee that the DB change will issue a broadcast. If
the application is deployed or the process crashes after the DB changes
are made but before the broadcast happens, we will have potentially
failed to update any connected clients or gateways with the changes.
2. We have no guarantee that the order of DB updates will be maintained
in order for broadcasts. In other words, app server A could win its DB
operation against app server B, but then proceed to lose being the first
to broadcast.
3. If the cluster is in a bad state where broadcasts may return an error
(i.e. https://github.com/firezone/firezone/issues/8660), we will never
retry the broadcast.

To fix the above issues, we introduce a WAL logical decoder that process
the event stream one message at a time and performs any needed work.
Serializability is guaranteed since we only process the WAL in a single,
cluster-global process, `ReplicationConnection`. Durability is also
guaranteed since we only ACK WAL segments after we've successfully
ingested the event.

This means we will only advance the position of our WAL stream after
successfully broadcasting the event.

This PR only introduces the WAL stream processing system but does not
introduce any changes to our current broadcasting behavior - that's
saved for another PR.
2025-04-29 23:53:06 -07:00
Jamil
2bbc0abc3a feat(portal): Add Oban (#8786)
Our current bespoke job system, while it's worked out well so far, has
the following shortcomings:

- No retry logic
- No robust to guarantee job isolation / uniqueness without resorting to
row-level locking
- No support for cron-based scheduling

This PR adds the boilerplate required to get started with
[Oban](https://hexdocs.pm/oban/Oban.html), the job management system for
Elixir.
2025-04-15 03:56:49 +00:00
Jamil
e064cf5821 fix(portal): Debounce relays_presence (#8302)
If the websocket connection between a relay and the portal experiences a
temporary network split, the portal will immediately send the
disconnected id of the relay to any connected clients and gateways, and
all relayed connections (and current allocations) will be immediately
revoked by connlib.

This tight coupling is needlessly disruptive. As we've seen in staging
and production logs, relay disconnects can happen randomly, and in the
vast majority of cases immediately reconnect. Currently we see about 1-2
dozen of these **per day**.

To better account for this, we introduce a debounce mechanism in the
portal for `relays_presence` disconnects that works as follows:

- When a relay disconnects, record its `stamp_secret` (this is somewhat
tricky as we don't get this at the time of disconnect - we need to cache
it by relay_id beforehand)
- If the same `relay_id` reconnects again with the same `stamp_secret`
within `relays_presence_debounce_timeout` -> no-op
- If the same `relay_id` reconnects again with a **different**
`stamp_secret` -> disconnect immediately
- If it doesn't reconnect, **then** send the `relays_presence` with the
disconnected_id after the `relays_presence_debounce_timeout`

There are several ways connlib detects a relay is down:

1. Binding requests time out. These happen every 25s, so on average we
don't know a Relay is down for 12.5s + backoff timer.
2. `relays_presence` - this is currently the fastest way to detect
relays are down. With this change, the caveat is we will now detect this
with a delay of `relays_presence_debounce_timer`.

Fixes #8301
2025-03-04 23:56:40 +00:00
Jamil
28559a317f chore(portal): Optionally drop NotFoundError to sentry (#8183)
By specifying the `before_send` hook, we can easily drop events based on
their data, such as `original_exception` which contains the original
exception instance raised.

Leveraging this, we can add a `report_to_sentry` parameter to
`Web.LiveErrors.NotFound` to optionally ignore certain not found errors
from going to Sentry.
2025-02-18 21:55:23 +00:00
Jamil
5bac3f5ec2 fix(infra): Don't send more/faster metrics than Google accepts (#8028)
We are getting quite a few of these warnings on prod:

```
{400, "{\n  \"error\": {\n    \"code\": 400,\n    \"message\": \"One or more TimeSeries could not be written: timeSeries[0-39]: write for resource=gce_instance{zone:us-east1-d,instance_id:2678918148122610092} failed with: One or more points were written more frequently than the maximum sampling period configured for the metric.\",\n    \"status\": \"INVALID_ARGUMENT\",\n    \"details\": [\n      {\n        \"@type\": \"type.googleapis.com/google.monitoring.v3.CreateTimeSeriesSummary\",\n        \"totalPointCount\": 40,\n        \"successPointCount\": 31,\n        \"errors\": [\n          {\n            \"status\": {\n              \"code\": 9\n            },\n            \"pointCount\": 9\n          }\n        ]\n      }\n    ]\n  }\n}\n"}
```

Since the point count is _much_ less than our flush buffer size of 1000,
we can only surmise the limit we're hitting is the flush interval.

The telemetry metrics reporter is run on each node, so we run the risk
of violating Google's API limit regardless of what a single node's
`@flush_interval` is set to.

To solve this, we use a new table `telemetry_reporter_logs` that stores
the last time a particular `flush` occurred for a reporter module. This
tracks global state as to when the last flush occurred, and if too
recent, the timer-based flush is call is `no-op`ed until the next one.

**Note**: The buffer-based `flush` is left unchanged, this will always
be called when `buffer_size > max_buffer_size`.
2025-02-10 18:21:40 +00:00
Brian Manifold
7fda4c52c4 feat(portal): Add outdated gateway notifications (#6841)
Why:

* Without some type of notification, users do not realize that new
Gateway versions have been released and thus do not seem to be upgrading
their deployed Gateways.
2024-10-11 12:46:00 +00:00
Andrew Dryga
3652839b1a feat(portal): Allow updating policies and resources (#6690)
Now you can "edit" any fields on the policy, when one of fields that
govern the access is changed (resource, actor group or conditions) a new
policy will be created and an old one is deleted. This will be
broadcasted to the clients right away to minimize downtime. New policy
will have it's own flows to prevent confusion while auditing. To make
experience better for external systems we added `persistent_id` that
will be the same across all versions of a given policy.

Resources work in a similar fashion but when they are replaced we will
also replace all corresponding policies.

An additional nice effect of this approach is that we also got
configuration audit log for resources and policies.

Fixes #2504
2024-09-18 13:06:05 -06:00
Brian Manifold
716623a993 feat(portal): Add IDP sync error email notifications (#6483)
This adds a feature that will email all admins in a Firezone Account
when sync errors occur with their Identity Provider.

In order to avoid spamming admins with sync error emails, the error
emails are only sent once every 24 hours. One exception to that is when
there is a successful sync the `sync_error_emailed_at` field is reset,
which means in theory if an identity provider was flip flopping between
successful and unsuccessful syncs the admins would be emailed more than
once in a 24 hours period.

### Sample Email Message
<img width="589" alt="idp-sync-error-message"
src="https://github.com/user-attachments/assets/d7128c7c-c10d-4d02-8283-059e2f1f5db5">
2024-09-18 15:29:50 +00:00
Jamil
bfbf570191 ci: Increase default assert_receive timeout to 500ms from 100ms (#5417)
We seem to be hitting `assert_receive`-style much more frequently after
"upgrading" to Enterprise Cloud (our credits expired, I was able to
renew them).

This updates the global timeout to 500ms for `assert_receive` to reduce
the likelihood `assert_push` and friends will time out on slow GH
runners.

E.g.


https://github.com/firezone/firezone/actions/runs/9556532328/job/26341986456

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
2024-06-17 18:35:11 -07:00
Brian Manifold
26d8f7eab3 feat(portal): Add WorkOS/JumpCloud integration (#5269)
Why:

* JumpCloud directory sync was requested from customers. JumpCloud only
offers the ability to use it's API with an admin level access token that
is tied to a specific user within a given JumpCloud account. This would
require Firezone customers to give an access token with much more
permissions that needed for our directory sync. To avoid this, we've
decide to use WorkOS to provide SCIM support between JumpCloud and
WorkOS, which will allow Firezone to then easily and safely retrieve
JumpCloud directory info from WorkOS.

---------

Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-06-12 15:45:33 +00:00
Andrew Dryga
a7e54686b0 feat(portal): Track page views and sign ups using Mixpanel and HubSpot on public pages (#5050)
Fixes firezone/gtm#253
Fixes firezone/gtm#278
2024-05-21 10:34:56 -06:00
Andrew Dryga
b0590fa532 chore(portal): Send metrics to Google Cloud Monitoring (#4564) 2024-04-10 13:04:59 -06:00
Andrew Dryga
114696c0ba chore(infra): Split terraform files into folders and add domain to production app (#4172) 2024-03-16 11:54:06 -06:00
Andrew Dryga
5b1e3ea1d1 feat(portal): Billing system (#3642) 2024-02-20 15:01:17 -06:00
Jamil
dc0119c347 Revert "feat(portal): Add sign-in success page for clients" (#3692)
Merged a bit too soon!
2024-02-19 13:53:47 -08:00
Brian Manifold
db399651f2 feat(portal): Add sign-in success page for clients (#3659)
Why:

* On some clients, the web view that is opened to sign-in to Firezone is
left open and ends up getting stuck on the Sign In page with the
liveview loader on the top of the page also stuck and appearing as
though it is waiting for another response. This commit adds a sign-in
success page that is displayed upon successful sign-in and shows a
message to the user that lets them know they can close the window if
needed. If the client device is able to close the web view that was
opened, then the page will either very briefly be shown or will not be
visible at all due to how quickly the redirect happens.

Closes #3608 

<img width="625" alt="Screenshot 2024-02-15 at 4 30 57 PM"
src="https://github.com/firezone/firezone/assets/2646332/eb6a5df6-4a4c-4e54-b57c-5da239069ea9">

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-02-19 21:00:49 +00:00
Jamil
0c25ad57cb Add link to status on website (#2974)
Fixes #2953
2023-12-20 22:56:40 +00:00
Andrew Dryga
1ab3fdd3b5 Ephemeral gateways (#2656)
- [x] Fixed docker run command to mount a volume at `/etc/firezone`
- [x] Fixed systemd unit file to prope setcap, create writeable
`/etc/firezone` directory, use non-root user, etc
- [x] Removed `FIREZONE_ID` from our terraform scripts

Now on Sites index we only show online gateways:
<img width="1728" alt="Screenshot 2023-11-15 at 18 04 12"
src="https://github.com/firezone/firezone/assets/1877644/b532f200-0420-4427-acff-a3b8623560c5">

On the Site view we also show only online ones with a link to see all:
<img width="1728" alt="Screenshot 2023-11-15 at 18 02 33"
src="https://github.com/firezone/firezone/assets/1877644/9774dfac-4340-41d4-8404-586e081505f5">

All can be seen on a separate page:
<img width="1728" alt="Screenshot 2023-11-15 at 18 02 27"
src="https://github.com/firezone/firezone/assets/1877644/5d135f60-c7af-4e48-9ebb-626ff7461316">

Some of the functions I've added are pretty dirty hacks, we really need
to implement filters from #2029 to properly implement those and remove
code duplicates.
2023-11-16 11:17:22 -06:00
Andrew Dryga
9281b7fede Allow client logs and messages instrumentation (#2086)
Closes #2019
2023-09-18 15:03:51 -06:00
Andrew Dryga
e290f26298 Complete Actors, Devices and Groups UIs (#1885)
This will be done once the remaining UI code is covered with tests.
2023-09-02 05:35:52 +00:00
Andrew Dryga
e7d5d0579b Authentication for the live app (#1674)
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2023-06-27 13:11:36 -06:00
Andrew Dryga
37a2d7b7f5 Move elixir code to a subfolder (#1631) 2023-05-24 15:46:51 -06:00