firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 18:18:55 +00:00

Author	SHA1	Message	Date
Jamil	cafe6554ff	refactor(portal): reduce cache memory usage (#10058 ) Napkin math shows that we can save substantial memory (~3x or more) on the API nodes as connected clients/gateways grow if we just store the fields we need in order to keep the client and gateway state maintained in the channel pids. To facilitate this, we create new `Cacheable` structs that represent their `Domain` cousins, which use byte arrays for `id`s and strip out unused fields. Additionally, all business logic involved with maintaining these caches is now contained within two modules: `Domain.Cache.Client` and `Domain.Cache.Gateway`, and type specs have been added to aid in static analysis and code documentation. Comprehensive testing is now added not only for the cache modules, but for their associated channel modules as well to ensure we handle different kinds of edge cases gracefully. The `Events` nomenclature was renamed to `Changes` to better name what we are doing: Change-Data-Capture. Lastly, the following related changes are included in this PR since they were "in the way" so to speak of getting this done: - We save the last received LSN in each channel and drop the `change` with a warning if we receive it twice in a row, or we receive it out of order - The client/gateway version compatibility calculations have been moved to `Domain.Resources` and `Domain.Gateways` and have been simplified to make them easier to understand and maintain going forward. Related: #10174 Fixes: #9392 Fixes: #9965 Fixes: #9501 Fixes: #10227 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-08-22 21:52:29 +00:00
Jamil	4a448e5517	fix(portal): separate dev and runtime Oban configs (#10027 ) Oban includes its own configuration validation, which seems to prevent `runtime.exs` from overriding any compile-time options. This prevents us from using ENV vars to configure it, such as restricting job execution to `domain` nodes by setting `queues: []`. To fix that, we make sure to set Oban configuration in env-specific files `config/dev.exs` and `config/test.exs`, and at runtime for prod with `config/runtime.exs`. Fixes #10016	2025-07-28 15:13:52 +00:00
Jamil	b20c141759	feat(portal): add batch-insert to change logs (#9733 ) Inserting a change log incurs some minor overhead for sending query over the network and reacting to its response. In many cases, this makes up the bulk of the actual time it takes to run the change log insert. To reduce this overhead and avoid any kind of processing delay in the WAL consumers, we introduce batch insert functionality with size `500` and timeout `30` seconds. If either of those two are hit, we flush the batch using `insert_all`. `insert_all` does not use `Ecto.Changeset`, so we need to be a bit more careful about the data we insert, and check the inserted LSNs to determine what to update the acknowledged LSN pointer to. The functionality to determine when to call the new `on_flush/1` callback lives in the replication_connection module, but the actual behavior of `on_flush/1` is left to the child modules to implement. The `Events.ReplicationConnection` module does not use flush behavior, and so does not override the defaults, which is not to use a flush mechanism. Related: #949	2025-07-05 19:03:28 +00:00
Jamil	bebc69e2bc	fix(portal): use distinct slot names (#9672 ) These were being configured using the same default `events_` value.	2025-06-25 17:28:17 +00:00
Jamil	a9f49629ae	feat(portal): add change_logs table and insert data (#9553 ) Building on the WAL consumer that's been in development over the past several weeks, we introduce a new `change_logs` table that stores very lightly up-fitted data decoded from the WAL: - `account_id` (indexed): a foreign key reference to an account. - `inserted_at` (indexed): the timestamp of insert, for truncating rows later. - `table`: the table where the op took place. - `op`: the operation performed (insert/update/delete) - `old_data`: a nullable map of the old row data (update/delete) - `data`: a nullable map of the new row data(insert/update) - `vsn`: an integer version field we can bump to signify schema changes in the data in case we need to apply operations to only new or only old data. Judging from our prod metrics, we're currently average about 1,000 write operations a minute, which will generate about 1-2 dozen changelogs / s. Doing the math on this, 30 days at our current volume will yield about 50M / month, which should be ok for some time, since this is an append-only, rarely (if ever) read from table. The one aspect of this we may need to handle sooner than later is batch-inserting these. That raises an issue though - currently, in this PR, we process each WAL event serially, ending with the final acknowledgement `:ok` which will signal to Postgres our status in processing the WAL. If we do anything async here, this processing "cursor" then becomes inaccurate, so we may need to think about what to track and what data we care about. Related: #7124	2025-06-25 02:06:20 +00:00
Jamil	c783b23bae	refactor(portal): rename conditional->manual (#9612 ) These only have one condition - to run manually. `manual migrations` better implies that these migrations _must_ typically be run manually.	2025-06-21 21:17:33 +00:00
Jamil	a20989a819	feat(portal): conditional migrations on prod (#9562 ) Some migrations take a long time to run because they require locks or modify large amounts of data. To prevent this from causing issues during deploy, we leverage Ecto's native support for loading migrations from multiple directories to introduce a `conditional_migrations/` directory that houses any conditional migrations we want to run. To run these migrations, you'll need to do one of the following: - `dev, test`: The `mix ecto.migrate` will run them by default because we have aliased this to load conditional_migrations for dev - `prod`: Set the `RUN_CONDITIONAL_MIGRATIONS` env var to `true` before starting a prod server using the `bin/migrate` script. - `dev, test, prod`: Run `Domain.Release.migrate(conditional: true)` from an IEx shell. If conditional migrations were found that weren't executed during `Domain.Release.migrate`, a warning is logged to remind us to run them. --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2025-06-18 18:08:25 +00:00
Jamil	f58176a447	chore: remove docs writer (#9494 ) This was added in an earlier era and will be just too cumbersome to maintain going forward. We have OpenAPI docs which are more flexible.	2025-06-10 02:51:46 +00:00
Brian Manifold	6d425d5677	refactor(portal): add retry logic to Stripe API client (#9466 ) Why: * We've seen some Stripe API requests come back with 429 responses, which likely could be retried and succeed. This commit adds some basic retry logic to our Stripe API client.	2025-06-09 23:11:33 +00:00
Jamil	968db2ae39	feat(portal): Receive WAL events (#8909 ) Firezone's control plane is a realtime, distributed system that relies on a broadcast/subscribe system to function. In many cases, these events are broadcasted whenever relevant data in the DB changes, such as an actor losing access to a policy, a membership being deleted, and so forth. Today, this is handled in the application layer, typically happening at the place where the relevant DB call is made (i.e. in an `after_commit`). While this approach has worked thus far, it has several issues: 1. We have no guarantee that the DB change will issue a broadcast. If the application is deployed or the process crashes after the DB changes are made but before the broadcast happens, we will have potentially failed to update any connected clients or gateways with the changes. 2. We have no guarantee that the order of DB updates will be maintained in order for broadcasts. In other words, app server A could win its DB operation against app server B, but then proceed to lose being the first to broadcast. 3. If the cluster is in a bad state where broadcasts may return an error (i.e. https://github.com/firezone/firezone/issues/8660), we will never retry the broadcast. To fix the above issues, we introduce a WAL logical decoder that process the event stream one message at a time and performs any needed work. Serializability is guaranteed since we only process the WAL in a single, cluster-global process, `ReplicationConnection`. Durability is also guaranteed since we only ACK WAL segments after we've successfully ingested the event. This means we will only advance the position of our WAL stream after successfully broadcasting the event. This PR only introduces the WAL stream processing system but does not introduce any changes to our current broadcasting behavior - that's saved for another PR.	2025-04-29 23:53:06 -07:00
Jamil	2bbc0abc3a	feat(portal): Add Oban (#8786 ) Our current bespoke job system, while it's worked out well so far, has the following shortcomings: - No retry logic - No robust to guarantee job isolation / uniqueness without resorting to row-level locking - No support for cron-based scheduling This PR adds the boilerplate required to get started with [Oban](https://hexdocs.pm/oban/Oban.html), the job management system for Elixir.	2025-04-15 03:56:49 +00:00
Jamil	e064cf5821	fix(portal): Debounce relays_presence (#8302 ) If the websocket connection between a relay and the portal experiences a temporary network split, the portal will immediately send the disconnected id of the relay to any connected clients and gateways, and all relayed connections (and current allocations) will be immediately revoked by connlib. This tight coupling is needlessly disruptive. As we've seen in staging and production logs, relay disconnects can happen randomly, and in the vast majority of cases immediately reconnect. Currently we see about 1-2 dozen of these per day. To better account for this, we introduce a debounce mechanism in the portal for `relays_presence` disconnects that works as follows: - When a relay disconnects, record its `stamp_secret` (this is somewhat tricky as we don't get this at the time of disconnect - we need to cache it by relay_id beforehand) - If the same `relay_id` reconnects again with the same `stamp_secret` within `relays_presence_debounce_timeout` -> no-op - If the same `relay_id` reconnects again with a different `stamp_secret` -> disconnect immediately - If it doesn't reconnect, then send the `relays_presence` with the disconnected_id after the `relays_presence_debounce_timeout` There are several ways connlib detects a relay is down: 1. Binding requests time out. These happen every 25s, so on average we don't know a Relay is down for 12.5s + backoff timer. 2. `relays_presence` - this is currently the fastest way to detect relays are down. With this change, the caveat is we will now detect this with a delay of `relays_presence_debounce_timer`. Fixes #8301	2025-03-04 23:56:40 +00:00
Jamil	28559a317f	chore(portal): Optionally drop `NotFoundError` to sentry (#8183 ) By specifying the `before_send` hook, we can easily drop events based on their data, such as `original_exception` which contains the original exception instance raised. Leveraging this, we can add a `report_to_sentry` parameter to `Web.LiveErrors.NotFound` to optionally ignore certain not found errors from going to Sentry.	2025-02-18 21:55:23 +00:00
Jamil	5bac3f5ec2	fix(infra): Don't send more/faster metrics than Google accepts (#8028 ) We are getting quite a few of these warnings on prod: ``` {400, "{\n \"error\": {\n \"code\": 400,\n \"message\": \"One or more TimeSeries could not be written: timeSeries[0-39]: write for resource=gce_instance{zone:us-east1-d,instance_id:2678918148122610092} failed with: One or more points were written more frequently than the maximum sampling period configured for the metric.\",\n \"status\": \"INVALID_ARGUMENT\",\n \"details\": [\n {\n \"@type\": \"type.googleapis.com/google.monitoring.v3.CreateTimeSeriesSummary\",\n \"totalPointCount\": 40,\n \"successPointCount\": 31,\n \"errors\": [\n {\n \"status\": {\n \"code\": 9\n },\n \"pointCount\": 9\n }\n ]\n }\n ]\n }\n}\n"} ``` Since the point count is _much_ less than our flush buffer size of 1000, we can only surmise the limit we're hitting is the flush interval. The telemetry metrics reporter is run on each node, so we run the risk of violating Google's API limit regardless of what a single node's `@flush_interval` is set to. To solve this, we use a new table `telemetry_reporter_logs` that stores the last time a particular `flush` occurred for a reporter module. This tracks global state as to when the last flush occurred, and if too recent, the timer-based flush is call is `no-op`ed until the next one. Note: The buffer-based `flush` is left unchanged, this will always be called when `buffer_size > max_buffer_size`.	2025-02-10 18:21:40 +00:00
Brian Manifold	7fda4c52c4	feat(portal): Add outdated gateway notifications (#6841 ) Why: * Without some type of notification, users do not realize that new Gateway versions have been released and thus do not seem to be upgrading their deployed Gateways.	2024-10-11 12:46:00 +00:00
Andrew Dryga	3652839b1a	feat(portal): Allow updating policies and resources (#6690 ) Now you can "edit" any fields on the policy, when one of fields that govern the access is changed (resource, actor group or conditions) a new policy will be created and an old one is deleted. This will be broadcasted to the clients right away to minimize downtime. New policy will have it's own flows to prevent confusion while auditing. To make experience better for external systems we added `persistent_id` that will be the same across all versions of a given policy. Resources work in a similar fashion but when they are replaced we will also replace all corresponding policies. An additional nice effect of this approach is that we also got configuration audit log for resources and policies. Fixes #2504	2024-09-18 13:06:05 -06:00
Brian Manifold	716623a993	feat(portal): Add IDP sync error email notifications (#6483 ) This adds a feature that will email all admins in a Firezone Account when sync errors occur with their Identity Provider. In order to avoid spamming admins with sync error emails, the error emails are only sent once every 24 hours. One exception to that is when there is a successful sync the `sync_error_emailed_at` field is reset, which means in theory if an identity provider was flip flopping between successful and unsuccessful syncs the admins would be emailed more than once in a 24 hours period. ### Sample Email Message <img width="589" alt="idp-sync-error-message" src="https://github.com/user-attachments/assets/d7128c7c-c10d-4d02-8283-059e2f1f5db5">	2024-09-18 15:29:50 +00:00
Jamil	bfbf570191	ci: Increase default assert_receive timeout to 500ms from 100ms (#5417 ) We seem to be hitting `assert_receive`-style much more frequently after "upgrading" to Enterprise Cloud (our credits expired, I was able to renew them). This updates the global timeout to 500ms for `assert_receive` to reduce the likelihood `assert_push` and friends will time out on slow GH runners. E.g. https://github.com/firezone/firezone/actions/runs/9556532328/job/26341986456 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2024-06-17 18:35:11 -07:00
Brian Manifold	26d8f7eab3	feat(portal): Add WorkOS/JumpCloud integration (#5269 ) Why: * JumpCloud directory sync was requested from customers. JumpCloud only offers the ability to use it's API with an admin level access token that is tied to a specific user within a given JumpCloud account. This would require Firezone customers to give an access token with much more permissions that needed for our directory sync. To avoid this, we've decide to use WorkOS to provide SCIM support between JumpCloud and WorkOS, which will allow Firezone to then easily and safely retrieve JumpCloud directory info from WorkOS. --------- Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2024-06-12 15:45:33 +00:00
Andrew Dryga	a7e54686b0	feat(portal): Track page views and sign ups using Mixpanel and HubSpot on public pages (#5050 ) Fixes firezone/gtm#253 Fixes firezone/gtm#278	2024-05-21 10:34:56 -06:00
Andrew Dryga	b0590fa532	chore(portal): Send metrics to Google Cloud Monitoring (#4564 )	2024-04-10 13:04:59 -06:00
Andrew Dryga	114696c0ba	chore(infra): Split terraform files into folders and add domain to production app (#4172 )	2024-03-16 11:54:06 -06:00
Andrew Dryga	5b1e3ea1d1	feat(portal): Billing system (#3642 )	2024-02-20 15:01:17 -06:00
Jamil	dc0119c347	Revert "feat(portal): Add sign-in success page for clients" (#3692 ) Merged a bit too soon!	2024-02-19 13:53:47 -08:00
Brian Manifold	db399651f2	feat(portal): Add sign-in success page for clients (#3659 ) Why: * On some clients, the web view that is opened to sign-in to Firezone is left open and ends up getting stuck on the Sign In page with the liveview loader on the top of the page also stuck and appearing as though it is waiting for another response. This commit adds a sign-in success page that is displayed upon successful sign-in and shows a message to the user that lets them know they can close the window if needed. If the client device is able to close the web view that was opened, then the page will either very briefly be shown or will not be visible at all due to how quickly the redirect happens. Closes #3608 <img width="625" alt="Screenshot 2024-02-15 at 4 30 57 PM" src="https://github.com/firezone/firezone/assets/2646332/eb6a5df6-4a4c-4e54-b57c-5da239069ea9"> --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2024-02-19 21:00:49 +00:00
Jamil	0c25ad57cb	Add link to status on website (#2974 ) Fixes #2953	2023-12-20 22:56:40 +00:00
Andrew Dryga	1ab3fdd3b5	Ephemeral gateways (#2656 ) - [x] Fixed docker run command to mount a volume at `/etc/firezone` - [x] Fixed systemd unit file to prope setcap, create writeable `/etc/firezone` directory, use non-root user, etc - [x] Removed `FIREZONE_ID` from our terraform scripts Now on Sites index we only show online gateways: <img width="1728" alt="Screenshot 2023-11-15 at 18 04 12" src="https://github.com/firezone/firezone/assets/1877644/b532f200-0420-4427-acff-a3b8623560c5"> On the Site view we also show only online ones with a link to see all: <img width="1728" alt="Screenshot 2023-11-15 at 18 02 33" src="https://github.com/firezone/firezone/assets/1877644/9774dfac-4340-41d4-8404-586e081505f5"> All can be seen on a separate page: <img width="1728" alt="Screenshot 2023-11-15 at 18 02 27" src="https://github.com/firezone/firezone/assets/1877644/5d135f60-c7af-4e48-9ebb-626ff7461316"> Some of the functions I've added are pretty dirty hacks, we really need to implement filters from #2029 to properly implement those and remove code duplicates.	2023-11-16 11:17:22 -06:00
Andrew Dryga	9281b7fede	Allow client logs and messages instrumentation (#2086 ) Closes #2019	2023-09-18 15:03:51 -06:00
Andrew Dryga	e290f26298	Complete Actors, Devices and Groups UIs (#1885 ) This will be done once the remaining UI code is covered with tests.	2023-09-02 05:35:52 +00:00
Andrew Dryga	e7d5d0579b	Authentication for the live app (#1674 ) Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2023-06-27 13:11:36 -06:00
Andrew Dryga	37a2d7b7f5	Move elixir code to a subfolder (#1631 )	2023-05-24 15:46:51 -06:00

31 Commits