The `background_jobs_enabled` config in an ENV var that needs to be set
for a specific configuration key. It's not set on the top-level
`:domain` config by default.
Instead, it's used to enable / disable specific modules to start by the
application's Supervisor.
The `Domain.Events.ReplicationConnection` module is updated in this PR
to follow this convention.
The `web` and `api` application use `domain` as a dependency in their
`mix.exs`. This means by default their Supervisor will start the
Domain's supervision tree as well.
The author did not realize this at the time of implementation, and so we
now leverage the convention in place for restricting tasks to `domain`
nodes, the `background_jobs_enabled` application configuration
parameter.
We also add an info log when the replication slot is being started so we
can verify the node it's starting on.
When deploying, the cluster state diverges temporarily, which allows
more than one `ReplicationConnection` process to start on the new nodes.
(One of) the old nodes still has an active slot, and we get an "object
in use" error `(Postgrex.Error) ERROR 55006 (object_in_use) replication
slot "events_slot" is active for PID 603037`.
Rather than use ReplicationConnection's restart behavior (which logs
tons of errors with Logger.error), we can use the Supervisor here
instead, and continue to try and start the ReplicationConnection until
successful.
Note that if the process name is registered (globally) and running,
ReplicationConnection.start_link/1 simply returns `{:ok, pid}` instead
of erroring out with `:already_running`, so eventually one of the nodes
will succeed and the remaining ones will return the globally-registered
pid.
Turns out we are making replication overly complex by creating a
dedicated user for it. The `web` user is already privileged and we can
reuse it since the replication system operates in the same security
context as the remaining app.
The LoggerJSON Redactor only redacts top-level keys, so we need to
redact the entire `connection_opts` param to redact its contained
password.
We also don't need to pass around `connection_opts` across the entire
ReplicationConnection process state, only for the initial connection, so
we refactor that out of the `state`.
Firezone's control plane is a realtime, distributed system that relies
on a broadcast/subscribe system to function. In many cases, these events
are broadcasted whenever relevant data in the DB changes, such as an
actor losing access to a policy, a membership being deleted, and so
forth.
Today, this is handled in the application layer, typically happening at
the place where the relevant DB call is made (i.e. in an
`after_commit`). While this approach has worked thus far, it has several
issues:
1. We have no guarantee that the DB change will issue a broadcast. If
the application is deployed or the process crashes after the DB changes
are made but before the broadcast happens, we will have potentially
failed to update any connected clients or gateways with the changes.
2. We have no guarantee that the order of DB updates will be maintained
in order for broadcasts. In other words, app server A could win its DB
operation against app server B, but then proceed to lose being the first
to broadcast.
3. If the cluster is in a bad state where broadcasts may return an error
(i.e. https://github.com/firezone/firezone/issues/8660), we will never
retry the broadcast.
To fix the above issues, we introduce a WAL logical decoder that process
the event stream one message at a time and performs any needed work.
Serializability is guaranteed since we only process the WAL in a single,
cluster-global process, `ReplicationConnection`. Durability is also
guaranteed since we only ACK WAL segments after we've successfully
ingested the event.
This means we will only advance the position of our WAL stream after
successfully broadcasting the event.
This PR only introduces the WAL stream processing system but does not
introduce any changes to our current broadcasting behavior - that's
saved for another PR.
Our current bespoke job system, while it's worked out well so far, has
the following shortcomings:
- No retry logic
- No robust to guarantee job isolation / uniqueness without resorting to
row-level locking
- No support for cron-based scheduling
This PR adds the boilerplate required to get started with
[Oban](https://hexdocs.pm/oban/Oban.html), the job management system for
Elixir.
There was slight API change in the way LoggerJSON's configuration is
generation, so I took the time to do a little fixing and cleanup here.
Specifically, we should be using the `new/1` callback to create the
Logger config which fixes the below exception due to missing config
keys:
```
FORMATTER CRASH: {report,[{formatter_crashed,'Elixir.LoggerJSON.Formatters.GoogleCloud'},{config,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},{log_event,#{meta => #{line => 15,pid => <0.308.0>,time => 1744145139650804,file => "lib/logger.ex",gl => <0.281.0>,domain => [elixir],application => libcluster,mfa => {'Elixir.Cluster.Logger',info,2}},msg => {string,<<"[libcluster:default] connected to :\"web@web.cluster.local\"">>},level => info}},{reason,{error,{badmatch,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},[{'Elixir.LoggerJSON.Formatters.GoogleCloud',format,2,[{file,"lib/logger_json/formatters/google_cloud.ex"},{line,148}]}]}}]}
```
Supersedes #8714
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
- Attaches the Sentry Logging hook in each of [api, web, domain]
- Removes errant Sentry logging configuration in config/config.exs
- Fixes the exception logger to default to logging exceptions, use
`skip_sentry: true` to skip
Tested successfully in dev. Hopefully the cluster behaves the same way.
Fixes#8639
The adapter itself isn't enabled in the UI on prod, but the background
job to sync mock data was. This prevents the job from being started and
emitting log noise into production logs.
- Adds more actor groups to the existing `oidc_provider`
- Configures a rand seed so our seed data is reproducible across
machines
- Formats the seeds file to allow for some refactoring a later PR
- Adds a `Mock` identity provider adapter with sync enabled
If the websocket connection between a relay and the portal experiences a
temporary network split, the portal will immediately send the
disconnected id of the relay to any connected clients and gateways, and
all relayed connections (and current allocations) will be immediately
revoked by connlib.
This tight coupling is needlessly disruptive. As we've seen in staging
and production logs, relay disconnects can happen randomly, and in the
vast majority of cases immediately reconnect. Currently we see about 1-2
dozen of these **per day**.
To better account for this, we introduce a debounce mechanism in the
portal for `relays_presence` disconnects that works as follows:
- When a relay disconnects, record its `stamp_secret` (this is somewhat
tricky as we don't get this at the time of disconnect - we need to cache
it by relay_id beforehand)
- If the same `relay_id` reconnects again with the same `stamp_secret`
within `relays_presence_debounce_timeout` -> no-op
- If the same `relay_id` reconnects again with a **different**
`stamp_secret` -> disconnect immediately
- If it doesn't reconnect, **then** send the `relays_presence` with the
disconnected_id after the `relays_presence_debounce_timeout`
There are several ways connlib detects a relay is down:
1. Binding requests time out. These happen every 25s, so on average we
don't know a Relay is down for 12.5s + backoff timer.
2. `relays_presence` - this is currently the fastest way to detect
relays are down. With this change, the caveat is we will now detect this
with a delay of `relays_presence_debounce_timer`.
Fixes#8301
By specifying the `before_send` hook, we can easily drop events based on
their data, such as `original_exception` which contains the original
exception instance raised.
Leveraging this, we can add a `report_to_sentry` parameter to
`Web.LiveErrors.NotFound` to optionally ignore certain not found errors
from going to Sentry.
This allows connections to the postgresql database via the standard
socket, which - opposed to TCP sockets - allows `peer` authentication
based on local unix users. This removes the need for a password and is
much simpler to deploy when running components locally.
In the current form, `DATABASE_SOCKET_DIR` takes precedence over
hostname, if the environment variable is present. I found that
`compile_config!` somehow enforces a value to be present which is
explicitly not what I want for some of these values (i think). I'd be
glad if anyone with more elixir experience can guide me as to how I can
make this more idiomatic.
---------
Supersedes: #8044
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: oddlama <oddlama@oddlama.org>
We are getting quite a few of these warnings on prod:
```
{400, "{\n \"error\": {\n \"code\": 400,\n \"message\": \"One or more TimeSeries could not be written: timeSeries[0-39]: write for resource=gce_instance{zone:us-east1-d,instance_id:2678918148122610092} failed with: One or more points were written more frequently than the maximum sampling period configured for the metric.\",\n \"status\": \"INVALID_ARGUMENT\",\n \"details\": [\n {\n \"@type\": \"type.googleapis.com/google.monitoring.v3.CreateTimeSeriesSummary\",\n \"totalPointCount\": 40,\n \"successPointCount\": 31,\n \"errors\": [\n {\n \"status\": {\n \"code\": 9\n },\n \"pointCount\": 9\n }\n ]\n }\n ]\n }\n}\n"}
```
Since the point count is _much_ less than our flush buffer size of 1000,
we can only surmise the limit we're hitting is the flush interval.
The telemetry metrics reporter is run on each node, so we run the risk
of violating Google's API limit regardless of what a single node's
`@flush_interval` is set to.
To solve this, we use a new table `telemetry_reporter_logs` that stores
the last time a particular `flush` occurred for a reporter module. This
tracks global state as to when the last flush occurred, and if too
recent, the timer-based flush is call is `no-op`ed until the next one.
**Note**: The buffer-based `flush` is left unchanged, this will always
be called when `buffer_size > max_buffer_size`.
This is required to run multiple components on a single machine (even if
the processes are sandboxed), since they will share a network namespace
and thus cannot bind to the same port.
Currently port `4000` is hardcoded, this PR allows this to be configured
by an environment variable.
---------
Co-authored-by: oddlama <oddlama@oddlama.org>
It appears that something is initializing the Sentry.LoggerHandler
before we try to load it when starting:
```
Invalid logger handler config: {:logger,
{:invalid_handler, {:function_not_exported, {Sentry.LoggerHandler, :log, 2}}}}
```
This doesn't seem to actually inhibit the Sentry logger at all,
presumably because it initializes just fine in the application start
callback.
Instead of defining the config in the `config/` directory, we can pass
it directly to `:logger` on start which solves the above issue.
The applications within our umbrella are all joined into a single Erlang
cluster, and logger configuration is applied already to the entire
umbrella.
As such, registering the Sentry log handler in each application's
startup routine triggers duplicate handlers to be registered for the
cluster, resulting in warnings like this in GCP:
```
Event dropped due to being a duplicate of a previously-captured event.
```
As such, we can move the log handler configuration to the top-level
`:logger` key, under the `:logger` subkey for configuring a single
handler. We then load this handler config in the `domain` app only and
it applies to the entire cluster.
This adds https://github.com/getsentry/sentry-elixir to the portal for
automatic process crash and exception trace reporting.
It also configures Logger reporting for the `warning` level and higher,
and sets the data scrubbing rules to allow all Logger metadata keys
(`logger_metadata.*` in the Sentry project settings).
Lastly, it configures automatic HTTP error reporting by tying into the
`api` and `web` endpoint modules with a custom `plug` middleware so we
get automatic reporting of unsuccessful Phoenix responses.
It is expected this will be noisy when we first deploy and we'll need to
tune it down a bit. This is the same approach used with other Sentry
platforms.
Why:
* The Firezone website is hosting the component versions at the moment
and due to how Vercel works, occassionally a request will
timeout when being made to the /api/versions endpoint. This had been
throwing an error in the elixir logs and triggering an alert, but
because there is always a default set of component version values in
the elixir app there isn't really a need for an error/alert. With
that in mind the log level will be set to `warning` rather than
`error`.
Closes#7233
Why:
* Without some type of notification, users do not realize that new
Gateway versions have been released and thus do not seem to be upgrading
their deployed Gateways.
Now you can "edit" any fields on the policy, when one of fields that
govern the access is changed (resource, actor group or conditions) a new
policy will be created and an old one is deleted. This will be
broadcasted to the clients right away to minimize downtime. New policy
will have it's own flows to prevent confusion while auditing. To make
experience better for external systems we added `persistent_id` that
will be the same across all versions of a given policy.
Resources work in a similar fashion but when they are replaced we will
also replace all corresponding policies.
An additional nice effect of this approach is that we also got
configuration audit log for resources and policies.
Fixes#2504
This adds a feature that will email all admins in a Firezone Account
when sync errors occur with their Identity Provider.
In order to avoid spamming admins with sync error emails, the error
emails are only sent once every 24 hours. One exception to that is when
there is a successful sync the `sync_error_emailed_at` field is reset,
which means in theory if an identity provider was flip flopping between
successful and unsuccessful syncs the admins would be emailed more than
once in a 24 hours period.
### Sample Email Message
<img width="589" alt="idp-sync-error-message"
src="https://github.com/user-attachments/assets/d7128c7c-c10d-4d02-8283-059e2f1f5db5">
They will be sent in the API for connlib 1.3 and above.
I think in future we can make a whole menu section called "Internet
Security" which will be a specialized UI for the new resource type (and
now show it in Resources list) to improve the user experience around it.
Closes#5852
---------
Signed-off-by: Andrew Dryga <andrew@dryga.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Why:
* The Swagger UI is currently served from the API application. This
means that the Web application does not have access to the external URL
in the API configuration during/after compilation. Without the API
external URL, we cannot generate a proper link in the portal to the
Swagger UI. This commit refactors how the API external URL is set from
the environment variables and allows the Web app to have access to the
value of the API URL.
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Why:
* In order to manage a large number of Firezone Sites, Resources,
Policies, etc... a REST API is needed as clicking through the UI is too
time consuming, as well as prone to error. By providing a REST API
Firezone customers will be able to manage things within their Firezone
accounts with code.
We seem to be hitting `assert_receive`-style much more frequently after
"upgrading" to Enterprise Cloud (our credits expired, I was able to
renew them).
This updates the global timeout to 500ms for `assert_receive` to reduce
the likelihood `assert_push` and friends will time out on slow GH
runners.
E.g.
https://github.com/firezone/firezone/actions/runs/9556532328/job/26341986456
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Why:
* JumpCloud directory sync was requested from customers. JumpCloud only
offers the ability to use it's API with an admin level access token that
is tied to a specific user within a given JumpCloud account. This would
require Firezone customers to give an access token with much more
permissions that needed for our directory sync. To avoid this, we've
decide to use WorkOS to provide SCIM support between JumpCloud and
WorkOS, which will allow Firezone to then easily and safely retrieve
JumpCloud directory info from WorkOS.
---------
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Why:
* As work on the portal REST API has begun, there was a need to easily
provision API tokens to allow testing of the new API endpoints being
created. Adding the API Client UI allows for this to be done very easily
and will also be used once the API is ready to be consumed by customers.
Closes#2368