This adds the option to do socket based postgres connection to the
replication connections. Basically just a copy of the existing config
for the base postgres connection.
---------
Co-authored-by: PatrickDaG <patrickdag@failmail.dev>
Napkin math shows that we can save substantial memory (~3x or more) on
the API nodes as connected clients/gateways grow if we just store the
fields we need in order to keep the client and gateway state maintained
in the channel pids.
To facilitate this, we create new `Cacheable` structs that represent
their `Domain` cousins, which use byte arrays for `id`s and strip out
unused fields.
Additionally, all business logic involved with maintaining these caches
is now contained within two modules: `Domain.Cache.Client` and
`Domain.Cache.Gateway`, and type specs have been added to aid in static
analysis and code documentation.
Comprehensive testing is now added not only for the cache modules, but
for their associated channel modules as well to ensure we handle
different kinds of edge cases gracefully.
The `Events` nomenclature was renamed to `Changes` to better name what
we are doing: Change-Data-Capture.
Lastly, the following related changes are included in this PR since they
were "in the way" so to speak of getting this done:
- We save the last received LSN in each channel and drop the `change`
with a warning if we receive it twice in a row, or we receive it out of
order
- The client/gateway version compatibility calculations have been moved
to `Domain.Resources` and `Domain.Gateways` and have been simplified to
make them easier to understand and maintain going forward.
Related: #10174Fixes: #9392Fixes: #9965Fixes: #9501Fixes: #10227
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Oban includes its own configuration validation, which seems to prevent
`runtime.exs` from overriding any compile-time options. This prevents us
from using ENV vars to configure it, such as restricting job execution
to `domain` nodes by setting `queues: []`. To fix that, we make sure to
set Oban configuration in env-specific files `config/dev.exs` and
`config/test.exs`, and at runtime for prod with `config/runtime.exs`.
Fixes#10016
Time-based policy conditions are tricky. When they authorize a flow, we
correctly tell the Gateway to remove access when the time window
expires.
However, we do nothing on the client to reset the connectivity state.
This means that whenever the window of time of access was re-entered,
the client would essentially never be able to connect to it again until
the resource was toggled.
To fix this, we add a 1-minute check in the client channel that
re-checks allowed resources, and updates the client state with the
difference. This means that policies that have time-based conditions are
only accurate to the minute, but this is how they're presented anyhow.
For good measure, we also add a periodic job that runs every minute to
delete expired Flows. This will propagate to the Gateway where, if the
access for a particular client-resource is determined to be actually
gone, will receive `reject_access`.
Zooming out a bit, this PR furthers the theme that:
- Client channels react to underlying resource / policy / membership
changes directly, while
- Gateway channels react primarily to flows being deleted, or the
downstream effects of a prior client authorization
These parameters should be tuned to how long we expect "normal" queries
to take against the SQL instance. For smaller instances, "normal"
queries may take longer than 500ms, so we need to be able to configure
these via our Terraform configuration.
If not specified, the same defaults are used as before.
Related: https://github.com/firezone/infra/pull/82
The domain nodes that run background jobs aren't user-facing and as
such, don't need short queue_interval times. They can afford to wait a
few seconds for a connection from the pool to become available.
This should be a relatively low-risk change as it does not increase the
resources used on the app server and database, it only relaxes the
length of time we want for a connection to be available for us to use.
This has been dead code for a long time. The feature this was meant to
support, #8353, will require a different domain model, views, and user
flows.
Related: #8353
Building on the WAL consumer that's been in development over the past
several weeks, we introduce a new `change_logs` table that stores very
lightly up-fitted data decoded from the WAL:
- `account_id` (indexed): a foreign key reference to an account.
- `inserted_at` (indexed): the timestamp of insert, for truncating rows
later.
- `table`: the table where the op took place.
- `op`: the operation performed (insert/update/delete)
- `old_data`: a nullable map of the old row data (update/delete)
- `data`: a nullable map of the new row data(insert/update)
- `vsn`: an integer version field we can bump to signify schema changes
in the data in case we need to apply operations to only new or only old
data.
Judging from our prod metrics, we're currently average about 1,000 write
operations a minute, which will generate about 1-2 dozen changelogs / s.
Doing the math on this, 30 days at our current volume will yield about
50M / month, which should be ok for some time, since this is an
append-only, rarely (if ever) read from table.
The one aspect of this we may need to handle sooner than later is
batch-inserting these. That raises an issue though - currently, in this
PR, we process each WAL event serially, ending with the final
acknowledgement `:ok` which will signal to Postgres our status in
processing the WAL.
If we do anything async here, this processing "cursor" then becomes
inaccurate, so we may need to think about what to track and what data we
care about.
Related: #7124
The `compile_config` macro only works on environment and DB variables.
This caused recent confusion when determining where `database_pool_size`
was coming from.
To fix this issue, we rename `compile_config` to be more clear.
We also remove the technical debt around supporting "legacy keys" and
DB-based configuration.
The configuration compiler now works exclusively on environment
variables only, where it is still useful for:
- Casting environment variables to their expected type
- Alerting us when one is missing that should be set
Some migrations take a long time to run because they require locks or
modify large amounts of data. To prevent this from causing issues during
deploy, we leverage Ecto's native support for loading migrations from
multiple directories to introduce a `conditional_migrations/` directory
that houses any conditional migrations we want to run.
To run these migrations, you'll need to do one of the following:
- `dev, test`: The `mix ecto.migrate` will run them by default because
we have aliased this to load conditional_migrations for dev
- `prod`: Set the `RUN_CONDITIONAL_MIGRATIONS` env var to `true` before
starting a prod server using the `bin/migrate` script.
- `dev, test, prod`: Run `Domain.Release.migrate(conditional: true)`
from an IEx shell.
If conditional migrations were found that weren't executed during
`Domain.Release.migrate`, a warning is logged to remind us to run them.
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
> This PR adds two configuration keys for the replication connection.
> Exemple usecase:
> If you run two firezone control planes on the same db cluster (like me
😂 ), you'll need to have two different replication slot names
Related: #9395
Co-authored-by: Antoine <antoinelabarussias@gmail.com>
The `background_jobs_enabled` config in an ENV var that needs to be set
for a specific configuration key. It's not set on the top-level
`:domain` config by default.
Instead, it's used to enable / disable specific modules to start by the
application's Supervisor.
The `Domain.Events.ReplicationConnection` module is updated in this PR
to follow this convention.
The `web` and `api` application use `domain` as a dependency in their
`mix.exs`. This means by default their Supervisor will start the
Domain's supervision tree as well.
The author did not realize this at the time of implementation, and so we
now leverage the convention in place for restricting tasks to `domain`
nodes, the `background_jobs_enabled` application configuration
parameter.
We also add an info log when the replication slot is being started so we
can verify the node it's starting on.
Turns out we are making replication overly complex by creating a
dedicated user for it. The `web` user is already privileged and we can
reuse it since the replication system operates in the same security
context as the remaining app.
Firezone's control plane is a realtime, distributed system that relies
on a broadcast/subscribe system to function. In many cases, these events
are broadcasted whenever relevant data in the DB changes, such as an
actor losing access to a policy, a membership being deleted, and so
forth.
Today, this is handled in the application layer, typically happening at
the place where the relevant DB call is made (i.e. in an
`after_commit`). While this approach has worked thus far, it has several
issues:
1. We have no guarantee that the DB change will issue a broadcast. If
the application is deployed or the process crashes after the DB changes
are made but before the broadcast happens, we will have potentially
failed to update any connected clients or gateways with the changes.
2. We have no guarantee that the order of DB updates will be maintained
in order for broadcasts. In other words, app server A could win its DB
operation against app server B, but then proceed to lose being the first
to broadcast.
3. If the cluster is in a bad state where broadcasts may return an error
(i.e. https://github.com/firezone/firezone/issues/8660), we will never
retry the broadcast.
To fix the above issues, we introduce a WAL logical decoder that process
the event stream one message at a time and performs any needed work.
Serializability is guaranteed since we only process the WAL in a single,
cluster-global process, `ReplicationConnection`. Durability is also
guaranteed since we only ACK WAL segments after we've successfully
ingested the event.
This means we will only advance the position of our WAL stream after
successfully broadcasting the event.
This PR only introduces the WAL stream processing system but does not
introduce any changes to our current broadcasting behavior - that's
saved for another PR.
There was slight API change in the way LoggerJSON's configuration is
generation, so I took the time to do a little fixing and cleanup here.
Specifically, we should be using the `new/1` callback to create the
Logger config which fixes the below exception due to missing config
keys:
```
FORMATTER CRASH: {report,[{formatter_crashed,'Elixir.LoggerJSON.Formatters.GoogleCloud'},{config,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},{log_event,#{meta => #{line => 15,pid => <0.308.0>,time => 1744145139650804,file => "lib/logger.ex",gl => <0.281.0>,domain => [elixir],application => libcluster,mfa => {'Elixir.Cluster.Logger',info,2}},msg => {string,<<"[libcluster:default] connected to :\"web@web.cluster.local\"">>},level => info}},{reason,{error,{badmatch,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},[{'Elixir.LoggerJSON.Formatters.GoogleCloud',format,2,[{file,"lib/logger_json/formatters/google_cloud.ex"},{line,148}]}]}}]}
```
Supersedes #8714
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
The adapter itself isn't enabled in the UI on prod, but the background
job to sync mock data was. This prevents the job from being started and
emitting log noise into production logs.
- Adds more actor groups to the existing `oidc_provider`
- Configures a rand seed so our seed data is reproducible across
machines
- Formats the seeds file to allow for some refactoring a later PR
- Adds a `Mock` identity provider adapter with sync enabled
By specifying the `before_send` hook, we can easily drop events based on
their data, such as `original_exception` which contains the original
exception instance raised.
Leveraging this, we can add a `report_to_sentry` parameter to
`Web.LiveErrors.NotFound` to optionally ignore certain not found errors
from going to Sentry.
This allows connections to the postgresql database via the standard
socket, which - opposed to TCP sockets - allows `peer` authentication
based on local unix users. This removes the need for a password and is
much simpler to deploy when running components locally.
In the current form, `DATABASE_SOCKET_DIR` takes precedence over
hostname, if the environment variable is present. I found that
`compile_config!` somehow enforces a value to be present which is
explicitly not what I want for some of these values (i think). I'd be
glad if anyone with more elixir experience can guide me as to how I can
make this more idiomatic.
---------
Supersedes: #8044
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: oddlama <oddlama@oddlama.org>
This is required to run multiple components on a single machine (even if
the processes are sandboxed), since they will share a network namespace
and thus cannot bind to the same port.
Currently port `4000` is hardcoded, this PR allows this to be configured
by an environment variable.
---------
Co-authored-by: oddlama <oddlama@oddlama.org>
It appears that something is initializing the Sentry.LoggerHandler
before we try to load it when starting:
```
Invalid logger handler config: {:logger,
{:invalid_handler, {:function_not_exported, {Sentry.LoggerHandler, :log, 2}}}}
```
This doesn't seem to actually inhibit the Sentry logger at all,
presumably because it initializes just fine in the application start
callback.
Instead of defining the config in the `config/` directory, we can pass
it directly to `:logger` on start which solves the above issue.
This adds https://github.com/getsentry/sentry-elixir to the portal for
automatic process crash and exception trace reporting.
It also configures Logger reporting for the `warning` level and higher,
and sets the data scrubbing rules to allow all Logger metadata keys
(`logger_metadata.*` in the Sentry project settings).
Lastly, it configures automatic HTTP error reporting by tying into the
`api` and `web` endpoint modules with a custom `plug` middleware so we
get automatic reporting of unsuccessful Phoenix responses.
It is expected this will be noisy when we first deploy and we'll need to
tune it down a bit. This is the same approach used with other Sentry
platforms.
This adds a feature that will email all admins in a Firezone Account
when sync errors occur with their Identity Provider.
In order to avoid spamming admins with sync error emails, the error
emails are only sent once every 24 hours. One exception to that is when
there is a successful sync the `sync_error_emailed_at` field is reset,
which means in theory if an identity provider was flip flopping between
successful and unsuccessful syncs the admins would be emailed more than
once in a 24 hours period.
### Sample Email Message
<img width="589" alt="idp-sync-error-message"
src="https://github.com/user-attachments/assets/d7128c7c-c10d-4d02-8283-059e2f1f5db5">
They will be sent in the API for connlib 1.3 and above.
I think in future we can make a whole menu section called "Internet
Security" which will be a specialized UI for the new resource type (and
now show it in Resources list) to improve the user experience around it.
Closes#5852
---------
Signed-off-by: Andrew Dryga <andrew@dryga.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Why:
* The Swagger UI is currently served from the API application. This
means that the Web application does not have access to the external URL
in the API configuration during/after compilation. Without the API
external URL, we cannot generate a proper link in the portal to the
Swagger UI. This commit refactors how the API external URL is set from
the environment variables and allows the Web app to have access to the
value of the API URL.
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Why:
* JumpCloud directory sync was requested from customers. JumpCloud only
offers the ability to use it's API with an admin level access token that
is tied to a specific user within a given JumpCloud account. This would
require Firezone customers to give an access token with much more
permissions that needed for our directory sync. To avoid this, we've
decide to use WorkOS to provide SCIM support between JumpCloud and
WorkOS, which will allow Firezone to then easily and safely retrieve
JumpCloud directory info from WorkOS.
---------
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Why:
* As work on the portal REST API has begun, there was a need to easily
provision API tokens to allow testing of the new API endpoints being
created. Adding the API Client UI allows for this to be done very easily
and will also be used once the API is ready to be consumed by customers.
Closes#2368
Why:
* In order to allow easy testing of billing / Stripe integration, the
staging environment needs to allow members of the Firezone team access
to create new accounts, while disallowing the general public to create
accounts. The account creation override functionality allows for
multiple domains to be set by ENV variable by passing a comma separated
string of domains.
---------
Co-authored-by: Andrew Dryga <andrew@dryga.com>