Google Cloud Artifact registry and Cloud storage is a significant cost.
GitHub, on the other hand, is completely free due to our being a public
repository. Hence, it makes sense to ditch GCP for GHCR.
To do this, we move all "staging" artifacts to GHCR. These will then be
used in the infra repo to push to GCP for deploys - we probably still
want pulls for our infra to hit GCP and not GitHub.
One big element of this is that we potentially lose sccache, so I'll be
checking the compile time of this PR and looking for alternatives that
don't involve such a massive cloud bill.
This has been dead code for a long time. The feature this was meant to
support, #8353, will require a different domain model, views, and user
flows.
Related: #8353
Some migrations take a long time to run because they require locks or
modify large amounts of data. To prevent this from causing issues during
deploy, we leverage Ecto's native support for loading migrations from
multiple directories to introduce a `conditional_migrations/` directory
that houses any conditional migrations we want to run.
To run these migrations, you'll need to do one of the following:
- `dev, test`: The `mix ecto.migrate` will run them by default because
we have aliased this to load conditional_migrations for dev
- `prod`: Set the `RUN_CONDITIONAL_MIGRATIONS` env var to `true` before
starting a prod server using the `bin/migrate` script.
- `dev, test, prod`: Run `Domain.Release.migrate(conditional: true)`
from an IEx shell.
If conditional migrations were found that weren't executed during
`Domain.Release.migrate`, a warning is logged to remind us to run them.
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Firezone's control plane is a realtime, distributed system that relies
on a broadcast/subscribe system to function. In many cases, these events
are broadcasted whenever relevant data in the DB changes, such as an
actor losing access to a policy, a membership being deleted, and so
forth.
Today, this is handled in the application layer, typically happening at
the place where the relevant DB call is made (i.e. in an
`after_commit`). While this approach has worked thus far, it has several
issues:
1. We have no guarantee that the DB change will issue a broadcast. If
the application is deployed or the process crashes after the DB changes
are made but before the broadcast happens, we will have potentially
failed to update any connected clients or gateways with the changes.
2. We have no guarantee that the order of DB updates will be maintained
in order for broadcasts. In other words, app server A could win its DB
operation against app server B, but then proceed to lose being the first
to broadcast.
3. If the cluster is in a bad state where broadcasts may return an error
(i.e. https://github.com/firezone/firezone/issues/8660), we will never
retry the broadcast.
To fix the above issues, we introduce a WAL logical decoder that process
the event stream one message at a time and performs any needed work.
Serializability is guaranteed since we only process the WAL in a single,
cluster-global process, `ReplicationConnection`. Durability is also
guaranteed since we only ACK WAL segments after we've successfully
ingested the event.
This means we will only advance the position of our WAL stream after
successfully broadcasting the event.
This PR only introduces the WAL stream processing system but does not
introduce any changes to our current broadcasting behavior - that's
saved for another PR.
Turns out that the standard `pgoutput` plugin shipped with Postgres will
do everything we need it to, and there are good examples of prior art
decoding its binary output in Elixir (in production).
So to avoid adding a dependency on `wal2json` here, we'll go with that.
Enables `wal_level = logical` so we can start implementing CDC.
**WARNING**: This will trigger a restart of our database instance, so it
should be deployed during our standard maintenance window (Saturday
evening).
In order to develop and test WAL replication, we need the wal2json
module installed in our dev postgres image. The module itself builds
very quickly, but I thought it would be better to have this
automatically built and pushed as part of a nightly job so that CI and
developers can make use of it.
We don't ever use the `dev` stage within our Rust Dockerfile that
actually builds the binaries within the container. In CI, we build the
binaries on the host and then copy them in. During local development, I
always do the same because it is much faster to iterate that way.
Long story short: We don't need this stage within our Dockerfile and it
causes confusion when people try to use `docker compose build`. If
somebody really wanted to do that, they need to follow the instructions
in the Dockerfile and build the binary first.
Related: #8687
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
EdgeShark is extremely useful if you want to attach WireShark to a TUN
device within a container. So far, I've just run this ad-hoc next to our
setup whenever I needed to debug something but I think it is actually
worthwhile adding permanently so it is just there when you need it.
[Step
2](https://cloud.google.com/sql/docs/postgres/pg-audit#set-pgaudit-flag-values)
of the pgaudit setup guide for Google Cloud SQL. It would be good to
have detailed pg audit logs on the master application instance in case
things go wrong.
Notably, this prevents erroring out when the `pgaudit` is not available,
which by default, it is. Enabling the `pgaudit` extension for our dev
instance is left as a future endeavor.
Supersedes #5442
- Adds more actor groups to the existing `oidc_provider`
- Configures a rand seed so our seed data is reproducible across
machines
- Formats the seeds file to allow for some refactoring a later PR
- Adds a `Mock` identity provider adapter with sync enabled
This ensure that we run prettier across all supported filetypes to check
for any formatting / style inconsistencies. Previously, it was only run
for files in the website/ directory using a deprecated pre-commit
plugin.
The benefit to keeping this in our pre-commit config is that devs can
optionally run these checks locally with `pre-commit run --config
.github/pre-commit-config.yaml`.
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
They will be sent in the API for connlib 1.3 and above.
I think in future we can make a whole menu section called "Internet
Security" which will be a specialized UI for the new resource type (and
now show it in Resources list) to improve the user experience around it.
Closes#5852
---------
Signed-off-by: Andrew Dryga <andrew@dryga.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Why:
* The Swagger UI is currently served from the API application. This
means that the Web application does not have access to the external URL
in the API configuration during/after compilation. Without the API
external URL, we cannot generate a proper link in the portal to the
Swagger UI. This commit refactors how the API external URL is set from
the environment variables and allows the Web app to have access to the
value of the API URL.
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
- Adds `http_test_server_image` to inputs so that it gets set properly
for CI (`debug`) and CD (`perf`)
- Updates `dev` -> `debug` in docker-compose.yml to fix pulls
- Fixes issue with seeds and relevant docs from #6205
Without masquerading, packets sent by the gateway through the TUN
interface use the wrong source address (the TUN device's address)
instead of the gateway's actual network interface.
We set this env variable in all our uses of the gateway, thus we might
as well remove it and always perform unconditionally.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
The compose service I defined is called `otel` not `otlp`. With this fix
in place, the relay successfully connects to the OTLP exporter.
it is worthwhile noting that the connection to the OTLP exporter itself
is not critical for relay operation. Even if it fails, it won't affect
the actual data plane. I do think it makes sense to still have a working
OTLP exporter in the compose definition. As it makes it easier to test
whether the ingestion of metrics and traces works as expected.
~~Relays are still failing to boot~~. By setting OTLP_ENDPOINT for both
relays in CI we will more closely mimic staging/prod env.
Edit: They just started working randomly. It had failed with a core dump
error after #6034 was merged, which is a bit concerning.
The dependency update in #6003 introduced a regression: Connecting to
the OTLP exporter was hanging forever and thus the relay failed to start
up.
The hang seems to be related to _dropping_ the `meter_provider`. Looking
at the changelog update, this change was actually called out:
https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-otlp/CHANGELOG.md#v0170.
By setting these providers globally, the relay starts up just fine.
To ensure this doesn't regress again, we add an OTEL collector to our
`docker-compose.yml` and configure the `relay-1` to connect to it.
I don't believe we use/need TCP for the Relays. Better to keep the ports
closed if so.
Also, the docker-compose.yml is updated to allow the `relay-1` service
to respond to all its ports, since we don't need those mapped typically.
Currently, `phoenix-channel` calls `flush` manually to ensure we don't
have messages sitting in a buffer somewhere. This is somewhat wasteful
if we haven't actually written any message. We can move the flushing to
directly after sending the message.
To avoid further buffering on the TCP level, we disable Nagle's
algorithm to avoid buffering on the TCP level.
Closes#4907
They're still accepted, but the binary entirely determines the behavior.
This makes the code for CLI parsing and token handling simpler with
fewer branches, so it's easier to be sure it's correct.
Replaces #4942 which isn't doing what I intended anymore.
Whenever we receive a `relays_presence` message from the portal, we
invalidate the candidates of all now disconnected relays and make
allocations on the new ones. This triggers signalling of new candidates
to the remote party and migrates the connection to the newly nominated
socket.
This still relies on #4613 until we have #4634.
Resolves: #4548.
---------
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Unfortunately I had to keep `linux-client` to get the compatibility
tests to pass. #4578 aims to remove that package.
Please add to this list if you think of anything:
```[tasklist]
# Things that may break that CI/CD won't catch
- [ ] Github release artifacts
- [ ] Knowledge base
- [ ] Docker images
- [ ] Docker containers
- [ ] Existing `linux-client` users
- [ ] Anything that downloads ghcr artifacts
- [ ] Nix (Not sure if it's built in CI. It had a merge conflict)
```
Refs #4515, and #3712, #3782
I think this is what Thomas and I agreed on in Slack / Github
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
This is part of a yak shave towards CI testing of #3812
Moving the DNS control method out of `docker-compose.yml` and up to the
integration tests themselves allows us to test these scenarios:
- `systemd-resolved`
- `etc-resolv-conf`
- `systemd-resolved` but we're in a container where that won't work, so
we should gracefully degrade to just allowing IP/CIDR resources