Commit Graph

101 Commits

Author SHA1 Message Date
Brian Manifold
ae5440929b refactor(portal): Update elixir TF module (#9410)
Why:

* In order to reduce the number of traces/spans being sent to GCP a
  custom otel-collector config is needed for each type of node in our
  portal deployment.  This commit allows the elixir TF module to accept
  a otel-collector config at the time of use, rather than being hard
  coded in to the module itself.
2025-06-05 06:18:34 +00:00
Jamil
b6dedba1d8 fix(infra/portal): Allow full instance surge during deploys (#9307)
During a deploy, we had `max_surge_fixed` set to the target instance
count - 1, which caused only 3 nodes to be spun up at a time instead of
the full 4.

We also had max_unavailable_fixed = 1 which allowed the instance group
manager to bring an old, healthy node down before the last remaining
node was spun up.

Since[ we are now always
setting](https://github.com/firezone/environments/pull/29) the
reservation_size to 2*replicas, we can fix these values to make sure all
new VMs spin up before old ones deleted.
2025-05-30 18:02:40 +00:00
Andrew Dryga
18cb7c147b chore(portal): Upgrade Postgres to 17 (#5442)
### Pre-upgrade TODO

- [ ] Update firezone.statuspage.io with planned maintenance status

### Performing the upgrade

- [ ] Upgrade in place using the GCP UI
- [ ] Run `ANALYZE;`
- [ ] Run `REINDEX DATABASE firezone;`
- [ ] When complete, deploy production via Terraform with new version
selected

### Post-upgrade TODO

- [ ] Test application connectivity
- [ ] Monitor Cloud SQL logs for any issues
- [ ] Unmark the planned maintenance window in firezone.statuspage.io

Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
2025-05-23 14:02:38 -07:00
Jamil
649c03e290 chore(portal): Bump LoggerJSON to 7.0.0, fixing config (#8759)
There was slight API change in the way LoggerJSON's configuration is
generation, so I took the time to do a little fixing and cleanup here.

Specifically, we should be using the `new/1` callback to create the
Logger config which fixes the below exception due to missing config
keys:

```
FORMATTER CRASH: {report,[{formatter_crashed,'Elixir.LoggerJSON.Formatters.GoogleCloud'},{config,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},{log_event,#{meta => #{line => 15,pid => <0.308.0>,time => 1744145139650804,file => "lib/logger.ex",gl => <0.281.0>,domain => [elixir],application => libcluster,mfa => {'Elixir.Cluster.Logger',info,2}},msg => {string,<<"[libcluster:default] connected to :\"web@web.cluster.local\"">>},level => info}},{reason,{error,{badmatch,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},[{'Elixir.LoggerJSON.Formatters.GoogleCloud',format,2,[{file,"lib/logger_json/formatters/google_cloud.ex"},{line,148}]}]}}]}
```

Supersedes #8714

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-04-11 19:00:06 -07:00
Jamil
8df86c48c8 chore(infra): add reservation_size var to elixir app (#8659)
We need to decouple the target VM size from the reservation size so that
we have wiggle room to spin up new VMs if things go south during a
deploy.
2025-04-04 01:13:22 -07:00
Thomas Eizinger
6fe7e77f76 refactor(relay): fail if eBPF offloading is requested but fails (#8656)
It happens a bunch of times to me during testing that I'd forget to set
the right interface onto which the eBPF kernel should be loaded and was
wondering why it didn't work. Defaulting to `eth0` wasn't a very smart
decision because it means users cannot disable the eBPF kernel at all
(other than via the feature-flag).

It makes more sense to default to not loading the program at all AND
hard-fail if we are requested to load it but cannot. This allows us to
catch configuration errors early.
2025-04-04 07:00:29 +00:00
Jamil
463e70f3a4 chore(infra): Bump elixir VM image to COS 117 (#8547)
The relay was bumped here for the updated kernel. Would be good to stay
standardized.
2025-03-29 05:33:00 +00:00
Jamil
5d038697d6 feat(infra): Use GVNIC and set queue_count=2 for elixir app (#8546)
This aligns with the relay app and is safe for all machine types.

See https://cloud.google.com/compute/docs/networking/using-gvnic
2025-03-29 05:16:44 +00:00
Jamil
1c4d3f44c1 fix(infra): Use 2 for default relay queue_count (#8542)
It seems that this cannot be higher than the number of vCPUs in the
instance.

```
Instance 'relay-7h8s' creation failed: Invalid value for field 'resource.networkInterfaces[0].queueCount': '4'. Networking queue number is invalid: '4'. (when acting as '85623168602@cloudservices.gserviceaccount.com')
```
2025-03-28 17:07:03 -07:00
Jamil
0110bdf7a7 fix(infra/relay): Set active queue count to half of max (#8539)
The `gve` driver defaults to setting the active queue count equal to the
max queue count.

We need this to be half or lower for XDP eBPF programs to load.

Related: #8538
2025-03-28 16:31:33 -07:00
Jamil
b1cdc3b03d feat(relay): Bump RX/TX queue count to 2 (#8538)
By default, GCP VMs have a max RX/TX queue count of `1`. While this is a
fine default, it causes XDP programs to fail to load onto the virtual
NIC with the following error:

```
gve 0000:00:04.0 eth0: XDP load failed: The number of configured RX queues 1 should be equal to the number of configured TX queues 1 and the number of configured RX/TX queues should be less than or equal to half the maximum number of RX/TX queues 1
```

To fix this, we can bump the maximum queue count to `2` (the max support
by gVNIC is 16), allowing the current queue count of `1` to satisfy the
condition.
2025-03-28 21:57:44 +00:00
Jamil
6edfa7ba7f feat(infra): Use gVNIC for relay network interface driver (#8537)
This is supposed to offer much better performance and networking
features in GCP. I would bet it supports XDP as well, unlike the default
VIRTIO_NET driver.

See: https://cloud.google.com/compute/docs/networking/using-gvnic

Related:
https://github.com/firezone/firezone/issues/7518#issuecomment-2762357354
2025-03-28 13:39:08 -07:00
Thomas Eizinger
34c5b6475f chore(infra): bump COS version for relays to cos-117-lts (#8533)
The 117 version uses Linux 6.6 whereas 113 only uses Linux 6.1. By using
a newer kernel, we can hopefully get eBPF to work on Google Cloud.

https://cloud.google.com/container-optimized-os/docs/release-notes/m113
https://cloud.google.com/container-optimized-os/docs/release-notes/m117
2025-03-28 04:03:51 +00:00
Thomas Eizinger
1066d53d51 fix(infra): move privileged field to security-context (#8530)
Related: #8529
Related: #8496
2025-03-28 01:29:21 +00:00
Jamil
b618eb31e8 feat(infra): Make relay containers privileged (#8529)
This is needed to load eBPF programs.

Related: #8496
2025-03-27 20:36:40 +00:00
Jamil
e0c373ef2b chore(infra): Move google gateway to dedicated module (#8489)
Removes the google gateway module in this repo because:

- We already reference this module from our `environments` repo.
- Customers are already using the dedicated module
- Any actually pointing to the module in this repo will have issues
because Terraform [automatically tries to clone
submodules](https://github.com/hashicorp/terraform/issues/34917).
2025-03-20 05:16:28 +00:00
Jamil
73c63c8ea4 chore(infra): Use simplified config for swap space (#8488)
Turns out cloud-init has native support for configuring swapfiles, so we
use that here and make it configurable.

The `environments` submodule will be updated to inject the current value
into here.
2025-03-19 19:28:08 +00:00
Jamil
09fb5f9274 chore(infra): Enable pgaudit on master instance (#8434)
This is [step
1](https://cloud.google.com/sql/docs/postgres/pg-audit#set-pgaudit-flag-values)
of enabling `pgaudit` logs. We'll also need to `CREATE EXTENSION` which
will need to happen in a migration. I'll make a separate PR for that.

Supersedes: #5442
2025-03-14 20:14:23 +00:00
Jamil
5fc45b1a7e chore(infra): Increase PG backups to 30 days (#8433)
These are currently 7. It would be good to have more retention here.
2025-03-13 19:24:01 +00:00
Jamil
91a92f1773 feat(portal): Enable 1G of swap on portal instances (#8348)
The `e2-micro` instances we'll be rolling out have 1G of memory (which
should be plenty), but it would be helpful to be able to handle small
spikes without getting OOM-killed.

Related #8344
2025-03-04 19:36:33 +00:00
Jamil
0c5fc8fe0a chore(infra): Increase terraform create timeouts to 30m (#8230)
With the additional number of resources we are now managing on each
deploy, these can sometimes timeout, even though they would have
succeeded.


https://app.terraform.io/app/firezone/workspaces/production/runs/run-qnyFGhyjX8ZxMWvf
2025-02-21 16:20:08 +00:00
Jamil
9a2f2c0fa6 fix(infra): Add missing naming suffix to lb ingress (#8206)
Adds a naming_suffix I left out on the relays module.
2025-02-19 15:13:04 -08:00
Jamil
762f16bfea fix(infra): create_before_destroy for all Relay resources (#8198)
When making any modification that taints any Relay infrastructure, some
Relay components are destroyed before they're created, and some are
created before they're destroyed.

This results in failures that can lead to downtime, even if we bump
subnet numbering to trigger a rollover of the `naming_suffix`. See
https://app.terraform.io/app/firezone/workspaces/staging/runs

To fix this, we ensure `create_before_destroy` is applied to all Relay
module resources, and we ensure that the `naming_suffix` is properly
used in all resources that require unique names or IDs within the
project.

Thus, we need to remember to make sure to bump subnet numbering whenever
changing any Relay infrastructure so that: (1) the subnet numbering
doesn't collide, and (2) to trigger the `naming_suffix` change which
prevents other resource names from colliding.

Unfortunately there doesn't seem to be a better alternative here. The
only other alternative I could determine as of now is to derive the
subnet numbering dynamically on each deploy, incrementing them, which
would taint all Relay resources upon each and every deploy, which is
wasteful and prone to random timeouts or failures.
2025-02-19 07:10:12 -08:00
Jamil
bb999b73f3 chore(infra): bump tf environments to fast-forward (#8197)
This will perform the final staging test that will ensure the pending
prod deploy will work smoothly.
2025-02-19 00:36:12 -08:00
Jamil
d1de22e7cc fix(infra): Keep reservation names in sync (#8194) 2025-02-18 22:40:48 -08:00
Jamil
39e302f3b7 fix(infra): Add naming suffix to relay compute reservations (#8190)
Unfortunately relay reservations need the naming suffix as well because
they require unique names in the project and `create_before_destroy` is
taking effect.


https://app.terraform.io/app/firezone/workspaces/staging/runs/run-kfD9JtvTZEsXnfzA
2025-02-18 19:48:02 -08:00
Jamil
7eebc04118 fix(infra): Remove unused ingress Relay UDP ports (#8166)
These are redundant since we explicitly allow STUN/TURN traffic a few
lines up.
2025-02-17 22:50:02 +00:00
Jamil
5bac3f5ec2 fix(infra): Don't send more/faster metrics than Google accepts (#8028)
We are getting quite a few of these warnings on prod:

```
{400, "{\n  \"error\": {\n    \"code\": 400,\n    \"message\": \"One or more TimeSeries could not be written: timeSeries[0-39]: write for resource=gce_instance{zone:us-east1-d,instance_id:2678918148122610092} failed with: One or more points were written more frequently than the maximum sampling period configured for the metric.\",\n    \"status\": \"INVALID_ARGUMENT\",\n    \"details\": [\n      {\n        \"@type\": \"type.googleapis.com/google.monitoring.v3.CreateTimeSeriesSummary\",\n        \"totalPointCount\": 40,\n        \"successPointCount\": 31,\n        \"errors\": [\n          {\n            \"status\": {\n              \"code\": 9\n            },\n            \"pointCount\": 9\n          }\n        ]\n      }\n    ]\n  }\n}\n"}
```

Since the point count is _much_ less than our flush buffer size of 1000,
we can only surmise the limit we're hitting is the flush interval.

The telemetry metrics reporter is run on each node, so we run the risk
of violating Google's API limit regardless of what a single node's
`@flush_interval` is set to.

To solve this, we use a new table `telemetry_reporter_logs` that stores
the last time a particular `flush` occurred for a reporter module. This
tracks global state as to when the last flush occurred, and if too
recent, the timer-based flush is call is `no-op`ed until the next one.

**Note**: The buffer-based `flush` is left unchanged, this will always
be called when `buffer_size > max_buffer_size`.
2025-02-10 18:21:40 +00:00
Jamil
60ab106b67 chore(infra): Update otel-collector image to 0.119.0 (#8059)
We are quite a few versions behind.

The changelog lists a good amount of [Breaking API
changes](https://github.com/open-telemetry/opentelemetry-collector/releases),
but rather than enumerate all of those, or forever stay on the same
(ancient) version, I thought it would be a good idea to flex the upgrade
muscle here and see where it lands us on staging.
2025-02-10 16:47:05 +00:00
Jamil
ff07f10759 chore(portal): Remove GCP alerting for application errors (#8040)
As discussed with @bmanifold, we're moving forward with the following
monitoring strategy:

For infra alerts, stick with GCP.

For application-level alerts, use Sentry.

Since we already have Sentry configured and working on staging, this PR
removes the "Errors in logs" alert since we will be receiving this in
Sentry going forward.
2025-02-07 15:12:39 +00:00
Brian Manifold
639f5e6a60 chore(ops): Update CPU utilization alert threshold (#7966)
The PR updates the threshold for CPU utilization monitoring alerts. This
is being done to avoid notification fatigue. This is not intended to
ignore any issue that might cause the CPU utilization to be high, but
rather make sure we don't miss other important alerts that might come
through while we try to solve the underlying CPU utilization issues.
2025-01-31 16:36:57 +00:00
Jamil
8e64a01f4a chore(infra): Disable debug log for otel (#7874)
In the relay's `cloud-init.yaml`, we've overridden the `telemetry`
service log filter to be `debug`.

This results in this log being printed to Cloud Logging every 1s, for
_every_ relay:

```
2025-01-26T23:00:35.066Z	debug	memorylimiter/memorylimiter.go:200	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 31}
```

These logs are consuming over half of our total log count, which
accounts for over half our Cloud Monitoring cost -- the second highest
cost in our GCP account.

This PR removes the override so that the relay app has the same
`otel-collector` log level as the Elixir, the default (presumably
`info`).
2025-01-26 18:57:07 -08:00
Jamil
7b40282ebe revert: pre-relay change for prod test (#7873)
Doing another (hopefully final) reversion of staging from the prod setup
to what we're after with respect to relay infra.

Reverts firezone/firezone#7872
2025-01-26 14:50:49 -08:00
Jamil
fe343a9372 chore(infra): revert to pre-relay change for prod test (#7872) 2025-01-26 14:02:53 -08:00
Jamil
d96276e1ac fix(infra): Use naming_suffix in instance_group_manager (#7871)
Google still had lingering Relay instance groups and subnets around from
a previous deployment that were deleted in the UI and gone, but then
popped back up.

Theoretically, the instance groups should be deleted because there is no
current Terraform config matching them. This change will ensure that
instance groups also get rolled over based on the naming suffix
introduced in #7870.

Related: #7870
2025-01-26 12:10:34 -08:00
Jamil
0454fb173d refactor(infra): Ensure network names unique (#7870)
Turns out subnets need to have globally unique names as well. This PR
updates the instance-template, VPC, and subnet names to append an
8-character random string.

This random string "depends on" the subnet IP range configuration
specified above, so that if we change that in the future, causing a
network change, the naming will change as well.

Lastly, this random_string is also passed to the `relays` module to be
used in the instance template name prefix. While that name does _not_
need to be globally unique, the `instance_template` **needs** to be
rolled over if the subnets change, because otherwise it will contain a
network interface that is linked to both old and new subnets and GCP
will complain about that.

Reverts: firezone/firezone#7869
2025-01-26 08:16:23 -08:00
Jamil
1826700b89 revert: re-apply Relay region changes (#7869)
Reverts firezone/firezone#7868
2025-01-26 06:46:24 -08:00
Jamil
0805e87016 chore(infra): re-apply Relay region changes (#7868)
Reverts firezone/firezone#7835 in order to test how this will be applied
to prod.

If this goes through fine, we should be ok for a prod rollout.
2025-01-26 06:13:26 -08:00
Jamil
90f445a971 chore(infra): Revert relay regions to test prod-like deploy (#7835)
Since we know we now have the Relay configuration we want (and works),
this PR rolls back staging to how it was pre-Relay region changes, so we
can test that a single `terraform apply` on prod will deploy without any
errors.
2025-01-25 17:05:06 +00:00
Jamil
aaea3bf537 revert(infra): Billing budget (PR #7836) (#7855)
This is causing issues applying because our CI terraform IAM user
doesn't have the `Billing Account Administrator` role.

Rather than granting such a sensitive role to our CI pipeline, I'm
suggesting we create the billing budget outside the scope of the
terraform config tracked in this repo.

If we want it to be tracked as code, I would propose maybe we have a
separate (private) repository with a separate token / IAM permissions
that we can monitor separately.

For the time being, I'll plan to manually create this budget in the UI.

Reverts: #7836
2025-01-24 06:53:47 +00:00
Jamil
c913086dbe feat(infra): Add billing budget alerts to infra (#7836)
To help prevent surprises with unexpected cloud bills, we add a billing
budget amount that will trigger when the 50% threshold is hit.

The exact amount is considered secret and is set via variables that are
already added in HCP staging and prod envs.
2025-01-23 19:19:36 +00:00
Jamil
dca9645adf chore(infra): Remove unused tf vars (#7803)
These were leftover from #7737 and friends.
2025-01-22 05:32:28 +00:00
Jamil
0a1cd92c00 fix(infra): Rotate naming to taint old Relay instances (#7739)
The Relay instance template is sticking around because none of its
inputs have changed, so we bump its name.
2025-01-12 21:34:18 -08:00
Jamil
5dd640daa8 fix(infra): Define Relay subnets outside of Relays module (#7736)
Even after all of the changes made to make the subnets update properly
in the Relays module, it will always fail because of these two facts
combined:

- lifecycle is `create_before_destroy`
- GCP instance group template binds a network interface on a per-subnet
basis and this cannot be bound to both old and new subnet. The fix for
this would be to create a new instance group manager on each deploy

Rather than needlessly roll over the relay networks on each deploy,
since they're not changing, it would make more sense to define them
outside of the Relays module so that they aren't tainted by code
changes. This will prevent needless resource replacement and allow for
the Relay module to use them as-is.
2025-01-12 19:04:44 -08:00
Jamil
03d81ed2df fix(infra): Fix subnet numbering across all regions (#7734)
#7733 fixed the randomness generation, but didn't fix the numbering.
According to [GCP docs](https://cloud.google.com/vpc/docs/subnets), we
can use virtually any RFC 1918 space for this.

This PR updates our numbering scheme to use the `10.128.0.0/9` space for
Relay subnets and changes the elixir app to use `10.2.2.0/20` to prevent
collisions.
2025-01-12 16:33:03 -08:00
Jamil
e9a120c272 fix(infra): Rotate random vars on each image version (#7733) 2025-01-12 14:22:14 -08:00
Jamil
d6d0d78bda chore(infra): Use numeric instead of number (#7731)
`number` is deprecated for the built-in `random_string` resource.
2025-01-12 13:09:29 -08:00
Jamil
ba5b8ed3f5 fix(infra): Use computed cidrsubnet for Relays (#7730)
When a Relay's instances are updated / changed, the contained
subnetwork's `name` and `ip_cidr_range` need to be updated to something
else because we are using the `create_before_destroy` lifecycle
configuration for the Relays module.

To fix this, we need to make sure that when recreating Relays, we use a
unique `name` and `ip_cidr_range` for the new instances so as not to
conflict with existing ones.

To handle this, we use a computed state-tracked value for
`ip_cidr_range` that will automatically adjust to the number of Relay
regions we have and it will be incremented each time the Relays are
recreated. Then we update the `name` to include this range to ensure we
never have a subnet name that conflicts with an existing one.
2025-01-12 12:22:39 -08:00
Jamil
45bfe0f2a3 chore(infra): Deny connections from US-sanctioned countries with HTTP 403 (#7462)
Implementing the remainder of the legally required block. Will be
applied on Dec 9th, as we notified customers.
2024-12-06 20:26:30 +00:00
Jamil
3a62709c77 docs: Add restricted regions docs (#7395)
This will be referred to when we make our email announcement.
2024-11-24 17:20:06 +00:00