Why:
* In order to reduce the number of traces/spans being sent to GCP a
custom otel-collector config is needed for each type of node in our
portal deployment. This commit allows the elixir TF module to accept
a otel-collector config at the time of use, rather than being hard
coded in to the module itself.
During a deploy, we had `max_surge_fixed` set to the target instance
count - 1, which caused only 3 nodes to be spun up at a time instead of
the full 4.
We also had max_unavailable_fixed = 1 which allowed the instance group
manager to bring an old, healthy node down before the last remaining
node was spun up.
Since[ we are now always
setting](https://github.com/firezone/environments/pull/29) the
reservation_size to 2*replicas, we can fix these values to make sure all
new VMs spin up before old ones deleted.
### Pre-upgrade TODO
- [ ] Update firezone.statuspage.io with planned maintenance status
### Performing the upgrade
- [ ] Upgrade in place using the GCP UI
- [ ] Run `ANALYZE;`
- [ ] Run `REINDEX DATABASE firezone;`
- [ ] When complete, deploy production via Terraform with new version
selected
### Post-upgrade TODO
- [ ] Test application connectivity
- [ ] Monitor Cloud SQL logs for any issues
- [ ] Unmark the planned maintenance window in firezone.statuspage.io
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
There was slight API change in the way LoggerJSON's configuration is
generation, so I took the time to do a little fixing and cleanup here.
Specifically, we should be using the `new/1` callback to create the
Logger config which fixes the below exception due to missing config
keys:
```
FORMATTER CRASH: {report,[{formatter_crashed,'Elixir.LoggerJSON.Formatters.GoogleCloud'},{config,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},{log_event,#{meta => #{line => 15,pid => <0.308.0>,time => 1744145139650804,file => "lib/logger.ex",gl => <0.281.0>,domain => [elixir],application => libcluster,mfa => {'Elixir.Cluster.Logger',info,2}},msg => {string,<<"[libcluster:default] connected to :\"web@web.cluster.local\"">>},level => info}},{reason,{error,{badmatch,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},[{'Elixir.LoggerJSON.Formatters.GoogleCloud',format,2,[{file,"lib/logger_json/formatters/google_cloud.ex"},{line,148}]}]}}]}
```
Supersedes #8714
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
It happens a bunch of times to me during testing that I'd forget to set
the right interface onto which the eBPF kernel should be loaded and was
wondering why it didn't work. Defaulting to `eth0` wasn't a very smart
decision because it means users cannot disable the eBPF kernel at all
(other than via the feature-flag).
It makes more sense to default to not loading the program at all AND
hard-fail if we are requested to load it but cannot. This allows us to
catch configuration errors early.
It seems that this cannot be higher than the number of vCPUs in the
instance.
```
Instance 'relay-7h8s' creation failed: Invalid value for field 'resource.networkInterfaces[0].queueCount': '4'. Networking queue number is invalid: '4'. (when acting as '85623168602@cloudservices.gserviceaccount.com')
```
The `gve` driver defaults to setting the active queue count equal to the
max queue count.
We need this to be half or lower for XDP eBPF programs to load.
Related: #8538
By default, GCP VMs have a max RX/TX queue count of `1`. While this is a
fine default, it causes XDP programs to fail to load onto the virtual
NIC with the following error:
```
gve 0000:00:04.0 eth0: XDP load failed: The number of configured RX queues 1 should be equal to the number of configured TX queues 1 and the number of configured RX/TX queues should be less than or equal to half the maximum number of RX/TX queues 1
```
To fix this, we can bump the maximum queue count to `2` (the max support
by gVNIC is 16), allowing the current queue count of `1` to satisfy the
condition.
Removes the google gateway module in this repo because:
- We already reference this module from our `environments` repo.
- Customers are already using the dedicated module
- Any actually pointing to the module in this repo will have issues
because Terraform [automatically tries to clone
submodules](https://github.com/hashicorp/terraform/issues/34917).
Turns out cloud-init has native support for configuring swapfiles, so we
use that here and make it configurable.
The `environments` submodule will be updated to inject the current value
into here.
The `e2-micro` instances we'll be rolling out have 1G of memory (which
should be plenty), but it would be helpful to be able to handle small
spikes without getting OOM-killed.
Related #8344
When making any modification that taints any Relay infrastructure, some
Relay components are destroyed before they're created, and some are
created before they're destroyed.
This results in failures that can lead to downtime, even if we bump
subnet numbering to trigger a rollover of the `naming_suffix`. See
https://app.terraform.io/app/firezone/workspaces/staging/runs
To fix this, we ensure `create_before_destroy` is applied to all Relay
module resources, and we ensure that the `naming_suffix` is properly
used in all resources that require unique names or IDs within the
project.
Thus, we need to remember to make sure to bump subnet numbering whenever
changing any Relay infrastructure so that: (1) the subnet numbering
doesn't collide, and (2) to trigger the `naming_suffix` change which
prevents other resource names from colliding.
Unfortunately there doesn't seem to be a better alternative here. The
only other alternative I could determine as of now is to derive the
subnet numbering dynamically on each deploy, incrementing them, which
would taint all Relay resources upon each and every deploy, which is
wasteful and prone to random timeouts or failures.
We are getting quite a few of these warnings on prod:
```
{400, "{\n \"error\": {\n \"code\": 400,\n \"message\": \"One or more TimeSeries could not be written: timeSeries[0-39]: write for resource=gce_instance{zone:us-east1-d,instance_id:2678918148122610092} failed with: One or more points were written more frequently than the maximum sampling period configured for the metric.\",\n \"status\": \"INVALID_ARGUMENT\",\n \"details\": [\n {\n \"@type\": \"type.googleapis.com/google.monitoring.v3.CreateTimeSeriesSummary\",\n \"totalPointCount\": 40,\n \"successPointCount\": 31,\n \"errors\": [\n {\n \"status\": {\n \"code\": 9\n },\n \"pointCount\": 9\n }\n ]\n }\n ]\n }\n}\n"}
```
Since the point count is _much_ less than our flush buffer size of 1000,
we can only surmise the limit we're hitting is the flush interval.
The telemetry metrics reporter is run on each node, so we run the risk
of violating Google's API limit regardless of what a single node's
`@flush_interval` is set to.
To solve this, we use a new table `telemetry_reporter_logs` that stores
the last time a particular `flush` occurred for a reporter module. This
tracks global state as to when the last flush occurred, and if too
recent, the timer-based flush is call is `no-op`ed until the next one.
**Note**: The buffer-based `flush` is left unchanged, this will always
be called when `buffer_size > max_buffer_size`.
We are quite a few versions behind.
The changelog lists a good amount of [Breaking API
changes](https://github.com/open-telemetry/opentelemetry-collector/releases),
but rather than enumerate all of those, or forever stay on the same
(ancient) version, I thought it would be a good idea to flex the upgrade
muscle here and see where it lands us on staging.
As discussed with @bmanifold, we're moving forward with the following
monitoring strategy:
For infra alerts, stick with GCP.
For application-level alerts, use Sentry.
Since we already have Sentry configured and working on staging, this PR
removes the "Errors in logs" alert since we will be receiving this in
Sentry going forward.
The PR updates the threshold for CPU utilization monitoring alerts. This
is being done to avoid notification fatigue. This is not intended to
ignore any issue that might cause the CPU utilization to be high, but
rather make sure we don't miss other important alerts that might come
through while we try to solve the underlying CPU utilization issues.
In the relay's `cloud-init.yaml`, we've overridden the `telemetry`
service log filter to be `debug`.
This results in this log being printed to Cloud Logging every 1s, for
_every_ relay:
```
2025-01-26T23:00:35.066Z debug memorylimiter/memorylimiter.go:200 Currently used memory. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 31}
```
These logs are consuming over half of our total log count, which
accounts for over half our Cloud Monitoring cost -- the second highest
cost in our GCP account.
This PR removes the override so that the relay app has the same
`otel-collector` log level as the Elixir, the default (presumably
`info`).
Doing another (hopefully final) reversion of staging from the prod setup
to what we're after with respect to relay infra.
Reverts firezone/firezone#7872
Google still had lingering Relay instance groups and subnets around from
a previous deployment that were deleted in the UI and gone, but then
popped back up.
Theoretically, the instance groups should be deleted because there is no
current Terraform config matching them. This change will ensure that
instance groups also get rolled over based on the naming suffix
introduced in #7870.
Related: #7870
Turns out subnets need to have globally unique names as well. This PR
updates the instance-template, VPC, and subnet names to append an
8-character random string.
This random string "depends on" the subnet IP range configuration
specified above, so that if we change that in the future, causing a
network change, the naming will change as well.
Lastly, this random_string is also passed to the `relays` module to be
used in the instance template name prefix. While that name does _not_
need to be globally unique, the `instance_template` **needs** to be
rolled over if the subnets change, because otherwise it will contain a
network interface that is linked to both old and new subnets and GCP
will complain about that.
Reverts: firezone/firezone#7869
Since we know we now have the Relay configuration we want (and works),
this PR rolls back staging to how it was pre-Relay region changes, so we
can test that a single `terraform apply` on prod will deploy without any
errors.
This is causing issues applying because our CI terraform IAM user
doesn't have the `Billing Account Administrator` role.
Rather than granting such a sensitive role to our CI pipeline, I'm
suggesting we create the billing budget outside the scope of the
terraform config tracked in this repo.
If we want it to be tracked as code, I would propose maybe we have a
separate (private) repository with a separate token / IAM permissions
that we can monitor separately.
For the time being, I'll plan to manually create this budget in the UI.
Reverts: #7836
To help prevent surprises with unexpected cloud bills, we add a billing
budget amount that will trigger when the 50% threshold is hit.
The exact amount is considered secret and is set via variables that are
already added in HCP staging and prod envs.
Even after all of the changes made to make the subnets update properly
in the Relays module, it will always fail because of these two facts
combined:
- lifecycle is `create_before_destroy`
- GCP instance group template binds a network interface on a per-subnet
basis and this cannot be bound to both old and new subnet. The fix for
this would be to create a new instance group manager on each deploy
Rather than needlessly roll over the relay networks on each deploy,
since they're not changing, it would make more sense to define them
outside of the Relays module so that they aren't tainted by code
changes. This will prevent needless resource replacement and allow for
the Relay module to use them as-is.
#7733 fixed the randomness generation, but didn't fix the numbering.
According to [GCP docs](https://cloud.google.com/vpc/docs/subnets), we
can use virtually any RFC 1918 space for this.
This PR updates our numbering scheme to use the `10.128.0.0/9` space for
Relay subnets and changes the elixir app to use `10.2.2.0/20` to prevent
collisions.
When a Relay's instances are updated / changed, the contained
subnetwork's `name` and `ip_cidr_range` need to be updated to something
else because we are using the `create_before_destroy` lifecycle
configuration for the Relays module.
To fix this, we need to make sure that when recreating Relays, we use a
unique `name` and `ip_cidr_range` for the new instances so as not to
conflict with existing ones.
To handle this, we use a computed state-tracked value for
`ip_cidr_range` that will automatically adjust to the number of Relay
regions we have and it will be incremented each time the Relays are
recreated. Then we update the `name` to include this range to ensure we
never have a subnet name that conflicts with an existing one.