firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 10:18:54 +00:00

Author	SHA1	Message	Date
Brian Manifold	ae5440929b	refactor(portal): Update elixir TF module (#9410 ) Why: * In order to reduce the number of traces/spans being sent to GCP a custom otel-collector config is needed for each type of node in our portal deployment. This commit allows the elixir TF module to accept a otel-collector config at the time of use, rather than being hard coded in to the module itself.	2025-06-05 06:18:34 +00:00
Jamil	b6dedba1d8	fix(infra/portal): Allow full instance surge during deploys (#9307 ) During a deploy, we had `max_surge_fixed` set to the target instance count - 1, which caused only 3 nodes to be spun up at a time instead of the full 4. We also had max_unavailable_fixed = 1 which allowed the instance group manager to bring an old, healthy node down before the last remaining node was spun up. Since[ we are now always setting](https://github.com/firezone/environments/pull/29) the reservation_size to 2*replicas, we can fix these values to make sure all new VMs spin up before old ones deleted.	2025-05-30 18:02:40 +00:00
Andrew Dryga	18cb7c147b	chore(portal): Upgrade Postgres to 17 (#5442 ) ### Pre-upgrade TODO - [ ] Update firezone.statuspage.io with planned maintenance status ### Performing the upgrade - [ ] Upgrade in place using the GCP UI - [ ] Run `ANALYZE;` - [ ] Run `REINDEX DATABASE firezone;` - [ ] When complete, deploy production via Terraform with new version selected ### Post-upgrade TODO - [ ] Test application connectivity - [ ] Monitor Cloud SQL logs for any issues - [ ] Unmark the planned maintenance window in firezone.statuspage.io Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>	2025-05-23 14:02:38 -07:00
Jamil	649c03e290	chore(portal): Bump LoggerJSON to 7.0.0, fixing config (#8759 ) There was slight API change in the way LoggerJSON's configuration is generation, so I took the time to do a little fixing and cleanup here. Specifically, we should be using the `new/1` callback to create the Logger config which fixes the below exception due to missing config keys: ``` FORMATTER CRASH: {report,[{formatter_crashed,'Elixir.LoggerJSON.Formatters.GoogleCloud'},{config,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},{log_event,#{meta => #{line => 15,pid => <0.308.0>,time => 1744145139650804,file => "lib/logger.ex",gl => <0.281.0>,domain => [elixir],application => libcluster,mfa => {'Elixir.Cluster.Logger',info,2}},msg => {string,<<"[libcluster:default] connected to :\"web@web.cluster.local\"">>},level => info}},{reason,{error,{badmatch,[{metadata,{all_except,[socket,conn]}},{redactors,[{'Elixir.LoggerJSON.Redactors.RedactKeys',[<<"password">>,<<"secret">>,<<"nonce">>,<<"fragment">>,<<"state">>,<<"token">>,<<"public_key">>,<<"private_key">>,<<"preshared_key">>,<<"session">>,<<"sessions">>]}]}]},[{'Elixir.LoggerJSON.Formatters.GoogleCloud',format,2,[{file,"lib/logger_json/formatters/google_cloud.ex"},{line,148}]}]}}]} ``` Supersedes #8714 --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-04-11 19:00:06 -07:00
Jamil	8df86c48c8	chore(infra): add reservation_size var to elixir app (#8659 ) We need to decouple the target VM size from the reservation size so that we have wiggle room to spin up new VMs if things go south during a deploy.	2025-04-04 01:13:22 -07:00
Thomas Eizinger	6fe7e77f76	refactor(relay): fail if eBPF offloading is requested but fails (#8656 ) It happens a bunch of times to me during testing that I'd forget to set the right interface onto which the eBPF kernel should be loaded and was wondering why it didn't work. Defaulting to `eth0` wasn't a very smart decision because it means users cannot disable the eBPF kernel at all (other than via the feature-flag). It makes more sense to default to not loading the program at all AND hard-fail if we are requested to load it but cannot. This allows us to catch configuration errors early.	2025-04-04 07:00:29 +00:00
Jamil	463e70f3a4	chore(infra): Bump elixir VM image to COS 117 (#8547 ) The relay was bumped here for the updated kernel. Would be good to stay standardized.	2025-03-29 05:33:00 +00:00
Jamil	5d038697d6	feat(infra): Use GVNIC and set queue_count=2 for elixir app (#8546 ) This aligns with the relay app and is safe for all machine types. See https://cloud.google.com/compute/docs/networking/using-gvnic	2025-03-29 05:16:44 +00:00
Jamil	1c4d3f44c1	fix(infra): Use 2 for default relay queue_count (#8542 ) It seems that this cannot be higher than the number of vCPUs in the instance. ``` Instance 'relay-7h8s' creation failed: Invalid value for field 'resource.networkInterfaces[0].queueCount': '4'. Networking queue number is invalid: '4'. (when acting as '85623168602@cloudservices.gserviceaccount.com') ```	2025-03-28 17:07:03 -07:00
Jamil	0110bdf7a7	fix(infra/relay): Set active queue count to half of max (#8539 ) The `gve` driver defaults to setting the active queue count equal to the max queue count. We need this to be half or lower for XDP eBPF programs to load. Related: #8538	2025-03-28 16:31:33 -07:00
Jamil	b1cdc3b03d	feat(relay): Bump RX/TX queue count to 2 (#8538 ) By default, GCP VMs have a max RX/TX queue count of `1`. While this is a fine default, it causes XDP programs to fail to load onto the virtual NIC with the following error: ``` gve 0000:00:04.0 eth0: XDP load failed: The number of configured RX queues 1 should be equal to the number of configured TX queues 1 and the number of configured RX/TX queues should be less than or equal to half the maximum number of RX/TX queues 1 ``` To fix this, we can bump the maximum queue count to `2` (the max support by gVNIC is 16), allowing the current queue count of `1` to satisfy the condition.	2025-03-28 21:57:44 +00:00
Jamil	6edfa7ba7f	feat(infra): Use gVNIC for relay network interface driver (#8537 ) This is supposed to offer much better performance and networking features in GCP. I would bet it supports XDP as well, unlike the default VIRTIO_NET driver. See: https://cloud.google.com/compute/docs/networking/using-gvnic Related: https://github.com/firezone/firezone/issues/7518#issuecomment-2762357354	2025-03-28 13:39:08 -07:00
Thomas Eizinger	34c5b6475f	chore(infra): bump COS version for relays to `cos-117-lts` (#8533 ) The 117 version uses Linux 6.6 whereas 113 only uses Linux 6.1. By using a newer kernel, we can hopefully get eBPF to work on Google Cloud. https://cloud.google.com/container-optimized-os/docs/release-notes/m113 https://cloud.google.com/container-optimized-os/docs/release-notes/m117	2025-03-28 04:03:51 +00:00
Thomas Eizinger	1066d53d51	fix(infra): move `privileged` field to security-context (#8530 ) Related: #8529 Related: #8496	2025-03-28 01:29:21 +00:00
Jamil	b618eb31e8	feat(infra): Make relay containers privileged (#8529 ) This is needed to load eBPF programs. Related: #8496	2025-03-27 20:36:40 +00:00
Jamil	e0c373ef2b	chore(infra): Move google gateway to dedicated module (#8489 ) Removes the google gateway module in this repo because: - We already reference this module from our `environments` repo. - Customers are already using the dedicated module - Any actually pointing to the module in this repo will have issues because Terraform [automatically tries to clone submodules](https://github.com/hashicorp/terraform/issues/34917).	2025-03-20 05:16:28 +00:00
Jamil	73c63c8ea4	chore(infra): Use simplified config for swap space (#8488 ) Turns out cloud-init has native support for configuring swapfiles, so we use that here and make it configurable. The `environments` submodule will be updated to inject the current value into here.	2025-03-19 19:28:08 +00:00
Jamil	09fb5f9274	chore(infra): Enable pgaudit on master instance (#8434 ) This is [step 1](https://cloud.google.com/sql/docs/postgres/pg-audit#set-pgaudit-flag-values) of enabling `pgaudit` logs. We'll also need to `CREATE EXTENSION` which will need to happen in a migration. I'll make a separate PR for that. Supersedes: #5442	2025-03-14 20:14:23 +00:00
Jamil	5fc45b1a7e	chore(infra): Increase PG backups to 30 days (#8433 ) These are currently 7. It would be good to have more retention here.	2025-03-13 19:24:01 +00:00
Jamil	91a92f1773	feat(portal): Enable 1G of swap on portal instances (#8348 ) The `e2-micro` instances we'll be rolling out have 1G of memory (which should be plenty), but it would be helpful to be able to handle small spikes without getting OOM-killed. Related #8344	2025-03-04 19:36:33 +00:00
Jamil	0c5fc8fe0a	chore(infra): Increase terraform create timeouts to 30m (#8230 ) With the additional number of resources we are now managing on each deploy, these can sometimes timeout, even though they would have succeeded. https://app.terraform.io/app/firezone/workspaces/production/runs/run-qnyFGhyjX8ZxMWvf	2025-02-21 16:20:08 +00:00
Jamil	9a2f2c0fa6	fix(infra): Add missing naming suffix to lb ingress (#8206 ) Adds a naming_suffix I left out on the relays module.	2025-02-19 15:13:04 -08:00
Jamil	762f16bfea	fix(infra): `create_before_destroy` for all Relay resources (#8198 ) When making any modification that taints any Relay infrastructure, some Relay components are destroyed before they're created, and some are created before they're destroyed. This results in failures that can lead to downtime, even if we bump subnet numbering to trigger a rollover of the `naming_suffix`. See https://app.terraform.io/app/firezone/workspaces/staging/runs To fix this, we ensure `create_before_destroy` is applied to all Relay module resources, and we ensure that the `naming_suffix` is properly used in all resources that require unique names or IDs within the project. Thus, we need to remember to make sure to bump subnet numbering whenever changing any Relay infrastructure so that: (1) the subnet numbering doesn't collide, and (2) to trigger the `naming_suffix` change which prevents other resource names from colliding. Unfortunately there doesn't seem to be a better alternative here. The only other alternative I could determine as of now is to derive the subnet numbering dynamically on each deploy, incrementing them, which would taint all Relay resources upon each and every deploy, which is wasteful and prone to random timeouts or failures.	2025-02-19 07:10:12 -08:00
Jamil	bb999b73f3	chore(infra): bump tf environments to fast-forward (#8197 ) This will perform the final staging test that will ensure the pending prod deploy will work smoothly.	2025-02-19 00:36:12 -08:00
Jamil	d1de22e7cc	fix(infra): Keep reservation names in sync (#8194 )	2025-02-18 22:40:48 -08:00
Jamil	39e302f3b7	fix(infra): Add naming suffix to relay compute reservations (#8190 ) Unfortunately relay reservations need the naming suffix as well because they require unique names in the project and `create_before_destroy` is taking effect. https://app.terraform.io/app/firezone/workspaces/staging/runs/run-kfD9JtvTZEsXnfzA	2025-02-18 19:48:02 -08:00
Jamil	7eebc04118	fix(infra): Remove unused ingress Relay UDP ports (#8166 ) These are redundant since we explicitly allow STUN/TURN traffic a few lines up.	2025-02-17 22:50:02 +00:00
Jamil	5bac3f5ec2	fix(infra): Don't send more/faster metrics than Google accepts (#8028 ) We are getting quite a few of these warnings on prod: ``` {400, "{\n \"error\": {\n \"code\": 400,\n \"message\": \"One or more TimeSeries could not be written: timeSeries[0-39]: write for resource=gce_instance{zone:us-east1-d,instance_id:2678918148122610092} failed with: One or more points were written more frequently than the maximum sampling period configured for the metric.\",\n \"status\": \"INVALID_ARGUMENT\",\n \"details\": [\n {\n \"@type\": \"type.googleapis.com/google.monitoring.v3.CreateTimeSeriesSummary\",\n \"totalPointCount\": 40,\n \"successPointCount\": 31,\n \"errors\": [\n {\n \"status\": {\n \"code\": 9\n },\n \"pointCount\": 9\n }\n ]\n }\n ]\n }\n}\n"} ``` Since the point count is _much_ less than our flush buffer size of 1000, we can only surmise the limit we're hitting is the flush interval. The telemetry metrics reporter is run on each node, so we run the risk of violating Google's API limit regardless of what a single node's `@flush_interval` is set to. To solve this, we use a new table `telemetry_reporter_logs` that stores the last time a particular `flush` occurred for a reporter module. This tracks global state as to when the last flush occurred, and if too recent, the timer-based flush is call is `no-op`ed until the next one. Note: The buffer-based `flush` is left unchanged, this will always be called when `buffer_size > max_buffer_size`.	2025-02-10 18:21:40 +00:00
Jamil	60ab106b67	chore(infra): Update otel-collector image to 0.119.0 (#8059 ) We are quite a few versions behind. The changelog lists a good amount of [Breaking API changes](https://github.com/open-telemetry/opentelemetry-collector/releases), but rather than enumerate all of those, or forever stay on the same (ancient) version, I thought it would be a good idea to flex the upgrade muscle here and see where it lands us on staging.	2025-02-10 16:47:05 +00:00
Jamil	ff07f10759	chore(portal): Remove GCP alerting for application errors (#8040 ) As discussed with @bmanifold, we're moving forward with the following monitoring strategy: For infra alerts, stick with GCP. For application-level alerts, use Sentry. Since we already have Sentry configured and working on staging, this PR removes the "Errors in logs" alert since we will be receiving this in Sentry going forward.	2025-02-07 15:12:39 +00:00
Brian Manifold	639f5e6a60	chore(ops): Update CPU utilization alert threshold (#7966 ) The PR updates the threshold for CPU utilization monitoring alerts. This is being done to avoid notification fatigue. This is not intended to ignore any issue that might cause the CPU utilization to be high, but rather make sure we don't miss other important alerts that might come through while we try to solve the underlying CPU utilization issues.	2025-01-31 16:36:57 +00:00
Jamil	8e64a01f4a	chore(infra): Disable debug log for otel (#7874 ) In the relay's `cloud-init.yaml`, we've overridden the `telemetry` service log filter to be `debug`. This results in this log being printed to Cloud Logging every 1s, for _every_ relay: ``` 2025-01-26T23:00:35.066Z debug memorylimiter/memorylimiter.go:200 Currently used memory. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 31} ``` These logs are consuming over half of our total log count, which accounts for over half our Cloud Monitoring cost -- the second highest cost in our GCP account. This PR removes the override so that the relay app has the same `otel-collector` log level as the Elixir, the default (presumably `info`).	2025-01-26 18:57:07 -08:00
Jamil	7b40282ebe	revert: pre-relay change for prod test (#7873 ) Doing another (hopefully final) reversion of staging from the prod setup to what we're after with respect to relay infra. Reverts firezone/firezone#7872	2025-01-26 14:50:49 -08:00
Jamil	fe343a9372	chore(infra): revert to pre-relay change for prod test (#7872 )	2025-01-26 14:02:53 -08:00
Jamil	d96276e1ac	fix(infra): Use naming_suffix in instance_group_manager (#7871 ) Google still had lingering Relay instance groups and subnets around from a previous deployment that were deleted in the UI and gone, but then popped back up. Theoretically, the instance groups should be deleted because there is no current Terraform config matching them. This change will ensure that instance groups also get rolled over based on the naming suffix introduced in #7870. Related: #7870	2025-01-26 12:10:34 -08:00
Jamil	0454fb173d	refactor(infra): Ensure network names unique (#7870 ) Turns out subnets need to have globally unique names as well. This PR updates the instance-template, VPC, and subnet names to append an 8-character random string. This random string "depends on" the subnet IP range configuration specified above, so that if we change that in the future, causing a network change, the naming will change as well. Lastly, this random_string is also passed to the `relays` module to be used in the instance template name prefix. While that name does _not_ need to be globally unique, the `instance_template` needs to be rolled over if the subnets change, because otherwise it will contain a network interface that is linked to both old and new subnets and GCP will complain about that. Reverts: firezone/firezone#7869	2025-01-26 08:16:23 -08:00
Jamil	1826700b89	revert: re-apply Relay region changes (#7869 ) Reverts firezone/firezone#7868	2025-01-26 06:46:24 -08:00
Jamil	0805e87016	chore(infra): re-apply Relay region changes (#7868 ) Reverts firezone/firezone#7835 in order to test how this will be applied to prod. If this goes through fine, we should be ok for a prod rollout.	2025-01-26 06:13:26 -08:00
Jamil	90f445a971	chore(infra): Revert relay regions to test prod-like deploy (#7835 ) Since we know we now have the Relay configuration we want (and works), this PR rolls back staging to how it was pre-Relay region changes, so we can test that a single `terraform apply` on prod will deploy without any errors.	2025-01-25 17:05:06 +00:00
Jamil	aaea3bf537	revert(infra): Billing budget (PR #7836 ) (#7855 ) This is causing issues applying because our CI terraform IAM user doesn't have the `Billing Account Administrator` role. Rather than granting such a sensitive role to our CI pipeline, I'm suggesting we create the billing budget outside the scope of the terraform config tracked in this repo. If we want it to be tracked as code, I would propose maybe we have a separate (private) repository with a separate token / IAM permissions that we can monitor separately. For the time being, I'll plan to manually create this budget in the UI. Reverts: #7836	2025-01-24 06:53:47 +00:00
Jamil	c913086dbe	feat(infra): Add billing budget alerts to infra (#7836 ) To help prevent surprises with unexpected cloud bills, we add a billing budget amount that will trigger when the 50% threshold is hit. The exact amount is considered secret and is set via variables that are already added in HCP staging and prod envs.	2025-01-23 19:19:36 +00:00
Jamil	dca9645adf	chore(infra): Remove unused tf vars (#7803 ) These were leftover from #7737 and friends.	2025-01-22 05:32:28 +00:00
Jamil	0a1cd92c00	fix(infra): Rotate naming to taint old Relay instances (#7739 ) The Relay instance template is sticking around because none of its inputs have changed, so we bump its name.	2025-01-12 21:34:18 -08:00
Jamil	5dd640daa8	fix(infra): Define Relay subnets outside of Relays module (#7736 ) Even after all of the changes made to make the subnets update properly in the Relays module, it will always fail because of these two facts combined: - lifecycle is `create_before_destroy` - GCP instance group template binds a network interface on a per-subnet basis and this cannot be bound to both old and new subnet. The fix for this would be to create a new instance group manager on each deploy Rather than needlessly roll over the relay networks on each deploy, since they're not changing, it would make more sense to define them outside of the Relays module so that they aren't tainted by code changes. This will prevent needless resource replacement and allow for the Relay module to use them as-is.	2025-01-12 19:04:44 -08:00
Jamil	03d81ed2df	fix(infra): Fix subnet numbering across all regions (#7734 ) #7733 fixed the randomness generation, but didn't fix the numbering. According to [GCP docs](https://cloud.google.com/vpc/docs/subnets), we can use virtually any RFC 1918 space for this. This PR updates our numbering scheme to use the `10.128.0.0/9` space for Relay subnets and changes the elixir app to use `10.2.2.0/20` to prevent collisions.	2025-01-12 16:33:03 -08:00
Jamil	e9a120c272	fix(infra): Rotate random vars on each image version (#7733 )	2025-01-12 14:22:14 -08:00
Jamil	d6d0d78bda	chore(infra): Use `numeric` instead of `number` (#7731 ) `number` is deprecated for the built-in `random_string` resource.	2025-01-12 13:09:29 -08:00
Jamil	ba5b8ed3f5	fix(infra): Use computed cidrsubnet for Relays (#7730 ) When a Relay's instances are updated / changed, the contained subnetwork's `name` and `ip_cidr_range` need to be updated to something else because we are using the `create_before_destroy` lifecycle configuration for the Relays module. To fix this, we need to make sure that when recreating Relays, we use a unique `name` and `ip_cidr_range` for the new instances so as not to conflict with existing ones. To handle this, we use a computed state-tracked value for `ip_cidr_range` that will automatically adjust to the number of Relay regions we have and it will be incremented each time the Relays are recreated. Then we update the `name` to include this range to ensure we never have a subnet name that conflicts with an existing one.	2025-01-12 12:22:39 -08:00
Jamil	45bfe0f2a3	chore(infra): Deny connections from US-sanctioned countries with HTTP 403 (#7462 ) Implementing the remainder of the legally required block. Will be applied on Dec 9th, as we notified customers.	2024-12-06 20:26:30 +00:00
Jamil	3a62709c77	docs: Add restricted regions docs (#7395 ) This will be referred to when we make our email announcement.	2024-11-24 17:20:06 +00:00

1 2 3

101 Commits