Commit Graph

159 Commits

Author SHA1 Message Date
Jamil
8e64a01f4a chore(infra): Disable debug log for otel (#7874)
In the relay's `cloud-init.yaml`, we've overridden the `telemetry`
service log filter to be `debug`.

This results in this log being printed to Cloud Logging every 1s, for
_every_ relay:

```
2025-01-26T23:00:35.066Z	debug	memorylimiter/memorylimiter.go:200	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 31}
```

These logs are consuming over half of our total log count, which
accounts for over half our Cloud Monitoring cost -- the second highest
cost in our GCP account.

This PR removes the override so that the relay app has the same
`otel-collector` log level as the Elixir, the default (presumably
`info`).
2025-01-26 18:57:07 -08:00
Jamil
7b40282ebe revert: pre-relay change for prod test (#7873)
Doing another (hopefully final) reversion of staging from the prod setup
to what we're after with respect to relay infra.

Reverts firezone/firezone#7872
2025-01-26 14:50:49 -08:00
Jamil
fe343a9372 chore(infra): revert to pre-relay change for prod test (#7872) 2025-01-26 14:02:53 -08:00
Jamil
d96276e1ac fix(infra): Use naming_suffix in instance_group_manager (#7871)
Google still had lingering Relay instance groups and subnets around from
a previous deployment that were deleted in the UI and gone, but then
popped back up.

Theoretically, the instance groups should be deleted because there is no
current Terraform config matching them. This change will ensure that
instance groups also get rolled over based on the naming suffix
introduced in #7870.

Related: #7870
2025-01-26 12:10:34 -08:00
Jamil
0454fb173d refactor(infra): Ensure network names unique (#7870)
Turns out subnets need to have globally unique names as well. This PR
updates the instance-template, VPC, and subnet names to append an
8-character random string.

This random string "depends on" the subnet IP range configuration
specified above, so that if we change that in the future, causing a
network change, the naming will change as well.

Lastly, this random_string is also passed to the `relays` module to be
used in the instance template name prefix. While that name does _not_
need to be globally unique, the `instance_template` **needs** to be
rolled over if the subnets change, because otherwise it will contain a
network interface that is linked to both old and new subnets and GCP
will complain about that.

Reverts: firezone/firezone#7869
2025-01-26 08:16:23 -08:00
Jamil
1826700b89 revert: re-apply Relay region changes (#7869)
Reverts firezone/firezone#7868
2025-01-26 06:46:24 -08:00
Jamil
0805e87016 chore(infra): re-apply Relay region changes (#7868)
Reverts firezone/firezone#7835 in order to test how this will be applied
to prod.

If this goes through fine, we should be ok for a prod rollout.
2025-01-26 06:13:26 -08:00
Jamil
90f445a971 chore(infra): Revert relay regions to test prod-like deploy (#7835)
Since we know we now have the Relay configuration we want (and works),
this PR rolls back staging to how it was pre-Relay region changes, so we
can test that a single `terraform apply` on prod will deploy without any
errors.
2025-01-25 17:05:06 +00:00
Jamil
aaea3bf537 revert(infra): Billing budget (PR #7836) (#7855)
This is causing issues applying because our CI terraform IAM user
doesn't have the `Billing Account Administrator` role.

Rather than granting such a sensitive role to our CI pipeline, I'm
suggesting we create the billing budget outside the scope of the
terraform config tracked in this repo.

If we want it to be tracked as code, I would propose maybe we have a
separate (private) repository with a separate token / IAM permissions
that we can monitor separately.

For the time being, I'll plan to manually create this budget in the UI.

Reverts: #7836
2025-01-24 06:53:47 +00:00
Jamil
c913086dbe feat(infra): Add billing budget alerts to infra (#7836)
To help prevent surprises with unexpected cloud bills, we add a billing
budget amount that will trigger when the 50% threshold is hit.

The exact amount is considered secret and is set via variables that are
already added in HCP staging and prod envs.
2025-01-23 19:19:36 +00:00
Jamil
dca9645adf chore(infra): Remove unused tf vars (#7803)
These were leftover from #7737 and friends.
2025-01-22 05:32:28 +00:00
Jamil
0a1cd92c00 fix(infra): Rotate naming to taint old Relay instances (#7739)
The Relay instance template is sticking around because none of its
inputs have changed, so we bump its name.
2025-01-12 21:34:18 -08:00
Jamil
5dd640daa8 fix(infra): Define Relay subnets outside of Relays module (#7736)
Even after all of the changes made to make the subnets update properly
in the Relays module, it will always fail because of these two facts
combined:

- lifecycle is `create_before_destroy`
- GCP instance group template binds a network interface on a per-subnet
basis and this cannot be bound to both old and new subnet. The fix for
this would be to create a new instance group manager on each deploy

Rather than needlessly roll over the relay networks on each deploy,
since they're not changing, it would make more sense to define them
outside of the Relays module so that they aren't tainted by code
changes. This will prevent needless resource replacement and allow for
the Relay module to use them as-is.
2025-01-12 19:04:44 -08:00
Jamil
03d81ed2df fix(infra): Fix subnet numbering across all regions (#7734)
#7733 fixed the randomness generation, but didn't fix the numbering.
According to [GCP docs](https://cloud.google.com/vpc/docs/subnets), we
can use virtually any RFC 1918 space for this.

This PR updates our numbering scheme to use the `10.128.0.0/9` space for
Relay subnets and changes the elixir app to use `10.2.2.0/20` to prevent
collisions.
2025-01-12 16:33:03 -08:00
Jamil
e9a120c272 fix(infra): Rotate random vars on each image version (#7733) 2025-01-12 14:22:14 -08:00
Jamil
d6d0d78bda chore(infra): Use numeric instead of number (#7731)
`number` is deprecated for the built-in `random_string` resource.
2025-01-12 13:09:29 -08:00
Jamil
ba5b8ed3f5 fix(infra): Use computed cidrsubnet for Relays (#7730)
When a Relay's instances are updated / changed, the contained
subnetwork's `name` and `ip_cidr_range` need to be updated to something
else because we are using the `create_before_destroy` lifecycle
configuration for the Relays module.

To fix this, we need to make sure that when recreating Relays, we use a
unique `name` and `ip_cidr_range` for the new instances so as not to
conflict with existing ones.

To handle this, we use a computed state-tracked value for
`ip_cidr_range` that will automatically adjust to the number of Relay
regions we have and it will be incremented each time the Relays are
recreated. Then we update the `name` to include this range to ensure we
never have a subnet name that conflicts with an existing one.
2025-01-12 12:22:39 -08:00
Jamil
45bfe0f2a3 chore(infra): Deny connections from US-sanctioned countries with HTTP 403 (#7462)
Implementing the remainder of the legally required block. Will be
applied on Dec 9th, as we notified customers.
2024-12-06 20:26:30 +00:00
Jamil
3a62709c77 docs: Add restricted regions docs (#7395)
This will be referred to when we make our email announcement.
2024-11-24 17:20:06 +00:00
Jamil
5437c3e2df fix(infra): Block signups if expression matches (#7337) 2024-11-13 21:29:47 +00:00
Jamil
6f7f6a4f34 style: Enforce code style across all supported languages using Prettier (#7322)
This ensure that we run prettier across all supported filetypes to check
for any formatting / style inconsistencies. Previously, it was only run
for files in the website/ directory using a deprecated pre-commit
plugin.

The benefit to keeping this in our pre-commit config is that devs can
optionally run these checks locally with `pre-commit run --config
.github/pre-commit-config.yaml`.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2024-11-13 00:19:15 +00:00
Jamil
fa40d6e852 fix(infra): Adjust rule to total_latencies from backend_latencies (#7323)
This is the check that Oneleet is expecting.
2024-11-12 21:30:28 +00:00
Jamil
f40528f8f0 chore(infra): Relax load balancer to app latency alert to 3s (#7317)
1000ms is a little too agressive here. The latency is measured from load
balancer, which are global, to our app servers, which are in us-east1.
2024-11-12 05:44:05 +00:00
Brian Manifold
50ba752d30 fix(infra): Update gcp cloud armor rules (#7293)
The expression for one of the rules was not able to be applied due to
invalid characters (`\n`) and even once the invalid characters were
removed there is a limit of 5 subexpressions, but the previous
expression contained 10.

Along with the expression change, the `deny(451)` is not allowed. The
only `deny` codes allowed are `403`, `404`, `502`
2024-11-09 15:09:16 +00:00
Jamil
83dfd3a98c fix(infra): Don't use macros for Cloud armor (#7285)
Fixes #6807 

Follow up to #7282
2024-11-06 21:06:21 -08:00
Jamil
1bd9a3e134 fix(infra): Use proper common expression language syntax (#7282)
https://github.com/firezone/firezone/actions/runs/11713228570/job/32626046819


Language reference:

https://github.com/google/cel-spec/blob/master/doc/langdef.md#macros
2024-11-06 23:59:34 +00:00
Andrew Dryga
0a79cd5045 chore(portal): Do not allow signing up from legally-restricted jurisdictions (#7088)
Related to #6807

---------

Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-11-06 22:40:20 +00:00
Jamil
2825522844 fix(infra): Filter out WebSocket upgrade from latency alerting (#7242) 2024-11-02 15:43:49 -07:00
Jamil
e9db936c0f feat(infra): Add Google load balancer latency alert (#7231)
Oneleet has a new monitor failing that suggests adding this.


https://app.oneleet.com/tenants/148d888b-6cbe-4198-b4be-359e816927f4/monitors/9ad764bf-147b-4b87-bee8-f825ea9e0adc
2024-11-01 15:57:32 +00:00
James Winegar
733ada2a26 Add RUST_LOG to cloud_init.yaml for google-cloud gateway RIG (#6736)
Signed-off-by: James Winegar <jameswinegar@users.noreply.github.com>
2024-09-17 15:02:34 -06:00
Andrew Dryga
c57a670dbb fix(devops): Create SSL certs before destroy (#6607) 2024-09-05 14:35:16 -07:00
Andrew Dryga
2ae5f921c8 fix(portal): Disable IP check for browser session tokens (#6598)
This PR reverts commit that moves out IPv6 address to a separate
subdomain (deploying that will cause a prod downtime) and simply removes
the check that causes redirect loops.
2024-09-05 11:07:40 -07:00
Andrew
14e3c379c1 Fix DNS cert replacement 2024-09-05 08:51:01 -07:00
Jamil
c581439ee2 fix(portal): Use app-ipv6.firezone.dev for IPv6 app to prevent websocket / http from using different stacks (#6522)
Based on testing and research it does not appear that Chrome will
reliably choose a consistent protocol stack for loading the initial web
page as it does for connecting the WebSocket when connecting over VPN
tunnels. If one or the other stacks experiences a slight delay or packet
loss causing retransmission, or QUIC simply doesn't play nicely with the
MTU (in our case 1280), it may fall back to IPv4 (which has less
per-packet overhead) or even a TCP connection.

Unfortunately this violates an assumption we have in token validation
logic. Namely, that the remote_ip used to create the token (via sign in)
is the same one used to the connect the WebSocket. I can see where this
logic comes from in a security context, but thinking through the attack
vector(s) that would be able to leverage this violation has me left
wondering if this check is worth the breakage we currently face in
#6511.

- Scenario 1: MITM - attacker steals token somehow via MITM (would need
to somehow break TLS) - the attacker is already in our network path and
can rewrite the remote_ip already with his/her own.
- Scenario 2: Malicious browser plugin stealing session token. It will
be harder to spoof the remote IP in this case, but if this is a
possibility, the plugin could presumably directly control the tab where
the user is logged in.
- Scenario 3: IdP is compromised leading to malicious redirect before
arriving to Firezone - if this is the case, the user could likely login
in directly and create his/her own valid session token anyhow.

Perhaps I'm missing other scenarios, open to feedback. If we want to
ensure the token used by the websocket originated from the same browser
as it was minted from, perhaps we could generate a small random key,
save it in local storage, and send that in a header when connecting the
WebSocket. I think cookies handle that for us already though.

Fixes #6511
2024-09-04 07:28:14 +00:00
Thomas Eizinger
93d678aaea feat(relay): set OTEL metadata for metrics and traces (#6249)
I recently discovered that the metrics reporting to Google Cloud Metrics
for the relays is actually working. Unfortunately, they are all bucketed
together because we don't set the metadata correctly.

This PR aims to fix that be setting some useful default metadata for
traces and metrics and additionally, discoveres instance ID and name
from GCE metadata.

Related: #2033.
2024-08-10 16:32:01 +00:00
Andrew Dryga
ba71d651d9 chore(infra): Silence alerts from OTEL Finch integration (#6188) 2024-08-07 10:26:51 -06:00
Thomas Eizinger
94527f9fa1 fix(gateway): always masquerade for docker-deployed gateways (#6169)
Without masquerading, packets sent by the gateway through the TUN
interface use the wrong source address (the TUN device's address)
instead of the gateway's actual network interface.

We set this env variable in all our uses of the gateway, thus we might
as well remove it and always perform unconditionally.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-08-07 03:00:50 +00:00
Reactor Scram
5eb2bba47b feat(headless-client): use systemd-resolved DNS control by default (#6163)
Closes #5063, supersedes #5850 

Other refactors and changes made as part of this:

- Adds the ability to disable DNS control on Windows
- Removes the spooky-action-at-a-distance `from_env` functions that used
to be buried in `tunnel`
- `FIREZONE_DNS_CONTROL` is now a regular `clap` argument again

---------

Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-08-06 18:16:51 +00:00
Andrew Dryga
823b3cb276 fix(infra): Resolve capacity issues during rollouts (#6007)
I've managed to finally reserve enough e2 instances for our needs and
also used e2 for gateways to workaround the quota issues. The `web` app
still used n2 because quota doesn't allow additional n4's. Rollouts also
fixed to not go over the reservations/quotas.
2024-07-23 19:58:29 -06:00
Andrew Dryga
0b6e3564f3 chore(infra): Deploy relay and portal to more zones and use more modern CPU arch (#5921) 2024-07-19 15:15:28 -06:00
Jamil
ffe4d5f950 docs: fix references to AWS and Azure example modules (#5829)
These are now published at
https://www.github.com/firezone/terraform-aws-gateway and
https://www.github.com/firezone/terraform-azurerm-gateway to match the
unclear docs for registry module naming...
2024-07-11 16:10:12 +00:00
Jamil
ae87abacff chore: move AWS firezone-gateway module to dedicated repo (#5816)
Why:

Managing the module from Terraform registry is simpler if our published
module is in its own repo.

See https://github.com/firezone/terraform-firezone-aws
2024-07-09 14:05:14 -07:00
Andrew
4037a7bdd3 Provision and read-only DB replica in Europe 2024-07-04 13:00:55 -06:00
Jamil
60d2a2befd fix(infra): relay listens on UDP only (#5718)
I don't believe we use/need TCP for the Relays. Better to keep the ports
closed if so.

Also, the docker-compose.yml is updated to allow the `relay-1` service
to respond to all its ports, since we don't need those mapped typically.
2024-07-04 16:53:08 +00:00
Jamil
9ac9dedfb9 feat: Azure scalable Gateway module and docs (#5644)
Resolves #2603
2024-07-03 07:16:56 +00:00
Jamil
fc8d89ea73 docs: Add AWS NAT Gateway example (#5543)
- Adds the AWS equivalent of our GCP scalable NAT Gateway.
- Adds a new kb section `/kb/automate` that will contain various
automation / IaaC recipes going forward. It's better to have these
guides in the main docs with all the other info.

~~Will update the GCP example in another PR.~~

Portal helper docs in the gateway deploy page will come in another PR
after this is merged.
2024-06-27 21:05:38 -07:00
Jamil
e82a9506ab fix(infra): use sensitive attribute for all secrets (#5562)
Is there a reason not to mark these `sensitive`?


https://developer.hashicorp.com/terraform/tutorials/configuration-language/sensitive-variables
2024-06-27 08:13:35 +00:00
Andrew Dryga
fa15e1568f fix(portal): Use RESTRICTED SSL policy to remove weak cipher suites (#5358) 2024-06-13 11:31:47 -06:00
Andrew Dryga
7fd8e66f7d Enable flow logs and delete default network 2024-05-24 11:04:10 -06:00
Andrew Dryga
7c67c87422 Do not r/./- when deploying gateways 2024-05-13 14:43:25 -06:00