In the relay's `cloud-init.yaml`, we've overridden the `telemetry`
service log filter to be `debug`.
This results in this log being printed to Cloud Logging every 1s, for
_every_ relay:
```
2025-01-26T23:00:35.066Z debug memorylimiter/memorylimiter.go:200 Currently used memory. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 31}
```
These logs are consuming over half of our total log count, which
accounts for over half our Cloud Monitoring cost -- the second highest
cost in our GCP account.
This PR removes the override so that the relay app has the same
`otel-collector` log level as the Elixir, the default (presumably
`info`).
Doing another (hopefully final) reversion of staging from the prod setup
to what we're after with respect to relay infra.
Reverts firezone/firezone#7872
Google still had lingering Relay instance groups and subnets around from
a previous deployment that were deleted in the UI and gone, but then
popped back up.
Theoretically, the instance groups should be deleted because there is no
current Terraform config matching them. This change will ensure that
instance groups also get rolled over based on the naming suffix
introduced in #7870.
Related: #7870
Turns out subnets need to have globally unique names as well. This PR
updates the instance-template, VPC, and subnet names to append an
8-character random string.
This random string "depends on" the subnet IP range configuration
specified above, so that if we change that in the future, causing a
network change, the naming will change as well.
Lastly, this random_string is also passed to the `relays` module to be
used in the instance template name prefix. While that name does _not_
need to be globally unique, the `instance_template` **needs** to be
rolled over if the subnets change, because otherwise it will contain a
network interface that is linked to both old and new subnets and GCP
will complain about that.
Reverts: firezone/firezone#7869
Since we know we now have the Relay configuration we want (and works),
this PR rolls back staging to how it was pre-Relay region changes, so we
can test that a single `terraform apply` on prod will deploy without any
errors.
This is causing issues applying because our CI terraform IAM user
doesn't have the `Billing Account Administrator` role.
Rather than granting such a sensitive role to our CI pipeline, I'm
suggesting we create the billing budget outside the scope of the
terraform config tracked in this repo.
If we want it to be tracked as code, I would propose maybe we have a
separate (private) repository with a separate token / IAM permissions
that we can monitor separately.
For the time being, I'll plan to manually create this budget in the UI.
Reverts: #7836
To help prevent surprises with unexpected cloud bills, we add a billing
budget amount that will trigger when the 50% threshold is hit.
The exact amount is considered secret and is set via variables that are
already added in HCP staging and prod envs.
Even after all of the changes made to make the subnets update properly
in the Relays module, it will always fail because of these two facts
combined:
- lifecycle is `create_before_destroy`
- GCP instance group template binds a network interface on a per-subnet
basis and this cannot be bound to both old and new subnet. The fix for
this would be to create a new instance group manager on each deploy
Rather than needlessly roll over the relay networks on each deploy,
since they're not changing, it would make more sense to define them
outside of the Relays module so that they aren't tainted by code
changes. This will prevent needless resource replacement and allow for
the Relay module to use them as-is.
#7733 fixed the randomness generation, but didn't fix the numbering.
According to [GCP docs](https://cloud.google.com/vpc/docs/subnets), we
can use virtually any RFC 1918 space for this.
This PR updates our numbering scheme to use the `10.128.0.0/9` space for
Relay subnets and changes the elixir app to use `10.2.2.0/20` to prevent
collisions.
When a Relay's instances are updated / changed, the contained
subnetwork's `name` and `ip_cidr_range` need to be updated to something
else because we are using the `create_before_destroy` lifecycle
configuration for the Relays module.
To fix this, we need to make sure that when recreating Relays, we use a
unique `name` and `ip_cidr_range` for the new instances so as not to
conflict with existing ones.
To handle this, we use a computed state-tracked value for
`ip_cidr_range` that will automatically adjust to the number of Relay
regions we have and it will be incremented each time the Relays are
recreated. Then we update the `name` to include this range to ensure we
never have a subnet name that conflicts with an existing one.
This ensure that we run prettier across all supported filetypes to check
for any formatting / style inconsistencies. Previously, it was only run
for files in the website/ directory using a deprecated pre-commit
plugin.
The benefit to keeping this in our pre-commit config is that devs can
optionally run these checks locally with `pre-commit run --config
.github/pre-commit-config.yaml`.
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
The expression for one of the rules was not able to be applied due to
invalid characters (`\n`) and even once the invalid characters were
removed there is a limit of 5 subexpressions, but the previous
expression contained 10.
Along with the expression change, the `deny(451)` is not allowed. The
only `deny` codes allowed are `403`, `404`, `502`
This PR reverts commit that moves out IPv6 address to a separate
subdomain (deploying that will cause a prod downtime) and simply removes
the check that causes redirect loops.
Based on testing and research it does not appear that Chrome will
reliably choose a consistent protocol stack for loading the initial web
page as it does for connecting the WebSocket when connecting over VPN
tunnels. If one or the other stacks experiences a slight delay or packet
loss causing retransmission, or QUIC simply doesn't play nicely with the
MTU (in our case 1280), it may fall back to IPv4 (which has less
per-packet overhead) or even a TCP connection.
Unfortunately this violates an assumption we have in token validation
logic. Namely, that the remote_ip used to create the token (via sign in)
is the same one used to the connect the WebSocket. I can see where this
logic comes from in a security context, but thinking through the attack
vector(s) that would be able to leverage this violation has me left
wondering if this check is worth the breakage we currently face in
#6511.
- Scenario 1: MITM - attacker steals token somehow via MITM (would need
to somehow break TLS) - the attacker is already in our network path and
can rewrite the remote_ip already with his/her own.
- Scenario 2: Malicious browser plugin stealing session token. It will
be harder to spoof the remote IP in this case, but if this is a
possibility, the plugin could presumably directly control the tab where
the user is logged in.
- Scenario 3: IdP is compromised leading to malicious redirect before
arriving to Firezone - if this is the case, the user could likely login
in directly and create his/her own valid session token anyhow.
Perhaps I'm missing other scenarios, open to feedback. If we want to
ensure the token used by the websocket originated from the same browser
as it was minted from, perhaps we could generate a small random key,
save it in local storage, and send that in a header when connecting the
WebSocket. I think cookies handle that for us already though.
Fixes#6511
I recently discovered that the metrics reporting to Google Cloud Metrics
for the relays is actually working. Unfortunately, they are all bucketed
together because we don't set the metadata correctly.
This PR aims to fix that be setting some useful default metadata for
traces and metrics and additionally, discoveres instance ID and name
from GCE metadata.
Related: #2033.
Without masquerading, packets sent by the gateway through the TUN
interface use the wrong source address (the TUN device's address)
instead of the gateway's actual network interface.
We set this env variable in all our uses of the gateway, thus we might
as well remove it and always perform unconditionally.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Closes#5063, supersedes #5850
Other refactors and changes made as part of this:
- Adds the ability to disable DNS control on Windows
- Removes the spooky-action-at-a-distance `from_env` functions that used
to be buried in `tunnel`
- `FIREZONE_DNS_CONTROL` is now a regular `clap` argument again
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
I've managed to finally reserve enough e2 instances for our needs and
also used e2 for gateways to workaround the quota issues. The `web` app
still used n2 because quota doesn't allow additional n4's. Rollouts also
fixed to not go over the reservations/quotas.
These are now published at
https://www.github.com/firezone/terraform-aws-gateway and
https://www.github.com/firezone/terraform-azurerm-gateway to match the
unclear docs for registry module naming...
I don't believe we use/need TCP for the Relays. Better to keep the ports
closed if so.
Also, the docker-compose.yml is updated to allow the `relay-1` service
to respond to all its ports, since we don't need those mapped typically.
- Adds the AWS equivalent of our GCP scalable NAT Gateway.
- Adds a new kb section `/kb/automate` that will contain various
automation / IaaC recipes going forward. It's better to have these
guides in the main docs with all the other info.
~~Will update the GCP example in another PR.~~
Portal helper docs in the gateway deploy page will come in another PR
after this is merged.