`rtnetlink` has some breaking changes in their latest version. To avoid
waiting until they actually cut a release, we temporarily depend on
their `main` branch.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Under some conditions the call to SCDynamicStoreCreate can fail,
presumably due to some kind of allocation failure. Once it succeeds,
however, we can continue using the dynamic store for the lifetime of the
Adapter instance.
So we memoize the result of this call so as not to allocate each time we
need it.
See
https://developer.apple.com/documentation/systemconfiguration/1437828-scdynamicstorecreate
The batch size effects how many packets we process one at a time. It
also effects the worst-case size of a single buffer as all packets may
be of the same size and thus need to be appended to the same buffer.
On mobile, we can't afford to allocate all of these so we reduce the
batch-size there.
In the `didReceivePathUpdate` private function, we capture an implicit
hard reference to `self` because we access the `Adapter` instance
properties. This function is called in the workQueue by the
NWPathMonitor API and as such, we should weakly capture self throughout
to prevent a retain cycle.
To fix this, we use a `lazy var` for the `pathUpdateHandler` closure,
capturing `[weak self]` within it to use throughout the remainder of the
callback.
Within the `GsoQueue` data structure, we keep a hash map indexed by
source, destination and segment length of UDP packets pointing to a
buffer for those payloads. What we intended to do here is to return the
buffer to the pool after we sent the payload. What we failed to realise
is that putting another buffer into the hash map means we have a buffer
allocated for a certain destination address and segment length! This
buffer would only get reused for the exact same address and segment
length, causing memory usage to balloon over time.
To fix this, we wrap the `DatagramBuffer` in an additional `Option`.
This allows us to actually remove it from the hash map and return the
buffer for future use to the buffer pool.
Resolves: #7866.
Resolves: #7747.
At present, the file logger for all Rust code starts each logfile with
`connlib.`. This is very confusing when exporting the logs from the GUI
client because even the logs from the client itself will start with
`connlib.`. To fix this, we make the base file name of the log file
configurable.
When we are queuing a new UDP payload for sending, we always immediately
pulled a new buffer even though we might already have on allocated for
this particular segment length. This causes an unnecessary spike in
memory when we are under load.
When the Gateway's filter-engine drops a packet, we currently only log
"destination not allowed". This could happen either because we don't
have a filter (i.e. the resource is not allowed) or because the TCP /
UDP port or ICMP traffic is not allowed. To make debugging easier, we
now include that information in the error message.
Resolves: #7875.
Nobody looks at these logs, writing them uses unnecessary CPU + storage
on users devices. It also means we have 1 background thread less because
we need one less non-blocking writer.
Updates the Resource's pagination cursor such that the default cursor
(with no HTTP params applied) uses `{:resources, :asc, :name}` as the
default, which correctly updates all Resources live tables to sort by
`name`.
The reason this is updated at the Query layer is because I wanted to
achieve this without populating URL params by default, and still
allowing the sort icon to properly reflect the default sort order upon
page load, which it does.
My initial attempt went down the path of updating `assign_live_table/3`
to take a `default_order_by` option. That didn't work because upon page
load we `handle_params` which resets the ordering immediately based on
the URL params.
Rather than update the UI code to track even more state in order to use
`default_order_by` when the `order_by` param is not specified, I opted
to updated the Query module instead which the UI uses.
Fixes#7842
There are two places relating to Relays where our production infra has
drifted from staging:
- We have a "Relays are down" alarm on prod that we don't on staging
- We allow overriding the image tag to deploy via an input var on prod
(this can be set from the TF cloud UI)
This PR fixes that, and also updates the production TF config whitespace
to match staging exactly for easier diffs.
In the relay's `cloud-init.yaml`, we've overridden the `telemetry`
service log filter to be `debug`.
This results in this log being printed to Cloud Logging every 1s, for
_every_ relay:
```
2025-01-26T23:00:35.066Z debug memorylimiter/memorylimiter.go:200 Currently used memory. {"kind": "processor", "name": "memory_limiter", "pipeline": "logs", "cur_mem_mib": 31}
```
These logs are consuming over half of our total log count, which
accounts for over half our Cloud Monitoring cost -- the second highest
cost in our GCP account.
This PR removes the override so that the relay app has the same
`otel-collector` log level as the Elixir, the default (presumably
`info`).
Doing another (hopefully final) reversion of staging from the prod setup
to what we're after with respect to relay infra.
Reverts firezone/firezone#7872
Google still had lingering Relay instance groups and subnets around from
a previous deployment that were deleted in the UI and gone, but then
popped back up.
Theoretically, the instance groups should be deleted because there is no
current Terraform config matching them. This change will ensure that
instance groups also get rolled over based on the naming suffix
introduced in #7870.
Related: #7870
Turns out subnets need to have globally unique names as well. This PR
updates the instance-template, VPC, and subnet names to append an
8-character random string.
This random string "depends on" the subnet IP range configuration
specified above, so that if we change that in the future, causing a
network change, the naming will change as well.
Lastly, this random_string is also passed to the `relays` module to be
used in the instance template name prefix. While that name does _not_
need to be globally unique, the `instance_template` **needs** to be
rolled over if the subnets change, because otherwise it will contain a
network interface that is linked to both old and new subnets and GCP
will complain about that.
Reverts: firezone/firezone#7869
On iOS, it's possible for the notification granting process to throw
errors, though not very likely. This PR updates updates the plumbing for
how we request user notifications on iOS and respond to the result,
allowing for errors to be propagated back to the user in case something
goes wrong.
Note that unlike installing system extensions and VPN configurations,
disallowing notifications _does not_ result in an error being thrown.
Instead, we receive `false`, and we now track this Bool instead of the
entire `UNAuthorizationStatus` for updating the UI with.
By keeping that Bool `nil` until we load the authorization status, we
fix#7617 as a bonus.
Related: #7714Fixes: #7617
Since we know we now have the Relay configuration we want (and works),
this PR rolls back staging to how it was pre-Relay region changes, so we
can test that a single `terraform apply` on prod will deploy without any
errors.
When processing the items in the Adapter's workQueue, it's possible the
Adapter has stopped by the time the queued closure begins execution.
This can possibly lead to obscure failures if for example we're trying
to apply network settings to a disconnected tunnel.
Supersedes: #7858
When installing the System Extension or VPN configuration, certain
errors can occur.
Some of these are due to user error and we should further advise the
user how to remedy.
For others, they aren't really actionable, and we silently ignore or
quietly log them.
Resolves: #7714
`fetchResources` is an IPC call. As such, it could potentially take a
long time to execute since the system may need to launch the XPC process
to handle the call.
Since this is called within a 1-second Timer whenever the user has the
MenuBar open (macOS) or is viewing the ResourcesList (iOS), we need to
run these IPC calls in a `Task.detached`.
The resources themselves must be updated on the main thread because
they're an `ObservableObject`, so a bit of refactoring is added to clean
this up a bit.
We are getting failures from `SCDynamicStoreCreate` and possibly will
also get failures from subsequent SystemConfiguration framework calls.
We can contextualize these errors with an error code retrieved from
`SCError()` which returns the error code from the most recent API call:
https://developer.apple.com/documentation/systemconfiguration/1516922-scerror
These can fail sporadically and we don't need to capture them. However,
for users who may be experiencing consistent failures or otherwise
wondering why their client isn't able to check for updates, we leave
them as `warning`.
This is causing issues applying because our CI terraform IAM user
doesn't have the `Billing Account Administrator` role.
Rather than granting such a sensitive role to our CI pipeline, I'm
suggesting we create the billing budget outside the scope of the
terraform config tracked in this repo.
If we want it to be tracked as code, I would propose maybe we have a
separate (private) repository with a separate token / IAM permissions
that we can monitor separately.
For the time being, I'll plan to manually create this budget in the UI.
Reverts: #7836
On Apple, we will silently fail if the tunnel fails to start. This adds
a simple `NSAlert` modal (macOS) or `Alert` popup (iOS) that must be
dismissed before continuing if the tunnel fails to come up, so that the
user has a chance of understanding why.
The vast majority of the time this fails due to DNS lookup errors while
starting connlib.
Related: #7004
Supersedes: #7814
In #7795, we optimised our CI pipeline to only test the installation of
the GUI client whenever we actually upload to the draft release. This
trigger has been moved to `workflow_dispatch`, meaning no CI builds
neither from PRs nor `main` perform these steps.
This makes it difficult to test GUI client binaries from PRs because
they also no longer get uploaded to the artifacts of the CI run on the
PR.
To fix this, we split the testing away from the rename script and
unconditionally run the rename script, which allows us to also always
upload the binaries to the CI artifacts.
Finally, uploading to the draft releases is only done when we explicitly
trigger the workflow from `main`. This is a defense-in-depth measure: We
should never publish a code to a release that hasn't been merged to
`main`.
STUN binding requests & responses are not authenticated on purpose
because they are so easy to fulfill that having to perform the
computational work to check the authentication is more work than
actually just sending the request. With #7819, we send STUN binding
requests more often because they are used as keep-alives to the relay.
This spams the debug log because we see
> Message does not have a `MessageIntegrity` attribute
for every BINDING response. This information isn't interesting for
BINDING responses because those will never have a `MessageIntegrity`
attribute.
In order to debug connection wake-ups, it is useful to know, which
packet is the first one that gets sent on an idle connection. With this
PR, we do exactly that for incoming and outgoing packets through the
tunnel. The resulting log looks something like this:
```
2025-01-24T02:52:51.818Z DEBUG snownet::node: Connection is idle cid=65f149ea-96a4-4eee-ac70-62a1a2590821
2025-01-24T02:52:57.312Z DEBUG firezone_tunnel::client: Cleared DNS resource NAT domain=speed.cloudflare.com
2025-01-24T02:52:57.312Z DEBUG firezone_tunnel::client: Setting up DNS resource NAT gid=65f149ea-96a4-4eee-ac70-62a1a2590821 domain=speed.cloudflare.com
2025-01-24T02:52:57.312Z DEBUG snownet::node: Connection resumed packet=Packet { src: ::, dst: ::, protocol: "Reserved" } cid=65f149ea-96a4-4eee-ac70-62a1a2590821
```
Here, the connection got resumed because we locally received a DNS query
for a DNS resource which triggers a new control protocol message through
the tunnel. For this, we use the unspecified IPv6 address for src and
dst and the 0x255 protocol identifier which here renders as "Reserved".
Currently the GUI Client exits if `update-desktop-database` cannot be
executed after deep-links were registered. On non-Ubuntu systems (or
more generally non-Debian) this will fail since the command does not
exist and prevent the GUI Client from starting.
This PR just ignores any command-not-found error, ensuring the command
still has to succeed on Debian/Ubuntu machines.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: oddlama <oddlama@oddlama.org>
The committed regression seeds trigger a scenario where the WireGuard
sessions of the peers expire in a way where by the time the Client sends
the packet, it is still active (179.xx seconds old) and with the latency
to the Gateway, the 180s mark is reached and the Gateway clears the
session and discards the packet as a result.
In order to fix this, I opted to patch WireGuard by introducing a new
timer that does not allow the initiator to use a session that is almost
expired: https://github.com/firezone/boringtun/pull/68.
Resolves: #7832.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
With #7819, these log messages appear at a ~10x higher rate than before
- a day's worth of these would be over 3,000 messages. For BINDING
requests, these only matter if the candidates change, therefore we can
make the logging conditional to that.
---------
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Unfortunately, we don't have a formatter for the manifest other than
sorting the dependencies alphabetically so some things need to be taken
care of manually.
Contrary to my prior belief, we don't actually need the WireGuard
_persistent_ keep-alive. The in-built timers from WireGuard will
automatically send keep-alive messages in case no organic reply is sent
for a particular request.
All NAT bindings along the network path are already kept open using the
STUN bindings sent on all candidate pairs. Even on idle connections, we
send those every 60s. Well-behaved NATs are meant to keep confirmed UDP
bindings open for at least 120s. Even if not, the worst-case here is
that a connection which does not send any(!) application traffic is cut.
Strangely, this is set in `production` but not `staging`. This variable
determines whether to keep the `Default` network or not.
Since we create our own network resources, I don't think we need this to
be `true`.
To help prevent surprises with unexpected cloud bills, we add a billing
budget amount that will trigger when the 50% threshold is hit.
The exact amount is considered secret and is set via variables that are
already added in HCP staging and prod envs.
We've gotten feedback recently that the expiration field causes
confusion among auditors who assume it has actual security relevance.
In reality, this is simply the maximum amount of time a connection
between Client and Gateway will stay alive for, and it has no relation
to "sessions" from a security perspective. As such, it's removed, and
the table renamed "Recent connections" to better name what these are.
The `expiration` column is also removed because this is not actionable
by the admin or end-user. In nearly all cases, the connection will have
been "expired" by some other means naturally, such as toggling Firezone
on/off or a policy or resource change. In other words, we do not rely on
this `expiration` field to enforce any security-related timeout.
Fixes#7712
#7808 introduced a minor bug that prevented the rust Docker images from
building locally, in `debug` builds. Adding `openssl-dev` to the
builder's container fixes the issue.
```
cargo:warning=Could not find directory of OpenSSL installation, and this `-sys` crate cannot proceed without this knowledge. If OpenSSL is installed and this crate had trouble finding it, you can set the `OPENSSL_DIR` environment variable for the compilation process. See stderr section below for further information.
```