Commit Graph

244 Commits

Author SHA1 Message Date
Firezone Bot
a11983e4b3 chore: publish gateway 1.4.13 (#9969) 2025-07-22 18:56:40 +00:00
Thomas Eizinger
c4457bf203 feat(gateway): shutdown after 15m of portal disconnect (#9894) 2025-07-18 05:47:30 +00:00
Thomas Eizinger
3e71a91667 feat(gateway): revoke unlisted authorizations upon init (#9896)
When receiving an `init` message from the portal, we will now revoke all
authorizations not listed in the `authorizations` list of the `init`
message.

We (partly) test this by introducing a new transition in our proptests
that de-authorizes a certain resource whilst the Gateway is simulated to
be partitioned. It is difficult to test that we cannot make a connection
once that has happened because we would have to simulate a malicious
client that knows about resources / connections or ignores the "remove
resource" message.

Testing this is deferred to a dedicated task. We do test that we hit the
code path of revoking the resource authorization and because the other
resources keep working, we also test that we are at least not revoking
the wrong ones.

Resolves: #9892
2025-07-17 19:04:54 +00:00
Thomas Eizinger
2e0ed018ee chore: document metrics config switches as private API (#9865) 2025-07-14 13:53:03 +00:00
Thomas Eizinger
cecca37073 feat(gateway): allow exporting metrics to an OTEL collector (#9838)
As a first step in preparation for sending OTEL metrics from Clients and
Gateways to a cloud-hosted OTEL collector, we extend the CLI of the
Gateway with configuration options to provide a gRPC endpoint to an OTEL
collector.

If `FIREZONE_METRICS` is set to `otel-collector` and an endpoint is
configured via `OTLP_GRPC_ENDPOINT`, we will report our metrics to that
collector.

The future plan for extending this is such that if `FIREZONE_METRICS` is
set to `otel-collector` (which will likely be the default) and no
`OTLP_GRPC_ENDPOINT` is set, then we will use our own, hosted OTEL
collector and report metrics IF the `export-metrics` feature-flag is set
to `true`.

This is a similar integration as we have done it with streaming logs to
Sentry. We can therefore enable it on a similar granularity as we do
with the logs and e.g. only enable it for the `firezone` account to
start with.

In meantime, customers can already make use of those metrics if they'd
like by using the current integration.

Resolves: #1550
Related: #7419

---------

Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>
2025-07-14 03:54:38 +00:00
Thomas Eizinger
d01701148b fix(rust): remove jemalloc (#9849)
I am no longer able to compile `jemalloc` on my system in a debug build.
It fails with the following error:

```
src/malloc_io.c: In function ‘buferror’:
src/malloc_io.c:107:16: error: returning ‘char *’ from a function with return type ‘int’ makes integer from pointer without a cast [-Wint-conversion]
  107 |         return strerror_r(err, buf, buflen);
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

This appears to be a problem with modern versions of clang/gcc. I
believe this started happening when I recently upgraded my system. The
upstream [`jemalloc`](https://github.com/jemalloc/jemalloc) repository
is now archived and thus unmaintained. I am not sure if we ever measured
a significant benefit in using `jemalloc`.

Related: https://github.com/servo/servo/issues/31059
2025-07-12 19:22:06 +00:00
Thomas Eizinger
d6805d7e48 chore(rust): bump to Rust 1.88 (#9714)
Rust 1.88 has been released and brings with it a quite exciting feature:
let-chains! It allows us to mix-and-match `if` and `let` expressions,
therefore often reducing the "right-drift" of the relevant code, making
it easier to read.

Rust.188 also comes with a new clippy lint that warns when creating a
mutable reference from an immutable pointer. Attempting to fix this
revealed that this is exactly what we are doing in the eBPF kernel.
Unfortunately, it doesn't seem to be possible to design this in a way
that is both accepted by the borrow-checker AND by the eBPF verifier.
Hence, we simply make the function `unsafe` and document for the
programmer, what needs to be upheld.
2025-07-12 06:42:50 +00:00
Thomas Eizinger
04499da11e feat(telemetry): grab env and distinct_id from Sentry session (#9801)
At present, our primary indicator as to whether telemetry is active is
whether we have a Sentry session. For our analytics events however, we
currently require passing in the Firezone ID and API url again. This
makes it difficult to send analytics events from areas of the code that
don't have this information available.

To still allow for that, we integrate the `analytics` module more
tightly with the Sentry session. This allows us to drop two parameters
from the `$identify` event and also means we now respect the
`NO_TELEMETRY` setting for these events except for `new_session`. This
event is sent regardless because it allows us to track, how many on-prem
installations of Firezone are out there.
2025-07-10 20:05:08 +00:00
Thomas Eizinger
ec2599d545 chore(rust): simplify stream logs feature (#9780)
Instead of conditionally enabling the `logs` feature in the Sentry
client, we always enable it and control via the `tracing` integration,
which events should get forwarded to Sentry. The feature-flag check
accesses only shared-memory and is therefore really fast.

We already re-evaluate feature flags on a timer which means this boolean
will flip over automatically and logs will be streamed to Sentry.
2025-07-04 14:51:53 +00:00
Jamil
a4cf3ead0f ci: publish gateway 1.4.12 (#9736) 2025-07-01 14:04:21 +00:00
Jamil
699739deae fix(docs): use sha256sum over sha256 (#9690)
`sha256` isn't found by default on some machines.
2025-06-27 20:08:41 +00:00
Thomas Eizinger
6fc2ebe576 chore(gateway): log on startup (#9684)
As with some of our other applications, it is useful to know when they
restart and which version is running. Adding a log on INFO on startup
solves this.
2025-06-26 13:59:09 +00:00
Thomas Eizinger
d5be185ae4 chore(rust): remove telemetry spans and events (#9634)
Originally, we introduced these to gather some data from logs / warnings
that we considered to be too spammy. We've since merged a
burst-protection that will at most submit the same event once every 5
minutes.

The data from the telemetry spans themselves have not been used at all.
2025-06-25 17:15:57 +00:00
Thomas Eizinger
3b972643b1 feat(rust): stream logs to Sentry when enabled in PostHog (#9635)
Sentry has a new "Logs" feature where we can stream logs directly to
Sentry. Doing this for all Clients and Gateways would be way too much
data to collect though.

In order to aid debugging from customer installations, we add a
PostHog-managed feature flag that - if set to `true` - enables the
streaming of logs to Sentry. This feature flag is evaluated every time
the telemetry context is initialised:

- For all FFI usages of connlib, this happens every time a new session
is created.
- For the Windows/Linux Tunnel service, this also happens every time we
create a new session.
- For the Headless Client and Gateway, it happens on startup and
afterwards, every minute. The feature-flag context itself is only
checked every 5 minutes though so it might take up to 5 minutes before
this takes effect.

The default value - like all feature flags - is `false`. Therefore, if
there is any issue with the PostHog service, we will fallback to the
previous behaviour where logs are simply stored locally.

Resolves: #9600
2025-06-25 16:14:14 +00:00
Thomas Eizinger
91edd11a47 feat(gateway): send $identify event with account-slug (#9658)
When we receive the `account_slug` from the portal, the Gateway now
sends a `$identify` event to PostHog. This will allow us to target
Gateways with feature-flags based on the account they are connected to.
2025-06-24 11:31:56 +00:00
Thomas Eizinger
a91dda139f feat(connlib): only conditionally hash firezone ID (#9633)
A bit of legacy that we have inherited around our Firezone ID is that
the ID stored on the user's device is sha'd before being passed to the
portal as the "external ID". This makes it difficult to correlate IDs in
Sentry and PostHog with the data we have in the portal. For Sentry and
PostHog, we submit the raw UUID stored on the user's device.

As a first step in overcoming this, we embed an "external ID" in those
services as well IF the provided Firezone ID is a valid UUID. This will
allow us to immediately correlate those events.

As a second step, we automatically generate all new Firezone IDs for the
Windows and Linux Client as `hex(sha256(uuid))`. These won't parse as
valid UUIDs and therefore will be submitted as is to the portal.

As a third step, we update all documentation around generating Firezone
IDs to use `uuidgen | sha256` instead of just `uuidgen`. This is
effectively the equivalent of (2) but for the Headless Client and
Gateway where the Firezone ID can be configured via environment
variables.

Resolves: #9382

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2025-06-24 07:05:48 +00:00
Thomas Eizinger
950afd9b2d chore(gateway): set account-slug in telemetry context (#9545)
This PR adds an optional field `account_slug` to the Gateway's init
message. If populated, we will use this field to set the account-slug in
the telemetry context. This will allow us to know, which customers a
particular Sentry issue is related to.
2025-06-23 18:52:39 +00:00
Jamil
081b075f2c chore: bump gui, apple, gateway (#9586)
The new publish automation still [has some
kinks](https://github.com/firezone/firezone/actions/runs/15764891111) so
publishing this manually.
2025-06-19 12:29:46 -07:00
Thomas Eizinger
cc50d58d8c chore(client,gateway): log portal connection hiccups on INFO (#9557)
These don't happen very often so are safe to log on INFO. That is the
default log level and it is useful to see, why we are re-connecting to
the portal.
2025-06-17 14:01:34 +00:00
Jamil
b60d77cef4 chore: publish gateway 1.4.10 (#9412) 2025-06-05 08:55:13 +00:00
Thomas Eizinger
e05c98bfca ci: update to new cargo sort release (#9354)
The latest release now also sorts workspace dependencies, as well as
different dependency sections. Keeping these things sorted reduces the
chances of merge conflicts when multiple PRs edit these files.
2025-06-02 02:01:09 +00:00
Thomas Eizinger
cee4be9e24 build(deps): bump Rust dependencies (#9192)
A mass upgrade of our Rust dependencies. Most crucially, these remove
several duplicated dependencies from our tree.

- The Tauri plugins have been stuck on `windows v0.60` for a while. They
are now updated to use `windows v0.61` which is what the rest of our
dependency tree uses.
- By bumping `axum`, can also bump `reqwest` which reduces a few more
duplicated dependencies.
- By removing `env_logger`, we can get rid of a few dependencies.
2025-05-22 13:15:01 +00:00
Thomas Eizinger
b7451fcdae chore: release Gateway 1.4.9 (#9132) 2025-05-14 06:39:03 +00:00
Thomas Eizinger
ea0ad9d089 chore(gateway): log CLI args we got invoked with (#9089) 2025-05-12 22:10:37 +00:00
Thomas Eizinger
f965487739 chore(connlib): turn down logs for non-fatal IO errors (#9091)
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-12 11:48:40 +00:00
Thomas Eizinger
fa790b231a fix(gateway): respond with SERVFAIL for missing nameserver (#9061)
When we implemented #8350, we chose an error handling strategy that
would shutdown the Gateway in case we didn't have a nameserver selected
for handling those SRV and TXT queries. At the time, this was deemed to
be sufficiently rare to be an adequate strategy. We have since learned
that this can indeed happen when the Gateway starts without network
connectivity which is quite common when using tools such as terraform to
provision infrastructure.

In #9060, we fix this by re-evaluating the fastest nameserver on a
timer. This however doesn't change the error handling strategy when we
don't have a working nameserver at all. It is practically impossible to
have a working Gateway yet us being unable to select a nameserver. We
read them from `/etc/resolv.conf` which is what `libc` uses to also
resolve the domain we connect to for the WebSocket. A working WebSocket
connection is required for us to establish connections to Clients, which
in turn is a precursor to us receiving DNS queries from a Client.

It causes unnecessary complexity to have a code path that can
potentially terminate the Gateway, yet is practically unreachable. To
fix this situation, we remove this code path and instead reply with a
DNS SERVFAIL error.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-05-09 05:55:48 +00:00
Thomas Eizinger
37529803ce build(rust): bump otel ecosystem crates to 0.29 (#9029) 2025-05-05 12:33:07 +00:00
Jamil
6e0e7343ba chore: release Apple & Gateway with ECN fix (#9013) 2025-05-02 00:16:40 -07:00
Thomas Eizinger
43c4c5f91b feat(gateway): add CLI flag to validate checksums of all packets (#9007)
Validating checksums can be expensive so this is off-by-default. The
intent is to turn it on in our staging environment so we can detects
bugs in our packet handling code during testing.
2025-05-02 03:50:07 +00:00
Thomas Eizinger
497f8a7f8a ci(rust): make compile-packages opt-out from workspace (#8979)
Instead of explicitly listing every package we want to compile, attempt
to compile the entire workspace and exclude the ones we know won't work
on Windows.
2025-05-01 21:20:35 +00:00
Thomas Eizinger
ea5709e8da chore(rust): initialise OTEL with useful metadata (#8945)
Once we start collecting metrics across various Clients and Gateways,
these metrics need to be tagged with the correct `service.name`,
`service.version` as well as an instance ID to differentiate metrics
from different instances.
2025-05-01 05:19:07 +00:00
Thomas Eizinger
ec4cd898ba chore: release Gateway v1.4.7 (#8943) 2025-04-30 13:37:32 +00:00
Thomas Eizinger
6114bb274f chore(rust): make most of the Rust code compile on MacOS (#8924)
When working on the Rust code of Firezone from a MacOS computer, it is
useful to have pretty much all of the code at least compile to ensure
detect problems early. Eventually, once we target features like a
headless MacOS client, some of these stubs will actually be filled in an
be functional.
2025-04-29 11:20:09 +00:00
Thomas Eizinger
1af7f4f8c1 fix(rust): don't use jemalloc on ARMv7 (#8859)
Doesn't compile on ARMv7 so we just fallback to the default allocator
there.
2025-04-19 22:20:05 +00:00
Thomas Eizinger
34f28e2ae6 feat(rust): use jemalloc for Gateway and Relay (#8846)
`jemalloc` is a modern allocator that is designed for multi-threaded
systems and can better handle smaller allocations that may otherwise
fragment the heap. Firezone's components, especially Relays and Gateways
are intended to operate with a long uptime and therefore need to handle
memory efficiently.
2025-04-19 12:25:46 +00:00
Jamil
743f5fdfeb ci: bump clients/gateway to ship write improvements (#8792)
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2025-04-15 06:21:23 +00:00
Thomas Eizinger
7c2163ddf4 fix(connlib): fail event-loops if UDP threads stop (#8783)
The UDP socket threads added in #7590 are designed to never exit. UDP
sockets are stateless and therefore any error condition on them should
be isolated to sending / receiving a particular datagram. It is however
possible that code panics which will shut down the threads
irrecoverably. In this unlikely event, `connlib`'s event-loop would keep
spinning and spam the log with "UDP socket stopped". There is no good
way on how we can recover from such a situation automatically, so we
just quit `connlib` in that case and shut everything down.

To model this new error path, we refactor the `DisconnectError` to be
internally backed by `anyhow`.
2025-04-15 02:27:37 +00:00
Thomas Eizinger
be897ed6c5 chore(gateway): require 4 cores to spawn more TUN threads (#8775)
By default, we spawn 1 TUN send and 1 TUN receive thread on the Gateway.
In addition to that, we also have the main processing thread that
encrypts and decrypts packets. With #7590, we will be separating out the
UDP send and receive operations into yet another thread. As a result, we
will have at a minimum 4 threads running that perform IO or important
work.

Thus, in order to benefit from TUN multi-queue, we need more than 4
cores to be able to efficiently parallelise work.

Related: #8769
2025-04-14 01:18:40 +00:00
Thomas Eizinger
859aa3cee0 feat(connlib): add context to event-loop errors (#8773)
This should make it easier to diagnose any error returned from the
event-loop.
2025-04-14 00:07:27 +00:00
Thomas Eizinger
e0f94824df fix(gateway): default to 1 TUN thread on single-core systems (#8765)
On single-core systems, spawning more than one TUN thread results in
contention that hurts performance more than it helps.

Resolves: #8760
2025-04-13 01:54:04 +00:00
Thomas Eizinger
439da65180 chore(connlib): log all tunnel errors on WARN (#8764)
Currently, errors encountered as part of operating the tunnel are
non-fatal and only logged on `TRACE` in order to not flood the logs.
Recent improvements around how the event loop operates made it such that
we actually emit a lot less errors and ideally there should be 0.
Therefore we can now employ a much more strict policy and log all errors
here on `WARN` in order to get Sentry alerts.
2025-04-13 01:35:37 +00:00
Thomas Eizinger
289bd35e4c feat(connlib): add packet counter metrics (#8752)
This PR adds opentelemetry-based packet counter metrics to `connlib`. By
default, the collection of these metrics of disabled. Without a
registered metrics-provider, gathering these metrics are effectively
no-ops. They will still incur 1 or 2 function calls per packet but that
should be negligible compared to other operations such as encryption /
decryption.

With this system in place, we can in the future add more metrics to make
debugging easier.
2025-04-12 08:35:26 +00:00
Thomas Eizinger
84a2c275ca build(rust): upgrade to Rust 1.85 and Edition 2024 (#8240)
Updates our codebase to the 2024 Edition. For highlights on what
changes, see the following blogpost:
https://blog.rust-lang.org/2025/02/20/Rust-1.85.0.html
2025-03-19 02:58:55 +00:00
Jamil
df5bbdd240 ci: Ship SRV/TXT for GUI/Headless/Gateway (#8413) 2025-03-10 21:30:23 -07:00
Thomas Eizinger
39e272cfd1 refactor(rust): introduce dns-types crate (#8380)
A sizeable chunk of Firezone's Rust components deal with parsing,
manipulating and emitting DNS queries and responses. The API surface of
DNS is quite large and to make handling of all corner-cases easier, we
depend on the `domain` library to do the heavy-lifting for us.

For better or worse, `domain` follows a lazy-parsing approach. Thus,
creating a new DNS message doesn't actually verify that it is in fact
valid. Within Firezone, we make several assumptions around DNS messages,
such as that they will only ever contain a single question.
Historically, DNS allows for multiple questions per query but in
practise, nobody uses that.

Due to how we handle DNS in Firezone, manipulating these messages
happens in multiple places. That combined with the lazy-parsing approach
from `domain` warrants having our own `dns-types` library that wraps
`domain` and provides us with types that offer the interface we need in
the rest of the codebase.

Resolves: #7019
2025-03-10 04:33:10 +00:00
Thomas Eizinger
eacf67f2bc feat(gateway): forward queries to local nameserver (#8350)
The DNS server added in #8285 was only a dummy DNS server that added
infrastructure to actually receive DNS queries on the IP of the TUN
device at port 53535 and it returns SERVFAIL for all queries. For this
DNS server to be useful, we need to take those queries and replay them
towards a DNS server that is configured locally on the Gateway.

To achieve this, we parse `/etc/resolv.conf` during startup of the
Gateway and pass the contained nameservers into the tunnel. From there,
the Gateway's event-loop can receive the queries, feed them into the
already existing machinery for performing recursive DNS queries that we
use on the Client and resolve the records.

In its current implementation, we only use the first nameserver defined
in `/etc/resolv.conf`. If the lookup fails, we send back a SERVFAIL
error and log a message.

Resolves: #8221
2025-03-05 20:23:01 +00:00
Thomas Eizinger
3978661fbc feat(gateway): run a DNS resolver on $tun_ip:53535 (#8285)
To support resolving SRV and TXT records for DNS-resources, we host a
DNS server on UDP/53535 and TCP/53535 on the IPv4 and IPv6 IP of the
Gateway's TUN device. This will later be used by connlib to send DNS
queries of particular types (concretely SRV and TXT) to the Gateway
itself.

With this PR, this DNS server is already functional and reachable but it
will answer all queries with SERVFAIL. Actual handling of these queries
is left to a future PR.

We listen on port 53535 because:

- Port 53 may be taken by another DNS server running on the customer's
machine where they deploy the Gateway
- Port 5353 is the standard port for mDNS
- I could not find anything on the Internet about it being used by a
specific application

In theory, we could also bind to a random port but then we'd have to
communicate this port somehow to the client. This could be done using a
control protocol message but it just makes things more complicated. For
example, there would be additional buffering needed on the Client side
for the time-period where we've established a connection to the Gateway
already but haven't received the control protocol message yet, at which
port the Gateway is hosting the DNS server.

If one knows the Gateway's IP (and has a connection to it already), this
DNS server will be usable by users with standard DNS tools such as
`dig`:

```sh
dig @100.76.212.99 -p 53535 example.com
```

Related: #8221
2025-03-03 12:26:32 +00:00
Thomas Eizinger
e63f1cb4da feat(connlib): allow and route packets to Gateway TUN IPs (#8294)
At the moment, `connlib` doesn't allow routing packets directly to
Gateways because the subnet we've chosen for the tunnel IPs isn't part
of the routing table. In addition, all traffic within `connlib` is
expected to be targeting a resource _beyond_ a Gateway.

In order to resolve SRV and TXT records within a certain site, we've
opted to host a DNS server on the Gateway's TUN device. See #8285 for
details on that. To actually reach that DNS server, we need to add a few
new control flows to `connlib` where we detect whether a packet is
directly for the tunnel IP of a Gateway or for a resource.

We only know a Gateway's IP once we are connected to it, meaning we
cannot route those packets prior to that. We also cannot establish a
connection when the user attempts to as every connection intent sent to
the portal needs to reference a Resource. For the usecase of resolving
SRV and TXT records, the packets will be associated with the DNS
resource for which we are trying to resolve records.

This patch only established the base connectivity and necessary
exceptions to the Client's filter rules in order to route packets to the
Gateway's TUN device. The following commands have been issued against a
staging Gateway, demonstrating connectivity to the Gateway's TUN device
from a Client after establishing a connection to it:

```
❯ ping github.com -c 1
PING github.com (fd00:2021:1111:8000::) 56 data bytes
64 bytes from github.com (fd00:2021:1111:8000::): icmp_seq=1 ttl=50 time=1441 ms

--- github.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1440.614/1440.614/1440.614/0.000 ms

❯ ping 100.72.145.83 -c 1
PING 100.72.145.83 (100.72.145.83) 56(84) bytes of data.
64 bytes from 100.72.145.83: icmp_seq=1 ttl=64 time=213 ms

--- 100.72.145.83 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 212.574/212.574/212.574/0.000 ms
```

Related: #8221
2025-03-03 03:20:06 +00:00
Thomas Eizinger
315d99f723 feat(gateway): allow tunneling packets to and from TUN device (#8283)
At present, Clients are only allowed to send packets to resources
accessible via the Gateway but not to the Gateway itself. Thus, any
application (including Firezone itself) that opens a listening socket on
the TUN device will never receive any traffic.

This has opens up interesting features like hosting additional services
on the machine that the Gateway is running on. Concretely, in order to
implement #8221, we will run a DNS server on port 53 of the TUN device
as part of the Gateway.

The diff for this ended up being a bit larger because we are introducing
an `IpConfig` abstraction so we don't have to track 4 IP addresses as
separate fields within `ClientOnGateway`; the connection-specific state
on a Gateway. This is where we allow / deny traffic from a Client. To
allow traffic for this particular Gateway, we need to know our own TUN
IP configuration within the component.
2025-02-27 23:49:05 +00:00
Thomas Eizinger
57ce0ee469 feat(gateway): cache DNS queries for resources (#8225)
With the addition of the Firezone Control Protocol, we are now issuing a
lot more DNS queries on the Gateway. Specifically, every DNS query for a
DNS resource name always triggers a DNS query on the Gateway. This
ensures that changes to DNS entries for resources are picked up without
having to build any sort of "stale detection" in the Gateway itself. As
a result though, a Gateway has to issue a lot of DNS queries to upstream
resolvers which in 99% or more cases will return the same result.

To reduce the load on these upstream, we cache successful results of DNS
queries for 5 minutes.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-02-23 04:27:09 +00:00