Commit Graph

131 Commits

Author SHA1 Message Date
Thomas Eizinger
166b0d1573 feat(linux): compute device ID from /etc/machine-id (#10805)
All of our Linux applications have a soft-dependency on systemd. That
is, in the default configuration, we expect systemd to be present on the
machine. The only exception here are the docker containers for Headless
Client and Gateway.

For the GUI client in particular, systemd is a hard-dependency in order
to control DNS on the system which we do via `systemd-resolved`. To
secure the communication between the GUI client and its tunnel process,
we automatically create a group called `firezone-client` to which the
user gets added. All members of the group are allowed to access the unix
socket which is used for IPC between the two processes. Membership in
this group is also a prerequisite for accessing any of the configuration
files.

On the first launch of the GUI client on a Linux system, this presents a
problem. For group membership changes to take the effect, the user needs
to reboot. We say that in the documentation but it is unclear whether
all users will read that thoroughly enough. To help the user, the GUI
client checks for membership of the current user in the group and alerts
the user via a dialog box if that isn't the case. This would all be fine
if it would actually work. Unfortunately, that check ends up being too
late in the process. If we aren't a member of the group, we cannot read
the device ID and bail early, thus never reaching the check and
terminating the process without any dialog box or user-visible error.

We could attempt to fix this by shuffling around some of the startup
init code. That is a sub-optimal solution however because it a) may get
broken again in the future and b) it means we have to delay
initialisation of telemetry until a much later point.

Given that this is only a problem on Linux, a better solution is to
simply not rely on the disk-based device ID at all. Instead, we can
integrate with systemd and deterministically derive a device ID from the
unique machine ID and a randomly chosen "app ID".

For backwards-compatibility reasons, the disk-based device ID is still
prioritised. For all new installs however, we will use the one based on
`/etc/machine-id`.
2025-11-10 02:29:52 +00:00
Thomas Eizinger
9016ffc9dc build(rust): bump to Rust 1.91.0 (#10767)
Rust 1.91 has been released and brings with it a few new lints that we
need to tidy up. In addition, it also stabilizes `BTreeMap::extract_if`:
A really nifty std-lib function that allows us to conditionally take
elements from a map. We need that in a bunch of places.
2025-11-03 01:56:12 +00:00
Thomas Eizinger
3308e3c010 fix(linux): introduce tiered routing tables (#10742)
With the fix of taking into account link-scoped routes in #10554 we
introduced a bug: If a customer defines routes in Firezone that conflict
with the link-scope ones, those currently take priority as they are
usually more specific.

To fix this, we introduce tiered routing tables controlled by a set of
rules with different priority.

1. In the first "Firezone" routing table, we add all CIDR/IP routes that
users define in Firezone.
2. In the second "Firezone" routing table, we sync in all link-scope
routes from the system.
3. In the third "Firezone" routing table, we only add the Internet
Resource if it is active.

By evaluating the routing tables in this order, we effectively always
prioritize Firezone-controlled routes over local ones but still allow
access to LAN resources when the Internet Resource is active.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2025-10-30 06:53:55 +00:00
Thomas Eizinger
21a848a4cb chore(connlib): tune INFO logs (#10677)
The INFO logs of Firezone (specifically `connlib`) should be a good
balance between useful and not noisy. Several of the INFO logs we
currently have a probably a bit too noisy and can be tuned down or
optimised to be easier to read.

Before:

```
2025-10-22T01:48:38.836Z  INFO firezone_headless_client: arch="x86_64" version="1.5.5"
2025-10-22T01:48:38.840Z  INFO socket_factory: Set UDP socket buffer sizes requested_send_buffer_size=16777216 send_buffer_size=425984 requested_recv_buffer_size=134217728 recv_buffer_size=425984 port=52625
2025-10-22T01:48:38.841Z  INFO socket_factory: Set UDP socket buffer sizes requested_send_buffer_size=16777216 send_buffer_size=425984 requested_recv_buffer_size=134217728 recv_buffer_size=425984 port=52625
2025-10-22T01:48:38.851Z  INFO firezone_tunnel::device_channel: Initializing TUN device name=tun-firezone
2025-10-22T01:48:38.852Z  INFO firezone_tunnel::client: Resetting network state (network changed)
2025-10-22T01:48:38.853Z  INFO socket_factory: Set UDP socket buffer sizes requested_send_buffer_size=16777216 send_buffer_size=425984 requested_recv_buffer_size=134217728 recv_buffer_size=425984 port=52625
2025-10-22T01:48:38.854Z  INFO socket_factory: Set UDP socket buffer sizes requested_send_buffer_size=16777216 send_buffer_size=425984 requested_recv_buffer_size=134217728 recv_buffer_size=425984 port=52625
2025-10-22T01:48:39.263Z  INFO phoenix_channel: Connected to portal host=api
2025-10-22T01:48:39.408Z  INFO firezone_tunnel::client: Updating TUN device config=TunConfig { ip: IpConfig { v4: 100.90.205.158, v6: fd00:2021:1111::2:76b2 }, dns_by_sentinel: {}, search_domain: Some(Name(httpbin.search.test.)), ipv4_routes: [100.64.0.0/11, 100.96.0.0/11, 100.100.111.0/24], ipv6_routes: [fd00:2021:1111::/107, fd00:2021:1111:8000::/107, fd00:2021:1111:8000:100:100:111:0/120] }
2025-10-22T01:48:39.408Z  INFO firezone_tunnel::client: Updating TUN device config=TunConfig { ip: IpConfig { v4: 100.90.205.158, v6: fd00:2021:1111::2:76b2 }, dns_by_sentinel: {100.100.111.1 <> 127.0.0.11:53}, search_domain: Some(Name(httpbin.search.test.)), ipv4_routes: [100.64.0.0/11, 100.96.0.0/11, 100.100.111.0/24], ipv6_routes: [fd00:2021:1111::/107, fd00:2021:1111:8000::/107, fd00:2021:1111:8000:100:100:111:0/120] }
2025-10-22T01:48:39.408Z  INFO firezone_tunnel::client: Activating resource name=foobar.com address=foobar.com sites=mycro-aws-gws
2025-10-22T01:48:39.409Z  INFO firezone_tunnel::client: Activating resource name=*.firezone.dev address=*.firezone.dev sites=mycro-aws-gws
2025-10-22T01:48:39.409Z  INFO firezone_tunnel::client: Activating resource name=ip6only address=ip6only.me sites=mycro-aws-gws
2025-10-22T01:48:39.409Z  INFO firezone_tunnel::client: Activating resource name=example.com address=example.com sites=mycro-aws-gws
2025-10-22T01:48:39.409Z  INFO firezone_tunnel::client: Activating resource name=Example address=*.example.com sites=mycro-aws-gws
2025-10-22T01:48:39.409Z  INFO firezone_tunnel::client: Activating resource name=**.httpbin address=**.httpbin sites=mycro-aws-gws
2025-10-22T01:48:39.409Z  INFO firezone_tunnel::client: Activating resource name=MyCorp Network (IPv6) address=172:20::/64 sites=mycro-aws-gws
2025-10-22T01:48:39.409Z  INFO firezone_tunnel::client: Updating TUN device config=TunConfig { ip: IpConfig { v4: 100.90.205.158, v6: fd00:2021:1111::2:76b2 }, dns_by_sentinel: {100.100.111.1 <> 127.0.0.11:53}, search_domain: Some(Name(httpbin.search.test.)), ipv4_routes: [100.64.0.0/11, 100.96.0.0/11, 100.100.111.0/24], ipv6_routes: [172:20::/64, fd00:2021:1111::/107, fd00:2021:1111:8000::/107, fd00:2021:1111:8000:100:100:111:0/120] }
2025-10-22T01:48:39.409Z  INFO firezone_tunnel::client: Activating resource name=**.httpbin.search.test address=**.httpbin.search.test sites=mycro-aws-gws
2025-10-22T01:48:39.409Z  INFO firezone_tunnel::client: Activating resource name=**.firez.one address=**.firez.one sites=mycro-aws-gws
2025-10-22T01:48:39.409Z  INFO firezone_tunnel::client: Activating resource name=MyCorp Network address=172.20.0.0/16 sites=mycro-aws-gws
2025-10-22T01:48:39.409Z  INFO firezone_tunnel::client: Updating TUN device config=TunConfig { ip: IpConfig { v4: 100.90.205.158, v6: fd00:2021:1111::2:76b2 }, dns_by_sentinel: {100.100.111.1 <> 127.0.0.11:53}, search_domain: Some(Name(httpbin.search.test.)), ipv4_routes: [100.64.0.0/11, 100.96.0.0/11, 100.100.111.0/24, 172.20.0.0/16], ipv6_routes: [172:20::/64, fd00:2021:1111::/107, fd00:2021:1111:8000::/107, fd00:2021:1111:8000:100:100:111:0/120] }
2025-10-22T01:48:39.418Z  INFO firezone_bin_shared::tun_device_manager::linux: Setting new routes new_routes={V4(Ipv4Network { network_address: 100.64.0.0, netmask: 11 }), V4(Ipv4Network { network_address: 172.20.0.0, netmask: 16 }), V6(Ipv6Network { network_address: 172:20::, netmask: 64 }), V4(Ipv4Network { network_address: 100.96.0.0, netmask: 11 }), V6(Ipv6Network { network_address: fd00:2021:1111::, netmask: 107 }), V6(Ipv6Network { network_address: fd00:2021:1111:8000::, netmask: 107 }), V6(Ipv6Network { network_address: fd00:2021:1111:8000:100:100:111:0, netmask: 120 }), V4(Ipv4Network { network_address: 100.100.111.0, netmask: 24 })}
2025-10-22T01:48:39.420Z  INFO firezone_headless_client: Tunnel ready elapsed=583.523468ms
2025-10-22T01:48:39.430Z  INFO snownet::node: Added new TURN server rid=2a413094-32d4-4a69-8e92-642d60e885e9 address=Dual { v4: 203.0.113.102:3478, v6: [203:0:113::102]:3478 }
2025-10-22T01:49:44.814Z  INFO snownet::node: Creating new connection local=IceCreds { ufrag: "bly5", pass: "bdjtlfpvfdhhya6om4kssi" } remote=IceCreds { ufrag: "24gy", pass: "5mqlci4n4nmoovovihswvq" } index=(2378720|0) cid=ea82a87c-ca11-4292-a332-940ac386cba1
2025-10-22T01:49:45.634Z  INFO snownet::node: Updating remote socket new=PeerToPeer { source: 172.30.0.100:52625, dest: 203.0.113.3:52625 } duration_since_intent=821.149802ms cid=ea82a87c-ca11-4292-a332-940ac386cba1
2025-10-22T01:49:45.783Z  INFO snownet::node: Updating remote socket old=PeerToPeer { source: 172.30.0.100:52625, dest: 203.0.113.3:52625 } new=PeerToPeer { source: [172:30::100]:52625, dest: [203:0:113::3]:52625 } duration_since_intent=971.112388ms cid=ea82a87c-ca11-4292-a332-940ac386cba1
```

After:

```
2025-10-22T01:58:09.972Z  INFO firezone_headless_client: arch="x86_64" version="1.5.5"
2025-10-22T01:58:09.980Z  INFO firezone_tunnel::client: Resetting network state (network changed)
2025-10-22T01:58:10.271Z  INFO phoenix_channel: Connected to portal host=api
2025-10-22T01:58:10.369Z  INFO firezone_tunnel::client: Activating resource name=foobar.com address=foobar.com sites=mycro-aws-gws
2025-10-22T01:58:10.369Z  INFO firezone_tunnel::client: Activating resource name=*.firezone.dev address=*.firezone.dev sites=mycro-aws-gws
2025-10-22T01:58:10.369Z  INFO firezone_tunnel::client: Activating resource name=ip6only address=ip6only.me sites=mycro-aws-gws
2025-10-22T01:58:10.369Z  INFO firezone_tunnel::client: Activating resource name=example.com address=example.com sites=mycro-aws-gws
2025-10-22T01:58:10.369Z  INFO firezone_tunnel::client: Activating resource name=Example address=*.example.com sites=mycro-aws-gws
2025-10-22T01:58:10.369Z  INFO firezone_tunnel::client: Activating resource name=**.httpbin address=**.httpbin sites=mycro-aws-gws
2025-10-22T01:58:10.370Z  INFO firezone_tunnel::client: Activating resource name=MyCorp Network (IPv6) address=172:20::/64 sites=mycro-aws-gws
2025-10-22T01:58:10.370Z  INFO firezone_tunnel::client: Activating resource name=**.httpbin.search.test address=**.httpbin.search.test sites=mycro-aws-gws
2025-10-22T01:58:10.370Z  INFO firezone_tunnel::client: Activating resource name=**.firez.one address=**.firez.one sites=mycro-aws-gws
2025-10-22T01:58:10.370Z  INFO firezone_tunnel::client: Activating resource name=MyCorp Network address=172.20.0.0/16 sites=mycro-aws-gws
2025-10-22T01:58:10.370Z  INFO snownet::node: Added new TURN server rid=2a413094-32d4-4a69-8e92-642d60e885e9 address=Dual { v4: 203.0.113.102:3478, v6: [203:0:113::102]:3478 }
2025-10-22T01:58:10.370Z  INFO snownet::node: Added new TURN server rid=54f6ba35-1914-48fc-be24-62f6293936eb address=Dual { v4: 203.0.113.101:3478, v6: [203:0:113::101]:3478 }
2025-10-22T01:58:10.370Z  INFO firezone_tunnel::client: Updating TUN device config=TunConfig { ip: IpConfig { v4: 100.90.205.158, v6: fd00:2021:1111::2:76b2 }, dns_by_sentinel: {100.100.111.1 <> 127.0.0.11:53}, search_domain: Some(Name(httpbin.search.test.)), ipv4_routes: [100.64.0.0/11, 100.96.0.0/11, 100.100.111.0/24, 172.20.0.0/16], ipv6_routes: [172:20::/64, fd00:2021:1111::/107, fd00:2021:1111:8000::/107, fd00:2021:1111:8000:100:100:111:0/120] }
2025-10-22T01:58:10.383Z  INFO firezone_bin_shared::tun_device_manager::linux: Setting new routes new_routes=[100.64.0.0/11, 100.96.0.0/11, 100.100.111.0/24, 172.20.0.0/16, 172:20::/64, fd00:2021:1111::/107, fd00:2021:1111:8000::/107, fd00:2021:1111:8000:100:100:111:0/120]
2025-10-22T01:58:10.495Z  INFO snownet::allocation: Invalidating allocation active_socket=Some(203.0.113.101:3478)
2025-10-22T01:58:10.495Z  INFO snownet::allocation: Invalidating allocation active_socket=Some(203.0.113.102:3478)
2025-10-22T02:03:04.410Z  INFO snownet::node: Creating new connection local=IceCreds { ufrag: "uxgc", pass: "xxdgp5ivfhqloedzdmgi3j" } remote=IceCreds { ufrag: "es6w", pass: "doa2s3hmiteid7dtlszsbq" } index=(583098|0) cid=ea82a87c-ca11-4292-a332-940ac386cba1
2025-10-22T02:03:04.960Z  INFO snownet::node: Updating remote socket new=PeerToPeer { source: 172.30.0.100:52625, dest: 203.0.113.3:52625 } duration_since_intent=550.756408ms cid=ea82a87c-ca11-4292-a332-940ac386cba1
2025-10-22T02:03:05.112Z  INFO snownet::node: Updating remote socket old=PeerToPeer { source: 172.30.0.100:52625, dest: 203.0.113.3:52625 } new=PeerToPeer { source: [172:30::100]:52625, dest: [203:0:113::3]:52625 } duration_since_intent=702.23775ms cid=ea82a87c-ca11-4292-a332-940ac386cba1
```
2025-10-22 23:47:55 +00:00
Thomas Eizinger
d35cf445d4 fix(linux): don't sync link-scope routes of offline interfaces (#10583)
In #10554, we added a syncing mechanism that would copy all link-scoped
routes of the `main` routing table over to the Firezone routing table.
Routes for interfaces that are currently offline cannot be added and
cause a netlink error of "Invalid argument".

To prevent unnecessary warnings from being logged to Sentry, we retrieve
the link state of each interface and skip routes for interfaces are not
online.
2025-10-16 05:34:10 +00:00
Thomas Eizinger
eb75cef467 fix(linux): allow LAN access when Internet Resource is on (#10554)
## Context

On Linux, we create a dedicated routing table for all routes of the
Firezone TUN device, including the `0.0.0.0/0` route. At a minimum, this
routing table contains the following if the Internet Resource is active:

```
> ip route show table 539098368
default dev tun-firezone proto static
100.64.0.0/11 dev tun-firezone proto static
100.96.0.0/11 dev tun-firezone proto static
100.100.111.0/24 dev tun-firezone proto static
```

In addition, we also create a routing rule that bypasses this routing
table for all packets that are tagged with the `0xfd002021` mark:

```
> ip rule list
0:      from all lookup local
32765:  not from all fwmark 0xfd002021 lookup 539098368
32766:  from all lookup main
32767:  from all lookup default
```

Firezone's internal UDP and TCP sockets are tagged with this mark and
thus prevent routing loops where our own packets would otherwise get
redirected back into the tunnel.

Without the Internet Resource active, the rule `from all lookup main`
triggers for local LAN traffic and correctly route the traffic out via
that interface.

For example, on my computer, the Linux kernel created the following
route with the `link` scope in the main table:

```
192.168.188.0/24 dev wlp192s0 proto kernel scope link src 192.168.188.112 metric 600
```

## The problem

With the Internet Resource active, there is a problem. The default route
matches ALL destinations, including those for local LAN destinations
which should actually be sent out via a different interface. As a
result, local LAN traffic is broken on Linux as soon as the Internet
Resource is active. Instead of being sent out via the local interface,
these packets get sent to `tun-firezone` where they get forwarded to the
Gateway and then dropped because their source IP is not a Firezone
Client IP.

## Solution

Fixing this is unfortunately non-trivial. The best I could come up with
is to create a copy of all link-scoped routes in the Firezone routing
table and keep those in sync with all route changes that happen. For
example, when we roam, the link-scoped routes obviously change because
we join a new subnet.

We therefore listen to change-events from netlink and create a debounced
task that reads the current link-scoped routes from the main routing
table, compares it to the ones in the Firezone table and adds any routes
not present.

We don't need to worry about removing routes as link-scoped routes
automatically disappear once the resulting interface goes away.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-14 20:36:58 +00:00
Thomas Eizinger
d718c5de8e fix(connlib): retry packets on IO error 5 (#10279)
Unfortunately, it isn't very easy to detect whether a socket supports
GSO on Linux. Hence, `quinn-udp` simply probes for its support by trying
to send GSO batches and effectively disables GSO by setting the
`max-gso-segments` state variable to 1 if it encounters either EINVAL
(-22) or EIO (-5).

For EINVAL, `quinn-udp` has an internal retry mechanism. For EIO, the
`Transmit` which is passed to `quinn-udp` needs to be re-chunked and
thus cannot be automatically retried.

In order to avoid dropping packets, we therefore add a once-off retry
step to sending a datagram whenever we hit EIO on Linux or Android. If
the error was due to GSO not being supported, the 2nd attempt should be
successful and going forward, even the first one should be until we roam
the socket (where this state variable gets reset).

These packet drops have been causing flakiness in CI ever since we
merged the eBPF tests. Those disable checksum offloading which appears
to trigger these errors.
2025-09-02 21:31:57 +00:00
Thomas Eizinger
e84bdc5566 refactor(connlib): periodically record queue depths (#10242)
Instead of recording the queue depths on every event-loop tick, we now
record them once a second by setting a Gauge. Not only is that a simpler
instrument to work with but it is significantly more performant. The
current version - when metrics are enabled - takes on quite a bit of CPU
time.

Resolves: #10237
2025-09-02 02:57:36 +00:00
Thomas Eizinger
a109c1a2ef feat(connlib): discard intermediate resource and TUN updates (#10223)
Right now, the Client event-loops have a channel with 1000 items for
sending new resource lists and updates to the TUN device to the host
app. This is kind of unnecessary as we always only care about the last
version of these. Intermediate updates that the host app doesn't process
are effectively irrelevant.

We've had an issue before where a bug in the portal caused us to receive
many updates to resources which ended up crashing Client apps because
this channel filled up.

To be more resilient on this front, we refactor the Client event loop to
use a `watch` channel for this. Watch channels only retain the last
value that got sent into them.
2025-08-21 05:42:54 +00:00
Thomas Eizinger
4e11112d9b feat(connlib): improve throughput on higher latencies (#10231)
Turns out the multi-threaded access of the TUN device on the Gateway
causes packet reordering which makes the TCP congestion controller
throttle the connection. Additionally, the default TX queue length of a
TUN device on Linux is only 500 packets.

With just a single thread and an increased TX queue length, we get a
throughput performance of just over 1 GBit/s for a 20ms link between
Client and Gateway with basically no packet drops:

```
Connecting to host 172.20.0.110, port 5201
[  5] local 100.79.130.70 port 49546 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   116 MBytes   977 Mbits/sec    0   6.40 MBytes       
[  5]   1.00-2.00   sec   137 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]   2.00-3.00   sec   134 MBytes  1.13 Gbits/sec    0   6.40 MBytes       
[  5]   3.00-4.00   sec   136 MBytes  1.14 Gbits/sec   47   6.40 MBytes       
[  5]   4.00-5.00   sec   137 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]   5.00-6.00   sec   138 MBytes  1.16 Gbits/sec    0   6.40 MBytes       
[  5]   6.00-7.00   sec   138 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]   7.00-8.00   sec   138 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]   8.00-9.00   sec   138 MBytes  1.16 Gbits/sec    0   6.40 MBytes       
[  5]   9.00-10.00  sec   138 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]  10.00-11.00  sec   139 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  11.00-12.00  sec   139 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  12.00-13.00  sec   136 MBytes  1.14 Gbits/sec    0   6.40 MBytes       
[  5]  13.00-14.00  sec   139 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  14.00-15.00  sec   140 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  15.00-16.00  sec   138 MBytes  1.16 Gbits/sec    0   6.40 MBytes       
[  5]  16.00-17.00  sec   137 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]  17.00-18.00  sec   139 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  18.00-19.00  sec   138 MBytes  1.16 Gbits/sec    0   6.40 MBytes       
[  5]  19.00-20.00  sec   136 MBytes  1.14 Gbits/sec    0   6.40 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-20.00  sec  2.67 GBytes  1.15 Gbits/sec   47             sender
[  5]   0.00-20.02  sec  2.67 GBytes  1.15 Gbits/sec                  receiver

iperf Done.

```

For further debugging in the future, we are now recording the send and
receive queue depths of both the TUN device and the UDP sockets. Neither
of those showed to be full in my testing which leads me to conclude that
it isn't any buffer inside Firezone that is too small here.

Related: #7452

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-08-20 23:08:56 +00:00
Thomas Eizinger
301d2137e5 refactor(windows): share src IP cache across UDP sockets (#9976)
When looking through customer logs, we see a lot of "Resolved best route
outside of tunnel" messages. Those get logged every time we need to
rerun our re-implementation of Windows' weighting algorithm as to which
source interface / IP a packet should be sent from.

Currently, this gets cached in every socket instance so for the
peer-to-peer socket, this is only computed once per destination IP.
However, for DNS queries, we make a new socket for every query. Using a
new source port DNS queries is recommended to avoid fingerprinting of
DNS queries. Using a new socket also means that we need to re-run this
algorithm every time we make a DNS query which is why we see this log so
often.

To fix this, we need to share this cache across all UDP sockets. Cache
invalidation is one of the hardest problems in computer science and this
instance is no different. This cache needs to be reset every time we
roam as that changes the weighting of which source interface to use.

To achieve this, we extend the `SocketFactory` trait with a `reset`
method. This method is called whenever we roam and can then reset a
shared cache inside the `UdpSocketFactory`. The "source IP resolver"
function that is passed to the UDP socket now simply accesses this
shared cache and inserts a new entry when it needs to resolve the IP.

As an added benefit, this may speed up DNS queries on Windows a bit
(although I haven't benchmarked it). It should certainly drastically
reduce the amount of syscalls we make on Windows.
2025-07-24 01:36:53 +00:00
Thomas Eizinger
eb4c54620c chore(linux): add more error context to TUN device (#9853)
When failing to create the TUN device, the error messages are currently
pretty bare. Add a bit more context so users can self-diagnose easier
what is wrong.
2025-07-13 05:51:02 +00:00
Thomas Eizinger
d6805d7e48 chore(rust): bump to Rust 1.88 (#9714)
Rust 1.88 has been released and brings with it a quite exciting feature:
let-chains! It allows us to mix-and-match `if` and `let` expressions,
therefore often reducing the "right-drift" of the relevant code, making
it easier to read.

Rust.188 also comes with a new clippy lint that warns when creating a
mutable reference from an immutable pointer. Attempting to fix this
revealed that this is exactly what we are doing in the eBPF kernel.
Unfortunately, it doesn't seem to be possible to design this in a way
that is both accepted by the borrow-checker AND by the eBPF verifier.
Hence, we simply make the function `unsafe` and document for the
programmer, what needs to be upheld.
2025-07-12 06:42:50 +00:00
Thomas Eizinger
f98fcca542 refactor(connlib): directly implement async fn (#9806)
At present, and as a result of how `connlib` evolved, we still implement
a `Poll`-based function for receiving data on our UDP socket. Ever since
we moved to dedicated threads for the UDP socket, we can directly block
on "block" on receiving datagrams and don't have to poll the socket.

This simplifies the implementation a fair bit. Additionally, it made me
reailise that we currently don't expose any errors on the UDP socket.
Likely, those will be ephemeral but it is still better than completely
silencing them.
2025-07-10 13:54:44 +00:00
Thomas Eizinger
17a1d36eae fix(gui-client): set IO error type for missing non-tunnel routes (#9777)
On Windows - in order to prevent routing loops - we resolve the best
"non-tunnel" route to a particular host for each IP address. The
resulting source IP is then used as source for packets leaving our
interface. In case the system doesn't have IPv6 connectivity or are
simply no routes available, we fail this "source IP resolver" with an IO
error.

Presently, this uses the "other" IO error type which causes this to be
logged on a WARN level in the event-loop. The IO error types
`HostUnreachable` and `NetworkUnreachable` are expected during normal
operation of Firezone and are therefore only logged on DEBUG.

By changing this IO error type, we fix the WARN log spam on Windows for
machines without IPv6 connectivity.
2025-07-03 21:45:06 +00:00
Thomas Eizinger
899f5ea5e8 fix(gui-client): ensure GUI client can access firezone-id.json (#9764)
I believe some of the recent changes around how we load the
`firezone-id.json` from the GUI client surfaced that we in fact don't
always have access to it. Previously, this was silenced because we would
only optionally add it as context to the Sentry client.

Now, we need it to initialise telemetry so we know whether or not to
send logs to Sentry.

In order to be able to access the file, we need to change the config's
directory and the file to be owned by the `firezone-client` group.
2025-07-01 14:11:29 +00:00
Thomas Eizinger
daf05b8c79 fix(windows): ignore network changes from irrelevant networks (#9696)
In order to detect network changes on Windows, we implement the
`INetworkEvents` callback interface. This callback notifies us every
time the connectivity of a certain network changes.

Performing a network reset in connlib on any of these changes hurts the
user experience as Firezone is booting because it takes a while for this
to settle. Firezone itself is making changes to the network so several
of these change events happen _because_ Firezone is starting.

The documentation from Microsoft on what possible values the `NameType`
attribute can have is pretty thin but I did manage to find the following
values on the Internet:

- `6`: Wired network
- `71`: Wireless network
- `243`: Broadband network

We assume that the user is connected to the Internet through one of
these so we ignore network changes on all other networks.

An alternative approach to reducing the number of false-positive change
events would be to react to a narrower list of change events. I
discarded this approach because it wasn't clear to me, which of the
event types [0] would matter to us and when Windows emits them. I think
in order to effectively react to those, we'd have to do more fine
granular tracking of which state a network is in and e.g. only trigger a
reset if we move from "Disconnected" to e.g. "Subnet connectivity".
Windows also differentiates between local, subnet and Internet
connectivity, yet in my testing, I've never observed the "Internet"
connectivity being emitted.

Hence, it is deemed more robust to just filter out networks based on
their type. Firezone itself is of type 53 and is therefore automatically
filtered out as well. The risk here is that we don't react to
connectivity changes of a network that a customer is relying on.
Unfortunately, I don't think there is a better way to find this out
other than shipping this change and waiting for reports.

[0]:
https://learn.microsoft.com/en-us/windows/win32/api/netlistmgr/ne-netlistmgr-nlm_connectivity#constants

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-06-30 08:52:00 +00:00
Thomas Eizinger
a91dda139f feat(connlib): only conditionally hash firezone ID (#9633)
A bit of legacy that we have inherited around our Firezone ID is that
the ID stored on the user's device is sha'd before being passed to the
portal as the "external ID". This makes it difficult to correlate IDs in
Sentry and PostHog with the data we have in the portal. For Sentry and
PostHog, we submit the raw UUID stored on the user's device.

As a first step in overcoming this, we embed an "external ID" in those
services as well IF the provided Firezone ID is a valid UUID. This will
allow us to immediately correlate those events.

As a second step, we automatically generate all new Firezone IDs for the
Windows and Linux Client as `hex(sha256(uuid))`. These won't parse as
valid UUIDs and therefore will be submitted as is to the portal.

As a third step, we update all documentation around generating Firezone
IDs to use `uuidgen | sha256` instead of just `uuidgen`. This is
effectively the equivalent of (2) but for the Headless Client and
Gateway where the Firezone ID can be configured via environment
variables.

Resolves: #9382

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2025-06-24 07:05:48 +00:00
Thomas Eizinger
60bdbb39cb refactor(gui-client): move change listeners to tunnel service (#8160)
At present, listening for DNS server change and network change events is
handled in the GUI client. Upon an event, a message is sent to the
tunnel service which then applies the new state to `connlib`.

We can avoid some of this boilerplate by moving these listeners to the
tunnel service as part of the handler. As a result, we get a few
improvements:

- We don't need to ignore these events if we don't have a session
because the lifetime of these listeners is tied to the IPC handler on
the service side.
- We need fewer IPC messages
- We can retry the connection directly from within the tunnel service in
case we have no Internet at the time of startup
- We can more easily model out the state machine of a connlib session in
the tunnel service
- On Linux, this means we no longer shell out to `resolvectl` from the
GUI process, unifying access to the "resolvers" from the tunnel service
- On Windows, we no longer need admin privileges on the GUI client for
optimized network-change detection. This now happens in the Tunnel
process which already runs as admin.

Resolves: #9465
2025-06-11 06:18:14 +00:00
Jamil
e1ac9e4912 fix(rust): relax assertion on cloudflare tcp response (#9506)
In #9498, the Cloudflare response was updated to match what appears to
be a transient change on their end. It looks like this has changed
again, so to prevent this from breaking CI in the future we relax the
assertion.
2025-06-10 19:51:18 -07:00
Thomas Eizinger
1fa345aa5e test(rust): adapt response from Cloudflare proxy (#9498)
It appears that Cloudflare changed the response that it is sending for
the 1.1.1.1 IP so we need to adapt our integration test for packet loops
in order to make CI pass.
2025-06-10 14:18:07 +00:00
Jamil
822832e02b chore(macos): allow tauri to build on macOS (#9391)
When working on UI stuff for the Tauri clients on macOS it's helpful if
the UI is buildable. This is a first stab at getting a stub client to
launch on macOS with the help of our AI overlords. Feel free to close or
heavily critique if there is a better approach.
2025-06-06 09:15:39 +00:00
Thomas Eizinger
e05c98bfca ci: update to new cargo sort release (#9354)
The latest release now also sorts workspace dependencies, as well as
different dependency sections. Keeping these things sorted reduces the
chances of merge conflicts when multiple PRs edit these files.
2025-06-02 02:01:09 +00:00
Thomas Eizinger
d62f82787d build(deps): bump netlink dependency group (#9315)
In
https://github.com/rust-netlink/netlink-packet-route/issues/140#issuecomment-2919539363,
the author claims the issue we've been holding the dependency bump back
for is resolved. We can now update to the latest versions of the
`netlink` dependency group.
2025-05-31 02:34:55 +00:00
Thomas Eizinger
ae872980ae refactor(gui-client): scope telemetry sessions to GUI client (#9179)
For our telemetry sessions with Sentry, we need to know which
environment we are running in, i.e. staging, production or on-prem. The
GUI client's tunnel service doesn't have a concept of an environment
until a GUI connects and sends the `StartTelemetry` message. Therefore,
we should scope a telemetry session to a GUI being connected over IPC.

Any errors around setting up / tearing down the background service are a
catch-22. Until a GUI connects, we can't initialise the telemetry
connection but if we fail to set up the background service, no GUI can
ever connect. Hence, the current setup and tear down of the `Telemetry`
module around the `ipc_listen` calls can safely be removed as they are
effectively no-ops anyway.
2025-05-20 23:18:18 +00:00
Thomas Eizinger
1bdba3601a feat(gui-client): rename IPC service to Tunnel service (#9154)
The name IPC service is not very descriptive. By nature of being
separate processes, we need to use IPC to communicate between them. The
important thing is that the service process has control over the tunnel.
Therefore, we rename everything to "Tunnel service".

The only part that is not changed are historic changelog entries.

Resolves: #9048
2025-05-19 09:52:06 +00:00
Thomas Eizinger
3300c0fe02 chore(rust): fix windows static analysis errors (#9162)
The `static-analysis` job for Windows was not yet part of the rule set
and therefore some clippy errors slipped through when we merged #9159.
2025-05-16 04:23:53 +00:00
Thomas Eizinger
6165555add build(deps): bump Rust to 1.87.0 (#9159) 2025-05-16 01:58:17 +00:00
Thomas Eizinger
b8738448df refactor(connlib): forward error from source IP resolver (#9116)
In order to avoid routing loops on Windows, our UDP and TCP sockets in
`connlib` embed a "source IP resolver" that finds the "next best"
interface after our TUN device according to Windows' routing metrics.
This ensures that packets don't get routed back into our TUN device.

Currently, errors during this process are only logged on TRACE and
therefore not visible in Sentry. We fix this by moving around some of
the function interfaces and forward the error from the source IP
resolver together with some context of the destination IP.
2025-05-13 13:33:15 +00:00
Thomas Eizinger
b93c28240e chore(rust): fix features in bin-shared (#9094)
When this crate is compiled by itself, these features are required. This
doesn't show up in CI because there we compile the entire workspace and
some crate somewhere already activates these features then.
2025-05-13 03:12:59 +00:00
Thomas Eizinger
4097ee0cdf chore(gui-client): only read is_finished once (#9095)
For at least 1 user, the threads shut down correctly, but we didn't seem
to have exited the loop. In
https://firezone-inc.sentry.io/issues/6335839279/events/c11596de18924ee3a1b64ced89b1fba2/?project=4508008945549312,
we can see that both flags are marked as `true` yet we still emitted the
message.

The only way how I can explain this is that the thread shut down in
between the two times we've called the `is_finished` function. To ensure
this doesn't happen, we now only read it once.

This however also shows that 5s may not be enough time for WinTUN to
shutdown. Therefore, we increase the grace period to 10s.
2025-05-12 11:47:42 +00:00
Thomas Eizinger
5566f1847f refactor(rust): move crates into a more sensical hierarchy (#9066)
The current `rust/` directory is a bit of a wild-west in terms of how
the crates are organised. Most of them are simply at the top-level when
in reality, they are all `connlib`-related. The Apple and Android FFI
crates - which are entrypoints in the Rust code are defined several
layers deep.

To improve the situation, we move around and rename several crates. The
end result is that all top-level crates / directories are:

- Either entrypoints into the Rust code, i.e. applications such as
Gateway, Relay or a Client
- Or crates shared across all those entrypoints, such as `telemetry` or
`logging`
2025-05-12 01:04:17 +00:00
Thomas Eizinger
18ec6c6860 refactor(rust): move service implementation to GUI client (#9045)
The module and crate structure around the GUI client and its background
service are currently a mess of circular dependencies. Most of the
service implementation actually sits in `firezone-headless-client`
because the headless-client and the service share certain modules. We
have recently moved most of these to `firezone-bin-shared` which is the
correct place for these modules.

In order to move the background service to `firezone-gui-client`, we
need to untangle a few more things in the GUI client. Those are done
commit-by-commit in this PR. With that out the way, we can finally move
the service module to the GUI client; where is should actually live
given that it has nothing to do with the headless client.

As a result, the headless-client is - as one would expect - really just
a thin wrapper around connlib itself and is reduced down to 4 files with
this PR.

To make things more consistent in the GUI client, we move the `main.rs`
file also into `bin/`. By convention `bin/` is where you define binaries
if a crate has more than one. cargo will then build all of them.

Eventually, we can optimise the compile-times for `firezone-gui-client`
by splitting it into multiple crates:

- Shared structs like IPC messages
- Background service
- GUI client

This will be useful because it allows only re-compiling of the GUI
client alone if nothing in `connlib` changes and vice versa.

Resolves: #6913
Resolves: #5754
2025-05-08 13:22:09 +00:00
Thomas Eizinger
f2b1fbe718 refactor(rust): move device_id to bin-shared (#9040)
Both `device_id` and `device_info` are used by the headless-client and
the GUI client / IPC service. They should therefore be defined in the
`bin-shared` crate.
2025-05-06 04:52:37 +00:00
Thomas Eizinger
f11a902b3d refactor(rust): move dns-control to bin-shared (#9023)
Currently, the platform-specific code for controlling DNS resolution on
a system sits in `firezone-headless-client`. This code is also used by
the GUI client. This creates a weird compile-time dependency from the
GUI client to the headless client.

For other components that have platform-specific implementations, we use
the `firezone-bin-shared` crate. As a first step of resolving the
compile-time dependency, we move the `dns_control` module to
`firezone-bin-shared`.
2025-05-06 01:29:09 +00:00
Thomas Eizinger
005b6fe863 feat(windows): optimise network change detection (#9021)
Presently, the network change detection on Windows is very naive and
simply emits a change event everytime _anything_ changes. We can
optimise this and therefore improve the start-up time of Firezone by:

- Filtering out duplicate events
- Filtering out network change events for our own network adapter

This reduces the number of network change events to 1 during startup. As
far as I can tell from the code comments in this area, we explicitly
send this one to ensure we don't run into a race condition whilst we are
starting up.

Resolves: #8905
2025-05-06 00:23:27 +00:00
Thomas Eizinger
806996c245 refactor(rust): move signals to bin-shared (#9024)
The `signals` module isn't something headless-client specific and should
live in our `bin-shared` crate. Once the `ipc_service` module is
decoupled from the headless-client crate, it will be used by both the
headless client and IPC service (which then will be defined in the GUI
client crate).
2025-05-05 23:34:26 +00:00
Thomas Eizinger
ce51c40d0d refactor(rust): move known_dirs to bin-shared (#9026)
The `known_dirs` module is used across the headless-client and the GUI
client. It should live in `bin-shared` where all the other
cross-platform modules are.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-05-05 22:45:53 +00:00
Thomas Eizinger
80335676b1 refactor(rust): move uptime to bin-shared (#9027)
The `uptime` module from `firezone-headless-client` is also used in the
GUI client. In order to decouple this dependency, we move the module to
`bin-shared`, next to the other cross-plaform modules.
2025-05-05 12:28:26 +00:00
Thomas Eizinger
8a201494d0 ci: remove flaky Windows benchmark (#8941)
This tunnel throughput benchmark isn't a very useful benchmark and it is
very flaky. Remove it entirely until we can replace it with something
more robust and useful.

Resolves: #8172
2025-04-30 07:24:21 -07:00
Thomas Eizinger
e031dfdb4a refactor(connlib): introduce our own bufferpool crate (#8928)
We have been using buffer pools for a while all over `connlib` as a way
to efficiently use heap-allocated memory. This PR harmonizes the usage
of buffer pools across the codebase by introducing a dedicated
`bufferpool` crate. This crate offers a convenient and easy-to-use API
for all the things we (currently) need from buffer pools. As a nice
bonus of having it all in one place, we can now also track metrics of
how many buffers we have currently allocated.

An example output from the local metrics exporter looks like this:

```
Name         : system.buffer.count
Description  : The number of buffers allocated in the pool.
Unit         : {buffers}
Type         : Sum
Sum DataPoints
Monotonic    : false
Temporality  : Cumulative
DataPoint #0
	StartTime    : 2025-04-29 12:41:25.278436
	EndTime      : 2025-04-29 12:42:25.278088
	Value        : 96
	Attributes   :
		 ->  system.buffer.pool.name: udp-socket-v6
		 ->  system.buffer.pool.buffer_size: 65535
DataPoint #1
	StartTime    : 2025-04-29 12:41:25.278436
	EndTime      : 2025-04-29 12:42:25.278088
	Value        : 7
	Attributes   :
		 ->  system.buffer.pool.buffer_size: 131600
		 ->  system.buffer.pool.name: gso-queue
DataPoint #2
	StartTime    : 2025-04-29 12:41:25.278436
	EndTime      : 2025-04-29 12:42:25.278088
	Value        : 128
	Attributes   :
		 ->  system.buffer.pool.name: udp-socket-v4
		 ->  system.buffer.pool.buffer_size: 65535
DataPoint #3
	StartTime    : 2025-04-29 12:41:25.278436
	EndTime      : 2025-04-29 12:42:25.278088
	Value        : 8
	Attributes   :
		 ->  system.buffer.pool.buffer_size: 1336
		 ->  system.buffer.pool.name: ip-packet
DataPoint #4
	StartTime    : 2025-04-29 12:41:25.278436
	EndTime      : 2025-04-29 12:42:25.278088
	Value        : 9
	Attributes   :
		 ->  system.buffer.pool.buffer_size: 1336
		 ->  system.buffer.pool.name: snownet
```

Resolves: #8385
2025-04-30 08:52:18 +00:00
Thomas Eizinger
6114bb274f chore(rust): make most of the Rust code compile on MacOS (#8924)
When working on the Rust code of Firezone from a MacOS computer, it is
useful to have pretty much all of the code at least compile to ensure
detect problems early. Eventually, once we target features like a
headless MacOS client, some of these stubs will actually be filled in an
be functional.
2025-04-29 11:20:09 +00:00
Thomas Eizinger
93036734ae build(rust): move our own windows dependency to 0.61.0 (#8730)
Version `0.61.0` is what most of our dependencies bring in, so depending
on that allows us to unify the dependency tree here.
2025-04-22 02:35:28 +00:00
Thomas Eizinger
b3746b330f refactor(connlib): spawn dedicated threads for UDP sockets (#7590)
Correctly implementing asynchronous IO is notoriously hard. In order to
not drop packets in the process, one has to ensure a given socket is
ready to accept packets, buffer them if it is not case, suspend
everything else until the socket is ready and then continue.

Until now, we did this because it was the only option to run the UDP
sockets on the same thread as the actual packet processing. That in turn
was motivated by wanting to pass around references of the received
packets for processing. Rust's borrow-checker does not allow to pass
references between threads which forced us to have the sockets on the
same thread as the packet processing.

Like we already did in other places in `connlib`, this can be solved
through the use of buffer pools. Using a buffer pool, we can use heap
allocations to store the received packets without having to make a new
allocation every time we read new packets. Instead, we can have a
dedicated thread that is connected to `connlib`'s packet processing
thread via two channels (one for inbound and one for outbound packets).
These channels are bounded, which ensures backpressure is maintained in
case one of the two threads lags behind. These bounds also mean that we
have at most N buffers from the buffer pool in-flight (where N is the
capacity of the channel).

Within those dedicated threads, we can then use `async/await` notation
to suspend the entire task when a socket isn't ready for sending.

Resolves: #8000
2025-04-14 06:18:06 +00:00
Jamil
4afcdf1c53 test(windows): Expect 80 Mbps on slow actions runners (#8621)
These are still failing a good portion of the time:


https://github.com/firezone/firezone/actions/runs/14226461996/job/39867070540?pr=8620
2025-04-02 22:22:20 +00:00
Thomas Eizinger
58fe527b0e feat(connlib): mirror ECN bits on TUN device (#8511)
From the perspective of any application, Firezone is a layer-3 network
and will thus use the host's networking stack to form IP packets for
whichever application protocol is in use (UDP, TCP, etc). These packets
then get encapsulated into UDP packets by Firezone and sent to a
Gateway.

As a result of this design, the IP header seen by the networking stacks
of the Client and the receiving service are not visible to any
intermediary along the network path of the Client and Gateway.

In case this network path is congested and middleboxes such as routers
need to drop packets, they will look at the ECN bits in the IP header
(of the UDP packet generated by a Client or Gateway) and flip a bit in
case the previous value indicated support for ECN (`0x01` or `0x10`).
When received by a network stack that supports ECN, seeing `0x11` means
that the network path is congested and that it must reduce its
send/receive windows (or otherwise throttle the connection).

At present, this doesn't work with Firezone because of the
aforementioned encapsulation of IP packets. To support ECN, we need to
therefore:

- Copy ECN bits from a received IP packet to the datagram that
encapsulates it: This ensures that if the Client's network stack support
ECN, we mirror that support on the wire.
- Copy ECN bits from a received datagram to the IP packet the is sent to
the TUN device: This ensures that if the "Congestion Experienced" bit
get set along the network path between Client and Gateway, we reflect
that accordingly on the IP packet emitted by the TUN device.

Resolves: #3758

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
2025-03-26 20:55:51 +00:00
Thomas Eizinger
84a2c275ca build(rust): upgrade to Rust 1.85 and Edition 2024 (#8240)
Updates our codebase to the 2024 Edition. For highlights on what
changes, see the following blogpost:
https://blog.rust-lang.org/2025/02/20/Rust-1.85.0.html
2025-03-19 02:58:55 +00:00
Thomas Eizinger
7af4b91ac5 fix(gui-client): call wintun::Session::shutdown on drop (#8464)
The bugfix we attempted in #8156 turned out wrong. Reading the
source-code, we have to call `Session::shutdown` in order to actually
cancel the `Session::receive_blocking` call. Not doing so means we run
into the timeout when discarding the `Tun` device because the
recv-thread is stuck in `Session::receive_blocking`.

Fixes: #8395
2025-03-17 12:58:03 +00:00
Thomas Eizinger
152939c7dd build(rust): bump Tauri dependencies (#8459)
Dependabot appears to have a hard time to bump the Tauri dependencies in
a group together. Additionally, our dependency linter `cargo deny`
disallows duplicate dependencies by default. To avoid introducing more
duplicate dependencies, we depend on the upstream `main` branch of two
projects that have already updated their dependencies but did not yet
cut a release.
2025-03-17 12:19:20 +00:00
Thomas Eizinger
2fe5c00c64 fix(windows): break from retry loop if we sent the packet (#8271)
Regression introduced in #8268.
2025-02-26 06:10:02 +00:00