firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-03-20 06:41:25 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	166b0d1573	feat(linux): compute device ID from `/etc/machine-id` (#10805 ) All of our Linux applications have a soft-dependency on systemd. That is, in the default configuration, we expect systemd to be present on the machine. The only exception here are the docker containers for Headless Client and Gateway. For the GUI client in particular, systemd is a hard-dependency in order to control DNS on the system which we do via `systemd-resolved`. To secure the communication between the GUI client and its tunnel process, we automatically create a group called `firezone-client` to which the user gets added. All members of the group are allowed to access the unix socket which is used for IPC between the two processes. Membership in this group is also a prerequisite for accessing any of the configuration files. On the first launch of the GUI client on a Linux system, this presents a problem. For group membership changes to take the effect, the user needs to reboot. We say that in the documentation but it is unclear whether all users will read that thoroughly enough. To help the user, the GUI client checks for membership of the current user in the group and alerts the user via a dialog box if that isn't the case. This would all be fine if it would actually work. Unfortunately, that check ends up being too late in the process. If we aren't a member of the group, we cannot read the device ID and bail early, thus never reaching the check and terminating the process without any dialog box or user-visible error. We could attempt to fix this by shuffling around some of the startup init code. That is a sub-optimal solution however because it a) may get broken again in the future and b) it means we have to delay initialisation of telemetry until a much later point. Given that this is only a problem on Linux, a better solution is to simply not rely on the disk-based device ID at all. Instead, we can integrate with systemd and deterministically derive a device ID from the unique machine ID and a randomly chosen "app ID". For backwards-compatibility reasons, the disk-based device ID is still prioritised. For all new installs however, we will use the one based on `/etc/machine-id`.	2025-11-10 02:29:52 +00:00
Thomas Eizinger	9016ffc9dc	build(rust): bump to Rust 1.91.0 (#10767 ) Rust 1.91 has been released and brings with it a few new lints that we need to tidy up. In addition, it also stabilizes `BTreeMap::extract_if`: A really nifty std-lib function that allows us to conditionally take elements from a map. We need that in a bunch of places.	2025-11-03 01:56:12 +00:00
Thomas Eizinger	3308e3c010	fix(linux): introduce tiered routing tables (#10742 ) With the fix of taking into account link-scoped routes in #10554 we introduced a bug: If a customer defines routes in Firezone that conflict with the link-scope ones, those currently take priority as they are usually more specific. To fix this, we introduce tiered routing tables controlled by a set of rules with different priority. 1. In the first "Firezone" routing table, we add all CIDR/IP routes that users define in Firezone. 2. In the second "Firezone" routing table, we sync in all link-scope routes from the system. 3. In the third "Firezone" routing table, we only add the Internet Resource if it is active. By evaluating the routing tables in this order, we effectively always prioritize Firezone-controlled routes over local ones but still allow access to LAN resources when the Internet Resource is active. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-10-30 06:53:55 +00:00
Thomas Eizinger	21a848a4cb	chore(connlib): tune INFO logs (#10677 ) The INFO logs of Firezone (specifically `connlib`) should be a good balance between useful and not noisy. Several of the INFO logs we currently have a probably a bit too noisy and can be tuned down or optimised to be easier to read. Before: ``` 2025-10-22T01:48:38.836Z INFO firezone_headless_client: arch="x86_64" version="1.5.5" 2025-10-22T01:48:38.840Z INFO socket_factory: Set UDP socket buffer sizes requested_send_buffer_size=16777216 send_buffer_size=425984 requested_recv_buffer_size=134217728 recv_buffer_size=425984 port=52625 2025-10-22T01:48:38.841Z INFO socket_factory: Set UDP socket buffer sizes requested_send_buffer_size=16777216 send_buffer_size=425984 requested_recv_buffer_size=134217728 recv_buffer_size=425984 port=52625 2025-10-22T01:48:38.851Z INFO firezone_tunnel::device_channel: Initializing TUN device name=tun-firezone 2025-10-22T01:48:38.852Z INFO firezone_tunnel::client: Resetting network state (network changed) 2025-10-22T01:48:38.853Z INFO socket_factory: Set UDP socket buffer sizes requested_send_buffer_size=16777216 send_buffer_size=425984 requested_recv_buffer_size=134217728 recv_buffer_size=425984 port=52625 2025-10-22T01:48:38.854Z INFO socket_factory: Set UDP socket buffer sizes requested_send_buffer_size=16777216 send_buffer_size=425984 requested_recv_buffer_size=134217728 recv_buffer_size=425984 port=52625 2025-10-22T01:48:39.263Z INFO phoenix_channel: Connected to portal host=api 2025-10-22T01:48:39.408Z INFO firezone_tunnel::client: Updating TUN device config=TunConfig { ip: IpConfig { v4: 100.90.205.158, v6: fd00:2021:1111::2:76b2 }, dns_by_sentinel: {}, search_domain: Some(Name(httpbin.search.test.)), ipv4_routes: [100.64.0.0/11, 100.96.0.0/11, 100.100.111.0/24], ipv6_routes: [fd00:2021:1111::/107, fd00:2021:1111:8000::/107, fd00:2021:1111:8000:100:100:111:0/120] } 2025-10-22T01:48:39.408Z INFO firezone_tunnel::client: Updating TUN device config=TunConfig { ip: IpConfig { v4: 100.90.205.158, v6: fd00:2021:1111::2:76b2 }, dns_by_sentinel: {100.100.111.1 <> 127.0.0.11:53}, search_domain: Some(Name(httpbin.search.test.)), ipv4_routes: [100.64.0.0/11, 100.96.0.0/11, 100.100.111.0/24], ipv6_routes: [fd00:2021:1111::/107, fd00:2021:1111:8000::/107, fd00:2021:1111:8000:100:100:111:0/120] } 2025-10-22T01:48:39.408Z INFO firezone_tunnel::client: Activating resource name=foobar.com address=foobar.com sites=mycro-aws-gws 2025-10-22T01:48:39.409Z INFO firezone_tunnel::client: Activating resource name=.firezone.dev address=.firezone.dev sites=mycro-aws-gws 2025-10-22T01:48:39.409Z INFO firezone_tunnel::client: Activating resource name=ip6only address=ip6only.me sites=mycro-aws-gws 2025-10-22T01:48:39.409Z INFO firezone_tunnel::client: Activating resource name=example.com address=example.com sites=mycro-aws-gws 2025-10-22T01:48:39.409Z INFO firezone_tunnel::client: Activating resource name=Example address=.example.com sites=mycro-aws-gws 2025-10-22T01:48:39.409Z INFO firezone_tunnel::client: Activating resource name=.httpbin address=.httpbin sites=mycro-aws-gws 2025-10-22T01:48:39.409Z INFO firezone_tunnel::client: Activating resource name=MyCorp Network (IPv6) address=172:20::/64 sites=mycro-aws-gws 2025-10-22T01:48:39.409Z INFO firezone_tunnel::client: Updating TUN device config=TunConfig { ip: IpConfig { v4: 100.90.205.158, v6: fd00:2021:1111::2:76b2 }, dns_by_sentinel: {100.100.111.1 <> 127.0.0.11:53}, search_domain: Some(Name(httpbin.search.test.)), ipv4_routes: [100.64.0.0/11, 100.96.0.0/11, 100.100.111.0/24], ipv6_routes: [172:20::/64, fd00:2021:1111::/107, fd00:2021:1111:8000::/107, fd00:2021:1111:8000:100:100:111:0/120] } 2025-10-22T01:48:39.409Z INFO firezone_tunnel::client: Activating resource name=.httpbin.search.test address=.httpbin.search.test sites=mycro-aws-gws 2025-10-22T01:48:39.409Z INFO firezone_tunnel::client: Activating resource name=.firez.one address=.firez.one sites=mycro-aws-gws 2025-10-22T01:48:39.409Z INFO firezone_tunnel::client: Activating resource name=MyCorp Network address=172.20.0.0/16 sites=mycro-aws-gws 2025-10-22T01:48:39.409Z INFO firezone_tunnel::client: Updating TUN device config=TunConfig { ip: IpConfig { v4: 100.90.205.158, v6: fd00:2021:1111::2:76b2 }, dns_by_sentinel: {100.100.111.1 <> 127.0.0.11:53}, search_domain: Some(Name(httpbin.search.test.)), ipv4_routes: [100.64.0.0/11, 100.96.0.0/11, 100.100.111.0/24, 172.20.0.0/16], ipv6_routes: [172:20::/64, fd00:2021:1111::/107, fd00:2021:1111:8000::/107, fd00:2021:1111:8000:100:100:111:0/120] } 2025-10-22T01:48:39.418Z INFO firezone_bin_shared::tun_device_manager::linux: Setting new routes new_routes={V4(Ipv4Network { network_address: 100.64.0.0, netmask: 11 }), V4(Ipv4Network { network_address: 172.20.0.0, netmask: 16 }), V6(Ipv6Network { network_address: 172:20::, netmask: 64 }), V4(Ipv4Network { network_address: 100.96.0.0, netmask: 11 }), V6(Ipv6Network { network_address: fd00:2021:1111::, netmask: 107 }), V6(Ipv6Network { network_address: fd00:2021:1111:8000::, netmask: 107 }), V6(Ipv6Network { network_address: fd00:2021:1111:8000:100:100:111:0, netmask: 120 }), V4(Ipv4Network { network_address: 100.100.111.0, netmask: 24 })} 2025-10-22T01:48:39.420Z INFO firezone_headless_client: Tunnel ready elapsed=583.523468ms 2025-10-22T01:48:39.430Z INFO snownet::node: Added new TURN server rid=2a413094-32d4-4a69-8e92-642d60e885e9 address=Dual { v4: 203.0.113.102:3478, v6: [203:0:113::102]:3478 } 2025-10-22T01:49:44.814Z INFO snownet::node: Creating new connection local=IceCreds { ufrag: "bly5", pass: "bdjtlfpvfdhhya6om4kssi" } remote=IceCreds { ufrag: "24gy", pass: "5mqlci4n4nmoovovihswvq" } index=(2378720\|0) cid=ea82a87c-ca11-4292-a332-940ac386cba1 2025-10-22T01:49:45.634Z INFO snownet::node: Updating remote socket new=PeerToPeer { source: 172.30.0.100:52625, dest: 203.0.113.3:52625 } duration_since_intent=821.149802ms cid=ea82a87c-ca11-4292-a332-940ac386cba1 2025-10-22T01:49:45.783Z INFO snownet::node: Updating remote socket old=PeerToPeer { source: 172.30.0.100:52625, dest: 203.0.113.3:52625 } new=PeerToPeer { source: [172:30::100]:52625, dest: [203:0:113::3]:52625 } duration_since_intent=971.112388ms cid=ea82a87c-ca11-4292-a332-940ac386cba1 ``` After: ``` 2025-10-22T01:58:09.972Z INFO firezone_headless_client: arch="x86_64" version="1.5.5" 2025-10-22T01:58:09.980Z INFO firezone_tunnel::client: Resetting network state (network changed) 2025-10-22T01:58:10.271Z INFO phoenix_channel: Connected to portal host=api 2025-10-22T01:58:10.369Z INFO firezone_tunnel::client: Activating resource name=foobar.com address=foobar.com sites=mycro-aws-gws 2025-10-22T01:58:10.369Z INFO firezone_tunnel::client: Activating resource name=.firezone.dev address=.firezone.dev sites=mycro-aws-gws 2025-10-22T01:58:10.369Z INFO firezone_tunnel::client: Activating resource name=ip6only address=ip6only.me sites=mycro-aws-gws 2025-10-22T01:58:10.369Z INFO firezone_tunnel::client: Activating resource name=example.com address=example.com sites=mycro-aws-gws 2025-10-22T01:58:10.369Z INFO firezone_tunnel::client: Activating resource name=Example address=.example.com sites=mycro-aws-gws 2025-10-22T01:58:10.369Z INFO firezone_tunnel::client: Activating resource name=.httpbin address=.httpbin sites=mycro-aws-gws 2025-10-22T01:58:10.370Z INFO firezone_tunnel::client: Activating resource name=MyCorp Network (IPv6) address=172:20::/64 sites=mycro-aws-gws 2025-10-22T01:58:10.370Z INFO firezone_tunnel::client: Activating resource name=.httpbin.search.test address=.httpbin.search.test sites=mycro-aws-gws 2025-10-22T01:58:10.370Z INFO firezone_tunnel::client: Activating resource name=.firez.one address=.firez.one sites=mycro-aws-gws 2025-10-22T01:58:10.370Z INFO firezone_tunnel::client: Activating resource name=MyCorp Network address=172.20.0.0/16 sites=mycro-aws-gws 2025-10-22T01:58:10.370Z INFO snownet::node: Added new TURN server rid=2a413094-32d4-4a69-8e92-642d60e885e9 address=Dual { v4: 203.0.113.102:3478, v6: [203:0:113::102]:3478 } 2025-10-22T01:58:10.370Z INFO snownet::node: Added new TURN server rid=54f6ba35-1914-48fc-be24-62f6293936eb address=Dual { v4: 203.0.113.101:3478, v6: [203:0:113::101]:3478 } 2025-10-22T01:58:10.370Z INFO firezone_tunnel::client: Updating TUN device config=TunConfig { ip: IpConfig { v4: 100.90.205.158, v6: fd00:2021:1111::2:76b2 }, dns_by_sentinel: {100.100.111.1 <> 127.0.0.11:53}, search_domain: Some(Name(httpbin.search.test.)), ipv4_routes: [100.64.0.0/11, 100.96.0.0/11, 100.100.111.0/24, 172.20.0.0/16], ipv6_routes: [172:20::/64, fd00:2021:1111::/107, fd00:2021:1111:8000::/107, fd00:2021:1111:8000:100:100:111:0/120] } 2025-10-22T01:58:10.383Z INFO firezone_bin_shared::tun_device_manager::linux: Setting new routes new_routes=[100.64.0.0/11, 100.96.0.0/11, 100.100.111.0/24, 172.20.0.0/16, 172:20::/64, fd00:2021:1111::/107, fd00:2021:1111:8000::/107, fd00:2021:1111:8000:100:100:111:0/120] 2025-10-22T01:58:10.495Z INFO snownet::allocation: Invalidating allocation active_socket=Some(203.0.113.101:3478) 2025-10-22T01:58:10.495Z INFO snownet::allocation: Invalidating allocation active_socket=Some(203.0.113.102:3478) 2025-10-22T02:03:04.410Z INFO snownet::node: Creating new connection local=IceCreds { ufrag: "uxgc", pass: "xxdgp5ivfhqloedzdmgi3j" } remote=IceCreds { ufrag: "es6w", pass: "doa2s3hmiteid7dtlszsbq" } index=(583098\|0) cid=ea82a87c-ca11-4292-a332-940ac386cba1 2025-10-22T02:03:04.960Z INFO snownet::node: Updating remote socket new=PeerToPeer { source: 172.30.0.100:52625, dest: 203.0.113.3:52625 } duration_since_intent=550.756408ms cid=ea82a87c-ca11-4292-a332-940ac386cba1 2025-10-22T02:03:05.112Z INFO snownet::node: Updating remote socket old=PeerToPeer { source: 172.30.0.100:52625, dest: 203.0.113.3:52625 } new=PeerToPeer { source: [172:30::100]:52625, dest: [203:0:113::3]:52625 } duration_since_intent=702.23775ms cid=ea82a87c-ca11-4292-a332-940ac386cba1 ```	2025-10-22 23:47:55 +00:00
Thomas Eizinger	d35cf445d4	fix(linux): don't sync link-scope routes of offline interfaces (#10583 ) In #10554, we added a syncing mechanism that would copy all link-scoped routes of the `main` routing table over to the Firezone routing table. Routes for interfaces that are currently offline cannot be added and cause a netlink error of "Invalid argument". To prevent unnecessary warnings from being logged to Sentry, we retrieve the link state of each interface and skip routes for interfaces are not online.	2025-10-16 05:34:10 +00:00
Thomas Eizinger	eb75cef467	fix(linux): allow LAN access when Internet Resource is on (#10554 ) ## Context On Linux, we create a dedicated routing table for all routes of the Firezone TUN device, including the `0.0.0.0/0` route. At a minimum, this routing table contains the following if the Internet Resource is active: ``` > ip route show table 539098368 default dev tun-firezone proto static 100.64.0.0/11 dev tun-firezone proto static 100.96.0.0/11 dev tun-firezone proto static 100.100.111.0/24 dev tun-firezone proto static ``` In addition, we also create a routing rule that bypasses this routing table for all packets that are tagged with the `0xfd002021` mark: ``` > ip rule list 0: from all lookup local 32765: not from all fwmark 0xfd002021 lookup 539098368 32766: from all lookup main 32767: from all lookup default ``` Firezone's internal UDP and TCP sockets are tagged with this mark and thus prevent routing loops where our own packets would otherwise get redirected back into the tunnel. Without the Internet Resource active, the rule `from all lookup main` triggers for local LAN traffic and correctly route the traffic out via that interface. For example, on my computer, the Linux kernel created the following route with the `link` scope in the main table: ``` 192.168.188.0/24 dev wlp192s0 proto kernel scope link src 192.168.188.112 metric 600 ``` ## The problem With the Internet Resource active, there is a problem. The default route matches ALL destinations, including those for local LAN destinations which should actually be sent out via a different interface. As a result, local LAN traffic is broken on Linux as soon as the Internet Resource is active. Instead of being sent out via the local interface, these packets get sent to `tun-firezone` where they get forwarded to the Gateway and then dropped because their source IP is not a Firezone Client IP. ## Solution Fixing this is unfortunately non-trivial. The best I could come up with is to create a copy of all link-scoped routes in the Firezone routing table and keep those in sync with all route changes that happen. For example, when we roam, the link-scoped routes obviously change because we join a new subnet. We therefore listen to change-events from netlink and create a debounced task that reads the current link-scoped routes from the main routing table, compares it to the ones in the Firezone table and adds any routes not present. We don't need to worry about removing routes as link-scoped routes automatically disappear once the resulting interface goes away. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-10-14 20:36:58 +00:00
Thomas Eizinger	e84bdc5566	refactor(connlib): periodically record queue depths (#10242 ) Instead of recording the queue depths on every event-loop tick, we now record them once a second by setting a Gauge. Not only is that a simpler instrument to work with but it is significantly more performant. The current version - when metrics are enabled - takes on quite a bit of CPU time. Resolves: #10237	2025-09-02 02:57:36 +00:00
Thomas Eizinger	a109c1a2ef	feat(connlib): discard intermediate resource and TUN updates (#10223 ) Right now, the Client event-loops have a channel with 1000 items for sending new resource lists and updates to the TUN device to the host app. This is kind of unnecessary as we always only care about the last version of these. Intermediate updates that the host app doesn't process are effectively irrelevant. We've had an issue before where a bug in the portal caused us to receive many updates to resources which ended up crashing Client apps because this channel filled up. To be more resilient on this front, we refactor the Client event loop to use a `watch` channel for this. Watch channels only retain the last value that got sent into them.	2025-08-21 05:42:54 +00:00
Thomas Eizinger	4e11112d9b	feat(connlib): improve throughput on higher latencies (#10231 ) Turns out the multi-threaded access of the TUN device on the Gateway causes packet reordering which makes the TCP congestion controller throttle the connection. Additionally, the default TX queue length of a TUN device on Linux is only 500 packets. With just a single thread and an increased TX queue length, we get a throughput performance of just over 1 GBit/s for a 20ms link between Client and Gateway with basically no packet drops: ``` Connecting to host 172.20.0.110, port 5201 [ 5] local 100.79.130.70 port 49546 connected to 172.20.0.110 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 116 MBytes 977 Mbits/sec 0 6.40 MBytes [ 5] 1.00-2.00 sec 137 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 2.00-3.00 sec 134 MBytes 1.13 Gbits/sec 0 6.40 MBytes [ 5] 3.00-4.00 sec 136 MBytes 1.14 Gbits/sec 47 6.40 MBytes [ 5] 4.00-5.00 sec 137 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 5.00-6.00 sec 138 MBytes 1.16 Gbits/sec 0 6.40 MBytes [ 5] 6.00-7.00 sec 138 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 7.00-8.00 sec 138 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 8.00-9.00 sec 138 MBytes 1.16 Gbits/sec 0 6.40 MBytes [ 5] 9.00-10.00 sec 138 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 10.00-11.00 sec 139 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 11.00-12.00 sec 139 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 12.00-13.00 sec 136 MBytes 1.14 Gbits/sec 0 6.40 MBytes [ 5] 13.00-14.00 sec 139 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 14.00-15.00 sec 140 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 15.00-16.00 sec 138 MBytes 1.16 Gbits/sec 0 6.40 MBytes [ 5] 16.00-17.00 sec 137 MBytes 1.15 Gbits/sec 0 6.40 MBytes [ 5] 17.00-18.00 sec 139 MBytes 1.17 Gbits/sec 0 6.40 MBytes [ 5] 18.00-19.00 sec 138 MBytes 1.16 Gbits/sec 0 6.40 MBytes [ 5] 19.00-20.00 sec 136 MBytes 1.14 Gbits/sec 0 6.40 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-20.00 sec 2.67 GBytes 1.15 Gbits/sec 47 sender [ 5] 0.00-20.02 sec 2.67 GBytes 1.15 Gbits/sec receiver iperf Done. ``` For further debugging in the future, we are now recording the send and receive queue depths of both the TUN device and the UDP sockets. Neither of those showed to be full in my testing which leads me to conclude that it isn't any buffer inside Firezone that is too small here. Related: #7452 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-08-20 23:08:56 +00:00
Thomas Eizinger	301d2137e5	refactor(windows): share src IP cache across UDP sockets (#9976 ) When looking through customer logs, we see a lot of "Resolved best route outside of tunnel" messages. Those get logged every time we need to rerun our re-implementation of Windows' weighting algorithm as to which source interface / IP a packet should be sent from. Currently, this gets cached in every socket instance so for the peer-to-peer socket, this is only computed once per destination IP. However, for DNS queries, we make a new socket for every query. Using a new source port DNS queries is recommended to avoid fingerprinting of DNS queries. Using a new socket also means that we need to re-run this algorithm every time we make a DNS query which is why we see this log so often. To fix this, we need to share this cache across all UDP sockets. Cache invalidation is one of the hardest problems in computer science and this instance is no different. This cache needs to be reset every time we roam as that changes the weighting of which source interface to use. To achieve this, we extend the `SocketFactory` trait with a `reset` method. This method is called whenever we roam and can then reset a shared cache inside the `UdpSocketFactory`. The "source IP resolver" function that is passed to the UDP socket now simply accesses this shared cache and inserts a new entry when it needs to resolve the IP. As an added benefit, this may speed up DNS queries on Windows a bit (although I haven't benchmarked it). It should certainly drastically reduce the amount of syscalls we make on Windows.	2025-07-24 01:36:53 +00:00
Thomas Eizinger	eb4c54620c	chore(linux): add more error context to TUN device (#9853 ) When failing to create the TUN device, the error messages are currently pretty bare. Add a bit more context so users can self-diagnose easier what is wrong.	2025-07-13 05:51:02 +00:00
Thomas Eizinger	d6805d7e48	chore(rust): bump to Rust 1.88 (#9714 ) Rust 1.88 has been released and brings with it a quite exciting feature: let-chains! It allows us to mix-and-match `if` and `let` expressions, therefore often reducing the "right-drift" of the relevant code, making it easier to read. Rust.188 also comes with a new clippy lint that warns when creating a mutable reference from an immutable pointer. Attempting to fix this revealed that this is exactly what we are doing in the eBPF kernel. Unfortunately, it doesn't seem to be possible to design this in a way that is both accepted by the borrow-checker AND by the eBPF verifier. Hence, we simply make the function `unsafe` and document for the programmer, what needs to be upheld.	2025-07-12 06:42:50 +00:00
Thomas Eizinger	17a1d36eae	fix(gui-client): set IO error type for missing non-tunnel routes (#9777 ) On Windows - in order to prevent routing loops - we resolve the best "non-tunnel" route to a particular host for each IP address. The resulting source IP is then used as source for packets leaving our interface. In case the system doesn't have IPv6 connectivity or are simply no routes available, we fail this "source IP resolver" with an IO error. Presently, this uses the "other" IO error type which causes this to be logged on a WARN level in the event-loop. The IO error types `HostUnreachable` and `NetworkUnreachable` are expected during normal operation of Firezone and are therefore only logged on DEBUG. By changing this IO error type, we fix the WARN log spam on Windows for machines without IPv6 connectivity.	2025-07-03 21:45:06 +00:00
Thomas Eizinger	899f5ea5e8	fix(gui-client): ensure GUI client can access `firezone-id.json` (#9764 ) I believe some of the recent changes around how we load the `firezone-id.json` from the GUI client surfaced that we in fact don't always have access to it. Previously, this was silenced because we would only optionally add it as context to the Sentry client. Now, we need it to initialise telemetry so we know whether or not to send logs to Sentry. In order to be able to access the file, we need to change the config's directory and the file to be owned by the `firezone-client` group.	2025-07-01 14:11:29 +00:00
Thomas Eizinger	daf05b8c79	fix(windows): ignore network changes from irrelevant networks (#9696 ) In order to detect network changes on Windows, we implement the `INetworkEvents` callback interface. This callback notifies us every time the connectivity of a certain network changes. Performing a network reset in connlib on any of these changes hurts the user experience as Firezone is booting because it takes a while for this to settle. Firezone itself is making changes to the network so several of these change events happen _because_ Firezone is starting. The documentation from Microsoft on what possible values the `NameType` attribute can have is pretty thin but I did manage to find the following values on the Internet: - `6`: Wired network - `71`: Wireless network - `243`: Broadband network We assume that the user is connected to the Internet through one of these so we ignore network changes on all other networks. An alternative approach to reducing the number of false-positive change events would be to react to a narrower list of change events. I discarded this approach because it wasn't clear to me, which of the event types [0] would matter to us and when Windows emits them. I think in order to effectively react to those, we'd have to do more fine granular tracking of which state a network is in and e.g. only trigger a reset if we move from "Disconnected" to e.g. "Subnet connectivity". Windows also differentiates between local, subnet and Internet connectivity, yet in my testing, I've never observed the "Internet" connectivity being emitted. Hence, it is deemed more robust to just filter out networks based on their type. Firezone itself is of type 53 and is therefore automatically filtered out as well. The risk here is that we don't react to connectivity changes of a network that a customer is relying on. Unfortunately, I don't think there is a better way to find this out other than shipping this change and waiting for reports. [0]: https://learn.microsoft.com/en-us/windows/win32/api/netlistmgr/ne-netlistmgr-nlm_connectivity#constants --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-06-30 08:52:00 +00:00
Thomas Eizinger	a91dda139f	feat(connlib): only conditionally hash firezone ID (#9633 ) A bit of legacy that we have inherited around our Firezone ID is that the ID stored on the user's device is sha'd before being passed to the portal as the "external ID". This makes it difficult to correlate IDs in Sentry and PostHog with the data we have in the portal. For Sentry and PostHog, we submit the raw UUID stored on the user's device. As a first step in overcoming this, we embed an "external ID" in those services as well IF the provided Firezone ID is a valid UUID. This will allow us to immediately correlate those events. As a second step, we automatically generate all new Firezone IDs for the Windows and Linux Client as `hex(sha256(uuid))`. These won't parse as valid UUIDs and therefore will be submitted as is to the portal. As a third step, we update all documentation around generating Firezone IDs to use `uuidgen \| sha256` instead of just `uuidgen`. This is effectively the equivalent of (2) but for the Headless Client and Gateway where the Firezone ID can be configured via environment variables. Resolves: #9382 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-06-24 07:05:48 +00:00
Thomas Eizinger	60bdbb39cb	refactor(gui-client): move change listeners to tunnel service (#8160 ) At present, listening for DNS server change and network change events is handled in the GUI client. Upon an event, a message is sent to the tunnel service which then applies the new state to `connlib`. We can avoid some of this boilerplate by moving these listeners to the tunnel service as part of the handler. As a result, we get a few improvements: - We don't need to ignore these events if we don't have a session because the lifetime of these listeners is tied to the IPC handler on the service side. - We need fewer IPC messages - We can retry the connection directly from within the tunnel service in case we have no Internet at the time of startup - We can more easily model out the state machine of a connlib session in the tunnel service - On Linux, this means we no longer shell out to `resolvectl` from the GUI process, unifying access to the "resolvers" from the tunnel service - On Windows, we no longer need admin privileges on the GUI client for optimized network-change detection. This now happens in the Tunnel process which already runs as admin. Resolves: #9465	2025-06-11 06:18:14 +00:00
Jamil	822832e02b	chore(macos): allow tauri to build on macOS (#9391 ) When working on UI stuff for the Tauri clients on macOS it's helpful if the UI is buildable. This is a first stab at getting a stub client to launch on macOS with the help of our AI overlords. Feel free to close or heavily critique if there is a better approach.	2025-06-06 09:15:39 +00:00
Thomas Eizinger	d62f82787d	build(deps): bump `netlink` dependency group (#9315 ) In https://github.com/rust-netlink/netlink-packet-route/issues/140#issuecomment-2919539363, the author claims the issue we've been holding the dependency bump back for is resolved. We can now update to the latest versions of the `netlink` dependency group.	2025-05-31 02:34:55 +00:00
Thomas Eizinger	ae872980ae	refactor(gui-client): scope telemetry sessions to GUI client (#9179 ) For our telemetry sessions with Sentry, we need to know which environment we are running in, i.e. staging, production or on-prem. The GUI client's tunnel service doesn't have a concept of an environment until a GUI connects and sends the `StartTelemetry` message. Therefore, we should scope a telemetry session to a GUI being connected over IPC. Any errors around setting up / tearing down the background service are a catch-22. Until a GUI connects, we can't initialise the telemetry connection but if we fail to set up the background service, no GUI can ever connect. Hence, the current setup and tear down of the `Telemetry` module around the `ipc_listen` calls can safely be removed as they are effectively no-ops anyway.	2025-05-20 23:18:18 +00:00
Thomas Eizinger	1bdba3601a	feat(gui-client): rename IPC service to Tunnel service (#9154 ) The name IPC service is not very descriptive. By nature of being separate processes, we need to use IPC to communicate between them. The important thing is that the service process has control over the tunnel. Therefore, we rename everything to "Tunnel service". The only part that is not changed are historic changelog entries. Resolves: #9048	2025-05-19 09:52:06 +00:00
Thomas Eizinger	3300c0fe02	chore(rust): fix windows static analysis errors (#9162 ) The `static-analysis` job for Windows was not yet part of the rule set and therefore some clippy errors slipped through when we merged #9159.	2025-05-16 04:23:53 +00:00
Thomas Eizinger	6165555add	build(deps): bump Rust to 1.87.0 (#9159 )	2025-05-16 01:58:17 +00:00
Thomas Eizinger	b8738448df	refactor(connlib): forward error from source IP resolver (#9116 ) In order to avoid routing loops on Windows, our UDP and TCP sockets in `connlib` embed a "source IP resolver" that finds the "next best" interface after our TUN device according to Windows' routing metrics. This ensures that packets don't get routed back into our TUN device. Currently, errors during this process are only logged on TRACE and therefore not visible in Sentry. We fix this by moving around some of the function interfaces and forward the error from the source IP resolver together with some context of the destination IP.	2025-05-13 13:33:15 +00:00
Thomas Eizinger	4097ee0cdf	chore(gui-client): only read `is_finished` once (#9095 ) For at least 1 user, the threads shut down correctly, but we didn't seem to have exited the loop. In https://firezone-inc.sentry.io/issues/6335839279/events/c11596de18924ee3a1b64ced89b1fba2/?project=4508008945549312, we can see that both flags are marked as `true` yet we still emitted the message. The only way how I can explain this is that the thread shut down in between the two times we've called the `is_finished` function. To ensure this doesn't happen, we now only read it once. This however also shows that 5s may not be enough time for WinTUN to shutdown. Therefore, we increase the grace period to 10s.	2025-05-12 11:47:42 +00:00
Thomas Eizinger	5566f1847f	refactor(rust): move crates into a more sensical hierarchy (#9066 ) The current `rust/` directory is a bit of a wild-west in terms of how the crates are organised. Most of them are simply at the top-level when in reality, they are all `connlib`-related. The Apple and Android FFI crates - which are entrypoints in the Rust code are defined several layers deep. To improve the situation, we move around and rename several crates. The end result is that all top-level crates / directories are: - Either entrypoints into the Rust code, i.e. applications such as Gateway, Relay or a Client - Or crates shared across all those entrypoints, such as `telemetry` or `logging`	2025-05-12 01:04:17 +00:00
Thomas Eizinger	f2b1fbe718	refactor(rust): move `device_id` to `bin-shared` (#9040 ) Both `device_id` and `device_info` are used by the headless-client and the GUI client / IPC service. They should therefore be defined in the `bin-shared` crate.	2025-05-06 04:52:37 +00:00
Thomas Eizinger	f11a902b3d	refactor(rust): move `dns-control` to `bin-shared` (#9023 ) Currently, the platform-specific code for controlling DNS resolution on a system sits in `firezone-headless-client`. This code is also used by the GUI client. This creates a weird compile-time dependency from the GUI client to the headless client. For other components that have platform-specific implementations, we use the `firezone-bin-shared` crate. As a first step of resolving the compile-time dependency, we move the `dns_control` module to `firezone-bin-shared`.	2025-05-06 01:29:09 +00:00
Thomas Eizinger	005b6fe863	feat(windows): optimise network change detection (#9021 ) Presently, the network change detection on Windows is very naive and simply emits a change event everytime _anything_ changes. We can optimise this and therefore improve the start-up time of Firezone by: - Filtering out duplicate events - Filtering out network change events for our own network adapter This reduces the number of network change events to 1 during startup. As far as I can tell from the code comments in this area, we explicitly send this one to ensure we don't run into a race condition whilst we are starting up. Resolves: #8905	2025-05-06 00:23:27 +00:00
Thomas Eizinger	806996c245	refactor(rust): move `signals` to `bin-shared` (#9024 ) The `signals` module isn't something headless-client specific and should live in our `bin-shared` crate. Once the `ipc_service` module is decoupled from the headless-client crate, it will be used by both the headless client and IPC service (which then will be defined in the GUI client crate).	2025-05-05 23:34:26 +00:00
Thomas Eizinger	ce51c40d0d	refactor(rust): move `known_dirs` to `bin-shared` (#9026 ) The `known_dirs` module is used across the headless-client and the GUI client. It should live in `bin-shared` where all the other cross-platform modules are. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-05-05 22:45:53 +00:00
Thomas Eizinger	80335676b1	refactor(rust): move `uptime` to `bin-shared` (#9027 ) The `uptime` module from `firezone-headless-client` is also used in the GUI client. In order to decouple this dependency, we move the module to `bin-shared`, next to the other cross-plaform modules.	2025-05-05 12:28:26 +00:00
Thomas Eizinger	6114bb274f	chore(rust): make most of the Rust code compile on MacOS (#8924 ) When working on the Rust code of Firezone from a MacOS computer, it is useful to have pretty much all of the code at least compile to ensure detect problems early. Eventually, once we target features like a headless MacOS client, some of these stubs will actually be filled in an be functional.	2025-04-29 11:20:09 +00:00
Thomas Eizinger	93036734ae	build(rust): move our own `windows` dependency to `0.61.0` (#8730 ) Version `0.61.0` is what most of our dependencies bring in, so depending on that allows us to unify the dependency tree here.	2025-04-22 02:35:28 +00:00
Thomas Eizinger	84a2c275ca	build(rust): upgrade to Rust 1.85 and Edition 2024 (#8240 ) Updates our codebase to the 2024 Edition. For highlights on what changes, see the following blogpost: https://blog.rust-lang.org/2025/02/20/Rust-1.85.0.html	2025-03-19 02:58:55 +00:00
Thomas Eizinger	7af4b91ac5	fix(gui-client): call `wintun::Session::shutdown` on drop (#8464 ) The bugfix we attempted in #8156 turned out wrong. Reading the source-code, we have to call `Session::shutdown` in order to actually cancel the `Session::receive_blocking` call. Not doing so means we run into the timeout when discarding the `Tun` device because the recv-thread is stuck in `Session::receive_blocking`. Fixes: #8395	2025-03-17 12:58:03 +00:00
Thomas Eizinger	2fe5c00c64	fix(windows): break from retry loop if we sent the packet (#8271 ) Regression introduced in #8268.	2025-02-26 06:10:02 +00:00
Thomas Eizinger	96170be082	fix(gui-client): mitigate deadlock when shutting down TUN device (#8268 ) In #8159, we introduced a regression that could lead to a deadlock when shutting down the TUN device. Whilst we did close the channel prior to awaiting the thread to exit, we failed to notice that _another_ instance of the sender could be alive as part of an internally stored "sending permit" with the `PollSender` in case another packet is queued for sending. We need to explicitly call `abort_send` to free that. Judging from the comment and a prior bug, this shutdown logic has been buggy before. To further avoid this deadlock, we introduce two changes: - The worker threads only receive a `Weak` reference to the `wintun::Session` - We move all device-related state into a dedicated `TunState` struct that we can drop prior to joining the threads The combination of these features means that all strong references to channels and the session are definitely dropped without having to wait for anything. To provide a clean and synchronous shutdown, we wait for at most 5s on the worker-threads. If they don't exit until then, we log a warning and exit anyway. This should greatly reduce the risk of future bugs here because the session (and thus the WinTUN device) gets shutdown in any case and so at worst, we have a few zombie threads around. Resolves: #8265	2025-02-26 00:46:12 +00:00
Thomas Eizinger	33c707dbf6	feat(windows): introduce dedicated "TUN send" thread (#8159 ) Same as done for unix-based operation systems in #8117, we introduce a dedicated "TUN send" thread for Windows in this PR. Not only does this move the syscalls and copying of sending packets away from `connlib`'s main thread but it also establishes backpressure between those threads properly. WinTUN does not have any ability to signal that it has space in its send buffer. If it fails to allocate a packet for sending, it will return `ERROR_BUFFER_OVERFLOW` [0]. We now handle this case gracefully by suspending the send thread for 10ms and then try again. This isn't a great way of establishing back-pressure but at least we don't have any packet loss. To test this, I temporarily lowered the ring buffer size and ran a speed test. In that, I could confirm that `ERROR_BUFFER_OVERFLOW` is indeed emitted and handled as intended. [0]: https://git.zx2c4.com/wintun/tree/api/session.c#n267	2025-02-17 20:33:45 +00:00
Thomas Eizinger	af9fc49b18	fix(windows): don't double shutdown session (#8156 ) The `wintun` crate will already shutdown the session for us when the last instance of `Session` gets dropped. Shutting down the session prior to that already results in an attempt to close an adapter that is no longer present, causing WinTUN to log (unactionable) errors.	2025-02-17 05:38:11 +00:00
Thomas Eizinger	10ba02e341	fix(connlib): split TUN send & recv into separate threads (#8117 ) We appear to have caused a pretty big performance regression (~40%) in `037a2e64b6` (identified through `git-bisect`). Specifically, the regression appears to have been caused by [`aef411a` (#7605)](`aef411abf5`). Weirdly enough, undoing just that on top of `main` doesn't fix the regression. My hypothesis is that using the same file descriptor for read AND write interests on the same runtime causes issues because those interests are occasionally cleared (i.e. on false-positive wake-ups). In this PR, we spawn a dedicated thread each for the sending and receiving operations of the TUN device. On unix-based systems, a TUN device is just a file descriptor and can therefore simply be copied and read & written to from different threads. Most importantly, we only construct the `AsyncFd` _within_ the newly spawned thread and runtime because constructing an `AsyncFd` implicitly registers with the runtime active on the current thread. As a nice benefit, this allows us to get rid of a `future::select`. Those are always kind of nasty because they cancel the future that wasn't ready. My original intuition was that we drop packets due to cancelled futures there but that could not be confirmed in experiments.	2025-02-14 05:32:51 +00:00
Thomas Eizinger	7dcda1dc74	fix(windows): silence `0x800706D9` when DNS deactivation fails (#8085 ) The error code we see here means "There are no more endpoints available from the endpoint mapper." This has something to do with Windows' internal RPC communication between components. DNS deactivation is on a best-effort basis and it appears that everything else is working just fine, despite this error. It appears to happen when we shut down our own service, so perhaps it is just a race condition.	2025-02-11 05:38:37 +00:00
Thomas Eizinger	d7ebd07183	fix(linux): check for correct sign of netlink error code (#8087 ) We've previously tried to handle the "No such process" error from netlink when it tries to remove a route that no longer exists. What we failed to do is use the correct sign for the error code as netlink errors are always negative, yet when printed, the are positive numbers.	2025-02-11 04:47:51 +00:00
Thomas Eizinger	b193dd91f6	fix(windows): don't warn on disabled IP stack (#8086 ) When an IP stack is programmatically disabled, such as with: > reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters" /v DisabledComponents /t REG_DWORD /d 255 /f Attempting to interact with this IP stack will yield "NOT_FOUND" errors. These aren't worth reporting to Sentry because there isn't much we can do about it.	2025-02-11 04:37:17 +00:00
Thomas Eizinger	436b502eab	fix(windows): handle disabled IPv6 stack gracefully (#8083 ) Fixes: #8049.	2025-02-11 03:21:32 +00:00
Thomas Eizinger	f48df7585c	refactor(windows): de-duplicate Win32 error codes (#8071 ) The errors returned from Win32 API calls are currently duplicated in several places. To makes it error-prone to handle them correctly. With this PR, we de-duplicate this and add proper docs and links for further reading to them. We also fix a case where we would currently fail to set IP addresses for our tunnel interface if the IP stack is not supported.	2025-02-10 23:33:06 +00:00
Thomas Eizinger	d2e9b09874	refactor(rust): stringify errors early (#8033 ) As it turns out, the effort in #7104 was not a good idea. By logging errors as values, most of our Sentry reports all have the same title and thus cannot be differentiated from within the overview at all. To fix this, we stringify errors with all their sources whenever they got logged. This ensures log messages are unique and all Sentry issues will have a useful title.	2025-02-06 14:18:35 +00:00
Thomas Eizinger	90fb9b8478	refactor(connlib): use Win32 APIs instead of `netsh` to set IPs (#8003 ) This should be faster and hopefully more reliable. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-02-03 06:24:28 +00:00
Thomas Eizinger	8bd8098cab	refactor(connlib): don't re-implement waker for TUN thread (#7944 ) Within `connlib` - on UNIX platforms - we have dedicated threads that read from and write to the TUN device. These threads are connected with `connlib`'s main thread via bounded channels: one in each direction. When these channels are full, `connlib`'s main thread will suspend and not read any network packets from the sockets in order to maintain back-pressure. Reading more packets from the socket would mean most likely sending more packets out the TUN device. When debugging #7763, it became apparent that _something_ must be wrong with these threads and that somehow, we either consider them as full or aren't emptying them and as a result, we don't read _any_ network packets from our sockets. To maintain back-pressure here, we currently use our own `AtomicWaker` construct that is shared with the TUN thread(s). This is unnecessary. We can also directly convert the `flume::Sender` into a `flume::async::SendSink` and therefore directly access a `poll` interface.	2025-01-29 15:48:48 +00:00
Thomas Eizinger	416e320319	revert: bump `netlink-packet-route` and `rtnetlink` (#7899 ) Reverts: #6694 Related: https://github.com/rust-netlink/netlink-packet-route/issues/140	2025-01-28 06:29:07 +00:00

1 2

98 Commits