Commit Graph

92 Commits

Author SHA1 Message Date
Thomas Eizinger
e84bdc5566 refactor(connlib): periodically record queue depths (#10242)
Instead of recording the queue depths on every event-loop tick, we now
record them once a second by setting a Gauge. Not only is that a simpler
instrument to work with but it is significantly more performant. The
current version - when metrics are enabled - takes on quite a bit of CPU
time.

Resolves: #10237
2025-09-02 02:57:36 +00:00
Thomas Eizinger
a109c1a2ef feat(connlib): discard intermediate resource and TUN updates (#10223)
Right now, the Client event-loops have a channel with 1000 items for
sending new resource lists and updates to the TUN device to the host
app. This is kind of unnecessary as we always only care about the last
version of these. Intermediate updates that the host app doesn't process
are effectively irrelevant.

We've had an issue before where a bug in the portal caused us to receive
many updates to resources which ended up crashing Client apps because
this channel filled up.

To be more resilient on this front, we refactor the Client event loop to
use a `watch` channel for this. Watch channels only retain the last
value that got sent into them.
2025-08-21 05:42:54 +00:00
Thomas Eizinger
4e11112d9b feat(connlib): improve throughput on higher latencies (#10231)
Turns out the multi-threaded access of the TUN device on the Gateway
causes packet reordering which makes the TCP congestion controller
throttle the connection. Additionally, the default TX queue length of a
TUN device on Linux is only 500 packets.

With just a single thread and an increased TX queue length, we get a
throughput performance of just over 1 GBit/s for a 20ms link between
Client and Gateway with basically no packet drops:

```
Connecting to host 172.20.0.110, port 5201
[  5] local 100.79.130.70 port 49546 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   116 MBytes   977 Mbits/sec    0   6.40 MBytes       
[  5]   1.00-2.00   sec   137 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]   2.00-3.00   sec   134 MBytes  1.13 Gbits/sec    0   6.40 MBytes       
[  5]   3.00-4.00   sec   136 MBytes  1.14 Gbits/sec   47   6.40 MBytes       
[  5]   4.00-5.00   sec   137 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]   5.00-6.00   sec   138 MBytes  1.16 Gbits/sec    0   6.40 MBytes       
[  5]   6.00-7.00   sec   138 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]   7.00-8.00   sec   138 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]   8.00-9.00   sec   138 MBytes  1.16 Gbits/sec    0   6.40 MBytes       
[  5]   9.00-10.00  sec   138 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]  10.00-11.00  sec   139 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  11.00-12.00  sec   139 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  12.00-13.00  sec   136 MBytes  1.14 Gbits/sec    0   6.40 MBytes       
[  5]  13.00-14.00  sec   139 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  14.00-15.00  sec   140 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  15.00-16.00  sec   138 MBytes  1.16 Gbits/sec    0   6.40 MBytes       
[  5]  16.00-17.00  sec   137 MBytes  1.15 Gbits/sec    0   6.40 MBytes       
[  5]  17.00-18.00  sec   139 MBytes  1.17 Gbits/sec    0   6.40 MBytes       
[  5]  18.00-19.00  sec   138 MBytes  1.16 Gbits/sec    0   6.40 MBytes       
[  5]  19.00-20.00  sec   136 MBytes  1.14 Gbits/sec    0   6.40 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-20.00  sec  2.67 GBytes  1.15 Gbits/sec   47             sender
[  5]   0.00-20.02  sec  2.67 GBytes  1.15 Gbits/sec                  receiver

iperf Done.

```

For further debugging in the future, we are now recording the send and
receive queue depths of both the TUN device and the UDP sockets. Neither
of those showed to be full in my testing which leads me to conclude that
it isn't any buffer inside Firezone that is too small here.

Related: #7452

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-08-20 23:08:56 +00:00
Thomas Eizinger
301d2137e5 refactor(windows): share src IP cache across UDP sockets (#9976)
When looking through customer logs, we see a lot of "Resolved best route
outside of tunnel" messages. Those get logged every time we need to
rerun our re-implementation of Windows' weighting algorithm as to which
source interface / IP a packet should be sent from.

Currently, this gets cached in every socket instance so for the
peer-to-peer socket, this is only computed once per destination IP.
However, for DNS queries, we make a new socket for every query. Using a
new source port DNS queries is recommended to avoid fingerprinting of
DNS queries. Using a new socket also means that we need to re-run this
algorithm every time we make a DNS query which is why we see this log so
often.

To fix this, we need to share this cache across all UDP sockets. Cache
invalidation is one of the hardest problems in computer science and this
instance is no different. This cache needs to be reset every time we
roam as that changes the weighting of which source interface to use.

To achieve this, we extend the `SocketFactory` trait with a `reset`
method. This method is called whenever we roam and can then reset a
shared cache inside the `UdpSocketFactory`. The "source IP resolver"
function that is passed to the UDP socket now simply accesses this
shared cache and inserts a new entry when it needs to resolve the IP.

As an added benefit, this may speed up DNS queries on Windows a bit
(although I haven't benchmarked it). It should certainly drastically
reduce the amount of syscalls we make on Windows.
2025-07-24 01:36:53 +00:00
Thomas Eizinger
eb4c54620c chore(linux): add more error context to TUN device (#9853)
When failing to create the TUN device, the error messages are currently
pretty bare. Add a bit more context so users can self-diagnose easier
what is wrong.
2025-07-13 05:51:02 +00:00
Thomas Eizinger
d6805d7e48 chore(rust): bump to Rust 1.88 (#9714)
Rust 1.88 has been released and brings with it a quite exciting feature:
let-chains! It allows us to mix-and-match `if` and `let` expressions,
therefore often reducing the "right-drift" of the relevant code, making
it easier to read.

Rust.188 also comes with a new clippy lint that warns when creating a
mutable reference from an immutable pointer. Attempting to fix this
revealed that this is exactly what we are doing in the eBPF kernel.
Unfortunately, it doesn't seem to be possible to design this in a way
that is both accepted by the borrow-checker AND by the eBPF verifier.
Hence, we simply make the function `unsafe` and document for the
programmer, what needs to be upheld.
2025-07-12 06:42:50 +00:00
Thomas Eizinger
17a1d36eae fix(gui-client): set IO error type for missing non-tunnel routes (#9777)
On Windows - in order to prevent routing loops - we resolve the best
"non-tunnel" route to a particular host for each IP address. The
resulting source IP is then used as source for packets leaving our
interface. In case the system doesn't have IPv6 connectivity or are
simply no routes available, we fail this "source IP resolver" with an IO
error.

Presently, this uses the "other" IO error type which causes this to be
logged on a WARN level in the event-loop. The IO error types
`HostUnreachable` and `NetworkUnreachable` are expected during normal
operation of Firezone and are therefore only logged on DEBUG.

By changing this IO error type, we fix the WARN log spam on Windows for
machines without IPv6 connectivity.
2025-07-03 21:45:06 +00:00
Thomas Eizinger
899f5ea5e8 fix(gui-client): ensure GUI client can access firezone-id.json (#9764)
I believe some of the recent changes around how we load the
`firezone-id.json` from the GUI client surfaced that we in fact don't
always have access to it. Previously, this was silenced because we would
only optionally add it as context to the Sentry client.

Now, we need it to initialise telemetry so we know whether or not to
send logs to Sentry.

In order to be able to access the file, we need to change the config's
directory and the file to be owned by the `firezone-client` group.
2025-07-01 14:11:29 +00:00
Thomas Eizinger
daf05b8c79 fix(windows): ignore network changes from irrelevant networks (#9696)
In order to detect network changes on Windows, we implement the
`INetworkEvents` callback interface. This callback notifies us every
time the connectivity of a certain network changes.

Performing a network reset in connlib on any of these changes hurts the
user experience as Firezone is booting because it takes a while for this
to settle. Firezone itself is making changes to the network so several
of these change events happen _because_ Firezone is starting.

The documentation from Microsoft on what possible values the `NameType`
attribute can have is pretty thin but I did manage to find the following
values on the Internet:

- `6`: Wired network
- `71`: Wireless network
- `243`: Broadband network

We assume that the user is connected to the Internet through one of
these so we ignore network changes on all other networks.

An alternative approach to reducing the number of false-positive change
events would be to react to a narrower list of change events. I
discarded this approach because it wasn't clear to me, which of the
event types [0] would matter to us and when Windows emits them. I think
in order to effectively react to those, we'd have to do more fine
granular tracking of which state a network is in and e.g. only trigger a
reset if we move from "Disconnected" to e.g. "Subnet connectivity".
Windows also differentiates between local, subnet and Internet
connectivity, yet in my testing, I've never observed the "Internet"
connectivity being emitted.

Hence, it is deemed more robust to just filter out networks based on
their type. Firezone itself is of type 53 and is therefore automatically
filtered out as well. The risk here is that we don't react to
connectivity changes of a network that a customer is relying on.
Unfortunately, I don't think there is a better way to find this out
other than shipping this change and waiting for reports.

[0]:
https://learn.microsoft.com/en-us/windows/win32/api/netlistmgr/ne-netlistmgr-nlm_connectivity#constants

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-06-30 08:52:00 +00:00
Thomas Eizinger
a91dda139f feat(connlib): only conditionally hash firezone ID (#9633)
A bit of legacy that we have inherited around our Firezone ID is that
the ID stored on the user's device is sha'd before being passed to the
portal as the "external ID". This makes it difficult to correlate IDs in
Sentry and PostHog with the data we have in the portal. For Sentry and
PostHog, we submit the raw UUID stored on the user's device.

As a first step in overcoming this, we embed an "external ID" in those
services as well IF the provided Firezone ID is a valid UUID. This will
allow us to immediately correlate those events.

As a second step, we automatically generate all new Firezone IDs for the
Windows and Linux Client as `hex(sha256(uuid))`. These won't parse as
valid UUIDs and therefore will be submitted as is to the portal.

As a third step, we update all documentation around generating Firezone
IDs to use `uuidgen | sha256` instead of just `uuidgen`. This is
effectively the equivalent of (2) but for the Headless Client and
Gateway where the Firezone ID can be configured via environment
variables.

Resolves: #9382

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2025-06-24 07:05:48 +00:00
Thomas Eizinger
60bdbb39cb refactor(gui-client): move change listeners to tunnel service (#8160)
At present, listening for DNS server change and network change events is
handled in the GUI client. Upon an event, a message is sent to the
tunnel service which then applies the new state to `connlib`.

We can avoid some of this boilerplate by moving these listeners to the
tunnel service as part of the handler. As a result, we get a few
improvements:

- We don't need to ignore these events if we don't have a session
because the lifetime of these listeners is tied to the IPC handler on
the service side.
- We need fewer IPC messages
- We can retry the connection directly from within the tunnel service in
case we have no Internet at the time of startup
- We can more easily model out the state machine of a connlib session in
the tunnel service
- On Linux, this means we no longer shell out to `resolvectl` from the
GUI process, unifying access to the "resolvers" from the tunnel service
- On Windows, we no longer need admin privileges on the GUI client for
optimized network-change detection. This now happens in the Tunnel
process which already runs as admin.

Resolves: #9465
2025-06-11 06:18:14 +00:00
Jamil
822832e02b chore(macos): allow tauri to build on macOS (#9391)
When working on UI stuff for the Tauri clients on macOS it's helpful if
the UI is buildable. This is a first stab at getting a stub client to
launch on macOS with the help of our AI overlords. Feel free to close or
heavily critique if there is a better approach.
2025-06-06 09:15:39 +00:00
Thomas Eizinger
d62f82787d build(deps): bump netlink dependency group (#9315)
In
https://github.com/rust-netlink/netlink-packet-route/issues/140#issuecomment-2919539363,
the author claims the issue we've been holding the dependency bump back
for is resolved. We can now update to the latest versions of the
`netlink` dependency group.
2025-05-31 02:34:55 +00:00
Thomas Eizinger
ae872980ae refactor(gui-client): scope telemetry sessions to GUI client (#9179)
For our telemetry sessions with Sentry, we need to know which
environment we are running in, i.e. staging, production or on-prem. The
GUI client's tunnel service doesn't have a concept of an environment
until a GUI connects and sends the `StartTelemetry` message. Therefore,
we should scope a telemetry session to a GUI being connected over IPC.

Any errors around setting up / tearing down the background service are a
catch-22. Until a GUI connects, we can't initialise the telemetry
connection but if we fail to set up the background service, no GUI can
ever connect. Hence, the current setup and tear down of the `Telemetry`
module around the `ipc_listen` calls can safely be removed as they are
effectively no-ops anyway.
2025-05-20 23:18:18 +00:00
Thomas Eizinger
1bdba3601a feat(gui-client): rename IPC service to Tunnel service (#9154)
The name IPC service is not very descriptive. By nature of being
separate processes, we need to use IPC to communicate between them. The
important thing is that the service process has control over the tunnel.
Therefore, we rename everything to "Tunnel service".

The only part that is not changed are historic changelog entries.

Resolves: #9048
2025-05-19 09:52:06 +00:00
Thomas Eizinger
3300c0fe02 chore(rust): fix windows static analysis errors (#9162)
The `static-analysis` job for Windows was not yet part of the rule set
and therefore some clippy errors slipped through when we merged #9159.
2025-05-16 04:23:53 +00:00
Thomas Eizinger
6165555add build(deps): bump Rust to 1.87.0 (#9159) 2025-05-16 01:58:17 +00:00
Thomas Eizinger
b8738448df refactor(connlib): forward error from source IP resolver (#9116)
In order to avoid routing loops on Windows, our UDP and TCP sockets in
`connlib` embed a "source IP resolver" that finds the "next best"
interface after our TUN device according to Windows' routing metrics.
This ensures that packets don't get routed back into our TUN device.

Currently, errors during this process are only logged on TRACE and
therefore not visible in Sentry. We fix this by moving around some of
the function interfaces and forward the error from the source IP
resolver together with some context of the destination IP.
2025-05-13 13:33:15 +00:00
Thomas Eizinger
4097ee0cdf chore(gui-client): only read is_finished once (#9095)
For at least 1 user, the threads shut down correctly, but we didn't seem
to have exited the loop. In
https://firezone-inc.sentry.io/issues/6335839279/events/c11596de18924ee3a1b64ced89b1fba2/?project=4508008945549312,
we can see that both flags are marked as `true` yet we still emitted the
message.

The only way how I can explain this is that the thread shut down in
between the two times we've called the `is_finished` function. To ensure
this doesn't happen, we now only read it once.

This however also shows that 5s may not be enough time for WinTUN to
shutdown. Therefore, we increase the grace period to 10s.
2025-05-12 11:47:42 +00:00
Thomas Eizinger
5566f1847f refactor(rust): move crates into a more sensical hierarchy (#9066)
The current `rust/` directory is a bit of a wild-west in terms of how
the crates are organised. Most of them are simply at the top-level when
in reality, they are all `connlib`-related. The Apple and Android FFI
crates - which are entrypoints in the Rust code are defined several
layers deep.

To improve the situation, we move around and rename several crates. The
end result is that all top-level crates / directories are:

- Either entrypoints into the Rust code, i.e. applications such as
Gateway, Relay or a Client
- Or crates shared across all those entrypoints, such as `telemetry` or
`logging`
2025-05-12 01:04:17 +00:00
Thomas Eizinger
f2b1fbe718 refactor(rust): move device_id to bin-shared (#9040)
Both `device_id` and `device_info` are used by the headless-client and
the GUI client / IPC service. They should therefore be defined in the
`bin-shared` crate.
2025-05-06 04:52:37 +00:00
Thomas Eizinger
f11a902b3d refactor(rust): move dns-control to bin-shared (#9023)
Currently, the platform-specific code for controlling DNS resolution on
a system sits in `firezone-headless-client`. This code is also used by
the GUI client. This creates a weird compile-time dependency from the
GUI client to the headless client.

For other components that have platform-specific implementations, we use
the `firezone-bin-shared` crate. As a first step of resolving the
compile-time dependency, we move the `dns_control` module to
`firezone-bin-shared`.
2025-05-06 01:29:09 +00:00
Thomas Eizinger
005b6fe863 feat(windows): optimise network change detection (#9021)
Presently, the network change detection on Windows is very naive and
simply emits a change event everytime _anything_ changes. We can
optimise this and therefore improve the start-up time of Firezone by:

- Filtering out duplicate events
- Filtering out network change events for our own network adapter

This reduces the number of network change events to 1 during startup. As
far as I can tell from the code comments in this area, we explicitly
send this one to ensure we don't run into a race condition whilst we are
starting up.

Resolves: #8905
2025-05-06 00:23:27 +00:00
Thomas Eizinger
806996c245 refactor(rust): move signals to bin-shared (#9024)
The `signals` module isn't something headless-client specific and should
live in our `bin-shared` crate. Once the `ipc_service` module is
decoupled from the headless-client crate, it will be used by both the
headless client and IPC service (which then will be defined in the GUI
client crate).
2025-05-05 23:34:26 +00:00
Thomas Eizinger
ce51c40d0d refactor(rust): move known_dirs to bin-shared (#9026)
The `known_dirs` module is used across the headless-client and the GUI
client. It should live in `bin-shared` where all the other
cross-platform modules are.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-05-05 22:45:53 +00:00
Thomas Eizinger
80335676b1 refactor(rust): move uptime to bin-shared (#9027)
The `uptime` module from `firezone-headless-client` is also used in the
GUI client. In order to decouple this dependency, we move the module to
`bin-shared`, next to the other cross-plaform modules.
2025-05-05 12:28:26 +00:00
Thomas Eizinger
6114bb274f chore(rust): make most of the Rust code compile on MacOS (#8924)
When working on the Rust code of Firezone from a MacOS computer, it is
useful to have pretty much all of the code at least compile to ensure
detect problems early. Eventually, once we target features like a
headless MacOS client, some of these stubs will actually be filled in an
be functional.
2025-04-29 11:20:09 +00:00
Thomas Eizinger
93036734ae build(rust): move our own windows dependency to 0.61.0 (#8730)
Version `0.61.0` is what most of our dependencies bring in, so depending
on that allows us to unify the dependency tree here.
2025-04-22 02:35:28 +00:00
Thomas Eizinger
84a2c275ca build(rust): upgrade to Rust 1.85 and Edition 2024 (#8240)
Updates our codebase to the 2024 Edition. For highlights on what
changes, see the following blogpost:
https://blog.rust-lang.org/2025/02/20/Rust-1.85.0.html
2025-03-19 02:58:55 +00:00
Thomas Eizinger
7af4b91ac5 fix(gui-client): call wintun::Session::shutdown on drop (#8464)
The bugfix we attempted in #8156 turned out wrong. Reading the
source-code, we have to call `Session::shutdown` in order to actually
cancel the `Session::receive_blocking` call. Not doing so means we run
into the timeout when discarding the `Tun` device because the
recv-thread is stuck in `Session::receive_blocking`.

Fixes: #8395
2025-03-17 12:58:03 +00:00
Thomas Eizinger
2fe5c00c64 fix(windows): break from retry loop if we sent the packet (#8271)
Regression introduced in #8268.
2025-02-26 06:10:02 +00:00
Thomas Eizinger
96170be082 fix(gui-client): mitigate deadlock when shutting down TUN device (#8268)
In #8159, we introduced a regression that could lead to a deadlock when
shutting down the TUN device. Whilst we did close the channel prior to
awaiting the thread to exit, we failed to notice that _another_ instance
of the sender could be alive as part of an internally stored "sending
permit" with the `PollSender` in case another packet is queued for
sending. We need to explicitly call `abort_send` to free that.

Judging from the comment and a prior bug, this shutdown logic has been
buggy before. To further avoid this deadlock, we introduce two changes:

- The worker threads only receive a `Weak` reference to the
`wintun::Session`
- We move all device-related state into a dedicated `TunState` struct
that we can drop prior to joining the threads

The combination of these features means that all strong references to
channels and the session are definitely dropped without having to wait
for anything. To provide a clean and synchronous shutdown, we wait for
at most 5s on the worker-threads. If they don't exit until then, we log
a warning and exit anyway.

This should greatly reduce the risk of future bugs here because the
session (and thus the WinTUN device) gets shutdown in any case and so at
worst, we have a few zombie threads around.

Resolves: #8265
2025-02-26 00:46:12 +00:00
Thomas Eizinger
33c707dbf6 feat(windows): introduce dedicated "TUN send" thread (#8159)
Same as done for unix-based operation systems in #8117, we introduce a
dedicated "TUN send" thread for Windows in this PR. Not only does this
move the syscalls and copying of sending packets away from `connlib`'s
main thread but it also establishes backpressure between those threads
properly.

WinTUN does not have any ability to signal that it has space in its send
buffer. If it fails to allocate a packet for sending, it will return
`ERROR_BUFFER_OVERFLOW` [0]. We now handle this case gracefully by
suspending the send thread for 10ms and then try again. This isn't a
great way of establishing back-pressure but at least we don't have any
packet loss.

To test this, I temporarily lowered the ring buffer size and ran a speed
test. In that, I could confirm that `ERROR_BUFFER_OVERFLOW` is indeed
emitted and handled as intended.

[0]: https://git.zx2c4.com/wintun/tree/api/session.c#n267
2025-02-17 20:33:45 +00:00
Thomas Eizinger
af9fc49b18 fix(windows): don't double shutdown session (#8156)
The `wintun` crate will already shutdown the session for us when the
last instance of `Session` gets dropped. Shutting down the session prior
to that already results in an attempt to close an adapter that is no
longer present, causing WinTUN to log (unactionable) errors.
2025-02-17 05:38:11 +00:00
Thomas Eizinger
10ba02e341 fix(connlib): split TUN send & recv into separate threads (#8117)
We appear to have caused a pretty big performance regression (~40%) in
037a2e64b6 (identified through
`git-bisect`). Specifically, the regression appears to have been caused
by [`aef411a`
(#7605)](aef411abf5).
Weirdly enough, undoing just that on top of `main` doesn't fix the
regression.

My hypothesis is that using the same file descriptor for read AND write
interests on the same runtime causes issues because those interests are
occasionally cleared (i.e. on false-positive wake-ups).

In this PR, we spawn a dedicated thread each for the sending and
receiving operations of the TUN device. On unix-based systems, a TUN
device is just a file descriptor and can therefore simply be copied and
read & written to from different threads. Most importantly, we only
construct the `AsyncFd` _within_ the newly spawned thread and runtime
because constructing an `AsyncFd` implicitly registers with the runtime
active on the current thread.

As a nice benefit, this allows us to get rid of a `future::select`.
Those are always kind of nasty because they cancel the future that
wasn't ready. My original intuition was that we drop packets due to
cancelled futures there but that could not be confirmed in experiments.
2025-02-14 05:32:51 +00:00
Thomas Eizinger
7dcda1dc74 fix(windows): silence 0x800706D9 when DNS deactivation fails (#8085)
The error code we see here means "There are no more endpoints available
from the endpoint mapper." This has something to do with Windows'
internal RPC communication between components. DNS deactivation is on a
best-effort basis and it appears that everything else is working just
fine, despite this error.

It appears to happen when we shut down our own service, so perhaps it is
just a race condition.
2025-02-11 05:38:37 +00:00
Thomas Eizinger
d7ebd07183 fix(linux): check for correct sign of netlink error code (#8087)
We've previously tried to handle the "No such process" error from
netlink when it tries to remove a route that no longer exists. What we
failed to do is use the correct sign for the error code as netlink
errors are always negative, yet when printed, the are positive numbers.
2025-02-11 04:47:51 +00:00
Thomas Eizinger
b193dd91f6 fix(windows): don't warn on disabled IP stack (#8086)
When an IP stack is programmatically disabled, such as with:

> reg add
"HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters"
/v DisabledComponents /t REG_DWORD /d 255 /f

Attempting to interact with this IP stack will yield "NOT_FOUND" errors.
These aren't worth reporting to Sentry because there isn't much we can
do about it.
2025-02-11 04:37:17 +00:00
Thomas Eizinger
436b502eab fix(windows): handle disabled IPv6 stack gracefully (#8083)
Fixes: #8049.
2025-02-11 03:21:32 +00:00
Thomas Eizinger
f48df7585c refactor(windows): de-duplicate Win32 error codes (#8071)
The errors returned from Win32 API calls are currently duplicated in
several places. To makes it error-prone to handle them correctly. With
this PR, we de-duplicate this and add proper docs and links for further
reading to them.

We also fix a case where we would currently fail to set IP addresses for
our tunnel interface if the IP stack is not supported.
2025-02-10 23:33:06 +00:00
Thomas Eizinger
d2e9b09874 refactor(rust): stringify errors early (#8033)
As it turns out, the effort in #7104 was not a good idea. By logging
errors as values, most of our Sentry reports all have the same title and
thus cannot be differentiated from within the overview at all. To fix
this, we stringify errors with all their sources whenever they got
logged. This ensures log messages are unique and all Sentry issues will
have a useful title.
2025-02-06 14:18:35 +00:00
Thomas Eizinger
90fb9b8478 refactor(connlib): use Win32 APIs instead of netsh to set IPs (#8003)
This should be faster and hopefully more reliable.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2025-02-03 06:24:28 +00:00
Thomas Eizinger
8bd8098cab refactor(connlib): don't re-implement waker for TUN thread (#7944)
Within `connlib` - on UNIX platforms - we have dedicated threads that
read from and write to the TUN device. These threads are connected with
`connlib`'s main thread via bounded channels: one in each direction.
When these channels are full, `connlib`'s main thread will suspend and
not read any network packets from the sockets in order to maintain
back-pressure. Reading more packets from the socket would mean most
likely sending more packets out the TUN device.

When debugging #7763, it became apparent that _something_ must be wrong
with these threads and that somehow, we either consider them as full or
aren't emptying them and as a result, we don't read _any_ network
packets from our sockets.

To maintain back-pressure here, we currently use our own `AtomicWaker`
construct that is shared with the TUN thread(s). This is unnecessary. We
can also directly convert the `flume::Sender` into a
`flume::async::SendSink` and therefore directly access a `poll`
interface.
2025-01-29 15:48:48 +00:00
Thomas Eizinger
416e320319 revert: bump netlink-packet-route and rtnetlink (#7899)
Reverts: #6694
Related: https://github.com/rust-netlink/netlink-packet-route/issues/140
2025-01-28 06:29:07 +00:00
dependabot[bot]
0779757646 build(deps): netlink-packet-route and rtnetlink (#6694)
`rtnetlink` has some breaking changes in their latest version. To avoid
waiting until they actually cut a release, we temporarily depend on
their `main` branch.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2025-01-28 05:21:52 +00:00
Thomas Eizinger
46cdbbcc23 fix(connlib): use a buffer pool for the GSO queue (#7749)
Within `connlib`, we read batches of IP packets and process them at
once. Each encrypted packet is appended to a buffer shared with other
packets of the same length. Once the batch is successfully processed,
all of these buffers are written out using GSO to the network. This
allows UDP operations to be much more efficient because not every packet
has to traverse the entire syscall hierarchy of the operating system.

Until now, these buffers got re-allocated on every batch. This is pretty
wasteful and leads to a lot of repeated allocations. Measurements show
that most of the time, we only have a handful of packets with different
segments lengths _per batch_. For example, just booting up the
headless-client and running a speedtest showed that only 5 of these
buffers are were needed at one time.

By introducing a buffer pool, we can reuse these buffers between batches
and avoid reallocating them.

Related: #7747.
2025-01-13 19:24:52 +00:00
Thomas Eizinger
037a2e64b6 fix(connlib): attempt to detect runtime shutdown within TUN task (#7605)
Reading and writing to the TUN device within `connlib` happens in a
separate thread. The task running within these threads is connected to
the rest of `connlib` via channels. When the application shuts down,
these threads also need to exit. Currently, we attempt to detect this
from within the task when these channels close. It appears that there is
a race condition here because we first attempt to read from the TUN
device before reading from the channels. We treat read & write errors on
the TUN device as non-fatal so we loop around and attempt to read from
it again, causing an infinite-loop and log spam.

To fix this, we swap the order in which we evaluate the two concurrent
tasks: The first task to be polled is now the channel for outbound
packets and only if that one is empty, we attempt to read new packets
from the TUN device. This is also better from a backpressure point of
view: We should attempt to flush out our local buffers of already
processed packets before taking on "new work".

As a defense-in-depth strategy, we also attempt to detect the particular
error from the tokio runtime when it is being shut down and exit the
task.

Resolves: #7601.
Related: https://github.com/tokio-rs/tokio/issues/7056.
2025-01-05 20:41:24 +00:00
Thomas Eizinger
26824fb3c7 fix(gateway): check if we run with correct permissions (#7565)
The gateway needs either the `CAP_NET_ADMIN` capability or run as `root`
in order to access the TUN device as well as configure routes via
`netlink`. Running without either leads to "Permission denied" errors at
runtime. It is good to fail early in these kind of situations.

By checking for this capability early on during startup, these should no
longer surface later. As a bonus, we won't receive (unactionable) Sentry
alerts.

Resolves: #7559.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-12-29 21:45:56 +00:00
Thomas Eizinger
e7cc0e5eef fix(linux): don't fail on unsupported IP version (#7583)
Firezone always attempts to handle IPv4 and IPv6. On Linux systems
without an IPv6 stack, attempts to add an IPv6 route may fail with "Not
supported (os error 95)". We don't need the IPv6 routes on those systems
as we will never receive IPv6 traffic. Therefore, we can safely ignore
these errors and not log them.
2024-12-25 11:09:22 +00:00
Thomas Eizinger
1b04b0eb2b fix(windows): don't warn on deleting non-existing route (#7507)
Similarly as Linux (#7502), we don't want to log an error if we cannot
delete a route that doesn't exist.
2024-12-13 21:09:09 +00:00