Commit Graph

64 Commits

Author SHA1 Message Date
Thomas Eizinger
84a2c275ca build(rust): upgrade to Rust 1.85 and Edition 2024 (#8240)
Updates our codebase to the 2024 Edition. For highlights on what
changes, see the following blogpost:
https://blog.rust-lang.org/2025/02/20/Rust-1.85.0.html
2025-03-19 02:58:55 +00:00
Thomas Eizinger
7af4b91ac5 fix(gui-client): call wintun::Session::shutdown on drop (#8464)
The bugfix we attempted in #8156 turned out wrong. Reading the
source-code, we have to call `Session::shutdown` in order to actually
cancel the `Session::receive_blocking` call. Not doing so means we run
into the timeout when discarding the `Tun` device because the
recv-thread is stuck in `Session::receive_blocking`.

Fixes: #8395
2025-03-17 12:58:03 +00:00
Thomas Eizinger
2fe5c00c64 fix(windows): break from retry loop if we sent the packet (#8271)
Regression introduced in #8268.
2025-02-26 06:10:02 +00:00
Thomas Eizinger
96170be082 fix(gui-client): mitigate deadlock when shutting down TUN device (#8268)
In #8159, we introduced a regression that could lead to a deadlock when
shutting down the TUN device. Whilst we did close the channel prior to
awaiting the thread to exit, we failed to notice that _another_ instance
of the sender could be alive as part of an internally stored "sending
permit" with the `PollSender` in case another packet is queued for
sending. We need to explicitly call `abort_send` to free that.

Judging from the comment and a prior bug, this shutdown logic has been
buggy before. To further avoid this deadlock, we introduce two changes:

- The worker threads only receive a `Weak` reference to the
`wintun::Session`
- We move all device-related state into a dedicated `TunState` struct
that we can drop prior to joining the threads

The combination of these features means that all strong references to
channels and the session are definitely dropped without having to wait
for anything. To provide a clean and synchronous shutdown, we wait for
at most 5s on the worker-threads. If they don't exit until then, we log
a warning and exit anyway.

This should greatly reduce the risk of future bugs here because the
session (and thus the WinTUN device) gets shutdown in any case and so at
worst, we have a few zombie threads around.

Resolves: #8265
2025-02-26 00:46:12 +00:00
Thomas Eizinger
33c707dbf6 feat(windows): introduce dedicated "TUN send" thread (#8159)
Same as done for unix-based operation systems in #8117, we introduce a
dedicated "TUN send" thread for Windows in this PR. Not only does this
move the syscalls and copying of sending packets away from `connlib`'s
main thread but it also establishes backpressure between those threads
properly.

WinTUN does not have any ability to signal that it has space in its send
buffer. If it fails to allocate a packet for sending, it will return
`ERROR_BUFFER_OVERFLOW` [0]. We now handle this case gracefully by
suspending the send thread for 10ms and then try again. This isn't a
great way of establishing back-pressure but at least we don't have any
packet loss.

To test this, I temporarily lowered the ring buffer size and ran a speed
test. In that, I could confirm that `ERROR_BUFFER_OVERFLOW` is indeed
emitted and handled as intended.

[0]: https://git.zx2c4.com/wintun/tree/api/session.c#n267
2025-02-17 20:33:45 +00:00
Thomas Eizinger
af9fc49b18 fix(windows): don't double shutdown session (#8156)
The `wintun` crate will already shutdown the session for us when the
last instance of `Session` gets dropped. Shutting down the session prior
to that already results in an attempt to close an adapter that is no
longer present, causing WinTUN to log (unactionable) errors.
2025-02-17 05:38:11 +00:00
Thomas Eizinger
10ba02e341 fix(connlib): split TUN send & recv into separate threads (#8117)
We appear to have caused a pretty big performance regression (~40%) in
037a2e64b6 (identified through
`git-bisect`). Specifically, the regression appears to have been caused
by [`aef411a`
(#7605)](aef411abf5).
Weirdly enough, undoing just that on top of `main` doesn't fix the
regression.

My hypothesis is that using the same file descriptor for read AND write
interests on the same runtime causes issues because those interests are
occasionally cleared (i.e. on false-positive wake-ups).

In this PR, we spawn a dedicated thread each for the sending and
receiving operations of the TUN device. On unix-based systems, a TUN
device is just a file descriptor and can therefore simply be copied and
read & written to from different threads. Most importantly, we only
construct the `AsyncFd` _within_ the newly spawned thread and runtime
because constructing an `AsyncFd` implicitly registers with the runtime
active on the current thread.

As a nice benefit, this allows us to get rid of a `future::select`.
Those are always kind of nasty because they cancel the future that
wasn't ready. My original intuition was that we drop packets due to
cancelled futures there but that could not be confirmed in experiments.
2025-02-14 05:32:51 +00:00
Thomas Eizinger
7dcda1dc74 fix(windows): silence 0x800706D9 when DNS deactivation fails (#8085)
The error code we see here means "There are no more endpoints available
from the endpoint mapper." This has something to do with Windows'
internal RPC communication between components. DNS deactivation is on a
best-effort basis and it appears that everything else is working just
fine, despite this error.

It appears to happen when we shut down our own service, so perhaps it is
just a race condition.
2025-02-11 05:38:37 +00:00
Thomas Eizinger
d7ebd07183 fix(linux): check for correct sign of netlink error code (#8087)
We've previously tried to handle the "No such process" error from
netlink when it tries to remove a route that no longer exists. What we
failed to do is use the correct sign for the error code as netlink
errors are always negative, yet when printed, the are positive numbers.
2025-02-11 04:47:51 +00:00
Thomas Eizinger
b193dd91f6 fix(windows): don't warn on disabled IP stack (#8086)
When an IP stack is programmatically disabled, such as with:

> reg add
"HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters"
/v DisabledComponents /t REG_DWORD /d 255 /f

Attempting to interact with this IP stack will yield "NOT_FOUND" errors.
These aren't worth reporting to Sentry because there isn't much we can
do about it.
2025-02-11 04:37:17 +00:00
Thomas Eizinger
436b502eab fix(windows): handle disabled IPv6 stack gracefully (#8083)
Fixes: #8049.
2025-02-11 03:21:32 +00:00
Thomas Eizinger
f48df7585c refactor(windows): de-duplicate Win32 error codes (#8071)
The errors returned from Win32 API calls are currently duplicated in
several places. To makes it error-prone to handle them correctly. With
this PR, we de-duplicate this and add proper docs and links for further
reading to them.

We also fix a case where we would currently fail to set IP addresses for
our tunnel interface if the IP stack is not supported.
2025-02-10 23:33:06 +00:00
Thomas Eizinger
d2e9b09874 refactor(rust): stringify errors early (#8033)
As it turns out, the effort in #7104 was not a good idea. By logging
errors as values, most of our Sentry reports all have the same title and
thus cannot be differentiated from within the overview at all. To fix
this, we stringify errors with all their sources whenever they got
logged. This ensures log messages are unique and all Sentry issues will
have a useful title.
2025-02-06 14:18:35 +00:00
Thomas Eizinger
90fb9b8478 refactor(connlib): use Win32 APIs instead of netsh to set IPs (#8003)
This should be faster and hopefully more reliable.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2025-02-03 06:24:28 +00:00
Thomas Eizinger
8bd8098cab refactor(connlib): don't re-implement waker for TUN thread (#7944)
Within `connlib` - on UNIX platforms - we have dedicated threads that
read from and write to the TUN device. These threads are connected with
`connlib`'s main thread via bounded channels: one in each direction.
When these channels are full, `connlib`'s main thread will suspend and
not read any network packets from the sockets in order to maintain
back-pressure. Reading more packets from the socket would mean most
likely sending more packets out the TUN device.

When debugging #7763, it became apparent that _something_ must be wrong
with these threads and that somehow, we either consider them as full or
aren't emptying them and as a result, we don't read _any_ network
packets from our sockets.

To maintain back-pressure here, we currently use our own `AtomicWaker`
construct that is shared with the TUN thread(s). This is unnecessary. We
can also directly convert the `flume::Sender` into a
`flume::async::SendSink` and therefore directly access a `poll`
interface.
2025-01-29 15:48:48 +00:00
Thomas Eizinger
416e320319 revert: bump netlink-packet-route and rtnetlink (#7899)
Reverts: #6694
Related: https://github.com/rust-netlink/netlink-packet-route/issues/140
2025-01-28 06:29:07 +00:00
dependabot[bot]
0779757646 build(deps): netlink-packet-route and rtnetlink (#6694)
`rtnetlink` has some breaking changes in their latest version. To avoid
waiting until they actually cut a release, we temporarily depend on
their `main` branch.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2025-01-28 05:21:52 +00:00
Thomas Eizinger
46cdbbcc23 fix(connlib): use a buffer pool for the GSO queue (#7749)
Within `connlib`, we read batches of IP packets and process them at
once. Each encrypted packet is appended to a buffer shared with other
packets of the same length. Once the batch is successfully processed,
all of these buffers are written out using GSO to the network. This
allows UDP operations to be much more efficient because not every packet
has to traverse the entire syscall hierarchy of the operating system.

Until now, these buffers got re-allocated on every batch. This is pretty
wasteful and leads to a lot of repeated allocations. Measurements show
that most of the time, we only have a handful of packets with different
segments lengths _per batch_. For example, just booting up the
headless-client and running a speedtest showed that only 5 of these
buffers are were needed at one time.

By introducing a buffer pool, we can reuse these buffers between batches
and avoid reallocating them.

Related: #7747.
2025-01-13 19:24:52 +00:00
Thomas Eizinger
037a2e64b6 fix(connlib): attempt to detect runtime shutdown within TUN task (#7605)
Reading and writing to the TUN device within `connlib` happens in a
separate thread. The task running within these threads is connected to
the rest of `connlib` via channels. When the application shuts down,
these threads also need to exit. Currently, we attempt to detect this
from within the task when these channels close. It appears that there is
a race condition here because we first attempt to read from the TUN
device before reading from the channels. We treat read & write errors on
the TUN device as non-fatal so we loop around and attempt to read from
it again, causing an infinite-loop and log spam.

To fix this, we swap the order in which we evaluate the two concurrent
tasks: The first task to be polled is now the channel for outbound
packets and only if that one is empty, we attempt to read new packets
from the TUN device. This is also better from a backpressure point of
view: We should attempt to flush out our local buffers of already
processed packets before taking on "new work".

As a defense-in-depth strategy, we also attempt to detect the particular
error from the tokio runtime when it is being shut down and exit the
task.

Resolves: #7601.
Related: https://github.com/tokio-rs/tokio/issues/7056.
2025-01-05 20:41:24 +00:00
Thomas Eizinger
26824fb3c7 fix(gateway): check if we run with correct permissions (#7565)
The gateway needs either the `CAP_NET_ADMIN` capability or run as `root`
in order to access the TUN device as well as configure routes via
`netlink`. Running without either leads to "Permission denied" errors at
runtime. It is good to fail early in these kind of situations.

By checking for this capability early on during startup, these should no
longer surface later. As a bonus, we won't receive (unactionable) Sentry
alerts.

Resolves: #7559.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-12-29 21:45:56 +00:00
Thomas Eizinger
e7cc0e5eef fix(linux): don't fail on unsupported IP version (#7583)
Firezone always attempts to handle IPv4 and IPv6. On Linux systems
without an IPv6 stack, attempts to add an IPv6 route may fail with "Not
supported (os error 95)". We don't need the IPv6 routes on those systems
as we will never receive IPv6 traffic. Therefore, we can safely ignore
these errors and not log them.
2024-12-25 11:09:22 +00:00
Thomas Eizinger
1b04b0eb2b fix(windows): don't warn on deleting non-existing route (#7507)
Similarly as Linux (#7502), we don't want to log an error if we cannot
delete a route that doesn't exist.
2024-12-13 21:09:09 +00:00
Thomas Eizinger
b5d6c27680 fix(linux): don't print error when removing non-existent route (#7502)
We are already handling one case where we are trying to remove a route
that doesn't exist. `ESRCH` is another variant of this error that
manifests as "No such process". According to the Internet, this just
means the route doesn't exist so we can bail out early here.
2024-12-13 04:53:22 +00:00
Thomas Eizinger
90cf191a7c feat(linux): multi-threaded TUN device operations (#7449)
## Context

At present, we only have a single thread that reads and writes to the
TUN device on all platforms. On Linux, it is possible to open the file
descriptor of a TUN device multiple times by setting the
`IFF_MULTI_QUEUE` option using `ioctl`. Using multi-queue, we can then
spawn multiple threads that concurrently read and write to the TUN
device. This is critical for achieving a better throughput.

## Solution

`IFF_MULTI_QUEUE` is a Linux-only thing and therefore only applies to
headless-client, GUI-client on Linux and the Gateway (it may also be
possible on Android, I haven't tried). As such, we need to first change
our internal abstractions a bit to move the creation of the TUN thread
to the `Tun` abstraction itself. For this, we change the interface of
`Tun` to the following:

- `poll_recv_many`: An API, inspired by tokio's `mpsc::Receiver` where
multiple items in a channel can be batch-received.
- `poll_send_ready`: Mimics the API of `Sink` to check whether more
items can be written.
- `send`: Mimics the API of `Sink` to actually send an item.

With these APIs in place, we can implement various (performance)
improvements for the different platforms.

- On Linux, this allows us to spawn multiple threads to read and write
from the TUN device and send all packets into the same channel. The `Io`
component of `connlib` then uses `poll_recv_many` to read batches of up
to 100 packets at once. This ties in well with #7210 because we can then
use GSO to send the encrypted packets in single syscalls to the OS.
- On Windows, we already have a dedicated recv thread because `WinTun`'s
most-convenient API uses blocking IO. As such, we can now also tie into
that by batch-receiving from this channel.
- In addition to using multiple threads, this API now also uses correct
readiness checks on Linux, Darwin and Android to uphold backpressure in
case we cannot write to the TUN device.

## Configuration

Local testing has shown that 2 threads give the best performance for a
local `iperf3` run. I suspect this is because there is only so much
traffic that a single application (i.e. `iperf3`) can generate. With
more than 2 threads, the throughput actually drops drastically because
`connlib`'s main thread is too busy with lock-contention and triggering
`Waker`s for the TUN threads (which mostly idle around if there are 4+
of them). I've made it configurable on the Gateway though so we can
experiment with this during concurrent speedtests etc.

In addition, switching `connlib` to a single-threaded tokio runtime
further increased the throughput. I suspect due to less task / context
switching.

## Results

Local testing with `iperf3` shows some very promising results. We now
achieve a throughput of 2+ Gbit/s.

```
Connecting to host 172.20.0.110, port 5201
Reverse mode, remote host 172.20.0.110 is sending
[  5] local 100.80.159.34 port 57040 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   274 MBytes  2.30 Gbits/sec
[  5]   1.00-2.00   sec   279 MBytes  2.34 Gbits/sec
[  5]   2.00-3.00   sec   216 MBytes  1.82 Gbits/sec
[  5]   3.00-4.00   sec   224 MBytes  1.88 Gbits/sec
[  5]   4.00-5.00   sec   234 MBytes  1.96 Gbits/sec
[  5]   5.00-6.00   sec   238 MBytes  2.00 Gbits/sec
[  5]   6.00-7.00   sec   229 MBytes  1.92 Gbits/sec
[  5]   7.00-8.00   sec   222 MBytes  1.86 Gbits/sec
[  5]   8.00-9.00   sec   223 MBytes  1.87 Gbits/sec
[  5]   9.00-10.00  sec   217 MBytes  1.82 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.30 GBytes  1.98 Gbits/sec  22247             sender
[  5]   0.00-10.00  sec  2.30 GBytes  1.98 Gbits/sec                  receiver

iperf Done.
```

This is a pretty solid improvement over what is in `main`:

```
Connecting to host 172.20.0.110, port 5201
[  5] local 100.65.159.3 port 56970 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  90.4 MBytes   758 Mbits/sec  1800    106 KBytes
[  5]   1.00-2.00   sec  93.4 MBytes   783 Mbits/sec  1550   51.6 KBytes
[  5]   2.00-3.00   sec  92.6 MBytes   777 Mbits/sec  1350   76.8 KBytes
[  5]   3.00-4.00   sec  92.9 MBytes   779 Mbits/sec  1800   56.4 KBytes
[  5]   4.00-5.00   sec  93.4 MBytes   783 Mbits/sec  1650   69.6 KBytes
[  5]   5.00-6.00   sec  90.6 MBytes   760 Mbits/sec  1500   73.2 KBytes
[  5]   6.00-7.00   sec  87.6 MBytes   735 Mbits/sec  1400   76.8 KBytes
[  5]   7.00-8.00   sec  92.6 MBytes   777 Mbits/sec  1600   82.7 KBytes
[  5]   8.00-9.00   sec  91.1 MBytes   764 Mbits/sec  1500   70.8 KBytes
[  5]   9.00-10.00  sec  92.0 MBytes   771 Mbits/sec  1550   85.1 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   917 MBytes   769 Mbits/sec  15700             sender
[  5]   0.00-10.00  sec   916 MBytes   768 Mbits/sec                  receiver

iperf Done.
```
2024-12-05 00:18:20 +00:00
Thomas Eizinger
4f92a0d7ca refactor(gui-client): tidy up GUI controller code (#7444)
This PR intends to be a pure refactoring, i.e. no behaviour change. It
simplifies a few aspects of the GUI controller event-loop by getting rid
of the `select!` macro. We also remove some indirection of the
`gui_controller::Builder`.
2024-12-02 20:07:44 +00:00
Thomas Eizinger
0a6554122a feat(connlib): utilise GSO for UDP sockets (#7210)
## Context

At present, `connlib` sends UDP packets one at a time. Sending a packet
requires us to make a syscall which is quite expensive. Under load, i.e.
during a speedtest, syscalls account for over 50% of our CPU time [0].
In order to improve this situation, we need to somehow make use of GSO
(generic segmentation offload). With GSO, we can send multiple packets
to the same destination in a single syscall.

The tricky question here is, how can we achieve having multiple UDP
packets ready at once so we can send them in a single syscall? Our TUN
interface only feeds us packets one at a time and `connlib`'s state
machine is single-threaded. Additionally, we currently only have a
single `EncryptBuffer` in which the to-be-sent datagram sits.

## 1. Stack-allocating encrypted IP packets

As a first step, we get rid of the single `EncryptBuffer` and instead
stack-allocate each encrypted IP packet. Due to our small MTU, these
packets are only around 1300 bytes. Stack-allocating that requires a few
memcpy's but those are in the single-digit % range in the terms of CPU
time performance hit. That is nothing compared to how much time we are
spending on UDP syscalls. With the `EncryptBuffer` out the way, we can
now "freely" move around the `EncryptedPacket` structs and - technically
- we can have multiple of them at the same time.

## 2. Implementing GSO

The GSO interface allows you to pass multiple packets **of the same
length and for the same destination** in a single syscall, meaning we
cannot just batch-up arbitrary UDP packets. Counterintuitively, making
use of GSO requires us to do more copying: In particular, we change the
interface of `Io` such that "sending" a packet performs essentially a
lookup of a `BytesMut`-buffer by destination and packet length and
appends the payload to that packet.

## 3. Batch-read IP packets

In order to actually perform GSO, we need to process more than a single
IP packet in one event-loop tick. We achieve this by batch-reading up to
50 IP packets from the mpsc-channel that connects `connlib`'s main
event-loop with the dedicated thread that reads and writes to the TUN
device. These reads and writes happen concurrently to `connlib`'s packet
processing. Thus, it is likely that by the time `connlib` is ready to
process another IP packet, multiple have been read from the device and
are sitting in the channel. Batch-processing these IP packets means that
the buffers in our `GsoQueue` are more likely to contain more than a
single datagram.

Imagine you are running a file upload. The OS will send many packets to
the same destination IP and likely max MTU to the TUN device. It is
likely, that we read 10-20 of these packets in one batch (i.e. within a
single "tick" of the event-loop). All packets will be appended to the
same buffer in the `GsoQueue` and on the next event-loop tick, they will
all be flushed out in a single syscall.

## Results

Overall, this results in a significant reduction of syscalls for sending
UDP message. In [1], we spend only a total of 16% of our CPU time in
`udpv6_sendmsg` whereas in [0] (main), we spent a total of 34%. Do note
that these numbers are relative to the total CPU time spent per program
run and thus can't be compared directly (i.e. you cannot just do 34 - 16
and say we now spend 18% less time sending UDP packets). Nevertheless,
this appears to be a great improvement.

In terms of throughput, we achieve a ~60% improvement in our benchmark
suite. That one is running on localhost though so it might not
necessarily be reflect like that in a real network.

[0]: https://share.firefox.dev/4hvoPju
[1]: https://share.firefox.dev/4frhCPv
2024-12-02 01:09:44 +00:00
Thomas Eizinger
c6e7e6192e build(rust): bump Rust to 1.83 (#7409)
Rust 1.83 comes with a bunch of new lints for elidible lifetimes. Those
also trigger in the generated code of `derivative`. That crate is
actually unmaintained so we replace our usages of it with `derive_more`.
2024-11-29 01:04:06 +00:00
Thomas Eizinger
24f7ba530d refactor(gui-client): add more context to connection failures (#7364)
Adding more context to these errors makes it easier to identify, which
of the operations fails. In addition, we remove some usages of the "log
and return" anti-pattern to avoid duplicate reports of the same issue.
2024-11-18 18:16:16 +00:00
Thomas Eizinger
48ba2869a8 chore(rust): ban the use of .unwrap except in tests (#7319)
Using the clippy lint `unwrap_used`, we can automatically lint against
all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a
few results actually. In most cases, they are invariants that can't
actually be hit. For these, we change them to `Option`. In other cases,
they can actually be hit. For example, if the user supplies an invalid
log-filter.

Activating this lint ensures the compiler will yell at us every time we
use `.unwrap` to double-check whether we do indeed want to panic here.

Resolves: #7292.
2024-11-13 03:59:22 +00:00
Thomas Eizinger
ad4eea29ff chore(rust): don't panic in fallible functions (#7298)
"Just let it crash" is terrible advice for software that is shipped to
end users. Where possible, we should use proper error handling and only
fail the current function / task that is active, e.g. drop a particular
packet instead of failing all of connlib. We more or less already do
that.

Activating the clippy lint `unwrap_in_result` surfaced a few more places
where we panic despite being in a function that is fallible already.
These cases can easily be converted to not panic and return an error
instead.
2024-11-11 23:55:23 +00:00
Thomas Eizinger
e261cb3c27 chore: remove git_version! (#7270)
Reading the Git version requires the entire Git repository to be
present, including all tags. The tags are only created _after_ the
artifact is being built, when we publish the release. Therefore, these
tags are never included in the actual released binary.

For Sentry, we use the `CARGO_PKG_VERSION` variable instead. This
doesn't tell us whether somebody built a client from source and then
used it so there could be some confusion in Sentry events. It is quite
unlikely that this happens though so for the majority of Sentry alerts,
this will give us the correct version.

For the Android client, we also depend on the `GITHUB_SHA` env variable
at compile-time. We do the same thing for the GUI client here.

Resolves: #6925.
2024-11-07 22:56:17 +00:00
Thomas Eizinger
78ebad13ab chore(rust): log more errors as tracing::Values (#7208)
Logging these as structured values gives us a better stacktrace in
Sentry (assuming the errors themselves make proper use of defining an
error-chain).
2024-11-05 14:36:47 +00:00
Thomas Eizinger
7213eb823d fix(rust): fallback to CARGO_PKG_VERSION if git is unavailable (#7188)
When building inside a docker container, like we do for the
headless-client and gateway, the `.git` directory is not available.
Thus, determining what our current version is fails and gets reported as
"unknown". We are now also using this for Sentry which is not very
helpful if all errors are categorised under the same version.

In case somebody builds a gateway / client from source, we will have the
full version available. Most users will use our docker containers
though, meaning the version will only always be for a full release.

Resolves: #7184.
2024-10-30 17:42:44 +00:00
Thomas Eizinger
73eebd2c4d refactor(rust): consistently record errors as tracing::Value (#7104)
Our logging library, `tracing` supports structured logging. This is
useful because it preserves the more than just the string representation
of a value and thus allows the active logging backend(s) to capture more
information for a particular value.

In the case of errors, this is especially useful because it allows us to
capture the sources of a particular error.

Unfortunately, recording an error as a tracing value is a bit cumbersome
because `tracing::Value` is only implemented for `&dyn
std::error::Error`. Casting an error to this is quite verbose. To make
it easier, we introduce two utility functions in `firezone-logging`:

- `std_dyn_err`
- `anyhow_dyn_err`

Tracking errors as correct `tracing::Value`s will be especially helpful
once we enable Sentry's `tracing` integration:
https://docs.rs/sentry-tracing/latest/sentry_tracing/#tracking-errors
2024-10-22 04:46:26 +00:00
Reactor Scram
9b93fc2a2c fix(rust/client/windows): set our DNS resolvers on our interface (#6931)
Closes #6777
2024-10-07 15:03:22 +00:00
Thomas Eizinger
4ae29c604c fix(windows): only consider online adapters (#6810)
When deciding which interface we are going to use for connecting to the
portal API, we need to filter through all adapters on Windows and
exclude our own TUN adapter to avoid routing loops. In addition, we also
need to filter for only online adapters, otherwise we might pick one
that is not actually routable.

Resolves: #6802.
2024-09-25 21:19:15 +00:00
Thomas Eizinger
a9f515a453 chore(rust): use #[expect] instead of #[allow] (#6692)
The `expect` attribute is similar to `allow` in that it will silence a
particular lint. In addition to `allow` however, `expect` will fail as
soon as the lint is no longer emitted. This ensures we don't end up with
stale `allow` attributes in our codebase. Additionally, it provides a
way of adding a `reason` to document, why the lint is being suppressed.
2024-09-16 13:51:12 +00:00
Reactor Scram
54b6222722 fix(client/windows): set MTU even if IPv6 is disabled (#6681)
Refs #6547, this fixes a similar error message but it's not the same
exact issue.

When IPv6 is disabled on a system, our call to set the MTU was failing
with error code 0x80070490. This patch allows some of the MTU-related
syscalls to fail with a warning log.

To replicate the issue, run this command to set a registry value to
disable IPv6, then reboot the system:

`reg add
"HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters"
/v DisabledComponents /t REG_DWORD /d 255 /f`

```[tasklist]
- [x] Update changelog
- [x] Apply PR feedback
```

---------

Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2024-09-13 17:43:21 +00:00
Reactor Scram
5a44151bba test(bin-shared): improve network notifier test (#6676)
On Windows, the network notifier always notifies once at startup. We
make the DNS notifier and Linux match this behavior, and we assert it in
the unit test.

Part of a yak shave towards removing Tauri.
2024-09-13 14:53:13 +00:00
Reactor Scram
ece8f7a5b7 test(rust/client): add unit test for DNS and network change notifiers (#6635)
Part of a yak shave towards removing Tauri.

---------

Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-09-12 15:59:13 +00:00
Reactor Scram
f507a01f9f fix(windows): prevent routing loops for TCP connections (#6584)
In #6032, we attempted to fix routing loops for Windows and did so
successfully for UDP packets. For TCP sockets, we believed that binding
the socket to an interface is enough to prevent routing loops. This
assumptions is wrong.

> On Windows, a call to bind() affects card selection only incoming
traffic, not outgoing traffic.
>
> Thus, on a client running in a multi-homed system (i.e., more than one
interface card), it's the network stack that selects the card to use,
and it makes its selection based solely on the destination IP, which in
turn is based on the routing table. A call to bind() will not affect the
choice of the card in any way.

On most of our testing machines, this problem didn't surface but it
turns out that on some machines, especially with WiFi cards there is a
conflict between the routes added on the system. In particular, with the
Internet resource active, we also add a catch-all route that we _want_
to have the most priority, i.e. Windows SHOULD send all traffic to our
TUN device. Except for traffic that we generate, like TCP connections to
the portal or UDP packets sent to gateways, relays or DNS servers.

It appears that on some systems, mostly with Ethernet adapters, Windows
picks the "correct" interface for our socket and sends traffic via that
but on other systems, it doesn't. TCP sockets are only used for the
WebSocket connection to the portal. Without that one, Firezone
completely stops working because we can't send any control messages.

To reliably fix this issue, we need to add a dedicated route for the
target IP of each TCP socket that is more specific than the Internet
resource route (`0.0.0.0/0`) but otherwise identical. We do this as part
of creating a new TCP socket. This route is for the _default_ interface
and thus, doesn't get automatically removed when Firezone exits.

We implement a RAII guard that attempts to drop the route on a
best-effort basis. Despite this RAII guard, this route can linger around
in case Firezone is being forced to exit or exits in otherwise unclean
ways. To avoid lingering routes, we always delete all routing table
entries matching the IP of the portal just before we are about to add
one.

Fixes: #6591.

[0]:
https://forums.codeguru.com/showthread.php?487139-Socket-binding-with-routing-table&s=a31637836c1bf7f0bc71c1955e47bdf9&p=1891235#post1891235

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Foo Bar <foo@bar.com>
Co-authored-by: conectado <gabrielalejandro7@gmail.com>
2024-09-05 06:17:28 +00:00
Reactor Scram
df3067a8f8 chore(rust/windows): more detailed error log for wintun::Adapter::create (#6596)
Without this we don't log the `std::io::ErrorKind`, which is useful to
know.

Refs #6547
2024-09-05 02:43:25 +00:00
Thomas Eizinger
e3688a475e refactor(connlib): only buffer 1 unsent packet if socket is busy (#6563)
Currently, we buffer UDP packets whenever the socket is busy and try to
flush them out at a later point. This requires allocations and is tricky
to get right.

In order to solve both of these problems, we refactor `snownet` to
return us an `EncryptedPacket` instead of a `Transmit`. An
`EncryptedPacket` is an indirection-abstraction that can be turned into
a `Transmit` given an `EncryptBuffer`. This combination of types allows
us to hold on to the `EncryptedPacket` (which does not contain any
references itself) in the `io` component whilst we are waiting for the
socket to be ready to send again.

This means we will immediately suspend the event loop in case the socket
is no longer ready for sending and resend the datagram in the
`EncryptBuffer` once we get re-polled.
2024-09-04 16:59:33 +00:00
Reactor Scram
d7810ef9c0 chore(rust/gui-client/windows): update windows to 0.58 (#6565)
Updates `windows` crates to 0.58 without the bug in #6551.

Supersedes #6556.

The bug was calling `try_send()?` on an MPSC channel of capacity 1,
which would bail out of the worker thread if we got 2 DNS change
notifications faster than the controller task / thread could process the
first one.
2024-09-03 04:18:46 +00:00
Reactor Scram
1505b699e5 fix(client/windows): Revert "chore(rust/gui-client/windows): update windows to 0.58 (#6506)" (#6555)
This reverts commit d8f25f9bf8.

#6506 broke the Clients and I guess I didn't do any manual smoke test,
so I didn't catch it.

I have leads for a permanent fix in #6551 but I don't want to leave
`main` broken since it will screw up bisects.
2024-09-02 20:25:10 +00:00
Reactor Scram
d8f25f9bf8 chore(rust/gui-client/windows): update windows to 0.58 (#6506)
Supersedes #5913

This required a big refactor because `HANDLE` is no longer `Send` and
was never supposed to be.

So we add a worker thread for listening to DNS changes, since that
requires us to hold a `HANDLE` across `await` points and I couldn't find
any simpler way to do it.

I could add integration tests for this in a future PR that prove the
notifiers work by poking the registry or setting DNS servers and seeing
if we pick up the changes on time. But setting DNS servers without the
tunnel up may be tricky, so I left it out of scope for this PR.

```[tasklist]
- [x] Fix force-kill bug
```
2024-09-02 18:00:45 +00:00
Thomas Eizinger
4c30d78cda fix: refer to correct tag in git-version (#6334)
The output of `git describe` always refers to the last tag that it can
find. This leads to confusing versions being printed such as:

```
2024-08-19T00:24:08.983891Z  INFO firezone_headless_client: arch="x86_64" git_version="gateway-1.1.5-30-gf82fee162-modified"
```

Note that this is code running in the headless-client and it refers to
the gateway tag. Whilst not wrong from git's PoV, it is certainly
confusing.

We can fix this by providing a glob-pattern to `git describe` via
`--match`. This makes git ignore any other tags and print a version
identifier that refers to the current program:

```
2024-08-19T00:39:48.634191Z  INFO firezone_headless_client: arch="x86_64" git_version="headless-client-1.1.7-31-ga08a3411d-modified"
```
2024-08-19 22:42:15 +00:00
Thomas Eizinger
c51cf096ae build(rust): avoid unnecessary rebuilds (#6321)
Parsing the current Git version within `firezone-bin-shared` means this
crate (and all its dependents) need to be rebuilt everytime one makes a
commit, even if none of the code actually changes.

To avoid this whilst still allowing `firezone-bin-shared` to export a
useful, shared function, we export a macro instead that can be called
from the respective crates that need the GIT version. This means only
those binaries will be marked as dirty and rebuilds of e.g. unit tests
don't need to rebuild these workspace crates.
2024-08-16 15:30:04 +00:00
Thomas Eizinger
0abbf6bba9 refactor(rust): inline http-health-check crate into bin-shared (#6258)
Now that we have the `bin-shared` crate, it is easy to move the
health-check functionality into there. That allows us to get rid of a
crate which makes navigating the workspace a bit easier.
2024-08-12 16:44:52 +00:00
Thomas Eizinger
bed625a312 chore(rust): make logging more ergonomic (#6237)
Setting up a logger is something that pretty much every entrypoint needs
to do, be it a test, a shared library embedded in another app or a
standalone application. Thus, it makes sense to introduce a dedicated
crate that allows us to bundle all the things together, how we want to
do logging.

This allows us to introduce convenience functions like
`firezone_logging::test` which allow you to construct a logger for a
test as a one-liner.

Crucially though, introducing `firezone-logging` gives us a place to
store a default log directive that silences very noisy crates. When
looking into a problem, it is common to start by simply setting the
log-filter to `debug`. Without further action, this floods the output
with logs from crates like `netlink_proto` on Linux. It is very unlikely
that those are the logs that you want to see. Without a preset filter,
the only alternative here is to explicitly turn off the log filter for
`netlink_proto` by typing something like
`RUST_LOG=netlink_proto=off,debug`. Especially when debugging issues
with customers, this is annoying.

Log filters can be overridden, i.e. a 2nd filter that matches the exact
same scope overrides a previous one. Thus, with this design it is still
possible to activate certain logs at runtime, even if they have silenced
by default.

I'd expect `firezone-logging` to attract more functionality in the
future. For example, we want to support re-loading of log-filters on
other platforms. Additionally, where logs get stored could also be
defined in this crate.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-08-10 05:17:03 +00:00