Commit Graph

11 Commits

Author SHA1 Message Date
Thomas Eizinger
90cf191a7c feat(linux): multi-threaded TUN device operations (#7449)
## Context

At present, we only have a single thread that reads and writes to the
TUN device on all platforms. On Linux, it is possible to open the file
descriptor of a TUN device multiple times by setting the
`IFF_MULTI_QUEUE` option using `ioctl`. Using multi-queue, we can then
spawn multiple threads that concurrently read and write to the TUN
device. This is critical for achieving a better throughput.

## Solution

`IFF_MULTI_QUEUE` is a Linux-only thing and therefore only applies to
headless-client, GUI-client on Linux and the Gateway (it may also be
possible on Android, I haven't tried). As such, we need to first change
our internal abstractions a bit to move the creation of the TUN thread
to the `Tun` abstraction itself. For this, we change the interface of
`Tun` to the following:

- `poll_recv_many`: An API, inspired by tokio's `mpsc::Receiver` where
multiple items in a channel can be batch-received.
- `poll_send_ready`: Mimics the API of `Sink` to check whether more
items can be written.
- `send`: Mimics the API of `Sink` to actually send an item.

With these APIs in place, we can implement various (performance)
improvements for the different platforms.

- On Linux, this allows us to spawn multiple threads to read and write
from the TUN device and send all packets into the same channel. The `Io`
component of `connlib` then uses `poll_recv_many` to read batches of up
to 100 packets at once. This ties in well with #7210 because we can then
use GSO to send the encrypted packets in single syscalls to the OS.
- On Windows, we already have a dedicated recv thread because `WinTun`'s
most-convenient API uses blocking IO. As such, we can now also tie into
that by batch-receiving from this channel.
- In addition to using multiple threads, this API now also uses correct
readiness checks on Linux, Darwin and Android to uphold backpressure in
case we cannot write to the TUN device.

## Configuration

Local testing has shown that 2 threads give the best performance for a
local `iperf3` run. I suspect this is because there is only so much
traffic that a single application (i.e. `iperf3`) can generate. With
more than 2 threads, the throughput actually drops drastically because
`connlib`'s main thread is too busy with lock-contention and triggering
`Waker`s for the TUN threads (which mostly idle around if there are 4+
of them). I've made it configurable on the Gateway though so we can
experiment with this during concurrent speedtests etc.

In addition, switching `connlib` to a single-threaded tokio runtime
further increased the throughput. I suspect due to less task / context
switching.

## Results

Local testing with `iperf3` shows some very promising results. We now
achieve a throughput of 2+ Gbit/s.

```
Connecting to host 172.20.0.110, port 5201
Reverse mode, remote host 172.20.0.110 is sending
[  5] local 100.80.159.34 port 57040 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   274 MBytes  2.30 Gbits/sec
[  5]   1.00-2.00   sec   279 MBytes  2.34 Gbits/sec
[  5]   2.00-3.00   sec   216 MBytes  1.82 Gbits/sec
[  5]   3.00-4.00   sec   224 MBytes  1.88 Gbits/sec
[  5]   4.00-5.00   sec   234 MBytes  1.96 Gbits/sec
[  5]   5.00-6.00   sec   238 MBytes  2.00 Gbits/sec
[  5]   6.00-7.00   sec   229 MBytes  1.92 Gbits/sec
[  5]   7.00-8.00   sec   222 MBytes  1.86 Gbits/sec
[  5]   8.00-9.00   sec   223 MBytes  1.87 Gbits/sec
[  5]   9.00-10.00  sec   217 MBytes  1.82 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.30 GBytes  1.98 Gbits/sec  22247             sender
[  5]   0.00-10.00  sec  2.30 GBytes  1.98 Gbits/sec                  receiver

iperf Done.
```

This is a pretty solid improvement over what is in `main`:

```
Connecting to host 172.20.0.110, port 5201
[  5] local 100.65.159.3 port 56970 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  90.4 MBytes   758 Mbits/sec  1800    106 KBytes
[  5]   1.00-2.00   sec  93.4 MBytes   783 Mbits/sec  1550   51.6 KBytes
[  5]   2.00-3.00   sec  92.6 MBytes   777 Mbits/sec  1350   76.8 KBytes
[  5]   3.00-4.00   sec  92.9 MBytes   779 Mbits/sec  1800   56.4 KBytes
[  5]   4.00-5.00   sec  93.4 MBytes   783 Mbits/sec  1650   69.6 KBytes
[  5]   5.00-6.00   sec  90.6 MBytes   760 Mbits/sec  1500   73.2 KBytes
[  5]   6.00-7.00   sec  87.6 MBytes   735 Mbits/sec  1400   76.8 KBytes
[  5]   7.00-8.00   sec  92.6 MBytes   777 Mbits/sec  1600   82.7 KBytes
[  5]   8.00-9.00   sec  91.1 MBytes   764 Mbits/sec  1500   70.8 KBytes
[  5]   9.00-10.00  sec  92.0 MBytes   771 Mbits/sec  1550   85.1 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   917 MBytes   769 Mbits/sec  15700             sender
[  5]   0.00-10.00  sec   916 MBytes   768 Mbits/sec                  receiver

iperf Done.
```
2024-12-05 00:18:20 +00:00
Thomas Eizinger
48ba2869a8 chore(rust): ban the use of .unwrap except in tests (#7319)
Using the clippy lint `unwrap_used`, we can automatically lint against
all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a
few results actually. In most cases, they are invariants that can't
actually be hit. For these, we change them to `Option`. In other cases,
they can actually be hit. For example, if the user supplies an invalid
log-filter.

Activating this lint ensures the compiler will yell at us every time we
use `.unwrap` to double-check whether we do indeed want to panic here.

Resolves: #7292.
2024-11-13 03:59:22 +00:00
Thomas Eizinger
29bc276bf2 refactor(connlib): parallelise TUN operations (#6673)
Currently, `connlib` is entirely single-threaded. This allows us to
reuse a single buffer for processing IP packets and makes reasoning of
the packet processing code very simple. Being single-threaded also means
we can only make use of a single CPU core and all operations have to be
sequential.

Analyzing `connlib` using `perf` shows that we spend 26% of our CPU time
writing packets to the TUN interface [0]. Because we are
single-threaded, `connlib` cannot do anything else during this time. If
we could offload the writing of these packets to a different thread,
`connlib` could already process the next packet while the current one is
writing.

Packets that we send to the TUN interface arrived as an encrypted WG
packet over UDP and get decrypted into a - currently - shared buffer.
Moving the writing to a different thread implies that we have to have
more of these buffer that the next packet(s) can be decrypted into.

To avoid IP fragmentation, we set the maximum IP MTU to 1280 bytes on
the TUN interface. That actually isn't very big and easily fits into a
stackframe. The default stack size for threads is 2MB [1].

Instead of creating more buffers and cycling through them, we can also
simply stack-allocate our IP packets. This incurs some overhead from
copying packets but it is only ~3.5% [2] (This was measured without a
separate thread). With stack-allocated packets, almost all
lifetime-annotations go away which in itself is already a welcome
ergonomics boost. Stack-allocated packets also means we can simply spawn
a new thread for the packet processing. This thread is connected with
two channel to connlib's main thread. The capacity of 1000 packets will
at most consume an additional 3.5 MB of memory which is fine even on our
most-constrained devices such as iOS.

[0]: https://share.firefox.dev/3z78CzD
[1]: https://doc.rust-lang.org/std/thread/#stack-size
[2]: https://share.firefox.dev/3Bf4zla

Resolves: #6653.
Resolves: #5541.
2024-09-26 03:03:35 +00:00
Thomas Eizinger
a9f515a453 chore(rust): use #[expect] instead of #[allow] (#6692)
The `expect` attribute is similar to `allow` in that it will silence a
particular lint. In addition to `allow` however, `expect` will fail as
soon as the lint is no longer emitted. This ensures we don't end up with
stale `allow` attributes in our codebase. Additionally, it provides a
way of adding a `reason` to document, why the lint is being suppressed.
2024-09-16 13:51:12 +00:00
Thomas Eizinger
7adbf9c6af refactor(connlib): remove pnet_packet (#6659)
As the final step in removing `pnet_packet`, we need to introduce `-Mut`
equivalent slices for UDP, TCP and ICMP packets. As a starting point,
introducing `UpdHeaderSliceMut` and `TcpHeaderSliceMut` is fairly
trivial. The ICMP variants are a bit trickier because those are
different for IPv4 and IPv6. Additionally, ICMP for IPv4 is quite
complex because it can have a variable header length. Additionally. for
both variants, the values in byte range 5-8 are semantically different
depending on the ICMP code.

This requires us to design an API that balances ergonomics and
correctness. Technically, an ICMP identifier and sequence can only be
set if the ICMP code is "echo request" or "echo reply". However, adding
an additional parsing step to guarantee this in the type system is quite
verbose.

The trade-off implemented in this PR allows to us to directly write to
the byte 5-8 using the `set_identifier` and `set_sequence` functions. To
catch errors early, this functions have debug-assertions built in that
ensure that the packet is indeed an ICMP echo packet.

Resolves: #6366.
2024-09-11 23:52:48 +00:00
Thomas Eizinger
133c2565b2 refactor(connlib): merge IpPacket and MutableIpPacket (#6652)
Currently, we have two structs for representing IP packets: `IpPacket`
and `MutableIpPacket`. As the name suggests, they mostly differ in
mutability. This design was originally inspired by the `pnet_packet`
crate which we based our `IpPacket` on. With subsequent iterations, we
added more and more functionality onto our `IpPacket`, like NAT64 &
NAT46 translation. As a result of that, the `MutableIpPacket` is no
longer directly based on `pnet_packet` but instead just keeps an
internal buffer.

This duplication can be resolved by merging the two structs into a
single `IpPacket`. We do this by first replacing all usages of
`IpPacket` with `MutableIpPacket`, deleting `IpPacket` and renaming
`MutableIpPacket` to `IpPacket`. The final design now has different
`self`-receivers: Some functions take `&self`, some `&mut self` and some
consume the packet using `self`.

This results in a more ergonomic usage of `IpPacket` across the codebase
and deletes a fair bit of code. It also takes us one step closer towards
using `etherparse` for all our IP packet interaction-needs. Lastly, I am
currently exploring a performance-optimisation idea that stack-allocates
all IP packets and for that, the current split between `IpPacket` and
`MutableIpPacket` does not really work.

Related: #6366.
2024-09-11 22:32:49 +00:00
Thomas Eizinger
bed625a312 chore(rust): make logging more ergonomic (#6237)
Setting up a logger is something that pretty much every entrypoint needs
to do, be it a test, a shared library embedded in another app or a
standalone application. Thus, it makes sense to introduce a dedicated
crate that allows us to bundle all the things together, how we want to
do logging.

This allows us to introduce convenience functions like
`firezone_logging::test` which allow you to construct a logger for a
test as a one-liner.

Crucially though, introducing `firezone-logging` gives us a place to
store a default log directive that silences very noisy crates. When
looking into a problem, it is common to start by simply setting the
log-filter to `debug`. Without further action, this floods the output
with logs from crates like `netlink_proto` on Linux. It is very unlikely
that those are the logs that you want to see. Without a preset filter,
the only alternative here is to explicitly turn off the log filter for
`netlink_proto` by typing something like
`RUST_LOG=netlink_proto=off,debug`. Especially when debugging issues
with customers, this is annoying.

Log filters can be overridden, i.e. a 2nd filter that matches the exact
same scope overrides a previous one. Thus, with this design it is still
possible to activate certain logs at runtime, even if they have silenced
by default.

I'd expect `firezone-logging` to attract more functionality in the
future. For example, we want to support re-loading of log-filters on
other platforms. Additionally, where logs get stored could also be
defined in this crate.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-08-10 05:17:03 +00:00
Thomas Eizinger
a87728b791 chore: remove connlib-shared dependency from bin-shared (#6229)
The `firezone-bin-shared` crate is meant to house non-tunnel related
things. That allows it to compile in parallel to everything else. It
currently only depends on `connlib-shared` to access the `DEFAULT_MTU`
constant. We can remove that by requiring the MTU as a ctor parameter of
`TunDeviceManager`.

A longer write-up of the intended dependency structure is in #4470.
2024-08-10 03:58:10 +00:00
Thomas Eizinger
4ae64f0257 fix(connlib): index forwarded DNS queries by ID + socket (#6233)
When forwarding DNS queries, we need to remember the original source
socket in order to send the response back. Previously, this mapping was
indexed by the DNS query ID. As it turns out, at least Windows doesn't
have a global DNS query ID counter and may reuse them across different
DNS servers. If that happens and two of these queries overlap, then we
match the wrong responses together.

In the best case, this produces bad DNS results on the client. In the
worst case, those queries were for DNS servers with different IP
versions in which case we triggered a panic in connlib further down the
stack where we created the IP packet for the response.

To fix this, we first and foremost remove the explicit `panic!` from the
`make::` functions in `ip-packet`. Originally, these functions were only
used in tests but we started to use them in production code too and
unfortunately forgot about this panic. By introducing a `Result`, all
call-sites are made aware that this can fail.

Second, we fix the actual indexing into the data structure for forwarded
DNS queries to also include the DNS server's socket. This ensures we
don't treat the DNS query IDs as globally unique.

Third, we replace the panicking path in
`try_handle_forwarded_dns_response` with a log statement, meaning if the
above assumption turns out wrong for some reason, we still don't panic
and simply don't handle the packet.
2024-08-09 07:01:57 +00:00
Reactor Scram
82b8de4c9c refactor(client/windows): de-dupe wintun.dll (#6020)
Closes #5977

Refactored some other stuff to make this work

Also removed a redundant impl of `ensure_dll` in a benchmark
2024-07-25 14:28:35 +00:00
Thomas Eizinger
50d6b865a1 refactor(connlib): move Tun implementations out of firezone-tunnel (#5903)
The different implementations of `Tun` are the last platform-specific
code within `firezone-tunnel`. By introducing a dedicated crate and a
`Tun` trait, we can move this code into (platform-specific) leaf crates:

- `connlib-client-android`
- `connlib-client-apple`
- `firezone-bin-shared`

Related: #4473.

---------

Co-authored-by: Not Applicable <ReactorScram@users.noreply.github.com>
2024-07-24 01:10:50 +00:00