firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 18:18:55 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	bc2febed99	fix(connlib): use correct constant for truncating DNS responses (#7551 ) In case an upstream DNS server responds with a payload that exceeds the available buffer space of an IP packet, we need to truncate the response. Currently, this truncation uses the wrong constant to check for the maximum allowed length. Instead of the `MAX_DATAGRAM_PAYLOAD`, we actually need to check against a limit that is less than the MTU as the IP layer and the UDP layer both add an overhead. To fix this, we introduce such a constant and provide additional documentation on the remaining ones to hopefully avoid future errors.	2024-12-19 17:15:43 +00:00
Thomas Eizinger	a1cf409af3	fix(connlib): clear all in-flight upstream DNS queries on reset (#7552 ) When a Firezone Client roams, we reset all network connections and rebind our local sockets. Doing that enables us to start from a clean state and establish new connections to Gateways. What we are currently not clearing are in-flight DNS queries. Those are all very likely to fail because our network connection is changing. There is no point in us keeping those around. Additionally, as part of roaming, it may also be that our upstream DNS server changes and thus, we may suddenly receive a response from a DNS server that we no longer know about. Clear all in-flight DNS queries on reset solves this.	2024-12-18 20:35:30 +00:00
Thomas Eizinger	992b97e6a9	fix(connlib): bind new channel to peer if needed (#7548 ) Initially, when we receive a new candidate from a remote peer, we bind a channel for each remote address on the relay that we sampled. This ensures that every possible communication path is actually functioning. In ICE, all candidates are tried against each other, meaning the remote will attempt to send from each of their candidates to every one of ours, including our relay candidates. To allow this traffic, a channel needs to be bound first. For various reasons, an allocation might become stale or needs to be otherwise invalidated. In that case, all the channel bindings are lost but there might still be an active connection that wants to utilise them. In that case, we will see "No channel" warnings like https://firezone-inc.sentry.io/issues/6036662614/events/f8375883fd3243a4afbb27c36f253e23/. To fix this, we use the attempt to encode a message for a channel as an intent to bind a new one. This is deemed safe because wanting to encode a message to a peer as a channel data message means we want such a channel to exist. The first message here is still dropped but that is better than not establishing the channel at all.	2024-12-18 17:15:17 +00:00
Thomas Eizinger	a80abec4ff	refactor(connlib): remove unused branch in `match` (#7550 ) When deciding what to do with a certain DNS query, we check whether the domain name in question corresponds to any of the (wildcard) DNS resource addresses. If yes, we resolve it to the resource ID of that resource. The source of those resource IDs is the `dns_resources` map. If we have looked up a `ResourceId` in that map, it is impossible for it to not be "known" which means the branch deleted in this PR is completely redundant and already covered by the catch-all branch where `maybe_resource` is `None`.	2024-12-18 15:47:15 +00:00
Thomas Eizinger	62dfe65679	chore(connlib): improve error messages for failed translations (#7540 )	2024-12-18 04:47:26 +00:00
Thomas Eizinger	8a1b6f26b4	fix(connlib): don't log warnings for unreachable errors (#7537 ) When a Gateway or Client is running in an environment without IPv4 or IPv6 connectivity, our initial probes for sending packets to the relays will fail with network unreachable. That isn't a very big concern and happens a lot in the wild. There is no need to report these as telemetry events. Resolves: #7514.	2024-12-17 17:59:20 +00:00
Thomas Eizinger	aa8c53a20d	refactor(rust): use a buffer pool for network packets (#7489 ) In order to achieve concurrency within `connlib`, we needed to create a way for IP packets to own the piece of memory they are sitting in. This allows us to concurrently read IP packets and them batch-process them (as opposed to have a dedicated buffer and reference it). At the moment, those IP packets are defined on the stack. With a size of ~1300 bytes that isn't very large but still causes _some_ amount of copying. We can avoid this copying by relying on a buffer pool: 1. When reading a new IP packet, we request a new buffer from the pool. 2. When the IP packet gets dropped, the buffer gets returned to the pool. This allows us to reuse an allocation for a packet once it finished processing, resulting in less CPU time spent on copying around memory. This causes us to make more _individual_ heap-allocations in the beginning: Each packet is being processed by `connlib` is allocated on the heap somewhere. At some point during the lifetime of the tunnel, this will settle in an ideal state where we have allocated enough slots to cover new packets whilst also reusing memory from packets that finished processing already. The actual `IpPacket` data type is now just a pointer. As a result, the channels to and from the TUN thread (where we were holding multiple of these packets) are now significantly smaller, leading to roughly the same memory usage overall. In my local testing on Linux, the client still only uses about ~15MB of RAM even with multiple concurrent speedtests running.	2024-12-16 01:02:17 +00:00
Thomas Eizinger	0861ccaf06	chore(connlib): improve logging on missing flow (#7508 ) Normally, there always be exactly on pending flow per resource. It appears though that it can sometimes happen that we first request a flow for a resource but by the time it is authorised, we've already cleared its local state. Regardless, this isn't a concerning error and not worth logging on WARN (which happens one layer up).	2024-12-13 18:03:53 +00:00
Thomas Eizinger	61d6eceb29	chore(connlib): downgrade warning about missing DNS servers (#7509 ) There is nothing we can do if the user doesn't have any DNS servers defined. The default log level is INFO so a user reading the logs will still come across this message in case they are trying to debug what is happening. Long term, problems like these would probably warrant some kind of notification channel from `connlib` to the GUI where we can display messages to the user.	2024-12-13 17:53:36 +00:00
Thomas Eizinger	7a33146997	chore(connlib): downgrade warning when disconnecting from relay (#7512 ) There are several reasons why we can disconnect from a relay at runtime: - STUN is blocked - We have invalid credentials - The TURN server is not protocol-conform The first two are very much possible in production and there is nothing we can do about them. When relays reboot, their credentials change and if the Internet connection of a user / gateway gets cut, we may disconnect from the relay because the messages get lost. The last one should never happen if we are connected to our own relays. Firezone can be self-hosted so ultimately, we don't have control over what we are talking to. That error however is more of a safe-guard for `connlib` itself to disconnect from the server as soon as it detects that it is behaving weirdly. None of these reasons are worth reporting to Sentry as a problem because they aren't really fixable as such. It is more important that the user sees them in the logs if they decide to dig into them which they will still do on INFO level.	2024-12-13 17:52:59 +00:00
Thomas Eizinger	f30cc3226d	fix(gateway): don't return error when client disconnected (#7504 ) When a client disconnects, we clear up the connection on the gateway. There might still be packets arriving from resources that we then cannot route. This isn't worth returning an error.	2024-12-13 04:54:07 +00:00
Thomas Eizinger	7a478634a8	feat(connlib): buffer packets during connection and NAT setup (#7477 ) At present, `connlib` will always drop all IP packets until a connection is established and the DNS resource NAT is created. This causes an unnecessary delay until the connection is working because we need to wait for retransmission timers of the host's network stack to resend those packets. With the new idempotent control protocol, it is now much easier to buffer these packets and send them to the gateway once the connection is established. The buffer sizes are chosen somewhat conservatively to ensure we don't consume a lot of memory. The hypothesis here is that every protocol - even if the transport layer is unreliable like UDP - will start with a handshake involving only one or at most a few packets and waiting for a reply before sending more. Thus, as long as we can set up a connection quicker than the re-transmit timer in the host's network stack, buffering those packets should result in no packet loss. Typically, setting up a new connection takes at most 500ms which should be fast enough to not trigger any re-transmits. Resolves: #3246.	2024-12-12 11:40:38 +00:00
Thomas Eizinger	7e38d3caee	chore(connlib): downgrade warning about failed flow (#7480 )	2024-12-11 19:01:37 +00:00
Thomas Eizinger	81f71cba62	fix(telemetry): use `package@version` notation for releases (#7466 ) In order for Sentry to parse our releases as semver, they need to be in the form of `package@version` [0]. Without this, the feature of "Mark this issue as resolved in the _next_ version" doesn't work properly because Sentry compares the versions as to when it first saw them vs parsing the semver string itself. We test versions prior to releasing them, meaning Sentry learns about a 1.4.0 version before it is actually released. This causes false-positive "regressions" even though they are fixed in a later (as per semver) release. This create some redundancy with the different DSNs that we are already using. I think it would make sense to consider merging the two projects we have for the GUI client for example. That is really just one project that happens to run as two binaries. For all other projects, I think the separation still makes sense because we e.g. may add Sentry to the "host" applications of Android and MacOS/iOS as well. For those, we would reuse the DSN and thus funnel the issues into the same Sentry project. As per Sentry's docs, releases are organisation-wide and therefore need a package identifier to be grouped correctly. [0]: https://docs.sentry.io/platforms/javascript/configuration/releases/#bind-the-version	2024-12-09 05:04:45 +00:00
Thomas Eizinger	ddce9312ea	fix(android): apply new log-filter on repeated `connect` call (#7461 ) Related: #7460. Resolves: #5634.	2024-12-06 04:45:28 +00:00
Thomas Eizinger	6115f662cf	fix(apple): only initialise global logger once (#7460 ) From within the FFI code, we have no control over the lifecycle of the host application and `connect` may be called multiple times from within the same process. Therefore, we cannot rely on the global logger state to not be set when `connect` gets called. To fix this, we cache the handles for the file logger and a reload-handle for the log filter in a `static` variable. This allows us to apply the new log-filter of a repeated `connect` call to the existing logger, even if `connect` is called multiple times from the same process.	2024-12-06 04:44:41 +00:00
Thomas Eizinger	90cf191a7c	feat(linux): multi-threaded TUN device operations (#7449 ) ## Context At present, we only have a single thread that reads and writes to the TUN device on all platforms. On Linux, it is possible to open the file descriptor of a TUN device multiple times by setting the `IFF_MULTI_QUEUE` option using `ioctl`. Using multi-queue, we can then spawn multiple threads that concurrently read and write to the TUN device. This is critical for achieving a better throughput. ## Solution `IFF_MULTI_QUEUE` is a Linux-only thing and therefore only applies to headless-client, GUI-client on Linux and the Gateway (it may also be possible on Android, I haven't tried). As such, we need to first change our internal abstractions a bit to move the creation of the TUN thread to the `Tun` abstraction itself. For this, we change the interface of `Tun` to the following: - `poll_recv_many`: An API, inspired by tokio's `mpsc::Receiver` where multiple items in a channel can be batch-received. - `poll_send_ready`: Mimics the API of `Sink` to check whether more items can be written. - `send`: Mimics the API of `Sink` to actually send an item. With these APIs in place, we can implement various (performance) improvements for the different platforms. - On Linux, this allows us to spawn multiple threads to read and write from the TUN device and send all packets into the same channel. The `Io` component of `connlib` then uses `poll_recv_many` to read batches of up to 100 packets at once. This ties in well with #7210 because we can then use GSO to send the encrypted packets in single syscalls to the OS. - On Windows, we already have a dedicated recv thread because `WinTun`'s most-convenient API uses blocking IO. As such, we can now also tie into that by batch-receiving from this channel. - In addition to using multiple threads, this API now also uses correct readiness checks on Linux, Darwin and Android to uphold backpressure in case we cannot write to the TUN device. ## Configuration Local testing has shown that 2 threads give the best performance for a local `iperf3` run. I suspect this is because there is only so much traffic that a single application (i.e. `iperf3`) can generate. With more than 2 threads, the throughput actually drops drastically because `connlib`'s main thread is too busy with lock-contention and triggering `Waker`s for the TUN threads (which mostly idle around if there are 4+ of them). I've made it configurable on the Gateway though so we can experiment with this during concurrent speedtests etc. In addition, switching `connlib` to a single-threaded tokio runtime further increased the throughput. I suspect due to less task / context switching. ## Results Local testing with `iperf3` shows some very promising results. We now achieve a throughput of 2+ Gbit/s. ``` Connecting to host 172.20.0.110, port 5201 Reverse mode, remote host 172.20.0.110 is sending [ 5] local 100.80.159.34 port 57040 connected to 172.20.0.110 port 5201 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 274 MBytes 2.30 Gbits/sec [ 5] 1.00-2.00 sec 279 MBytes 2.34 Gbits/sec [ 5] 2.00-3.00 sec 216 MBytes 1.82 Gbits/sec [ 5] 3.00-4.00 sec 224 MBytes 1.88 Gbits/sec [ 5] 4.00-5.00 sec 234 MBytes 1.96 Gbits/sec [ 5] 5.00-6.00 sec 238 MBytes 2.00 Gbits/sec [ 5] 6.00-7.00 sec 229 MBytes 1.92 Gbits/sec [ 5] 7.00-8.00 sec 222 MBytes 1.86 Gbits/sec [ 5] 8.00-9.00 sec 223 MBytes 1.87 Gbits/sec [ 5] 9.00-10.00 sec 217 MBytes 1.82 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 2.30 GBytes 1.98 Gbits/sec 22247 sender [ 5] 0.00-10.00 sec 2.30 GBytes 1.98 Gbits/sec receiver iperf Done. ``` This is a pretty solid improvement over what is in `main`: ``` Connecting to host 172.20.0.110, port 5201 [ 5] local 100.65.159.3 port 56970 connected to 172.20.0.110 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 90.4 MBytes 758 Mbits/sec 1800 106 KBytes [ 5] 1.00-2.00 sec 93.4 MBytes 783 Mbits/sec 1550 51.6 KBytes [ 5] 2.00-3.00 sec 92.6 MBytes 777 Mbits/sec 1350 76.8 KBytes [ 5] 3.00-4.00 sec 92.9 MBytes 779 Mbits/sec 1800 56.4 KBytes [ 5] 4.00-5.00 sec 93.4 MBytes 783 Mbits/sec 1650 69.6 KBytes [ 5] 5.00-6.00 sec 90.6 MBytes 760 Mbits/sec 1500 73.2 KBytes [ 5] 6.00-7.00 sec 87.6 MBytes 735 Mbits/sec 1400 76.8 KBytes [ 5] 7.00-8.00 sec 92.6 MBytes 777 Mbits/sec 1600 82.7 KBytes [ 5] 8.00-9.00 sec 91.1 MBytes 764 Mbits/sec 1500 70.8 KBytes [ 5] 9.00-10.00 sec 92.0 MBytes 771 Mbits/sec 1550 85.1 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 917 MBytes 769 Mbits/sec 15700 sender [ 5] 0.00-10.00 sec 916 MBytes 768 Mbits/sec receiver iperf Done. ```	2024-12-05 00:18:20 +00:00
Thomas Eizinger	48bd0f9804	chore: bump client versions to 1.4.0 (#7092 ) In order to release the new control protocol to users, we need to bump the versions of the clients to 1.4.0. The portal has a version gate to only select gateways with version >= 1.4.0 for clients >= 1.4.0. Thus, bumping these versions can only happen once testing has completed and the gateway has actually been released as 1.4.0. Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>	2024-12-04 19:48:51 +00:00
Thomas Eizinger	b802021cc4	feat(connlib): implement idempotent control protocol for client (#6942 ) Building on top of the gateway PR (#6941), this PR transitions the clients to the new control protocol. Clients are not backwards-compatible with old gateways. As a result, a certain customer environment MUST have at least one gateway with the above PR running in order for clients to be able to establish connections. With this transition, Clients send explicit events to Gateways whenever they assign IPs to a DNS resource name. The actual assignment only happens once and the IPs then remain stable for the duration of the client session. When the Gateway receives such an event, it will perform a DNS resolution of the requested domain name and set up the NAT between the assigned proxy IPs and the IPs the domain actually resolves to. In order to support self-healing of any problems that happen during this process, the client will send an "Assigned IPs" event every time it receives a DNS query for a particular domain. This in turn will trigger another DNS resolution on the Gateway. Effectively, this means that DNS queries for DNS resources propagate to the Gateway, triggering a DNS resolution there. In case the domain resolves to the same set of IPs, no state is changed to ensure existing connections are not interrupted. With this new functionality in place, we can delete the old logic around detecting "expired" IPs. This is considered a bugfix as this logic isn't currently working as intended. It has been observed multiple times that the Gateway can loop on this behaviour and resolving the same domain over and over again. The only theoretical "incompatibility" here is that pre-1.4.0 clients won't have access to this functionality of triggering DNS refreshes on a Gateway 1.4.2+ Gateway. However, as soon as this PR merges, we expect all admins to have already upgraded to a 1.4.0+ Gateway anyway which already mandates clients to be on 1.4.0+. Resolves: #7391. Resolves: #6828.	2024-12-04 12:05:35 +00:00
Jamil	15e75f80ba	fix(apple/ios): Expose IPHONEOS_DEPLOYMENT_TARGET to tell rustc our iOS version (#7453 ) Fixes a similar issue as #7443 where we were deleting the `IPHONEOS_DEPLOYMENT_TARGET` variable in our Rust build script, which caused lots of warnings about building for a different OS than being linked against.	2024-12-03 14:12:20 -08:00
Thomas Eizinger	dd6b52b236	chore(rust): share edition key via workspace table (#7451 )	2024-12-03 00:28:06 +00:00
Thomas Eizinger	9073bddaef	fix(gateway): translate ICMP destination unreachable errors (#7398 ) ## Context The Gateway implements a stateful NAT that translates the destination IP and source protocol of every packet that targets a DNS resource IP. This is necessary because the IPs for DNS resources are generated on the client without actually performing a DNS lookup, instead it always generates 4 IPv4 and 4 IPv6 addresses. On the Gateway, these IPs are then assigned in a round-robin fashion to the actual IPs that the domain resolves to, necessitating a NAT64/46 translation in case a domain only resolves to IPs of one family. A domain may resolve to a set of IPs but not all of these IPs may be routable. Whilst an arguably poor practise of the domain administrator, routing problems can occur for all kinds of reasons and are well handled on the wider Internet. When an IP packet cannot be routed further, the current routing node generates an ICMP error describing the routing failure and sends it back to the original sender. ICMP is a layer 4 protocol itself, same as TCP and UDP. As such, sending out a UDP packet may result in receiving an ICMP response. In order to allow the sender to learn, which packet failed to route, the ICMP error embeds parts of the original packet in its payload [0] [1]. The Gateway's NAT table uses parts of the layer 4 protocol as part of its key; the UDP and TCP source port and the ICMP echo request identifier (further referred to as "source protocol"). An ICMP error message doesn't have any of these, meaning the lookup in the NAT table currently fails and the ICMP error is silently dropped. A lot of software implements a happy-eyeballs approach and probs for IPv6 and IPv4 connectivity simulataneously. The absence of the ICMP errors confuses that algorithm as it detects the packet loss and starts retransmits instead of giving up. ## Solution Upon receiving an ICMP error on the Gateway, we now extract the partially embedded packet in the ICMP error payload. We use the destination IP and source protocol of _that_ packet for the lookup in the NAT table. This returns us the original (client-assigned) destination IP and source protocol. In order for the Gateway's NAT to be transparent, we need to patch the packet embedded in the ICMP error to use the original destination and source protocol. We also have to account for the fact that the original packet may have been translated with NAT64/46 and translate it back. Finally, we generate an ICMP error with the appropriate code and embed the patched packet in its payload. ## Test implementation To test that this works for all kind of combinations, we extend `tunnel_test` to sample a list of unreachable IPs from all IPs sampled for DNS resources. Upon receiving a packet for one of these IPs, the Gateway will send an ICMP error back instead of invoking its regular echo reply logic. On the client-side, upon receiving an ICMP error, we extract the originally failed packet from the body and treat it as a successful response. This may seem a bit hacky at first but is actually how operating systems would treat ICMP errors as well. For example, a `TcpSocket::connect` call (triggering a TCP SYN packet) may fail with an IO error if we receive an ICMP error packet. Thus, in a way, the original packet got answered, just not with what we expected. In addition, by treating these ICMP errors as responses to the original packet, we automatically perform other assertions on them, like ensuring that they come from the right IP address, that there are no unexpected packets etc. ## Test alternatives It is tricky to solve this in other ways in the test suite because at the time of generating a packet for a DNS resource, we don't know the actual IP that is being targeted by a certain proxy IP unless we'd start reimplementing the round-robin algorithm employed by the Gateway. To "test" the transparency of the NAT, we'd like to avoid knowing about these implementation details in the test. ## Future work In this PR, we currently only deal with "Destination Unreachable" ICMP errors. There are other ICMP messages such as ICMPv6's `PacketTooBig` or `ParameterProblem`. We should eventually handle these as well. They are being deferred because translating those between the different IP versions is only partially implemented and would thus require more work. The most pressing need is to translate destination unreachable errors to enable happy-eyeballs algorithms to work correctly. Resolves: #5614. Resolves: #6371. [0]: https://www.rfc-editor.org/rfc/rfc792 [1]: https://www.rfc-editor.org/rfc/rfc4443#section-3.1	2024-12-02 23:07:41 +00:00
Jamil	e1ed497d12	fix(apple): Expose `MACOSX_DEPLOYMENT_TARGET` in rust apple build script to signal to rustc which macOS to target (#7443 ) `MACOSX_DEPLOYMENT_TARGET` is a standard env var read by gcc and rustc that determines which version of macOS to target binaries for. This variable was being removed inadvertently in our rust build script. Exposing this var fixes a slew of warnings about object files being built for a newer macOS target than being linked that were showing up in Xcode for some time now. Hasn't caused any issues thus far from what I can tell, but possibly related to #7442	2024-12-02 17:27:11 +00:00
Thomas Eizinger	0a6554122a	feat(connlib): utilise GSO for UDP sockets (#7210 ) ## Context At present, `connlib` sends UDP packets one at a time. Sending a packet requires us to make a syscall which is quite expensive. Under load, i.e. during a speedtest, syscalls account for over 50% of our CPU time [0]. In order to improve this situation, we need to somehow make use of GSO (generic segmentation offload). With GSO, we can send multiple packets to the same destination in a single syscall. The tricky question here is, how can we achieve having multiple UDP packets ready at once so we can send them in a single syscall? Our TUN interface only feeds us packets one at a time and `connlib`'s state machine is single-threaded. Additionally, we currently only have a single `EncryptBuffer` in which the to-be-sent datagram sits. ## 1. Stack-allocating encrypted IP packets As a first step, we get rid of the single `EncryptBuffer` and instead stack-allocate each encrypted IP packet. Due to our small MTU, these packets are only around 1300 bytes. Stack-allocating that requires a few memcpy's but those are in the single-digit % range in the terms of CPU time performance hit. That is nothing compared to how much time we are spending on UDP syscalls. With the `EncryptBuffer` out the way, we can now "freely" move around the `EncryptedPacket` structs and - technically - we can have multiple of them at the same time. ## 2. Implementing GSO The GSO interface allows you to pass multiple packets of the same length and for the same destination in a single syscall, meaning we cannot just batch-up arbitrary UDP packets. Counterintuitively, making use of GSO requires us to do more copying: In particular, we change the interface of `Io` such that "sending" a packet performs essentially a lookup of a `BytesMut`-buffer by destination and packet length and appends the payload to that packet. ## 3. Batch-read IP packets In order to actually perform GSO, we need to process more than a single IP packet in one event-loop tick. We achieve this by batch-reading up to 50 IP packets from the mpsc-channel that connects `connlib`'s main event-loop with the dedicated thread that reads and writes to the TUN device. These reads and writes happen concurrently to `connlib`'s packet processing. Thus, it is likely that by the time `connlib` is ready to process another IP packet, multiple have been read from the device and are sitting in the channel. Batch-processing these IP packets means that the buffers in our `GsoQueue` are more likely to contain more than a single datagram. Imagine you are running a file upload. The OS will send many packets to the same destination IP and likely max MTU to the TUN device. It is likely, that we read 10-20 of these packets in one batch (i.e. within a single "tick" of the event-loop). All packets will be appended to the same buffer in the `GsoQueue` and on the next event-loop tick, they will all be flushed out in a single syscall. ## Results Overall, this results in a significant reduction of syscalls for sending UDP message. In [1], we spend only a total of 16% of our CPU time in `udpv6_sendmsg` whereas in [0] (main), we spent a total of 34%. Do note that these numbers are relative to the total CPU time spent per program run and thus can't be compared directly (i.e. you cannot just do 34 - 16 and say we now spend 18% less time sending UDP packets). Nevertheless, this appears to be a great improvement. In terms of throughput, we achieve a ~60% improvement in our benchmark suite. That one is running on localhost though so it might not necessarily be reflect like that in a real network. [0]: https://share.firefox.dev/4hvoPju [1]: https://share.firefox.dev/4frhCPv	2024-12-02 01:09:44 +00:00
Thomas Eizinger	5f4816ee46	fix(connlib): don't warn on non-UDP packet to DNS resolver IP (#7418 ) Windows appears to sometimes send ICMP (error?) packets to our DNS resolver IPs. There is nothing we can do with these but the current code path logs them as a warning because we expect everything that isn't TCP to be UDP. With this patch, we slightly change the parsing logic to first attempt extracting the UDP packet and fail only with a DEBUG log if it isn't.	2024-12-01 16:01:42 +00:00
Thomas Eizinger	a3e3d4cac5	fix(gateway): filter packets not destined for a client (#7417 ) This causes unnecessary logs higher up the stack.	2024-12-01 15:59:56 +00:00
Thomas Eizinger	932f6791fb	fix(phoenix-channel): lazily create backoff timer (#7414 ) Our `phoenix-channel` component is responsible for maintaining a WebSocket connection to the portal. In case that connection fails, we want to reconnect to it using an exponential backoff, eventually giving up after a certain amount of time. Unfortunately, the code we have today doesn't quite do that. An `ExponentialBackoff` has a setting for the `max_elapsed_time`. Regardless of how many and how often we retry something, we won't ever wait longer than this amount of time. For the Relay, this is set to 15min. For other components its indefinite (Gateway, headless-client), or very long (30 days for Android, 1 day for Apple). The point in time from which this duration is counted is when the `ExponentialBackoff` is constructed which translates to when we first connected to the portal. As a result, our backoff would immediately fail on the first error if it has been longer than `max_elapsed_time` since we first connected. For most components, this codepath is not relevant because the `max_elapsed_time` is so long. For the Relay however, that is only 15 minutes so chances are, the Relay would immediately fail (and get rebooted) on the first connection error with the portal. To fix this, we now lazily create the `ExponentialBackoff` on the first error. This bug has some interesting consequences: When a relay reboots, it looses all its state, i.e. allocations, channel bindings, available nonces etc, stamp-secret. Thus, all credentials and state that got distributed to Clients and Gateways get invalidated, causing disconnects from the Relay. We have observed these alerts in Sentry for a while and couldn't explain them. Most likely, this is the root cause for those because whilst a Relay disconnects, the portal also cannot detect its presence and pro-actively inform Clients and Gateways to no longer use this Relay.	2024-11-29 20:19:11 +00:00
Thomas Eizinger	c6e7e6192e	build(rust): bump Rust to 1.83 (#7409 ) Rust 1.83 comes with a bunch of new lints for elidible lifetimes. Those also trigger in the generated code of `derivative`. That crate is actually unmaintained so we replace our usages of it with `derive_more`.	2024-11-29 01:04:06 +00:00
Thomas Eizinger	e46cb3f62b	chore(snownet): improve log when `MessageIntegrity` is missing (#7399 )	2024-11-29 00:27:53 +00:00
Thomas Eizinger	3ccf795195	test(connlib): don't waste shrinking time & cycles on IDs (#7402 ) When `proptest` discovers a test failure, it will attempt to "shrink" the input to identify, what exactly causes the issue. How this is done depends on the data type but mostly performs things such as binary search to be efficient. Not every input within our tests is relevant for a failure. For example, which ID we have sampled for a client or a gateway doesn't at all affect whether or not the test will fail. `proptest` doesn't know that though so it will still happily spend shrinking time and cycles on figuring out the minimal difference in IDs (which is 1 because they have to be different). This is a huge waste of time for no benefit. We are getting much more value out of `proptest` if it tries to adjust other things such as the transitions involved in a test, how many gateways and relays there are etc. By marking the strategies for the IDs and private keys with `no_shrink`, we can achieve that.	2024-11-28 20:45:34 +00:00
Thomas Eizinger	fd337dd465	test: reduce number of local rejects for generating IPs (#7401 ) When generating random input data in property-based tests, we have to ensure that the data conforms to certain criteria. For example, IP addresses must not be multicast or unspecified addresses and they must not be within our reserved IP ranges. Currently, we ensure this using "filtering" which is a pretty poor technique [0]. To improve on this, we refactor the generation of IPs to automatically exclude all IPs within certain ranges. On very big test-runs (i.e. > 30000 test cases), too many local rejections lead to the test suite being aborted early. [0]: https://proptest-rs.github.io/proptest/proptest/tutorial/filtering.html	2024-11-28 20:45:02 +00:00
Thomas Eizinger	2c26fc9c0e	ci: lint Rust dependencies using `cargo deny` (#7390 ) One of Rust's promises is "if it compiles, it works". However, there are certain situations in which this isn't true. In particular, when using dynamic typing patterns where trait objects are downcast to concrete types, having two versions of the same dependency can silently break things. This happened in #7379 where I forgot to patch a certain Sentry dependency. A similar problem exists with our `tracing-stackdriver` dependency (see #7241). Lastly, duplicate dependencies increase the compile-times of a project, so we should aim for having as few duplicate versions of a particular dependency as possible in our dependency graph. This PR introduces `cargo deny`, a linter for Rust dependencies. In addition to linting for duplicate dependencies, it also enforces that all dependencies are compatible with an allow-list of licenses and it warns when a dependency is referred to from multiple crates without introducing a workspace dependency. Thanks to existing tooling (https://github.com/mainmatter/cargo-autoinherit), transitioning all dependencies to workspace dependencies was quite easy. Resolves: #7241.	2024-11-22 00:17:28 +00:00
Thomas Eizinger	56db250e2c	feat(connlib): validate integrity of all relay responses (#7378 ) In order to avoid processing of responses of relays that somehow got altered on the network path, we now use the client's `password` as a shared secret for the relay to also authenticate its responses. This means that not all message can be authenticated. In particular, BINDING requests will still be unauthenticated. Performing this validation now requires every component that crafts input to the `Allocation` to include a valid `MessageIntegrity` attribute. This is somewhat problematic for the regression tests of the relay and the unit tests of `Allocation`. In both cases, we implement workarounds so we don't have to actually compute a valid `MessageIntegrity`. This is deemed acceptable because: - Both of these are just tests. - We do test the validation path using `tunnel_test` because there we run an actual relay.	2024-11-19 18:32:33 +00:00
Thomas Eizinger	ecec00afed	chore(snownet): print attributes for all requests and responses (#7380 ) When debugging issues related to our TURN allocation code, we sometimes only have the logs that code submitted to Sentry. As part of the event, we submit the last 500 debug logs as breadcrumbs to give more context to the error. Unconditionally printing the attributes of each request-response pair will help us in more easily diagnosing, why certain errors happen.	2024-11-19 14:32:23 +00:00
Thomas Eizinger	e8519cca0c	chore(snownet): warn on exceeding number of candidate pairs (#7376 ) In the latest version, we added a warning log to str0m when the maximum number of candidate pairs is exceeded: https://github.com/algesten/str0m/pull/587. We only ever add the candidates of a single relay to an agent (2 candidates), plus at most 2 server-reflexive candidates and at most 2 host candidates. Unless there is a bug like what we fixed in #7334, exceeding the default number of candidate _pairs_ (100) should never happen. In case it does, the newly added `warn` log in `str0m` will trigger a Sentry alert.	2024-11-19 04:34:23 +00:00
Thomas Eizinger	de35bb067e	fix(telemetry): don't embed errors values in `telemetry_event!` (#7366 ) Due to https://github.com/getsentry/sentry-rust/issues/702, errors which are embedded as `tracing::Value` unfortunately get silently discarded when reported as part of Sentry "Event"s and not "Exception"s. The design idea of these telemetry events is that they aren't fatal errors so we don't need to treat them with the highest priority. They may also appear quite often, so to save performance and bandwidth, we sample them at a rate of 1% at creation time. In order to not lose the context of these errors, we instead format them into the message. This makes them completely identical to the `debug!` logs which we have on every call-site of `telemetry_event!` which prompted me to make that implicit as part of creating the `telemetry_event!`. Resolves: #7343.	2024-11-18 18:17:08 +00:00
Thomas Eizinger	d9fb9e53c8	chore(snownet): print error code for unhandled message (#7367 ) All our logic for handling errors is based on the error code. Even though there should be a 1:1 mapping between error code and reason phrase, I am seeing odd reports in Sentry for a case that we should be handling but aren't.	2024-11-18 18:15:04 +00:00
Thomas Eizinger	9536b8116c	fix: don't exit TUN thread on errors (#7354 ) I noticed that in case there is an error when reading from the TUN device, we currently exit that thread and we don't have a mechanism at the moment to restart it. Discarding the thread also means we can no longer send new instances of `Tun` into it. Instead of exiting the thread, we now just log the error and continue. In case the error was caused by the FD being closed, we discard the instance of `Tun` and wait for a new one.	2024-11-16 06:19:41 +00:00
Thomas Eizinger	4e423dc51c	fix(connlib): send all unwritten packets before reading new ones (#7342 ) With the parallelisation of TUN and UDP operations, we lost backpressure: Packets can now be read quicker from the UDP sockets than they can be sent out the TUN device, causing packet loss in extremely high-throughput situations. To avoid this, we don't directly send packets into the channel to the TUN device thread. This channel is bounded, meaning sending can fail if reading UDP packets is faster than writing packets to the TUN device. Due to GRO, we may read multiple UDP packets in one go, requiring us to write multiple IP packets to the TUN device as part of a single iteration in the event-loop. Thus, we cannot know, how much space we need in the channel for outgoing IP packets. By introducing a dedicated buffer, we can temporarily hold on to all of these packets and on the next call to `poll`, we flush them out into the channel. If the channel is full, we will suspend and only continue once there is space in the channel. This behaviour restores backpressue because we won't read UDP packets from the socket unless we have space to write the corresponding packet to the TUN device. UDP itself actually doesn't have any backpressure, instead the packets will simply get dropped once the receive buffer overflows. The UDP packets however carry encrypted IP packets, meaning whatever protocol sits inside these packets will detect the packet loss and should throttle their sending-pace accordingly.	2024-11-14 06:25:03 +00:00
Thomas Eizinger	8c5a5fa690	chore(rust): correctly disable ANSI escapes globally (#7336 ) I think I finally understood and correctly traced, where the use of ANSI escape codes came from. It turns out, the `with_ansi` switch on `tracing_subscriber::fmt::Layer` is what you want to toggle. From there, it trickles down to the `Writer` which we can then test for in our `Format`. Resolves: #7284. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-11-14 05:00:53 +00:00
Thomas Eizinger	efeba55709	chore(snownet): fail TURN connection on unknown attribute (#7341 ) A TURN server that doesn't understand certain attributes should return "Unknown attributes" as part of its response. Whilst we aim to be as spec-compliant as possible, Firezone doesn't officially support other TURN servers than our own relay. If we encounter a TURN server that sends us an "Unknown attribute", we now immediately fail this allocation and clear it as we cannot make any more assumptions about what the connected relay actually supports.	2024-11-14 02:43:17 +00:00
Thomas Eizinger	3cf5cbb989	chore(connlib): only send some tunnel errors to Sentry (#7340 ) Errors from the tunnel can potentially happen on a per-packet basis. In order to not flood Sentry, reduce the log-level down to `debug` and only report 1% of all errors. We did the same thing for the gateway in #7299.	2024-11-14 02:32:37 +00:00
Thomas Eizinger	00c7c42113	fix(snownet): don't allow duplicate server-reflexive candidates (#7334 ) In #7163, we introduced a shared cache of server-reflexive candidates within a `snownet::Node`. What we unfortunately overlooked is that if a node (i.e. a client or a gateway) is behind symmetric NAT, then we will repeatedly create "new" server-reflexive candiates, thereby filling up this cache. This cache is used to initialise the agents with local candidates, which manifests in us sending dozens if not hundreds of candidates to the other party. Whilst not harmful in itself, it does create quite a lot of spam. To fix this, we introduce a limit of only keeping around 1 server-reflexive candidate per IP version, i.e. only 1 IPv4 and IPv6 address. At present, `connlib` only supports a single egress interface meaning for now, we are fine with making this assumption. In case we encounter a new candidate of the same kind and same IP version, we evict the old one and replace it with the new one. Thus, for subsequent connections, only the new candidate is used.	2024-11-14 00:14:29 +00:00
Thomas Eizinger	3dd913f6df	fix(snownet): emit correct event on invalidating srflx candidate (#7333 ) This one has been lurking in the codebase for a while. Fortunately, it is not very critical because invalidation of server-reflexive addresses happens pretty rarely.	2024-11-13 20:12:20 +00:00
Thomas Eizinger	7e0d2ca59c	chore: add telemetry event in case we see large datagrams (#7335 ) If we see these, something fishy is going on (see #7332), so we should definitely know about these by recording Sentry events. These can potentially be per packet so we only send a telemetry event which gets sampled at a rate of 1%.	2024-11-13 20:09:58 +00:00
Thomas Eizinger	48ba2869a8	chore(rust): ban the use of `.unwrap` except in tests (#7319 ) Using the clippy lint `unwrap_used`, we can automatically lint against all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a few results actually. In most cases, they are invariants that can't actually be hit. For these, we change them to `Option`. In other cases, they can actually be hit. For example, if the user supplies an invalid log-filter. Activating this lint ensures the compiler will yell at us every time we use `.unwrap` to double-check whether we do indeed want to panic here. Resolves: #7292.	2024-11-13 03:59:22 +00:00
Thomas Eizinger	0e20f7d086	chore(connlib): better error reporting for invalid IP packets (#7320 ) Currently, we don't report very detailed errors when we fail to parse certain IP packets. With this patch, we use `Result` in more places and also extend the validation of IP packets to: a) enforce a length of at most 1280 bytes. This should already be the case due to our MTU but bad things may happen if that is off for some reason b) validate the entire IP packet instead of just its header	2024-11-12 19:46:32 +00:00
Thomas Eizinger	19f51568c2	chore(rust): don't pass errors as values for debug logs (#7318 ) Our logging library `tracing` supports structured logging. Structured logging means we can include values within a `tracing::Event` without having to immediately format it as a string. Processing these values - such as errors - as their original type allows the various `tracing` layers to capture and represent them as they see fit. One of these layers is responsible for sending ERROR and WARN events to Sentry, as part of which `std::error::Error` values get automatically captured as so-called "sentry exceptions". Unfortunately, there is a caveat: If an `std::error::Error` value is included in an event that does not get mapped to an exception, the `error` field is completely lost. See https://github.com/getsentry/sentry-rust/issues/702 for details. To work around this, we introduce a `err_with_sources` adapter that an error and all its sources together into a string. For all `tracing::debug!` statements, we then use this to report these errors. It is really unfortunate that we have to do this and cannot use the same mechanism, regardless of the log level. However, until this is fixed upstream, this will do and gives us better information in the log submitted to Sentry.	2024-11-12 04:00:02 +00:00
Thomas Eizinger	d38304b21f	build(rust): depend on our `boringtun` fork (#7120 ) This switches our dependency on `boringtun` over to our fork at https://github.com/firezone/boringtun. The idea of the fork is to carefully only patch selective parts such that upstream things later is still possible. The complete diff can be seen here: https://github.com/cloudflare/boringtun/compare/master...firezone:boringtun:master So far, the only patches in the fork are dependency bumps, linter fixes, adjustments to log levels and the removal of panics when the destination buffer is too small.	2024-11-12 03:40:36 +00:00
Thomas Eizinger	237865c37b	test(connlib): drain all `Transmit`s at the end of `advance` (#7315 ) Within our test suite, we "spin" for several (simulated) seconds after each state transition to allow for packets being sent between the different nodes. The test suite simulates different latencies by delaying the delivery of some of these packets. `connlib` has several timers for sending packets, i.e. STUN bindings, WG keep-alives etc. These timers never end so we cannot simply spin "until we no longer want to send any packets". Currently, we simply hard-stop after a few seconds and drop the remaining packets and move on to the next state transition. At present, this isn't an issue because only our ICE agent adheres to the simulated time advancement. `boringtun` is still impure and thus we usually don't get to see any of the WireGuard packets like keep-alives and session timeouts etc in our tests. The STUN messages are pretty resilient to retransmissions so the current packet drop doesn't matter. In the process of adopting our boringtun fork (https://github.com/firezone/boringtun) where we will eventually fix the time impurity, dropping some of these packets caused problems. To fix this, we now drain all remaining packets that are sitting in the "yet-to-be-delivered" buffer. These packets are delivered to an "inbox" that is per-host, meaning the host (i.e. client, gateway or relay) will still perceive the incoming packet with the correct latency. We extract this functionality from #7120 because it is generally useful.	2024-11-12 03:19:07 +00:00

1 2 3 4 5 ...

922 Commits