firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-28 02:18:50 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	d6a1966a42	refactor(snownet): reduce log noise for unhandled packets (#7952 ) When `snownet` originally got developed, its API was designed with the idea in mind that a packet that doesn't get handled is an error. Whilst that is technically true, we don't have any other component that processes packets within Firezone. When a connection is killed by e.g. an ICE timeout, we may still be receiving packets from the other party. Those fill the logs until the other party also runs into a timeout. To prevent this, we don't return errors for these but instead, log them on TRACE. For the case where we are given a packet that doesn't match any known format, we still emit an error.	2025-01-30 01:49:57 +00:00
Jamil	6a73406194	chore: Bump Apple version to 1.4.1 (#7946 )	2025-01-30 00:04:54 +00:00
Thomas Eizinger	8bd8098cab	refactor(connlib): don't re-implement waker for TUN thread (#7944 ) Within `connlib` - on UNIX platforms - we have dedicated threads that read from and write to the TUN device. These threads are connected with `connlib`'s main thread via bounded channels: one in each direction. When these channels are full, `connlib`'s main thread will suspend and not read any network packets from the sockets in order to maintain back-pressure. Reading more packets from the socket would mean most likely sending more packets out the TUN device. When debugging #7763, it became apparent that _something_ must be wrong with these threads and that somehow, we either consider them as full or aren't emptying them and as a result, we don't read _any_ network packets from our sockets. To maintain back-pressure here, we currently use our own `AtomicWaker` construct that is shared with the TUN thread(s). This is unnecessary. We can also directly convert the `flume::Sender` into a `flume::async::SendSink` and therefore directly access a `poll` interface.	2025-01-29 15:48:48 +00:00
Thomas Eizinger	287ea1e8b2	chore(snownet): log ignored candidate (#7943 ) Once we've finished ICE and nominated a socket, we ignore future candidates for the same connection (see #6876). To make this log a bit more helpful, we now log the candidate that we are ignoring on this connection.	2025-01-29 10:21:48 +00:00
Jamil	7e231c6b10	chore: Release Android 1.4.1 (#7911 )	2025-01-29 00:29:15 +00:00
Thomas Eizinger	3daac8730f	fix(connlib): limit batch size on mobile platforms to 25 (#7889 ) The batch size effects how many packets we process one at a time. It also effects the worst-case size of a single buffer as all packets may be of the same size and thus need to be appended to the same buffer. On mobile, we can't afford to allocate all of these so we reduce the batch-size there.	2025-01-28 02:30:54 +00:00
Thomas Eizinger	6789b0b377	fix(connlib): always return buffers to pool after sending (#7891 ) Within the `GsoQueue` data structure, we keep a hash map indexed by source, destination and segment length of UDP packets pointing to a buffer for those payloads. What we intended to do here is to return the buffer to the pool after we sent the payload. What we failed to realise is that putting another buffer into the hash map means we have a buffer allocated for a certain destination address and segment length! This buffer would only get reused for the exact same address and segment length, causing memory usage to balloon over time. To fix this, we wrap the `DatagramBuffer` in an additional `Option`. This allows us to actually remove it from the hash map and return the buffer for future use to the buffer pool. Resolves: #7866. Resolves: #7747.	2025-01-28 01:55:54 +00:00
Thomas Eizinger	c6492d4832	fix(rust): don't start all log files with `connlib.` (#7853 ) At present, the file logger for all Rust code starts each logfile with `connlib.`. This is very confusing when exporting the logs from the GUI client because even the logs from the client itself will start with `connlib.`. To fix this, we make the base file name of the log file configurable.	2025-01-28 01:35:05 +00:00
Thomas Eizinger	3887a7b690	fix(connlib): don't pull new GSO buffer unless we need it (#7888 ) When we are queuing a new UDP payload for sending, we always immediately pulled a new buffer even though we might already have on allocated for this particular segment length. This causes an unnecessary spike in memory when we are under load.	2025-01-28 00:34:22 +00:00
Thomas Eizinger	6188efd1e6	refactor(gateway): improve logging for filtered traffic (#7887 ) When the Gateway's filter-engine drops a packet, we currently only log "destination not allowed". This could happen either because we don't have a filter (i.e. the resource is not allowed) or because the TCP / UDP port or ICMP traffic is not allowed. To make debugging easier, we now include that information in the error message. Resolves: #7875.	2025-01-27 23:49:40 +00:00
Thomas Eizinger	e78ef04e6c	chore(snownet): don't log missing attribute for binding requests (#7852 ) STUN binding requests & responses are not authenticated on purpose because they are so easy to fulfill that having to perform the computational work to check the authentication is more work than actually just sending the request. With #7819, we send STUN binding requests more often because they are used as keep-alives to the relay. This spams the debug log because we see > Message does not have a `MessageIntegrity` attribute for every BINDING response. This information isn't interesting for BINDING responses because those will never have a `MessageIntegrity` attribute.	2025-01-24 03:55:30 +00:00
Thomas Eizinger	88c3e228ba	feat(snownet): log which packets resume a connection (#7850 ) In order to debug connection wake-ups, it is useful to know, which packet is the first one that gets sent on an idle connection. With this PR, we do exactly that for incoming and outgoing packets through the tunnel. The resulting log looks something like this: ``` 2025-01-24T02:52:51.818Z DEBUG snownet::node: Connection is idle cid=65f149ea-96a4-4eee-ac70-62a1a2590821 2025-01-24T02:52:57.312Z DEBUG firezone_tunnel::client: Cleared DNS resource NAT domain=speed.cloudflare.com 2025-01-24T02:52:57.312Z DEBUG firezone_tunnel::client: Setting up DNS resource NAT gid=65f149ea-96a4-4eee-ac70-62a1a2590821 domain=speed.cloudflare.com 2025-01-24T02:52:57.312Z DEBUG snownet::node: Connection resumed packet=Packet { src: ::, dst: ::, protocol: "Reserved" } cid=65f149ea-96a4-4eee-ac70-62a1a2590821 ``` Here, the connection got resumed because we locally received a DNS query for a DNS resource which triggers a new control protocol message through the tunnel. For this, we use the unspecified IPv6 address for src and dst and the 0x255 protocol identifier which here renders as "Reserved".	2025-01-24 03:33:50 +00:00
Thomas Eizinger	71b1edfb70	test(connlib): fix race condition of WireGuard handshakes (#7839 ) The committed regression seeds trigger a scenario where the WireGuard sessions of the peers expire in a way where by the time the Client sends the packet, it is still active (179.xx seconds old) and with the latency to the Gateway, the 180s mark is reached and the Gateway clears the session and discards the packet as a result. In order to fix this, I opted to patch WireGuard by introducing a new timer that does not allow the initiator to use a session that is almost expired: https://github.com/firezone/boringtun/pull/68. Resolves: #7832. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-01-24 02:42:43 +00:00
Jamil	1e5599e5fc	refactor(connlib): only log actual updates to the allocation (#7826 ) With #7819, these log messages appear at a ~10x higher rate than before - a day's worth of these would be over 3,000 messages. For BINDING requests, these only matter if the candidates change, therefore we can make the logging conditional to that. --------- Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2025-01-24 01:17:43 +00:00
Thomas Eizinger	e2c1ef8f09	chore: remove WireGuard keepalive (#7840 ) Contrary to my prior belief, we don't actually need the WireGuard _persistent_ keep-alive. The in-built timers from WireGuard will automatically send keep-alive messages in case no organic reply is sent for a particular request. All NAT bindings along the network path are already kept open using the STUN bindings sent on all candidate pairs. Even on idle connections, we send those every 60s. Well-behaved NATs are meant to keep confirmed UDP bindings open for at least 120s. Even if not, the worst-case here is that a connection which does not send any(!) application traffic is cut.	2025-01-24 00:26:55 +00:00
Thomas Eizinger	f10f29c03b	refactor(connlib): only log cleared nat status if we do (#7841 )	2025-01-23 22:47:28 +00:00
Jamil	0dcde7ffee	fix(connlib): Filter 'dual socket' log for keepalives (#7827 ) #7819 triggers this log every 25s which isn't exactly describing the correct condition any longer. This PR updates the log to only fire when we're determining which socket to use for communicating with the Relay, and not at each keepalive interval.	2025-01-22 05:24:40 +00:00
Thomas Eizinger	8c2d15b8d7	fix(snownet): implement STUN keepalive with relays (#7819 ) Firezone Clients and Gateways create an allocation with a given set of Relays as soon as they start up. If no traffic is being secured and thus no connections are established between them, NAT bindings between Clients / Gateways and the Relays may expire. Typically, these bindings last for 120s. Allocations are only refreshed every 5 min (after 50% of their lifetime has passed). After a NAT binding is expired, the next UDP message passing through the NAT may allocate a new port, thus changing the 3-tuple of the sender. TURN identifies clients by their 3-tuple. Therefore, without a proactive keepalive, TURN clients lose access to their allocation and need to create one under the new port. To fix this, we implement a scheduled STUN binding request every 25s once we have chosen a socket (IPv4 or IPv6) for a given relay. Resolves: #7802.	2025-01-21 13:52:08 +00:00
Thomas Eizinger	b568592e52	fix: avoid spurious rekey in `boringtun` (#7767 ) For a while now, I've known that `boringtun` may perform spurious rekeys but I didn't fully understand why. After spending some time refactoring the internals of `boringtun` and re-reading the whitepaper, I know understand the reason. https://github.com/firezone/boringtun/pull/66 fixes the problem. The proptests have since also discovered the same issue: https://github.com/firezone/firezone/actions/runs/12790301854/job/35655764072.	2025-01-21 13:45:59 +00:00
Thomas Eizinger	943dbf9712	test(connlib): assert resource status as part of `tunnel_test` (#7772 ) In order to ensure that the "site status" in the UIs is always up-to-date, we model the resource status as part of `tunnel_test`. This should cover even the most bizarre combinations of adding, removing, disabling and enabling resources interleaved with sending packets, resetting connections etc. Fixes: #7761.	2025-01-21 04:35:22 +00:00
Thomas Eizinger	14ed7c40cb	test(windows): increase grace-period for timer `Io` timer (#7821 ) Windows' timer granularity isn't as good as the one from Unix platforms. To ensure this test isn't flaky, we increase the grace-period for Windows runners. See https://github.com/firezone/firezone/actions/runs/12862968520/job/35858749736?pr=7808.	2025-01-21 04:28:03 +00:00
Jamil	6670741dee	chore: Bump apple clients to 1.4.0 (#7785 ) Bumps Apple clients to the 1.4.0 release. They're already live.	2025-01-17 00:07:25 +00:00
Thomas Eizinger	081216a929	fix(connlib): don't drop unsent datagrams (#7768 ) We introduced a regression in `connlib` in #7749 whereby queued but unsent datagrams got dropped in case the socket was not ready to send more data. This happens because within `Io`, we pull each datagram one by one from the iterator: `e60ec7144c/rust/connlib/tunnel/src/io.rs (L178-L188)` This function will send datagrams for as long as the socket is ready and drop the iterator afterwards. This means the returned iterator MUST BE lazy and "cancel-safe". This was the case prior to #7749 because `datagrams` function used `iter_mut` and only cut off the to be sent bytes when the next item got pulled from iterator. With #7749, the entire `HashMap` got drained, thus dropping packets if `Io` didn't manage to process the iterator in full.	2025-01-16 15:26:59 +00:00
Thomas Eizinger	01c1e629d2	test(connlib): ensure that we never want a time in the past (#7760 ) In #7758, we fix `connlib`s event-loop to always provide the current time to the state machine rather than the one that was requested (which may be in the past). Even though this is already fairly resilient, we should never request a time in the past. This patch adds this as an assertion to our test suite.	2025-01-15 14:49:15 +00:00
Thomas Eizinger	1ebee00699	fix(connlib): prevent time from going backwards (#7758 ) On a high level, `connlib` is a state machine that gets driven by a custom event-loop. For time-related actions, the state machine computes, when it would like to be woken next. The event-loop sets a timer for that value and emits this value when the timer fires. There is an edge-case where this may result in the time going backwards within the state machine. Specifically, if - for whatever reason - the state machine emits a time value that is in the past, the timer in the `Io` component will fire right away but the `deadline` will point to the time in the past. The only thing we are actually interested in is that the timer fires at all. Instead of passing back the deadline of the timer, we fetch the _current_ time and pass that back to the state machine as the current input. This ensures that we never jump back in time because Rust guarantees for calls to `Instant::now` to be monotonic. (https://doc.rust-lang.org/std/time/struct.Instant.html#:~:text=a%20measurement%20of%20a%20monotonically%20nondecreasing%20clock.)	2025-01-15 14:40:32 +00:00
Thomas Eizinger	b313f2a349	fix(connlib): don't spam if relay disconnects during ICE (#7750 ) When `snownet` is tasked to establish a new connection, it first randomly samples one of its relays that is used as an additional source of candidates in case a direct connection is not possible. We (try to) maintain an allocation on each relay throughout the lifetime of a `connlib` session. In case a relay doesn't respond to the initial binding message at all (even after several retries), we consider the relay offline and remove all state associated to it. It is possible that we sampled a relay for use in a connection and only then realise that it is offline. In that case, we print a message to the log: > Selected relay disconnected during ICE; connection may fail The condition for when we print this log is: "we are in `Connecting` and the sampled relay does no longer exist". This results in log spam in case that condition is actually hit because no state is being changed as part of this check and thus, on the next call to `handle_timeout`, this condition is still true! To fix this, we change the `rid` field of `Connecting` to an `Option`. In case we detect that a relay is no longer present, we print the log and then clear the option. As a result, the log is only printed once.	2025-01-13 22:45:03 +00:00
Thomas Eizinger	46cdbbcc23	fix(connlib): use a buffer pool for the GSO queue (#7749 ) Within `connlib`, we read batches of IP packets and process them at once. Each encrypted packet is appended to a buffer shared with other packets of the same length. Once the batch is successfully processed, all of these buffers are written out using GSO to the network. This allows UDP operations to be much more efficient because not every packet has to traverse the entire syscall hierarchy of the operating system. Until now, these buffers got re-allocated on every batch. This is pretty wasteful and leads to a lot of repeated allocations. Measurements show that most of the time, we only have a handful of packets with different segments lengths _per batch_. For example, just booting up the headless-client and running a speedtest showed that only 5 of these buffers are were needed at one time. By introducing a buffer pool, we can reuse these buffers between batches and avoid reallocating them. Related: #7747.	2025-01-13 19:24:52 +00:00
Thomas Eizinger	f5afea6f0d	refactor(connlib): reset authorized resources on roaming (#7746 ) When a Firezone client roams, the host app sends a "reset" command to `connlib`. At present, this "reset" command clears the network connection state and therefore restarts ICE. As part of that, the tunnel key also gets rotated yet which resources have already been authorized is retained. This isn't a problem per se because the client's identity is determined by the "Firezone ID" which persists even across restarts of a Client. For the Gateway however, a roamed Client and a restarted Client are indistinguishable as in both cases, the tunnel public key and ICE credentials change. Instead of only clearing the connection-specific state, we now also throw away all the ACL state that is associated with connections, i.e. which Resource already got authorized on the Gateway. As a result - with this change - Clients will emit another "connection intent" to the portal upon roaming, triggering a new authorization of this flow with a Gateway. There isn't any particular need for doing this except that lingering state can be a nasty source of bugs. With the now idempotent control protocol, it is pretty easy to re-request these authorisations. Overall, this makes `connlib` more resilient and easier to reason about.	2025-01-13 19:16:50 +00:00
Thomas Eizinger	5f5007edb8	refactor(connlib): remove "known hosts" feature (#7723 ) Ever since #7289, we no longer issue any DNS queries to `connlib` when we reconnect to the portal. Thus, the back-then conceived feature of "known hosts" that allowed us to resolve that DNS query without having an upstream receiver is no longer needed.	2025-01-12 17:32:20 +00:00
Thomas Eizinger	a5e398b843	fix(connlib): avoid competing and expired WireGuard sessions (#7704 ) When `connlib` detects that no data is being sent on a connection, it enters a "low-power" mode within which timers are set to a much longer interval than usual. For `boringtun` this moves the timer from 1s to 30s. At present, this timer also guards, how often we actually update the timer state within `boringtun`. Instead of following a "only update exactly when this timer fires"-policy, we now adopt a "update at least this often"-policy. The difference here is that while we are executing the `handle_timeout` function, we might as well call into `boringtun` and update its timer state too. Another side-effect of this timer is that `boringtun` may not be woken in time to initiate a rekey when the session expires. WireGuard sessions without activity expire after 3 minutes. Only the initiater should then recreate the session. If this doesn't happen in time, the responder (Gateway) may trigger a keep-alive timeout. Without an active session, keep-alives also initiate sessions, resulting in us having two competing sessions. This fixes the failing test cases added in this PR: There, we ran into a situation where a WireGuard tunnel idled for so long that the spec requires the session to expire. In the test, we then sent a packet using such an expired session but that packet got discarded by the Gateway because of the expired session. The timers are what check whether a session is expired: - By calling `update_timers_at` more often, we can expire the session in time and `boringtun` will buffer the to-be-sent packet until the new session is established. - By deactivating the keep-alive on the Gateway, we ensure that we only ever have a single WireGuard session active. - With https://github.com/firezone/boringtun/pull/53, we ensure the Gateway doesn't initiate a new session in the beginning. - With https://github.com/firezone/boringtun/pull/51, we ensure the Client only ever initiates a single session. To be entirely reliable, we also had to remove the idle WG timer and update `boringtun`'s state every second. This is unfortunate but can long-term be fixed by patching WireGuard to tell us, when it exactly wants to be woken instead of us having to proactively wake it every second _in case_ it needs to act on a timer. Related: https://github.com/firezone/boringtun/issues/54.	2025-01-09 15:11:14 +00:00
Jamil	6476109b73	fix(apple): Simplify Xcode rust build steps (#7709 ) Xcode doesn't allow wildcards in input file lists, so the rules I set up in #7488 never took effect. Upon further investigation, it appears that the `strip` command executed unconditionally at the end of every Rust build was the culprit. Since Xcode already does this for us, it's a useless step that adds about 30s to the build time. Unfortunately there isn't a good way to tell Xcode not to build rust. But now we don't need to -- `cargo`'s build cache is smart enough to skip builds and we are back to the ~1-2s range for repeated builds when only Swift code has changed. We also add the swift bridge generated code to version control. These doesn't change regularly, and Xcode sometimes complains that the files don't exist _before_ it lets you run the `cargo build` to generate them 🙃 .	2025-01-09 07:54:37 +00:00
Jamil	a21b9fe811	fix(connlib): Don't double-encode DNS addresses (#7708 )	2025-01-08 16:35:05 -08:00
Thomas Eizinger	ed5285268d	refactor: merge `on_update_routes` and `on_set_interface_config` (#7699 ) For a while now, `connlib` has been calling these two callbacks right after each other because the internal event already bundles all the information about the TUN device. With this PR, we merge the two callback functions also in layers above `connlib` itself. Resolves: #6182.	2025-01-08 18:26:40 +00:00
Thomas Eizinger	99d77a84cd	fix(connlib): improve boringtun timer precision (#7698 ) With #7684, we update our boringtun fork to support deterministic timers and handshake jitter. Further testing revealed that there was a bug within the jitter implementation that prevented the jitter from actually applying (https://github.com/firezone/boringtun/pull/48). In addition, we were only calling `update_timers_at` with a precision of 1s, making the internal jittering of 0 to 333ms within `boringtun` useless. To fix this, we introduced a `next_timer_update` function in `Tunn` in https://github.com/firezone/boringtun/pull/49 and make use of it in here. Finally, https://github.com/firezone/boringtun/pull/50 prioritizes the sending of these scheduled handshakes to further improve the timer precision. With these patches applied, this is what the rekey logs look like: ``` 2025-01-08T13:20:09.209Z DEBUG boringtun::noise::timers: HANDSHAKE(REKEY_AFTER_TIME (on send)) cid=b3d34a15-55ab-40df-994b-a838e75d65d7 2025-01-08T13:20:09.209Z DEBUG boringtun::noise::timers: Scheduling new handshake jitter=204.361814ms cid=b3d34a15-55ab-40df-994b-a838e75d65d7 2025-01-08T13:20:09.415Z DEBUG boringtun::noise: Sending handshake_initiation cid=b3d34a15-55ab-40df-994b-a838e75d65d7 2025-01-08T13:20:09.537Z DEBUG boringtun::noise: Received handshake_response local_idx=2898279939 remote_idx=2039394307 cid=b3d34a15-55ab-40df-994b-a838e75d65d7 2025-01-08T13:20:09.540Z DEBUG boringtun::noise: New session session=2898279939 cid=b3d34a15-55ab-40df-994b-a838e75d65d7 ``` We can see that the scheduled handshake now does indeed get sent with the applied jitter of 200ms.	2025-01-08 17:47:17 +00:00
Thomas Eizinger	3a8c6c7182	chore(connlib): assert that we don't emit `WouldBlock` errors (#7696 ) When file descriptors like sockets or the TUN device are opened in non-blocking mode, performing operations that would block emit the `WouldBlock` IO error. These errors _should_ be translated into `Poll::Pending` and have a waker registered that gets called whenever the operation should be attempted again. Therefore, we should _never_ see these IO errors. Previously, the implementation of the tunnel's event-loop did not yet properly handle this backpressure and instead sometimes dropped packets when it should have suspended. This has since been fixed but the then introduced branch of just ignored the `io::ErrorKind::WouldBlock` errors had remained. Changing this to a debug-assert will alert us whenever we accidentally break this without altering the behaviour of the release binary.	2025-01-08 14:11:01 +00:00
Thomas Eizinger	4c21193c2e	fix(connlib): make time within `boringtun` deterministic (#7684 ) At present, the WireGuard implementation within `boringtun` is impure with regards to time due to calls to `Instant::now` and `Instant::elapsed`. This makes it impossible to exhaustively test time-related features because time cannot be advanced arbitrarily. The rest of `connlib` is implemented in a sans-IO fashion where time is controlled from the outside via `Instant` parameters on every function that requires access to the current time. With this PR, we update to the latest version of our `boringtun` fork at https://github.com/firezone/boringtun which introduces pure equivalents of all functions that require access to the current time _and_ also implements the missing handshake-delay jitter feature (see https://github.com/firezone/boringtun/issues/19). This is a pretty safe upgrade as the production code doesn't really change and time advances at the same rate as before. To ensure this passes our test-suite, I ran 50_000 iterations locally.	2025-01-07 18:53:51 +00:00
Thomas Eizinger	21fc8efd5e	test(connlib): reduce number of duplicate IPs (#7685 ) For our test-suite, we need to sample a unique, non-overlapping IP for each component that is being simulated (client, gateways and relays). These are sampled from a predefined range. Currently, we only consider the first 100 IPs of this range and pick it from an allocated `Vec`. This isn't ideal for performance and increases the likelihood of two hosts having the same IP. IPv4 and IPv6 addresses can also just be represented as numbers. Instead of sampling a random IP from a list, we can simply sample a random number between the first and last address of the particular IP network to achieve the same effect.	2025-01-07 16:51:53 +00:00
Thomas Eizinger	49ddcf87f7	fix(snownet): make generation of tunnel index deterministic (#7683 ) In order for our test suite to be entirely deterministic, we must not generate random numbers that aren't backed by a seed that we control.	2025-01-07 15:26:50 +00:00
Thomas Eizinger	037a2e64b6	fix(connlib): attempt to detect runtime shutdown within TUN task (#7605 ) Reading and writing to the TUN device within `connlib` happens in a separate thread. The task running within these threads is connected to the rest of `connlib` via channels. When the application shuts down, these threads also need to exit. Currently, we attempt to detect this from within the task when these channels close. It appears that there is a race condition here because we first attempt to read from the TUN device before reading from the channels. We treat read & write errors on the TUN device as non-fatal so we loop around and attempt to read from it again, causing an infinite-loop and log spam. To fix this, we swap the order in which we evaluate the two concurrent tasks: The first task to be polled is now the channel for outbound packets and only if that one is empty, we attempt to read new packets from the TUN device. This is also better from a backpressure point of view: We should attempt to flush out our local buffers of already processed packets before taking on "new work". As a defense-in-depth strategy, we also attempt to detect the particular error from the tokio runtime when it is being shut down and exit the task. Resolves: #7601. Related: https://github.com/tokio-rs/tokio/issues/7056.	2025-01-05 20:41:24 +00:00
Jamil	b6aed36c2c	feat(apple): Set account slug from Swift -> Rust to hydrate Sentry with (#7662 ) - Refactor Telemetry module to expose firezoneId and accountSlug for easier access in the Adapter module - Set accountSlug to WrappedSession.connect for hydrating the Rust sentry context	2025-01-04 02:48:06 +00:00
Thomas Eizinger	d1eb1961dc	fix(connlib): recalculate overlapping CIDR routes less often (#7592 ) Firezone needs to deterministically handle overlapping CIDR routes. The way we handle this is that more specific routes are preferred over less specific one. In case of an exact overlap, the sorting of the resource ID acts as a tie-breaker: "Smaller" resource IDs preferred over "larger" ones. This ensures that regardless of which order the resources are added / enabled in, Firezone behaves deterministically. In addition to the above rules, existing connections to Gateways always have precedence: In other words, if we are connected to resource A via Gateway 1 and resource B exactly overlaps with A yet needs to be routed to Gateway B and B < A, we still retain resource A in order to not interrupt existing connections. When a connection to a Gateway fails, these mappings are cleaned up. The proptests seeds added in this PR identify a routing mismatch in case a (relayed) connection is cut, followed by adding a non-CIDR resource: `connlib` recalculated the CIDR routes as part of adding the new resource, even though the CIDR resources didn't actually change. This could potentially result in a connection suddenly being routed to a different Gateway despite nothing about that resource changing. To fix this, we add a check for updating the CIDR routes and only perform it in case CIDR resources get changed.	2025-01-03 14:35:32 +00:00
Jamil	309914a45d	chore(android): release version 1.4.0 (#7649 ) Bumps the Android client to the 1.4.0 release. Tested in Android emulator.	2025-01-03 14:45:00 +00:00
Thomas Eizinger	1e2bab4420	chore(snownet): log attributes on message integrity failure (#7577 ) We are receiving multiple reports of message, especially error messages from relays, where the message integrity check fails. To get more information as to why, this patch extends this error message with the attributes of the request and response message.	2024-12-23 19:02:36 +00:00
Thomas Eizinger	956bbbfd91	fix(gateway): translate ICMPv6's `PacketTooBig` error (#7567 ) IPv6 treats fragmentation and MTU errors differently than IPv4. Rather than requiring fragmentation on each hop of a routing path, fragmentation needs to happen at the packet source and failure to route a packet triggers an ICMPv6 `PacketTooBig` error. These need to be translated back through our NAT64 implementation of the Gateway. Due to the size difference in the headers of IPv4 and IPv6, the available MTU to the IPv4 packet is 20 bytes _less_ than the MTU reported by the ICMP error. IPv6 headers are always 40 bytes, meaning if the MTU is reported as e.g. 1200 on the IPv6 side, we need to only offer 1180 to the IPv4 end of the application. Once the new MTU is then honored, the packets translated by our NAT64 implementation will still conform to the required MTU of 1200, despite the overhead introduced by the translation. Resolves: #7515.	2024-12-22 12:09:14 +00:00
Thomas Eizinger	1dc6b46344	fix(connlib): regression seed failure (#7558 ) In #7477, we introduced a regression in our test suite for DNS queries that are forwarded through the tunnel. In order to be deterministic when users configure overlapping CIDR resources, we use the sort order of all CIDR resource IDs to pick, which one "wins". To make sure existing connections are not interrupted, this rule does not apply when we already have a connection to a gateway for a resource. In other words, if a new CIDR resource (e.g. resource `A`) is added to connlib that has an overlapping route with another resource (e.g. resource `B`) but we already have a connection to resource `B`, we will continue routing traffic for this CIDR range to resource `B`, despite `A` sorting "before" `B`. The regression that we introduced was that we did not account for resources being "connected" after forwarding a query through the tunnel to it. As a result, in the found failure case, the test suite was expecting to route the packet to resource `A` because it did not know that we are connected to resource `B` at the time of processing the ICMP packet.	2024-12-20 09:59:58 +00:00
Thomas Eizinger	bc2febed99	fix(connlib): use correct constant for truncating DNS responses (#7551 ) In case an upstream DNS server responds with a payload that exceeds the available buffer space of an IP packet, we need to truncate the response. Currently, this truncation uses the wrong constant to check for the maximum allowed length. Instead of the `MAX_DATAGRAM_PAYLOAD`, we actually need to check against a limit that is less than the MTU as the IP layer and the UDP layer both add an overhead. To fix this, we introduce such a constant and provide additional documentation on the remaining ones to hopefully avoid future errors.	2024-12-19 17:15:43 +00:00
Thomas Eizinger	a1cf409af3	fix(connlib): clear all in-flight upstream DNS queries on reset (#7552 ) When a Firezone Client roams, we reset all network connections and rebind our local sockets. Doing that enables us to start from a clean state and establish new connections to Gateways. What we are currently not clearing are in-flight DNS queries. Those are all very likely to fail because our network connection is changing. There is no point in us keeping those around. Additionally, as part of roaming, it may also be that our upstream DNS server changes and thus, we may suddenly receive a response from a DNS server that we no longer know about. Clear all in-flight DNS queries on reset solves this.	2024-12-18 20:35:30 +00:00
Thomas Eizinger	992b97e6a9	fix(connlib): bind new channel to peer if needed (#7548 ) Initially, when we receive a new candidate from a remote peer, we bind a channel for each remote address on the relay that we sampled. This ensures that every possible communication path is actually functioning. In ICE, all candidates are tried against each other, meaning the remote will attempt to send from each of their candidates to every one of ours, including our relay candidates. To allow this traffic, a channel needs to be bound first. For various reasons, an allocation might become stale or needs to be otherwise invalidated. In that case, all the channel bindings are lost but there might still be an active connection that wants to utilise them. In that case, we will see "No channel" warnings like https://firezone-inc.sentry.io/issues/6036662614/events/f8375883fd3243a4afbb27c36f253e23/. To fix this, we use the attempt to encode a message for a channel as an intent to bind a new one. This is deemed safe because wanting to encode a message to a peer as a channel data message means we want such a channel to exist. The first message here is still dropped but that is better than not establishing the channel at all.	2024-12-18 17:15:17 +00:00
Thomas Eizinger	a80abec4ff	refactor(connlib): remove unused branch in `match` (#7550 ) When deciding what to do with a certain DNS query, we check whether the domain name in question corresponds to any of the (wildcard) DNS resource addresses. If yes, we resolve it to the resource ID of that resource. The source of those resource IDs is the `dns_resources` map. If we have looked up a `ResourceId` in that map, it is impossible for it to not be "known" which means the branch deleted in this PR is completely redundant and already covered by the catch-all branch where `maybe_resource` is `None`.	2024-12-18 15:47:15 +00:00
Thomas Eizinger	62dfe65679	chore(connlib): improve error messages for failed translations (#7540 )	2024-12-18 04:47:26 +00:00

1 2 3 4 5 ...

967 Commits