firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-03-21 18:41:38 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	de35bb067e	fix(telemetry): don't embed errors values in `telemetry_event!` (#7366 ) Due to https://github.com/getsentry/sentry-rust/issues/702, errors which are embedded as `tracing::Value` unfortunately get silently discarded when reported as part of Sentry "Event"s and not "Exception"s. The design idea of these telemetry events is that they aren't fatal errors so we don't need to treat them with the highest priority. They may also appear quite often, so to save performance and bandwidth, we sample them at a rate of 1% at creation time. In order to not lose the context of these errors, we instead format them into the message. This makes them completely identical to the `debug!` logs which we have on every call-site of `telemetry_event!` which prompted me to make that implicit as part of creating the `telemetry_event!`. Resolves: #7343.	2024-11-18 18:17:08 +00:00
Thomas Eizinger	d9fb9e53c8	chore(snownet): print error code for unhandled message (#7367 ) All our logic for handling errors is based on the error code. Even though there should be a 1:1 mapping between error code and reason phrase, I am seeing odd reports in Sentry for a case that we should be handling but aren't.	2024-11-18 18:15:04 +00:00
Thomas Eizinger	9536b8116c	fix: don't exit TUN thread on errors (#7354 ) I noticed that in case there is an error when reading from the TUN device, we currently exit that thread and we don't have a mechanism at the moment to restart it. Discarding the thread also means we can no longer send new instances of `Tun` into it. Instead of exiting the thread, we now just log the error and continue. In case the error was caused by the FD being closed, we discard the instance of `Tun` and wait for a new one.	2024-11-16 06:19:41 +00:00
Thomas Eizinger	4e423dc51c	fix(connlib): send all unwritten packets before reading new ones (#7342 ) With the parallelisation of TUN and UDP operations, we lost backpressure: Packets can now be read quicker from the UDP sockets than they can be sent out the TUN device, causing packet loss in extremely high-throughput situations. To avoid this, we don't directly send packets into the channel to the TUN device thread. This channel is bounded, meaning sending can fail if reading UDP packets is faster than writing packets to the TUN device. Due to GRO, we may read multiple UDP packets in one go, requiring us to write multiple IP packets to the TUN device as part of a single iteration in the event-loop. Thus, we cannot know, how much space we need in the channel for outgoing IP packets. By introducing a dedicated buffer, we can temporarily hold on to all of these packets and on the next call to `poll`, we flush them out into the channel. If the channel is full, we will suspend and only continue once there is space in the channel. This behaviour restores backpressue because we won't read UDP packets from the socket unless we have space to write the corresponding packet to the TUN device. UDP itself actually doesn't have any backpressure, instead the packets will simply get dropped once the receive buffer overflows. The UDP packets however carry encrypted IP packets, meaning whatever protocol sits inside these packets will detect the packet loss and should throttle their sending-pace accordingly.	2024-11-14 06:25:03 +00:00
Thomas Eizinger	8c5a5fa690	chore(rust): correctly disable ANSI escapes globally (#7336 ) I think I finally understood and correctly traced, where the use of ANSI escape codes came from. It turns out, the `with_ansi` switch on `tracing_subscriber::fmt::Layer` is what you want to toggle. From there, it trickles down to the `Writer` which we can then test for in our `Format`. Resolves: #7284. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-11-14 05:00:53 +00:00
Thomas Eizinger	efeba55709	chore(snownet): fail TURN connection on unknown attribute (#7341 ) A TURN server that doesn't understand certain attributes should return "Unknown attributes" as part of its response. Whilst we aim to be as spec-compliant as possible, Firezone doesn't officially support other TURN servers than our own relay. If we encounter a TURN server that sends us an "Unknown attribute", we now immediately fail this allocation and clear it as we cannot make any more assumptions about what the connected relay actually supports.	2024-11-14 02:43:17 +00:00
Thomas Eizinger	3cf5cbb989	chore(connlib): only send some tunnel errors to Sentry (#7340 ) Errors from the tunnel can potentially happen on a per-packet basis. In order to not flood Sentry, reduce the log-level down to `debug` and only report 1% of all errors. We did the same thing for the gateway in #7299.	2024-11-14 02:32:37 +00:00
Thomas Eizinger	00c7c42113	fix(snownet): don't allow duplicate server-reflexive candidates (#7334 ) In #7163, we introduced a shared cache of server-reflexive candidates within a `snownet::Node`. What we unfortunately overlooked is that if a node (i.e. a client or a gateway) is behind symmetric NAT, then we will repeatedly create "new" server-reflexive candiates, thereby filling up this cache. This cache is used to initialise the agents with local candidates, which manifests in us sending dozens if not hundreds of candidates to the other party. Whilst not harmful in itself, it does create quite a lot of spam. To fix this, we introduce a limit of only keeping around 1 server-reflexive candidate per IP version, i.e. only 1 IPv4 and IPv6 address. At present, `connlib` only supports a single egress interface meaning for now, we are fine with making this assumption. In case we encounter a new candidate of the same kind and same IP version, we evict the old one and replace it with the new one. Thus, for subsequent connections, only the new candidate is used.	2024-11-14 00:14:29 +00:00
Thomas Eizinger	3dd913f6df	fix(snownet): emit correct event on invalidating srflx candidate (#7333 ) This one has been lurking in the codebase for a while. Fortunately, it is not very critical because invalidation of server-reflexive addresses happens pretty rarely.	2024-11-13 20:12:20 +00:00
Thomas Eizinger	7e0d2ca59c	chore: add telemetry event in case we see large datagrams (#7335 ) If we see these, something fishy is going on (see #7332), so we should definitely know about these by recording Sentry events. These can potentially be per packet so we only send a telemetry event which gets sampled at a rate of 1%.	2024-11-13 20:09:58 +00:00
Thomas Eizinger	48ba2869a8	chore(rust): ban the use of `.unwrap` except in tests (#7319 ) Using the clippy lint `unwrap_used`, we can automatically lint against all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a few results actually. In most cases, they are invariants that can't actually be hit. For these, we change them to `Option`. In other cases, they can actually be hit. For example, if the user supplies an invalid log-filter. Activating this lint ensures the compiler will yell at us every time we use `.unwrap` to double-check whether we do indeed want to panic here. Resolves: #7292.	2024-11-13 03:59:22 +00:00
Thomas Eizinger	0e20f7d086	chore(connlib): better error reporting for invalid IP packets (#7320 ) Currently, we don't report very detailed errors when we fail to parse certain IP packets. With this patch, we use `Result` in more places and also extend the validation of IP packets to: a) enforce a length of at most 1280 bytes. This should already be the case due to our MTU but bad things may happen if that is off for some reason b) validate the entire IP packet instead of just its header	2024-11-12 19:46:32 +00:00
Thomas Eizinger	19f51568c2	chore(rust): don't pass errors as values for debug logs (#7318 ) Our logging library `tracing` supports structured logging. Structured logging means we can include values within a `tracing::Event` without having to immediately format it as a string. Processing these values - such as errors - as their original type allows the various `tracing` layers to capture and represent them as they see fit. One of these layers is responsible for sending ERROR and WARN events to Sentry, as part of which `std::error::Error` values get automatically captured as so-called "sentry exceptions". Unfortunately, there is a caveat: If an `std::error::Error` value is included in an event that does not get mapped to an exception, the `error` field is completely lost. See https://github.com/getsentry/sentry-rust/issues/702 for details. To work around this, we introduce a `err_with_sources` adapter that an error and all its sources together into a string. For all `tracing::debug!` statements, we then use this to report these errors. It is really unfortunate that we have to do this and cannot use the same mechanism, regardless of the log level. However, until this is fixed upstream, this will do and gives us better information in the log submitted to Sentry.	2024-11-12 04:00:02 +00:00
Thomas Eizinger	d38304b21f	build(rust): depend on our `boringtun` fork (#7120 ) This switches our dependency on `boringtun` over to our fork at https://github.com/firezone/boringtun. The idea of the fork is to carefully only patch selective parts such that upstream things later is still possible. The complete diff can be seen here: https://github.com/cloudflare/boringtun/compare/master...firezone:boringtun:master So far, the only patches in the fork are dependency bumps, linter fixes, adjustments to log levels and the removal of panics when the destination buffer is too small.	2024-11-12 03:40:36 +00:00
Thomas Eizinger	237865c37b	test(connlib): drain all `Transmit`s at the end of `advance` (#7315 ) Within our test suite, we "spin" for several (simulated) seconds after each state transition to allow for packets being sent between the different nodes. The test suite simulates different latencies by delaying the delivery of some of these packets. `connlib` has several timers for sending packets, i.e. STUN bindings, WG keep-alives etc. These timers never end so we cannot simply spin "until we no longer want to send any packets". Currently, we simply hard-stop after a few seconds and drop the remaining packets and move on to the next state transition. At present, this isn't an issue because only our ICE agent adheres to the simulated time advancement. `boringtun` is still impure and thus we usually don't get to see any of the WireGuard packets like keep-alives and session timeouts etc in our tests. The STUN messages are pretty resilient to retransmissions so the current packet drop doesn't matter. In the process of adopting our boringtun fork (https://github.com/firezone/boringtun) where we will eventually fix the time impurity, dropping some of these packets caused problems. To fix this, we now drain all remaining packets that are sitting in the "yet-to-be-delivered" buffer. These packets are delivered to an "inbox" that is per-host, meaning the host (i.e. client, gateway or relay) will still perceive the incoming packet with the correct latency. We extract this functionality from #7120 because it is generally useful.	2024-11-12 03:19:07 +00:00
Thomas Eizinger	a83729e439	chore(gateway): be more detailed in error reporting (#7314 ) Instead of collapsing multiple of these errors into one, we emit a dedicated error message for each case. This will allow us to distinguish them within Sentry events.	2024-11-12 03:16:06 +00:00
Thomas Eizinger	ad4eea29ff	chore(rust): don't panic in fallible functions (#7298 ) "Just let it crash" is terrible advice for software that is shipped to end users. Where possible, we should use proper error handling and only fail the current function / task that is active, e.g. drop a particular packet instead of failing all of connlib. We more or less already do that. Activating the clippy lint `unwrap_in_result` surfaced a few more places where we panic despite being in a function that is fallible already. These cases can easily be converted to not panic and return an error instead.	2024-11-11 23:55:23 +00:00
dependabot[bot]	7e4e190cd6	build(deps): Bump test-strategy from 0.3.1 to 0.4.0 in /rust (#7308 ) Bumps [test-strategy](https://github.com/frozenlib/test-strategy) from 0.3.1 to 0.4.0. <details> <summary>Commits</summary> <ul> <li><a href="`c683eb3cf6`"><code>c683eb3</code></a> Version 0.4.0.</li> <li><a href="`17706bcd1c`"><code>17706bc</code></a> Update MSRV to 1.70.0.</li> <li><a href="`90a5efbf00`"><code>90a5efb</code></a> Update dependencies.</li> <li><a href="`cff2ede71f`"><code>cff2ede</code></a> Changed the strategy generated by <code>#[filter(...)]</code> to reduce `Too many local ...</li> <li><a href="`34cc6d2545`"><code>34cc6d2</code></a> Update expected compile error message.</li> <li><a href="`a4427e2d98`"><code>a4427e2</code></a> Update CI settings.</li> <li><a href="`ecb7dbae04`"><code>ecb7dba</code></a> Clippy.</li> <li><a href="`637f29e9c8`"><code>637f29e</code></a> Made it so an error occurs when an unsupported attribute is specified for enu...</li> <li><a href="`6d66057bb0`"><code>6d66057</code></a> Use <code>test</code> instead of <code>check</code> with <code>cargo hack --rust-version</code>.</li> <li><a href="`cee2ebbfe6`"><code>cee2ebb</code></a> Fix CI settings.</li> <li>Additional commits viewable in <a href="https://github.com/frozenlib/test-strategy/compare/v0.3.1...v0.4.0">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=test-strategy&package-manager=cargo&previous-version=0.3.1&new-version=0.4.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-11-11 21:26:41 +00:00
Thomas Eizinger	4e67c568da	refactor: downgrade errors when we cannot bind sockets (#7305 ) Most of the time, these errors are a result of a limited IP stack, for example IPv6 not being available. Reporting these as errors to Sentry is unnecessarily noisy. If something else happens further down the line, the last 500 debug and info logs will be sent along with the error report so we will still see these in the breadcrumbs if an actual error happens later. Resolves: #7245.	2024-11-11 19:51:10 +00:00
Thomas Eizinger	488c599d5b	chore(telemetry): capture Firezone ID and account in user ctx (#7310 ) Sentry has a feature called the "User context" which allows us to assign events to individual users. This in turn will give us statistics in Sentry, how many users are affected by a certain issue. Unfortunately, Sentry's user context cannot be built-up step-by-step but has to be set as a whole. To achieve this, we need to slightly refactor `Telemetry` to not be `clone`d and instead passed around by mutable reference. Resolves: #7248. Related: https://github.com/getsentry/sentry-rust/issues/706.	2024-11-11 19:50:14 +00:00
Thomas Eizinger	06a274c4e5	refactor(apple): make panics on decoding errors more descriptive (#7301 ) The communication between the native Apple client and `connlib` uses JSON encoding. The deserialisation of these should never fail because a particular version of `connlib` is always bundled with the native client. Thus, panicking here is justified. In case it does ever happen, we improve the panic message with this patch.	2024-11-11 04:18:07 +00:00
Thomas Eizinger	213dd34ff3	chore(apple): log all `connect` errors on the connlib-side (#7300 ) We don't have Sentry yet in the native Apple client, meaning we don't yet learn about errors returned from the `connect` call. Normally, logging and returning an error is an anti-pattern. We are okay with that in this case until we can report these errors in the native Apple client.	2024-11-11 04:01:12 +00:00
Thomas Eizinger	62cb32b7a3	chore(gateway): report more tunnel errors to event-loop (#7299 ) Currently, the Gateway's state machine functions for processing packets use type-signature that only return `Option`. Any errors while processing packets are logged internally. This makes it difficult to consistently log these errors. We refactor these functions to return `Result<Option<T>>` in most cases, indicating that they may fail for various reasons and also sometimes succeed without producing an output. This allows us to consistently log these errors in the event-loop. Logging them on WARN or ERROR would be too spammy though. In order to still be alerted about some of these, we use the `telemetry_event!` macro which samples them at a rate of 1%. This will alert us about cases that happen often and allows us to handle them explicitly. Once this is deployed to staging, I will monitor the alerts in Sentry to ensure we won't get spammed with events from customers on the next release.	2024-11-11 03:50:27 +00:00
Jamil	1dda915376	ci: Publish new clients (#7291 ) Fixes the roaming bug.	2024-11-08 22:58:06 +00:00
Thomas Eizinger	8653146c18	fix(connlib): discard timer once it fired (#7288 ) Within `connlib`, we have many nested state machines. Many of them have internal timers by means of timestamps with which they indicate, when they'd like to be "woken" to perform time-related processing. For example, the `Allocation` state machine would indicate with a timestamp 5 minutes from the time an allocation is created that it needs to be woken again in order to send the refresh message to the relay. When we reset our network connections, we pretty much discard all state within connlib and together with that, all of these timers. Thus the `poll_timeout` function would return `None`, indicating that our state machines are not waiting for anything. Within the eventloop, the most outer state machine, i.e. `ClientState` is paired with an `Io` component that actually implements the timer by scheduling a wake-up aggregated as the earliest point of all state machines. In order to not fire the same timer multiple times in a row, we already intended to reset the timer once it fired. It turns out that this never worked and the timer still lingered around. When we call `reset`, `poll_timeout` - which feeds this timer - returns `None` and the timer doesn't get updated until it will finally return `Some` with an `Instant`. Because the previous timer didn't get cleared when it fired, this caused `connlib` to busy loop and prevent some(?) other parts of it from progressing, resulting in us never being able to reconnect to the portal. Yet, because the event loop itself was still operating, we could still resolve DNS queries and such. Resolves: #7254. --------- Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>	2024-11-08 12:19:14 +00:00
Thomas Eizinger	4d2dc3dfcb	refactor(connlib): don't rely on DNS when reconnecting to portal (#7289 ) Currently, `connlib` uses the feature of "known hosts" to provide DNS functionality for some domains even without any network connectivity. This is primarily used to ensure that when we reconnect to the portal, we can resolve the domain name which allows us to then create network connections. With recent changes to how our phoenix-channel implementation works, this is actually no longer necessary. Currently, we re-resolve the domain every time we connect, even though we already resolved them when connecting to it for the first time. This step is unnecessary and we can simply directly use the previously resolved IP addresses for the portal domain.	2024-11-08 05:06:42 +00:00
Thomas Eizinger	47e45a3cf3	chore(telemetry): improve telemetry spans and events (#7206 ) DNS resolution is a critical part of `connlib`. If it is slow for whatever reason, users will notice this. To make sure we notice as well, we add `telemetry` spans to the client's and gateway's DNS resolution. For the client, this applies to all DNS queries that we forward to the upstream servers. For the gateway, this applies to all DNS resources. In addition to those IO operations, we also instrument the `match_resource_linear` function. This function operates in `O(n)` of all defined DNS resources. It _should_ be fast enough to not create an impact but it can't hurt to measure this regardless. Lastly, we also instrument `refresh_translations` on the gateway. Refreshing the DNS resolution of a DNS resource should really only happen, when the previous IP addresses become stale yet the user is still trying to send traffic to them. We don't actually have any data on how often that happens. By instrumenting it, we can gather some of this data. To make sure that none of these telemetry events and spans hurt the end-user performance, we introduce macros to `firezone-logging` that sample the creation of these events and spans at a rate of 1%. I ran a flamegraph and none of these even showed up. The most critical one here is probably the `match_resource_linear` span because it happens on every DNS query. Resolves: #7198. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-11-06 01:17:57 +00:00
Thomas Eizinger	a5730b6f3b	chore: release apple client 1.3.8 (#7268 ) To be merged once Apple approves the app review. --------- Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>	2024-11-05 11:15:50 -08:00
Thomas Eizinger	42f00fe01e	chore(snownet): fast-path using `PartialEq` (#7207 ) Counter-intuitively, doing a linear search across all local candidates and checking for equality is faster than hashing the candidate. This is because a `Candidate` actually has quite a few fields and we call this function in the hot-path of packet processing; from `snownet`'s perspective, each packet might come from a different local socket so we have to test for each packet, whether or not we already know about this socket. Using `PartialEq` instead of hashing every candidate saves about 1% in the during a speedtest.	2024-11-05 14:40:06 +00:00
Thomas Eizinger	78ebad13ab	chore(rust): log more errors as `tracing::Value`s (#7208 ) Logging these as structured values gives us a better stacktrace in Sentry (assuming the errors themselves make proper use of defining an error-chain).	2024-11-05 14:36:47 +00:00
Thomas Eizinger	250a31b872	chore(connlib): ignore resolved ip selector in debug output (#7209 )	2024-11-05 14:34:31 +00:00
Thomas Eizinger	271c480357	fix(connlib): don't attempt to encrypt too large packets (#7263 ) When encrypting packets, we need to reserve a buffer within which boringtun will encrypt the IP packet. Unfortunately, `boringtun` panics if that buffer is not big enough which essentially brings all of `connlib` down. Really, we should never see a packet that is too large and ideally, we enforce this at compile-time by creating different variants of `IpPacket` that are sized accordingly. That is a large refactoring so until then, we simply discard them instead of panicking. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2024-11-05 04:17:21 +00:00
dependabot[bot]	a2828a217b	build(deps): Bump thiserror from 1.0.64 to 1.0.68 in /rust (#7260 ) Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.64 to 1.0.68. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/dtolnay/thiserror/releases">thiserror's releases</a>.</em></p> <blockquote> <h2>1.0.68</h2> <ul> <li>Handle incomplete expressions more robustly in format arguments, such as while code is being typed (<a href="https://redirect.github.com/dtolnay/thiserror/issues/341">#341</a>, <a href="https://redirect.github.com/dtolnay/thiserror/issues/344">#344</a>)</li> </ul> <h2>1.0.67</h2> <ul> <li>Improve expression syntax support inside format arguments (<a href="https://redirect.github.com/dtolnay/thiserror/issues/335">#335</a>, <a href="https://redirect.github.com/dtolnay/thiserror/issues/337">#337</a>, <a href="https://redirect.github.com/dtolnay/thiserror/issues/339">#339</a>, <a href="https://redirect.github.com/dtolnay/thiserror/issues/340">#340</a>)</li> </ul> <h2>1.0.66</h2> <ul> <li>Improve compile error on malformed format attribute (<a href="https://redirect.github.com/dtolnay/thiserror/issues/327">#327</a>)</li> </ul> <h2>1.0.65</h2> <ul> <li>Ensure OUT_DIR is left with deterministic contents after build script execution (<a href="https://redirect.github.com/dtolnay/thiserror/issues/325">#325</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`8d06fb5549`"><code>8d06fb5</code></a> Release 1.0.68</li> <li><a href="`372fd8a71a`"><code>372fd8a</code></a> Merge pull request <a href="https://redirect.github.com/dtolnay/thiserror/issues/344">#344</a> from dtolnay/binop</li> <li><a href="`08f89925bf`"><code>08f8992</code></a> Disregard equality binop in fallback parser</li> <li><a href="`d2a823d2ae`"><code>d2a823d</code></a> Merge pull request <a href="https://redirect.github.com/dtolnay/thiserror/issues/343">#343</a> from dtolnay/unnamed</li> <li><a href="`b3bf7a6f69`"><code>b3bf7a6</code></a> Add logic to determine whether unnamed fmt arguments are present</li> <li><a href="`490f9c017b`"><code>490f9c0</code></a> Merge pull request <a href="https://redirect.github.com/dtolnay/thiserror/issues/342">#342</a> from dtolnay/synfull</li> <li><a href="`7daf1b169d`"><code>7daf1b1</code></a> Defer is_syn_full() call until first expression</li> <li><a href="`c92ac9940b`"><code>c92ac99</code></a> Merge pull request <a href="https://redirect.github.com/dtolnay/thiserror/issues/341">#341</a> from dtolnay/parsescan</li> <li><a href="`40a53f7f33`"><code>40a53f7</code></a> Interleave Expr parsing and scanning better</li> <li><a href="`925f2dde77`"><code>925f2dd</code></a> Release 1.0.67</li> <li>Additional commits viewable in <a href="https://github.com/dtolnay/thiserror/compare/1.0.64...1.0.68">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=thiserror&package-manager=cargo&previous-version=1.0.64&new-version=1.0.68)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-11-04 19:06:15 +00:00
Thomas Eizinger	5564e578fe	fix(telemetry): flush sentry.io events in dedicated task (#7205 ) `sentry`'s transport layer appears to be using blocking IO for flushing events. Performing blocking IO within a future that is running on a worker-thread of tokio causes this operation to hang and eventually time-out after 5 seconds. As a result, many events - especially traces - don't get flushed to sentry when an app is being shut down. To fix this, we make `Telemetry::stop` an `async fn` and offload the flushing to a task on tokio's thread-pool for blocking IO.	2024-11-01 15:52:09 +00:00
Thomas Eizinger	59412223cb	chore: bump Android and Apple apps to next version (#7192 ) We are in the process of releasing these so we need to bump their version to the next one.	2024-10-31 14:24:33 +00:00
Gabi	dc97b9040d	fix(connlib): large upstream dns message (#7183 ) If edns0 doesn't work correctly DNS servers might respond with messages bigger than our maximum udp size. In that case we need to truncate those messages when forwarding the respond back to the interface and expect the OS to retry with TCP. Otherwise we aren't able to allocate a packet big enough for this. Fixes #7121 --------- Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-10-30 04:02:14 +00:00
Thomas Eizinger	1d9802f2e4	fix(connlib): don't add host candidates multiple times (#7172 ) We introduced a boolean bug in #7163 that causes us to attempt to add host candidates much more often than necessary. This spams the logs on DEBUG level but was otherwise not harmful.	2024-10-29 15:13:40 +00:00
Thomas Eizinger	f7a388345b	fix(connlib): reconnect in case we lose all relays (#7164 ) During normal operation, we should never lose connectivity to the set of assigned relays in a client or gateway. In the presence of odd network conditions and partitions however, it is possible that we disconnect from a relay that is in fact only temporarily unavailable. Without an explicit mechanism to retrieve new relays, this means that both clients and gateways can end up with no relays at all. For clients, this can be fixed by either roaming or signing out and in again. For gateways, this can only be fixed by a restart! Without connected relays, no connections can be established. With #7163, we will at least be able to still establish direct connections. Yet, that isn't good enough and we need a mechanism for restoring full connectivity in such a case. We creating a new connection, we already sample one of our relays and assign it to this particular connection. This ensures that we don't create an excessive amount of candidates for each individual connection. Currently, this selection is allowed to be silently fallible. With this PR, we make this a hard-error and bubble up the error that all the way to the client's and gateway's event-loop. There, we initiate a reconnect to the portal as a compensating action. Reconnecting to the portal means we will receive another `init` message that allows us to reconnect the relays. Due to the nature of this implementation, this fix may only apply with a certain delay from when we actually lost connectivity to the last relay. However, this design has the advantage that we don't have to introduce an additional state within `snownet`: Connections now simply fail to establish and the next one soon after _should_ succeed again because we will have received a new `init` message. Resolves: #7162.	2024-10-29 01:01:47 +00:00
Thomas Eizinger	046b9e0cd4	refactor(connlib): track srvflx candidates separately (#7163 ) As part of maintaining an allocation, we also perform STUN with our relays to discover our server-reflexive address. At the moment, these candidates are scoped to an `Allocation`. This is unnecessarily restrictive. Similar to host candidates, server-reflexive candidate entirely depend on the socket you send data from and are thus independent of the allocation's state. During normal operation, this doesn't really matter because all relay traffic is sent through the same sockets so all `Allocation`s end up with the same server-reflexive candidates. Where this does matter is when we disconnect from relay's for one reason or another (for example: #7162). The fact that all but host-candidates are scoped to `Allocation`s means that without `Allocation`s, we cannot make any new connections, not even direct ones. This is unnecessarily restrictive and causes bugs within `Allocation` to have a bigger blast radius than necessary. With this PR, we keep server-reflexive candidates in the same set as host candidates. This allows us to at least establish direct connections in case something is wrong with the relays or our state tracking of relays on the client side.	2024-10-28 16:57:41 +00:00
Thomas Eizinger	8e107b0d65	chore: don't wake `Node` on idle connections (#7149 ) In order to make Firezone more mobile-friendly, waking up the CPU less often is key. In #6845, we introduced a low-power mode into `snownet` that sends STUN messages on a longer interval if the connection is idle. Whilst poking around `boringtun` as part integrating our fork into the main codebase, I noticed that we are updating `boringtun`'s timers every second - even on idle connections. This PR applies the same idea of #6845 to the timers within `Node`: Idle connections get "woken" less and if all connections are idle, we avoid waking the `Node` altogether (unless we need to refresh allocations / channels). Calling `handle_timeout` less often revealed an issue in the tests where we didn't fully process the state changes after invalidating a candidate from the remote. To fix this, we now call `handle_timeout` directly after `{add,remove}_remote_candidate`. This isn't super clean because at first glance, it looks like `handle_timeout` should just be part of the add/remove candidate function. It is quite common for sans-IO designs to require calling `handle_timeout` after state has been changed. In `{Client,Server}Node`, we do it implicitely so that we don't have to do it in the tests and the event-loop. It would be great to test this in some automated fashion but I haven't figured out how yet. I did temporarily add an `info!` log to the event-loop of the client and with this patch applied, the entire event-loop goes to sleep on idle connections. It still does get woken every now and then but no longer every second!	2024-10-25 00:21:06 +00:00
Thomas Eizinger	5cf105f073	chore(android): start telemetry together with connlib session (#7151 ) As a first step for integration Sentry into the Android app, we launch the Sentry Rust agent as soon as a `connlib` session starts up. At a later point, we can also integrate Sentry into the Android app itself using the Java / Kotlin SDK. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-10-24 20:03:06 +00:00
Thomas Eizinger	c12a02e348	chore(apple): start telemetry together with connlib session (#7152 ) This starts up telemetry together with each `connlib` session. At a later point, we can also integrate the native Swift SDK into the MacOS / iOS app to catch non-connlib specific problems. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-10-24 19:59:52 +00:00
Thomas Eizinger	12ca4f1cc7	chore(connlib): introduce `telemetry` log target (#7147 ) With #7105, all ERROR events from `tracing` get logged as exceptions in Sentry and all WARN events get logged as "messages". We don't want to fill up the user's harddrive with logs which means we have to be somewhat conservative, what gets logged on INFO and above (with INFO being the default log level). There are certain events though where it would be useful to know, how often they happen because too many of them can indicate a problem. To solve this problem, we introduce a dedicated `telemetry` log target that the tracing-sentry integration layer watches for. Events for the `telemetry` log target that gets logged on TRACE will be sampled at a rate of 1% and submitted as messages to Sentry.	2024-10-24 01:24:45 +00:00
Thomas Eizinger	58a770a857	chore(connlib): bump severity of several logs (#7140 ) Now that we have Sentry integrated with `tracing`, using `warn!` logs a bit more liberally allows us to detect edge-cases that customers might run into. All the logs touched in this PR represent some kind of problem that it would be good to know about.	2024-10-24 01:01:30 +00:00
Thomas Eizinger	80c5b0df71	refactor(connlib): replace `LogUnwrap` with macros (#7138 ) Using a trait means the call-site of the log message will always be the `log_unwrap` module, despite the `#[track_caller]` annotation. That one only works for `std::panic::Location` unfortunately which `tracing` isn't using. Macros will be evaluated earlier and thus the messages will show up with the correct module name.	2024-10-23 23:44:23 +00:00
Thomas Eizinger	8b62f63065	chore(connlib): respond with empty records to HTTPS query (#7141 ) Applications may query domains for HTTPS RR using the HTTPS record type. `connlib` operates on OSI layer 3 and thus can only hand out IPs for the particular domains. The correct way to signal this to applications is to answer the HTTPS query with NOERROR and return an empty set of records. [RFC9460](https://www.rfc-editor.org/rfc/rfc9460.html#name-client-behavior) says the following: > 4. If one or more "compatible" ([Section 8](https://www.rfc-editor.org/rfc/rfc9460.html#mandatory)) ServiceMode records are returned, these represent the alternative endpoints. Sort the records by ascending SvcPriority. > 5. Otherwise, SVCB resolution has failed, and the list of available endpoints is empty. This implies that returning no records is valid behaviour and forces the client to consider the HTTPS DNS query as failed and query for A / AAAA records instead (if it didn't do so via happy-eyeballs already).	2024-10-23 23:28:20 +00:00
Thomas Eizinger	2ca91a3b1a	chore(connlib): remove old `mock` feature (#7142 ) This is so stale, it definitely needs to go in the bin.	2024-10-23 16:30:15 +00:00
Thomas Eizinger	6eecfc0cfb	fix: replace panics with `Result` for IP packets (#7135 ) My theory for this issue is that we receive a UDP DNS response from an upstream server that is bigger than our MTU and thus forwarding it fails. This PR doesn't fix that issue by itself but only mitigates the actual panic. To properly fix the underlying issue, we need to parse the DNS message. Truncate it and set the TC bit. Related: #7121.	2024-10-23 16:25:12 +00:00
Thomas Eizinger	ee30368970	refactor(connlib): simplify error handling on crash (#7134 ) The `fmt::Display` implementation of `tokio::task::JoinError` already does exactly what we do here: Extracting the panic message if there is one. Thus, we can simplify this code why just moving the `JoinError` into the `DisconnectError` as its source.	2024-10-23 16:13:39 +00:00
Thomas Eizinger	582e919929	test(connlib): don't generate IPs in sentinel range (#7139 ) Fixes: #7137.	2024-10-23 06:42:40 +00:00

1 2 3 4 5 ...

887 Commits