firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-28 10:18:51 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	59412223cb	chore: bump Android and Apple apps to next version (#7192 ) We are in the process of releasing these so we need to bump their version to the next one.	2024-10-31 14:24:33 +00:00
Gabi	dc97b9040d	fix(connlib): large upstream dns message (#7183 ) If edns0 doesn't work correctly DNS servers might respond with messages bigger than our maximum udp size. In that case we need to truncate those messages when forwarding the respond back to the interface and expect the OS to retry with TCP. Otherwise we aren't able to allocate a packet big enough for this. Fixes #7121 --------- Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-10-30 04:02:14 +00:00
Thomas Eizinger	1d9802f2e4	fix(connlib): don't add host candidates multiple times (#7172 ) We introduced a boolean bug in #7163 that causes us to attempt to add host candidates much more often than necessary. This spams the logs on DEBUG level but was otherwise not harmful.	2024-10-29 15:13:40 +00:00
Thomas Eizinger	f7a388345b	fix(connlib): reconnect in case we lose all relays (#7164 ) During normal operation, we should never lose connectivity to the set of assigned relays in a client or gateway. In the presence of odd network conditions and partitions however, it is possible that we disconnect from a relay that is in fact only temporarily unavailable. Without an explicit mechanism to retrieve new relays, this means that both clients and gateways can end up with no relays at all. For clients, this can be fixed by either roaming or signing out and in again. For gateways, this can only be fixed by a restart! Without connected relays, no connections can be established. With #7163, we will at least be able to still establish direct connections. Yet, that isn't good enough and we need a mechanism for restoring full connectivity in such a case. We creating a new connection, we already sample one of our relays and assign it to this particular connection. This ensures that we don't create an excessive amount of candidates for each individual connection. Currently, this selection is allowed to be silently fallible. With this PR, we make this a hard-error and bubble up the error that all the way to the client's and gateway's event-loop. There, we initiate a reconnect to the portal as a compensating action. Reconnecting to the portal means we will receive another `init` message that allows us to reconnect the relays. Due to the nature of this implementation, this fix may only apply with a certain delay from when we actually lost connectivity to the last relay. However, this design has the advantage that we don't have to introduce an additional state within `snownet`: Connections now simply fail to establish and the next one soon after _should_ succeed again because we will have received a new `init` message. Resolves: #7162.	2024-10-29 01:01:47 +00:00
Thomas Eizinger	046b9e0cd4	refactor(connlib): track srvflx candidates separately (#7163 ) As part of maintaining an allocation, we also perform STUN with our relays to discover our server-reflexive address. At the moment, these candidates are scoped to an `Allocation`. This is unnecessarily restrictive. Similar to host candidates, server-reflexive candidate entirely depend on the socket you send data from and are thus independent of the allocation's state. During normal operation, this doesn't really matter because all relay traffic is sent through the same sockets so all `Allocation`s end up with the same server-reflexive candidates. Where this does matter is when we disconnect from relay's for one reason or another (for example: #7162). The fact that all but host-candidates are scoped to `Allocation`s means that without `Allocation`s, we cannot make any new connections, not even direct ones. This is unnecessarily restrictive and causes bugs within `Allocation` to have a bigger blast radius than necessary. With this PR, we keep server-reflexive candidates in the same set as host candidates. This allows us to at least establish direct connections in case something is wrong with the relays or our state tracking of relays on the client side.	2024-10-28 16:57:41 +00:00
Thomas Eizinger	8e107b0d65	chore: don't wake `Node` on idle connections (#7149 ) In order to make Firezone more mobile-friendly, waking up the CPU less often is key. In #6845, we introduced a low-power mode into `snownet` that sends STUN messages on a longer interval if the connection is idle. Whilst poking around `boringtun` as part integrating our fork into the main codebase, I noticed that we are updating `boringtun`'s timers every second - even on idle connections. This PR applies the same idea of #6845 to the timers within `Node`: Idle connections get "woken" less and if all connections are idle, we avoid waking the `Node` altogether (unless we need to refresh allocations / channels). Calling `handle_timeout` less often revealed an issue in the tests where we didn't fully process the state changes after invalidating a candidate from the remote. To fix this, we now call `handle_timeout` directly after `{add,remove}_remote_candidate`. This isn't super clean because at first glance, it looks like `handle_timeout` should just be part of the add/remove candidate function. It is quite common for sans-IO designs to require calling `handle_timeout` after state has been changed. In `{Client,Server}Node`, we do it implicitely so that we don't have to do it in the tests and the event-loop. It would be great to test this in some automated fashion but I haven't figured out how yet. I did temporarily add an `info!` log to the event-loop of the client and with this patch applied, the entire event-loop goes to sleep on idle connections. It still does get woken every now and then but no longer every second!	2024-10-25 00:21:06 +00:00
Thomas Eizinger	5cf105f073	chore(android): start telemetry together with connlib session (#7151 ) As a first step for integration Sentry into the Android app, we launch the Sentry Rust agent as soon as a `connlib` session starts up. At a later point, we can also integrate Sentry into the Android app itself using the Java / Kotlin SDK. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-10-24 20:03:06 +00:00
Thomas Eizinger	c12a02e348	chore(apple): start telemetry together with connlib session (#7152 ) This starts up telemetry together with each `connlib` session. At a later point, we can also integrate the native Swift SDK into the MacOS / iOS app to catch non-connlib specific problems. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-10-24 19:59:52 +00:00
Thomas Eizinger	12ca4f1cc7	chore(connlib): introduce `telemetry` log target (#7147 ) With #7105, all ERROR events from `tracing` get logged as exceptions in Sentry and all WARN events get logged as "messages". We don't want to fill up the user's harddrive with logs which means we have to be somewhat conservative, what gets logged on INFO and above (with INFO being the default log level). There are certain events though where it would be useful to know, how often they happen because too many of them can indicate a problem. To solve this problem, we introduce a dedicated `telemetry` log target that the tracing-sentry integration layer watches for. Events for the `telemetry` log target that gets logged on TRACE will be sampled at a rate of 1% and submitted as messages to Sentry.	2024-10-24 01:24:45 +00:00
Thomas Eizinger	58a770a857	chore(connlib): bump severity of several logs (#7140 ) Now that we have Sentry integrated with `tracing`, using `warn!` logs a bit more liberally allows us to detect edge-cases that customers might run into. All the logs touched in this PR represent some kind of problem that it would be good to know about.	2024-10-24 01:01:30 +00:00
Thomas Eizinger	80c5b0df71	refactor(connlib): replace `LogUnwrap` with macros (#7138 ) Using a trait means the call-site of the log message will always be the `log_unwrap` module, despite the `#[track_caller]` annotation. That one only works for `std::panic::Location` unfortunately which `tracing` isn't using. Macros will be evaluated earlier and thus the messages will show up with the correct module name.	2024-10-23 23:44:23 +00:00
Thomas Eizinger	8b62f63065	chore(connlib): respond with empty records to HTTPS query (#7141 ) Applications may query domains for HTTPS RR using the HTTPS record type. `connlib` operates on OSI layer 3 and thus can only hand out IPs for the particular domains. The correct way to signal this to applications is to answer the HTTPS query with NOERROR and return an empty set of records. [RFC9460](https://www.rfc-editor.org/rfc/rfc9460.html#name-client-behavior) says the following: > 4. If one or more "compatible" ([Section 8](https://www.rfc-editor.org/rfc/rfc9460.html#mandatory)) ServiceMode records are returned, these represent the alternative endpoints. Sort the records by ascending SvcPriority. > 5. Otherwise, SVCB resolution has failed, and the list of available endpoints is empty. This implies that returning no records is valid behaviour and forces the client to consider the HTTPS DNS query as failed and query for A / AAAA records instead (if it didn't do so via happy-eyeballs already).	2024-10-23 23:28:20 +00:00
Thomas Eizinger	2ca91a3b1a	chore(connlib): remove old `mock` feature (#7142 ) This is so stale, it definitely needs to go in the bin.	2024-10-23 16:30:15 +00:00
Thomas Eizinger	6eecfc0cfb	fix: replace panics with `Result` for IP packets (#7135 ) My theory for this issue is that we receive a UDP DNS response from an upstream server that is bigger than our MTU and thus forwarding it fails. This PR doesn't fix that issue by itself but only mitigates the actual panic. To properly fix the underlying issue, we need to parse the DNS message. Truncate it and set the TC bit. Related: #7121.	2024-10-23 16:25:12 +00:00
Thomas Eizinger	ee30368970	refactor(connlib): simplify error handling on crash (#7134 ) The `fmt::Display` implementation of `tokio::task::JoinError` already does exactly what we do here: Extracting the panic message if there is one. Thus, we can simplify this code why just moving the `JoinError` into the `DisconnectError` as its source.	2024-10-23 16:13:39 +00:00
Thomas Eizinger	582e919929	test(connlib): don't generate IPs in sentinel range (#7139 ) Fixes: #7137.	2024-10-23 06:42:40 +00:00
Gabi	a803a54b02	fix(proptest): don't use port 53 for non-dns packets (#7129 ) For simplicity sake I assumed that any packet using port 53 would be a dns packet, but I forgot to exclude it from the range of possible ports used. This can cause spurious CI failures	2024-10-23 00:05:28 +00:00
Thomas Eizinger	990324b2ec	chore(rust): enable `sentry-tracing` integration (#7105 ) Using the `sentry-tracing` integration, we can automatically capture events based on what we log via `tracing`. The mapping is defined as follows: - ERROR: Gets captured as a fatal error - WARN: Gets captured as a message - INFO: Gets captured as a breadcrumb - `_`: Does not get captured at all If telemetry isn't active / configured, this integration does nothing. It is therefore safe to just always enable it.	2024-10-22 23:23:49 +00:00
Thomas Eizinger	73eebd2c4d	refactor(rust): consistently record errors as `tracing::Value` (#7104 ) Our logging library, `tracing` supports structured logging. This is useful because it preserves the more than just the string representation of a value and thus allows the active logging backend(s) to capture more information for a particular value. In the case of errors, this is especially useful because it allows us to capture the sources of a particular error. Unfortunately, recording an error as a tracing value is a bit cumbersome because `tracing::Value` is only implemented for `&dyn std::error::Error`. Casting an error to this is quite verbose. To make it easier, we introduce two utility functions in `firezone-logging`: - `std_dyn_err` - `anyhow_dyn_err` Tracking errors as correct `tracing::Value`s will be especially helpful once we enable Sentry's `tracing` integration: https://docs.rs/sentry-tracing/latest/sentry_tracing/#tracking-errors	2024-10-22 04:46:26 +00:00
Gabi	2976081bc0	chore(connlib): use tcp and udp packets for proptests (#7064 ) Currently, tests only send ICMP packets back and forth, to expand our coverage and later on permit us cover filters and resource picking this PR implements sending UDP and TCP packets as part of that logic too. To make this PR simpler in this stage TCP packets don't track an actual TCP connection, just that they are forwarded back and forth, this will be fixed in a future PR by emulating TCP sockets. We also unify how we handle CIDR/DNS/Non Resources to reduce the number of transitions. Fixes #7003 --------- Signed-off-by: Gabi <gabrielalejandro7@gmail.com> Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-10-22 01:21:40 +00:00
Thomas Eizinger	fa13d11013	chore(connlib): downgrade log-level for unchanged TUN config (#7106 ) This is otherwise really noise in debug output and doesn't offer much.	2024-10-21 21:51:54 +00:00
Thomas Eizinger	2f712d2651	chore(gateway): bump log severity (#7101 ) This log should only ever happen if clients are buggy or someone is using a custom client. Thus worth a `warn`. Follow-up from #6941.	2024-10-21 21:39:56 +00:00
Thomas Eizinger	b8f5fb9e25	chore(connlib): fix proptest failure with TCP DNS (#7093 ) When we query upstream DNS servers through the tunnel via TCP DNS, we will always be successful in establishing a tunnel, regardless of how many concurrent queries we send because the TCP stack will keep re-trying. Thus, tracking, which resources we are connected to after sending a bunch of DNS queries needs to be split by UDP and TCP. For UDP, only the "first" resource will be connected, however, with concurrent TCP and UDP DNS queries, "first" isn't necessarily the order in which we send the queries because with TCP DNS, one packet doesn't equate to one query anymore. This is quite hacky but it will get completely deleted once we buffer packets during the connection setup.	2024-10-19 00:02:00 +00:00
Thomas Eizinger	ce1e59c9fe	feat(connlib): implement idempotent control protocol for gateway (#6941 ) This PR implements the new idempotent control protocol for the gateway. We retain backwards-compatibility with old clients to allow admins to perform a disruption-free update to the latest version. With this new control protocol, we are moving the responsibility of exchanging the proxy IPs we assigned to DNS resources to a p2p protocol between client and gateway. As a result, wildcard DNS resources only get authorized on the first access. Accessing a new domain within the same resource will thus no longer require a roundtrip to the portal. Overall, users will see a greatly decreased connection setup latency. On top of that, the new protocol will allow us to more easily implement packet buffering which will be another UX boost for Firezone.	2024-10-18 15:59:47 +00:00
Thomas Eizinger	9de1119b69	feat(connlib): support DNS over TCP (#6944 ) At present, `connlib` only supports DNS over UDP on port 53. Responses over UDP are size-constrained on the IP MTU and thus, not all DNS responses fit into a UDP packet. RFC9210 therefore mandates that all DNS resolvers must also support DNS over TCP to overcome this limitation [0]. Handling UDP packets is easy, handling TCP streams is more difficult because we need to effectively implement a valid TCP state machine. Building on top of a lot of earlier work (linked in issue), this is relatively easy because we can now simply import `dns_over_tcp::{Client,Server}` which do the heavy lifting of sending and receiving the correct packets for us. The main aspects of the integration that are worth pointing out are: - We can handle at most 10 concurrent DNS TCP connections _per defined resolver_. The assumption here is that most applications will first query for DNS records over UDP and only fall back to TCP if the response is truncated. Additionally, we assume that clients will close the TCP connections once they no longer need it. - Errors on the TCP stream to an upstream resolver result in `SERVFAIL` responses to the client. - All TCP connections to upstream resolvers get reset when we roam, all currently ongoing queries will be answered with `SERVFAIL`. - Upon network reset (i.e. roaming), we also re-allocate new local ports for all TCP sockets, similar to our UDP sockets. Resolves: #6140. [0]: https://www.ietf.org/rfc/rfc9210.html#section-3-5	2024-10-18 03:40:50 +00:00
Thomas Eizinger	3d6ed6dde7	refactor(connlib): respond with SERVFAIL on resolver error (#7067 ) Currently, any failure in the `StubResolver` while processing a query results only in a log but no response. For UDP DNS queries, that isn't too bad because the client will simply retry. With the upcoming support for TCP DNS queries in #6944, we should really always reply with a message. This PR refactors the handling of DNS messages to always generate a reply. In the case of an error, we reply with SERVFAIL. This opens up a few more refactorings where we can now collapse the handling of some branches into the same. As part of that, I noticed the recurring need for "unwrapping" a `Result<(), E>` and logging the error. To make that easier, I introduced an extension trait that does exactly that.	2024-10-17 03:01:59 +00:00
Thomas Eizinger	d53d845470	chore(connlib): don't generate DNS servers in reserved IP ranges (#7068 ) In #6960, split up the filter handling by resource type. This uncovered an issue that was already known in and fixed in the portal in #6962: Upstream DNS servers must not be in any of the reserved IP ranges. Related: #7060.	2024-10-17 01:28:17 +00:00
Thomas Eizinger	af7f2f2ce7	fix(connlib): index mangled queries by ID and upstream (#7069 ) When forwarding UDP DNS queries through the tunnel, `connlib` needs to mangle the IP header to set upstream as the correct destination of the packet. In order to know, which ones need to be mangled back, we keep a map of query ID to timestamp. This isn't good enough as the added proptest failure case shows. If there are two concurrent queries with the same query ID to different resolvers, we only handle one of those and don't mangle the 2nd one.	2024-10-16 23:29:33 +00:00
Thomas Eizinger	2ad65982f0	refactor(connlib): hardcode size of `EncryptBuffer` (#7042 ) This buffer is effectively limited by the maximum size of our IP packets (which is guided by our interface MTU). Passing a length is unnecessarily abstract. For implementing DNS over TCP, we will need to encapsulate packets that are emitted by the `dns_over_tcp::Client` which requires creating such a buffer on the fly. In the future, we should probably consider also stack-allocating all our `Transmit`s so we can get rid of passing around this buffer altogether.	2024-10-16 03:04:49 +00:00
Thomas Eizinger	c21bd18b62	refactor(connlib): explicitely define UDP DNS server resources (#7043 ) Currently, in our tests, traffic that is targeted at a resource is handled "in-line" on the gateway. This doesn't really represent how the real world works. In the real world, the gateway uses the IP forwarding functionality of the Linux kernel and the corresponding NAT to send the IP packet to the actual resource. We don't want to implement this forwarding and NAT in the tests. However, our testing harness is about to get more sophisticated. We will be sending TCP DNS queries with #6944 and we want to test TCP and its traffic filters with #7003. The state of those TCP sockets needs to live _somewhere_. If we "correctly" model this and introduce some kind of `HashMap` with `dyn Resource` in `TunnelTest`, then we will have to actually implement NAT for those packets to ensure that e.g. the SYN-ACK of a TCP handshake makes it back to the correct(!) gateway. That is rather cumbersome. This PR suggests taking a shortcut there by associating the resources with each gateway individually. At present, all we have are UDP DNS servers. Those don't actually have any connection state themselves but putting them in place gives us a framework for where we can put connection-specific state. Most importantly, these resources MUST NOT hold application-specific state. Instead, that state needs to be kept in `ReferenceState` or `TunnelState` and passed in by reference, as we do here for the DNS records. This has effectively the same behaviour as correctly translating IP packets back and forth between resources and gateways. The packets "emitted" by a particular `UdpDnsServerResource` will always go back to the correct gateway.	2024-10-16 01:10:38 +00:00
Gabi	eedee56be2	chore(connlib): use proxy ip to match filters instead of translated (#6960 ) Previously, when a gateway checked if a packet was allowed through, it used the real IP of the DNS resource that the client was trying to communicate with. The problem with this was that if there's an overlapping CIDR resource with the real IP this would allow a user to send packets using the filters for that resource instead of the DNS resource filters. This can be confusing for users as packet can flow unexpectedly to the resources even if the filter doesn't permit it, so we use the IP of the packet before we translate it to the real IP to match the filters. This doesn't change the security of this feature as a user can just change the IP of the packet with the dst of the DNS or the cidr resource according to what they need. Fixes #6806	2024-10-15 23:21:36 +00:00
dependabot[bot]	c0955811f1	build(deps): Bump lru from 0.12.4 to 0.12.5 in /rust (#7029 ) Bumps [lru](https://github.com/jeromefroe/lru-rs) from 0.12.4 to 0.12.5. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/jeromefroe/lru-rs/blob/master/CHANGELOG.md">lru's changelog</a>.</em></p> <blockquote> <h2><a href="https://github.com/jeromefroe/lru-rs/tree/0.12.5">v0.12.5</a> - 2024-10-30</h2> <ul> <li>Upgrade hashbrown dependency to 0.15.</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`2d18d2d333`"><code>2d18d2d</code></a> Merge pull request <a href="https://redirect.github.com/jeromefroe/lru-rs/issues/203">#203</a> from jeromefroe/jerome/prepare-0-12-5-release</li> <li><a href="`b42486918b`"><code>b424869</code></a> Prepare 0.12.5 release</li> <li><a href="`1ba5130174`"><code>1ba5130</code></a> Merge pull request <a href="https://redirect.github.com/jeromefroe/lru-rs/issues/202">#202</a> from torokati44/hashbrown-0.15</li> <li><a href="`60a7e71c59`"><code>60a7e71</code></a> Use top-level DefaultHashBuilder type alias</li> <li><a href="`12ed995b7b`"><code>12ed995</code></a> Update hashbrown to 0.15</li> <li>See full diff in <a href="https://github.com/jeromefroe/lru-rs/compare/0.12.4...0.12.5">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=lru&package-manager=cargo&previous-version=0.12.4&new-version=0.12.5)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-10-15 16:07:24 +00:00
Thomas Eizinger	dbe618c080	refactor(connlib): expose `&mut TRoleState` for direct access (#7026 ) Currently, we have a lot of stupid code to forward data from the `{Client,Gateway}Tunnel` interface to `{Client,Gateway}State`. Recent refactorings such as #6919 made it possible to get rid of this forwarding layer by directly exposing `&mut TRoleState`. To maintain some type-privacy, several functions are made generic to accept `impl Into` or `impl TryInto`.	2024-10-15 01:05:35 +00:00
Thomas Eizinger	6e0194f786	refactor(connlib): mangle UDP DNS query via tunnel earlier (#7023 ) UDP DNS queries for upstream resolvers that happen to be resources need to be sent through the tunnel. For that to work correctly, `connlib` needs to rewrite the IP header such that the destination IP points to the actual address of the DNS server. Currently, this happens rather "late" in the processing of the packets, i.e. after `try_handle_dns` has returned (where that decision is actually made). This is rather confusing and also forces us to re-parse the packet as a DNS packet at a later stage. To avoid this, we move main functionality of `maybe_mangle_dns_query_to_cidr_resource` into the branch where `connlib`'s stub DNS resolver tells us that the query needs to be forwarded via the tunnel. With the upcoming support of TCP DNS queries, we will have a 2nd source of IP packets that need to go through the tunnel: Packets emitted from our internal TCP stack. Attempting to perform the same post-processing on these TCP packets as we do with UDP is rather confusing, which is why we want to remove this step from the `encapsulate` function. Resolves: #5391.	2024-10-15 00:51:57 +00:00
Thomas Eizinger	8163c8567c	refactor(connlib): clarify TUN and network input functions (#7022 ) Within `connlib`, the `encapsulate` and `decapsulate` functions on `ClientState` and `GatewayState` are the entrypoint for sending and receiving network traffic. For example, IP packets read from the TUN device are processed using these functions. Not all packets / traffic passed to these functions is meant to be encrypted. Some of it is TURN traffic with relays, some of it is DNS traffic that we intercept. To clarify this, we rename these functions to `handle_tun_input` and `handle_network_input`. As part of this clarification, we also call `handle_timeout` in case we don't emit a decrypted IP packet when handling network input. Once we support DNS over TCP (#6944), some IP packets sent through the tunnel will originate from DNS servers that we forwarded queries to. In that case, those responses will be handled by `connlib`'s internal TCP stack and thus not produce a decrypted IP packet. To correctly, advance the state in this case, we mirror what we already do for `handle_tun_input` and call `handle_timeout` if `handle_network_input` yields `None`.	2024-10-15 00:19:19 +00:00
Thomas Eizinger	539b1c4f00	chore(connlib): don't respond on UDP DNS timeouts (#7020 ) When handling DNS queries, `connlib` tries to be as transparent as possible. For this reason, we byte-for-byte forward the DNS response from the upstream resolver to the original source socket. In #6999, we started modelling these DNS queries as explicit tasks in preparation for DNS over TCP and DNS over HTTPS. As part of that, we create a DNS response for _every_ IO error we encounter as part of the recursive query. This includes timeouts, i.e. when we don't receive a response at all. That actually breaks the rule of "be a transparent DNS proxy". In this PR, we slightly refactor the handling of the DNS response to explicitly match on `io::Errorkind::TimedOut` to not send a packet back, thus mirroring the behaviour the DNS client would encounter without Firezone being active.	2024-10-14 22:33:53 +00:00
Thomas Eizinger	f9bf681e64	refactor(connlib): track UDP DNS query source in query meta data (#7018 ) When performing recursive DNS queries over UDP, `connlib` needs to remember the original source socket a particular query came from in order to send the response back to the correct socket. Until now, this was tracked in a separate `HashMap`, indexed by upstream server and query ID. When DNS queries are being retried, they may be resent using the same query ID, causing "Unknown query" logs if the retry happens on a shorter interval than the timeout of our recursive query. We are already tracking a bunch of meta data along-side the actual query, meaning we can just as easily add the original source socket to that as well. Once we add TCP DNS queries, we will need to track the handle of the TCP socket in a similar manner.	2024-10-14 18:38:39 +00:00
Thomas Eizinger	857bbf5d98	chore(connlib): introduce custom logging format (#7024 ) This PR introduces a custom logging format for all Rust-components. It is more or less a copy of `tracing_subscriber::fmt::format::Compact` with the main difference that span-names don't get logged. Spans are super useful because they allow us to record contextual values, like the current connection ID, for a certain scope. What is IMO less useful about them is that in the default formatter configuration, active spans cause a right-drift of the actual log message. The actual log message is still what most accurately describes, what `connlib` is currently doing. Spans only add contextual information that the reader may use for further understand what is happening. This optional nature of the utility of spans IMO means that they should come _after_ the actual log message. Resolves: #7014.	2024-10-14 18:09:38 +00:00
Thomas Eizinger	9302331881	refactor(connlib): create new UDP socket for each DNS query (#6999 ) This extracts the initial refactoring required for #6944. Currently, `connlib` sends all DNS queries over the same UDP socket as all the p2p traffic for gateways and relays. In an earlier design of `connlib`, we already did something similar as we are doing here but using `hickory_resolver` for the actual DNS resolution. Instead of depending on hickory, we implement DNS resolution ourselves by sending a UDP DNS query to the mapped upstream DNS server. There are no retries, instead, we rely on the original DNS client to retry in case a packet gets lost on the way. Modelling recursive DNS queries as explicit events from the `ClientState` is necessary for implement DNS over TCP and DNS over HTTPS. In both cases, the query to the upstream server isn't as simple as emitting a `Transmit`. By modelling the query as an `async fn` within `Io`, it will be possible to perform them all in one place. Resolves: #6297.	2024-10-11 22:33:22 +00:00
Thomas Eizinger	b40f7d4c62	chore(connlib): trim TRACE logs for DNS resource matching (#6983 ) Currently, we emit a single TRACE log for each DNS resource entry that doesn't match. This is quite spammy and often not needed. When debugging DNS resources, it is useful to know, which resources we are matching against. To balance this, we now build a list of all DNS resource domain patterns that we have and log a single "No resources matched" log with that list in case none match.	2024-10-11 14:31:06 +00:00
Thomas Eizinger	590edad8fc	build(rust): bump `proptest` (#6965 ) With the latest version of `proptest-state-machine`, we no longer need to use their traits because `Sequential::new` is now exposed. This makes the overall things less magical because there is less indirection.	2024-10-09 02:08:50 +00:00
Thomas Eizinger	355726db7a	refactor(connlib): clarify design of p2p control protocol (#6980 ) This incorporates the feedback from #6939 after a discussion with @conectado. We agreed that the protocol should be more event-based, where each message has its own event type. Events MAY appear in pairs or other cardinality combinations, meaning semantically they could be seen as requests and responses. In general though, due to the unreliable nature of IP, it is better to view them as events. Events are typically designed to be idempotent which is important to make this protocol work. Using events also means it is not as easy to fall into the "trap" of modelling requests / responses on the control protocol level.	2024-10-09 00:45:30 +00:00
Thomas Eizinger	38c46b41c9	test(connlib): re-add all resources on roaming (#6943 ) To correctly handle overlapping CIDR resources, we need to recompute which ones are active. The `RoamClient` transition was missing that despite the system-under-test actually doing that bit. Adding the necessary function call fixes the two regression seeds detected in CI. Resolves: #6924.	2024-10-08 06:40:59 +00:00
Thomas Eizinger	02b0e1dc8d	chore: don't report authentication errors to sentry (#6948 ) Do we want to track 401s in sentry? If we see a lot of them, something is likely wrong but I guess there is some level of 401s that users will just run into. Is there a way of marking these as "might not be a really bad error"? --------- Co-authored-by: Not Applicable <ReactorScram@users.noreply.github.com>	2024-10-08 06:26:39 +00:00
Thomas Eizinger	027ef60ded	feat(connlib): introduce FZ p2p control protocol (#6939 ) At present, `connlib` utilises the portal as a signalling layer for any kind of control message that needs to be exchanged between clients and gateways. For anything regard to connectivity, this is crucial: Before we have a direct connection to the gateway, we don't really have a choice other than using the portal as a "relay" to e.g. exchange address candidates for ICE. However, once a direct connection has been established, exchanging information directly with the gateway is faster and removes the portal as a potential point of failure for the data plane. For DNS resources, `connlib` intercepts all DNS requests on the client and assigns its own IPs within the CG-NAT range to all domains that are configured as resources. Thus, all packets targeting DNS resources will have one of these IPs set as their destination. The gateway needs to learn about all the IPs that have been assigned to a certain domain by the client and perform NAT. We call this concept "DNS resource NAT". Currently, the domain + the assigned IPs are sent together with the `allow_access` or `request_connection` message via the portal. The new control protocol defined in #6732 purposely excludes this information and only authorises traffic to the entire resource which could also be a wildcard-DNS resource. To exchange the assigned IPs for a certain domain with the gateway, we introduce our own p2p control protocol built on top of IP. All control protocol messages are sent through the tunnel and thus encrypted at all times. They are differentiated from regular application traffic as follows: - IP src is set to the unspecified IPv6 address (`::`) - IP dst is set to the unspecified IPv6 address (`::`) - IP protocol is set to reserved (`0xFF`) The combination of all three should never appear as regular traffic. To ensure forwards-compatibility, the control protocol utilises a fixed 8-byte header where the first byte denotes the message kind. In this current design, there is no concept of a request or response in the wire-format. Each message is unidirectional and the fact that the two messages we define in here appear in tandem is purely by convention. We use the IPv6 payload length to determine the total length of the packet. The payloads are JSON-encoded. Message types are free to chose whichever encoding they'd like. This protocol is sent through the WireGuard tunnel, meaning we are effectively limited by our device MTU of 1280, otherwise we'd have to implement fragmentation. For the messages of setting up the DNS resource NAT, we are below this limit: - UUIDs are 16 bytes - Domain names are at most 255 bytes - IPv6 addresses are 16 bytes * 4 - IPv4 addressers are 4 bytes * 4 Including the JSON serialisation overhead, this results in a total maximum payload size of 402 bytes, which is well below our MTU. Finally, another thing to consider here is that IP is unreliable, meaning each use of this protocol needs to make sure that: - It is resilient against message re-ordering - It is resilient against packet loss The details of how this is ensured for setting up the DNS resource NAT is left to #6732.	2024-10-08 05:43:14 +00:00
Thomas Eizinger	b7795dfa03	feat(snownet): introduce `upsert_connection` (#6937 ) One of the key differences of the new control protocol designed in #6461 is that creating new connections is idempotent. We achieve this by having the portal generate the ICE credentials and the preshared-key for the WireGuard tunnel. As long as the ICE credentials don't change, we don't need to make a new connection. For `snownet`, this means we are deprecating the previous APIs for making connections. The client-side APIs will have to stay around until we merge the client-part of the new control protocol. The server-side APIs will have to stay around until we remove backwards-compatibility from the gateway.	2024-10-08 01:37:42 +00:00
Thomas Eizinger	4defb3b038	feat(connlib): buffer packets during ICE (#6935 ) As part of #6732, we will be using the tunnel for the p2p control protocol to setup the DNS resource NAT on the gateway. These messages will be immediately queued after creating the connection, before ICE is finished. In order to avoid additional retries within `firezone_tunnel`, we directly attempt to encapsulate these packets. `boringtun` internally has a buffer for these because prior to having a WireGuard session, we don't actually have the necessary keys to encrypt a packet. Thus, the packet passed here won't actually end up in the referenced buffer of the `Connecting` state. Instead, if we haven't already initiated a WireGuard handshake, attempting to encapsulate a packet will trigger a WireGuard handshake initiation and _that_ is the packet we need to buffer. Dropping this packet would require us to wait until `boringtun` retransmits it as part of the handshake timeout, causing an unnecessary delay in the connection setup.	2024-10-08 00:46:30 +00:00
Thomas Eizinger	41d968b850	refactor(connlib): allow passing `IpAddr` as well as `IpNetwork` (#6940 ) We call the `add_ips_with_resource` function with a list of `IpAddr` or `IpNetwork`s. To make this more ergonomic for the caller, we can accept an iterator that converts the items on the fly.	2024-10-07 22:49:11 +00:00
Thomas Eizinger	0c23e23183	refactor(snownet): only initiate a single WireGuard session (#6936 ) To reduce the TTFB, we immediately force a WireGuard handshake as soon as ICE completes. Currently, both the client and the gateway do this which is somewhat counter-productive as there can only be one active session. With this patch, only the client will force a new WireGuard handshake as soon as ICE completes.	2024-10-07 22:44:55 +00:00
Thomas Eizinger	2d4818e007	refactor(connlib): rotate tunnel private key on `reset` (#6909 ) With the new control protocol specified in #6461, the client will no longer initiate new connections. Instead, the credentials are generated deterministically by the portal based on the gateway's and the client's public key. For as long as they use the same public key, they also have the same in-memory state which makes creating connections idempotent. What we didn't consider in the new design at first is that when clients roam, they discard all connections but keep the same private key. As a result, the portal would generate the same ICE credentials which means the gateway thinks it can reuse the existing connection when new flows get authorized. The client however discarded all connections (and rotated its ports and maybe IPs), meaning the previous candidates sent to the gateway are no longer valid and connectivity fails. We fix this by also rotating the private keys upon reset. Rotating the keys itself isn't enough, we also need to propagate the new public key all the way "over" to the phoenix channel component which lives separately from connlib's data plane. To achieve this, we change `PhoenixChannel` to now start in the "disconnected" state and require an explicit `connect` call. In addition, the `LoginUrl` constructed by various components now acts merely as a "prototype", which may require additional data to construct a fully valid URL. In the case of client and gateway, this is the public key of the `Node`. This additional parameter needs to be passed to `PhoenixChannel` in the `connect` call, thus forming a type-safe contract that ensures we never attempt to connect without providing a public key. For the relay, this doesn't apply. Lastly, this allows us to tidy up the code a bit by: a) generating the `Node`'s private key from the existing RNG b) removing `ConnectArgs` which only had two members left Related: #6461. Related: #6732.	2024-10-07 22:28:51 +00:00

1 2 3 4 5 ...

853 Commits