firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-28 10:18:51 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	590edad8fc	build(rust): bump `proptest` (#6965 ) With the latest version of `proptest-state-machine`, we no longer need to use their traits because `Sequential::new` is now exposed. This makes the overall things less magical because there is less indirection.	2024-10-09 02:08:50 +00:00
Thomas Eizinger	355726db7a	refactor(connlib): clarify design of p2p control protocol (#6980 ) This incorporates the feedback from #6939 after a discussion with @conectado. We agreed that the protocol should be more event-based, where each message has its own event type. Events MAY appear in pairs or other cardinality combinations, meaning semantically they could be seen as requests and responses. In general though, due to the unreliable nature of IP, it is better to view them as events. Events are typically designed to be idempotent which is important to make this protocol work. Using events also means it is not as easy to fall into the "trap" of modelling requests / responses on the control protocol level.	2024-10-09 00:45:30 +00:00
Thomas Eizinger	38c46b41c9	test(connlib): re-add all resources on roaming (#6943 ) To correctly handle overlapping CIDR resources, we need to recompute which ones are active. The `RoamClient` transition was missing that despite the system-under-test actually doing that bit. Adding the necessary function call fixes the two regression seeds detected in CI. Resolves: #6924.	2024-10-08 06:40:59 +00:00
Thomas Eizinger	02b0e1dc8d	chore: don't report authentication errors to sentry (#6948 ) Do we want to track 401s in sentry? If we see a lot of them, something is likely wrong but I guess there is some level of 401s that users will just run into. Is there a way of marking these as "might not be a really bad error"? --------- Co-authored-by: Not Applicable <ReactorScram@users.noreply.github.com>	2024-10-08 06:26:39 +00:00
Thomas Eizinger	027ef60ded	feat(connlib): introduce FZ p2p control protocol (#6939 ) At present, `connlib` utilises the portal as a signalling layer for any kind of control message that needs to be exchanged between clients and gateways. For anything regard to connectivity, this is crucial: Before we have a direct connection to the gateway, we don't really have a choice other than using the portal as a "relay" to e.g. exchange address candidates for ICE. However, once a direct connection has been established, exchanging information directly with the gateway is faster and removes the portal as a potential point of failure for the data plane. For DNS resources, `connlib` intercepts all DNS requests on the client and assigns its own IPs within the CG-NAT range to all domains that are configured as resources. Thus, all packets targeting DNS resources will have one of these IPs set as their destination. The gateway needs to learn about all the IPs that have been assigned to a certain domain by the client and perform NAT. We call this concept "DNS resource NAT". Currently, the domain + the assigned IPs are sent together with the `allow_access` or `request_connection` message via the portal. The new control protocol defined in #6732 purposely excludes this information and only authorises traffic to the entire resource which could also be a wildcard-DNS resource. To exchange the assigned IPs for a certain domain with the gateway, we introduce our own p2p control protocol built on top of IP. All control protocol messages are sent through the tunnel and thus encrypted at all times. They are differentiated from regular application traffic as follows: - IP src is set to the unspecified IPv6 address (`::`) - IP dst is set to the unspecified IPv6 address (`::`) - IP protocol is set to reserved (`0xFF`) The combination of all three should never appear as regular traffic. To ensure forwards-compatibility, the control protocol utilises a fixed 8-byte header where the first byte denotes the message kind. In this current design, there is no concept of a request or response in the wire-format. Each message is unidirectional and the fact that the two messages we define in here appear in tandem is purely by convention. We use the IPv6 payload length to determine the total length of the packet. The payloads are JSON-encoded. Message types are free to chose whichever encoding they'd like. This protocol is sent through the WireGuard tunnel, meaning we are effectively limited by our device MTU of 1280, otherwise we'd have to implement fragmentation. For the messages of setting up the DNS resource NAT, we are below this limit: - UUIDs are 16 bytes - Domain names are at most 255 bytes - IPv6 addresses are 16 bytes * 4 - IPv4 addressers are 4 bytes * 4 Including the JSON serialisation overhead, this results in a total maximum payload size of 402 bytes, which is well below our MTU. Finally, another thing to consider here is that IP is unreliable, meaning each use of this protocol needs to make sure that: - It is resilient against message re-ordering - It is resilient against packet loss The details of how this is ensured for setting up the DNS resource NAT is left to #6732.	2024-10-08 05:43:14 +00:00
Thomas Eizinger	b7795dfa03	feat(snownet): introduce `upsert_connection` (#6937 ) One of the key differences of the new control protocol designed in #6461 is that creating new connections is idempotent. We achieve this by having the portal generate the ICE credentials and the preshared-key for the WireGuard tunnel. As long as the ICE credentials don't change, we don't need to make a new connection. For `snownet`, this means we are deprecating the previous APIs for making connections. The client-side APIs will have to stay around until we merge the client-part of the new control protocol. The server-side APIs will have to stay around until we remove backwards-compatibility from the gateway.	2024-10-08 01:37:42 +00:00
Thomas Eizinger	4defb3b038	feat(connlib): buffer packets during ICE (#6935 ) As part of #6732, we will be using the tunnel for the p2p control protocol to setup the DNS resource NAT on the gateway. These messages will be immediately queued after creating the connection, before ICE is finished. In order to avoid additional retries within `firezone_tunnel`, we directly attempt to encapsulate these packets. `boringtun` internally has a buffer for these because prior to having a WireGuard session, we don't actually have the necessary keys to encrypt a packet. Thus, the packet passed here won't actually end up in the referenced buffer of the `Connecting` state. Instead, if we haven't already initiated a WireGuard handshake, attempting to encapsulate a packet will trigger a WireGuard handshake initiation and _that_ is the packet we need to buffer. Dropping this packet would require us to wait until `boringtun` retransmits it as part of the handshake timeout, causing an unnecessary delay in the connection setup.	2024-10-08 00:46:30 +00:00
Thomas Eizinger	41d968b850	refactor(connlib): allow passing `IpAddr` as well as `IpNetwork` (#6940 ) We call the `add_ips_with_resource` function with a list of `IpAddr` or `IpNetwork`s. To make this more ergonomic for the caller, we can accept an iterator that converts the items on the fly.	2024-10-07 22:49:11 +00:00
Thomas Eizinger	0c23e23183	refactor(snownet): only initiate a single WireGuard session (#6936 ) To reduce the TTFB, we immediately force a WireGuard handshake as soon as ICE completes. Currently, both the client and the gateway do this which is somewhat counter-productive as there can only be one active session. With this patch, only the client will force a new WireGuard handshake as soon as ICE completes.	2024-10-07 22:44:55 +00:00
Thomas Eizinger	2d4818e007	refactor(connlib): rotate tunnel private key on `reset` (#6909 ) With the new control protocol specified in #6461, the client will no longer initiate new connections. Instead, the credentials are generated deterministically by the portal based on the gateway's and the client's public key. For as long as they use the same public key, they also have the same in-memory state which makes creating connections idempotent. What we didn't consider in the new design at first is that when clients roam, they discard all connections but keep the same private key. As a result, the portal would generate the same ICE credentials which means the gateway thinks it can reuse the existing connection when new flows get authorized. The client however discarded all connections (and rotated its ports and maybe IPs), meaning the previous candidates sent to the gateway are no longer valid and connectivity fails. We fix this by also rotating the private keys upon reset. Rotating the keys itself isn't enough, we also need to propagate the new public key all the way "over" to the phoenix channel component which lives separately from connlib's data plane. To achieve this, we change `PhoenixChannel` to now start in the "disconnected" state and require an explicit `connect` call. In addition, the `LoginUrl` constructed by various components now acts merely as a "prototype", which may require additional data to construct a fully valid URL. In the case of client and gateway, this is the public key of the `Node`. This additional parameter needs to be passed to `PhoenixChannel` in the `connect` call, thus forming a type-safe contract that ensures we never attempt to connect without providing a public key. For the relay, this doesn't apply. Lastly, this allows us to tidy up the code a bit by: a) generating the `Node`'s private key from the existing RNG b) removing `ConnectArgs` which only had two members left Related: #6461. Related: #6732.	2024-10-07 22:28:51 +00:00
Thomas Eizinger	1114838bb6	refactor(connlib): introduce dedicated `Resource` model (#6920 ) Following up from #6919, this PR introduces a dedicated, internal model for resources as to how the client uses them. This separation serves several purposes: 1. It allows us to introduce an `Unknown` resource type, ensuring forwards-compatibility with future resource types. 2. It allows us to remove trait implementations like `PartialEq` or `PartialOrd` from the message types. With #6732, the messages will include types like `SecretKey`s which cannot be compared. 3. A decoupling of serialisation and domain models is good practice in general and has long been overdue.	2024-10-03 22:15:53 +00:00
Thomas Eizinger	be250f1e00	refactor(connlib): repurpose `connlib-shared` as `connlib-model` (#6919 ) The `connlib-shared` crate has become a bit of a dependency magnet without a clear purpose. It hosts utilities like `get_user_agent`, messages for the client and gateway to communicate with the portal and domain types like `ResourceId`. To create a better dependency structure in our workspace, we repurpose `connlib-shared` as a `connlib-model` crate. Its purpose is to host domain-specific model types that multiple crates may want to use. For that purpose, we rename the `callbacks::ResourceDescription` type to `ResourceView`, designating that this is a _view_ onto a resource as seen by `connlib`. The message types which currently double up as connlib-internal model thus become an implementation detail of `firezone-tunnel` and shouldn't be used for anything else. --------- Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com> Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>	2024-10-03 14:47:58 +00:00
Jamil	613127d298	ci: Bump all clients and gateway (#6923 ) Main fix: idle connection timing. These have already been released. --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2024-10-03 07:12:52 -07:00
Thomas Eizinger	9f3f171b3d	feat(connlib): implement TRACE logging for DNS (#6907 ) When debugging DNS-related issues, it is useful to see all DNS queries that go into `connlib` and the responses that we generate. Analogous to the `wire::net` and `wire::dev` TRACE logs, we introduce `wire::dns` which logs incoming queries and the responses on TRACE. The output looks like this: ``` 2024-10-02T00:16:47.522847Z TRACE wire::dns::qry: A caldav.fastmail.com qid=55845 2024-10-02T00:16:47.522926Z TRACE wire::dns::qry: AAAA caldav.fastmail.com qid=56277 2024-10-02T00:16:47.531347Z TRACE wire::dns::res: AAAA caldav.fastmail.com => [] qid=56277 rcode=NOERROR 2024-10-02T00:16:47.538984Z TRACE wire::dns::res: A caldav.fastmail.com => [103.168.172.46 \| 103.168.172.61] qid=55845 rcode=NOERROR 2024-10-02T00:16:47.580237Z TRACE wire::dns::qry: HTTPS cloudflare-dns.com qid=21518 2024-10-02T00:16:47.580338Z TRACE wire::dns::qry: A example.org qid=35459 2024-10-02T00:16:47.580364Z TRACE wire::dns::qry: AAAA example.org qid=60073 2024-10-02T00:16:47.580699Z TRACE wire::dns::qry: AAAA ipv4only.arpa qid=17280 2024-10-02T00:16:47.580782Z TRACE wire::dns::qry: A ipv4only.arpa qid=47215 2024-10-02T00:16:47.581134Z TRACE wire::dns::qry: A detectportal.firefox.com qid=34970 2024-10-02T00:16:47.581261Z TRACE wire::dns::qry: AAAA detectportal.firefox.com qid=39505 2024-10-02T00:16:47.609502Z TRACE wire::dns::res: AAAA example.org => [2606:2800:21f:cb07:6820:80da:af6b:8b2c] qid=60073 rcode=NOERROR 2024-10-02T00:16:47.609640Z TRACE wire::dns::res: AAAA ipv4only.arpa => [] qid=17280 rcode=NOERROR 2024-10-02T00:16:47.610407Z TRACE wire::dns::res: A ipv4only.arpa => [192.0.0.170 \| 192.0.0.171] qid=47215 rcode=NOERROR 2024-10-02T00:16:47.617952Z TRACE wire::dns::res: HTTPS cloudflare-dns.com => [1 alpn=h3,h2 ipv4hint=104.16.248.249,104.16.249.249 ipv6hint=2606:4700::6810:f8f9,2606:4700::6810:f9f9] qid=21518 rcode=NOERROR 2024-10-02T00:16:47.631124Z TRACE wire::dns::res: A example.org => [93.184.215.14] qid=35459 rcode=NOERROR 2024-10-02T00:16:47.640286Z TRACE wire::dns::res: AAAA detectportal.firefox.com => [detectportal.prod.mozaws.net. \| prod.detectportal.prod.cloudops.mozgcp.net. \| 2600:1901:0:38d7::] qid=39505 rcode=NOERROR 2024-10-02T00:16:47.641332Z TRACE wire::dns::res: A detectportal.firefox.com => [detectportal.prod.mozaws.net. \| prod.detectportal.prod.cloudops.mozgcp.net. \| 34.107.221.82] qid=34970 rcode=NOERROR 2024-10-02T00:16:48.737608Z TRACE wire::dns::qry: AAAA myfiles.fastmail.com qid=52965 2024-10-02T00:16:48.737710Z TRACE wire::dns::qry: A myfiles.fastmail.com qid=5114 2024-10-02T00:16:48.745282Z TRACE wire::dns::res: AAAA myfiles.fastmail.com => [] qid=52965 rcode=NOERROR 2024-10-02T00:16:49.027932Z TRACE wire::dns::res: A myfiles.fastmail.com => [103.168.172.46 \| 103.168.172.61] qid=5114 rcode=NOERROR 2024-10-02T00:16:49.190054Z TRACE wire::dns::qry: HTTPS github.com qid=64696 2024-10-02T00:16:49.190171Z TRACE wire::dns::qry: A github.com qid=11912 2024-10-02T00:16:49.190502Z TRACE wire::dns::res: A github.com => [100.96.0.1 \| 100.96.0.2 \| 100.96.0.3 \| 100.96.0.4] qid=11912 rcode=NOERROR 2024-10-02T00:16:49.190619Z TRACE wire::dns::qry: A github.com qid=63366 2024-10-02T00:16:49.190730Z TRACE wire::dns::res: A github.com => [100.96.0.1 \| 100.96.0.2 \| 100.96.0.3 \| 100.96.0.4] qid=63366 rcode=NOERROR ``` As with the other filters, seeing both queries and responses can be achieved with `RUST_LOG=wire::dns=trace`. If you are only interested in the responses, you can activate a more specific log filter using `RUST_LOG=wire::dns::res=trace`. All responses also print the original query that they are answering. Resolves: #6862.	2024-10-02 21:19:06 +00:00
Thomas Eizinger	7e0fa50cae	fix(connlib): handle silently rebooted / disconnected relays (#6666 ) Our relays are essential for connectivity because they also perform STUN for us through which we learn our server-reflexive address. Thus, we must at all times have at least one relay that we can reach in order to establish a connection. The portal tracks the connectivity to the relays for us and in case any of them go down, sends us a `relays_presence` message, meaning we can stop using that relay and migrate any relayed connections to a new one. This works well for as long as we are connected to the portal while the relay is rebooting / going-down. If we are not currently connected to the portal and a relay we are using reboots, we don't learn about it. At least if we are actively using it, the connection will fail and further attempted communication with the relay will time-out and we will stop using it. In case we aren't currently using the relay, this gets a bit trickier. If we aren't using the relay but it rebooted while we were partitioned from the portal, logging in again might return the same relay to us in the `init` message, but this time with different credentials. The first bug that we are fixing in this PR is that we previously ignored those credentials because we already knew about the relay, thinking that we can still use our existing credentials. The fix here is to also compare the credentials and ditch the local state if they differ. The second bug identified during fixing the first one is that we need to pro-actively probe whether all other relays that we know about are actually still responsive. For that, we issue a `REFRESH` message to them. If that one times-out or fails otherwise, we will remove that one from our list of `Allocation`s too. To fix the 2nd bug, several changes were necessary: 1. We lower the log-level of `Disconnecting from relay` from ERROR to WARN. Any ERROR emitted during a test-run fails our test-suite which is what partially motivated this. The test suite builds on the assumption that ERRORs are fatal and thus should never happen during our tests. This change surfaces that disconnecting from a relay can indeed happen during normal operation, which justifies lowering this to WARN. Users should at the minimum monitor on WARN to be alerted about problems. 2. We reduce the total backoff duration for requests to relays from 60s to 10s. The current 60s result in total of 8 retries. UDP is unreliable but it isn't THAT unreliable to justify retrying everything for 60s. We also use a 10s timeout for ICE, which means these are now aligned to better match each other. We had to change the max backoff duration because we only idle-spin for at most 10s in the tests and thus the current 60s were too long to detect that a relay actually disappeared. 3. We had to shuffle around some function calls to make sure all intermediary event buffers are emptied at the right point in time to make the test deterministic. Fixes: #6648.	2024-10-02 21:14:51 +00:00
Gabi	3501d5b287	feat(clients): use hardware id for device verification (#6857 ) We want to associate additional device information for the device verification, these are all parameters that tries to uniquely identify the hardware. For that reason we read system information and send it as part of the query params to the portal, that way the portal can store this when device is verified and match against that later on. These are the parameters according to each platform: \|Platform\|Query Field\|Field Meaning\| \|-----\|----\|-----\| \|MacOS\|`device_serial`\|Hardware's Serial\| \|MacOS\| `device_uuid`\|Hardware's UUID\| \|iOS\|`identifier_for_vendor`\| Identifier for vendor, resets only on uninstall/install\| \|Android\|`firebase_installation_id`\| Firebase installation ID, resets only on uninstall/install\| \|Windows\|`device_serial`\|Motherboard's Serial\| \|Linux\|`device_serial`\|Motherboard's Serial\| Fixes #6837	2024-10-02 08:44:26 +00:00
Thomas Eizinger	d4e9384a08	fix(connlib): don't add new relays after nomination (#6876 ) When relays reboot or get redeployed, the portal sends us new relays to use and or relays we should discontinue using. To be more efficient with battery and network usage, `connlib` only ever samples a single relay out of all existing ones for a particular connection. In case of a network topology where we need to use relays, there are situations we can end up in: - The client connects to the gateway's relay, i.e. to the port the gateway allocated on the relay. - The gateway connects to the client's relay, i.e to the port the client allocated on the relay. When we detect that a relay is down, the party that allocated the port will now immediately (once #6666 is merged). The other party needs to wait until it receives the invalidated candidates from its peer. Invalidating that candidate will also invalidate the currently nominated socket and fail the connection. In theory at least. That only works if there are no other candidates available to try. This is where this patch becomes important. Say we have the following setup: - Client samples relay A. - Gateway samples relay B. - The nominated candidate pair is "client server-reflexive <=> relay B", i.e. the client talks to the allocated port on the gateway. Next: 1. Client and portal get network-partitioned. 2. Relay B disappears. 3. Relay C appears. 4. Relay A reboots. 5. Client reconnects. At this point, the client is told by the portal to use relays A & C. Note that relay A rebooted and thus the allocation previously present on the client is no longer valid. With #6666, we will detect this by comparing credentials & IPs. The gateway is being told about the same relays and as part of that, tests that relay B is still there. It learns that it isn't, invalidates the candidates which fails the connection to the client (but only locally!). Meanwhile, as part of the regular `init` procedure, the client made a new allocation with relays A & C. Because it had previously selected relay A for the connection with the gateway, the new candidates are added to the agent, forming new pairs. The gateway has already given up on this connection however so it won't ever answer these STUN requests. Concurrently, the gateway's invalidated candidates arrive the client. They however don't fail the connection because the client is probing the newly added candidates. This creates a state mismatch between the client and gateway that is only resolved after the candidates start timing out, adding an additional delay during which the connection isn't working. With this PR, we prevent this from happening by only ever adding new candidates while we are still in the nomination process of a socket. In theory, there exists a race condition in which we nominate a relay candidate first and then miss out on a server-reflexive candidate not being added. In practice, this won't happen because: - Our host candidates are always available first. - We learn server-reflexive candidates already as part of the initial BINDING, before creating the allocation. - We learn server-reflexive candidates from all relays, not just the one that has been assigned. Related: #6666.	2024-10-02 02:00:03 +00:00
Thomas Eizinger	9644b0c0b1	chore(snownet): unset parent span (#6865 ) When constructing a span, any currently set span will automatically be set as the parent. In the case of the `connection` span, this was the `accept_answer` or `new_connection` span from the client / gateway. Those are not meant to be re-activated every time we enter the `connection` span. By setting an explicit parent, we avoid that. Unfortunately, this means that this span will never have a parent, even if other spans are active whilst we enter this one. We enter this one in the hot-path, which is why it is being constructed ahead of time. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-09-30 14:15:19 +00:00
Thomas Eizinger	6736bb7866	feat(connlib): always-on, low-power connections (#6845 ) Within `snownet` - `connlib`'s connectivity library - we use ICE to set up a UDP "connection" between a client and a gateway. UDP is an unreliable transport, meaning the only way how can detect that the connection is broken is for both parties to constantly send messages and acknowledgements back and forth. ICE uses STUN binding requests for this. In the default configuration of `str0m`, a STUN binding is sent every 3s, and we tolerate at most 9 missing responses before we consider the connection broken. As these responses go missing, `str0m` halves this interval, which results in a total ICE timeout of around 17 seconds. We already tweak these values by reducing the number of requests to 8 and setting the interval to 1.5s. This results in a total ICE timeout of ~10s which effectively means that there is at most a 10s lag between the connection breaking and us considering it broken at which point new packets arriving at the TUN interface can trigger the setup of a new connection with the gateway. Lowering these timeouts improves the user experience in case of a broken connection because the user doesn't have to wait as long before they can access their resources again. The downside of lowering these timeouts is that we generate a lot of background noise. Especially on mobile devices, this is bad because it prevents the CPU from going to sleep and thus simply being signed into Firezone will drain your battery, even if you don't use it. Note that this doesn't apply at all if the client application on top detects a network change. In that case, we hard-reset all connections and instantly create new ones. We attempted to fix this in #5576 by closing idle connections after 5 minutes. This however created new problems such as #6778. The original problem here is that we send too many STUN messages as soon as a connection is established. Simply increasing the timeout is not an option because it would make the user experience really bad in case the connection actually drops for reasons that the client app can't detect. In this patch, we attempt to solve this in a different way: Detecting a broken connection is only critical if the user is actively using the tunnel (i.e. sending traffic). If there is no traffic, it doesn't matter if we need longer to detect a broken connection. The user won't notice because their phone is probably in their pocket or something. With this patch, we now implement the following behaviour: - A connection is considered idle after 10s of no application traffic. - On idle connections, we send a STUN requests every 60s - On idle connections, we wait for at most 4 missing responses before considering the connection broken. - Every connection will perform a client-initiated WireGuard keep-alive every 25s, unless there is application traffic. These values have been chosen while considering the following sources: 1. [RFC4787, REQ-5](https://www.rfc-editor.org/rfc/rfc4787.html#section-12) requires NATs to keep UDP NAT mappings alive for at least 2 minutes. 2. [`conntrack`](https://www.kernel.org/doc/Documentation/networking/nf_conntrack-sysctl.rst) adopts this requirement via the `nf_conntrack_udp_timeout_stream` configuration. 3. 25s is the default keep-alive of the WireGuard kernel module. In theory the WireGuard keep-alive itself should be good enough to keep all NAT bindings alive. In practice, missed keep-alives are not exposed by boringtun (the WireGuard implementation we rely on) and thus we need the additional STUN keep-alives to detect broken connections. We set those somewhat conservatively to 60s. As soon as the user triggers new application traffic, these values are reverted back to their defaults, meaning even if the connection died just before the user is starting to use it again, we will know within the usual 10s because we are triggering new STUN requests more often. Note that existing gateways still implement the "close idle connections after 5 minutes" behaviour. Customers will need to upgrade to a new gateway version to fully benefit from these new always-on, low-power connections. Resolves: #6778. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-09-27 23:27:34 +00:00
Reactor Scram	05a2b28d9f	feat(rust/gui-client): add sentry.io error reporting (#6782 ) Refs #6138 Sentry is always enabled for now. In the near future we'll make it opt-out per device and opt-in per org (see #6138 for details) - Replaces the `crash_handling` module - Catches panics in GUI process, tunnel daemon, and Headless Client - Added a couple "breadcrumbs" to play with that feature - User ID is not set yet - Environment is set to the API URL, e.g. `wss://api.firezone.dev` - Reports panics from the connlib async task - Release should be automatically pulled from the Cargo version which we automatically set in the version Makefile Example screenshot of sentry.io with a caught panic: <img width="861" alt="image" src="https://github.com/user-attachments/assets/c5188d86-10d0-4d94-b503-3fba51a21a90">	2024-09-27 16:34:54 +00:00
Thomas Eizinger	7209060c42	test(connlib): assert determinism of strategies in unit-test (#6846 ) In the past, we struggled a lot of the reproducibility of `tunnel_test` failures because our input state and transition strategies were not deterministic. In the end, we found out that it was due to the iteration order of `HashMap`s. To make sure this doesn't regress, we added a check to CI at the time that compares the debug output of all regression seeds against a 2nd run and ensures they are the same. That is overall a bit wonky. We can do better by simple sampling a value from the strategy twice from a test runner with the same seed. If the strategy is deterministic, those need to be the same. We still rely on the debug output being identical because: a. Deriving `PartialEq` on everything is somewhat cumbersome b. We actually care about the iteration order which a fancy `PartialEq` implementation might ignore	2024-09-27 14:46:51 +00:00
Thomas Eizinger	e901d51550	refactor(gateway): split proxy IP assignment from authorisation (#6812 ) At the moment, the mapping of proxy IPs to the resolved IPs of a DNS resource happens at the same time as the "authorisation" that the client is allowed to talk to that resource. This is somewhat convoluted because: - Mapping proxy IPs to resolved IPs only needs to happen for DNS resources, yet it is called for all resources (and internally skipped). - Wildcard DNS resources only need to be authorised once, after which the client is allowed to communicate with any domain matching the wildcard address. - The code that models resources within `ClientOnGateway` doesn't differentiate between resource types at all. With #6461, the authorisation of a resource will be completely decoupled from the domain resolution for a particular domain of a DNS resource. To make that easier to implement, we re-model the internals of `ClientOnGateway` to differentiate the various resource types. Instead of holding a single vec of addresses, the IPs are now indexed by the respective domain. For CIDR resources, we only hold a single address anyway and for the Internet Resource, the IP networks are static. This new model now implies that allowing a resource that has already been allowed essentially implies an update and the filters get re-calculated.	2024-09-26 23:04:03 +00:00
Thomas Eizinger	81564e270b	fix(connlib): answer `use-application-dns.net` with NXDOMAIN (#6831 ) Firefox uses this so-called canary domain `use-application-dns.net` to detect, whether it should use DoH for its DNS queries. If answered with a server error or without records, Firefox disables DoH as long as it only its "Default protection" is enabled. If a user forces DoH, this hint from the network is ignored. See https://support.mozilla.org/en-US/kb/canary-domain-use-application-dnsnet for details. I tested this on MacOS and Firefox does indeed instantly disable DoH. A default installation of Chrome doesn't use DoH for me. Related: #6375. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2024-09-26 22:07:42 +00:00
Jamil	e7dddee78f	ci: bump android apple dns match (#6833 ) Bumps Android -> 1.3.4, Apple -> 1.3.5 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2024-09-26 09:32:41 -07:00
Thomas Eizinger	29bc276bf2	refactor(connlib): parallelise TUN operations (#6673 ) Currently, `connlib` is entirely single-threaded. This allows us to reuse a single buffer for processing IP packets and makes reasoning of the packet processing code very simple. Being single-threaded also means we can only make use of a single CPU core and all operations have to be sequential. Analyzing `connlib` using `perf` shows that we spend 26% of our CPU time writing packets to the TUN interface [0]. Because we are single-threaded, `connlib` cannot do anything else during this time. If we could offload the writing of these packets to a different thread, `connlib` could already process the next packet while the current one is writing. Packets that we send to the TUN interface arrived as an encrypted WG packet over UDP and get decrypted into a - currently - shared buffer. Moving the writing to a different thread implies that we have to have more of these buffer that the next packet(s) can be decrypted into. To avoid IP fragmentation, we set the maximum IP MTU to 1280 bytes on the TUN interface. That actually isn't very big and easily fits into a stackframe. The default stack size for threads is 2MB [1]. Instead of creating more buffers and cycling through them, we can also simply stack-allocate our IP packets. This incurs some overhead from copying packets but it is only ~3.5% [2] (This was measured without a separate thread). With stack-allocated packets, almost all lifetime-annotations go away which in itself is already a welcome ergonomics boost. Stack-allocated packets also means we can simply spawn a new thread for the packet processing. This thread is connected with two channel to connlib's main thread. The capacity of 1000 packets will at most consume an additional 3.5 MB of memory which is fine even on our most-constrained devices such as iOS. [0]: https://share.firefox.dev/3z78CzD [1]: https://doc.rust-lang.org/std/thread/#stack-size [2]: https://share.firefox.dev/3Bf4zla Resolves: #6653. Resolves: #5541.	2024-09-26 03:03:35 +00:00
Thomas Eizinger	42a13789b1	fix(connlib): match exact domains before wildcard ones (#6809 ) Currently, the order in which connlib matches against the patterns of DNS resources is not specified. We simply iterate over all patterns and take the first one that matches. Due to the iteration order of `HashMap`s, this also isn't deterministic. With this patch, we introduce a defined order in which we attempt to match a particular domain against the defined DNS resources: - Resources without wildcards are always prioritised over wildcard domains - Single-char wildcards (`?`) take priority over label wildcards (``) - Label wildcards (``) take priority over catch-all wildcards (`**`) By matching against the DNS resources in a defined order, we ensure that DNS resources that overlap always resolve to the most specific resource. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>	2024-09-26 00:10:37 +00:00
Thomas Eizinger	5b391a9c66	chore(connlib): remove outdated log (#6808 ) This log is currently printed after we receive the `init` message from the client. It is a left-over from early days of connlib where receiving `init` itself already triggered all kinds of actions. These days, we are mostly just updating state. In addition, `init` can be received multiple times during a client's session which is somewhat confusing when you see multiple "Firezone started" logs.	2024-09-25 22:21:14 +00:00
Jamil	332a9fe352	ci: bump all clients to include fix for #6781 (#6820 ) bump all clients to include #6781 fix --------- Co-authored-by: Not Applicable <ReactorScram@users.noreply.github.com>	2024-09-25 19:27:50 +00:00
Gabi	c0f451ebea	fix(connlib): blackhole DNS HTTPS type queries for resources (#6788 ) Fix #6781 Fix #6375 The problem was that browsers in iOS(and possible other OSes) queries for A, AAAA and HTTPS, and we correctly intercept A and AAAA. Correctly intercepting HTTPS queries is more tricky since we need the server's alpn, before this PR we were just forwarding those and then the response back but the problem with that is that it'd return the real IP for the service instead of our proxy IP. So to quickly fix this we simply blackhole the query so the browser never use that response. In the future an improvement over this would be to intercept the response instead of the query and mangle the ips there. --------- Signed-off-by: Gabi <gabrielalejandro7@gmail.com> Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>	2024-09-19 23:20:17 +00:00
Jamil	4a1dc23a7e	ci: Bump Apple and GUI versions (#6776 ) Mainly to get DNS logging improvements out.	2024-09-19 07:13:33 -07:00
Jamil	fcea93bd66	revert: "chore(connlib): don't add routes for DNS sentinel servers" (#6775 ) Reverts firezone/firezone#6755 The problem was that ipv6 dns sentinel range falls outside the dns resources ipv6 range..	2024-09-19 05:50:41 +00:00
Gabi	28e55ae420	chore(connlib): reduce log level for non-udp DNS queries (#6774 ) This can actually happen more often than we first suspected, i.e. when an application requests TCP DNS in addition to UDP.	2024-09-19 04:43:57 +00:00
Jamil	7c806f7602	chore: Bump Apple to 1.3.2 (#6750 ) Adds update notifications and fixes resource menu bug.	2024-09-19 01:40:38 +00:00
Thomas Eizinger	d09335b31f	chore(connlib): log dynamic packet header on "unknown resource" (#6766 )	2024-09-19 00:21:43 +00:00
Thomas Eizinger	7538555ff7	refactor(connlib): improve type signatures for DNS parsing (#6770 ) Previously, the type signatures around parsing DNS queries were kind of weird and hard to follow. Instead of nesting `Option` and `Result`, we now use the `ControlFlow` type which is more expressive in what we expect the caller to do with the return value.	2024-09-19 00:21:40 +00:00
Thomas Eizinger	efcf28b480	chore(connlib): don't add routes for DNS sentinel servers (#6755 ) We reserve an IP range _within_ the CG-NAT range for the sentinel DNS servers. It is unnecessary to explicitly add that one as a route because it is already covered by the routing entry of the entire CG-NAT range.	2024-09-18 23:47:08 +00:00
Thomas Eizinger	8bac75bd49	fix(connlib): forward PTR queries for non-resources (#6765 ) When encountering a PTR query, `connlib` checks if the query is for a Firezone-managed resource and resolve it to the correct IP. If it isn't for a DNS resource, we should forward the query to the upstream resolver. This isn't what is currently happening though. Instead of forwarding the query, we bail early from `StubResolver::handle` and thus attempt to route the packet through the tunnel. This however fails because the DNS query was targeted at `connlib`'s stub resolver address which never corresponds to a resource IP. When TRACE logs where activated, this resulted in several entries such as > Unknown resource dst=100.100.111.1 To ensure this doesn't regress, we now generate PTR and MX record queries in `tunnel_test`. We don't assert the response of those but we do assert that we always get a response. The inclusion of MX records asserts that unknown query types get correctly forwarded. Resolves: #6749.	2024-09-18 22:46:26 +00:00
Gabi	93e923e33e	fix(connlib): failing proptest and overlapping routes not generated correctly (#6722 ) There were 3 problems currently on main, one on the tests and the actual bug. ## Test problem The routes were kept in a `BTreeSet` that when a new route was added it was `insert`ed into and when it was removed it was `remove`d from using the address of the route. The problem is if there were overlapping route added twice in a row and then a single one of those resources is removed the test would believe the route no longer exists. ## Test solution Keep the routes in a `BTreeMap` which maps the id to the ip and then we calculate the routes based on that combined with the default routes, that way we just remove the ID and the routes are kept in the correct expected state. ## Real bug So fixing this revealed a similar bug in connlib, since we kept things in a similar struct, `active_cidr_resources` using `IpNetworkTable`. To fix this I re-calculate the whole table each time we add/remove a resource. Note that this really doesn't properly fixes overlapping routes, this is just helpful to fix the test, to fix them we need #4789 Furthermore, fixing these issues revealed an additional problem, whenever we add an overlapping CIDR resource the old resource might be overridden, causing the connection to be lost, furthermore this happened in a non-deterministic(it's deterministic really but not explicit) way causing the tests to fail. To fix this we always sort resources by ID(it's an arbitrary order to keep consistency with the proptests) and then we don't replace the routing for resources that already had a connection. Sadly, to model this in the test I had to almost copy exactly how we calculate resources in connlib. Fixes #6721 --------- Signed-off-by: Gabi <gabrielalejandro7@gmail.com> Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-09-18 21:38:55 +00:00
Gabi	de490cbbbf	chore(connlib): reduces noisy log-level on gateway when clients restart (#6733 ) Fixes #6372	2024-09-18 21:26:02 +00:00
Thomas Eizinger	480a065bf8	chore(connlib): mitigate WARN logs from phoenix-channel (#6759 ) Merging #6708 had an unintended side-effect that we are seeing a lot of WARN logs from phoenix-channel because we can no longer parse the response from gateways. We didn't do anything with these responses but gateways are sending them for backwards-compatibility reasons. To not confuse ourselves while debugging, we revert the client-side bit of #6708 to remove these warnings.	2024-09-18 20:36:04 +00:00
Thomas Eizinger	7bcbc4d030	chore(connlib): enter debug span (#6758 ) A span needs to be entered to be active.	2024-09-18 16:42:25 +00:00
Thomas Eizinger	e34f36df7e	chore(connlib): be more verbose when probing DNS packets (#6751 ) Currently, checking whether a packet is a DNS query has multiple silent exit paths. This makes it DNS problems difficult to debug because the packets will be treated as if they have to get routed through the tunnel. This is also something we should fix but that isn't done in this PR: If we know that a packet is for connlib's DNS stub resolver, we should never route it through the tunnel. Currently, this isn't possible to express with the type signature of our DNS module and requires more refactoring. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2024-09-18 14:27:23 +00:00
Thomas Eizinger	49572fb9a6	chore(connlib): warn on truncates DNS response from upstream (#6754 ) DNS queries for non-resources get forwarded by connlib byte-for-byte and only the IP & UDP headers gets rewritten. The responses for some DNS queries don't fit into UDP packets. To signal this, DNS servers set the TC (truncated) bit on the response. If set, most clients will discard the response and requery over TCP. DNS over TCP is not yet supported in connlib, meaning any query response that has the TC bit set will most likely lead to DNS issues downstream. To make these cases easier to debug, we log a warning whenever we encounter a DNS response from an upstream server that that has the TC bit set.	2024-09-18 14:25:51 +00:00
Thomas Eizinger	5ae06a7b8c	chore(gateway): remove domain response (breaks < 1.1.0 clients) (#6708 ) Prior to version 1.1.0, clients did not have an embedded DNS resolver and relied on the gateway for DNS resolution. In that design, the gateway responded with the IPs that the domain resolved to. Our next iteration of the control protocol (#6461) will decouple the details of how DNS works from the flow-authorization. As a result, we will need to be able to establish a flow for a DNS resource without knowing which concrete domain the client is going to access. Without a concrete domain, we cannot send anything back to these old clients, meaning we unfortunately have to break compatibility with < 1.1.0 clients as part of implementing the new control protocol.	2024-09-18 14:12:46 +00:00
Thomas Eizinger	5593c0aa1c	chore(connlib): log IPs assigned by stub resolver (#6752 ) This only happens once for each domain, so logging this on DEBUG should be fine. It is disabled by default and shouldn't be too spammy during development.	2024-09-18 14:11:36 +00:00
Thomas Eizinger	a9f515a453	chore(rust): use `#[expect]` instead of `#[allow]` (#6692 ) The `expect` attribute is similar to `allow` in that it will silence a particular lint. In addition to `allow` however, `expect` will fail as soon as the lint is no longer emitted. This ensures we don't end up with stale `allow` attributes in our codebase. Additionally, it provides a way of adding a `reason` to document, why the lint is being suppressed.	2024-09-16 13:51:12 +00:00
Thomas Eizinger	35ffc99b11	test(connlib): clear connected gateways on reset (#6680 ) In #6592, we started tracking our connected gateways to correctly model, which packets get dropped as part of establishing new connections. We forgot to clear this when connections are being reset, causing some test failures.	2024-09-13 14:33:01 +00:00
Thomas Eizinger	7adbf9c6af	refactor(connlib): remove `pnet_packet` (#6659 ) As the final step in removing `pnet_packet`, we need to introduce `-Mut` equivalent slices for UDP, TCP and ICMP packets. As a starting point, introducing `UpdHeaderSliceMut` and `TcpHeaderSliceMut` is fairly trivial. The ICMP variants are a bit trickier because those are different for IPv4 and IPv6. Additionally, ICMP for IPv4 is quite complex because it can have a variable header length. Additionally. for both variants, the values in byte range 5-8 are semantically different depending on the ICMP code. This requires us to design an API that balances ergonomics and correctness. Technically, an ICMP identifier and sequence can only be set if the ICMP code is "echo request" or "echo reply". However, adding an additional parsing step to guarantee this in the type system is quite verbose. The trade-off implemented in this PR allows to us to directly write to the byte 5-8 using the `set_identifier` and `set_sequence` functions. To catch errors early, this functions have debug-assertions built in that ensure that the packet is indeed an ICMP echo packet. Resolves: #6366.	2024-09-11 23:52:48 +00:00
Thomas Eizinger	133c2565b2	refactor(connlib): merge `IpPacket` and `MutableIpPacket` (#6652 ) Currently, we have two structs for representing IP packets: `IpPacket` and `MutableIpPacket`. As the name suggests, they mostly differ in mutability. This design was originally inspired by the `pnet_packet` crate which we based our `IpPacket` on. With subsequent iterations, we added more and more functionality onto our `IpPacket`, like NAT64 & NAT46 translation. As a result of that, the `MutableIpPacket` is no longer directly based on `pnet_packet` but instead just keeps an internal buffer. This duplication can be resolved by merging the two structs into a single `IpPacket`. We do this by first replacing all usages of `IpPacket` with `MutableIpPacket`, deleting `IpPacket` and renaming `MutableIpPacket` to `IpPacket`. The final design now has different `self`-receivers: Some functions take `&self`, some `&mut self` and some consume the packet using `self`. This results in a more ergonomic usage of `IpPacket` across the codebase and deletes a fair bit of code. It also takes us one step closer towards using `etherparse` for all our IP packet interaction-needs. Lastly, I am currently exploring a performance-optimisation idea that stack-allocates all IP packets and for that, the current split between `IpPacket` and `MutableIpPacket` does not really work. Related: #6366.	2024-09-11 22:32:49 +00:00
Thomas Eizinger	2f7f04a58d	chore(connlib): don't unnecessarily shorten packet lifetime (#6646 ) The lifetime of the returned packet is actually already `'static`, meaning we don't need to call `to_owned`. Related: #6366. Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: firezone <firezone@firezones-MacBook-Air.local>	2024-09-10 14:57:15 +00:00

1 2 3 4 5 ...

813 Commits