firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-03-20 22:41:50 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	9caca475dc	test(connlib): introduce routing table to `tunnel_test` (#5786 ) Currently, `tunnel_test` uses a rather naive approach when dispatching `Transmit`s. In particular, it checks client, gateway and relay separately whether they "want" a certain packet. In a real network, these packets are routed based on their IP. To mimic something similar, we introduce a `Host` abstraction that wraps each component: client, gateway and relay. Additionally, we introduce a `RoutingTable` where we can add and remove hosts. With these things in place, routing a `Transmit` is as easy as looking up the destination IP in the routing table and dispatching to the corresponding host. Our hosts are type-safe: client, gateway and relay have different types. Thus, we abstract over them using a `HostId` in order to know, which host a certain message is for. Following these patches, we can easily introduce multiple gateways and relays to this test by simply making more entries in this routing table. This will increase the test coverage of connlib. Lastly, this patch massively increases the performance of `tunnel_test`. It turns out that previously, we spent a lot of CPU cycles accessing "random" IPs from very large iterators. With this patch, we take a limited range of 100 IPs that we sample from, thus drastically increasing performance of this test. The configured 1000 testcases execute in 3s on my machine now (with opt-level 1 which is what we use in CI). --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-07-09 01:48:54 +00:00
Jamil	aa7977c9b5	chore: bump android 1.1.3 (#5784 )	2024-07-06 16:54:14 -07:00
Jamil	7820e3f3c7	fix(android): Strip scope id off IPv6 addresses Android (#5783 ) Fixes #5781 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2024-07-06 16:50:30 -07:00
Reactor Scram	663367b605	chore(gui-client): timestamp crash dump file names (#5452 ) Closes #5449 The smoke tests expect `last_crash.dmp` at a fixed path, so in this case we write the file with a timestamped name, then copy it over `last_crash.dmp`.	2024-07-05 15:21:25 +00:00
Thomas Eizinger	28d5b8574c	chore(connlib): minor logging tweaks (#5746 ) Noticed a few things that caused unnecessary verbosity in the logs.	2024-07-05 14:45:32 +00:00
Thomas Eizinger	2a2877a4d9	test(snownet): add debug assert (#5750 ) Within `snownet`'s test harness, packets are dispatched in a particular order and of none of them match. They are assumed to be for the node directly. We add a debug assert to ensure that the given address is in fact part of the "local" interfaces that we have configured in the tests.	2024-07-05 07:00:24 +00:00
Thomas Eizinger	a57c64e62b	chore(snownet): add some debug logs around channel bindings (#5749 )	2024-07-05 07:00:03 +00:00
Jamil	086c730aaf	chore: Bump clients to 1.1.2 for DNS record type forward (#5703 ) Apps are already in review with App Stores	2024-07-04 01:31:26 +00:00
Reactor Scram	f6e99752ec	fix(client): flush the OS' DNS cache whenever resources change (#5700 ) Closes #5052 On my dev VMs: - systemd-resolved = 15 ms to flush - Windows = 600 ms to flush I tested with the headless Clients on Linux and Windows and it fixes the issue. On Windows I didn't replicate the issue with the GUI Client, on Linux this patch also fixes it for the GUI Client.	2024-07-03 21:14:43 +00:00
Gabi	5fd321c4bb	chore(connlib): forward non-address record queries (#5674 ) Since we only handle `A`, `AAAA` and `PTR` records of names we handle, this can lead to unexpected behavior with other record types, where using Firezone breaks `TXT`, `MX` or other record types for the resources we handle. So this is a bit of a refactor, now we lookup a resource and explicitly return `Some` when there is a record we should be returning (even if it's empty due to IP exhaustion) or `None` when we should just forward the query. This has the added benefit of no longer breaking bonjour or other non-standard `PTR` queries. Fixes: #5673. --------- Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-07-03 05:15:23 +00:00
Gabi	79fd8f6063	chore(connlib): add message type to the no records found logs (#5641 ) Added for clarity when debugging, it used to look like: ``` 2024-06-30T00:16:05.718337Z DEBUG firezone_tunnel::dns: No records for github.com, returning NXDOMAIN ``` And now looks like: ``` 2024-06-30T00:16:05.718337Z DEBUG firezone_tunnel::dns: No MX records for github.com, returning NXDOMAIN ```	2024-07-01 23:15:44 +00:00
Thomas Eizinger	02f5c67974	chore(windows): reduce nesting in wintun recv-thread (#5573 ) Related: #5571.	2024-07-01 16:33:59 +00:00
Jamil	25b6528942	chore: Bump versions and update changelog (#5636 ) Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2024-06-29 09:06:10 -07:00
Thomas Eizinger	96536a23cf	refactor(connlib): ignore relays per connection (#5631 ) In a previous design of firezone, relays used to be scoped to a certain connection. For a while now, this constraint has been lifted and all connections can use all relays. A related, outdated concern is the idea of STUN-only servers. Those also used to be assigned on a per-connection basis. By removing any use of per-connection relays and STUN-only servers, the entire `StunBinding` concept is unused code and can thus be deleted. To push this over the finish line, the `snownet-tests` which test the hole-punching functionality needed to be slightly adapted to make use of the more recently introduced API `Node::update_relays`. Resolves: #4749.	2024-06-29 02:36:17 +00:00
Thomas Eizinger	f2b6c205c2	refactor(snownet): change `reconnect` to `reset` (#5630 ) Currently, `snownet` still supports this notion of "reconnecting" which is a mix between resetting some state but keeping other. In particular, we currently retain the `StunBinding` and `Allocation` state. This used to be important because allocations are bound to the 3-tuple of the client and thus needed to be kept around in case we weren't actually roaming. We always rebind the the local UDP sockets upon reconnecting and thus the 3-tuple always changes anyway. In addition, we always reconnect to the portal, meaning we receive another `init` message and thus can actually completely clear the `Node`'s state. This PR does that an in the process, rebrands `reconnect` as `reset` which now makes more sense. Related: #5619.	2024-06-29 02:07:10 +00:00
Thomas Eizinger	8973cc5785	refactor(android): use fmt::Layer with custom writer (#5558 ) Currently, the logs that go to logcat on Android are pretty badly formatted because we use `tracing-android` and it formats the span fields and message fields itself. There is actually no reason for doing the formatting ourselves. Instead, we can use the `MakeWriter` abstraction from `tracing_subscriber` to plug in a custom writer that writes to Android's logcat. This results in logs like this: ``` [nix-shell:~/src/github.com/firezone/firezone/rust]$ adb logcat -s connlib --------- beginning of main 06-28 19:41:20.057 19955 20213 D connlib : phoenix_channel: Connecting to portal host=api.firez.one user_agent=Android/14 5.15.137-android14-11-gbf4f9bc41c3b-ab11664771 connlib/1.1.1 06-28 19:41:20.058 19955 20213 I connlib : firezone_tunnel::client: Network change detected 06-28 19:41:20.061 19955 20213 D connlib : snownet::node: Closed all connections as part of reconnecting num_connections=0 06-28 19:41:20.365 19955 20213 I connlib : phoenix_channel: Connected to portal host=api.firez.one 06-28 19:41:20.601 19955 20213 I connlib : firezone_tunnel::io: Setting new DNS resolvers 06-28 19:41:21.031 19955 20213 D connlib : firezone_tunnel::client: TUN device initialized ip4=100.66.86.233 ip6=fd00:2021:1111::f:d9c1 name=tun1 06-28 19:41:21.031 19955 20213 I connlib : connlib_client_shared::eventloop: Firezone Started! 06-28 19:41:21.031 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=.slackb.com 06-28 19:41:21.031 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=.test-ipv6.com 06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=5.4.6.7/32 name=5.4.6.7 06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=10.0.32.101/32 name=IPerf3 06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=ifconfig.net 06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=.slack-imgs.com 06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=.google.com 06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=10.0.0.5/32 name=10.0.0.5 06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=.githubassets.com 06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=dnsleaktest.com 06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=.slack-edge.com 06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=.github.com 06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=speed.cloudflare.com 06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=.githubusercontent.com 06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=10.0.14.11/32 name=Staging resource performance 06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=*.whatismyip.com 06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=10.0.0.8/32 name=10.0.0.8 06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=9.9.9.9/32 name=Quad9 DNS 06-28 19:41:21.034 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=10.0.32.10/32 name=CoreDNS 06-28 19:41:21.216 19955 20213 I connlib : snownet::node: Added new TURN server id=bd6e9d1a-4696-4f8b-8337-aab5d5cea810 address=Dual { v4: 35.197.171.113:3478, v6: [2600:1900:40b0:1504:0:27::]:3478 } ``` --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-06-28 22:15:10 +00:00
Jamil	8655b711db	fix(connlib): Don't use `operatingSystemVersionString` on Apple OSes (#5628 ) The [HTTP 1.1 RFC](https://datatracker.ietf.org/doc/html/rfc2616) states that HTTP headers should be US-ASCII. This is not the case when the macOS Client is run from a host that has a non-English language selected as its system default due to the way we build the user agent. This PR fixes that by normalizing how we build the user agent by more granularly selecting which fields compose it, and not just relying on OS-provided version strings that may contain non-ASCII characters. fixes https://github.com/firezone/firezone/issues/5467 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2024-06-28 21:59:02 +00:00
Thomas Eizinger	e5cba1caf4	refactor(apple): use `fmt::Layer` with custom writer (#5623 ) Currently, we use the `tracing-oslog` crate to ingest logs on MacOS and iOS. This crate has a "feature" where it creates so called "Activities" for spans. Whilst that may initially sound useful, Apple's UI for viewing these activities is absolutely useless. Instead of tinkering around with that, we remove the `tracing-oslog` crate and let `tracing-subscriber` format our logs first and then only send a single string to the oslog backend. Related: #5619.	2024-06-28 21:22:54 +00:00
Reactor Scram	a315c49b3c	chore(firezone-tunnel/windows): reduce ring buffer from 64 MiB to 1 MiB (#5609 ) Oops. It runs the same either way so we definitely don't need all that RAM to be tied up. The Linux and macOS Clients probably have similar buffer sizes already. I tested before and after with CloudFlare's speed test and got roughly 140/12 with latency 50 ms both times. The error bars on speed tests are pretty wide, but we definitely aren't falling 60 MiB behind on processing and then catching up. ```[tasklist] ### Tasks - [x] (failed, can't do it right now) ~~Log if we knowingly drop a lot of packets~~ - [x] Extract constant - [x] Add comment about not knowing if we drop packets - [x] Merge - [ ] (skipped) Test while the CPU is loaded ```	2024-06-28 21:03:18 +00:00
Thomas Eizinger	ed34ca096b	chore(gateway): remove dead IP detection (#5618 ) This does not work as well as intended and spams the logs. We may need #5542 before we can implement this properly. Fixes: #5593.	2024-06-28 04:47:00 +00:00
Thomas Eizinger	66cb565915	fix(snownet): use unused channels before reused expired ones (#5613 ) Within each allocation, a client has 4095 channels that it can bind to a different peers. Each channel bindings is valid for 10 minutes unless rebound. Additionally, there is a 5min cool-down period after a channel binding expires before it can be rebound to a different peer. This patch fixes a bug in snownet where we would have first attempted to rebind the last bound channel instead of just picking the next unused one. In the case of a clock drift between client and relay, this caused unnecessary errors when attempting to rebind channels. Fixes: #5603. --------- Co-authored-by: conectado <gabrielalejandro7@gmail.com>	2024-06-28 03:14:16 +00:00
Thomas Eizinger	aadb045b27	chore(connlib): batch together sending of ICE candidates (#5616 ) Currently, we are sending each ICE candidate individually from the client to the gateway and vice versa. This causes a slight delay as to when each ICE candidate gets added on the remote ICE agent. As a result, they all start being tested with a slight offset which causes "endpoint hopping" whenever a connection expires as they expire just after each other. In addition, sending multiple messages to the portal causes unnecessary load when establishing connections. Finally, with #5283 we started not adding the server-reflexive candidate to the local ICE agent. Because we talk to multiple relays, we detect the same server-reflexive candidate multiple times if we are behind a non-symmetric NAT. Not adding the server-reflexive candidate to the ICE agent mitigated our de-duplication strategy here which means we currently send the same candidate multiple times to a peer, causing additional, unnecessary load. All of this can be mitigated by batching together all our ICE candidates together into one message. Resolves: #3978.	2024-06-28 02:04:31 +00:00
Thomas Eizinger	79ff3f830b	chore(gateway): downgrade `warn` logs (#5612 ) Whilst it has been helpful to find issues such as #5611, having these logs on `warn` spams the end user too much and creates a false sense that things might not be working as there can be a variety of reasons why packets might not be able to be routed.	2024-06-28 01:13:29 +00:00
Gabi	375a1b5586	fix(connlib): allow 1s ACK for packet before refreshing DNS (#5560 ) Currently, we refresh DNS mappings when: * We translate a packet for the first time * There are no more incoming packets for 120 seconds * There is at least 1 outoing packet in the last 10 seconds The idea was to coordinate with conntrack somehow, to expire DNS translation at the point where the NAT session of the OS stops being valid. That way, if the triggered DNS refresh changes the resolved IPs it would never kill the underlying connection. However, TCP sessions by default can last for up to 5 days! And I have no idea how long for ICMP. To prevent killing these connections, we assume that for TCP and ICMP packets will elicit a response within 1s. The DNS refresh for a translation mapping that hasn't seen any responses is thus delayed by 1s after the last packet has been sent out. To get an idea of how this works you can imagine it like this \|last incoming packet\|------ 120 seconds + x seconds ----\|out going packet\|----1 second ----\|dns refresh\| However this another case where dns refresh is triggered, in this case the same packet triggers the refresh period and the period where it was used in the last 10 seconds \|last incoming packet\|------ 111 seconds ----\|out going packet\|---- 9 seconds ----\|dns refresh\| The unit tests should also make clear of when we want to trigger dns refresh and when we don't. --------- Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-06-28 00:25:26 +00:00
Reactor Scram	76e55e6138	fix(client/windows): fix upload speed by letting Wintun queue packets again (#5598 ) Closes #5589. Refs #5571 Improves upload speeds on my Windows 11 VM from 2 Mbps to 10.5 Mbps. On the resource-constrained VM it improved from 3 to 7 Mbps. ```[tasklist] ### Tasks - [x] Open for review - [x] Manual test on resource-constrained VM - [x] Run 5x replication steps from #5571 and make sure it doesn't deadlock again - [x] Merge - [ ] https://github.com/firezone/firezone/issues/5601 ``` Sorted by decreasing speed, M = macOS host, W = Windows guest in Parallels, RC = Resource-constrained Windows guest in VirtualBox: - M, Internet - 16 Mbps - W, Internet - 13 Mbps - M, Firezone - 12 Mbps - RC, Internet - 12 Mbps - W, Firezone, after this PR - 10.5 Mbps - RC, Firezone, after this PR - 8.5 Mbps - RC, Firezone, before this PR - 4 Mbps - W, Firezone, before this PR - 2 Mbps So it's not perfect but the worst part is fixed. The slow upload speeds were probably a regression from #5571. The MPSC channel only has a few spots in it, so if connlib doesn't pick up every packet immediately (which would be impossible under load), we drop packets. I measured 25% packet drops in an earlier commit. I first tried increasing the channel size from 5 to 64, and that worked. But this solution is simpler. I switch back to `blocking_send` so if connlib isn't clearing the MPSC channel, Wintun will just queue up packets in its internal ring buffers, and we aren't responsible for buffering. Getting rid of `blocking_send` was a defense-in-depth thing to fix the deadlock yesterday, but we still close the MPSC channel inside `Tun::drop`, and I confirmed in a manual test that this will kick the worker thread out of `blocking_send`, so the deadlock won't come back.	2024-06-27 17:59:22 +00:00
Jamil	b5de55ac26	chore: Bump clients to 1.1.0, Gateway to 1.1.1 (#5591 )	2024-06-27 02:43:48 -07:00
Thomas Eizinger	b6420eaa3e	feat(snownet): close idle connections after 5min (#5576 ) We define a connection as idle if we haven't sent or received any packets in the last 5 minutes. From `snownet`'s perspective, keep-alives sent by upper layers (like TCP keep-alives) must be honored and thus outgoing as well as incoming packets are accounted for. If the underlying connection breaks, we will hit an ICE timeout which is an implementation detail of `snownet`. The packets tracked here are IP packets that the user wants to send / receive via the tunnel. Similarly, wireguard's keep-alives do not update these timestamps and thus don't mark a connection as non-idle. --------- Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>	2024-06-27 08:28:38 +00:00
Thomas Eizinger	58fad7cb2d	refactor(connlib): batch resource change updates (#5575 ) Currently, upon reconnecting, `snownet` returns a list of connection IDs that have been closed. This was done to avoid emitting many identical `ResourcesChanged` events. In all other events, `snownet` always only references a single connection. To align this whilst not duplicating `ResourcesChanged` events, we use a dedicated `bool` to check, whether any of the events emitted by `snownet` require updating the clients about our active resources.	2024-06-27 07:48:41 +00:00
Thomas Eizinger	18b9c35316	chore(connlib): explicitly handle `invalid_version` error (#5577 ) Ensures we correctly deserialize `invalid_version` and don't fall-back to `Other`. Related: #5525.	2024-06-27 07:41:41 +00:00
Gabi	ad8c92ca35	fix(connlib): dont panic in invalid PTR records (#5588 )	2024-06-27 07:24:06 +00:00
Thomas Eizinger	9ddee774b4	chore(connlib): allow filtering of `wire` log target (#5578 ) Currently, enabling the `wire` log is an all or nothing approach, logging incoming and outgoing messages from the TUN device, network and the portal. Often, only one or more of these is desired but enabling all of `wire` spams the logs to the point where one cannot see the information they'd like. With this PR, we move some of the fields of the `wire` log statements to the log target instead. This allows controlling the logs via the `RUST_LOG` env variable. For example, to only see messages sent and received to the API, one can set `RUST_LOG=wire::api=trace` which will output something like: ``` 2024-06-27T02:12:41.821374Z TRACE wire::api::send: {"topic":"client","event":"phx_join","payload":null,"ref":0} 2024-06-27T02:12:42.030573Z TRACE wire::api::recv: {"event":"phx_reply","ref":0,"topic":"client","payload":{"status":"ok","response":{}}} ``` Similarly, enabling `wire::net=trace` will give you logs for packets sent over the network: ``` 2024-06-27T02:12:50.487503Z TRACE wire::net::send: src=None dst=34.80.2.250:3478 num_bytes=20 2024-06-27T02:12:50.487589Z TRACE wire::net::send: src=None dst=[2600:1900:4030:b0d9:0:5::]:3478 num_bytes=20 2024-06-27T02:12:50.487622Z TRACE wire::net::send: src=None dst=34.87.210.10:3478 num_bytes=20 2024-06-27T02:12:50.487652Z TRACE wire::net::send: src=None dst=[2600:1900:40b0:1504:0:17::]:3478 num_bytes=20 2024-06-27T02:12:50.510049Z TRACE wire::net::recv: src=34.87.210.10:3478 dst=192.168.188.71:39207 num_bytes=32 2024-06-27T02:12:50.510382Z TRACE wire::net::send: src=None dst=34.87.210.10:3478 num_bytes=112 2024-06-27T02:12:50.526947Z TRACE wire::net::recv: src=34.87.210.10:3478 dst=192.168.188.71:39207 num_bytes=92 2024-06-27T02:12:50.527295Z TRACE wire::net::send: src=None dst=34.87.210.10:3478 num_bytes=152 ``` These targets have been designed to take up equal amounts of space. All three types (`dev`, `net`, `api`) have 3 letters and `send` and `recv` have 4. That way, these logs are always aligned which makes them easier to scan.	2024-06-27 06:36:49 +00:00
Gabi	e0e9e078a0	fix(connlib): statically resolve API domain (#5563 ) In order to handle DNS resources, connlib intercepts all DNS requests on the system once it has started up. The DNS queries are then forwarded to the original DNS resolver in case the query isn't for one of the configured DNS resources _except_ if the configured DNS resovler is also a CIDR resource. In that case, the DNS query will be tunneled to a gateway and forwarded to the DNS resolver from there. Exactly this configuration results in a dead-lock when roaming networks. To make roaming more reliable, we now drop all connections when detecting a network change (see #5308). As a result, DNS queries cannot be tunneled right away. This isn't usually a problem: We just send a connection intent to the portal to connect to the gateway. Upon a network change, we also reconnect the websocket to the portal which also requires to resolve the domain name. Connlib's DNS resolver is still active at the point and thus, we end up deadlocking ourselves because the DNS query to resolve the portal's domain is waiting for a connection to a gateway that can only be established once we are connected to the portal. To prevent this, we extend connlib with a "known hosts" feature. These are DNS records that are defined statically for the lifetime of a connlib session and can thus always be resolved, regardless of the connection state with the portal or the gateways. We populate these records with the portal's API, allowing the reconnect to work without having connected gateways. --------- Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-06-27 06:00:56 +00:00
Thomas Eizinger	c2b5379fba	chore(connlib): demote log for unknown incoming packets to debug (#5584 ) There are several reasons why we would legitimately receive a packet that we can't handle, i.e. when a connection got cleared locally but the gateway is still trying to send us packets for that socket. Not handling these packets can be a bug but more often than not, it is not an issue. Additionally, all our unit-tests actually `.unwrap` the `Node::encapsulate` function so any unhandled packets in the tests will be caught.	2024-06-27 05:58:04 +00:00
Reactor Scram	990f98e60f	fix(windows): prevent deadlock when closing wintun (#5571 ) Refs #5441, but without a reliable way to replicate that issue, I'm not sure if this will completely fix it. Before this PR, a deadlock can happen between 2 threads, call them "main thread" and "worker thread". The deadlock is more likely if more traffic is flowing through the tunnel. # Test results I ran a build from this PR inside the resource-constrained VM and it's likely the deadlock could have triggered there, since the packet channel had 0 capacity (it was full) when we reached `Tun::drop`: ```jsonl {"time":"2024-06-26T22:43:33.2398441Z","target":"firezone_headless_client::ipc_service","logging.googleapis.com/sourceLocation":{"file":"headless-client\\src\\ipc_service.rs","line":"304"},"severity":"INFO","gitVersion":"e591bb9","logFilter":"\"str0m=warn,info\""} .. {"time":"2024-06-26T22:45:42.9035226Z","target":"firezone_tunnel::device_channel::tun_windows","logging.googleapis.com/sourceLocation":{"file":"connlib\\tunnel\\src\\device_channel\\tun_windows.rs","line":"45"},"severity":"INFO","channelCapacity":0,"message":"Shutting down packet channel..."} {"time":"2024-06-26T22:45:42.9035467Z","target":"firezone_tunnel::device_channel::tun_windows","logging.googleapis.com/sourceLocation":{"file":"connlib\\tunnel\\src\\device_channel\\tun_windows.rs","line":"274"},"severity":"INFO","message":"recv_task exiting gracefully"} {"time":"2024-06-26T22:45:43.4978015Z","target":"connlib_client_shared","logging.googleapis.com/sourceLocation":{"file":"connlib\\clients\\shared\\src\\lib.rs","line":"150"},"severity":"INFO","message":"connlib exited gracefully"} ``` I followed these steps: - Run Firezone and sign in - Start a speed test using Cloudflare - During the download phase, quit the GUI I did the same test with `0fac698` (`main`) and got the "All pipe instances are busy" error dialog 3 out of 5 times. # Details The deadlock will happen in this scenario: - The main thread enters `Tun::drop` here `0fac698dfc/rust/connlib/tunnel/src/device_channel/tun_windows.rs (L44)` - The worker thread is waiting for space in the packet channel (`packet_tx` and `packet_rx`) here `0fac698dfc/rust/connlib/tunnel/src/device_channel/tun_windows.rs (L249)` - The main thread tells wintun to shut down. If the worker was on line 247 waiting on wintun, this would unblock it, but the worker is not on line 247. `0fac698dfc/rust/connlib/tunnel/src/device_channel/tun_windows.rs (L45)` - The main thread waits to join the worker thread `0fac698dfc/rust/connlib/tunnel/src/device_channel/tun_windows.rs (L52)` The threads are now deadlocked. The main thread is waiting for the worker thread to exit, and the worker thread is waiting for the main thread to either call `poll_recv`, which would cause `blocking_send` to return, or for the main thread to complete `Tun::drop`, which would cause Rust to drop `packet_rx`, which would cause `blocking_send` to return an error. This PR makes 2 changes to prevent this deadlock. Each change alone should work, but for defense-in-depth we make both changes: 1. When the main thread starts `Tun::drop`, we `close` the packet channel, which would unblock any thread waiting on `Sender::blocking_send` 2. We use `Sender::try_send` instead of `Sender::blocking_send`. If the main thread can't consume packets fast enough, we're going to drop them anyway, because the ring buffer in wintun will eventually fill up. So dropping them here isn't much different from dropping them anywhere else, and this keeps the worker thread from locking up.	2024-06-26 23:52:20 +00:00
Gabi	0fac698dfc	chore(connlib): set connection expiration to 120seconds to respect the conntrack udp timeout (#5559 )	2024-06-26 21:25:00 +00:00
Gabi	2d312ddc71	chore(connlib): reduce log level for unallowed packets in client (#5569 ) Work around for too many `unallowed packets`. Long term fix on #5568 and #5560	2024-06-26 21:24:13 +00:00
Jamil	89bb7c2c5d	fix(android): Fix crash in `setDns` on 32-bit Android by using jlong consistently for the SessionWrapper pointer (#5564 ) `connlibSessionPtr` is a `Long`, which is 64-bits. On 32-bit Android architectures, this overwrites part of the `dns_list` for the `setDns` native function call because Rust uses a `32-bit` sized pointer for `SessionWrapper` in the function definition. This causes a JNI crash, detailed below. To fix this, we make sure `jlong` is received in Rust, and do the pointer conversion in the body of the functions that need to use it. Adding @ReactorScram to review for visibility. ``` runtime.cc:655] Runtime aborting... runtime.cc:655] Dumping all threads without mutator lock held runtime.cc:655] All threads: runtime.cc:655] DALVIK THREADS (35): runtime.cc:655] "ConnectivityThread" prio=5 tid=35 Runnable runtime.cc:655] \| group="" sCount=0 dsCount=0 flags=0 obj=0x131809a8 self=0xa42dea10 runtime.cc:655] \| sysTid=8854 nice=0 cgrp=default sched=0/0 handle=0x7fbb71c0 runtime.cc:655] \| state=R schedstat=( 0 0 0 ) utm=8 stm=0 core=2 HZ=100 runtime.cc:655] \| stack=0x7fab4000-0x7fab6000 stackSize=1040KB runtime.cc:655] \| held mutexes= "abort lock" "mutator lock"(shared held) runtime.cc:655] native: #00 pc 0037b1dd /apex/com.android.art/lib/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, int, BacktraceMap, char const, art::ArtMethod, void, bool)+76) runtime.cc:655] native: #01 pc 0044cd01 /apex/com.android.art/lib/libart.so (art::Thread::DumpStack(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, bool, BacktraceMap, bool) const+388) runtime.cc:655] native: #02 pc 00448447 /apex/com.android.art/lib/libart.so (art::Thread::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, bool, BacktraceMap, bool) const+34) runtime.cc:655] native: #03 pc 00465995 /apex/com.android.art/lib/libart.so (art::DumpCheckpoint::Run(art::Thread)+688) runtime.cc:655] native: #04 pc 00460e57 /apex/com.android.art/lib/libart.so (art::ThreadList::RunCheckpoint(art::Closure, art::Closure)+354) runtime.cc:655] native: #05 pc 0046034f /apex/com.android.art/lib/libart.so (art::ThreadList::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, bool)+1514) runtime.cc:655] native: #06 pc 0040a3af /apex/com.android.art/lib/libart.so (art::Runtime::Abort(char const)+1510) runtime.cc:655] native: #07 pc 0000d989 /system/lib/libbase.so (android::base::SetAborter(std::__1::function<void (char const)>&&)::$_3::__invoke(char const)+48) runtime.cc:655] native: #08 pc 0000d295 /system/lib/libbase.so (android::base::LogMessage::~LogMessage()+224) runtime.cc:655] native: #09 pc 002965db /apex/com.android.art/lib/libart.so (art::JavaVMExt::JniAbort(char const, char const)+1962) runtime.cc:655] native: #10 pc 002966a5 /apex/com.android.art/lib/libart.so (art::JavaVMExt::JniAbortF(char const, char const, ...)+64) runtime.cc:655] native: #11 pc 004521c1 /apex/com.android.art/lib/libart.so (art::Thread::DecodeJObject(_jobject) const+544) runtime.cc:655] native: #12 pc 0028a6e7 /apex/com.android.art/lib/libart.so (art::(anonymous namespace)::ScopedCheck::CheckInstance(art::ScopedObjectAccess&, art::(anonymous namespace)::ScopedCheck::InstanceKind, _jobject, bool)+82) runtime.cc:655] native: #13 pc 00289779 /apex/com.android.art/lib/libart.so (art::(anonymous namespace)::ScopedCheck::CheckPossibleHeapValue(art::ScopedObjectAccess&, char, art::(anonymous namespace)::JniValueType)+552) runtime.cc:655] native: #14 pc 00288f55 /apex/com.android.art/lib/libart.so (art::(anonymous namespace)::ScopedCheck::Check(art::ScopedObjectAccess&, bool, char const, art::(anonymous namespace)::JniValueType)+592) runtime.cc:655] native: #15 pc 0027cbe7 /apex/com.android.art/lib/libart.so (art::(anonymous namespace)::CheckJNI::GetObjectClass(_JNIEnv, _jobject)+586) runtime.cc:655] native: #16 pc 003412db /data/app/~~X6p_4xQWTraApNXlo4SIHA==/dev.firezone.android-zJrN9FN3yhs12tvUNeoOmw==/base.apk!libconnlib.so (offset ec000) (???) runtime.cc:655] at dev.firezone.android.tunnel.ConnlibSession.setDns(Native method) runtime.cc:655] at NetworkMonitor.onLinkPropertiesChanged(NetworkMonitor.kt:28) runtime.cc:655] at android.net.ConnectivityManager$NetworkCallback.onAvailable(ConnectivityManager.java:3328) runtime.cc:655] at android.net.ConnectivityManager$CallbackHandler.handleMessage(ConnectivityManager.java:3607) runtime.cc:655] at android.os.Handler.dispatchMessage(Handler.java:106) runtime.cc:655] at android.os.Looper.loop(Looper.java:223) runtime.cc:655] at android.os.HandlerThread.run(HandlerThread.java:67) ``` --------- Co-authored-by: conectado <gabrielalejandro7@gmail.com>	2024-06-26 19:24:44 +00:00
Gabi	3fa8f04831	chore(connlib): fix test compilation without proptest flag (#5561 ) Fixes plain `cargo test`	2024-06-26 11:29:55 +00:00
Gabi	98aa902374	chore(connlib): only refresh DNS for connections that are in use (#5555 ) With the current behavior after a connection stops being used it will trigger a refresh DNS after every 30 seconds forever. This can be bad for a gateway that could be handling more than thousands of domain names. This was prevented before by only setting `slated_for_refresh` when we see the first packet, this was deprecated in favor of checking times in the `handle_timeout`. So the solution now is to check that the connection is being used currently before triggering any DNS refresh.	2024-06-26 01:10:58 +00:00
Thomas Eizinger	6c842de83c	refactor(connlib): don't re-initialise `Tun` on config updates (#5392 ) Currently, connlib re-initialises the TUN device on Linux every time its configuration gets updated such as when roaming from one network to another. This is unnecessary. Instead, we can adopt the same approach as already used on MacOS, iOS and Windows and only initialise it if it doesn't exist yet. Doing so surfaces an interesting bug. Currently, attempting to re-initialise the TUN device fails with a warning: > connlib_client_shared::eventloop: Failed to set interface on tunnel: Resource busy (os error 16) See https://github.com/firezone/firezone/actions/runs/9656570163/job/26634409346#step:7:103 for an example. As a consequence, we never actually trigger the `on_set_interface_config` callback and thus never actually set the new IPs on the TUN device. Now that we _are_ calling this callback, we execute `TunDeviceManager::set_ips` which first clears all IPs from the device and then attaches the new ones. A consequence of this is that the Linux kernel will clear all routes associated with the device. This clashes with an optimisation we have in `TunDeviceManager` where we remember the previously set routes and don't set new ones if they are the same. This `HashSet` needs to be cleared upon setting new IPs in order to actually set the new routes correctly afterwards. Without that, we stop receiving traffic on the TUN device.	2024-06-25 22:30:31 +00:00
Thomas Eizinger	4ffc49eef9	fix(snownet): ensure failed refresh requests invalidate allocation (#5538 ) Whilst we had a unit-test for this behaviour, it was written poorly and didn't assert on the correct thing. Instead, I happened to pass because we advanced time far enough to trigger the actual expiry of the allocation instead of directly expiring it upon the last failed retry of the refresh request. Re-writing this test then surfaced that we were in fact no invalidating the allocation correctly. In real-time, this represents a difference of 5 minutes within which a client may try to use a relay candidate that is in fact no longer working. Related: #5519.	2024-06-25 20:19:56 +00:00
Thomas Eizinger	409039afde	chore(connlib): improve error messages in `TunDeviceManager` (#5530 )	2024-06-25 14:09:48 +00:00
Thomas Eizinger	bd989d4416	chore(connlib): improve logging for `set_routes` on Linux (#5529 ) Logging the routes in the span and in an event creates duplicate information so we remove the former. Additionally, we add a debug log in case we short-circuit the function.	2024-06-25 14:09:06 +00:00
Thomas Eizinger	9e47fa11fb	chore(snownet): log upon attempt to delete unknown relay (#5532 )	2024-06-25 04:27:52 +00:00
Thomas Eizinger	eec615eddb	refactor(connlib): drop all connections when roaming (#5308 ) Currently, `snownet` tries to be very clever in how it roams connections. This is/was necessary because we associated DNS-specific state with a connection. More specifically, the assigned proxy IPs for a DNS resource are stored as part of a connection with the gateway. As a result, DNS resources would always break if the underlying connection in `snownet` failed. This is quite error prone and means, `snownet` must be very careful to never-ever fail a connection erroneously. With #5049, we no longer store any important state with a connection and thus, can implement roaming in much simpler way: Drop all connections and let the incoming packets create new ones. This is much more robust as we don't have to "patch" existing state in `snownet` as part of roaming. We test this new functionality by adding a `RoamClient` transition to `tunnel_test`. This ensures roaming works in a lot of scenarios, including relayed and non-relayed situations as well as roaming between either of them. As a result, we can delete several of the more specific test cases of `snownet`. Depends-On: #5049. Replaces: #5060. Resolves: #5080.	2024-06-25 03:53:00 +00:00
Thomas Eizinger	6abf5be58a	chore(connlib): set mangled DNS query log to trace (#5526 ) Anything that happens on a per-packet level should be logged at `trace` level to avoid spamming the logs. Whilst queries to DNS servers that are CIDR resources aren't necessarily _every_ packet, in certain configurations it is still common enough that it logging it on debug is too much noise.	2024-06-25 03:52:36 +00:00
Thomas Eizinger	dfe52766d2	chore(snownet): add INFO log for removing relay (#5528 )	2024-06-25 03:36:06 +00:00
Thomas Eizinger	eec0652abe	chore(connlib): shrink "packet not allowed" log (#5476 ) All allowed IPs can be a fair few which clutters the log. Remove the `HashSet` from the error and also remove the stuttering; the error already says "Packet not allowed".	2024-06-25 01:16:29 +00:00
Thomas Eizinger	96b32481db	chore(gateway): emit warn on dead but used IPs (#5482 ) As part of our NAT table, we keep track of the last time a resolved IP sent us traffic. This is primarily used to detect and correct changes in the DNS record. If we keep getting traffic for a proxy IP but the resolved IP doesn't respond for more than 30s, we re-query the corresponding domain name. We can also use this to detect and warn the administrator of entirely dead but used IPs. A dead-but-used IP is one that has never sent us any traffic, yet we are actively trying to contact it. For example, if the environment uses DNS64 but is missing a NAT64 gateway, DNS queries for IPv4-only resources will give us synthesized IPv6 addresses from the `0064:ff9b/96` subnet but without a NAT64 gateway, those will never work. Whilst this log isn't specific to issues around DNS64 and NAT64, emitting a warning that a resolved IP does not work at all should send the administrator into the right direction whilst debugging this issue.	2024-06-25 00:46:59 +00:00
Thomas Eizinger	72e726f9bd	chore(connlib): emit INFO logs for resource changes (#5473 ) When operating just the headless client, it is currently impossible to know, when resources become activate / inactive. To fix this, we add INFO logs every time we activate or deactivate a resource. This should also prove useful when debugging issues with customers because we now have a timestamped record of what resources were active at that time. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2024-06-25 00:44:47 +00:00

1 2 3 4 5 ...

611 Commits