firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-03-21 10:41:50 +00:00

Author	SHA1	Message	Date
Jamil	0bc3895c3e	ci: Bump Apple clients to 1.4.4 (#8245 ) These have been released / published. Need to merge this to get website links and changelog updated.	2025-02-24 09:01:45 -08:00
Thomas Eizinger	57ce0ee469	feat(gateway): cache DNS queries for resources (#8225 ) With the addition of the Firezone Control Protocol, we are now issuing a lot more DNS queries on the Gateway. Specifically, every DNS query for a DNS resource name always triggers a DNS query on the Gateway. This ensures that changes to DNS entries for resources are picked up without having to build any sort of "stale detection" in the Gateway itself. As a result though, a Gateway has to issue a lot of DNS queries to upstream resolvers which in 99% or more cases will return the same result. To reduce the load on these upstream, we cache successful results of DNS queries for 5 minutes. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-02-23 04:27:09 +00:00
Thomas Eizinger	f882edb3bd	feat(gui-client): configure IPC service to log to stdout (#8219 ) On Linux, logs sent to stdout from a systemd-service are automatically captured by `journald`. This is where most admins expect logs to be and frankly, doing any kind of debugging of Firezone is much easier if you can do `journalctl -efu firezone-client-ipc.service` in a terminal and check what the IPC service is doing. On Windows, stdout from a service is (unfortunately) ignored. To achieve this and also allow dynamically changing the log-filter, I had to introduce a (long-overdue) abstraction over tracing's "reload" layer that allows us to combine multiple reload-handles into one. Unfortunately, neither the `reload::Layer` nor the `reload::Handle` implement `Clone`, which makes this unnecessarily difficult. Related: #8173	2025-02-23 00:23:29 +00:00
Thomas Eizinger	ea9796e346	feat(gateway): apply filter engine to inbound packets (#7702 ) The Gateway keeps some state for each client connection. Part of this state are filters which can be controlled via the Firezone portal. Even if no filters are set in the portal, the Gateway uses this data structure to ensure only packets to allowed resources are forwarded. If a resource is not allowed, its IP won't exist in the `IpNetworkTable` of filters and thus won't be allowed. When a Client disconnects, the Gateway cleans up this data structure and thus all filters etc are gone. As soon as a Client reconnects, default filters are installed (which don't allow anything) under the same IP (the portal always assigns the same IP to Clients). These filters are only applied on _outbound_ traffic (i.e. from the Client towards Resources). As a result, packets arriving from Resources to a Client will still be routed back, causing "Source not allowed" errors on the client (which has lost all of its state when restarting). To fix this, we apply the Gateway's filters also on the reverse path of packets from Resources to Clients. Resolves: #5568 Resolves: #7521 Resolves: #6091	2025-02-21 05:59:36 +00:00
Thomas Eizinger	9bc23732f3	chore(apple): downgrade warning about installed crypto provider (#8226 ) With the introduction of system extensions, the memory is no longer free'd after the tunnel disconnects meaning this can easily happen.	2025-02-21 05:27:12 +00:00
Thomas Eizinger	deb47d956e	chore(gateway): remove log around "No NAT session" (#8227 ) This is pretty confusing when reading logs. For inbound packets, we assume that if we don't have a NAT session, they belong to the Internet Resource or a CIDR resource, meaning this log shows up for all packets for those resources and even for packets that don't belong to any resource at all.	2025-02-21 05:24:59 +00:00
Thomas Eizinger	d5fdb5fda8	test(connlib): remove assertion around idle packets / sec (#8210 ) This has been flaky recently but it isn't a priority right now.	2025-02-20 01:33:18 +00:00
Thomas Eizinger	81da120c17	fix(phoenix-channel): report connection hiccups to upper layer (#8203 ) The WebSocket connection to the portal from within the Clients, Gateways and Relays may be temporarily interrupted by IO errors. In such cases we simply reconnect to it. This isn't as much of a problem for Clients and Gateways. For Relays however, a disconnect can be disruptive for customers because the portal will send `relays_presence` events to all Clients and Gateways. Any relayed connection will therefore be interrupted. See #8177. Relays run on our own infrastructure and we want to be notified if their connection flaps. In order to differentiate between these scenarios, we remove the logging from within `phoenix-channel` and report these connection hiccups one layer up. This allows Clients and Gateways to log them on DEBUG whereas the Relay can log them on WARN. Related: #8177 Related: #7004	2025-02-20 00:54:43 +00:00
Thomas Eizinger	cad84922db	fix(apple): don't panic in FFI functions (#8202 ) Now that we have error reporting via Sentry in Swift-land as well, we can handle errors in the FFI layer more gracefully and return them to Swift. --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-02-20 00:51:56 +00:00
Thomas Eizinger	2e43523f75	fix(snownet): servers should not initiate WireGuard sessions (#8169 ) Whilst ICE for a connection is in progress, it might happen that packets for a particular client are arriving at the Gateway's TUN device. I assume that these might be from a previous session? We can only negotiate a WireGuard session once we have a nominated socket. Thus, the very first packet sent on a session will always trigger a new handshake. We don't want Gateway's to start handshakes though, those should always be initiated by the Clients. To avoid this, we add a conditional to `snownet::Node` that drops packets iff the current node is a `ServerNode` and we haven't nominated a socket yet. The following log output from a Gateway motivated this change: ``` 2025-02-17T15:36:45.372Z INFO snownet::node: Connection failed (ICE timeout) cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 // Here the previous connection failed. 2025-02-17T15:36:45.989Z DEBUG firezone_tunnel::gateway: Unknown client, perhaps already disconnected? dst=100.64.69.110 2025-02-17T15:36:45.989Z DEBUG firezone_tunnel::gateway: Unknown client, perhaps already disconnected? dst=100.64.69.110 2025-02-17T15:36:45.989Z DEBUG firezone_tunnel::gateway: Unknown client, perhaps already disconnected? dst=100.64.69.110 2025-02-17T15:36:46.213Z DEBUG firezone_tunnel::gateway: Unknown client, perhaps already disconnected? dst=100.64.69.110 // Until here, packets for this client got dropped but now a new connection (for the same IP!) is being created. 2025-02-17T15:36:46.474Z DEBUG snownet::node: Sampled relay rid=b7198983-0cf6-48ba-a459-e7d27ef7d6c9 client_id=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.474Z INFO str0m::ice_::agent: Set local credentials: IceCreds { ufrag: "ipcg", pass: "eyy6s27emu2joisw7aqc7q" } client_id=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.474Z INFO str0m::ice_::agent: Set remote credentials: IceCreds { ufrag: "up5k", pass: "4q6uvhawhcbnhbqrddvy5x" } client_id=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.474Z INFO str0m::ice_::agent: Add local candidate: Candidate(host=10.0.0.4:38621/udp prio=2130706175) client_id=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.474Z INFO str0m::ice_::agent: Add local candidate: Candidate(relay=34.16.221.134:62250/udp prio=37748479) client_id=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.474Z INFO str0m::ice_::agent: Add local candidate: Candidate(relay=[2600:1900:4180:ee3:0:78::]:62250/udp prio=37748735) client_id=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.474Z INFO str0m::ice_::agent: State change (new connection): New -> Checking client_id=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.474Z INFO snownet::node: Created new connection client_id=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.475Z INFO firezone_tunnel::peer: Allowing access to resource client=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 resource=dca3fcc6-b5e0-470a-bc7b-6446cdd03bb3 expires=Some("2025-02-24T15:09:11+00:00") client_id=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 // The connection has been created and very likely another packet has arrived at the TUN interface. This time though, we have an entry in our connection map for this IP and try to route it. 2025-02-17T15:36:46.546Z DEBUG boringtun::noise: Sending handshake_initiation cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.546Z DEBUG snownet::node: ICE is still in progress, buffering WG handshake num_buffered=1 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 // We buffered the handshake packet. This is only meant to be done by clients. 2025-02-17T15:36:46.572Z INFO str0m::ice_::agent: Created peer reflexive remote candidate from STUN request: Candidate(prflx=107.197.104.68:49376/udp prio=1862270719) cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.572Z DEBUG str0m::ice_::agent: Created new pair for STUN request: CandidatePair(1-0 prio=162128486503284223 state=Waiting attempts=0 unanswered=0 remote=0 last=None nom=None) cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.574Z INFO str0m::ice_::agent: Created peer reflexive remote candidate from STUN request: Candidate(prflx=[2600:1700:3ecb:2410:7499:175a:5c9:9bc5]:57622/udp prio=1862270975) cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.574Z DEBUG str0m::ice_::agent: Created new pair for STUN request: CandidatePair(2-1 prio=162129586014912511 state=Waiting attempts=0 unanswered=0 remote=0 last=None nom=None) cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.611Z DEBUG str0m::ice_::pair: Nominated pair: CandidatePair(2-1 prio=162129586014912511 state=Succeeded attempts=1 unanswered=0 remote=2 last=Some(Instant { tv_sec: 286264, tv_nsec: 840170135 }) nom=Nominated) cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.612Z INFO str0m::ice_::agent: State change (got nomination, still trying others): Checking -> Connected cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.612Z DEBUG snownet::node: Flushing packets buffered during ICE num_buffered=1 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.612Z INFO snownet::node: Updating remote socket old=None new=Relay { relay: b7198983-0cf6-48ba-a459-e7d27ef7d6c9, dest: [2600:1700:3ecb:2410:7499:175a:5c9:9bc5]:57622 } duration_since_intent=137.48517ms cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 // The connection has been established and we receive the (forced) handshake initiation by the client. However, we also flushed a handshake initiation. 2025-02-17T15:36:46.612Z DEBUG boringtun::noise: Received handshake_initiation remote_idx=731337473 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.613Z DEBUG boringtun::noise: Sending handshake_response local_idx=185230594 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.613Z DEBUG boringtun::noise: Sending handshake_initiation cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.629Z DEBUG snownet::node: Unknown connection or socket has already been nominated ignored_candidate=candidate:fffeff021b36b51d6f7abdc3 1 udp 50331391 34.94.63.38 55487 typ relay cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.629Z DEBUG snownet::node: Unknown connection or socket has already been nominated ignored_candidate=candidate:fffeff64a52b02479dab9c4 1 udp 1694498559 107.197.104.68 49376 typ srflx cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.629Z DEBUG snownet::node: Unknown connection or socket has already been nominated ignored_candidate=candidate:fffeff7ec9b7a7db40ec1c44 1 udp 2130706175 192.168.1.150 49376 typ host cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.630Z DEBUG snownet::node: Unknown connection or socket has already been nominated ignored_candidate=candidate:ffffff026d81f5c8a4d5600e 1 udp 50331647 2600:1900:4120:521c:0:78:: 55487 typ relay cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.630Z DEBUG snownet::node: Unknown connection or socket has already been nominated ignored_candidate=candidate:ffffff64e2c91c4ff6f343f5 1 udp 1694498815 2600:1700:3ecb:2410:7499:175a:5c9:9bc5 57622 typ srflx cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.630Z DEBUG snownet::node: Unknown connection or socket has already been nominated ignored_candidate=candidate:ffffff7ed64262b110d1f279 1 udp 2130706431 2600:1700:3ecb:2410:7499:175a:5c9:9bc5 57622 typ host cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 // We are receiving a response for our handshake initiation. Let the fight begin! 2025-02-17T15:36:46.651Z DEBUG boringtun::noise: Received handshake_response local_idx=185230593 remote_idx=731337474 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.651Z DEBUG firezone_gateway::eventloop: Tunnel error: Failed to decapsulate: Failed to decapsulate: UnexpectedPacket 2025-02-17T15:36:46.651Z DEBUG boringtun::noise: Received handshake_initiation remote_idx=731337475 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.652Z DEBUG boringtun::noise: Sending handshake_response local_idx=185230596 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.652Z DEBUG boringtun::noise: Sending handshake_initiation cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.652Z DEBUG boringtun::noise: Received handshake_response local_idx=185230595 remote_idx=731337476 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.652Z DEBUG firezone_gateway::eventloop: Tunnel error: Failed to decapsulate: Failed to decapsulate: UnexpectedPacket 2025-02-17T15:36:46.652Z DEBUG boringtun::noise: Received handshake_initiation remote_idx=731337477 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.653Z DEBUG boringtun::noise: Sending handshake_response local_idx=185230598 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.653Z DEBUG boringtun::noise: Sending handshake_initiation cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.691Z DEBUG boringtun::noise: Received handshake_response local_idx=185230597 remote_idx=731337478 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.691Z DEBUG firezone_gateway::eventloop: Tunnel error: Failed to decapsulate: Failed to decapsulate: UnexpectedPacket 2025-02-17T15:36:46.691Z DEBUG boringtun::noise: Received handshake_initiation remote_idx=731337479 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.692Z DEBUG boringtun::noise: Sending handshake_response local_idx=185230600 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 2025-02-17T15:36:46.692Z INFO snownet::node: Completed wireguard handshake cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 duration_since_intent=217.247362ms 2025-02-17T15:36:46.692Z DEBUG firezone_gateway::eventloop: Tunnel error: Failed to decapsulate: Failed to decapsulate: NoCurrentSession 2025-02-17T15:36:46.692Z DEBUG firezone_gateway::eventloop: Tunnel error: Failed to decapsulate: Failed to decapsulate: NoCurrentSession 2025-02-17T15:36:46.692Z DEBUG firezone_gateway::eventloop: Tunnel error: Failed to decapsulate: Failed to decapsulate: NoCurrentSession 2025-02-17T15:36:46.692Z DEBUG firezone_gateway::eventloop: Tunnel error: Failed to decapsulate: Failed to decapsulate: NoCurrentSession 2025-02-17T15:36:46.708Z DEBUG firezone_gateway::eventloop: Tunnel error: Failed to decapsulate: Failed to decapsulate: NoCurrentSession 2025-02-17T15:36:46.731Z DEBUG boringtun::noise: New session session=185230600 cid=8b106344-ba59-4050-8f9a-e2f0bab6e9e5 ``` As you can see, with both parties initiating handshakes, they end up fighting over who should initiate the session.	2025-02-18 08:25:45 +00:00
Thomas Eizinger	2d37cfa264	refactor(snownet): make kind of connection more descriptive (#8167 ) When `snownet` establishes a connection to another peer, we may end up in one of four different connection types: - `PeerToPeer` - `PeerToRelay` - `RelayToPeer` - `RelayToRelay` From the perspective of the local node, it only matters whether or not we are sending data from our local socket or a relay's socket because in the latter case, we have to encapsulate it in a channel data message. Hence, at present, we often see logs that say "Direct" but really, we are talking to a port allocated by the remote on a relay. We know whether or not the remote candidate is a relay by looking at the candidates they sent us. To make our logs more descriptive, we now model out all 4 possibilities here.	2025-02-18 07:35:50 +00:00
Thomas Eizinger	287068396f	chore(snownet): advance backoff after accessing interval (#8175 ) When we detect timed-out request to a relay, we print the duration we were waiting for. Currently, this is offset by one "backoff tick" because we advance the backoff too early. Here is a log-output of a test prior to the change: ``` snownet::allocation: Sending BINDING requests to pick active socket relay_socket=V4(127.0.0.1:3478) handle_timeout{active_socket=None}: snownet::allocation: Request timed out after 1.5s, re-sending id=TransactionId(0x0BFA13E983FEF36EE4877719) method=binding dst=127.0.0.1:3478 handle_timeout{active_socket=None}: snownet::allocation: Request timed out after 2.25s, re-sending id=TransactionId(0x0BFA13E983FEF36EE4877719) method=binding dst=127.0.0.1:3478 handle_timeout{active_socket=None}: snownet::allocation: Request timed out after 3.375s, re-sending id=TransactionId(0x0BFA13E983FEF36EE4877719) method=binding dst=127.0.0.1:3478 handle_timeout{active_socket=None}: snownet::allocation: Request timed out after 3.375s, re-sending id=TransactionId(0x0BFA13E983FEF36EE4877719) method=binding dst=127.0.0.1:3478 handle_timeout{active_socket=None}: snownet::allocation: Backoff expired, giving up id=TransactionId(0x0BFA13E983FEF36EE4877719) method=binding dst=127.0.0.1:3478 ``` and with this change: ``` snownet::allocation: Sending BINDING requests to pick active socket relay_socket=V4(127.0.0.1:3478) handle_timeout{active_socket=None}: snownet::allocation: Request timed out after 1s, re-sending id=TransactionId(0x6C79DD3607DF96806C4A7D8C) method=binding dst=127.0.0.1:3478 handle_timeout{active_socket=None}: snownet::allocation: Request timed out after 1.5s, re-sending id=TransactionId(0x6C79DD3607DF96806C4A7D8C) method=binding dst=127.0.0.1:3478 handle_timeout{active_socket=None}: snownet::allocation: Request timed out after 2.25s, re-sending id=TransactionId(0x6C79DD3607DF96806C4A7D8C) method=binding dst=127.0.0.1:3478 handle_timeout{active_socket=None}: snownet::allocation: Request timed out after 3.375s, re-sending id=TransactionId(0x6C79DD3607DF96806C4A7D8C) method=binding dst=127.0.0.1:3478 handle_timeout{active_socket=None}: snownet::allocation: Backoff expired, giving up id=TransactionId(0x6C79DD3607DF96806C4A7D8C) method=binding dst=127.0.0.1:3478 t ``` There is no functional difference, we were just logging the wrong duration.	2025-02-18 06:12:52 +00:00
Thomas Eizinger	28f00089b9	test(connlib): increase threshold for idle packets (#8174 ) Same as the other day. Currently no bandwidth to look into this but need to ensure stable CI.	2025-02-18 04:49:07 +00:00
Thomas Eizinger	2d70a8ed31	test(connlib): create dedicated Internet site (#8153 ) To ensure that our test suite represents production as much as possible, we introduce a dedicated "Internet" site into the `StubPortal` that only hosts the Internet resource. All other creates resources are assigned to other sites.	2025-02-17 19:46:32 +00:00
Thomas Eizinger	a3c0321020	fix(android): init Sentry layer as part of logging (#8154 ) Resolves: #8050.	2025-02-17 05:38:29 +00:00
Thomas Eizinger	9de467483f	fix(apple): init Sentry layer as part of logging (#8155 )	2025-02-17 04:19:22 +00:00
Thomas Eizinger	72782b8389	fix(gui-client): update telemetry context on new session (#8152 ) Every time we start a new session, our telemetry context potentially changes, i.e. the user may sign into a new account. This should ensure that both the IPC service and the GUI always use the most up-to-date `account_slug` as part of Sentry events. In addition, this will also set the `account_slug` for clients that just signed in. Previously, the `account_slug` would only get populated on the next start of the client.	2025-02-17 03:29:08 +00:00
Jamil	e487272a1b	chore(apple): Release Apple clients 1.4.3 (#8144 )	2025-02-16 12:59:38 -08:00
Jamil	d38ec466b9	chore(android): Release Android 1.4.2 (#8145 )	2025-02-16 12:59:12 -08:00
Thomas Eizinger	9cce4fd637	fix(gateway): don't route packets from expired NAT sessions (#8124 ) When we receive an inbound packet from the TUN device on the Gateway, we make a lookup in the NAT table to see if it needs to be translated back to a DNS proxy IP. At present, non-existence of such a NAT entry results in the packet being sent entirely unmodified because that is what needs to happen for CIDR resources. Whilst that is important, the same code path is currently being executed for DNS resources whose NAT session expired! Those packets should be dropped instead which is what we do with this PR. To differentiate between not having a NAT session at all or whether a previous one existed but is expired now, we keep around all previous "outside" tuples of NAT sessions around. Those are only very small in their memory-footprint. The entire NAT table is scoped to a connection to the given peer and will thus eventually freed once the peer disconnects. This allows us to reliably and cheaply detect, whether a packet is using an expired NAT session. This check must be cheap because all traffic of CIDR resources and the Internet resource needs to perform this check such that we know that they don't have to be translated. This might be the source of some of the "Source not allowed" errors we have been seeing in client logs.	2025-02-14 08:21:23 +00:00
Thomas Eizinger	8f0db6ad47	fix(connlib): run all callbacks on a separate thread (#8126 ) At present, `connlib` communicates with its host app via callbacks. These callbacks are executed synchronously as part of `connlib`s event-loop, meaning `connlib` cannot do anything else whilst the callback is executing in the host app. Additionally, this callback runs within the `Future` that represents `connlib` and thus runs on a `tokio` worker thread. Attempting to interact with the session from within the callback can lead to panics, for example when `Session::disconnect` is called which uses `Runtime::block_on`. This isn't allowed by `tokio`: You cannot block on the execution of an async task from within one of the worker threads. To solve both of these problems, we introduce a thread-pool of size 1 that is responsible for executing `connlib` callbacks. Not only does this allow `connlib` to perform more work such as routing packets or process portal messages, it also means that it is not possible for the host app to cause these panics within the `tokio` runtime because the callbacks run on a different thread.	2025-02-14 06:54:35 +00:00
Thomas Eizinger	10ba02e341	fix(connlib): split TUN send & recv into separate threads (#8117 ) We appear to have caused a pretty big performance regression (~40%) in `037a2e64b6` (identified through `git-bisect`). Specifically, the regression appears to have been caused by [`aef411a` (#7605)](`aef411abf5`). Weirdly enough, undoing just that on top of `main` doesn't fix the regression. My hypothesis is that using the same file descriptor for read AND write interests on the same runtime causes issues because those interests are occasionally cleared (i.e. on false-positive wake-ups). In this PR, we spawn a dedicated thread each for the sending and receiving operations of the TUN device. On unix-based systems, a TUN device is just a file descriptor and can therefore simply be copied and read & written to from different threads. Most importantly, we only construct the `AsyncFd` _within_ the newly spawned thread and runtime because constructing an `AsyncFd` implicitly registers with the runtime active on the current thread. As a nice benefit, this allows us to get rid of a `future::select`. Those are always kind of nasty because they cancel the future that wasn't ready. My original intuition was that we drop packets due to cancelled futures there but that could not be confirmed in experiments.	2025-02-14 05:32:51 +00:00
Jamil	39cbf60ec8	ci: Bump Apple clients to 1.4.2 (#8109 ) Fixes a slew of memory leaks, crashes, and other papercuts.	2025-02-13 22:08:45 +00:00
Thomas Eizinger	e3e6634790	chore: make all Rust code compile on Windows (#8036 ) Developing on Windows is much easier if all Rust code compiles without errors or warnings because you can "trust" your IDE that your code is error free if it says "0 errors; 0 warnings". We are not far off from achieving this! Apart from the "graceful termination" feature in the relay, both the relay and gateway should actually also work on Windows just fine, thanks to the platform-agnostic abstractions we have been building up for the GUI and headless client.	2025-02-06 14:25:10 +00:00
Thomas Eizinger	d2e9b09874	refactor(rust): stringify errors early (#8033 ) As it turns out, the effort in #7104 was not a good idea. By logging errors as values, most of our Sentry reports all have the same title and thus cannot be differentiated from within the overview at all. To fix this, we stringify errors with all their sources whenever they got logged. This ensures log messages are unique and all Sentry issues will have a useful title.	2025-02-06 14:18:35 +00:00
Thomas Eizinger	bbea85687a	test: up limit for packets / sec during idle (#7999 ) This test parameter needed adjustment, not currently sure why but it is low priority to investigate and fixing flaky CI is more important.	2025-02-03 01:07:54 +00:00
Thomas Eizinger	f44fdb7fa3	refactor(snownet): re-implement backoff to only tick on timeout (#7942 ) For all STUN and TURN messages that are being sent from `connlib`, we implement a retransmit strategy with an exponential backoff if we don't hear from the relay within a given amount of time. For this, we are currently using the `backoff` crate. For our purposes, this crate is a bit unergonomic. In particular, it has a mutable `next_backoff` function as well as internal dependency on a "clock". As a consequence, we need to a) always make sure the clock of an `ExponentialBackoff` is pointing to the current time b) only call `next_backoff` when we want to resend a message Within the sans-IO design of `connlib`, time-related functions are handled within `handle_timeout` which is being passed a `now: Instant` parameter. Instead of ticking over to the next backoff, what we need from our backoff module are answers to the questions: - Is the backoff expired? - When should the next retry happen? - What is the current waiting interval? In addition, we want the backoff module to "tick over" to the next trigger when the time passes the current one, i.e. we want to issue the command: "This is the current time, update your internal state." By re-implementing this ourselves, we can avoid this additional state tracking of `last_now`, thus simplifying the implementation.	2025-01-30 02:42:23 +00:00
Thomas Eizinger	fc7efef94e	refactor(gateway): don't treat filtered packets as errors (#7954 ) Filtering packets is part of a Gateway's day-to-day business. We don't want to treat those as errors further up as those might get reported to Sentry but it is still worth logging them on debug.	2025-01-30 02:34:24 +00:00
Thomas Eizinger	d6a1966a42	refactor(snownet): reduce log noise for unhandled packets (#7952 ) When `snownet` originally got developed, its API was designed with the idea in mind that a packet that doesn't get handled is an error. Whilst that is technically true, we don't have any other component that processes packets within Firezone. When a connection is killed by e.g. an ICE timeout, we may still be receiving packets from the other party. Those fill the logs until the other party also runs into a timeout. To prevent this, we don't return errors for these but instead, log them on TRACE. For the case where we are given a packet that doesn't match any known format, we still emit an error.	2025-01-30 01:49:57 +00:00
Jamil	6a73406194	chore: Bump Apple version to 1.4.1 (#7946 )	2025-01-30 00:04:54 +00:00
Thomas Eizinger	8bd8098cab	refactor(connlib): don't re-implement waker for TUN thread (#7944 ) Within `connlib` - on UNIX platforms - we have dedicated threads that read from and write to the TUN device. These threads are connected with `connlib`'s main thread via bounded channels: one in each direction. When these channels are full, `connlib`'s main thread will suspend and not read any network packets from the sockets in order to maintain back-pressure. Reading more packets from the socket would mean most likely sending more packets out the TUN device. When debugging #7763, it became apparent that _something_ must be wrong with these threads and that somehow, we either consider them as full or aren't emptying them and as a result, we don't read _any_ network packets from our sockets. To maintain back-pressure here, we currently use our own `AtomicWaker` construct that is shared with the TUN thread(s). This is unnecessary. We can also directly convert the `flume::Sender` into a `flume::async::SendSink` and therefore directly access a `poll` interface.	2025-01-29 15:48:48 +00:00
Thomas Eizinger	287ea1e8b2	chore(snownet): log ignored candidate (#7943 ) Once we've finished ICE and nominated a socket, we ignore future candidates for the same connection (see #6876). To make this log a bit more helpful, we now log the candidate that we are ignoring on this connection.	2025-01-29 10:21:48 +00:00
Jamil	7e231c6b10	chore: Release Android 1.4.1 (#7911 )	2025-01-29 00:29:15 +00:00
Thomas Eizinger	3daac8730f	fix(connlib): limit batch size on mobile platforms to 25 (#7889 ) The batch size effects how many packets we process one at a time. It also effects the worst-case size of a single buffer as all packets may be of the same size and thus need to be appended to the same buffer. On mobile, we can't afford to allocate all of these so we reduce the batch-size there.	2025-01-28 02:30:54 +00:00
Thomas Eizinger	6789b0b377	fix(connlib): always return buffers to pool after sending (#7891 ) Within the `GsoQueue` data structure, we keep a hash map indexed by source, destination and segment length of UDP packets pointing to a buffer for those payloads. What we intended to do here is to return the buffer to the pool after we sent the payload. What we failed to realise is that putting another buffer into the hash map means we have a buffer allocated for a certain destination address and segment length! This buffer would only get reused for the exact same address and segment length, causing memory usage to balloon over time. To fix this, we wrap the `DatagramBuffer` in an additional `Option`. This allows us to actually remove it from the hash map and return the buffer for future use to the buffer pool. Resolves: #7866. Resolves: #7747.	2025-01-28 01:55:54 +00:00
Thomas Eizinger	c6492d4832	fix(rust): don't start all log files with `connlib.` (#7853 ) At present, the file logger for all Rust code starts each logfile with `connlib.`. This is very confusing when exporting the logs from the GUI client because even the logs from the client itself will start with `connlib.`. To fix this, we make the base file name of the log file configurable.	2025-01-28 01:35:05 +00:00
Thomas Eizinger	3887a7b690	fix(connlib): don't pull new GSO buffer unless we need it (#7888 ) When we are queuing a new UDP payload for sending, we always immediately pulled a new buffer even though we might already have on allocated for this particular segment length. This causes an unnecessary spike in memory when we are under load.	2025-01-28 00:34:22 +00:00
Thomas Eizinger	6188efd1e6	refactor(gateway): improve logging for filtered traffic (#7887 ) When the Gateway's filter-engine drops a packet, we currently only log "destination not allowed". This could happen either because we don't have a filter (i.e. the resource is not allowed) or because the TCP / UDP port or ICMP traffic is not allowed. To make debugging easier, we now include that information in the error message. Resolves: #7875.	2025-01-27 23:49:40 +00:00
Thomas Eizinger	e78ef04e6c	chore(snownet): don't log missing attribute for binding requests (#7852 ) STUN binding requests & responses are not authenticated on purpose because they are so easy to fulfill that having to perform the computational work to check the authentication is more work than actually just sending the request. With #7819, we send STUN binding requests more often because they are used as keep-alives to the relay. This spams the debug log because we see > Message does not have a `MessageIntegrity` attribute for every BINDING response. This information isn't interesting for BINDING responses because those will never have a `MessageIntegrity` attribute.	2025-01-24 03:55:30 +00:00
Thomas Eizinger	88c3e228ba	feat(snownet): log which packets resume a connection (#7850 ) In order to debug connection wake-ups, it is useful to know, which packet is the first one that gets sent on an idle connection. With this PR, we do exactly that for incoming and outgoing packets through the tunnel. The resulting log looks something like this: ``` 2025-01-24T02:52:51.818Z DEBUG snownet::node: Connection is idle cid=65f149ea-96a4-4eee-ac70-62a1a2590821 2025-01-24T02:52:57.312Z DEBUG firezone_tunnel::client: Cleared DNS resource NAT domain=speed.cloudflare.com 2025-01-24T02:52:57.312Z DEBUG firezone_tunnel::client: Setting up DNS resource NAT gid=65f149ea-96a4-4eee-ac70-62a1a2590821 domain=speed.cloudflare.com 2025-01-24T02:52:57.312Z DEBUG snownet::node: Connection resumed packet=Packet { src: ::, dst: ::, protocol: "Reserved" } cid=65f149ea-96a4-4eee-ac70-62a1a2590821 ``` Here, the connection got resumed because we locally received a DNS query for a DNS resource which triggers a new control protocol message through the tunnel. For this, we use the unspecified IPv6 address for src and dst and the 0x255 protocol identifier which here renders as "Reserved".	2025-01-24 03:33:50 +00:00
Thomas Eizinger	71b1edfb70	test(connlib): fix race condition of WireGuard handshakes (#7839 ) The committed regression seeds trigger a scenario where the WireGuard sessions of the peers expire in a way where by the time the Client sends the packet, it is still active (179.xx seconds old) and with the latency to the Gateway, the 180s mark is reached and the Gateway clears the session and discards the packet as a result. In order to fix this, I opted to patch WireGuard by introducing a new timer that does not allow the initiator to use a session that is almost expired: https://github.com/firezone/boringtun/pull/68. Resolves: #7832. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-01-24 02:42:43 +00:00
Jamil	1e5599e5fc	refactor(connlib): only log actual updates to the allocation (#7826 ) With #7819, these log messages appear at a ~10x higher rate than before - a day's worth of these would be over 3,000 messages. For BINDING requests, these only matter if the candidates change, therefore we can make the logging conditional to that. --------- Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2025-01-24 01:17:43 +00:00
Thomas Eizinger	e2c1ef8f09	chore: remove WireGuard keepalive (#7840 ) Contrary to my prior belief, we don't actually need the WireGuard _persistent_ keep-alive. The in-built timers from WireGuard will automatically send keep-alive messages in case no organic reply is sent for a particular request. All NAT bindings along the network path are already kept open using the STUN bindings sent on all candidate pairs. Even on idle connections, we send those every 60s. Well-behaved NATs are meant to keep confirmed UDP bindings open for at least 120s. Even if not, the worst-case here is that a connection which does not send any(!) application traffic is cut.	2025-01-24 00:26:55 +00:00
Thomas Eizinger	f10f29c03b	refactor(connlib): only log cleared nat status if we do (#7841 )	2025-01-23 22:47:28 +00:00
Jamil	0dcde7ffee	fix(connlib): Filter 'dual socket' log for keepalives (#7827 ) #7819 triggers this log every 25s which isn't exactly describing the correct condition any longer. This PR updates the log to only fire when we're determining which socket to use for communicating with the Relay, and not at each keepalive interval.	2025-01-22 05:24:40 +00:00
Thomas Eizinger	8c2d15b8d7	fix(snownet): implement STUN keepalive with relays (#7819 ) Firezone Clients and Gateways create an allocation with a given set of Relays as soon as they start up. If no traffic is being secured and thus no connections are established between them, NAT bindings between Clients / Gateways and the Relays may expire. Typically, these bindings last for 120s. Allocations are only refreshed every 5 min (after 50% of their lifetime has passed). After a NAT binding is expired, the next UDP message passing through the NAT may allocate a new port, thus changing the 3-tuple of the sender. TURN identifies clients by their 3-tuple. Therefore, without a proactive keepalive, TURN clients lose access to their allocation and need to create one under the new port. To fix this, we implement a scheduled STUN binding request every 25s once we have chosen a socket (IPv4 or IPv6) for a given relay. Resolves: #7802.	2025-01-21 13:52:08 +00:00
Thomas Eizinger	b568592e52	fix: avoid spurious rekey in `boringtun` (#7767 ) For a while now, I've known that `boringtun` may perform spurious rekeys but I didn't fully understand why. After spending some time refactoring the internals of `boringtun` and re-reading the whitepaper, I know understand the reason. https://github.com/firezone/boringtun/pull/66 fixes the problem. The proptests have since also discovered the same issue: https://github.com/firezone/firezone/actions/runs/12790301854/job/35655764072.	2025-01-21 13:45:59 +00:00
Thomas Eizinger	943dbf9712	test(connlib): assert resource status as part of `tunnel_test` (#7772 ) In order to ensure that the "site status" in the UIs is always up-to-date, we model the resource status as part of `tunnel_test`. This should cover even the most bizarre combinations of adding, removing, disabling and enabling resources interleaved with sending packets, resetting connections etc. Fixes: #7761.	2025-01-21 04:35:22 +00:00
Thomas Eizinger	14ed7c40cb	test(windows): increase grace-period for timer `Io` timer (#7821 ) Windows' timer granularity isn't as good as the one from Unix platforms. To ensure this test isn't flaky, we increase the grace-period for Windows runners. See https://github.com/firezone/firezone/actions/runs/12862968520/job/35858749736?pr=7808.	2025-01-21 04:28:03 +00:00
Jamil	6670741dee	chore: Bump apple clients to 1.4.0 (#7785 ) Bumps Apple clients to the 1.4.0 release. They're already live.	2025-01-17 00:07:25 +00:00

1 2 3 4 5 ...

995 Commits