Commit Graph

634 Commits

Author SHA1 Message Date
Gabi
5b0aaa6f81 fix(connlib): protect all sockets from routing loops (#5797)
Currently, only connlib's UDP sockets for sending and receiving STUN &
WireGuard traffic are protected from routing loops. This is was done via
the `Sockets::with_protect` function. Connlib has additional sockets
though:

- A TCP socket to the portal.
- UDP & TCP sockets for DNS resolution via hickory.

Both of these can incur routing loops on certain platforms which becomes
evident as we try to implement #2667.

To fix this, we generalise the idea of "protecting" a socket via a
`SocketFactory` abstraction. By allowing the different platforms to
provide a specialised `SocketFactory`, anything Linux-based can give
special treatment to the socket before handing it to connlib.

As an additional benefit, this allows us to remove the `Sockets`
abstraction from connlib's API again because we can now initialise it
internally via the provided `SocketFactory` for UDP sockets.

---------

Signed-off-by: Gabi <gabrielalejandro7@gmail.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2024-07-16 00:40:05 +00:00
Thomas Eizinger
14abda01fd refactor(connlib): polish DNS resource matching (#5866)
In preparation for implementing #5056, I familiarized myself with the
current code and ended up implementing a couple of refactorings.
2024-07-15 23:56:48 +00:00
Gabi
7436f86332 chore(connlib): remove warnings for non-proptest tests (#5883)
Extracted from #5797
2024-07-15 21:52:22 +00:00
Thomas Eizinger
847e7801f6 test(connlib): remove Tick transition (#5867)
When the property-based state machine test was first created, I
envisioned that we could also easily test advancing time. Unfortunately,
the tricky part of advancing time is to correctly encode the _expected_
behaviour as it requires knowledge of all timeouts etc.

Thus, the `Tick` transition has been left lingering and doesn't actually
test much. It is obviously still sampled by the test runner and thus
"wastes" test cases that don't end up exercising anything meaningful
because the time advancements are < 1000ms.

There are plans to more roughly test time-related things by implementing
delays between applying `Transmit`s. Until then, we can remove the
`Tick` transition.
2024-07-15 21:08:22 +00:00
Thomas Eizinger
a4a8221b8b refactor(connlib): explicitly initialise Tun (#5839)
Connlib's routing logic and networking code is entirely platform
agnostic. The only platform-specific bit is how we interact with the TUN
device. From connlib's perspective though, all it needs is an interface
for reading and writing. How the device gets initialised and updated is
client-business.

For the most part, this is the same on all platforms: We call callbacks
and the client updates the state accordingly. The only annoying bit here
is that Android recreates the TUN interface on every update and thus our
old file descriptor is invalid. The current design works around this by
returning the new file descriptor on Android. This is a problematic
design for several reasons:

- It forces the callback handler to finish synchronously, and halting
connlib until this is complete.
- The synchronous nature also means we cannot replace the callbacks with
events as events don't have a return value.

To fix this, we introduce a new `set_tun` method on `Tunnel`. This moves
the business of how the `Tun` device is created up to the client. The
clients are already platform-specific so this makes sense. In a future
iteration, we can move all the various `Tun` implementations all the way
up to the client-specific crates, thus co-locating the platform-specific
code.

Initialising `Tun` from the outside surfaces another issue: The routes
are still set via the `Tun` handle on Windows. To fix this, we introduce
a `make_tun` function on `TunDeviceManager` in order for it to remember
the interface index on Windows and being able to move the setting of
routes to `TunDeviceManager`.

This simplifies several of connlib's APIs which are now infallible.

Resolves: #4473.

---------

Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: conectado <gabrielalejandro7@gmail.com>
2024-07-12 23:54:15 +00:00
Thomas Eizinger
a4714d6de3 chore(connlib): print error after panicking (#5854) 2024-07-12 14:30:11 +00:00
Thomas Eizinger
c92dd559f7 chore(rust): format Cargo.toml using cargo-sort (#5851) 2024-07-12 04:57:22 +00:00
Thomas Eizinger
71f8b86b78 test(connlib): don't update resources as part of adding new ones (#5834)
Currently, `tunnel_test` has some old code that attempted to handle
resource _updates_ as part of adding new ones. That is outdated and
wrong. The test is easier to reason about if we disallow updates to
resources as part of _adding_ a new one.

In production, resources IDs are unique so this shouldn't actually
happen. At a later point, we can add explicit transitions for updating
an existing resource.
2024-07-12 00:30:18 +00:00
Thomas Eizinger
d95193be7d test(connlib): introduce dynamic number of gateways to tunnel_test (#5823)
Currently, `tunnel_test` exercises a lot of code paths within connlib
already by adding & removing resources, roaming the client and sending
ICMP packets. Yet, it does all of this with just a single gateway
whereas in production, we are very likely using more than one gateway.

To capture these other code-paths, we now sample between 1 and 3
gateways and randomly assign the added resources to one of them, which
makes us hit the codepaths that select between different gateways.

Most importantly, the reference implementation has barely any knowledge
about those individual connections. Instead, it is implementation in
terms of connectivity to resources.
2024-07-11 23:42:46 +00:00
Thomas Eizinger
960ce80680 refactor(connlib): move TunDeviceManager into firezone-bin-shared (#5843)
The `TunDeviceManager` is a component that the leaf-nodes of our
dependency tree need: the binaries. Thus, it is misplaced in the
`connlib-shared` crate which is at the very bottom of the dependency
tree.

This is necessary to allow the `TunDeviceManager` to actually construct
a `Tun` (which currently lives in `firezone-tunnel`).

Related: #5839.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-07-11 23:42:33 +00:00
Thomas Eizinger
2013d6a2bf chore(connlib): improve logging (#5836)
Currently, the logging of fields in spans for encapsulate and
decapsulate operations is a bit inconsistent between client and gateway.
Logging the `from` field for every message is actually quite redundant
because most of these logs are emitted within `snownet`'s `Allocation`
which can add its own span to indicate, which relay we are talking to.

For most other operations, it is much more useful to log the connection
ID instead of IPs.

This should make the logs a bit more succinct.
2024-07-11 23:38:19 +00:00
Thomas Eizinger
08182913a5 refactor(connlib): remove CidrV4 and CidrV6 types from callbacks (#5842)
These are only necessary for the Android and Apple client. Other clients
should not need to bother with these custom types.

Required-for: #5843.
2024-07-11 14:25:26 +00:00
Thomas Eizinger
f39a57fa50 refactor(connlib): remove cyclic From impls (#5837)
We have several representations of `ResourceDescription` within connlib.
The ones within the `callbacks` module are meant for _presentation_ to
the clients and thus contain additional information like the site
status.

The `From` impls deleted within the PR are only used within tests. We
can rewrite those tests by asserting on the presented data instead.

This is better because it means information about resources only flows
in one direction: From connlib to the clients.
2024-07-11 14:21:33 +00:00
Thomas Eizinger
03c0da8995 chore(connlib): ensure span is activate during test init (#5835)
Applying the initial `init` closure may also print logs that are
currently not captured within the corresponding span. By using
`in_scope`, we ensure those logs are also correctly captured in the
corresponding span.
2024-07-11 14:20:15 +00:00
Thomas Eizinger
8ec6a809a1 refactor(relay): use RangeInclusive to specify available ports (#5820) 2024-07-11 06:26:21 +00:00
Thomas Eizinger
00a3940717 chore(rust): introduce tokio workspace dependency (#5821)
We are referencing the `tokio` dependency a lot and it makes sense to
ensure that version is tracked only once across the whole workspace.

Extracted out of #5797.

---------

Co-authored-by: Not Applicable <ReactorScram@users.noreply.github.com>
2024-07-10 23:40:34 +00:00
Thomas Eizinger
0c2648dae2 test(connlib): correctly scope state within tunnel_test (#5809)
Currently, the type hierarchy within `tunnel_test` is already quite
nested: We have a `Host` that wraps a `SimNode` which wraps a
`ClientState` or `GatewayState`. Additionally, a lot of state that is
actually _per_ client or _per_ gateway is tracked in the root of
`ReferenceState` and `TunnelTest`. That makes it difficult to introduce
multiple gateways / clients to this test.

To fix this, we introduce dedicated `RefClient` and `RefGateway` states.
Those track the expected state of a particular client / gateway.
Similarly, we introduce dedicated `SimClient` and `SimGateway` structs
that track the simulation state by wrapping the corresponding
system-under-test: `ClientState` a `GatewayState`.

This ends up moving a lot of code around but has the great benefit that
all the state is now scoped to a particular instance of a client or a
gateway, paving the way for creating multiple clients & gateways in a
single test.
2024-07-10 23:22:19 +00:00
Reactor Scram
78f1c7c519 test(firezone-tunnel/windows): Test Windows upload speed in CI (#5607)
Closes #5601
It looks like we can hit 100+ Mbps in theory. This covers Wintun, Tokio,
and Windows OS overhead. It doesn't cover the cryptography or anything
in connlib itself.

The code is kinda messy but I'm not sure how to clean it up so I'll just
leave it for review.

This test should fail if there's any regressions in #5598.

It fails if any packet is dropped or if the speed is under 100 Mbps

```[tasklist]
### Tasks
- [x] Use `ip_packet::make`
- [x] Switch to `cargo bench`
- [x] Extract windows ARM PR
- [x] Clean up wintun.dll install code
- [x] Re-request review
```
2024-07-10 19:09:45 +00:00
Thomas Eizinger
0e6ac2040c test(connlib): use two relays in tunnel_test (#5804)
With the introduction of a routing table in #5786, we can very easily
introduce an additional relay to `tunnel_test`. In production, we are
always given two relays and thus, this mimics the production setup more
closely.
2024-07-09 23:47:35 +00:00
Thomas Eizinger
f3fa0c7e5f test(connlib): reduce cycles of resource_management test (#5807)
With the performance improvements of `tunnel_test` in #5786, the
`resource_management` test is now in the hot-path of CI runtime. We
reduce the cycles to 50 should cut down overall CI time by ~ 1 minute as
the Windows builds are among the slowest.

Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-07-09 14:50:12 +00:00
Thomas Eizinger
d15c43b6f2 test(connlib): render IDs as hex u128 (#5803)
This is a bit of a hack because features should never change behaviour.
Unfortunately, we can't use `cfg(test)` here because the proptests live
in a different crate and thus for the tests, we import the crate using
`cfg(not(test))`.

Our `proptest` feature is really only meant to be activated during
testing so I think this is fine for now.

The benefit is that the test logs are much more terse because proptest
will shrink the IDs to `0`, `1` etc. With the upcoming addition of
multiple gateways and multiple relays, we will have a lot more IDs in
the logs. Thus, it is important that they stay legible.
2024-07-09 14:23:37 +00:00
Thomas Eizinger
a3c9617faa test(connlib): ensure Windows test module follows conventions (#5806)
By convention, `tests` modules are usually feature-flagged to not end up
in production code. Additionally, a `use super::*;` import line ensures
we have access to the parent module which is usually the one you want to
test.
2024-07-09 14:12:44 +00:00
Thomas Eizinger
f8468813c3 test(tunnel): use hex notation for IPv6 network (#5808) 2024-07-09 14:11:46 +00:00
Thomas Eizinger
9caca475dc test(connlib): introduce routing table to tunnel_test (#5786)
Currently, `tunnel_test` uses a rather naive approach when dispatching
`Transmit`s. In particular, it checks client, gateway and relay
separately whether they "want" a certain packet. In a real network,
these packets are routed based on their IP.

To mimic something similar, we introduce a `Host` abstraction that wraps
each component: client, gateway and relay. Additionally, we introduce a
`RoutingTable` where we can add and remove hosts. With these things in
place, routing a `Transmit` is as easy as looking up the destination IP
in the routing table and dispatching to the corresponding host.

Our hosts are type-safe: client, gateway and relay have different types.
Thus, we abstract over them using a `HostId` in order to know, which
host a certain message is for. Following these patches, we can easily
introduce multiple gateways and relays to this test by simply making
more entries in this routing table. This will increase the test coverage
of connlib.

Lastly, this patch massively increases the performance of `tunnel_test`.
It turns out that previously, we spent a lot of CPU cycles accessing
"random" IPs from very large iterators. With this patch, we take a
limited range of 100 IPs that we sample from, thus drastically
increasing performance of this test. The configured 1000 testcases
execute in 3s on my machine now (with opt-level 1 which is what we use
in CI).

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2024-07-09 01:48:54 +00:00
Jamil
aa7977c9b5 chore: bump android 1.1.3 (#5784) 2024-07-06 16:54:14 -07:00
Jamil
7820e3f3c7 fix(android): Strip scope id off IPv6 addresses Android (#5783)
Fixes #5781

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
2024-07-06 16:50:30 -07:00
Reactor Scram
663367b605 chore(gui-client): timestamp crash dump file names (#5452)
Closes #5449

The smoke tests expect `last_crash.dmp` at a fixed path, so in this case
we write the file with a timestamped name, then copy it over
`last_crash.dmp`.
2024-07-05 15:21:25 +00:00
Thomas Eizinger
28d5b8574c chore(connlib): minor logging tweaks (#5746)
Noticed a few things that caused unnecessary verbosity in the logs.
2024-07-05 14:45:32 +00:00
Thomas Eizinger
2a2877a4d9 test(snownet): add debug assert (#5750)
Within `snownet`'s test harness, packets are dispatched in a particular
order and of none of them match. They are assumed to be for the node
directly. We add a debug assert to ensure that the given address is in
fact part of the "local" interfaces that we have configured in the
tests.
2024-07-05 07:00:24 +00:00
Thomas Eizinger
a57c64e62b chore(snownet): add some debug logs around channel bindings (#5749) 2024-07-05 07:00:03 +00:00
Jamil
086c730aaf chore: Bump clients to 1.1.2 for DNS record type forward (#5703)
Apps are already in review with App Stores
2024-07-04 01:31:26 +00:00
Reactor Scram
f6e99752ec fix(client): flush the OS' DNS cache whenever resources change (#5700)
Closes #5052

On my dev VMs:
- systemd-resolved = 15 ms to flush
- Windows = 600 ms to flush

I tested with the headless Clients on Linux and Windows and it fixes the
issue. On Windows I didn't replicate the issue with the GUI Client, on
Linux this patch also fixes it for the GUI Client.
2024-07-03 21:14:43 +00:00
Gabi
5fd321c4bb chore(connlib): forward non-address record queries (#5674)
Since we only handle `A`, `AAAA` and `PTR` records of names we handle,
this can lead to unexpected behavior with other record types, where
using Firezone breaks `TXT`, `MX` or other record types for the
resources we handle.

So this is a bit of a refactor, now we lookup a resource and explicitly
return `Some` when there is a record we should be returning (even if
it's empty due to IP exhaustion) or `None` when we should just forward
the query.

This has the added benefit of no longer breaking bonjour or other
non-standard `PTR` queries.

Fixes: #5673.

---------

Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2024-07-03 05:15:23 +00:00
Gabi
79fd8f6063 chore(connlib): add message type to the no records found logs (#5641)
Added for clarity when debugging, it used to look like:

```
2024-06-30T00:16:05.718337Z DEBUG firezone_tunnel::dns: No records for github.com, returning NXDOMAIN
```

And now looks like:

```
2024-06-30T00:16:05.718337Z DEBUG firezone_tunnel::dns: No MX records for github.com, returning NXDOMAIN
```
2024-07-01 23:15:44 +00:00
Thomas Eizinger
02f5c67974 chore(windows): reduce nesting in wintun recv-thread (#5573)
Related: #5571.
2024-07-01 16:33:59 +00:00
Jamil
25b6528942 chore: Bump versions and update changelog (#5636)
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
2024-06-29 09:06:10 -07:00
Thomas Eizinger
96536a23cf refactor(connlib): ignore relays per connection (#5631)
In a previous design of firezone, relays used to be scoped to a certain
connection. For a while now, this constraint has been lifted and all
connections can use all relays. A related, outdated concern is the idea
of STUN-only servers. Those also used to be assigned on a per-connection
basis.

By removing any use of per-connection relays and STUN-only servers, the
entire `StunBinding` concept is unused code and can thus be deleted.

To push this over the finish line, the `snownet-tests` which test the
hole-punching functionality needed to be slightly adapted to make use of
the more recently introduced API `Node::update_relays`.

Resolves: #4749.
2024-06-29 02:36:17 +00:00
Thomas Eizinger
f2b6c205c2 refactor(snownet): change reconnect to reset (#5630)
Currently, `snownet` still supports this notion of "reconnecting" which
is a mix between resetting some state but keeping other. In particular,
we currently retain the `StunBinding` and `Allocation` state. This used
to be important because allocations are bound to the 3-tuple of the
client and thus needed to be kept around in case we weren't actually
roaming.

We always rebind the the local UDP sockets upon reconnecting and thus
the 3-tuple always changes anyway. In addition, we always reconnect to
the portal, meaning we receive another `init` message and thus can
actually completely clear the `Node`'s state.

This PR does that an in the process, rebrands `reconnect` as `reset`
which now makes more sense.

Related: #5619.
2024-06-29 02:07:10 +00:00
Thomas Eizinger
8973cc5785 refactor(android): use fmt::Layer with custom writer (#5558)
Currently, the logs that go to logcat on Android are pretty badly
formatted because we use `tracing-android` and it formats the span
fields and message fields itself. There is actually no reason for doing
the formatting ourselves. Instead, we can use the `MakeWriter`
abstraction from `tracing_subscriber` to plug in a custom writer that
writes to Android's logcat.

This results in logs like this:

```
[nix-shell:~/src/github.com/firezone/firezone/rust]$ adb logcat -s connlib
--------- beginning of main
06-28 19:41:20.057 19955 20213 D connlib : phoenix_channel: Connecting to portal host=api.firez.one user_agent=Android/14 5.15.137-android14-11-gbf4f9bc41c3b-ab11664771 connlib/1.1.1
06-28 19:41:20.058 19955 20213 I connlib : firezone_tunnel::client: Network change detected
06-28 19:41:20.061 19955 20213 D connlib : snownet::node: Closed all connections as part of reconnecting num_connections=0
06-28 19:41:20.365 19955 20213 I connlib : phoenix_channel: Connected to portal host=api.firez.one
06-28 19:41:20.601 19955 20213 I connlib : firezone_tunnel::io: Setting new DNS resolvers
06-28 19:41:21.031 19955 20213 D connlib : firezone_tunnel::client: TUN device initialized ip4=100.66.86.233 ip6=fd00:2021:1111::f:d9c1 name=tun1
06-28 19:41:21.031 19955 20213 I connlib : connlib_client_shared::eventloop: Firezone Started!
06-28 19:41:21.031 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=*.slackb.com
06-28 19:41:21.031 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=*.test-ipv6.com
06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=5.4.6.7/32 name=5.4.6.7
06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=10.0.32.101/32 name=IPerf3
06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=ifconfig.net
06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=*.slack-imgs.com
06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=*.google.com
06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=10.0.0.5/32 name=10.0.0.5
06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=*.githubassets.com
06-28 19:41:21.032 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=dnsleaktest.com
06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=*.slack-edge.com
06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=*.github.com
06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=speed.cloudflare.com
06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=*.githubusercontent.com
06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=10.0.14.11/32 name=Staging resource performance
06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::dns: Activating DNS resource address=*.whatismyip.com
06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=10.0.0.8/32 name=10.0.0.8
06-28 19:41:21.033 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=9.9.9.9/32 name=Quad9 DNS
06-28 19:41:21.034 19955 20213 I connlib : firezone_tunnel::client: Activating CIDR resource address=10.0.32.10/32 name=CoreDNS
06-28 19:41:21.216 19955 20213 I connlib : snownet::node: Added new TURN server id=bd6e9d1a-4696-4f8b-8337-aab5d5cea810 address=Dual { v4: 35.197.171.113:3478, v6: [2600:1900:40b0:1504:0:27::]:3478 }
```

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2024-06-28 22:15:10 +00:00
Jamil
8655b711db fix(connlib): Don't use operatingSystemVersionString on Apple OSes (#5628)
The [HTTP 1.1 RFC](https://datatracker.ietf.org/doc/html/rfc2616) states
that HTTP headers should be US-ASCII. This is not the case when the
macOS Client is run from a host that has a non-English language selected
as its system default due to the way we build the user agent.

This PR fixes that by normalizing how we build the user agent by more
granularly selecting which fields compose it, and not just relying on
OS-provided version strings that may contain non-ASCII characters.

fixes https://github.com/firezone/firezone/issues/5467

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
2024-06-28 21:59:02 +00:00
Thomas Eizinger
e5cba1caf4 refactor(apple): use fmt::Layer with custom writer (#5623)
Currently, we use the `tracing-oslog` crate to ingest logs on MacOS and
iOS. This crate has a "feature" where it creates so called "Activities"
for spans. Whilst that may initially sound useful, Apple's UI for
viewing these activities is absolutely useless.

Instead of tinkering around with that, we remove the `tracing-oslog`
crate and let `tracing-subscriber` format our logs first and then only
send a single string to the oslog backend.

Related: #5619.
2024-06-28 21:22:54 +00:00
Reactor Scram
a315c49b3c chore(firezone-tunnel/windows): reduce ring buffer from 64 MiB to 1 MiB (#5609)
Oops. It runs the same either way so we definitely don't need all that
RAM to be tied up. The Linux and macOS Clients probably have similar
buffer sizes already.

I tested before and after with CloudFlare's speed test and got roughly
140/12 with latency 50 ms both times. The error bars on speed tests are
pretty wide, but we definitely aren't falling 60 MiB behind on
processing and then catching up.

```[tasklist]
### Tasks
- [x] (failed, can't do it right now) ~~Log if we knowingly drop a lot of packets~~
- [x] Extract constant
- [x] Add comment about not knowing if we drop packets
- [x] Merge
- [ ] (skipped) Test while the CPU is loaded
```
2024-06-28 21:03:18 +00:00
Thomas Eizinger
ed34ca096b chore(gateway): remove dead IP detection (#5618)
This does not work as well as intended and spams the logs. We may need
#5542 before we can implement this properly.

Fixes: #5593.
2024-06-28 04:47:00 +00:00
Thomas Eizinger
66cb565915 fix(snownet): use unused channels before reused expired ones (#5613)
Within each allocation, a client has 4095 channels that it can bind to a
different peers. Each channel bindings is valid for 10 minutes unless
rebound. Additionally, there is a 5min cool-down period after a channel
binding expires before it can be rebound to a different peer.

This patch fixes a bug in snownet where we would have first attempted to
rebind the last bound channel instead of just picking the next unused
one. In the case of a clock drift between client and relay, this caused
unnecessary errors when attempting to rebind channels.

Fixes: #5603.

---------

Co-authored-by: conectado <gabrielalejandro7@gmail.com>
2024-06-28 03:14:16 +00:00
Thomas Eizinger
aadb045b27 chore(connlib): batch together sending of ICE candidates (#5616)
Currently, we are sending each ICE candidate individually from the
client to the gateway and vice versa. This causes a slight delay as to
when each ICE candidate gets added on the remote ICE agent. As a result,
they all start being tested with a slight offset which causes "endpoint
hopping" whenever a connection expires as they expire just after each
other.

In addition, sending multiple messages to the portal causes unnecessary
load when establishing connections.

Finally, with #5283 we started **not** adding the server-reflexive
candidate to the local ICE agent. Because we talk to multiple relays, we
detect the same server-reflexive candidate multiple times if we are
behind a non-symmetric NAT. Not adding the server-reflexive candidate to
the ICE agent mitigated our de-duplication strategy here which means we
currently send the same candidate multiple times to a peer, causing
additional, unnecessary load.

All of this can be mitigated by batching together all our ICE candidates
together into one message.

Resolves: #3978.
2024-06-28 02:04:31 +00:00
Thomas Eizinger
79ff3f830b chore(gateway): downgrade warn logs (#5612)
Whilst it has been helpful to find issues such as #5611, having these
logs on `warn` spams the end user too much and creates a false sense
that things might not be working as there can be a variety of reasons
why packets might not be able to be routed.
2024-06-28 01:13:29 +00:00
Gabi
375a1b5586 fix(connlib): allow 1s ACK for packet before refreshing DNS (#5560)
Currently, we refresh DNS mappings when:
* We translate a packet for the first time
* There are no more incoming packets for 120 seconds
* There is at least 1 outoing packet in the last 10 seconds

The idea was to coordinate with conntrack somehow, to expire DNS
translation at the point where the NAT session of the OS stops being
valid. That way, if the triggered DNS refresh changes the resolved IPs
it would never kill the underlying connection.

However, TCP sessions by default can last for up to 5 days! And I have
no idea how long for ICMP. To prevent killing these connections, we
assume that for TCP and ICMP packets will elicit a response within 1s.
The DNS refresh for a translation mapping that hasn't seen any responses
is thus delayed by 1s after the last packet has been sent out.

To get an idea of how this works you can imagine it like this

|last incoming packet|------ 120 seconds + x seconds ----|out going
packet|----1 second ----|dns refresh|

However this another case where dns refresh is triggered, in this case
the same packet triggers the refresh period and the period where it was
used in the last 10 seconds

|last incoming packet|------ 111 seconds ----|out going packet|---- 9
seconds ----|dns refresh|

The unit tests should also make clear of when we want to trigger dns
refresh and when we don't.

---------

Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2024-06-28 00:25:26 +00:00
Reactor Scram
76e55e6138 fix(client/windows): fix upload speed by letting Wintun queue packets again (#5598)
Closes #5589. Refs #5571

Improves upload speeds on my Windows 11 VM from 2 Mbps to 10.5 Mbps.

On the resource-constrained VM it improved from 3 to 7 Mbps.

```[tasklist]
### Tasks
- [x] Open for review
- [x] Manual test on resource-constrained VM
- [x] Run 5x replication steps from #5571 and make sure it doesn't deadlock again
- [x] Merge
- [ ] https://github.com/firezone/firezone/issues/5601
```

Sorted by decreasing speed, M = macOS host, W = Windows guest in
Parallels, RC = Resource-constrained Windows guest in VirtualBox:

- M, Internet - 16 Mbps
- W, Internet - 13 Mbps
- M, Firezone - 12 Mbps
- RC, Internet - 12 Mbps
- W, Firezone, after this PR - 10.5 Mbps
- RC, Firezone, after this PR - 8.5 Mbps
- RC, Firezone, before this PR - 4 Mbps
- W, Firezone, before this PR - 2 Mbps

So it's not perfect but the worst part is fixed.

The slow upload speeds were probably a regression from #5571. The MPSC
channel only has a few spots in it, so if connlib doesn't pick up every
packet immediately (which would be impossible under load), we drop
packets. I measured 25% packet drops in an earlier commit.

I first tried increasing the channel size from 5 to 64, and that worked.

But this solution is simpler. I switch back to `blocking_send` so if
connlib isn't clearing the MPSC channel, Wintun will just queue up
packets in its internal ring buffers, and we aren't responsible for
buffering.

Getting rid of `blocking_send` was a defense-in-depth thing to fix the
deadlock yesterday, but we still close the MPSC channel inside
`Tun::drop`, and I confirmed in a manual test that this will kick the
worker thread out of `blocking_send`, so the deadlock won't come back.
2024-06-27 17:59:22 +00:00
Jamil
b5de55ac26 chore: Bump clients to 1.1.0, Gateway to 1.1.1 (#5591) 2024-06-27 02:43:48 -07:00
Thomas Eizinger
b6420eaa3e feat(snownet): close idle connections after 5min (#5576)
We define a connection as idle if we haven't sent or received any
packets in the last 5 minutes. From `snownet`'s perspective, keep-alives
sent by upper layers (like TCP keep-alives) must be honored and thus
outgoing as well as incoming packets are accounted for.

If the underlying connection breaks, we will hit an ICE timeout which is
an implementation detail of `snownet`. The packets tracked here are IP
packets that the user wants to send / receive via the tunnel. Similarly,
wireguard's keep-alives do not update these timestamps and thus don't
mark a connection as non-idle.

---------

Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
2024-06-27 08:28:38 +00:00