Extracted out of #5797.
This is a problem that becomes evident as
https://github.com/firezone/firezone/issues/2667 is implemented:
Whenever connlib sees a DNS packet where the sentinel DNS is a resource,
it's forwarded to the resource instead of requests being resolved
locally. This doesn't work well with system's DNS servers since many
times those are provided by the DHCP to be a local resolver which can't
be reached from a gateway. Meaning that with full route this request
will be just dropped. Preventing all internet connections outside of
Firezone.
Most of the times when an administrator actually wants to forward all
DNS request they will add explicitly an upstream DNS server which makes
sense since depending on what the local DHCP configures isn't a good
idea if you want to tunnel DNS requests.
This makes this behavior explicit and docs and UI should be updated
accordingly.
Co-authored-by: Gabi <gabrielalejandro7@gmail.com>
---------
Co-authored-by: Gabi <gabrielalejandro7@gmail.com>
Currently, the relay path in `tunnel_test` is only hit accidentally
because we don't run the gateways in dual-stack mode and thus, some
testcases have a client and gateways that can't talk to each other (and
thus fall back to the relay).
This requires us to filter out certain resources because we can't route
to an IPv6 CIDR resource from an IPv4-only gateway. This causes quite a
lot of rejections which creates problems when one attempts up the number
of test cases (i.e. 10_000).
To fix this, we run the gateways always in dual-stack mode and introduce
a dedicated flag that sometimes drop all direct traffic between the
client and the gateways.
Why:
* In order to manage a large number of Firezone Sites, Resources,
Policies, etc... a REST API is needed as clicking through the UI is too
time consuming, as well as prone to error. By providing a REST API
Firezone customers will be able to manage things within their Firezone
accounts with code.
To determine whether we send proxy IPs we depend on the `allowed_ips`,
since that's where we track what resources we have sent to a given
gateway.
However, the way we were matching if a given resource destination was
sent was using `longest_match` and with overlapping DNS this no longer
works, since this will match for internet resources even if the proxy IP
wasn't sent.
So we check that it's a DNS resource and if it's we exactly match on the
allowed ip table.
Alternatively, we could keep track of `sent_ips` for a gateway, though
this is a bit of a redundant state that we need to keep in sync but has
the benefit of being more explicit, so I'm open to do that in a follow
up PR. But I'd like to merge this to get ready for internet resources.
Currently, `tunnel_test` is broken as a result of #5871. In particular,
adding a resource requires that the resource is assigned to a gateway
which can only be done after it is being added. As a result, no
resources are ever added in the test.
With this patch, we align the test even closer with how Firezone works
in production: We generate all resources ahead of time and selectively
activate / deactivate them on the client. Unfortunately, this requires
quite a few changes but overall, is a net-positive change.
Replaces: #5914.
This version was a few months old and started throwing errors about
features that stabilized since then.
e.g.
https://github.com/firezone/firezone/actions/runs/10011089436/job/27673759249
```
error[E0658]: use of unstable library feature 'proc_macro_byte_character'
--> /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/proc-macro2-1.0.86/src/wrapper.rs:871:21
|
871 | proc_macro::Literal::byte_character(byte)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: see issue #115268 <https://github.com/rust-lang/rust/issues/115268> for more information
= help: add `#![feature(proc_macro_byte_character)]` to the crate attributes to enable
= note: this compiler was built on 2024-03-25; consider upgrading it if it is out of date
error[E0658]: use of unstable library feature 'proc_macro_c_str_literals'
--> /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/proc-macro2-1.0.86/src/wrapper.rs:898:21
|
898 | proc_macro::Literal::c_string(string)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: see issue #119750 <https://github.com/rust-lang/rust/issues/119750> for more information
= help: add `#![feature(proc_macro_c_str_literals)]` to the crate attributes to enable
= note: this compiler was built on 2024-03-25; consider upgrading it if it is out of date
For more information about this error, try `rustc --explain E0658`.
error: could not compile `proc-macro2` (lib) due to 2 previous errors
```
The connection to the portal could be interrupted at any point, most
notably when it is being re-deployed. Doing so results in a new `init`
message being pushed to all clients and gateways. This must not
interrupt the data plane.
To ensure this, we add a new `ReconnectPortal` transition to
`tunnel_test` where we simulate receiving a new `init` message with the
same values as we already have locally, i.e. same set of relays and
resources.
This resolves an existing TODO where the logic of performing
non-destructive updates to resources in `set_resources` wasn't tested.
The two primary users of the `add_resources` and `remove_resources` are
the client's eventloop and the `tunnel_test`. Both of them only ever
pass a single resource at a time.
It is thus simpler to remove the inner loop from within `ClientState`
and simply process a single resource at a time.
In preparation for #2667, we add an `internet` variant to our list of
possible resource types. This is backwards-compatible with existing
clients and ensures that, once the portal starts sending Internet
resources to clients, they won't fail to deserialise these messages.
The portal will have a version check to not send this to older clients
anyway but the sooner we can land this, the better. It simplifies the
initial development as we start preparing for the next client release.
Adding new fields to a JSON message is always backwards-compatible so we
can extend this later with whatever we need.
Currently, `tunnel_test` aborts a `Transition` as soon as one assertion
fails. This often makes it hard to debug a problem as it can be useful
to see which assertions pass and which fail to figure out, what went
wrong.
To resolve this, we replace all `assert` macros with either `info!` or
`error!` trace events. All "failed assertions" must be logged as
`error!`.
Before running these assertions, we temporarily install a custom tracing
layer that keeps track of how many `error!` events are emitted. If we
emit at least one `error!` event, the layer pancis upon `Drop` which
happens at the end of the `check_invariants` function.
This represents a step towards #3837. Eventually, we'd like the
abstractions of `Session` and `Eventloop` to go away entirely. For that,
we need to thin them out.
The introduction of `ConnectArgs` was already a hint that we are passing
a lot of data across layers that we shouldn't. To avoid that, we can
simply initialise `PhoenixChannel` earlier and thus each callsite can
specify the desired configuration directly.
I've left `ConnectArgs` intact to keep the diff small.
For full route this happens always and if we don't prioritize DNS
resources any packet for DNS IPs will get routed to the full route
gateway which might not have the correct resource.
TODO: this still needs unit tests
TODO: Waiting on #5891
Currently, the relationship between gateways, sites and resources is
modeled in an ad-hoc fashion within `tunnel_test`. The correct
relationship is:
- The portal knows about all sites.
- A resource can only be added for an existing site.
- One or more gateways belong to a single site.
To express this relationship in `tunnel_test`, we first sample between 1
and 3 sites. Then we sample between 1 and 3 gateways and assign them a
site each. When adding new resources, we sample a site that the resource
belongs to. Upon a connection intent, we sample a gateway from all
gateways that belong to the site that the resource is defined in.
In addition, this patch-set removes multi-site resources from the
`tunnel_test`. As far as connlib's routing logic is concerned, we route
packets to a resource on a selected gateway. How the portal selected the
site of the gateway doesn't matter to connlib and thus doesn't need to
be covered in these tests.
Our Rust CI runs various jobs in different configurations of packages
and / or features. Currently, only the clippy job denies warnings which
makes it possible that some code still generates warnings under
particular configurations.
To ensure we always fail on warnings, we set a global env var to deny
warnings for all Rust CI jobs.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
I think this is because there was a force-push in the `proptest` repo,
which caused the locked revision to no longer belong to the branch
specified by `Cargo.toml`
It happened to affect macOS and not Linux or Windows, nor local builds,
maybe because they have different caching setups.
<img width="326" alt="image"
src="https://github.com/user-attachments/assets/4c8f8ba5-d2f1-4f89-8895-e533178b6348">
Extracted from #5840
Some cleanup on generating IPs and improve performance of picking a host
within an IP range by doing some math instead of iterating through the
ip range.
In the new version, `quinn-udp` no longer supports sending multiple
`Transmit`s at once via `sendmmsg`. We made use of that to send all
buffered packets in one go.
In reality, these buffered packets can only be control messages like
STUN requests to relays or something like that. For the hot-patch of
routing packets, we only ever read a single IP packet from the TUN
device and attempt to send it out right away. At most, we may buffer one
packet at a time here in case the socket is busy.
Getting these wake-ups right is quite tricky. I think we should
prioritise #3837 soon. Once that is integrated, we can use `async/await`
for the high-level integration between `Io` and the state which allows
us to simply suspend until we can send the message, avoiding the need
for a dedicated buffer.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Closes#5026Closes#5879
On the resource-constrained Windows Server 2022 test VM, the median
sign-in time dropped from 5.0 seconds to 2.2 seconds.
# Changes
- Measure end-to-end connection time in the GUI process
- Use `ipconfig` instead of Powershell to flush DNS faster
- Activate DNS control by manipulating the Windows Registry directly
instead of calling Powershell
- Remove deactivate step when changing DNS servers (seals a DNS leak
when roaming networks)
- Remove completely redundant `Set-DnsClientServerAddress` step from
activating DNS control
- Remove `Remove-NetRoute` powershell cmdlet that seems to do nothing
# Benchmark 7
- Optimized release builds
- x86-64 constrained VM (1 CPU thread, 2 GB RAM)
Main with measurement added, `c1c99197e` from #5864
- 6.0 s
- 5.5 s
- 4.1 s
- 5.0 s
- 4.1 s
- (Median = 5.0 s)
Main with speedups added, `2128329f9` from #5375, this PR
- 3.7 s
- 2.2 s
- 1.9 s
- 2.3 s
- 2.0 s
- (Median = 2.2 s)
```[tasklist]
### Next steps
- [x] Benchmark on the resource-constrained VM
- [x] Move raw benchmark data to a comment and summarize in the description
- [x] Clean up tasks that don't need to be in the commit
- [x] Merge
```
# Hypothetical further optimizations
- Ditch the `netsh` subprocess in `set_ips`
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Following the removal of the return type from the callback functions in
#5839, we can now move the use of the `Callbacks` one layer up the stack
and decouple them entirely from the `Tunnel`.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Gabi <gabrielalejandro7@gmail.com>
One of the lines at sysctls section in docker-compose.yml example file
is duplicated:
- net.ipv4.conf.all.src_valid_mark=1
So I deleted it to make it clearer.
Signed-off-by: Adrián Baena García <adrianbaenagarcia@gmail.com>
Currently, only connlib's UDP sockets for sending and receiving STUN &
WireGuard traffic are protected from routing loops. This is was done via
the `Sockets::with_protect` function. Connlib has additional sockets
though:
- A TCP socket to the portal.
- UDP & TCP sockets for DNS resolution via hickory.
Both of these can incur routing loops on certain platforms which becomes
evident as we try to implement #2667.
To fix this, we generalise the idea of "protecting" a socket via a
`SocketFactory` abstraction. By allowing the different platforms to
provide a specialised `SocketFactory`, anything Linux-based can give
special treatment to the socket before handing it to connlib.
As an additional benefit, this allows us to remove the `Sockets`
abstraction from connlib's API again because we can now initialise it
internally via the provided `SocketFactory` for UDP sockets.
---------
Signed-off-by: Gabi <gabrielalejandro7@gmail.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Additional verbosity doesn't give us a lot more useful information but
spams the log a lot. We don't compile with `cargo --verbose` anywhere
else either.
When the property-based state machine test was first created, I
envisioned that we could also easily test advancing time. Unfortunately,
the tricky part of advancing time is to correctly encode the _expected_
behaviour as it requires knowledge of all timeouts etc.
Thus, the `Tick` transition has been left lingering and doesn't actually
test much. It is obviously still sampled by the test runner and thus
"wastes" test cases that don't end up exercising anything meaningful
because the time advancements are < 1000ms.
There are plans to more roughly test time-related things by implementing
delays between applying `Transmit`s. Until then, we can remove the
`Tick` transition.
Connlib's routing logic and networking code is entirely platform
agnostic. The only platform-specific bit is how we interact with the TUN
device. From connlib's perspective though, all it needs is an interface
for reading and writing. How the device gets initialised and updated is
client-business.
For the most part, this is the same on all platforms: We call callbacks
and the client updates the state accordingly. The only annoying bit here
is that Android recreates the TUN interface on every update and thus our
old file descriptor is invalid. The current design works around this by
returning the new file descriptor on Android. This is a problematic
design for several reasons:
- It forces the callback handler to finish synchronously, and halting
connlib until this is complete.
- The synchronous nature also means we cannot replace the callbacks with
events as events don't have a return value.
To fix this, we introduce a new `set_tun` method on `Tunnel`. This moves
the business of how the `Tun` device is created up to the client. The
clients are already platform-specific so this makes sense. In a future
iteration, we can move all the various `Tun` implementations all the way
up to the client-specific crates, thus co-locating the platform-specific
code.
Initialising `Tun` from the outside surfaces another issue: The routes
are still set via the `Tun` handle on Windows. To fix this, we introduce
a `make_tun` function on `TunDeviceManager` in order for it to remember
the interface index on Windows and being able to move the setting of
routes to `TunDeviceManager`.
This simplifies several of connlib's APIs which are now infallible.
Resolves: #4473.
---------
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: conectado <gabrielalejandro7@gmail.com>
I started a playbook for publishing GUI releases, I didn't see any other
one around.
I think there's a middle step I'm not clear on:
1. Open this PR and get it approved
2. Do something? Publish the draft release maybe? Run a special CI
workflow?
3. Merge this PR to update the changelog and bump the versions in Git
```[tasklist]
### Tasks
```