Currently, `tunnel_test` uses a rather naive approach when dispatching
`Transmit`s. In particular, it checks client, gateway and relay
separately whether they "want" a certain packet. In a real network,
these packets are routed based on their IP.
To mimic something similar, we introduce a `Host` abstraction that wraps
each component: client, gateway and relay. Additionally, we introduce a
`RoutingTable` where we can add and remove hosts. With these things in
place, routing a `Transmit` is as easy as looking up the destination IP
in the routing table and dispatching to the corresponding host.
Our hosts are type-safe: client, gateway and relay have different types.
Thus, we abstract over them using a `HostId` in order to know, which
host a certain message is for. Following these patches, we can easily
introduce multiple gateways and relays to this test by simply making
more entries in this routing table. This will increase the test coverage
of connlib.
Lastly, this patch massively increases the performance of `tunnel_test`.
It turns out that previously, we spent a lot of CPU cycles accessing
"random" IPs from very large iterators. With this patch, we take a
limited range of 100 IPs that we sample from, thus drastically
increasing performance of this test. The configured 1000 testcases
execute in 3s on my machine now (with opt-level 1 which is what we use
in CI).
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Closes#5052
On my dev VMs:
- systemd-resolved = 15 ms to flush
- Windows = 600 ms to flush
I tested with the headless Clients on Linux and Windows and it fixes the
issue. On Windows I didn't replicate the issue with the GUI Client, on
Linux this patch also fixes it for the GUI Client.
The [HTTP 1.1 RFC](https://datatracker.ietf.org/doc/html/rfc2616) states
that HTTP headers should be US-ASCII. This is not the case when the
macOS Client is run from a host that has a non-English language selected
as its system default due to the way we build the user agent.
This PR fixes that by normalizing how we build the user agent by more
granularly selecting which fields compose it, and not just relying on
OS-provided version strings that may contain non-ASCII characters.
fixes https://github.com/firezone/firezone/issues/5467
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Currently, connlib re-initialises the TUN device on Linux every time its
configuration gets updated such as when roaming from one network to
another. This is unnecessary. Instead, we can adopt the same approach as
already used on MacOS, iOS and Windows and only initialise it if it
doesn't exist yet.
Doing so surfaces an interesting bug. Currently, attempting to
re-initialise the TUN device fails with a warning:
> connlib_client_shared::eventloop: Failed to set interface on tunnel:
Resource busy (os error 16)
See
https://github.com/firezone/firezone/actions/runs/9656570163/job/26634409346#step:7:103
for an example. As a consequence, we never actually trigger the
`on_set_interface_config` callback and thus never actually set the new
IPs on the TUN device.
Now that we _are_ calling this callback, we execute
`TunDeviceManager::set_ips` which first clears all IPs from the device
and then attaches the new ones. A consequence of this is that the Linux
kernel will clear all routes associated with the device. This clashes
with an optimisation we have in `TunDeviceManager` where we remember the
previously set routes and don't set new ones if they are the same.
This `HashSet` needs to be cleared upon setting new IPs in order to
actually set the new routes correctly afterwards. Without that, we stop
receiving traffic on the TUN device.
Logging the routes in the span and in an event creates duplicate
information so we remove the former. Additionally, we add a debug log in
case we short-circuit the function.
All allowed IPs can be a fair few which clutters the log. Remove the
`HashSet` from the error and also remove the stuttering; the error
already says "Packet not allowed".
This PR is the "client-side" of things for #4994. Up until now, when a
user wanted to connect to a DNS resource, we would establish a
connection to the gateway and pass along the domain we are trying to
access. The gateway would resolve that domain and send the response back
to the client, allowing them to finally send a DNS response.
Now, we instantly assign and respond with 4x A and 4x AAAA records to
any query for one of our DNS resources. Upon the first IP packet for one
of these "proxy IPs", we select a gateway, establish a connection and
send our proxy IPs along. The gateway then performs the necessary
mangling and NATing of all packets. See #5354 for details.
Resolves: #4994.
Resolves: #5491.
---------
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Closes#5481
With this, I can connect to the staging portal without a build.rs or any
extra env var setup
<img width="387" alt="image"
src="https://github.com/firezone/firezone/assets/13400041/9c080b36-3a76-49c7-b706-20723697edc7">
```[tasklist]
### Next steps
- [x] Split out a refactor PR for `ConnectArgs` (#5488)
- [x] Try doing this for other Clients
- [x] Check Gateway
- [x] Check Tauri Client
- [x] Change to `app_version`
- [x] Open for review
- [ ] Use `option_env` so that `FIREZONE_PACKAGE_VERSION` can still override the Cargo.toml version for local testing
- [ ] Check Android Client
- [ ] Check Apple Client
```
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
When a user sends the first packet to a resource, we generate a
"connection intent" and consult the portal, which gateway to use for
this resource. This process is throttled to only generate a new intent
every 2s.
Once we know, which gateway to use for a certain resource, we initiate a
connection via snownet. This involves an OFFER-ANSWER handshake with the
gateway. A connection for which we have sent an offer and have not yet
received an answer is what we call a "pending connection".
In case the connection setup takes longer than 2s, we will generate
another connection intent which can point to the same gateway that we
are currently setting up a connection with.
Currently, encountering a "pending connection" during another connection
setup is treated as an error which results in some state being
cleaned-up / removed. This is where the bug surfaces: If we remove the
state for a resource as a result of a 2nd connection intent and then
receive the response of the first one, we will be left with no state
that knows about this resource.
We fix this by refactoring `create_or_reuse_connection` to be atomic in
regards to its state changes: All checks that fail the function are
moved to the top which means there is no state to clean up in case of an
error. Additionally, we model the case of a "pending connection" using
an `Option` to not flood the logs with "pending connection" warnings as
those are expected during normal operation.
Fixes: #5385
As part of #4994, the IP translation and mangling of packets to and from
DNS resources is moved to the gateway. This PR represents the
"gateway-half" of the required changes.
Eventually, the client will send a list of proxy IPs that it assigned
for a certain DNS resource. The gateway assigns each proxy IP to a real
IP and mangles outgoing and incoming traffic accordingly. There are a
number of things that we need to take care of as part of that:
- We need to implement NAT to correctly route traffic. Our NAT table
maps from source port* and destination IP to an assigned port* and real
IP. We say port* because that is only true for UDP and TCP. For ICMP, we
use the identifier.
- We need to translate between IPv4 and IPv6 in case a DNS resource e.g.
only resolves to IPv6 addresses but the client gave out an IPv4 proxy
address to the application. This translation is was added in #5364 and
is now being used here.
This PR is backwards-compatible because currently, clients don't send
any IPs to the gateway. No proxy IPs means we cannot do any translation
and thus, packets are simply routed through as is which is what the
current clients expect.
---------
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Currently, we use `sample::Index` and `sample::Selector` to
deterministically select parts of our state. Originally, this was done
because I did not yet fully understand, how `proptest-state-machine`
works.
The available transitions are always sampled from the current state,
meaning we can directly use `sample::select` to pick an element like an
IP address from a list. This has several advantages:
- The transitions are more readable when debug-printed because they now
contain the actual data that is being used.
- I _think_ this results in better shrinking because `sample::select`
will perform a binary search for the problematic value.
- We can more easily implement transitions that _remove_ state.
Currently, we cannot remove things from the `ReferenceState` because the
system-under-test would also have to index into the `ReferenceState` as
part of executing its transition. By directly embedding all necessary
information in the transition, this is much simpler.
- Removes version numbers from infra components (elixir/relay)
- Removes version bumping from Rust workspace members that don't get
published
- Splits release publishing into `gateway-`, `headless-client-`, and
`gui-client-`
- Removes auto-deploying new infrastructure when a release is published.
Use the Deploy Production workflow instead.
Fixes#4397
To make proptests efficient, it is important to generate the set of
possible test cases algorithmically instead of filtering through
randomly generated values.
This PR makes the strategies for upstream DNS servers and IP networks
more efficient by removing the filtering.
Currently, `tunnel_test` only tests DNS resources with fully-qualified
domain names. Firezone also supports wildcard domains in the forms of
`*.example.com` and `?.example.com`.
To include these in the tests, we generate a bunch of DNS records that
include various subdomains for such wildcard DNS resources.
When sampling DNS queries, we already take them from the pool of global
DNS records which now also includes these subdomains, thus nothing else
needed to be changed to support testing these resources.
Refs #3636 (This pays down some of the technical debt from Linux DNS)
Refs #4473 (This partially fulfills it)
Refs #5068 (This is needed to make `FIREZONE_DNS_CONTROL` mandatory)
As of dd6421:
- On both Linux and Windows, DNS control and IP setting (i.e.
`on_set_interface_config`) both move to the Client
- On Windows, route setting stays in `tun_windows.rs`. Route setting in
Windows requires us to know the interface index, which we don't know in
the Client code. If we could pass opaque platform-specific data between
the tunnel and the Client it would be easy.
- On Linux, route setting moves to the Client and Gateway, which
completely removes the `worker` task in `tun_linux.rs`
- Notifying systemd that we're ready moves up to the headless Client /
IPC service
```[tasklist]
### Before merging / notes
- [x] Does DNS roaming work on Linux on `main`? I don't see where it hooks up. I think I only set up DNS in `Tun::new` (Yes, the `Tun` gets recreated every time we reconfigure the device)
- [x] Fix Windows Clients
- [x] Fix Gateway
- [x] Make sure connlib doesn't get the DNS control method from the env var (will be fixed in #5068)
- [x] De-dupe consts
- [ ] ~~Add DNS control test~~ (failed)
- [ ] Smoke test Linux
- [ ] Smoke test Windows
```
Currently, `tunnel_test` only sends ICMPs to CIDR resources. We also
want to test certain properties in regards to DNS resources. In
particular, we want to test:
- Given a DNS resource, can we query it for an IP?
- Can we send an ICMP packet to the resolved IP?
- Is the mapping of proxy IP to upstream IP stable?
To achieve this, we sample a list of `IpAddr` whenever we add a DNS
resource to the state. We also add the transition
`SendQueryToDnsResource`. As the name suggests, this one simulates a DNS
query coming from the system for one of our resources. We simulate A and
AAAA queries and take note of the addresses that connlib returns to us
for the queries.
Lastly, as part of `SendICMPPacketToResource`, we now may also sample
from a list of IPs that connlib gave us for a domain and send an ICMP
packet to that one.
There is one caveat in this test that I'd like to point out: At the
moment, the exact mapping of proxy IP to real IP is an implementation
detail of connlib. As a result, I don't know which proxy IP I need to
use in order to ping a particular "real" IP. This presents an issue in
the assertions: Upon the first ICMP packet, I cannot assert what the
expected destination is. Instead, I need to "remember" it. In case we
send another ICMP packet to the same resource and happen to sample the
same proxy IP, we can then assert that the mapping did not change.
This is similar to #4097 and #4585 but for the entire `ClientState` and
`GatewayState`. We also do it in the context of a property-based test
with the vision that we can deterministically explore a large space of
state transitions and see where our main property breaks: Being able to
send an ICMP packet from the client to the gateway.
In other words, we now correctly pass all the `Transmit`s back and forth
between the components as if they would receive it from the network. Due
to the nature of property-based tests, this already exercises a very
large input space. For example, if the client does not have an IPv6
socket and the gateway doesn't have an IPv4 socket, this test already
checks whether we then correctly fall back to using a relay (because the
allocation we make on the relay is the only network path where the STUN
requests pass through).
What this does not (yet) do is set up a proper network topology. The
`dispatch_transmit` function will happily "route" a `Transmit` from e.g.
the client to the gateway even if they are in different subnets. In
other words, these tests assume that the actual network itself works and
we can exchange UDP packets between the components.
For now, we only send ICMPs to CIDR resources. As a next step, we can
extend this to DNS resources by sending DNS queries for our DNS
resources and then sending an ICMP to the resolved IP.
This PR introduces site's `Status`. That's used to report to the client
the status, either, unknown, online or offline, mostly as a hint to
users as what's wrong with a connection.
This are the criteria for an online or offline resource
* If all sites related to a resource are offline the resource is
considered offline, since there's no gateway that can respond to that
resource's connection
* If any site is online the resource is online, since that same peer can
be used to reach that resource
* Any other case is unknown
Right now resources are single site so it doesn't matter too much but
tracking online/offline per-site instead of per-gateway or resource
seems like the better long-term solution.
The way to "find out" the site's status is:
* If a response to a connection details is offline, all sites related to
that resource must be offline otherwise there would've been a gateway in
the response
* At the point we connect to a gateway, the site that corresponds to
that gateway must be online
* When a connection to a peer stops it's considered unknown again
Fixes#4738
There was an error on how resource filters were deserialized in the
gateway:
* we always assumed that there would be the ports included but the
portal sends no port down when the "all" range is allowed
* also we didn't support the resource_updated message, this fixes it,
and resources allow-list can be changes in-flight
This implements traffic filtering on the gateway. Filters are set on the
portal, per-resource, in an allow-list manner.
If no filters exist for a given resource all packets are allowed,
otherwise only packets that matches port/protocol for the filters are
allowed, otherwise they are dropped.
Filters can be either TCP, UDP or ICMP. For the first 2 multiple ports
can be given. Furthermore, multiple filters can exists for the same
resource.
To be able to add and remove filters with the same IP/CIDR we keep
around the whole list of filters for any given peer using an ID map and
recalculate the IP each time something is added is removed.
This allows us to remove filters and simply recalculate the allowlist
for each IP.
Furthermore, for any IP, all rules apply, meaning if there are multiple
IPs that apply for a resource all port/protocol combinations for that IP
will apply.
This works well right now for DNS resources, since access is requested
by DNS name, then the resource for that DNS name will arrive at the
gateway, and the port filtering will apply given that resource(and any
other resource with the same IP).
However, since the client has no idea of the filters, it can't request
the resource access based on the port/protocol combination and we are
still using the most specific("longest match") IP. This will mean that
for overlapping CIDR resources, only the rules for the most specific
will be used, even if the gateway supports applying them all, since it
will not have the other resources. This will be solved in #4789.
It can also lead to some weirdness, let's say that you have 10.0.0.0/24
-> TCP/80 and 10.0.0.0/16 -> TCP/443 for your user.
The user tries to access 10.0.0.1, and will then only be allowed port
80. At some point the user might access 10.1.0.1 and it will be allowed
port 443. But from that point on, the user will be allowed to access 80
and 443 in 10.0.0.1 because the rules correctly work on the gateway, the
problem is the client side. Again, #4789 will fix this.
Left for next PRs (in tentative order!):
- #4792
- #4789
Depends on: #4773.
Resolves#2030.
Resolves#4791.
---------
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
Bumps [anyhow](https://github.com/dtolnay/anyhow) from 1.0.81 to 1.0.82.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/dtolnay/anyhow/releases">anyhow's
releases</a>.</em></p>
<blockquote>
<h2>1.0.82</h2>
<ul>
<li>Documentation improvements</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="074bdea1c7"><code>074bdea</code></a>
Release 1.0.82</li>
<li><a
href="47a4fbfa36"><code>47a4fbf</code></a>
Merge pull request <a
href="https://redirect.github.com/dtolnay/anyhow/issues/360">#360</a>
from dtolnay/docensure</li>
<li><a
href="c5af1db020"><code>c5af1db</code></a>
Make ensure's doc comment apply to the cfg(not(doc)) macro too</li>
<li><a
href="bebc7a2fe4"><code>bebc7a2</code></a>
Revert "Temporarily disable miri on doctests"</li>
<li><a
href="f2c4db9b47"><code>f2c4db9</code></a>
Update ui test suite to nightly-2024-03-31</li>
<li><a
href="028cbeedf5"><code>028cbee</code></a>
Explicitly install a Rust toolchain for cargo-outdated job</li>
<li><a
href="7a4cac5192"><code>7a4cac5</code></a>
Merge pull request <a
href="https://redirect.github.com/dtolnay/anyhow/issues/358">#358</a>
from dtolnay/workspacewrapper</li>
<li><a
href="939db012c2"><code>939db01</code></a>
Apply RUSTC_WORKSPACE_WRAPPER</li>
<li><a
href="9f84a37551"><code>9f84a37</code></a>
Temporarily disable miri on doctests</li>
<li><a
href="45e5a589e9"><code>45e5a58</code></a>
Ignore dead code lint in test</li>
<li>Additional commits viewable in <a
href="https://github.com/dtolnay/anyhow/compare/1.0.81...1.0.82">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Gabi <gabrielalejandro7@gmail.com>
Calling `std::process::exit` won't let the DNS deactivation code runs.
For some control methods (systemd-resolved) this doesn't matter. For
etc-resolvconf and Windows, we are responsible for cleaning up DNS.
```[tasklist]
- [x] Replicate the issue
- [x] Fix it
- [x] Remove the fault injection code
```
Closes#4784
Whenever we receive a `relays_presence` message from the portal, we
invalidate the candidates of all now disconnected relays and make
allocations on the new ones. This triggers signalling of new candidates
to the remote party and migrates the connection to the newly nominated
socket.
This still relies on #4613 until we have #4634.
Resolves: #4548.
---------
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Opening this in a basic version that asserts sending of connection
intents to resource IPs. To do this, we add some boilerplate that sets
up the state machine test in general. Together with the
[work](d575dc3866/rust/connlib/snownet/tests/lib.rs (L296-L824))
that I've done on the `snownet` tests, this can then be extended to
describe the entire state machine of connlib and letting `proptest`
search for inputs & combinations that break stuff.
Some more `Transition`s that I'd expect we can implement:
- Add DNS resource
- Reconnect (i.e. roam networks)
- Remove resource
The public API of `Tunnel` isn't actually very large: We add and remove
resources, set upstream DNS servers and call `reconnect`. I think the
bet here is that we can implement the reference state machine in a very
simple way. For example, once we have added a resource and handled the
connection-intent, we should be able to send an ICMP packet through the
tunnel. I've already worked out how to pass `Transmit`s back and forth
between relay, client and gateway (see linked `snownet` tests above). If
we port that to this state machine test, we can actually exercise all
the code paths that are required to encapsulate / decapsulate those
packets whilst asserting against something simple like "packet pops out
at the other end".
Because the setup of the test is also a proptest-strategy, we can even
add the network topology as a variable by configuring the `Firewall`
(see `snownet` tests) dynamically with or without blocking rules and
thus force the entire tunnel through an (in-memory) relay.
Related: #4589.
This is another step towards #4548. The portal now includes a list of
relays as part of the "init" message. Any time we receive an "init", we
will now upsert those relays based on their ID. This requires us to
change our internal bookkeeping of relays from indexing them by address
to indexing by ID.
To ensure that this works correctly, the unit tests are rewritten to use
the new `upsert_relays` API.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Fixes#4488
```[tasklist]
# Before merging
- [x] There's one call site that won't compile on Linux. Make this cross-platform.
- [x] Does the rule get removed every time when you quit gracefully?
- [x] Will this NRPT rule prevent connlib from re-resolving the portal IP if it needs to?
- [x] Test network switching. Does this work worse, better, or the same?
- [ ] Is the Windows DNS cache flushed exactly when it needs to be?
```
- After connlib connects to the portal, we add an NRPT rule asking
Windows to send **all** DNS queries to our sentinels. This should also
be called whenever the interface is re-configured, which might change
the sentinel IPs
- When exiting gracefully, we delete the rule to restore normal DNS
behavior without having to back up and restore the other IPs
- We also delete the rule at startup so that if Firezone crashes or
misbehaves, restarting it should restore normal DNS
- We also flush the system-wide DNS cache whenever we claim different
routes. This may flush too often, and it may also miss some flushes that
we should do. It needs double-checking.
- There is still a gap when changing networks, DNS can leak there, but I
don't think it's worse than before.