Currently, `tunnel_test` only sends ICMPs to CIDR resources. We also
want to test certain properties in regards to DNS resources. In
particular, we want to test:
- Given a DNS resource, can we query it for an IP?
- Can we send an ICMP packet to the resolved IP?
- Is the mapping of proxy IP to upstream IP stable?
To achieve this, we sample a list of `IpAddr` whenever we add a DNS
resource to the state. We also add the transition
`SendQueryToDnsResource`. As the name suggests, this one simulates a DNS
query coming from the system for one of our resources. We simulate A and
AAAA queries and take note of the addresses that connlib returns to us
for the queries.
Lastly, as part of `SendICMPPacketToResource`, we now may also sample
from a list of IPs that connlib gave us for a domain and send an ICMP
packet to that one.
There is one caveat in this test that I'd like to point out: At the
moment, the exact mapping of proxy IP to real IP is an implementation
detail of connlib. As a result, I don't know which proxy IP I need to
use in order to ping a particular "real" IP. This presents an issue in
the assertions: Upon the first ICMP packet, I cannot assert what the
expected destination is. Instead, I need to "remember" it. In case we
send another ICMP packet to the same resource and happen to sample the
same proxy IP, we can then assert that the mapping did not change.
This is similar to #4097 and #4585 but for the entire `ClientState` and
`GatewayState`. We also do it in the context of a property-based test
with the vision that we can deterministically explore a large space of
state transitions and see where our main property breaks: Being able to
send an ICMP packet from the client to the gateway.
In other words, we now correctly pass all the `Transmit`s back and forth
between the components as if they would receive it from the network. Due
to the nature of property-based tests, this already exercises a very
large input space. For example, if the client does not have an IPv6
socket and the gateway doesn't have an IPv4 socket, this test already
checks whether we then correctly fall back to using a relay (because the
allocation we make on the relay is the only network path where the STUN
requests pass through).
What this does not (yet) do is set up a proper network topology. The
`dispatch_transmit` function will happily "route" a `Transmit` from e.g.
the client to the gateway even if they are in different subnets. In
other words, these tests assume that the actual network itself works and
we can exchange UDP packets between the components.
For now, we only send ICMPs to CIDR resources. As a next step, we can
extend this to DNS resources by sending DNS queries for our DNS
resources and then sending an ICMP to the resolved IP.
This PR introduces site's `Status`. That's used to report to the client
the status, either, unknown, online or offline, mostly as a hint to
users as what's wrong with a connection.
This are the criteria for an online or offline resource
* If all sites related to a resource are offline the resource is
considered offline, since there's no gateway that can respond to that
resource's connection
* If any site is online the resource is online, since that same peer can
be used to reach that resource
* Any other case is unknown
Right now resources are single site so it doesn't matter too much but
tracking online/offline per-site instead of per-gateway or resource
seems like the better long-term solution.
The way to "find out" the site's status is:
* If a response to a connection details is offline, all sites related to
that resource must be offline otherwise there would've been a gateway in
the response
* At the point we connect to a gateway, the site that corresponds to
that gateway must be online
* When a connection to a peer stops it's considered unknown again
Fixes#4738
There was an error on how resource filters were deserialized in the
gateway:
* we always assumed that there would be the ports included but the
portal sends no port down when the "all" range is allowed
* also we didn't support the resource_updated message, this fixes it,
and resources allow-list can be changes in-flight
This implements traffic filtering on the gateway. Filters are set on the
portal, per-resource, in an allow-list manner.
If no filters exist for a given resource all packets are allowed,
otherwise only packets that matches port/protocol for the filters are
allowed, otherwise they are dropped.
Filters can be either TCP, UDP or ICMP. For the first 2 multiple ports
can be given. Furthermore, multiple filters can exists for the same
resource.
To be able to add and remove filters with the same IP/CIDR we keep
around the whole list of filters for any given peer using an ID map and
recalculate the IP each time something is added is removed.
This allows us to remove filters and simply recalculate the allowlist
for each IP.
Furthermore, for any IP, all rules apply, meaning if there are multiple
IPs that apply for a resource all port/protocol combinations for that IP
will apply.
This works well right now for DNS resources, since access is requested
by DNS name, then the resource for that DNS name will arrive at the
gateway, and the port filtering will apply given that resource(and any
other resource with the same IP).
However, since the client has no idea of the filters, it can't request
the resource access based on the port/protocol combination and we are
still using the most specific("longest match") IP. This will mean that
for overlapping CIDR resources, only the rules for the most specific
will be used, even if the gateway supports applying them all, since it
will not have the other resources. This will be solved in #4789.
It can also lead to some weirdness, let's say that you have 10.0.0.0/24
-> TCP/80 and 10.0.0.0/16 -> TCP/443 for your user.
The user tries to access 10.0.0.1, and will then only be allowed port
80. At some point the user might access 10.1.0.1 and it will be allowed
port 443. But from that point on, the user will be allowed to access 80
and 443 in 10.0.0.1 because the rules correctly work on the gateway, the
problem is the client side. Again, #4789 will fix this.
Left for next PRs (in tentative order!):
- #4792
- #4789
Depends on: #4773.
Resolves#2030.
Resolves#4791.
---------
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
Calling `std::process::exit` won't let the DNS deactivation code runs.
For some control methods (systemd-resolved) this doesn't matter. For
etc-resolvconf and Windows, we are responsible for cleaning up DNS.
```[tasklist]
- [x] Replicate the issue
- [x] Fix it
- [x] Remove the fault injection code
```
Closes#4784
Whenever we receive a `relays_presence` message from the portal, we
invalidate the candidates of all now disconnected relays and make
allocations on the new ones. This triggers signalling of new candidates
to the remote party and migrates the connection to the newly nominated
socket.
This still relies on #4613 until we have #4634.
Resolves: #4548.
---------
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Opening this in a basic version that asserts sending of connection
intents to resource IPs. To do this, we add some boilerplate that sets
up the state machine test in general. Together with the
[work](d575dc3866/rust/connlib/snownet/tests/lib.rs (L296-L824))
that I've done on the `snownet` tests, this can then be extended to
describe the entire state machine of connlib and letting `proptest`
search for inputs & combinations that break stuff.
Some more `Transition`s that I'd expect we can implement:
- Add DNS resource
- Reconnect (i.e. roam networks)
- Remove resource
The public API of `Tunnel` isn't actually very large: We add and remove
resources, set upstream DNS servers and call `reconnect`. I think the
bet here is that we can implement the reference state machine in a very
simple way. For example, once we have added a resource and handled the
connection-intent, we should be able to send an ICMP packet through the
tunnel. I've already worked out how to pass `Transmit`s back and forth
between relay, client and gateway (see linked `snownet` tests above). If
we port that to this state machine test, we can actually exercise all
the code paths that are required to encapsulate / decapsulate those
packets whilst asserting against something simple like "packet pops out
at the other end".
Because the setup of the test is also a proptest-strategy, we can even
add the network topology as a variable by configuring the `Firewall`
(see `snownet` tests) dynamically with or without blocking rules and
thus force the entire tunnel through an (in-memory) relay.
Related: #4589.
This is another step towards #4548. The portal now includes a list of
relays as part of the "init" message. Any time we receive an "init", we
will now upsert those relays based on their ID. This requires us to
change our internal bookkeeping of relays from indexing them by address
to indexing by ID.
To ensure that this works correctly, the unit tests are rewritten to use
the new `upsert_relays` API.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Fixes#4488
```[tasklist]
# Before merging
- [x] There's one call site that won't compile on Linux. Make this cross-platform.
- [x] Does the rule get removed every time when you quit gracefully?
- [x] Will this NRPT rule prevent connlib from re-resolving the portal IP if it needs to?
- [x] Test network switching. Does this work worse, better, or the same?
- [ ] Is the Windows DNS cache flushed exactly when it needs to be?
```
- After connlib connects to the portal, we add an NRPT rule asking
Windows to send **all** DNS queries to our sentinels. This should also
be called whenever the interface is re-configured, which might change
the sentinel IPs
- When exiting gracefully, we delete the rule to restore normal DNS
behavior without having to back up and restore the other IPs
- We also delete the rule at startup so that if Firezone crashes or
misbehaves, restarting it should restore normal DNS
- We also flush the system-wide DNS cache whenever we claim different
routes. This may flush too often, and it may also miss some flushes that
we should do. It needs double-checking.
- There is still a gap when changing networks, DNS can leak there, but I
don't think it's worse than before.
Reducing the number of crates as outlined in #4470 would help with
detecting this sort of unused code because we could make more things
`pub(crate)` which allows the compiler to check whether code is actually
used.
Public API items are never subject to the dead-code analysis of the
compiler because they could be used by other crates.
A wildcard match was the underlying bug fixed in #4486. Despite being a
bit annoying in some cases, I think it is worth having this lint turned
on to ensure we don't wildcard match in situations where it can have bad
consequences, like `poll` functions.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Motivated by: #4340.
I also activated
[`clippy::unnnecessary_wraps`](https://rust-lang.github.io/rust-clippy/master/#/unnecessary_wraps)
which does create some false-positives for the platform-specific code
but is IMO overall a net-positive. With the amount of Rust code and
crates increasing, it is good to have tools point out simplifications
like these as they are otherwise hard to spot, especially across crate
boundaries.
* Move the resource changes to `ClientState` to unit test easier
* Add unit tests
* Set new config on update from portal
* Set parameters as told by portal on re-init
Fixes: #2728
Supersedes #4320, closes#4318
Updates the interface if effective dns have changed.
Fixes a bug where we could set upstream_dns to have sentinel dns
Adds corresponding unit tests.
---------
Signed-off-by: Gabi <gabrielalejandro7@gmail.com>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
- [x] Updated log level string for client and gateways to info or higher
- [x] Update logs to hide DNS information
I also removed `hickory_resolve` errors which could contain sensitive
info from our general error and hide the logs that specifically relates
to them.
@bmanifold double checking that the log levels in the gateway's `*.tf`
files are just used for our own gateways.
Also, the relays still have `debug`, since only we see that I think that
makes sense but double checking with @jamilbk
Fixes: #3618.
---------
Signed-off-by: Gabi <gabrielalejandro7@gmail.com>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
This updates connlib to follow the new guidelines described in #4262. I
only made the bare-minimum changes to the clients. With these changes
`reconnect` should only be called when the network interface actually
changed, meaning clients have to be updated to reflect that.
Currently, a failure during DNS resolution results in the client hanging
during the connection setup. Instead, we fall back to an empty list
which results in an empty DNS query result for the client.
That in turn will make most application consider the DNS request failed.
As far as I know, we don't currently retry these DNS requests, meaning a
user would have to sign-in and out again to fix this state.
Whilst not ideal, I think this is a better behaviour and what we
currently have where the initial connection just hangs.
Running as sudo / root causes a lot of problems for GUI programs, so
we're unwinding that. In this case we can go back to using Tauri's "open
URL" function, which is great.
Closes#4103
Refs #3713
Affects #3972 - I was finally able to debug it because it came up
constantly during this PR
Currently, an error returned by `Tunnel::poll_next_event` is only
logged. In other words, they are never fatal. This creates a tricky to
understand relationship on what kind of errors should be returned from
callbacks. Because connlib is used on multiple operating systems, it has
no idea how fatal a particular error is.
This PR removes all of these `Result` return values with the following
consequences:
- For Android, we now panic when a callback fails. This is a slight
change in behaviour. I believe that previously, any exception thrown by
a callback into Android was caught and returned as an error. Now, we
panic because in the FFI layer, we don't have any information on how
fatal the error is. For non-fatal errors, the Android app should simply
not throw an exception. The panics will cause the connlib task to be
shut down which triggers an `on_disconnect`.
- For Swift, there is no behaviour change. The FFI layer already did not
support `Result`s for those callbacks. I don't know how exceptions from
Swift are translated across the FFI layer but there is no change to what
we had before.
- For the Tauri client:
- I chose to log errors on ERROR level and continue gracefully for the
DNS resolvers.
- We panic in case the controller channel is full / closed. That should
really never happen in practice though unless we are currently shutting
down the app.
Resolves: #4064.
In the future we will want to refactor this to a builder pattern to
prevent the number of parameters from growing and have them clearer but
this works simply for now.
Found while discussing #4174
---------
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
This isn't really user-facing, so I marked it down from `feat` to
`chore`. Closes#3817
- If we exit gracefully, `/etc/resolv.conf` is reverted
- We always keep the `.before-firezone` backup in case we lose power and
the revert transaction is corrupted or rolled back
- We use a magic header to detect whether the last run was a crash or
not. If Firezone crashes and the user wants to modify their default DNS,
they need to delete that header so that Firezone won't accidentally
revert its backup and trash their change.
- All error variants for this module replaced with `anyhow::Error` since
they were never matched by callers.
I ran `cargo mutants` locally and it helped me validate the unit tests
and it picked up a `match` branch that I forgot to delete.
```[tasklist]
- [x] (Failed: Integration tests didn't like it) ~~Add the system default resolvers below Firezone's sentinels~~
- [x] `tracing::info` "Last run crashed" if we have to revert the file at startup
```
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
This refactors `Session` to allow for commands to be sent to the
`Eventloop`. Currently, we only send a `Stop` command. With #3429, we
will add more commands like refreshing and updating the DNS servers.
Currently, we are passing a lot of data into `Session::connect`. Half of
this data is only needed to construct the URL we will use to connect to
the portal. We can simplify this by extracting a dedicated `LoginUrl`
component that captures and validates this data early.
Not only does this reduce the number of parameters we pass to
`Session::connect`, it also reduces the number of failure cases we have
to deal with in `Session::connect`. Any time the session fails, we have
to call `onDisconnected` to inform the client. Thus, we should perform
as much validation as we can early on. In other words, once
`Session::connect` returns, the client should be able to expect that the
tunnel is starting.
Closes#3815
Changes that are breaking (but these aren't in production so it should
be okay)
- Windows, renaming `device_id.json` to `firezone-id.json` to match the
rest of the code
- Linux GUI, storing the firezone-id under `/var/lib` instead of under
`$HOME`
- Linux GUI, bails out if not run with `sudo --preserve-env` by
detecting `$HOME == root` or `$USER != root`
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Previously, we called `onDisconnect` in two kinds of situations:
- With an error when we wanted the clients to clear the token
- Without an error when the token was still valid (i.e. after a call to
`disconnect` from the clients)
This is unnecessarily redundant. Firezone is designed to **not** have a
state of "signed in but disconnected". Thus, every time connlib calls
`disconnect`, we should clear the token and sign the user out.
At present, we only do this for errors with the control plane. Errors in
the actual tunnel are only logged and we continue trying to use the
tunnel. There are errors in the tunnel where we should also give up
(i.e. TUN device gone, fatal IO error, etc). At present, those are not
yet bubbled up but we will at some point. Once we have
https://github.com/firezone/firezone/pull/3682, it will be much easier
to create a type-safe contract that ensures we only disconnect on fatal
errors.
---------
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
Co-authored-by: ReactorScram <ReactorScram@users.noreply.github.com>
Bumps [base64](https://github.com/marshallpierce/rust-base64) from
0.21.7 to 0.22.0.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/marshallpierce/rust-base64/blob/master/RELEASE-NOTES.md">base64's
changelog</a>.</em></p>
<blockquote>
<h1>0.22.0</h1>
<ul>
<li><code>DecodeSliceError::OutputSliceTooSmall</code> is now
conservative rather than precise. That is, the error will only occur if
the decoded output <em>cannot</em> fit, meaning that
<code>Engine::decode_slice</code> can now be used with exactly-sized
output slices. As part of this, <code>Engine::internal_decode</code> now
returns <code>DecodeSliceError</code> instead of
<code>DecodeError</code>, but that is not expected to affect any
external callers.</li>
<li><code>DecodeError::InvalidLength</code> now refers specifically to
the <em>number of valid symbols</em> being invalid (i.e. <code>len % 4
== 1</code>), rather than just the number of input bytes. This avoids
confusing scenarios when based on interpretation you could make a case
for either <code>InvalidLength</code> or <code>InvalidByte</code> being
appropriate.</li>
<li>Decoding is somewhat faster (5-10%)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="5d70ba7576"><code>5d70ba7</code></a>
Merge pull request <a
href="https://redirect.github.com/marshallpierce/rust-base64/issues/269">#269</a>
from marshallpierce/mp/decode-precisely</li>
<li><a
href="efb6c006c7"><code>efb6c00</code></a>
Release notes</li>
<li><a
href="2b91084a31"><code>2b91084</code></a>
Add some tests to boost coverage</li>
<li><a
href="9e9c7abe65"><code>9e9c7ab</code></a>
Engine::internal_decode now returns DecodeSliceError</li>
<li><a
href="a8a60f43c5"><code>a8a60f4</code></a>
Decode main loop improvements</li>
<li><a
href="a25be0667c"><code>a25be06</code></a>
Simplify leftover output writes</li>
<li><a
href="9979cc33bb"><code>9979cc3</code></a>
Keep morsels as separate bytes</li>
<li><a
href="37670c5ec2"><code>37670c5</code></a>
Bump dev toolchain version (<a
href="https://redirect.github.com/marshallpierce/rust-base64/issues/268">#268</a>)</li>
<li>See full diff in <a
href="https://github.com/marshallpierce/rust-base64/compare/v0.21.7...v0.22.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Part of a small yak shave for #3817
The `atomicwrites` lib uses the atomic rename trick and does correct
fsyncs for us, so if we lose power while rewriting or reverting
`/etc/resolv.conf`, it should always be in a recoverable state. (Unless
the hard drive lies about fsync, but then it's beyond our control.)
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>