The current Rust workspace isn't as consistent as it could be. To make
navigation a bit easier, we move a few crates around. Generally, we
follow the idea that entry-points should be at the top-level. `rust/`
now looks like this (directories only):
```
.
├── cli # Firezone CLI
├── client-ffi # Entry point for Apple & Android
├── gateway # Gateway
├── gui-client # GUI client
├── headless-client # Headless client
├── libs # Library crates
├── relay # Relay
├── target # Compile artifacts
├── tests # Crates for testing
└── tools # Local tools
```
To further enforce this structure, we also drop the `firezone-` prefix
from all crates that are not top-level binary crates.
The downcasting abilities of `anyhow` are pretty powerful.
Unfortunately, they can also be a bit tricky to get right. Whilst `is`
and `downcast` work fine for any errors that are within the `anyhow`
error chain, they don't check the chain of errors prior to that. In
other words, if we already have a nested `std::error::Error` with
several causes, `anyhow` cannot downcast to these causes directly.
In order to avoid this footgun, we create a thin-layer on top of the
`anyhow` crate with some downcasting functions that always try to do the
right thing.
Currently, the Gateway logs all kinds of errors during packet processing
on WARN. Whilst it is generally good to be aware of warnings / errors,
some of these scenarios are particularly noisy. For various reasons, we
may not be able to route a packet arriving from the TUN device.
In such cases, we now return an `UnroutablePacket` error to the
event-loop which is special-cased to only log on DEBUG. It also includes
the 5 tuple as variables, which should make log analysis a bit easier if
we want to filter on specific parts of the 5 tuple.
Currently, the DNS records for the portal's hostname are only resolved
during startup. When the WebSocket connection fails, we try to reconnect
but only with the IPs that we have previously resolved. If the local IP
stack changed since then or the hostname now points to different IPs, we
will run into the reconnect-timeout configured in `phoenix-channel`.
To fix this, we re-resolve the portal's hostname every time the
WebSocket connection fails. For the Gateway, this is easy as we can
simply reuse the already existing `TokioResolver` provided by hickory.
For the Client, we need to write our own DNS client on top of our socket
factory abstraction to ensure we don't create a routing loop with the
resulting DNS queries. To simplify things, we only send DNS queries over
UDP. Those are not guaranteed to succeed but given that we do this on
every "hiccup", we already have a retry mechanism. We use the currently
configured upstream DNS servers for this.
Resolves: #10238
All of our Linux applications have a soft-dependency on systemd. That
is, in the default configuration, we expect systemd to be present on the
machine. The only exception here are the docker containers for Headless
Client and Gateway.
For the GUI client in particular, systemd is a hard-dependency in order
to control DNS on the system which we do via `systemd-resolved`. To
secure the communication between the GUI client and its tunnel process,
we automatically create a group called `firezone-client` to which the
user gets added. All members of the group are allowed to access the unix
socket which is used for IPC between the two processes. Membership in
this group is also a prerequisite for accessing any of the configuration
files.
On the first launch of the GUI client on a Linux system, this presents a
problem. For group membership changes to take the effect, the user needs
to reboot. We say that in the documentation but it is unclear whether
all users will read that thoroughly enough. To help the user, the GUI
client checks for membership of the current user in the group and alerts
the user via a dialog box if that isn't the case. This would all be fine
if it would actually work. Unfortunately, that check ends up being too
late in the process. If we aren't a member of the group, we cannot read
the device ID and bail early, thus never reaching the check and
terminating the process without any dialog box or user-visible error.
We could attempt to fix this by shuffling around some of the startup
init code. That is a sub-optimal solution however because it a) may get
broken again in the future and b) it means we have to delay
initialisation of telemetry until a much later point.
Given that this is only a problem on Linux, a better solution is to
simply not rely on the disk-based device ID at all. Instead, we can
integrate with systemd and deterministically derive a device ID from the
unique machine ID and a randomly chosen "app ID".
For backwards-compatibility reasons, the disk-based device ID is still
prioritised. For all new installs however, we will use the one based on
`/etc/machine-id`.
Several aspects of the Gateway's Debian package depend on `systemd`
being present. Without it, we don't have the necessary users and files
in place for the Gateway to function. With that specified, we can fail
the `postinst` script (and therefore the installation) if anything in
there goes wrong.
Within Firezone, there are multiple components that deal with DNS
queries. Two of those components are the `l4-udp-dns-server` and
`l4-tcp-dns-server`. Both of them are responsible for receiving DNS
queries on layer 4, i.e. UDP or TCP. In other words, they do _not_
operate on an IP level (which would be layer 3) but instead use
`UdpSocket` and `TcpListener` to receive queries and sent back
responses.
Right now, the interfaces of these crates are designed for the usecase
of receiving forwarded DNS queries from the CLient on the Gateway's TUN
device. This is a special-case of DNS resolution. When receiving a TXT
or SRV query for a domain that is covered by a DNS resources, Firezone
Client's will forward that query to the corresponding Gateway and
resolve it in its network context. SRV and TXT records are commonly used
for service discovery and as such, should be resolved in the network
context of the service, i.e. the site that assigned to the resource.
For that usecase, it made sense to allow each DNS server to listen on 1
IPv4 and 1 IPv6 address. Since then, our event-loop has evolved a bit,
being able to handle multiple inputs at once. As such, we can simplify
the API of these crates to only listen on a single address and instead
create multiple instances of them inside `Io`. Depending on how the
design of our DNS implementation for the Clients evolves, this may be
used to listen on multiple IPs later (e.g. from the `127.0.0.0/8`
subnet).
Related: #8263
In order to make the flow logs emitted by the Gateway more useful and
self-contained, we extend the `authorize_flow` message sent to the
Gateway with some more context around the Client and Actor of that flow.
In particular, we now also send the following to the Gateway:
- `client_version`
- `device_os_version`
- `device_os_name`
- `device_serial`
- `device_uuid`
- `device_identifier_for_vendor`
- `device_firebase_installation_id`
- `identity_id`
- `identity_name`
- `actor_id`
- `actor_email`
We only extend the `authorize_flow` message with these additional
properties. The legacy messages for 1.3.x Clients remain as is. For
those Clients, the above properties will be empty in the flow logs.
Resolves: #10690
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Bumps [secrecy](https://github.com/iqlusioninc/crates) from 0.8.0 to
0.10.3.
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/iqlusioninc/crates/commits">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
With this PR we add `cargo-deb` to our CI pipeline and build a debian
package for the Gateway. The debian package comes with several
configuration files that make it easy for admins to start and maintain a
Gateway installation:
- The embedded systemd unit file is essentially the same one as what we
currently install with the install script with some minor modifications.
- The token is read from `/etc/firezone/gateway-token` and passed as a
systemd credential. This allows us to set the permissions for this file
to `0400` and have it owned by `root:root`.
- The configuration is read from `/etc/firezone/gateway-env`.
- Both of these changes basically mean the user should never need to
touch the unit file itself.
- The `sysusers` configuration file ensures the `firezone` user and
group are present on the system.
- The `tmpfiles` configuration file ensures the necessary directories
are present.
All of the above is automatically installed and configured using the
post-installation script which is called by `apt` once the package is
installed.
In addition to the Gateway, we also package a first version of the
`firezone-cli`. Right now, `firezone-cli` (installed as `firezone`) has
three subcommands:
- `gateway authenticate`: Asks for the Gateway's token and installs it
at `/etc/firezone/gateway-token`. The user doesn't have to know how we
manage this token and can trust that we are using safe defaults.
- `gateway enable`: Enables and starts the systemd service.
- `gateway disable`: Disables the systemd service.
Right now, the `.deb` file is only uploaded to the preview APT
repository and not attached to the release. It should therefore not yet
be user-visible unless somebody pokes around a lot, meaning we can defer
documentation to a later PR and start testing it from the preview
repository for our own purposes.
Related: #10598Resolves: #8484Resolves: #10681
Unix tools often write a newline at the end of a file. When using the
file's contents as a token, they need to match byte-for-byte otherwise
we cannot authenticate to the portal. To ensure that, we trim the
content from the file before creating the `SecretString`.
Even prior to #10373, failures in resolving a name on the Gateway for a
DNS resource resulted in a failure of setting up the DNS resource NAT.
Without the DNS resource NAT, packets for that resource bounced on the
Gateway because we didn't have any traffic filters.
A non-existent filter is being treated as a "traffic not allowed" error
and we respond with an ICMP permission denied error. For domains where
both the A and AAAA query result in NXDOMAIN, that isn't necessarily
appropriate. Instead, I am proposing that for such cases, we want to
return a regular "address/host unreachable" ICMP error instead of the
more specific "permission denied" variant.
To achieve that, we refactor the Gateway's peer state to be able to hold
an `Option<IpAddr>` inside the `TranslationState`. This allows us to
always insert an entry for each proxy IP, even if we did not resolve any
IPs for it. Then, when receiving traffic for a proxy IP where the
resolved IP is `None`, we reply with the appropriate ICMP error.
As part of this, we also simplify the assignment of the proxy IPs. With
the NAT64 module removed, there is no more reason to cross-assign IPv4
and IPv6 addresses. We can simply leave the mappings for e.g. IPv6 proxy
addresses empty if the AAAA query didn't resolve anything.
From the Client's perspective, not much changes. The DNS resource NAT
setup will now succeed, even for domains that don't resolve to anything.
This doesn't change any behaviour though as we are currently already
passing packets through for failed DNS resource NAT setups. The main
change is that we now send back a different ICMP error. Most
importantly, the "address/host unreachable variant" does not trigger
#10462.
Bumps
[futures-bounded](https://github.com/thomaseizinger/rust-futures-bounded)
from 0.2.4 to 0.3.0.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/thomaseizinger/rust-futures-bounded/blob/main/CHANGELOG.md">futures-bounded's
changelog</a>.</em></p>
<blockquote>
<h2>0.3.0</h2>
<ul>
<li>Allow for multiple timer implementations.
See <a
href="https://redirect.github.com/thomaseizinger/rust-futures-bounded/pull/5">PR
5</a>.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/thomaseizinger/rust-futures-bounded/commits">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
For more permanent Gateway installations, or ones that are managed
through something else other than our install script, it is useful to
define the Gateway's token outside the systemd unit file.
Systemd provides support for credentials via the `LoadCredential` and
`LoadCredentialEncrypted` instructions. We just need a tiny bit of glue
code in the Gateway to actually use that if it is set.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
In #10347, we made sure that we always return all errors that happen
during a single tick of the event-loop. What we overlooked is that as
part of handling the errors, we need to use `continue` to jump to the
next one instead of returning directly from the function.
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
When a Client disconnects from a Gateway, we might still be receiving
packets that are either in-flight or are still being sent by the
resource. For some amount of time after a disconnect, this is expected
and not worth logging a warning for.
With this PR, we define this time to be 60s. If we cannot look up a
connection either by ID, session index or public key but the peer has
disconnected within the last 60s, we will now only print a DEBUG log
instead of a WARN.
Resolves: #10175
In order to allow the portal to more easily classify, what kind of
component is connecting, we extend the `get_user_agent` header to
include a component type instead of the generic `connlib/`.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
At present, the Gateway performs DNS resolution for A & AAAA queries via
`libc`. The `resolve` system call only provides us with the resolved IPs
but not any of the metadata around the query such as TTL. As a result,
we can only cache DNS queries for a static amount of time, currently
30s. It would be more correct to cache them for their TTL instead.
To do so, we re-introduce `hickory-resolver` to our codebase.
Deliberately, we only use it for resolving A and AAAA records on the
Gateway for now. DNS resolution for SRV & TXT records happens one layer
below and uses the same infrastructure as DNS resolution on the Client.
Merging this is difficult however because the Gateway still supports the
control protocol of 1.3.x clients. That one requires DNS resolution
prior to setting up the connection of DNS resources which means it needs
to happen in the event-loop of the Gateway binary and cannot be moved
into the `Tunnel` where DNS resolution for Client and SRV/TXT records
happen.
Once we can drop support for 1.3.x clients, this Gateway's event-loop
will simplify drastically which will allow us to refactor this to a more
unified approach of DNS resolution. Until then, we can at least fix the
hardcoded TTL by using `hickory-resolver` in the event-loop.
The functionality is guarded behind a feature-flag which - as usual - is
off by default (i.e. for as long as we haven't fetched the flags). The
feature flag is already configured to `true` for staging and production
so we can test the new behaviour.
Resolves: #8232
Related: #10385
Previously, the Gateway would only proactively close connections to its
peers when it was shutdown gracefully via a SIGTERM or SIGINT signal. By
copying the same design for the event-loop as I've implemented in
#10400, we can now also initiate the graceful shutdown in case the
event-loop exits with an error.
For whatever reason, we seem to sometimes lose the association with the
"room" we are meant to be in in order to send messages to the portal.
Without joining the right room, messages get dropped silently.
To fix this, we re-join the room on such errors. Long-term, this will be
fixed by ditching phoenix-channel in favor of simple HTTP requests.
Related: #9649
The event-loop inside `Tunnel` processes input according to a certain
priority. We only take input from lower priority sources when the higher
priority sources are not ready. The current priorities are:
- Flush all buffers
- Read from UDP sockets
- Read from TUN device
- Read from DNS servers
- Process recursive DNS queries
- Check timeout
The idea of this priority ordering is to keep all kinds of processing
bounded and "finish" any kind of work that is on-going before taking on
new work. Anything that sits in a buffer is basically done with
processing and just needs to be written out to the network / device.
Arriving UDP packets have already traversed the network and been
encrypted on the other end, meaning they are higher priority than
reading from the TUN device. Packets from the TUN device still need to
be encrypted and sent to the remote.
Whilst there is merit in this design, it also bears the potential of
starving input sources further down if the top ones are extremely busy.
To prevent this, we refactor `Io` to read from all input sources and
present it to the event-loop as a batch, allowing all sources to make
progress before looping around. Since this event-loop has first been
conceived, we have refactored `Io` to use background threads for the UDP
sockets and TUN device, meaning they will make progress by themselves
anyway until the channels to the main-thread fill up. As such, there
shouldn't be any latency increase in processing packets even though we
are performing slightly more work per event-loop tick.
This kind of batch-processing highlights a problem: Bailing out with an
error midway through processing a batch leaves the remainder of the
batch unprocessed, essentially dropping packets. To fix this, we
introduce a new `TunnelError` type that presents a collection of errors
that we encountered while processing the batch. This might actually also
be a problem with what is currently in `main` because we are already
batch-processing packets there but possibly are bailing out midway
through the batch.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>
The default send and receive buffer sizes on Linux are too small (only
~200 KB). Checking `nstat` after an iperf run revealed that the number
of dropped packets in the first interval directly correlates with the
number of receive buffer errors reported by `nstat`.
We already try to increase the send and receive buffer sizes for our UDP
socket but unfortunately, we cannot increase them beyond what the system
limits them to. To workaround this, we try to set `rmem_max` and
`wmem_max` during startup of the Linux headless client and Gateway. This
behaviour can be disabled by setting `FIREZONE_NO_INC_BUF=true`.
This doesn't work in Docker unfortunately, so we set the values manually
in the CI perf tests and verify after the test that we didn't encounter
any send and receive buffer errors.
It is yet to be determined how we should deal with this problem for all
the GUI clients. See #10350 as an issue tracking that.
Unfortunately, this doesn't fix all packet drops during the first iperf
interval. With this PR, we now see packet drops on the interface itself.
In earlier versions of Firezone, the WebSocket protocol with the portal
was using the request-response semantics built into Phoenix. This
however is quite cumbersome to work with to due to the polymorphic
nature of the protocol design.
We ended up moving away from it and instead only use one-way messages
where each event directly corresponds to a message type. However, we
have never removed the capability reply messages from the
`phoenix-channel` module, instead all usages just set it to `()`.
We can simplify the code here by always setting this to `()`.
Resolves: #7091
Instead of recording the queue depths on every event-loop tick, we now
record them once a second by setting a Gauge. Not only is that a simpler
instrument to work with but it is significantly more performant. The
current version - when metrics are enabled - takes on quite a bit of CPU
time.
Resolves: #10237
Right now, connections cannot be actively closed in Firezone. The
WireGuard tunnel and the ICE agent are coupled together, meaning only if
either one of them fails will we clean up the connection. One exception
here is when the Client roams. In that case, the Client simply clears
its local memory completely and then re-establishes all necessary
connections by re-requesting access.
There are three cases where gracefully closing a connection is useful:
1. If an access authorization is revoked or expires and this was the
last resource authorisation for that peer, we don't currently remove the
connection on the Gateway. Instead, the Client is still able to send
packets by they'll be dropped because we don't have a peer state
anymore.
1. If a Gateway gets restarted due to e.g. an upgrade or other
maintenance work, it loses all its connections and every Client needs to
wait for the ICE timeout (~15 seconds) before it can establish a new
one.
1. If a Client has its access revoked for all resources it has access to
in a particular site we also don't remove this connection, even though
it has become practically useless.
All of these cases are fixed with this PR. Here we introduce a way to
gracefully shutdown a connection without forcing the other side into an
ICE timeout. The graceful connection shutdown works by introducing a new
"goodbye" p2p control protocol message. Like all our p2p control
protocol messages, this is based on IP and therefore delivery is not
guaranteed. In other words, this "goodbye" message is sent on a
best-effort basis.
In the case of shutdown, the Gateway will wait for all UDP packets to be
flushed but will not resend them or wait for an ACK.
If either end receives such a "goodbye" message, they simply remove the
local peer and connection state just as if the connection would have
failed due to either ICE or WireGuard. For the Client, this means that
the next packet for a resource will trigger a new access authorization
request.
With the introduction of the pre-resolved Sentry host, all Firezone
clients now require Internet on startup. That is a signficant usability
hit that we can easily fix by simply falling back to resolving the host
on-demand.
Our Sentry client needs to resolve DNS before being able to send logs or
errors to the backend. Currently, this DNS resolution happens on-demand
as we don't take any control of the underlying HTTP client.
In addition, this will use HTTP/1.1 by default which isn't as efficient
as it could be, especially with concurrent requests.
Finally, if we decide to ever proxy all Sentry for traffic through our
own domain, we have to take control of the underlying client anyway.
To resolve all of the above, we create a custom `TransportFactory` where
we reuse the existing `ReqwestHttpTransport` but provide an already
configured `reqwest::Client` that always uses HTTP/2 with a
pre-configured set of DNS records for the given ingest host.
We cannot poll the `PhoenixChannel` after it has returned an error,
otherwise it will panic. Therefore, we exit the event-loop then. The
outer event-loop also exits as soon as it receives an error from this
channel so this is fine.
`PhoenixChannel` only returns an error when it has irrecoverably
disconnected, e.g. after the retries have been exhausted or we hit a 4xx
error on the WebSocket connection.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
When the connection to a Client disappears, the Gateway currently clears
all state related to this peer. Whilst eagerly cleaning up memory can be
good, in this case, it may lead to the Client thinking it has access to
a resource when in reality it doesn't.
Just because the connection to a Client failed doesn't mean their access
authorizations are invalid. In case the Client reconnects, it should be
able to just continue sending traffic.
At the moment, this only works if the connection also failed on the
Client and therefore, its view of the world in regards to "which
resources do I have access to" was also reset.
What we are seeing in Sentry reports though is that Clients are
attempting to access these resources, thinking they have access but the
Gateway denies it because it has lost the access authorization state.
With the removal of the NAT64/46 modules, we can now simplify the
internals of our `IpPacket` struct. The requirements for our `IpPacket`
struct are somewhat delicate.
On the one hand, we don't want to be overly restrictive in our parsing /
validation code because there is a lot of broken software out there that
doesn't necessarily follow RFCs. Hence, we want to be as lenient as
possible in what we accept.
On the other hand, we do need to verify certain aspects of the packet,
like the payload lengths. At the moment, we are somewhat too lenient
there which causes errors on the Gateway where we have to NAT or
otherwise manipulate the packets. See #9567 or #9552 for example.
To fix this, we make the parsing in the `IpPacket` constructor more
restrictive. If it is a UDP, TCP or ICMP packet, we attempt to fully
parse its headers and validate the payload lengths.
This parsing allows us to then rely on the integrity of the packet as
part of the implementation. This does create several code paths that can
in theory panic but in practice, should be impossible to hit. To ensure
that this does in fact not happen, we also tackle an issue that is long
overdue: Fuzzing.
Resolves: #6667Resolves: #9567Resolves: #9552
When filtering through logs in Sentry, it is useful to narrow them down
by context of a client, gateway or resource. Currently, these fields are
sometimes called `client`, `cid`, `client_id` etc and the same for the
Gateway and Resources.
To make this filtering easier, name all of them `cid` for Client IDs,
`gid` for Gateway IDs and `rid` for Resource IDs.
These appear to happen on systems that e.g. don't have IPv6 support or
where the destination cannot be reached. It is a bit of a catch-all but
all the ones I am seeing in Sentry are false-positives. To reduce the
noise a bit, we log these on DEBUG now.
When looking through customer logs, we see a lot of "Resolved best route
outside of tunnel" messages. Those get logged every time we need to
rerun our re-implementation of Windows' weighting algorithm as to which
source interface / IP a packet should be sent from.
Currently, this gets cached in every socket instance so for the
peer-to-peer socket, this is only computed once per destination IP.
However, for DNS queries, we make a new socket for every query. Using a
new source port DNS queries is recommended to avoid fingerprinting of
DNS queries. Using a new socket also means that we need to re-run this
algorithm every time we make a DNS query which is why we see this log so
often.
To fix this, we need to share this cache across all UDP sockets. Cache
invalidation is one of the hardest problems in computer science and this
instance is no different. This cache needs to be reset every time we
roam as that changes the weighting of which source interface to use.
To achieve this, we extend the `SocketFactory` trait with a `reset`
method. This method is called whenever we roam and can then reset a
shared cache inside the `UdpSocketFactory`. The "source IP resolver"
function that is passed to the UDP socket now simply accesses this
shared cache and inserts a new entry when it needs to resolve the IP.
As an added benefit, this may speed up DNS queries on Windows a bit
(although I haven't benchmarked it). It should certainly drastically
reduce the amount of syscalls we make on Windows.
When receiving an `init` message from the portal, we will now revoke all
authorizations not listed in the `authorizations` list of the `init`
message.
We (partly) test this by introducing a new transition in our proptests
that de-authorizes a certain resource whilst the Gateway is simulated to
be partitioned. It is difficult to test that we cannot make a connection
once that has happened because we would have to simulate a malicious
client that knows about resources / connections or ignores the "remove
resource" message.
Testing this is deferred to a dedicated task. We do test that we hit the
code path of revoking the resource authorization and because the other
resources keep working, we also test that we are at least not revoking
the wrong ones.
Resolves: #9892