Currently, `snownet` tries to be very clever in how it roams
connections. This is/was necessary because we associated DNS-specific
state with a connection. More specifically, the assigned proxy IPs for a
DNS resource are stored as part of a connection with the gateway.
As a result, DNS resources would always break if the underlying
connection in `snownet` failed. This is quite error prone and means,
`snownet` must be very careful to never-ever fail a connection
erroneously. With #5049, we no longer store any important state with a
connection and thus, can implement roaming in much simpler way: Drop all
connections and let the incoming packets create new ones. This is much
more robust as we don't have to "patch" existing state in `snownet` as
part of roaming.
We test this new functionality by adding a `RoamClient` transition to
`tunnel_test`. This ensures roaming works in a lot of scenarios,
including relayed and non-relayed situations as well as roaming between
either of them. As a result, we can delete several of the more specific
test cases of `snownet`.
Depends-On: #5049.
Replaces: #5060.
Resolves: #5080.
Anything that happens on a per-packet level should be logged at `trace`
level to avoid spamming the logs. Whilst queries to DNS servers that are
CIDR resources aren't necessarily _every_ packet, in certain
configurations it is still common enough that it logging it on debug is
too much noise.
All allowed IPs can be a fair few which clutters the log. Remove the
`HashSet` from the error and also remove the stuttering; the error
already says "Packet not allowed".
As part of our NAT table, we keep track of the last time a resolved IP
sent us traffic. This is primarily used to detect and correct changes in
the DNS record. If we keep getting traffic for a proxy IP but the
resolved IP doesn't respond for more than 30s, we re-query the
corresponding domain name.
We can also use this to detect and warn the administrator of entirely
dead but used IPs. A dead-but-used IP is one that has never sent us any
traffic, yet we are actively trying to contact it. For example, if the
environment uses DNS64 but is missing a NAT64 gateway, DNS queries for
IPv4-only resources will give us synthesized IPv6 addresses from the
`0064:ff9b/96` subnet but without a NAT64 gateway, those will never
work.
Whilst this log isn't specific to issues around DNS64 and NAT64,
emitting a warning that a resolved IP does not work at all should send
the administrator into the right direction whilst debugging this issue.
When operating just the headless client, it is currently impossible to
know, when resources become activate / inactive. To fix this, we add
INFO logs every time we activate or deactivate a resource. This should
also prove useful when debugging issues with customers because we now
have a timestamped record of what resources were active at that time.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Currently, the clients only send JSON formatted logs to the configured
log directory. These are very hard to read as a human because one has to
re-assemble the spans and fields that we use extensively in connlib's
logs.
With this patch, the logs are sent to two files: `.jsonl` as JSON
formatted and `.log` formatted in syslog format.
So far, our packet translation only implemented the bare-minimum for
ICMP to work. There are a few things left that haven't been dealt with.
This PR adds additional conversions where it was easy.
There are still some left that require more elaborate mangling of the
packet, like updating pointer fields.
This PR is the "client-side" of things for #4994. Up until now, when a
user wanted to connect to a DNS resource, we would establish a
connection to the gateway and pass along the domain we are trying to
access. The gateway would resolve that domain and send the response back
to the client, allowing them to finally send a DNS response.
Now, we instantly assign and respond with 4x A and 4x AAAA records to
any query for one of our DNS resources. Upon the first IP packet for one
of these "proxy IPs", we select a gateway, establish a connection and
send our proxy IPs along. The gateway then performs the necessary
mangling and NATing of all packets. See #5354 for details.
Resolves: #4994.
Resolves: #5491.
---------
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Many name servers apply a limit as to how big a DNS response is allowed
to be to protect themselves against DoS attacks. Querying a domain with
large records can thus fail if all we have available is UDP. To mitigate
this, we configure every upstream / system DNS server to use UDP and TCP
and let hickory decide, when to use what.
In addition, we enable EDNS(0), an extension to the original DNS spec
that lifts several limits in terms of record sizes.
Closes#5450
Now the entire `Handler::run` function is allowed to fail, similar to a
web request handler failing in a web server.
Previously we only allowed the Handler to fail if it was idle, waiting
on incoming IPC requests. Now it can fail even if it's working with
connlib and about to send over IPC.
I replicated this on my Windows 11 VM in Parallels and the fix works
fine there. Should be the same bug and same fix in Linux.
Bumps [derive_more](https://github.com/JelteF/derive_more) from 0.99.17
to 0.99.18.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/JelteF/derive_more/blob/v0.99.18/CHANGELOG.md">derive_more's
changelog</a>.</em></p>
<blockquote>
<h2>0.99.18 - 2024-06-15</h2>
<ul>
<li>Update syn to version 2.x</li>
<li>Bump minimum supported rust version to 1.65</li>
</ul>
<h2>0.99.10 - 2020-09-11</h2>
<h3>Improvements</h3>
<ul>
<li><code>From</code> supports additional types for conversion:
<code>#[from(types(u8, u16))]</code>.</li>
</ul>
<h2>0.99.7 - 2020-05-16</h2>
<h3>Fixes</h3>
<ul>
<li>Fix generic derives for <code>MulAssign</code></li>
</ul>
<h3>Improvements</h3>
<ul>
<li>When specifying specific features of the crate to only enable
specific
derives, the <code>extra-traits</code> feature of <code>syn</code> is
not always enabled
when those the specified features do not require it. This should speed
up
compile time of <code>syn</code> when this feature is not needed.</li>
</ul>
<h2>0.99.6 - 2020-05-13</h2>
<h3>Improvements</h3>
<ul>
<li>Make sure output of derives is deterministic, for better support in
rust-analyzer</li>
</ul>
<h2>0.99.5 - 2020-03-28</h2>
<h3>New features</h3>
<ul>
<li>Support for deriving <code>Error</code>!!! (many thanks to <a
href="https://github.com/ffuugoo"><code>@ffuugoo</code></a> and <a
href="https://github.com/tyranron"><code>@tyranron</code></a>)</li>
</ul>
<h3>Fixes</h3>
<ul>
<li>
<p>Fix generic bounds for <code>Deref</code> and <code>DerefMut</code>
with <code>forward</code>, i.e. put <code>Deref</code>
bound on whole type, so on <code>where Box<T>: Deref</code>
instead of on <code>T: Deref</code>.
(<a
href="https://redirect.github.com/JelteF/derive_more/issues/114">#107</a>)</p>
</li>
<li>
<p>The <code>tests</code> directory is now correctly included in the
crate (requested by
Debian package maintainers)</p>
</li>
</ul>
<h2>0.99.4 - 2020-03-28</h2>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="678a4735bc"><code>678a473</code></a>
chore: Release derive_more version 0.99.18</li>
<li><a
href="fcde5568cb"><code>fcde556</code></a>
Include example published package</li>
<li><a
href="89cbd82959"><code>89cbd82</code></a>
Remove track_caller feature detection because msrv was bumped</li>
<li><a
href="db36f6dade"><code>db36f6d</code></a>
Fix question marks</li>
<li><a
href="f0c2530255"><code>f0c2530</code></a>
fmt</li>
<li><a
href="461db95716"><code>461db95</code></a>
Fix issue when compiling on 1.65</li>
<li><a
href="39ad36fd71"><code>39ad36f</code></a>
Update changelog for v0.99.18</li>
<li><a
href="57b6e1746e"><code>57b6e17</code></a>
Update to syn 2</li>
<li><a
href="ea4fa94003"><code>ea4fa94</code></a>
Fix tests</li>
<li><a
href="ab82aef0bf"><code>ab82aef</code></a>
Ignore error doctests as it still contains old backtrace logic</li>
<li>Additional commits viewable in <a
href="https://github.com/JelteF/derive_more/compare/v0.99.17...v0.99.18">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Closes#5481
With this, I can connect to the staging portal without a build.rs or any
extra env var setup
<img width="387" alt="image"
src="https://github.com/firezone/firezone/assets/13400041/9c080b36-3a76-49c7-b706-20723697edc7">
```[tasklist]
### Next steps
- [x] Split out a refactor PR for `ConnectArgs` (#5488)
- [x] Try doing this for other Clients
- [x] Check Gateway
- [x] Check Tauri Client
- [x] Change to `app_version`
- [x] Open for review
- [ ] Use `option_env` so that `FIREZONE_PACKAGE_VERSION` can still override the Cargo.toml version for local testing
- [ ] Check Android Client
- [ ] Check Apple Client
```
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Joining the "login" topic on the portal, i.e. `client`, `gateway` or
`relay` can fail. Usually, that is only due to a bug, yet we can and
should not operate if we haven't joined the login topic successfully.
Currently, we just hang in this scenario without an useful error
message. With this PR, we fail the entire connlib session. For the
headless client, it looks like this:
```
2024-06-21T08:44:47.792921Z INFO firezone_headless_client: git_version="gateway-1.1.0-8-ge16dcb8e5-modified"
2024-06-21T08:44:47.793138Z INFO firezone_headless_client: Running in headless / standalone mode
2024-06-21T08:44:47.801781Z INFO firezone_headless_client::dns_control::linux: dns_control_method=Some(Systemd)
2024-06-21T08:44:48.110502Z INFO phoenix_channel: Connected to portal host=api.firez.one
2024-06-21T08:44:48.372602Z ERROR connlib_client_shared: connlib failed: connection to the portal failed: login failed
2024-06-21T08:44:48.372661Z ERROR firezone_headless_client: Got `on_disconnect` from connlib error=PortalConnectionFailed(LoginFailed)
Error: Firezone disconnected
Caused by:
connection to the portal failed: login failed
```
This is extracted from #5487 since I needed to add an 8th parameter and
Clippy said 8 is too many.
Refs #2986
Stepping stone towards using the Builder pattern. There's only a few
Clients so this has 80% of the advantage for 20% of the effort
Refs #5453
I haven't solved the permissions problem fully, but this solves 2 other
issues:
- Even if we can't delete all the logs, we still delete the GUI logs
- Errors are logged to terminal
Tested on the Windows 11 aarch64 VM in Parallels
Closes#5464
These were silently broken, it was exporting an empty zip and passing
the test anyway. So this PR will cause the test to fail if the zip
wasn't fully exported, and then it will fix the export.
In our NAT table on the gateway, we try to first pick the external port
as the one on the packet that we want to translate. This makes that port
mapping consistent between NAT sessions in the majority of cases. In
case the port is taken, we iterate through two chained `Range`s that end
up cycling the entire port range.
[`RangeFrom`](https://doc.rust-lang.org/std/ops/struct.RangeFrom.html)
has a somewhat unexpected behaviour in regards to exhaustived ranges:
They panic when trying to access the next element. To avoid this, we
explicitly end the first range at `u16::MAX` which makes it an empty
range in case the source port is `u16::MAX`.
Without this, a < 1.1.0 client connecting to a > 1.1.0 gateway (i.e.
current main) causes lots of very strange logs that say:
> Assigned translation proxy_ip=X.X.X.X real_ip=X.X.X.X
Where X.X.X.X are the same IP.
Currently, we always emit a connection intent whenever we see a DNS
query for a domain of one of our DNS resources. However, especially for
wildcard DNS resources, we are very likely already connected to the
corresponding gateway. In that case, sending a connection intent
triggers another handshake with the portal only to learn that - surprise
- we should reuse a connection that we already have to that gateway.
We can short-circuit this by checking if we are already connected to the
gateway for this resource and directly requested access for the domain
name in question. We reuse the same event here as we do for refreshing
DNS resources. At a later stage, we should rename this to something else
to make this clearer.
Co-authored-by: Gabi <gabrielalejandro7@gmail.com>
This turns out to break things because we can no longer associate a
working but outdated IP with the DNS resource. Putting this up here in
case we want to merge a fix before we decide on a different one.
Reverts: #5435.
Extracted from https://github.com/firezone/firezone/pull/5426
- Replace `new` and `new_for_test` for IPC servers with `enum ServiceId`
- Rename `debug_command_setup` to `setup_stdout_logging`
It turned out there is no clever way to hide other platforms from
`cargo-mutants`, I thought I had such a way
Whenever we resolve a domain name to real IPs, we assign one proxy IP
per resolved IP. In case the DNS records for that domain actually
changed, we only appended the new proxy IPs to the list we assigned to
that domain.
If a domain no longer resolves to a certain IP, we should clear the
assigned proxy IP and stop returning in DNS responses. To achieve this,
we first remove all proxy IPs from our mapping of IP -> domain and then
add all _current_ proxy IPs back to the map.
When a user sends the first packet to a resource, we generate a
"connection intent" and consult the portal, which gateway to use for
this resource. This process is throttled to only generate a new intent
every 2s.
Once we know, which gateway to use for a certain resource, we initiate a
connection via snownet. This involves an OFFER-ANSWER handshake with the
gateway. A connection for which we have sent an offer and have not yet
received an answer is what we call a "pending connection".
In case the connection setup takes longer than 2s, we will generate
another connection intent which can point to the same gateway that we
are currently setting up a connection with.
Currently, encountering a "pending connection" during another connection
setup is treated as an error which results in some state being
cleaned-up / removed. This is where the bug surfaces: If we remove the
state for a resource as a result of a 2nd connection intent and then
receive the response of the first one, we will be left with no state
that knows about this resource.
We fix this by refactoring `create_or_reuse_connection` to be atomic in
regards to its state changes: All checks that fail the function are
moved to the top which means there is no state to clean up in case of an
error. Additionally, we model the case of a "pending connection" using
an `Option` to not flood the logs with "pending connection" warnings as
those are expected during normal operation.
Fixes: #5385
As part of #4994, the IP translation and mangling of packets to and from
DNS resources is moved to the gateway. This PR represents the
"gateway-half" of the required changes.
Eventually, the client will send a list of proxy IPs that it assigned
for a certain DNS resource. The gateway assigns each proxy IP to a real
IP and mangles outgoing and incoming traffic accordingly. There are a
number of things that we need to take care of as part of that:
- We need to implement NAT to correctly route traffic. Our NAT table
maps from source port* and destination IP to an assigned port* and real
IP. We say port* because that is only true for UDP and TCP. For ICMP, we
use the identifier.
- We need to translate between IPv4 and IPv6 in case a DNS resource e.g.
only resolves to IPv6 addresses but the client gave out an IPv4 proxy
address to the application. This translation is was added in #5364 and
is now being used here.
This PR is backwards-compatible because currently, clients don't send
any IPs to the gateway. No proxy IPs means we cannot do any translation
and thus, packets are simply routed through as is which is what the
current clients expect.
---------
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
When we attempt to establish a connection to a gateway for a DNS
resource, the gateway must resolve the requested domain name before it
can accept the connection. Currently, this timeout is set to 60s which
is much longer than the client's connection timeout.
DNS resolution is typically a very fast protocol so reducing this
timeout to 5s should be safe. In addition, we add a compile-time
assertion that this timeout must be less than the client's connection
timeout.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
This is a funny one. `cargo test -p firezone-headless-client -p
firezone-gui-client` actually passes, because the GUI client uses the
pipes feature, and Cargo apparently just does one build for both
packages. But if you build the headless Client by itself, it fails to
build.
I think this caused `cargo-mutants` to consider all its headless Client
mutants to be unviable, and so it didn't show coverage for that package.
Part of a yak shave to profile startup time for reducing it on Windows
#5026
Median of 3 runs:
- Windows 11 aarch64 Parallels VM - 4.8 s
- Windows 11 x86_64 laptop - 3.1 s (I thought it used to be slower)
- Windows Server 2022 VM - 22.2 s
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
## The problem
To find the correct peer for a given resource we keep a map of
`resource_id -> gateway_id` in the client state called
`resources_gateways`.
For CIDR resource connlib when sees a packet it does the following
steps:
1. Find the packet's corresponding resource
2. Find the resource corresponding gateway
3. Find the peer corresponding to the gateway, if none, request
access/connection
The problem was that when roaming, we didn't cleanup the map between
`resource_id -> gateway_id` so if after disconnecting with a gateway we
created a new connection due to a another resource, in step 3, connlib
would find a connected gateway and not request access.
This would cause the client to send unallowed packets to the gateway.
## Steps to reproduce
1. Open the client
2. Ping a CIDR resource on a gateway
3. roam and wait until disconnection
4. Ping a different resource on the same gateway
5. Ping the same CIDR resource as in step 2
This will result in no reply for step 5
## The fix
Cleanup the `resource -> gateway` map after disconnecting with a
gateway.
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Closes#5042
Smoke test plan:
- Install on a before-Firezone VM
- Confirm logs default to `str0m=warn,info`
- Set log filter to `debug` in GUI
- Restart IPC service
- Confirm logs are `debug`
- Clear settings back to default
- Restart IPC service
- Confirm logs are `str0m=warn,info`
Directions to apply new log level:
1. Put the new log filter in
2. Click "Apply"
3. Quit Firezone Client
4. Right-click on the Start Menu and click "Terminal (Admin)" to open a
Powershell prompt
5. Run `Restart-Service -Name FirezoneClientIpcService` (on Linux, `sudo
systemctl restart firezone-client-ipc.service`)
6. Re-open Firezone Client
```[tasklist]
- [x] Log the log filter maybe
- [x] Use `atomicwrites` to write the file
- [x] (cancelled) ~~Make the GUI write the file on boot if it's not there (saves a step when upgrading from older versions)~~
- [x] Windows smoke test
- [x] Fix permissions on `/var/lib/dev.firezone.client/config`
- [x] Fix Linux IPC service not loading the log filter file
- [x] Linux smoke test
- [ ] Make sure it's okay that users in `firezone-client` can change the device ID
- [ ] Update user guides to include restarting the computer or IPC service after updating the log level?
```
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>