Applying a filter globally to the entire subscriber means it filters
events for all layers. This prevents the Sentry layer from uploading
DEBUG logs if configured.
In #9870, the password generation algorithm was broken. The correct
order of the elements in the hash is: expiry, stamp_secret, salt. The
relay expects this order when it re-generates the password to validate
the message.
Due to a different bug in our CI system, we weren't actually checking
for warnings / errors in our perf-test suite:
https://github.com/firezone/firezone/actions/runs/16285038111/job/45982241021#step:9:66
The current Git tag for releases of the Apple client is out-of-line with
the naming of rest of the repository. Ideally, the tag would be renamed
to `apple-client-X.Y.Z` as it represents the version for both the macOS
and iOS client.
I am not familiar with the redirect system on our website to
confidentially do this without breaking anything, so the easiest fix
here is to employ the same hack we already do for Sentry where we
special-case the `macos-client` tag.
Resolves: #9871
Bumps [rustls](https://github.com/rustls/rustls) from 0.23.28 to
0.23.29.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="4e0b5fed17"><code>4e0b5fe</code></a>
Bump version to 0.23.29</li>
<li><a
href="b8540790dc"><code>b854079</code></a>
Propagate context for webpki signature algorithm errors</li>
<li><a
href="c84675e34b"><code>c84675e</code></a>
key_schedule: minimise lifetime of resumption secret</li>
<li><a
href="788b0df122"><code>788b0df</code></a>
key_schedule: erase master secret in traffic state</li>
<li><a
href="d2c64f0416"><code>d2c64f0</code></a>
key_schedule: separate ops not using current secret</li>
<li><a
href="e5998cd100"><code>e5998cd</code></a>
key_schedule: add state for derivations before finish</li>
<li><a
href="9620bec130"><code>9620bec</code></a>
tls13::key_schedule: move <code>KeySchedule</code> struct down</li>
<li><a
href="373ad888e2"><code>373ad88</code></a>
tls13::key_schedule: move <code>SecretKind</code> down</li>
<li><a
href="efa2066469"><code>efa2066</code></a>
Improve compactness of Debug impl for extensions</li>
<li><a
href="a5433a154b"><code>a5433a1</code></a>
Correct calculation of ServerHello ECH confirmation</li>
<li>Additional commits viewable in <a
href="https://github.com/rustls/rustls/compare/v/0.23.28...v/0.23.29">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [clap](https://github.com/clap-rs/clap) from 4.5.40 to 4.5.41.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/clap-rs/clap/blob/master/CHANGELOG.md">clap's
changelog</a>.</em></p>
<blockquote>
<h2>[4.5.41] - 2025-07-09</h2>
<h3>Features</h3>
<ul>
<li>Add <code>Styles::context</code> and
<code>Styles::context_value</code> to customize the styling of
<code>[default: value]</code> like notes in the <code>--help</code></li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="92fcd83b76"><code>92fcd83</code></a>
chore: Release</li>
<li><a
href="aca91b99c1"><code>aca91b9</code></a>
docs: Update changelog</li>
<li><a
href="8434510cee"><code>8434510</code></a>
Merge pull request <a
href="https://redirect.github.com/clap-rs/clap/issues/5869">#5869</a>
from tw4452852/patch-1</li>
<li><a
href="33b1fc304e"><code>33b1fc3</code></a>
fix(complete): Fix env leakage in elvish dynamic completion</li>
<li><a
href="e5f1f4884c"><code>e5f1f48</code></a>
chore: Release</li>
<li><a
href="9466a552fb"><code>9466a55</code></a>
docs: Update changelog</li>
<li><a
href="d74b793512"><code>d74b793</code></a>
Merge pull request <a
href="https://redirect.github.com/clap-rs/clap/issues/5865">#5865</a>
from gifnksm/nushell-completion-value-types</li>
<li><a
href="ecbc775d3b"><code>ecbc775</code></a>
fix(nu): Set argument type based on <code>ValueHint</code></li>
<li><a
href="6784054536"><code>6784054</code></a>
Merge pull request <a
href="https://redirect.github.com/clap-rs/clap/issues/5857">#5857</a>
from epage/empty</li>
<li><a
href="cca5f32b3a"><code>cca5f32</code></a>
test(complete): Show empty option-value behavior</li>
<li>Additional commits viewable in <a
href="https://github.com/clap-rs/clap/compare/clap_complete-v4.5.40...clap_complete-v4.5.41">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [zbus](https://github.com/dbus2/zbus) from 5.7.1 to 5.8.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/dbus2/zbus/releases">zbus's
releases</a>.</em></p>
<blockquote>
<h2>🔖 zbus 5.8.0</h2>
<ul>
<li>✨ <code>interface</code> macro now supports write-only
properties.</li>
<li>✨ Copy attributes over to <code>receive_*_changed</code> and
<code>cached_*</code> methods in <code>proxy</code>.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="7d8e935927"><code>7d8e935</code></a>
Merge pull request <a
href="https://redirect.github.com/dbus2/zbus/issues/1425">#1425</a> from
zeenix/zb-release</li>
<li><a
href="da0ca55c28"><code>da0ca55</code></a>
🔖 zb,zm: Release 5.8.0</li>
<li><a
href="be41117c4b"><code>be41117</code></a>
Merge pull request <a
href="https://redirect.github.com/dbus2/zbus/issues/1424">#1424</a> from
zeenix/zv-release</li>
<li><a
href="dda4f376e4"><code>dda4f37</code></a>
🔖 zv,zd: Release 5.6.0</li>
<li><a
href="747c64505c"><code>747c645</code></a>
⬆️ micro: Update blocking to v1.6.2 (<a
href="https://redirect.github.com/dbus2/zbus/issues/1423">#1423</a>)</li>
<li><a
href="d01e893a8b"><code>d01e893</code></a>
⬆️ micro: Update tokio to v1.46.1 (<a
href="https://redirect.github.com/dbus2/zbus/issues/1422">#1422</a>)</li>
<li><a
href="8250c5357e"><code>8250c53</code></a>
⬆️ micro: Update libfuzzer-sys to v0.4.10 (<a
href="https://redirect.github.com/dbus2/zbus/issues/1421">#1421</a>)</li>
<li><a
href="7ab8fa67ee"><code>7ab8fa6</code></a>
Merge pull request <a
href="https://redirect.github.com/dbus2/zbus/issues/1420">#1420</a> from
dbus2/renovate/tokio-1.x-lockfile</li>
<li><a
href="36fde484aa"><code>36fde48</code></a>
⬆️ Update tokio to v1.46.0</li>
<li><a
href="f9870cde4a"><code>f9870cd</code></a>
Merge pull request <a
href="https://redirect.github.com/dbus2/zbus/issues/1419">#1419</a> from
zeenix/fix-zv-regression</li>
<li>Additional commits viewable in <a
href="https://github.com/dbus2/zbus/compare/zbus-5.7.1...zbus-5.8.0">compare
view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Why:
* Adding more BEAM VM metrics to give us better insight as to how our
BEAM cluster is running since we're in the middle of making some
moderately large architectural changes to the application.
As a followup to #9856, after talking with @bmanifold, we determined
using the public_key as the username for TURN credentials is a safer bet
because:
- It's by definition public and therefore does not need to be obfuscated
- It's shorter-lived than the token, especially for the gateway
- It essentially represents the data plane connection for client/gateway
and naturally rotates along with the key state for those
When giving TURN credentials to clients and gateways, it's important
that they remain consistent across hiccups in the portal connection so
that relayed connections are not interrupted during a deploy, or if the
user's internet is flaky, or the GCP load balancer decides to disconnect
the client/gateway.
Prior to this PR, that was not the case because we essentially tied TURN
credentials, required for data plane packet flows, to the WebSocket
connection, a control plane element. This happened because we generated
random `expires_at` and `salt` elements on _each_ connection to the
portal.
Instead, what we do now is make these reproducible and tied to the auth
token by hashing then base64-encoding it. The expiry is tied to the
auth-token's expiry.
Fixes#9856
As per the WireGuard paper, `boringtun` tries to handshake with the
remote peer for 90s before it gives up. This timeout is important
because when a session is discarded due to e.g. missing replies,
WireGuard attempts to handshake a new session. Without this timeout, we
would then try to handshake a session forever.
Unfortunately, `boringtun` does not distinguish a missing handshake
response from a bad one. Decryption errors whilst decoding a handshake
response are simply passed up to the upper layer, in our case `snownet`.
I am not sure how we can actually fail to decrypt a handshake but the
pattern we are seeing in customer logs is that this happens over and
over again, so there is no point in having `boringtun` retry the
handshake. Therefore, we immediately fail the connection when this
happens.
Failed connections are immediately removed, triggering the client send a
new connection-intent to the portal. Such a new connection intent will
then sync-up the state between Client and Gateway so both of them use
the most recent public key.
Resolves: #9845
As a first step in preparation for sending OTEL metrics from Clients and
Gateways to a cloud-hosted OTEL collector, we extend the CLI of the
Gateway with configuration options to provide a gRPC endpoint to an OTEL
collector.
If `FIREZONE_METRICS` is set to `otel-collector` and an endpoint is
configured via `OTLP_GRPC_ENDPOINT`, we will report our metrics to that
collector.
The future plan for extending this is such that if `FIREZONE_METRICS` is
set to `otel-collector` (which will likely be the default) and no
`OTLP_GRPC_ENDPOINT` is set, then we will use our own, hosted OTEL
collector and report metrics IF the `export-metrics` feature-flag is set
to `true`.
This is a similar integration as we have done it with streaming logs to
Sentry. We can therefore enable it on a similar granularity as we do
with the logs and e.g. only enable it for the `firezone` account to
start with.
In meantime, customers can already make use of those metrics if they'd
like by using the current integration.
Resolves: #1550
Related: #7419
---------
Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>
When failing to create the TUN device, the error messages are currently
pretty bare. Add a bit more context so users can self-diagnose easier
what is wrong.
In the DNS resource NAT table, we track parts of the layer 4 protocol of
the connection in order to map packets back to the correct proxy IP in
case multiple DNS names resolve to the same real IP. The involvement of
layer 4 means we need to perform some packet inspection in case we
receive ICMP errors from an upstream router.
Presently, the only ICMP error we handle here is destination
unreachable. Those are generated e.g. when we are trying to contact an
IPv6 address but we don't have an IPv6 egress interface. An additional
error that we want to handle here is "time exceeded":
Time exceeded is sent when the TTL of a packet reaches 0. Typically,
TTLs are set high enough such that the packet makes it to its
destination. When using tools such as `tracepath` however, the TTL is
specifically only incremented one-by-one in order to resolve the exact
hops a packet is taking to a destination. Without handling the time
exceeded ICMP error, using `tracepath` through Firezone is broken
because the packets get dropped at the DNS resource NAT.
With this PR, we generalise the functionality of detecting destination
unreachable ICMP errors to also handle time-exceeded errors, allowing
tools such as `tracepath` to somewhat work:
```
❯ sudo docker compose exec --env RUST_LOG=info -it client /bin/sh -c 'tracepath -b example.com'
1?: [LOCALHOST] pmtu 1280
1: 100.82.110.64 (100.82.110.64) 0.795ms
1: 100.82.110.64 (100.82.110.64) 0.593ms
2: example.com (100.96.0.1) 0.696ms asymm 45
3: example.com (100.96.0.1) 5.788ms asymm 45
4: example.com (100.96.0.1) 7.787ms asymm 45
5: example.com (100.96.0.1) 8.412ms asymm 45
6: example.com (100.96.0.1) 9.545ms asymm 45
7: example.com (100.96.0.1) 7.312ms asymm 45
8: example.com (100.96.0.1) 8.779ms asymm 45
9: example.com (100.96.0.1) 9.455ms asymm 45
10: example.com (100.96.0.1) 14.410ms asymm 45
11: example.com (100.96.0.1) 24.244ms asymm 45
12: example.com (100.96.0.1) 31.286ms asymm 45
13: no reply
14: example.com (100.96.0.1) 303.860ms asymm 45
15: no reply
16: example.com (100.96.0.1) 135.616ms (This broken router returned corrupted payload) asymm 45
17: no reply
18: example.com (100.96.0.1) 161.647ms asymm 45
19: no reply
20: no reply
21: no reply
22: example.com (100.96.0.1) 238.066ms reached
Resume: pmtu 1280 hops 22 back 45
```
We say "somewhat work" because due to the NAT that is in place for DNS
resources, the output does not disclose the intermediary hops beyond the
Gateway.
Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>
---------
Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>
The latest version of str0m includes a fix that would result in an
immediate ICE timeout if a remote candidate was added prior to a local
candidate. We mitigated this in #9793 to make Firezone overall more
resilient towards sudden changes in the ICE connection state.
As a defense-in-depth measure, we also fixed this issue in str0m by not
transitioning to `Disconnected` if haven't even formed an candidate
pairs yet.
Diff:
2153bf0385...3d6e3d2f27
I am no longer able to compile `jemalloc` on my system in a debug build.
It fails with the following error:
```
src/malloc_io.c: In function ‘buferror’:
src/malloc_io.c:107:16: error: returning ‘char *’ from a function with return type ‘int’ makes integer from pointer without a cast [-Wint-conversion]
107 | return strerror_r(err, buf, buflen);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This appears to be a problem with modern versions of clang/gcc. I
believe this started happening when I recently upgraded my system. The
upstream [`jemalloc`](https://github.com/jemalloc/jemalloc) repository
is now archived and thus unmaintained. I am not sure if we ever measured
a significant benefit in using `jemalloc`.
Related: https://github.com/servo/servo/issues/31059
Rust 1.88 has been released and brings with it a quite exciting feature:
let-chains! It allows us to mix-and-match `if` and `let` expressions,
therefore often reducing the "right-drift" of the relevant code, making
it easier to read.
Rust.188 also comes with a new clippy lint that warns when creating a
mutable reference from an immutable pointer. Attempting to fix this
revealed that this is exactly what we are doing in the eBPF kernel.
Unfortunately, it doesn't seem to be possible to design this in a way
that is both accepted by the borrow-checker AND by the eBPF verifier.
Hence, we simply make the function `unsafe` and document for the
programmer, what needs to be upheld.
With this patch, we sample a list of DNS resources on each test run and
create a "TCP service" for each of their addresses. Using this list of
resources, we then change the `SendTcpPayload` transition to
`ConnectTcp` and establish TCP connections using `smoltcp` to these
services.
For now, we don't send any data on these connections but we do set the
keep-alive interval to 5s, meaning `smoltcp` itself will keep these
connections alive. We also set the timeout to 30s and after each
transition in a test-run, we assert that all TCP sockets are still in
their expected state:
- `ESTABLISHED` for most of them.
- `CLOSED` for all sockets where we ended up sampling an IPv4 address
but the DNS resource only supports IPv6 addresses (or vice-versa). In
these cases, we use the ICMP error to sent by the Gateway to assert that
the socket is `CLOSED`. Unfortunately, `smoltcp` currently does not
handle ICMP messages for its sockets, so we have to call `abort`
ourselves.
Overall, this should assert that regardless of whether we roam networks,
switch relays or do other kind of stuff with the underlying connection,
the tunneled TCP connection stays alive.
In order to make this work, I had to tweak the timeouts when we are
on-demand refreshing allocations. This only happens in one particular
case: When we are being given new relays by the portal, we refresh all
_other_ relays to make sure they are still present. In other words, all
relays that we didn't remove and didn't just add but still had in-memory
are refreshed. This is important for cases where we are
network-partitioned from the portal whilst relays are deployed or reset
their state otherwise. Instead of the previous 8s max elapsed time of
the exponential backoff like we have it for other requests, we now only
use a single message with a 1s timeout there. With the increased ICE
timeout of 15s, a TCP connection with a 30s timeout would otherwise not
survive such an event. This is because it takes the above mentioned 8s
for us to remove a non-functioning relay, all whilst trying to establish
a new connection (which also incurs its own ICE timeout then).
With the reduced timeout on the on-demand refresh of 1s, we detect the
disappeared relay much quicker and can immediately establish a new
connection via one of the new ones. As always with reduced timeouts,
this can create false-positives if the relay doesn't reply within 1s for
some reason.
Resolves: #9531
The Postgres logical decoding protocol is lacking documentation and
unclear about keepalive behavior when `wal_sender_timeout` is set to 0
(disabled). We have it disabled so that Postgres doesn't terminate our
connection for falling too far behind.
What we failed to take into account is that on some installations,
Postgres _never_ requests an immediate reply (keepalive with the reply
now bit set) if wal_sender_timeout is disabled. This means we would
always reply with the empty message, failing to advance the position of
the LSN.
In this PR, we fix that to always respond to every keepalive message
with a standby status update to advance the LSN position.
Relevant documentation:
https://www.postgresql.org/docs/current/protocol-replication.html#PROTOCOL-REPLICATION-STANDBY-STATUS-UPDATE
When defining a resource, a Firezone admin can define traffic filters to
only allow traffic on certain TCP and/or UDP ports and/or restrict
traffic on the ICMP protocol.
Presently, when a packet is filtered out on the Gateway, we simply drop
it. Dropping packets means the sending application can only react to
timeouts and has no other means on error handling. ICMP was conceived to
deal with these kind of situations. In particular, the "destination
unreachable" type has a dedicated code for filtered packets:
"Communication administratively prohibited".
Instead of just dropping the not-allowed packet, we now send back an
ICMP error with this particular code set, thus informing the sending
application that the packet did not get lost but was in fact not routed
for policy reasons.
When setting a traffic filter that does not allow TCP traffic,
attempting to `curl` such a resource now results in the following:
```
❯ sudo docker compose exec --env RUST_LOG=info -it client /bin/sh -c 'curl -v example.com'
* Host example.com:80 was resolved.
* IPv6: fd00:2021:1111:8000::, fd00:2021:1111:8000::1, fd00:2021:1111:8000::2, fd00:2021:1111:8000::3
* IPv4: 100.96.0.1, 100.96.0.2, 100.96.0.3, 100.96.0.4
* Trying [fd00:2021:1111:8000::]:80...
* connect to fd00:2021:1111:8000:: port 80 from fd00:2021:1111::1e:7658 port 34560 failed: Permission denied
* Trying [fd00:2021:1111:8000::1]:80...
* connect to fd00:2021:1111:8000::1 port 80 from fd00:2021:1111::1e:7658 port 34828 failed: Permission denied
* Trying [fd00:2021:1111:8000::2]:80...
* connect to fd00:2021:1111:8000::2 port 80 from fd00:2021:1111::1e:7658 port 44314 failed: Permission denied
* Trying [fd00:2021:1111:8000::3]:80...
* connect to fd00:2021:1111:8000::3 port 80 from fd00:2021:1111::1e:7658 port 37628 failed: Permission denied
* Trying 100.96.0.1:80...
* connect to 100.96.0.1 port 80 from 100.66.110.26 port 53780 failed: Host is unreachable
* Trying 100.96.0.2:80...
* connect to 100.96.0.2 port 80 from 100.66.110.26 port 60748 failed: Host is unreachable
* Trying 100.96.0.3:80...
* connect to 100.96.0.3 port 80 from 100.66.110.26 port 38378 failed: Host is unreachable
* Trying 100.96.0.4:80...
* connect to 100.96.0.4 port 80 from 100.66.110.26 port 49866 failed: Host is unreachable
* Failed to connect to example.com port 80 after 9 ms: Could not connect to server
* closing connection #0
curl: (7) Failed to connect to example.com port 80 after 9 ms: Could not connect to server
```
In order to better track, how well our `ENOBUFS` mitigation is working,
we should log the use of our feature flag to PostHog. This will give us
some stats how often this is happening. That combined with the lack of
error reports should give us good confidence in permanently enabling
this behaviour.
When a packet gets filtered because we are unable to evaluate the source
protocol (i.e. TCP/UDP/ICMP), then the current error message currently
misleadingly says that the packet got filtered because the protocol is
not supported.
The truth however is that we were never able to apply the filter in the
first place. This is a subtle difference that is quite important when
debugging filtered packets. To improve this, we add an error message to
the stack here.
Firezone uses ICMP errors to signal to client applications that e.g. a
certain IP is not reachable. This happens for example if a DNS resource
only resolves to IPv4 addresses yet the client application attempted to
use an IPv6 proxy address to connect to it.
In the presence of traffic filters for such a resource that does _not_
allow ICMP, we currently filter out these ICMP errors because - well -
ICMP traffic is not allowed! However, even in the presence of ICMP
traffic being allowed, we would fail to evaluate this filter because the
ICMP error packet is not an ICMP echo reply and therefore doesn't have
an ICMP identifier. We require this in the DNS resource NAT to identify
"connections" and NAT them correctly. The same L4 component is used to
evaluate the traffic filters.
ICMP errors are critical to many usage scenarios and algorithms like
happy-eyeballs. Dropping them usually results in weird behaviour as
client applications can then only react to timeouts.
We aren't sending the OTEL metrics anywhere yet but it still makes sense
to also use the "newer" hex-representation of the Firezone ID here as
the service ID.
At present, our primary indicator as to whether telemetry is active is
whether we have a Sentry session. For our analytics events however, we
currently require passing in the Firezone ID and API url again. This
makes it difficult to send analytics events from areas of the code that
don't have this information available.
To still allow for that, we integrate the `analytics` module more
tightly with the Sentry session. This allows us to drop two parameters
from the `$identify` event and also means we now respect the
`NO_TELEMETRY` setting for these events except for `new_session`. This
event is sent regardless because it allows us to track, how many on-prem
installations of Firezone are out there.
In #9733, we changed the replies of the handle_data messages which seems
to have caused Postgres to not respect our acknowledgements sent in the
keepalive.
To fix this, we revert to sending an empty message in response to write
messages.
Socket APIs across operating systems vary in how they handle
back-pressure. In most cases, a non-blocking socket should return
`EWOULDBLOCK` when it cannot send a given datagram and would have to
block to wait for resources to free up.
It appears that macOS doesn't always behave like that. In particular, we
are seeing error logs from a few users where sending a datagram fails
with
> No buffer space available (os error 55)
Digging through `libc`, I've found that this error is known as `ENOBUFS`
[0].
There are reports on the Apple developer forum [1] that recommend
retrying when this error happens. It is however unclear as to whether it
is entirely safe to map this error to `EWOULDBLOCK`. Other non-blocking
event-loop implementations [2] appear to do that but we don't know
whether it is fully correct.
At present, Firezone's behaviour here is to drop the packet. This means
the host networking stack has to fall-back to running into a timeout and
re-send the packet. This very likely negatively impacts the UX for the
users hitting this.
In order to validate this assumption, we implement a feature-flag. This
allows us to ship this code but switch back to the old behaviour, should
it negatively impact how Firezone behaves. In particular, if the
assumption that mapping `ENOBUFS` to `EWOULDBLOCK` is safe turns out
wrong and `kqueue` does in fact not signal readiness when more buffers
are available, then we may have missing wake-ups which would lead a
further delay in datagrams being sent.
[0]:
8e6f36c6ba/src/unix/bsd/apple/mod.rs (L2998)
[1]: https://developer.apple.com/forums/thread/42334
[2]:
aac866f399/src/unix/stream.c (L820)
When receiving UDP packets that we cannot decode we log an error. In
order to identify, whether we might have bugs in our decoding logic, we
now also print the hex-encoding of the packet for further analysis on
DEBUG.
A recent release of `tslink` now supports configuration via the
`package.metadata` table which resolved a warning about "unknown key"
that we have seeing for a while.
At present, and as a result of how `connlib` evolved, we still implement
a `Poll`-based function for receiving data on our UDP socket. Ever since
we moved to dedicated threads for the UDP socket, we can directly block
on "block" on receiving datagrams and don't have to poll the socket.
This simplifies the implementation a fair bit. Additionally, it made me
reailise that we currently don't expose any errors on the UDP socket.
Likely, those will be ephemeral but it is still better than completely
silencing them.
With a real AD (and not Intune), it seems the `valueName` attribute is
required for text elements.
Supersedes: #9419
Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>
This log is misplaced within the current `try_connect` function because
that will also be called when we didn't have Internet while we tried to
first connect and then witnessed a network change (in which case the
token is cached in the `WaitingForNetwork` session state).
Thus, we move the log to the `Connect` msg handler where we shouldn't
have any existing session.
Bumps [tslink](https://github.com/icsmw/tslink) from 0.3.0 to 0.4.2.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/icsmw/tslink/blob/master/changelog.md">tslink's
changelog</a>.</em></p>
<blockquote>
<h1>0.4.2 (08.06.2025)</h1>
<h2>Changes</h2>
<ul>
<li>Migrate settings to <code>package.metadata.tslink</code></li>
</ul>
<h1>0.4.1 (08.06.2025)</h1>
<h2>Changes</h2>
<ul>
<li>Add support arrays in the context of <code>const</code></li>
</ul>
<h1>0.4.0 (08.06.2025)</h1>
<h2>Features</h2>
<ul>
<li>Add support of <code>const</code> for primitive types</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li>See full diff in <a
href="https://github.com/icsmw/tslink/commits">compare view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
You can trigger a rebase of this PR by commenting `@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
</details>
> **Note**
> Automatic rebases have been disabled on this pull request as it has
been open for over 30 days.
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
We've not added changelog entries for several of the PRs that went in
recently. This adds some for the more user-facing changes and fixes that
we are shipping.
Certain UNIX systems such as macOS also use the EHOSTDOWN error to
signal that a packet cannot be sent to a certain IP. There is nothing we
can do about this error so we downgrade it from a WARN to a DEBUG like
we do for other kinds of "unreachable" errors.
When we shipped the feature of optimistc server-reflexive candidates, we
failed to add a check to only combine address and base such that they
are the same IP version. This is not harmful but unnecessary noise.
When we create a new connection, we seed the local ICE agent with all
known local candidates, i.e. host addresses and allocations on relays.
Server-reflexive candidates are never added to the local agent because
you cannot send directly from a server-reflexive addresses. Instead, an
agent sends from the _base_ of a server-reflexive candidate which in
turn is known as a host candidate.
The server-reflexive candidate is however signaled to the remote so it
can try and send packets to it. Those will then be mapped by the NAT to
our host candidate.
In case we have just performed a network reset, our own server-reflexive
candidate may not be known yet and therefore the seeding doesn't add an
candidates. With no candidates being seeded, we also can't signal them
to the remote.
For candidates discovered later in this process, the signalling happens
as part of adding them to the local agent. Because server-reflexive
candidates are not added to the local agent, we currently miss out on
signaling those to the remote IF they weren't already present when the
ICE agent got created.
This scenario can happen right after a network reset. In practice, it
shouldn't be much of an issue though. As soon as we start sending from
our host candidate, the remote will create a peer-reflexive candidate
for it. It is however cleaner to directly send the server-reflexive
candidate once we discover it.