mirror of
https://github.com/outbackdingo/firezone.git
synced 2026-01-27 18:18:55 +00:00
dacc40272171d227d8beddf445eb9908652aee8e
2592 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
dacc402721 |
chore(connlib): only log span field name into message (#9981)
When looking at logs, reducing noise is critical to make it easier to spot important information. When sending logs to Sentry, we currently append the fields of certain spans to message to make the output similar to that of `tracing_subscriber::fmt`. The actual name of a field inside a span is separated from the span name by a colon. For example, here is a log message as we see it in Sentry today: > handle_input:class=success response handle_input:from=C1A0479AA153FACA0722A5DF76343CF2BEECB10E:3478 handle_input:method=binding handle_input:rtt=34.7479ms handle_input:tid=BB30E859ED88FFDF0786B634 request=["Software(snownet; session=BCA42EF159C794F41AE45BF5099E54D3A193A7184C4D2C3560C2FE49C4C6CFB7)"] response=["Software(firezone-relay; rev=e4ba5a69)", "XorMappedAddress(B824B4035A78A6B188EF38BE13AA3C1B1B1196D6:52625)"] Really, what we would like to see is only this: > class=success response from=C1A0479AA153FACA0722A5DF76343CF2BEECB10E:3478 method=binding rtt=34.7479ms tid=BB30E859ED88FFDF0786B634 request=["Software(snownet; session=BCA42EF159C794F41AE45BF5099E54D3A193A7184C4D2C3560C2FE49C4C6CFB7)"] response=["Software(firezone-relay; rev=e4ba5a69)", "XorMappedAddress(B824B4035A78A6B188EF38BE13AA3C1B1B1196D6:52625)"] The duplication of `handle_input:` is just noise. In our local log output, we already strip the name of the span to make it easier to read. Here we now also do the same for the logs reported to Sentry. |
||
|
|
301d2137e5 |
refactor(windows): share src IP cache across UDP sockets (#9976)
When looking through customer logs, we see a lot of "Resolved best route outside of tunnel" messages. Those get logged every time we need to rerun our re-implementation of Windows' weighting algorithm as to which source interface / IP a packet should be sent from. Currently, this gets cached in every socket instance so for the peer-to-peer socket, this is only computed once per destination IP. However, for DNS queries, we make a new socket for every query. Using a new source port DNS queries is recommended to avoid fingerprinting of DNS queries. Using a new socket also means that we need to re-run this algorithm every time we make a DNS query which is why we see this log so often. To fix this, we need to share this cache across all UDP sockets. Cache invalidation is one of the hardest problems in computer science and this instance is no different. This cache needs to be reset every time we roam as that changes the weighting of which source interface to use. To achieve this, we extend the `SocketFactory` trait with a `reset` method. This method is called whenever we roam and can then reset a shared cache inside the `UdpSocketFactory`. The "source IP resolver" function that is passed to the UDP socket now simply accesses this shared cache and inserts a new entry when it needs to resolve the IP. As an added benefit, this may speed up DNS queries on Windows a bit (although I haven't benchmarked it). It should certainly drastically reduce the amount of syscalls we make on Windows. |
||
|
|
409459f11c |
chore(rust): bump boringtun (#9982)
Bumping the version to include https://github.com/firezone/boringtun/pull/105. |
||
|
|
d244a99c58 |
feat(connlib): always use all candidates (#9979)
In #6876, we added functionality that would only make use of new remote candidates whilst we haven't nominated a socket yet with the remote. The reason for that was because in the described edge-case where relays reboot or get replaced whilst the client is partitioned from the portal (or we experience a connection hiccup), only one of the two peers, i.e. Client or Gateway would migrate to the new relay, leaving the other one in an inconsistent state. Looking at recent customer logs, I've been seeing a lot of these messages: > Unknown connection or socket has already been nominated For this particular customer, these are then very quickly followed by ICE timeouts, leaving the connection unusable. Considering that, I no longer think that the above change was a good idea and we should instead always make use of all candidates that we are given. What we are seeing is that in deployment scenarios where the latency link between Client and Gateway is very short (5-10ms) yet the latency to the portal is longer (~30-50ms), we trigger a race condition where we are temporarily nominating a _peer-reflexive_ candidate pair instead of a regular one. This happens because with such a short latency link, Client and Gateway are _faster_ in sending back and forth several STUN bindings than the control plane is in delivering all the candidates. Due to the functionality added in #6876, this then results in us not accepting the candidates. It further appears that a nominated peer-reflexive candidate does not provide a stable connection which is why we then run into an ICE timeout, requiring Firezone to establish a new connection only to have the same thing happen again. This is very disruptive for the user experience as the connection only works for a few moments at a time. With #9793, we have actually added a feature that is also at play here. Now that we don't immediately act on an ICE timeout, it is actually possible for both Client and Gateway to migrate a connection to a different relay, should the one that they are using get disconnected. In #9793, we added a timeout of 2s for this. To make this fully work, we need to patch str0m to transition to `Checking` early. Presently, str0m would directly transition from `Disconnected` to `Connected` in this case which in some of the high-latency scenarios that we are testing in CI is not enough to recover the connection within 2s. By transitioning to `Checking` early, we abort this timer. Related: https://github.com/algesten/str0m/pull/676 |
||
|
|
ecb2bbc86b |
feat(gateway): allow updating expiry of access authorization (#9973)
Resolves: #9966 |
||
|
|
fafe2c43ea |
fix(connlib): update the current socket when in idle mode (#9977)
In case we received a newly nominated socket from `str0m` whilst our connection was in idle mode, we mistakenly did not apply that and kept using the old one. ICE would still be functioning in this case because `str0m` would have updated its internal state but we would be sending packets into Nirvana. I don't think that this is likely to be hit in production though as it would be quite unusual to receive a new nomination whilst the connection was completely idle. |
||
|
|
091d5b56e0 |
refactor(snownet): don't memmove every packet (#9907)
When encrypting IP packets, `snownet` needs to prepare a buffer where the encrypted packet is going to end up. Depending on whether we are sending data via a relayed connection or direct, this buffer needs to be offset by 4 bytes to allow for the 4-byte channel-data header of the TURN protocol. At present, we always first encrypt the packet and then on-demand move the packet by 4-bytes to the left if we **don't** need to send it via a relay. Internally, this translates to a `memmove` instruction which actually turns out to be very cheap (I couldn't measure a speed difference between this and `main`). All of this code has grown historically though so I figured, it is better to clean it up a bit to first evaluate, whether we have a direct or relayed connection and based on that, write the encrypted packet directly to the front of the buffer or offset it by 4 bytes. |
||
|
|
3e6fc8fda7 |
refactor(rust): use spinlock-based buffer pool (#9951)
Profiling has shown that using a spinlock-based buffer pool is marginally (~1%) faster than the mutex-based one because it resolves contention quicker. |
||
|
|
a11983e4b3 | chore: publish gateway 1.4.13 (#9969) | ||
|
|
6ae074005f |
refactor(connlib): don't check for enabled event (#9950)
Profiling has shown that checking whether the level is enabled is actually more expensive than checking whether the packet is a DNS packet. This improves performance by about 3%. |
||
|
|
71e6b56654 |
feat(snownet): remove "connection ID" span (#9949)
At present, `snownet` uses a `tracing::Span` to attach the connection ID to various log messages. This requires the span to be entered and exited on every packet. Whilst profiling Firezone, I noticed that is takes between 10% and 20% of CPU time on the main thread. Previously, this wasn't a bottleneck as other parts of Firezone were not yet as optimised. With some changes earlier this year of a dedicated UDP thread and better GSO, this does appear to be a bottleneck now. On `main`, I am currently getting the following numbers on my local machine: ``` Connecting to host 172.20.0.110, port 5201 [ 5] local 100.85.16.226 port 42012 connected to 172.20.0.110 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 251 MBytes 2.11 Gbits/sec 16 558 KBytes [ 5] 1.00-2.00 sec 287 MBytes 2.41 Gbits/sec 6 800 KBytes [ 5] 2.00-3.00 sec 284 MBytes 2.38 Gbits/sec 2 992 KBytes [ 5] 3.00-4.00 sec 287 MBytes 2.41 Gbits/sec 3 1.12 MBytes [ 5] 4.00-5.00 sec 290 MBytes 2.44 Gbits/sec 0 1.27 MBytes [ 5] 5.00-6.00 sec 300 MBytes 2.52 Gbits/sec 2 1.40 MBytes [ 5] 6.00-7.00 sec 295 MBytes 2.47 Gbits/sec 2 1.52 MBytes [ 5] 7.00-8.00 sec 304 MBytes 2.55 Gbits/sec 3 1.63 MBytes [ 5] 8.00-9.00 sec 290 MBytes 2.44 Gbits/sec 49 1.21 MBytes [ 5] 9.00-10.00 sec 288 MBytes 2.41 Gbits/sec 24 1023 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 2.81 GBytes 2.41 Gbits/sec 107 sender [ 5] 0.00-10.00 sec 2.81 GBytes 2.41 Gbits/sec receiver ``` With this patch applied, the throughput goes up significantly: ``` Connecting to host 172.20.0.110, port 5201 [ 5] local 100.85.16.226 port 41402 connected to 172.20.0.110 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 315 MBytes 2.64 Gbits/sec 7 619 KBytes [ 5] 1.00-2.00 sec 363 MBytes 3.05 Gbits/sec 11 847 KBytes [ 5] 2.00-3.00 sec 379 MBytes 3.18 Gbits/sec 1 1.07 MBytes [ 5] 3.00-4.00 sec 384 MBytes 3.22 Gbits/sec 44 981 KBytes [ 5] 4.00-5.00 sec 377 MBytes 3.16 Gbits/sec 116 911 KBytes [ 5] 5.00-6.00 sec 378 MBytes 3.17 Gbits/sec 3 1.10 MBytes [ 5] 6.00-7.00 sec 377 MBytes 3.16 Gbits/sec 48 929 KBytes [ 5] 7.00-8.00 sec 374 MBytes 3.14 Gbits/sec 151 947 KBytes [ 5] 8.00-9.00 sec 382 MBytes 3.21 Gbits/sec 36 833 KBytes [ 5] 9.00-10.00 sec 375 MBytes 3.14 Gbits/sec 1 1.06 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 3.62 GBytes 3.11 Gbits/sec 418 sender [ 5] 0.00-10.00 sec 3.61 GBytes 3.10 Gbits/sec receiver ``` Resolves: #9948 |
||
|
|
4292ca7ae8 |
test(connlib): fix failing proptest (#9864)
This essentially bumps just the boringtun dependency to include https://github.com/firezone/boringtun/pull/104. |
||
|
|
fbf96c261e |
chore(relay): remove spans (#9962)
These are flooding our monitoring infra and don't really add that much value. Pretty much all of the processing the relay does is request in and out and none of the spans are nested. We can therefore almost 1-to-1 replicate the logging we do with spans by adding the fields to each log message. Resolves: #9954 |
||
|
|
f668202c83 |
build(deps): bump the sentry group in /rust/gui-client with 2 updates (#9929)
Bumps the sentry group in /rust/gui-client with 2 updates: [@sentry/core](https://github.com/getsentry/sentry-javascript) and [@sentry/react](https://github.com/getsentry/sentry-javascript). Updates `@sentry/core` from 9.34.0 to 9.40.0 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/getsentry/sentry-javascript/releases"><code>@sentry/core</code>'s releases</a>.</em></p> <blockquote> <h2>9.40.0</h2> <h3>Important Changes</h3> <ul> <li><strong>feat(browser): Add debugId sync APIs between web worker and main thread (<a href="https://redirect.github.com/getsentry/sentry-javascript/pull/16981">#16981</a>)</strong></li> </ul> <p>This release adds two Browser SDK APIs to let the main thread know about debugIds of worker files:</p> <ul> <li><code>webWorkerIntegration({worker})</code> to be used in the main thread</li> <li><code>registerWebWorker({self})</code> to be used in the web worker</li> </ul> <pre lang="js"><code>// main.js Sentry.init({...}) <p>const worker = new MyWorker(...);</p> <p>Sentry.addIntegration(Sentry.webWorkerIntegration({ worker }));</p> <p>worker.addEventListener('message', e => {...});<br /> </code></pre></p> <pre lang="js"><code>// worker.js Sentry.registerWebWorker({ self }); self.postMessage(...); </code></pre> <ul> <li><strong>feat(core): Deprecate logger in favor of debug (<a href="https://redirect.github.com/getsentry/sentry-javascript/pull/17040">#17040</a>)</strong></li> </ul> <p>The internal SDK <code>logger</code> export from <code>@sentry/core</code> has been deprecated in favor of the <code>debug</code> export. <code>debug</code> only exposes <code>log</code>, <code>warn</code>, and <code>error</code> methods but is otherwise identical to <code>logger</code>. Note that this deprecation does not affect the <code>logger</code> export from other packages (like <code>@sentry/browser</code> or <code>@sentry/node</code>) which is used for Sentry Logging.</p> <pre lang="js"><code>import { logger, debug } from '@sentry/core'; <p>// before<br /> logger.info('This is an info message');</p> <p>// after<br /> debug.log('This is an info message');<br /> </code></pre></p> <ul> <li><strong>feat(node): Add OpenAI integration (<a href="https://redirect.github.com/getsentry/sentry-javascript/pull/17022">#17022</a>)</strong></li> </ul> <p>This release adds official support for instrumenting OpenAI SDK calls in with Sentry tracing, following OpenTelemetry semantic conventions for Generative AI. It instruments:</p> <ul> <li><code>client.chat.completions.create()</code> - For chat-based completions</li> <li><code>client.responses.create()</code> - For the responses API</li> </ul> <pre lang="js"><code></tr></table> </code></pre> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/getsentry/sentry-javascript/blob/develop/CHANGELOG.md"><code>@sentry/core</code>'s changelog</a>.</em></p> <blockquote> <h2>9.40.0</h2> <h3>Important Changes</h3> <ul> <li><strong>feat(browser): Add debugId sync APIs between web worker and main thread (<a href="https://redirect.github.com/getsentry/sentry-javascript/pull/16981">#16981</a>)</strong></li> </ul> <p>This release adds two Browser SDK APIs to let the main thread know about debugIds of worker files:</p> <ul> <li><code>webWorkerIntegration({worker})</code> to be used in the main thread</li> <li><code>registerWebWorker({self})</code> to be used in the web worker</li> </ul> <pre lang="js"><code>// main.js Sentry.init({...}) <p>const worker = new MyWorker(...);</p> <p>Sentry.addIntegration(Sentry.webWorkerIntegration({ worker }));</p> <p>worker.addEventListener('message', e => {...});<br /> </code></pre></p> <pre lang="js"><code>// worker.js Sentry.registerWebWorker({ self }); self.postMessage(...); </code></pre> <ul> <li><strong>feat(core): Deprecate logger in favor of debug (<a href="https://redirect.github.com/getsentry/sentry-javascript/pull/17040">#17040</a>)</strong></li> </ul> <p>The internal SDK <code>logger</code> export from <code>@sentry/core</code> has been deprecated in favor of the <code>debug</code> export. <code>debug</code> only exposes <code>log</code>, <code>warn</code>, and <code>error</code> methods but is otherwise identical to <code>logger</code>. Note that this deprecation does not affect the <code>logger</code> export from other packages (like <code>@sentry/browser</code> or <code>@sentry/node</code>) which is used for Sentry Logging.</p> <pre lang="js"><code>import { logger, debug } from '@sentry/core'; <p>// before<br /> logger.info('This is an info message');</p> <p>// after<br /> debug.log('This is an info message');<br /> </code></pre></p> <ul> <li><strong>feat(node): Add OpenAI integration (<a href="https://redirect.github.com/getsentry/sentry-javascript/pull/17022">#17022</a>)</strong></li> </ul> <p>This release adds official support for instrumenting OpenAI SDK calls in with Sentry tracing, following OpenTelemetry semantic conventions for Generative AI. It instruments:</p> <ul> <li><code>client.chat.completions.create()</code> - For chat-based completions</li> <li><code>client.responses.create()</code> - For the responses API</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href=" |
||
|
|
bc1a3df82b |
build(deps): bump react-router from 7.6.3 to 7.7.0 in /rust/gui-client in the react group (#9934)
Bumps the react group in /rust/gui-client with 1 update: [react-router](https://github.com/remix-run/react-router/tree/HEAD/packages/react-router). Updates `react-router` from 7.6.3 to 7.7.0 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/remix-run/react-router/releases">react-router's releases</a>.</em></p> <blockquote> <h2>v7.7.0</h2> <p>See the changelog for release notes: <a href="https://github.com/remix-run/react-router/blob/main/CHANGELOG.md#v770">https://github.com/remix-run/react-router/blob/main/CHANGELOG.md#v770</a></p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/remix-run/react-router/blob/main/packages/react-router/CHANGELOG.md">react-router's changelog</a>.</em></p> <blockquote> <h2>7.7.0</h2> <h3>Minor Changes</h3> <ul> <li> <p>Add unstable RSC support (<a href="https://redirect.github.com/remix-run/react-router/pull/13700">#13700</a>)</p> <p>For more information, see the <a href="https://reactrouter.com/start/rsc/installation">RSC documentation</a>.</p> </li> </ul> <h3>Patch Changes</h3> <ul> <li> <p>Handle <code>InvalidCharacterError</code> when validating cookie signature (<a href="https://redirect.github.com/remix-run/react-router/pull/13847">#13847</a>)</p> </li> <li> <p>Pass a copy of <code>searchParams</code> to the <code>setSearchParams</code> callback function to avoid muations of the internal <code>searchParams</code> instance. This was an issue when navigations were blocked because the internal instance be out of sync with <code>useLocation().search</code>. (<a href="https://redirect.github.com/remix-run/react-router/pull/12784">#12784</a>)</p> </li> <li> <p>Support invalid <code>Date</code> in <code>turbo-stream</code> v2 fork (<a href="https://redirect.github.com/remix-run/react-router/pull/13684">#13684</a>)</p> </li> <li> <p>In Framework Mode, clear critical CSS in development after initial render (<a href="https://redirect.github.com/remix-run/react-router/pull/13872">#13872</a>)</p> </li> <li> <p>Strip search parameters from <code>patchRoutesOnNavigation</code> <code>path</code> param for fetcher calls (<a href="https://redirect.github.com/remix-run/react-router/pull/13911">#13911</a>)</p> </li> <li> <p>Skip scroll restoration on useRevalidator() calls because they're not new locations (<a href="https://redirect.github.com/remix-run/react-router/pull/13671">#13671</a>)</p> </li> <li> <p>Support unencoded UTF-8 routes in prerender config with <code>ssr</code> set to <code>false</code> (<a href="https://redirect.github.com/remix-run/react-router/pull/13699">#13699</a>)</p> </li> <li> <p>Do not throw if the url hash is not a valid URI component (<a href="https://redirect.github.com/remix-run/react-router/pull/13247">#13247</a>)</p> </li> <li> <p>Fix a regression in <code>createRoutesStub</code> introduced with the middleware feature. (<a href="https://redirect.github.com/remix-run/react-router/pull/13946">#13946</a>)</p> <p>As part of that work we altered the signature to align with the new middleware APIs without making it backwards compatible with the prior <code>AppLoadContext</code> API. This permitted <code>createRoutesStub</code> to work if you were opting into middleware and the updated <code>context</code> typings, but broke <code>createRoutesStub</code> for users not yet opting into middleware.</p> <p>We've reverted this change and re-implemented it in such a way that both sets of users can leverage it.</p> <pre lang="tsx"><code>// If you have not opted into middleware, the old API should work again let context: AppLoadContext = { /*...*/ }; let Stub = createRoutesStub(routes, context); <p>// If you have opted into middleware, you should now pass an instantiated <code>unstable_routerContextProvider</code> instead of a <code>getContext</code> factory function.<br /> let context = new unstable_RouterContextProvider();<br /> context.set(SomeContext, someValue);<br /> let Stub = createRoutesStub(routes, context);<br /> </code></pre></p> <p>⚠️ This may be a breaking bug for if you have adopted the unstable Middleware feature and are using <code>createRoutesStub</code> with the updated API.</p> </li> <li> <p>Remove <code>Content-Length</code> header from Single Fetch responses (<a href="https://redirect.github.com/remix-run/react-router/pull/13902">#13902</a>)</p> </li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href=" |
||
|
|
0cd4b94691 |
build(deps): bump zbus from 5.8.0 to 5.9.0 in /rust (#9939)
Bumps [zbus](https://github.com/dbus2/zbus) from 5.8.0 to 5.9.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/dbus2/zbus/releases">zbus's releases</a>.</em></p> <blockquote> <h2>🔖 zbus 5.9.0</h2> <ul> <li>🧵 Remove deadlocks in Connection name request tasks, resulting in leaks under certain circumstances.</li> <li>🐛 When registering names, allow name replacement by default.</li> <li>✨ Allow setting request name flags in <code>connection::Builder</code>.</li> <li>✨ Proper Default impl for <code>RequestNameFlags</code>. This change is theoretically an API break for users who assumed the default value to be empty.</li> <li>🧑💻 Add <code>fdo::StartServiceReply</code> type. In 6.0 this will be the return type of <code>fdo::DBusProxy::start_service_by_name</code>. For now, just provide a <code>TryFrom<u32></code>.</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href=" |
||
|
|
0df8c45f6c |
build(deps): bump serde_json from 1.0.140 to 1.0.141 in /rust (#9938)
Bumps [serde_json](https://github.com/serde-rs/json) from 1.0.140 to 1.0.141. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/serde-rs/json/releases">serde_json's releases</a>.</em></p> <blockquote> <h2>v1.0.141</h2> <ul> <li>Optimize string escaping during serialization (<a href="https://redirect.github.com/serde-rs/json/issues/1273">#1273</a>, thanks <a href="https://github.com/conradludgate"><code>@conradludgate</code></a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href=" |
||
|
|
bba4ebe0da |
build(deps): bump eslint from 9.29.0 to 9.31.0 in /rust/gui-client (#9936)
Bumps [eslint](https://github.com/eslint/eslint) from 9.29.0 to 9.31.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/eslint/eslint/releases">eslint's releases</a>.</em></p> <blockquote> <h2>v9.31.0</h2> <h2>Features</h2> <ul> <li><a href=" |
||
|
|
35cd96b481 |
fix(phoenix-channel): fail connection in invalid peer cert (#9946)
When being presented an invalid peer certificate, there is no reason why we should retry the connection, it is unlikely to fix itself. Plus, the certificate may get / be cached and a restart of the application is necessary. Resolves: #9944 |
||
|
|
318ce24403 |
fix(connlib): resend AssignedIps on traffic for DNS resource (#9904)
This was exposed by #9846. It is being added here as a dedicated PR because the compatibility tests would fail or at least be flaky for the latest client release so we cannot add the integration test right away. |
||
|
|
82c4c39436 | chore(telemetry): don't start in local environment (#9905) | ||
|
|
93ca701896 | chore(snownet): check remote key and creds on connection upsert (#9902) | ||
|
|
c8760d87ae | chore(connlib): log remote address on decapsulation error (#9903) | ||
|
|
c4457bf203 | feat(gateway): shutdown after 15m of portal disconnect (#9894) | ||
|
|
3e71a91667 |
feat(gateway): revoke unlisted authorizations upon init (#9896)
When receiving an `init` message from the portal, we will now revoke all authorizations not listed in the `authorizations` list of the `init` message. We (partly) test this by introducing a new transition in our proptests that de-authorizes a certain resource whilst the Gateway is simulated to be partitioned. It is difficult to test that we cannot make a connection once that has happened because we would have to simulate a malicious client that knows about resources / connections or ignores the "remove resource" message. Testing this is deferred to a dedicated task. We do test that we hit the code path of revoking the resource authorization and because the other resources keep working, we also test that we are at least not revoking the wrong ones. Resolves: #9892 |
||
|
|
a6ffdd2654 |
feat(snownet): reduce rekey-attempt-time to 15s (#9891)
From Sentry reports and user-submitted logs, we know that it is possible for Client and Gateway to de-sync in regards to what each other's public key is. In such a scenario, ICE will succeed to make a connection but `boringtun` will fail to handshake a tunnel. By default, `boringtun` tries for 90s to handshake a session before it gives up and expires it. In Firezone, the ICE agent takes care of establishing connectivity whereas `boringtun` itself just encrypts and decrypts packets. As such, if ICE is working, we know that packets aren't getting lost but instead, there must be some other issue as to why we cannot establish a session. To improve the UX in these error cases, we reduce the rekey-attempt-time to 15s. This roughly matches our ICE timeout. Those 15s count from the moment we send the first handshake which is just after ICE completes. Thus we can be sure that after at most 15s, we either have a working WireGuard session or the connection gets cleaned up. Related: #9890 Related: #9850 |
||
|
|
cf2470ba1e |
test(iperf): install iptables rule inside of container (#9880)
In Docker environments, applying iptables rules to filter container-container traffic on the Docker bridged network is not reliable, leading to direct connections being established in our relayed tests. To fix this, we insert the rules directly from the client container itself. --------- Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com> |
||
|
|
116b518700 |
fix(snownet): discard channel-data messages from old allocations (#9885)
When we invalidate or discard an allocation, it may happen that a relay still sends channel-data messages to us. We don't recognize those and will therefore attempt to parse them as WireGuard packets, ultimately ending in an "Packet has unknown format" error. To avoid this, we check if the packet is a valid channel-data message even if we presently don't have an allocation on the relay that is sending us the packet. In those cases, we can stop processing the packet, thus avoiding these errors from being logged. |
||
|
|
29f81c64ff |
fix(snownet): wake idle connection on upsert (#9879)
When a connection is in idle-mode, it only sends a STUN request every 25 seconds. If the Client disconnects e.g. due to a network partition, it may send a new connection intent later. If the Gateway's connection is still around then because it was in idle mode, it won't send any candidates to the remote, making the Client's connection fail with "no candidates received". To alleviate this, we wake a connection out of idle mode every time it is being upserted. This ensures that the connection will fail within 15s IF the above scenario happens, allowing the Client to reconnect within a much shorter time-frame. Note that attempting to repair such a connection is likely pointless. It is much safer to discard it and let them both establish a new connection. Related: #9862 |
||
|
|
0f1c5f2818 |
refactor(relay): simplify auth module (#9873)
Whilst looking through the auth module of the relay, I noticed that we unnecessarily convert back and forth between expiry timestamps and username formats when we could just be using the already parsed version. |
||
|
|
ffcb269c8b |
chore(connlib): add "wake reason" to poll_timeout (#9876)
In order to debug timer interactions, it is useful to know when and why connlib wants to be woken to perform tasks. |
||
|
|
5141817134 |
feat(connlib): add reason argument to reset API (#9878)
In order to provide more detailed logs, why `connlib`'s network state is being reset, we add a `reason` parameter that is gets logged. Resolves: #9867 |
||
|
|
2b70596636 |
fix(rust): only apply filter to select tracing layers (#9872)
Applying a filter globally to the entire subscriber means it filters events for all layers. This prevents the Sentry layer from uploading DEBUG logs if configured. |
||
|
|
b9302cdc2a |
build(deps): bump rustls from 0.23.28 to 0.23.29 in /rust (#9860)
Bumps [rustls](https://github.com/rustls/rustls) from 0.23.28 to 0.23.29. <details> <summary>Commits</summary> <ul> <li><a href=" |
||
|
|
9ed7220520 |
build(deps): bump clap from 4.5.40 to 4.5.41 in /rust (#9861)
Bumps [clap](https://github.com/clap-rs/clap) from 4.5.40 to 4.5.41. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/clap-rs/clap/blob/master/CHANGELOG.md">clap's changelog</a>.</em></p> <blockquote> <h2>[4.5.41] - 2025-07-09</h2> <h3>Features</h3> <ul> <li>Add <code>Styles::context</code> and <code>Styles::context_value</code> to customize the styling of <code>[default: value]</code> like notes in the <code>--help</code></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href=" |
||
|
|
8dbb02e549 |
build(deps): bump zbus from 5.7.1 to 5.8.0 in /rust (#9863)
Bumps [zbus](https://github.com/dbus2/zbus) from 5.7.1 to 5.8.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/dbus2/zbus/releases">zbus's releases</a>.</em></p> <blockquote> <h2>🔖 zbus 5.8.0</h2> <ul> <li>✨ <code>interface</code> macro now supports write-only properties.</li> <li>✨ Copy attributes over to <code>receive_*_changed</code> and <code>cached_*</code> methods in <code>proxy</code>.</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href=" |
||
|
|
2e0ed018ee | chore: document metrics config switches as private API (#9865) | ||
|
|
f5425ac8e4 |
fix(snownet): fail connection on handshake decryption errors (#9850)
As per the WireGuard paper, `boringtun` tries to handshake with the remote peer for 90s before it gives up. This timeout is important because when a session is discarded due to e.g. missing replies, WireGuard attempts to handshake a new session. Without this timeout, we would then try to handshake a session forever. Unfortunately, `boringtun` does not distinguish a missing handshake response from a bad one. Decryption errors whilst decoding a handshake response are simply passed up to the upper layer, in our case `snownet`. I am not sure how we can actually fail to decrypt a handshake but the pattern we are seeing in customer logs is that this happens over and over again, so there is no point in having `boringtun` retry the handshake. Therefore, we immediately fail the connection when this happens. Failed connections are immediately removed, triggering the client send a new connection-intent to the portal. Such a new connection intent will then sync-up the state between Client and Gateway so both of them use the most recent public key. Resolves: #9845 |
||
|
|
cecca37073 |
feat(gateway): allow exporting metrics to an OTEL collector (#9838)
As a first step in preparation for sending OTEL metrics from Clients and Gateways to a cloud-hosted OTEL collector, we extend the CLI of the Gateway with configuration options to provide a gRPC endpoint to an OTEL collector. If `FIREZONE_METRICS` is set to `otel-collector` and an endpoint is configured via `OTLP_GRPC_ENDPOINT`, we will report our metrics to that collector. The future plan for extending this is such that if `FIREZONE_METRICS` is set to `otel-collector` (which will likely be the default) and no `OTLP_GRPC_ENDPOINT` is set, then we will use our own, hosted OTEL collector and report metrics IF the `export-metrics` feature-flag is set to `true`. This is a similar integration as we have done it with streaming logs to Sentry. We can therefore enable it on a similar granularity as we do with the logs and e.g. only enable it for the `firezone` account to start with. In meantime, customers can already make use of those metrics if they'd like by using the current integration. Resolves: #1550 Related: #7419 --------- Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com> |
||
|
|
70e4b6572f |
chore(rust): log environment when updating feature flags (#9855)
It is useful to know, which environment we've updated the feature-flags for. |
||
|
|
eb4c54620c |
chore(linux): add more error context to TUN device (#9853)
When failing to create the TUN device, the error messages are currently pretty bare. Add a bit more context so users can self-diagnose easier what is wrong. |
||
|
|
8dedc44735 |
chore(rust): bump boringtun (#9854)
The latest commits to our `boringtun` fork bring improved logs.
Diff:
|
||
|
|
66455ab0ef |
feat(gateway): translate TimeExceeded ICMP messages (#9812)
In the DNS resource NAT table, we track parts of the layer 4 protocol of
the connection in order to map packets back to the correct proxy IP in
case multiple DNS names resolve to the same real IP. The involvement of
layer 4 means we need to perform some packet inspection in case we
receive ICMP errors from an upstream router.
Presently, the only ICMP error we handle here is destination
unreachable. Those are generated e.g. when we are trying to contact an
IPv6 address but we don't have an IPv6 egress interface. An additional
error that we want to handle here is "time exceeded":
Time exceeded is sent when the TTL of a packet reaches 0. Typically,
TTLs are set high enough such that the packet makes it to its
destination. When using tools such as `tracepath` however, the TTL is
specifically only incremented one-by-one in order to resolve the exact
hops a packet is taking to a destination. Without handling the time
exceeded ICMP error, using `tracepath` through Firezone is broken
because the packets get dropped at the DNS resource NAT.
With this PR, we generalise the functionality of detecting destination
unreachable ICMP errors to also handle time-exceeded errors, allowing
tools such as `tracepath` to somewhat work:
```
❯ sudo docker compose exec --env RUST_LOG=info -it client /bin/sh -c 'tracepath -b example.com'
1?: [LOCALHOST] pmtu 1280
1: 100.82.110.64 (100.82.110.64) 0.795ms
1: 100.82.110.64 (100.82.110.64) 0.593ms
2: example.com (100.96.0.1) 0.696ms asymm 45
3: example.com (100.96.0.1) 5.788ms asymm 45
4: example.com (100.96.0.1) 7.787ms asymm 45
5: example.com (100.96.0.1) 8.412ms asymm 45
6: example.com (100.96.0.1) 9.545ms asymm 45
7: example.com (100.96.0.1) 7.312ms asymm 45
8: example.com (100.96.0.1) 8.779ms asymm 45
9: example.com (100.96.0.1) 9.455ms asymm 45
10: example.com (100.96.0.1) 14.410ms asymm 45
11: example.com (100.96.0.1) 24.244ms asymm 45
12: example.com (100.96.0.1) 31.286ms asymm 45
13: no reply
14: example.com (100.96.0.1) 303.860ms asymm 45
15: no reply
16: example.com (100.96.0.1) 135.616ms (This broken router returned corrupted payload) asymm 45
17: no reply
18: example.com (100.96.0.1) 161.647ms asymm 45
19: no reply
20: no reply
21: no reply
22: example.com (100.96.0.1) 238.066ms reached
Resume: pmtu 1280 hops 22 back 45
```
We say "somewhat work" because due to the NAT that is in place for DNS
resources, the output does not disclose the intermediary hops beyond the
Gateway.
Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>
---------
Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>
|
||
|
|
16facd394e |
chore(rust): bump str0m (#9852)
The latest version of str0m includes a fix that would result in an
immediate ICE timeout if a remote candidate was added prior to a local
candidate. We mitigated this in #9793 to make Firezone overall more
resilient towards sudden changes in the ICE connection state.
As a defense-in-depth measure, we also fixed this issue in str0m by not
transitioning to `Disconnected` if haven't even formed an candidate
pairs yet.
Diff:
|
||
|
|
d01701148b |
fix(rust): remove jemalloc (#9849)
I am no longer able to compile `jemalloc` on my system in a debug build.
It fails with the following error:
```
src/malloc_io.c: In function ‘buferror’:
src/malloc_io.c:107:16: error: returning ‘char *’ from a function with return type ‘int’ makes integer from pointer without a cast [-Wint-conversion]
107 | return strerror_r(err, buf, buflen);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
This appears to be a problem with modern versions of clang/gcc. I
believe this started happening when I recently upgraded my system. The
upstream [`jemalloc`](https://github.com/jemalloc/jemalloc) repository
is now archived and thus unmaintained. I am not sure if we ever measured
a significant benefit in using `jemalloc`.
Related: https://github.com/servo/servo/issues/31059
|
||
|
|
47c9922131 | test(connlib): don't attempt to listen on port 0 for TCP socket (#9851) | ||
|
|
d6805d7e48 |
chore(rust): bump to Rust 1.88 (#9714)
Rust 1.88 has been released and brings with it a quite exciting feature: let-chains! It allows us to mix-and-match `if` and `let` expressions, therefore often reducing the "right-drift" of the relevant code, making it easier to read. Rust.188 also comes with a new clippy lint that warns when creating a mutable reference from an immutable pointer. Attempting to fix this revealed that this is exactly what we are doing in the eBPF kernel. Unfortunately, it doesn't seem to be possible to design this in a way that is both accepted by the borrow-checker AND by the eBPF verifier. Hence, we simply make the function `unsafe` and document for the programmer, what needs to be upheld. |
||
|
|
12351e5985 | ci: publish apple 1.5.4 clients (#9842) | ||
|
|
55eaa7cdc7 |
test(connlib): establish real TCP connections in proptests (#9814)
With this patch, we sample a list of DNS resources on each test run and create a "TCP service" for each of their addresses. Using this list of resources, we then change the `SendTcpPayload` transition to `ConnectTcp` and establish TCP connections using `smoltcp` to these services. For now, we don't send any data on these connections but we do set the keep-alive interval to 5s, meaning `smoltcp` itself will keep these connections alive. We also set the timeout to 30s and after each transition in a test-run, we assert that all TCP sockets are still in their expected state: - `ESTABLISHED` for most of them. - `CLOSED` for all sockets where we ended up sampling an IPv4 address but the DNS resource only supports IPv6 addresses (or vice-versa). In these cases, we use the ICMP error to sent by the Gateway to assert that the socket is `CLOSED`. Unfortunately, `smoltcp` currently does not handle ICMP messages for its sockets, so we have to call `abort` ourselves. Overall, this should assert that regardless of whether we roam networks, switch relays or do other kind of stuff with the underlying connection, the tunneled TCP connection stays alive. In order to make this work, I had to tweak the timeouts when we are on-demand refreshing allocations. This only happens in one particular case: When we are being given new relays by the portal, we refresh all _other_ relays to make sure they are still present. In other words, all relays that we didn't remove and didn't just add but still had in-memory are refreshed. This is important for cases where we are network-partitioned from the portal whilst relays are deployed or reset their state otherwise. Instead of the previous 8s max elapsed time of the exponential backoff like we have it for other requests, we now only use a single message with a 1s timeout there. With the increased ICE timeout of 15s, a TCP connection with a 30s timeout would otherwise not survive such an event. This is because it takes the above mentioned 8s for us to remove a non-functioning relay, all whilst trying to establish a new connection (which also incurs its own ICE timeout then). With the reduced timeout on the on-demand refresh of 1s, we detect the disappeared relay much quicker and can immediately establish a new connection via one of the new ones. As always with reduced timeouts, this can create false-positives if the relay doesn't reply within 1s for some reason. Resolves: #9531 |
||
|
|
520dd0aa31 |
feat(gateway): respond with ICMP error for filtered packets (#9816)
When defining a resource, a Firezone admin can define traffic filters to only allow traffic on certain TCP and/or UDP ports and/or restrict traffic on the ICMP protocol. Presently, when a packet is filtered out on the Gateway, we simply drop it. Dropping packets means the sending application can only react to timeouts and has no other means on error handling. ICMP was conceived to deal with these kind of situations. In particular, the "destination unreachable" type has a dedicated code for filtered packets: "Communication administratively prohibited". Instead of just dropping the not-allowed packet, we now send back an ICMP error with this particular code set, thus informing the sending application that the packet did not get lost but was in fact not routed for policy reasons. When setting a traffic filter that does not allow TCP traffic, attempting to `curl` such a resource now results in the following: ``` ❯ sudo docker compose exec --env RUST_LOG=info -it client /bin/sh -c 'curl -v example.com' * Host example.com:80 was resolved. * IPv6: fd00:2021:1111:8000::, fd00:2021:1111:8000::1, fd00:2021:1111:8000::2, fd00:2021:1111:8000::3 * IPv4: 100.96.0.1, 100.96.0.2, 100.96.0.3, 100.96.0.4 * Trying [fd00:2021:1111:8000::]:80... * connect to fd00:2021:1111:8000:: port 80 from fd00:2021:1111::1e:7658 port 34560 failed: Permission denied * Trying [fd00:2021:1111:8000::1]:80... * connect to fd00:2021:1111:8000::1 port 80 from fd00:2021:1111::1e:7658 port 34828 failed: Permission denied * Trying [fd00:2021:1111:8000::2]:80... * connect to fd00:2021:1111:8000::2 port 80 from fd00:2021:1111::1e:7658 port 44314 failed: Permission denied * Trying [fd00:2021:1111:8000::3]:80... * connect to fd00:2021:1111:8000::3 port 80 from fd00:2021:1111::1e:7658 port 37628 failed: Permission denied * Trying 100.96.0.1:80... * connect to 100.96.0.1 port 80 from 100.66.110.26 port 53780 failed: Host is unreachable * Trying 100.96.0.2:80... * connect to 100.96.0.2 port 80 from 100.66.110.26 port 60748 failed: Host is unreachable * Trying 100.96.0.3:80... * connect to 100.96.0.3 port 80 from 100.66.110.26 port 38378 failed: Host is unreachable * Trying 100.96.0.4:80... * connect to 100.96.0.4 port 80 from 100.66.110.26 port 49866 failed: Host is unreachable * Failed to connect to example.com port 80 after 9 ms: Could not connect to server * closing connection #0 curl: (7) Failed to connect to example.com port 80 after 9 ms: Could not connect to server ``` |