firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 10:18:54 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	ea5709e8da	chore(rust): initialise OTEL with useful metadata (#8945 ) Once we start collecting metrics across various Clients and Gateways, these metrics need to be tagged with the correct `service.name`, `service.version` as well as an instance ID to differentiate metrics from different instances.	2025-05-01 05:19:07 +00:00
Jamil	42b2420c00	ci(portal): Only set GIT_SHA before main app compile (#8955 ) Delaying setting the GIT_SHA until as late as possible allows us to cache more layers. Fixes #8774 Related: #8948	2025-05-01 05:15:47 +00:00
Jamil	c0a670d947	fix(portal): Restart ReplicationConnection using Supervisor (#8953 ) When deploying, the cluster state diverges temporarily, which allows more than one `ReplicationConnection` process to start on the new nodes. (One of) the old nodes still has an active slot, and we get an "object in use" error `(Postgrex.Error) ERROR 55006 (object_in_use) replication slot "events_slot" is active for PID 603037`. Rather than use ReplicationConnection's restart behavior (which logs tons of errors with Logger.error), we can use the Supervisor here instead, and continue to try and start the ReplicationConnection until successful. Note that if the process name is registered (globally) and running, ReplicationConnection.start_link/1 simply returns `{:ok, pid}` instead of erroring out with `:already_running`, so eventually one of the nodes will succeed and the remaining ones will return the globally-registered pid.	2025-05-01 03:48:35 +00:00
Thomas Eizinger	8233db4d00	chore(connlib): bump `quinn-udp` (#8954 ) The latest release includes our upstreamed fix for handling `segment_size` > `contents.len()` and therefore our local workaround is no longer necessary.	2025-05-01 02:19:50 +00:00
dependabot[bot]	1ff545814d	build(deps-dev): bump vite from 6.3.2 to 6.3.4 in /rust/gui-client in the npm_and_yarn group (#8949 ) Bumps the npm_and_yarn group in /rust/gui-client with 1 update: [vite](https://github.com/vitejs/vite/tree/HEAD/packages/vite). Updates `vite` from 6.3.2 to 6.3.4 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/vitejs/vite/releases">vite's releases</a>.</em></p> <blockquote> <h2>v6.3.4</h2> <p>Please refer to <a href="https://github.com/vitejs/vite/blob/v6.3.4/packages/vite/CHANGELOG.md">CHANGELOG.md</a> for details.</p> <h2>v6.3.3</h2> <p>Please refer to <a href="https://github.com/vitejs/vite/blob/v6.3.3/packages/vite/CHANGELOG.md">CHANGELOG.md</a> for details.</p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/vitejs/vite/blob/main/packages/vite/CHANGELOG.md">vite's changelog</a>.</em></p> <blockquote> <h2><!-- raw HTML omitted -->6.3.4 (2025-04-30)<!-- raw HTML omitted --></h2> <ul> <li>fix: check static serve file inside sirv (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19965">#19965</a>) (<a href="`c22c43de61`">c22c43d</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19965">#19965</a></li> <li>fix(optimizer): return plain object when using <code>require</code> to import externals in optimized dependenci (<a href="`efc5eab253`">efc5eab</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19940">#19940</a></li> <li>refactor: remove duplicate plugin context type (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19935">#19935</a>) (<a href="`d6d01c2292`">d6d01c2</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19935">#19935</a></li> </ul> <h2><!-- raw HTML omitted -->6.3.3 (2025-04-24)<!-- raw HTML omitted --></h2> <ul> <li>fix: ignore malformed uris in tranform middleware (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19853">#19853</a>) (<a href="`e4d520141b`">e4d5201</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19853">#19853</a></li> <li>fix(assets): ensure ?no-inline is not included in the asset url in the production environment (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/1949">#1949</a> (<a href="`16a73c05d3`">16a73c0</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19496">#19496</a></li> <li>fix(css): resolve relative imports in sass properly on Windows (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19920">#19920</a>) (<a href="`ffab442704`">ffab442</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19920">#19920</a></li> <li>fix(deps): update all non-major dependencies (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19899">#19899</a>) (<a href="`a4b500ef9c`">a4b500e</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19899">#19899</a></li> <li>fix(ssr): fix execution order of re-export (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19841">#19841</a>) (<a href="`ed29dee2eb`">ed29dee</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19841">#19841</a></li> <li>fix(ssr): fix live binding of default export declaration and hoist exports getter (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19842">#19842</a>) (<a href="`80a91ff824`">80a91ff</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19842">#19842</a></li> <li>perf: skip sourcemap generation for renderChunk hook of import-analysis-build plugin (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19921">#19921</a>) (<a href="`55cfd04b10`">55cfd04</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19921">#19921</a></li> <li>test(ssr): test <code>ssrTransform</code> re-export deps and test stacktrace with first line (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19629">#19629</a>) (<a href="`9399cdaf8c`">9399cda</a>), closes <a href="https://redirect.github.com/vitejs/vite/issues/19629">#19629</a></li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`b040d547a1`"><code>b040d54</code></a> release: v6.3.4</li> <li><a href="`c22c43de61`"><code>c22c43d</code></a> fix: check static serve file inside sirv (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19965">#19965</a>)</li> <li><a href="`efc5eab253`"><code>efc5eab</code></a> fix(optimizer): return plain object when using <code>require</code> to import externals ...</li> <li><a href="`d6d01c2292`"><code>d6d01c2</code></a> refactor: remove duplicate plugin context type (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19935">#19935</a>)</li> <li><a href="`db9eb97b2f`"><code>db9eb97</code></a> release: v6.3.3</li> <li><a href="`e4d520141b`"><code>e4d5201</code></a> fix: ignore malformed uris in tranform middleware (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19853">#19853</a>)</li> <li><a href="`55cfd04b10`"><code>55cfd04</code></a> perf: skip sourcemap generation for renderChunk hook of import-analysis-build...</li> <li><a href="`ffab442704`"><code>ffab442</code></a> fix(css): resolve relative imports in sass properly on Windows (<a href="https://github.com/vitejs/vite/tree/HEAD/packages/vite/issues/19920">#19920</a>)</li> <li><a href="`16a73c05d3`"><code>16a73c0</code></a> fix(assets): ensure ?no-inline is not included in the asset url in the produc...</li> <li><a href="`9399cdaf8c`"><code>9399cda</code></a> test(ssr): test <code>ssrTransform</code> re-export deps and test stacktrace with first ...</li> <li>Additional commits viewable in <a href="https://github.com/vitejs/vite/commits/v6.3.4/packages/vite">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=vite&package-manager=npm_and_yarn&previous-version=6.3.2&new-version=6.3.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore <dependency name> major version` will close this group update PR and stop Dependabot creating any more for the specific dependency's major version (unless you unignore this specific dependency's major version or upgrade to it yourself) - `@dependabot ignore <dependency name> minor version` will close this group update PR and stop Dependabot creating any more for the specific dependency's minor version (unless you unignore this specific dependency's minor version or upgrade to it yourself) - `@dependabot ignore <dependency name>` will close this group update PR and stop Dependabot creating any more for the specific dependency (unless you unignore this specific dependency or upgrade to it yourself) - `@dependabot unignore <dependency name>` will remove all of the ignore conditions of the specified dependency - `@dependabot unignore <dependency name> <ignore condition>` will remove the ignore condition of the specified dependency and ignore conditions You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/firezone/firezone/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-04-30 20:17:46 +00:00
Jamil	fdd1105b10	fix(portal): alter db user role with replication (#8952 ) We need the `replication` attribute set on the db user. This is trivially done in a migration, and with the `CURRENT_USER` specifier, we don't need to fetch the Application configuration.	2025-04-30 13:02:34 -07:00
Thomas Eizinger	8dd794d8c8	chore(gateway): record metrics about dropped packets (#8942 ) When a NAT session expires or other unallowed traffic is routed to the Gateway, we drop these packets. It will be useful to learn, how often that actually happens and what the reason is for why they got dropped. To do so, we add a counter metric for these packets. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-04-30 18:24:10 +00:00
Jamil	1f8090c60d	fix(portal): use existing database user for replication (#8950 ) Turns out we are making replication overly complex by creating a dedicated user for it. The `web` user is already privileged and we can reuse it since the replication system operates in the same security context as the remaining app.	2025-04-30 11:19:14 -07:00
Jamil	a98a9867af	fix(portal): Redact entire connection_opts param (#8946 ) The LoggerJSON Redactor only redacts top-level keys, so we need to redact the entire `connection_opts` param to redact its contained password. We also don't need to pass around `connection_opts` across the entire ReplicationConnection process state, only for the initial connection, so we refactor that out of the `state`.	2025-04-30 16:33:21 +00:00
Jamil	ab617bf2d0	chore: Bump staging to fix replication role (#8947 ) See https://github.com/firezone/environments/pull/24	2025-04-30 16:06:22 +00:00
Thomas Eizinger	8a201494d0	ci: remove flaky Windows benchmark (#8941 ) This tunnel throughput benchmark isn't a very useful benchmark and it is very flaky. Remove it entirely until we can replace it with something more robust and useful. Resolves: #8172	2025-04-30 07:24:21 -07:00
Thomas Eizinger	6f11568c8c	fix(connlib): move `wire::dev::recv` log to right location (#8944 ) I don't understand why but in the current location, this log simply doesn't show up for anything other than UDP packets. If we move it up, it will actually log all packets.	2025-04-30 13:45:51 +00:00
Thomas Eizinger	ec4cd898ba	chore: release Gateway v1.4.7 (#8943 )	2025-04-30 13:37:32 +00:00
Thomas Eizinger	e031dfdb4a	refactor(connlib): introduce our own `bufferpool` crate (#8928 ) We have been using buffer pools for a while all over `connlib` as a way to efficiently use heap-allocated memory. This PR harmonizes the usage of buffer pools across the codebase by introducing a dedicated `bufferpool` crate. This crate offers a convenient and easy-to-use API for all the things we (currently) need from buffer pools. As a nice bonus of having it all in one place, we can now also track metrics of how many buffers we have currently allocated. An example output from the local metrics exporter looks like this: ``` Name : system.buffer.count Description : The number of buffers allocated in the pool. Unit : {buffers} Type : Sum Sum DataPoints Monotonic : false Temporality : Cumulative DataPoint #0 StartTime : 2025-04-29 12:41:25.278436 EndTime : 2025-04-29 12:42:25.278088 Value : 96 Attributes : -> system.buffer.pool.name: udp-socket-v6 -> system.buffer.pool.buffer_size: 65535 DataPoint #1 StartTime : 2025-04-29 12:41:25.278436 EndTime : 2025-04-29 12:42:25.278088 Value : 7 Attributes : -> system.buffer.pool.buffer_size: 131600 -> system.buffer.pool.name: gso-queue DataPoint #2 StartTime : 2025-04-29 12:41:25.278436 EndTime : 2025-04-29 12:42:25.278088 Value : 128 Attributes : -> system.buffer.pool.name: udp-socket-v4 -> system.buffer.pool.buffer_size: 65535 DataPoint #3 StartTime : 2025-04-29 12:41:25.278436 EndTime : 2025-04-29 12:42:25.278088 Value : 8 Attributes : -> system.buffer.pool.buffer_size: 1336 -> system.buffer.pool.name: ip-packet DataPoint #4 StartTime : 2025-04-29 12:41:25.278436 EndTime : 2025-04-29 12:42:25.278088 Value : 9 Attributes : -> system.buffer.pool.buffer_size: 1336 -> system.buffer.pool.name: snownet ``` Resolves: #8385	2025-04-30 08:52:18 +00:00
Thomas Eizinger	96998a43ae	docs(website): add missing changelog entry for Apple Clients (#8938 )	2025-04-30 07:14:33 +00:00
Thomas Eizinger	f7df445924	fix(gateway): don't invalidate active NAT sessions (#8937 ) Whenever the Gateway is instructed to (re)create the NAT for a DNS resource, it performs a DNS query and then overwrite the existing entries in the NAT table. Depending on how the DNS records are defined, this may lead to a very bad user experience where connections are cut regularly. In particular, if a service utilises round-robin DNS where a DNS query only ever returns a single entry yet that entry may change as soon as the TTL expires, all connections for this particular DNS resource for a Client get cut. To fix this, we now first check for active NAT sessions for a given proxy IP and only replace it if we don't have an open NAT session. The NAT sessions have a TTL of 1 minute, meaning there needs to be at least 1 outgoing packet from the Client every minute to keep it open.	2025-04-30 06:58:58 +00:00
Jamil	968db2ae39	feat(portal): Receive WAL events (#8909 ) Firezone's control plane is a realtime, distributed system that relies on a broadcast/subscribe system to function. In many cases, these events are broadcasted whenever relevant data in the DB changes, such as an actor losing access to a policy, a membership being deleted, and so forth. Today, this is handled in the application layer, typically happening at the place where the relevant DB call is made (i.e. in an `after_commit`). While this approach has worked thus far, it has several issues: 1. We have no guarantee that the DB change will issue a broadcast. If the application is deployed or the process crashes after the DB changes are made but before the broadcast happens, we will have potentially failed to update any connected clients or gateways with the changes. 2. We have no guarantee that the order of DB updates will be maintained in order for broadcasts. In other words, app server A could win its DB operation against app server B, but then proceed to lose being the first to broadcast. 3. If the cluster is in a bad state where broadcasts may return an error (i.e. https://github.com/firezone/firezone/issues/8660), we will never retry the broadcast. To fix the above issues, we introduce a WAL logical decoder that process the event stream one message at a time and performs any needed work. Serializability is guaranteed since we only process the WAL in a single, cluster-global process, `ReplicationConnection`. Durability is also guaranteed since we only ACK WAL segments after we've successfully ingested the event. This means we will only advance the position of our WAL stream after successfully broadcasting the event. This PR only introduces the WAL stream processing system but does not introduce any changes to our current broadcasting behavior - that's saved for another PR.	2025-04-29 23:53:06 -07:00
Jamil	2650d81444	chore: release clients with GSO fix (#8936 )	2025-04-29 23:52:43 -07:00
Thomas Eizinger	c75b6c6641	feat(connlib): record the number of IO errors as a metric (#8934 ) It will be interesting to learn for example, how many installations have no IPv6 connectivity as those will encounter `NetworkUnreachable` errors. We categorise the errors by IO direction and IP stack which will allow us to deduce this information.	2025-04-30 05:24:55 +00:00
Thomas Eizinger	6dc5f85cc5	fix(connlib): don't buffer when recreating DNS resource NAT (#8935 ) In order to detect changes to DNS records of DNS resources, `connlib` will recreate the DNS resource NAT whenever it receives a query for a DNS resource. The way we implemented this was by clearing the local state of the DNS resource NAT, which triggered us to perform the handshake with the Gateway again upon the next packet for this resource. The Gateway would then perform the DNS query and respond back when this was finished. In order to not drop any packets, `connlib` has a buffer where it keeps the packets that are arriving in the meantime. This works reasonably well when the connection is first set up because we are only buffering a TCP SYN or equivalent handshake packet. Yet, when the connection is full use, and the application just so happens to make another DNS query, we halt the entire flow of packets until this is confirmed again. To prevent high memory use, the buffer for this packets is constrained to 32 packets which is nowhere near enough when a connection is actively transferring data (like a file upload). In most cases, the DNS query on the Gateway will yield the exact same results as because the records haven't changed. Thus, there is no reason for us to actually halt the flow of these packets when we are _recreating_ the DNS resource NAT. That way, this handshake happens in parallel to the actual packet flow and does not interrupt anything in the happy path case.	2025-04-30 04:26:49 +00:00
Thomas Eizinger	d19d20da51	fix(connlib): send IO errors from UDP threads to event-loop (#8933 ) With #7590, we've moved all UDP IO operations to a separate thread. As a result, some of the error handling of IO errors within the Client's and Gateway's event-loop no longer applied as those are now captured within the respective thread. To fix this, we extend the type-signature of the receive-channel to also allow for errors and use that to send back errors from sending AND receiving UDP datagrams.	2025-04-30 02:00:41 +00:00
Thomas Eizinger	4881280a3a	fix(connlib): don't set `segment_size` if it is > than payload (#8932 ) When a platform's network driver does not support GSO, `quinn-udp` detects that and disables segmentation offloading: > 04-30 11:32:49.161 19612 19836 I connlib : quinn_udp:👿 `libc::sendmsg` failed with I/O error (os error 5); halting segmentation offload What this means is that it sets an internal field that sets the GSO batch-size to 1 (instead of the default 32). We then use this batch-size to compute, how we are meant to chunk up the already batched datagrams. As a consequence of #8920, we are now also using a "feature" of GSO where the last datagram in a GSO batch is allowed to be less than the segment size. The combination of these two features now makes it possible that we are passing a datagram to the kernel where the `segment_size` is greater than the actual length. Android's Linux kernel doesn't seem to like that an barfs when being passed such a datagram with an IO error 5. The long-term fix is to sanitise this within `quinn-udp` but in the short-term, we can do this ourselves as part of the loop where we segment the datagrams. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-04-30 01:58:34 +00:00
Thomas Eizinger	0d7e73be3c	docs(android): document how to use `adb logcat` (#8931 ) I've found myself having to look this up numerous times so I am documenting it in the README now.	2025-04-30 00:04:44 +00:00
Thomas Eizinger	2ba7a87899	feat(connlib): add FFI for changing log-level on MacOS (#8927 ) This isn't plugged into anything yet on the Swift side but lays the foundation for changing the log-level at runtime without having to sign the user out.	2025-04-29 13:51:46 +00:00
Thomas Eizinger	122d84cfa2	fix(connlib): recreate log file if it got deleted (#8926 ) Currently, when `connlib`'s log file gets deleted, we write logs into nirvana until the corresponding process gets restarted. This is painful for users to do because they need to restart the IPC service or Network Extension. Instead, we can simply check if the log file exists prior to writing to it and re-create it if it doesn't. Resolves: #6850 Related: #7569	2025-04-29 13:05:02 +00:00
Thomas Eizinger	6114bb274f	chore(rust): make most of the Rust code compile on MacOS (#8924 ) When working on the Rust code of Firezone from a MacOS computer, it is useful to have pretty much all of the code at least compile to ensure detect problems early. Eventually, once we target features like a headless MacOS client, some of these stubs will actually be filled in an be functional.	2025-04-29 11:20:09 +00:00
Thomas Eizinger	bbc9c29d5d	docs(website): add changelog for #8920 (#8923 )	2025-04-29 10:23:48 +00:00
Thomas Eizinger	091a1d0ab9	fix(headless-client): don't print error for `-h` (#8925 ) Resolves: #8897	2025-04-29 07:58:57 +00:00
Thomas Eizinger	66b7ca6f7f	fix(connlib): ensure we don't mistake SYN-ACK for SYN (#8922 ) This shouldn't matter because we are only using the `UniquePacketBuffer` on the client and not on the Gateway where SYN-ACK packets would be sent from. To be fully correct though, we need to also compare the ACK flag of the two packets.	2025-04-29 04:17:18 +00:00
Thomas Eizinger	fde8d08423	fix(connlib): maintain packet order across GSO batches (#8920 ) Despite our efforts in #8912, the current implementation still does not do enough to maintain packet ordering across GSO batches. At present, we very aggressively batch packets of the same length together. This however is too eager when we consider packet flows such as the following: ``` 9:03:49.585143 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [.], seq 1:1229, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1228 09:03:49.585151 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [P.], seq 1229:2063, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 834 09:03:49.585157 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [P.], seq 2063:3094, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1031 09:03:49.585187 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [.], seq 3094:4322, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1228 09:03:49.585188 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [P.], seq 4322:5156, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 834 09:03:49.585227 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [.], seq 5156:6384, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1228 09:03:49.585228 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [P.], seq 6384:7612, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1228 09:03:49.585230 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [P.], seq 7612:8249, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 637 09:03:49.585846 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [.], seq 8249:9477, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1228 09:03:49.585851 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [P.], seq 9477:10705, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1228 ``` As we can see here, the remote sends us packet batches of varying lengths: - 1228, 834 - 1031 - 1228, 834 - 1228, 1228, 637 - 1228, 1228 1228 represents a "full" TCP packet so any packet following a full-packet SHOULD be grouped together into a GSO batch. Currently, we are batching all the 1228 packets together and we ignore the fact that there were actually smaller sized packets inbetween those that belong together. To mitigate this, we refactor the `GsoQueue` to remove the `segment_size` from the binning key of our map and instead only group batches by their source, destination and ECN information. Within such a connection, we then create an ordered list of batches. A new batch is started if the length differs or we have previously pushed a packet that isn't of the length of the batch, therefore signalling the end of the batch. The result here looks very promising (this is loading `blog.firezone.dev` via the `lynx` browser from within the headless-client docker container, so going through a Gateway running this PR): \|main\|this PR\| \|---\|---\| \|![Screenshot From 2025-04-29 10-32-00](https://github.com/user-attachments/assets/ba0535e4-1df9-4601-a2d7-ba099ba2313f)\|![image](https://github.com/user-attachments/assets/ab2ccec7-ce96-4305-8514-2e43d82ecc7d)\| Related: #8899	2025-04-29 00:50:23 +00:00
Thomas Eizinger	ad9a453aa1	feat(linux-client): reduce number of TUN threads to 1 (#8914 ) Having multiple threads for reading and writing the TUN device can cause packet re-orderings on the client. All other clients only use a single TUN thread, so aligning this value means a more consistent behaviour of Firezone across all platforms.	2025-04-28 12:25:27 +00:00
Thomas Eizinger	52efb280ee	chore(ip-packet): print length of payload (#8913 ) This is useful when debugging things.	2025-04-28 10:45:15 +00:00
Thomas Eizinger	e0faddf43f	chore(connlib): maintain order within a single GSO batch (#8912 ) Generic Segmentation Offload (GSO) is a clever way of reducing the number of syscalls made when a you want to send a lot of packets with the same length to the same recipient. The way this works is that the packets are concatenated and passed to the kernel as a single packet together with the `segment_size` as an out-of-band argument. The component managing this batching in `connlib` is called `GsoQueue`. In #8772, we made the order in which these batches are sent to the kernel explicit by prioritising batches with smaller segments. What we overlooked with that strategy is that in a particular GSO batch, the last packet is actually allowed to be of a different length. For example, say the user is downloading an image of 4500Kb. With our MTU of 1280, we have a payload size of 1252. This results in three fully-filled packets and one packet of 744 bytes. With the change in #8772, the small packet of 744 bytes will be transferred first, followed by the "train" of fully filled packets. To fix this, we flip the order here and transfer batches or larger sizes first. The original problem we attempted to mitigate in #8772 no longer exists now that we merged #7590. We will simply suspend now if the UDP socket isn't ready contrary to dropping the next batch. By flipping the order here, we guarantee that batches with a larger size are sent before batches with a smaller size. This should also imply that the encapsulated IP packets of e.g. an image arrive in the correct order (with the smallest packet last as it is part of a smaller batch). What we don't guarantee with this is that there won't be any other IP packets sent "in the middle" of such a batch. This shouldn't be a problem though as we are simply interleaving packets of different TCP / UDP connections with each other which already happens on the regular Internet anyway.	2025-04-28 06:53:43 +00:00
Jamil	050a27cb07	chore: Update environments for slot management (#8911 ) Bumps environments to address drift and remove slot management from terraform. See https://github.com/firezone/environments/pull/22 See https://github.com/firezone/environments/pull/21	2025-04-27 23:15:30 +00:00
Jamil	48319df9f0	revert(#8893 ): Revert adding wal2json dev image (#8908 ) Turns out that the standard `pgoutput` plugin shipped with Postgres will do everything we need it to, and there are good examples of prior art decoding its binary output in Elixir (in production). So to avoid adding a dependency on `wal2json` here, we'll go with that.	2025-04-26 22:43:32 +00:00
Brian Manifold	3f3f007920	fix(portal): Update copy to clipboard button (#8907 ) Why: * The copy to clipboard button was not working at all on the API new token page due to the fact that the FlowbiteJS library expects the presence of the elements in the DOM on first render. This was not true of the API Token code block. Along with that issue the existing code blocks copy to clipboard buttons did not give any visual indication that the copy had been completed. It was also somewhat difficult to see the copy to clipboard button on those code blocks as well. This commit updates the buttons to be more visible, as well as adds a phx-hook to make sure the FlowbiteJS init functions are run on every code block even if it's inserted after the initial load of the page and adds functions that are run as a callback to toggle the button text and icon to show the text has been copied.	2025-04-26 00:43:43 +00:00
Jamil	4fbfa5247f	chore(ci): Remove version override from buildx (#8904 ) The override we needed from before is no longer needed.	2025-04-25 21:38:42 +00:00
Jamil	7e052313af	feat(infra): Enable wal logical for dev (#8879 ) Enables `wal_level = logical` so we can start implementing CDC. WARNING: This will trigger a restart of our database instance, so it should be deployed during our standard maintenance window (Saturday evening).	2025-04-25 19:01:37 +00:00
Jamil	f6ae7559e8	feat(ci): Add custom postgres Dockerfile for wal2json (#8893 ) In order to develop and test WAL replication, we need the wal2json module installed in our dev postgres image. The module itself builds very quickly, but I thought it would be better to have this automatically built and pushed as part of a nightly job so that CI and developers can make use of it.	2025-04-25 12:31:40 +00:00
Jamil	1a1c812f66	fix(portal): Set migration_lock to advisory lock (#8902 ) The migration that failed today got hung up on a global migration lock. This PR would alleviate that if we also run the index creation concurrently, which we should do in many cases. See https://hexdocs.pm/ecto_sql/Ecto.Migration.html#index/3-adding-dropping-indexes-concurrently	2025-04-24 20:26:01 +00:00
Jamil	5c084971c0	fix(portal): Reduce occupancy of reservations (#8900 ) See `dfede9e983`	2025-04-23 17:39:31 -07:00
Jamil	5f3cf82eef	chore(infra): Bump tf envs to deploy replication conf to staging (#8896 ) See https://github.com/firezone/environments/pull/20	2025-04-23 21:10:01 +00:00
Jamil	f181a3245b	chore(website): Remove old docs (#8895 ) These confuse users and are horribly outdated. Fixes #8528	2025-04-23 15:24:09 +00:00
Jamil	0a2a393d4c	fix(portal): Prevent additional email identities per actor (#8888 ) This is a UI-only change for now to serve as a stop-gap while we work to overhaul the identity domain model. Related: #6294	2025-04-22 21:13:37 +00:00
Jamil	8293e6c440	fix(portal): Don't peek groups for api_client actors (#8890 ) API clients don't belong to any actor_groups and attempting to deep link into the `groups` section when viewing an actor raises a 500 error. This PR fixes that by removing the deep link into `actor_groups` from the actors index view.	2025-04-22 13:59:06 +00:00
Jamil	0f300f2484	fix(portal): Prevent dupe sync adapters (#8887 ) Prevents more than one sync-enabled adapter per account in order to prepare for eventually adding a unique constraint on `provider_identifier` for identities and groups per account. Related: #6294 --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>	2025-04-22 13:58:24 +00:00
Thomas Eizinger	ac5e44d5d0	feat(connlib): request larger buffers for UDP sockets (#8731 ) Sufficiently large receive buffers are important to sustain high-throughput as latency increases. If the receive buffer in the kernel is too small, packets need to be dropped on arrival. Firefox uses 1MB in its QUIC stack [0]. `quic-go` recommends to set send and receive buffers to 7.5 MB [1]. Power users of Firezone are likely receiving a lot more traffic than the average Firefox user (especially with Internet Resource activated) so setting it to 10 MB seems reasonable. Sending packets is likely not as critical because we have back-pressure through our system such that we will stop reading IP packets when we cannot write to our UDP socket. The UDP socket is sitting in a separate thread and those threads are connected with dedicated queues which act as another buffer. However, as the data below shows, some systems have really small send buffers which are currently likely a speed bottleneck because we need to suspend writing so frequently. Assuming a 50ms latency, the bandwidth-delay product tells us that we can (in theory) saturate a 1.6 Gbps link with a 10MB receive buffer (assuming the OS also has large enough buffer sizes in its TCP or QUIC stack): ``` 80 Mb / 0.05s = 1600Mbps ``` Experiments and research [2] show the following: \|OS\|Receive buffer (default)\|Receive buffer (this PR)\|Send buffer (default)\|Send buffer (this PR)\| \|---\|---\|---\|---\|---\| \|Windows\|65KB\|10MB\|65KB\|1MB\| \|MacOS\|786KB\|8MB\|9KB\|1MB\| \|Linux\|212KB\|212KB\|212KB\|212KB\| With the exception of Linux, the OSes appear to be quite generous with how big they allow receive buffers to be. On Linux, these limit can be changed by setting the `core.net.rmem_max` and `core.net.wmem_max` parameters using `sysctl`. Most of our users are on Windows and MacOS, meaning they immediately benefit from this without having to change any system settings. Larger client-side UDP receive buffers are critical for any "download" scenario which is likely the majority of usecases that Firezone is used for. On Windows, increasing this receive buffer almost doubles the throughput in an iperf3 download test. [0]: https://github.com/mozilla/neqo/pull/2470 [1]: https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes [2]: https://unix.stackexchange.com/a/424381 --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2025-04-22 06:52:33 +00:00
Thomas Eizinger	93036734ae	build(rust): move our own `windows` dependency to `0.61.0` (#8730 ) Version `0.61.0` is what most of our dependencies bring in, so depending on that allows us to unify the dependency tree here.	2025-04-22 02:35:28 +00:00
Thomas Eizinger	44a402e1db	feat(connlib): increase the number of GRO batches (#8874 ) When reading from our UDP socket, we utilise GRO to read multiple packets originating from the same IP + port and with the same length in a single syscall. Currently, we can read up to 10 different combinations here in a single syscall. `quinn_udp` actually exposes a constant for how many batches it can handle at a time. Instead of hard-coding the value 10, we now follow this constant. On Linux and MacOS (with `apple-fast-datapath`), this constant has the value 32. On Windows, it is 1. Even on my not-so-fast Internet connection of 100Mbit, I can see an increase in batch-count of up to 29 so increasing this value seems to be definitely worth it.	2025-04-22 02:07:12 +00:00
Thomas Eizinger	74ff39ec6d	fix(connlib): shortcut `DatagramSegmentIter` on `len` (#8877 ) When the `recv` syscall completes, `quinn-udp` tells us how many batches we have read. On Windows, this is always 1 because Windows doesn't have an APIs to read more than a single GRO batch. The `DatagramSegmentIter` already has a way of detecting this, however it currently needs to iterator through all batches (10) and check that their `meta.length == 0` before realising this. We can shortcut the iterator early which might improve download performance on Windows. I can't measure a direct improvement here but I believe that is because we are currently limited by the buffer size on Windows. Regardless, this feels like the right thing to do.	2025-04-22 01:35:45 +00:00

1 2 3 4 5 ...

7119 Commits