Once we start collecting metrics across various Clients and Gateways,
these metrics need to be tagged with the correct `service.name`,
`service.version` as well as an instance ID to differentiate metrics
from different instances.
When deploying, the cluster state diverges temporarily, which allows
more than one `ReplicationConnection` process to start on the new nodes.
(One of) the old nodes still has an active slot, and we get an "object
in use" error `(Postgrex.Error) ERROR 55006 (object_in_use) replication
slot "events_slot" is active for PID 603037`.
Rather than use ReplicationConnection's restart behavior (which logs
tons of errors with Logger.error), we can use the Supervisor here
instead, and continue to try and start the ReplicationConnection until
successful.
Note that if the process name is registered (globally) and running,
ReplicationConnection.start_link/1 simply returns `{:ok, pid}` instead
of erroring out with `:already_running`, so eventually one of the nodes
will succeed and the remaining ones will return the globally-registered
pid.
The latest release includes our upstreamed fix for handling
`segment_size` > `contents.len()` and therefore our local workaround is
no longer necessary.
We need the `replication` attribute set on the db user. This is
trivially done in a migration, and with the `CURRENT_USER` specifier, we
don't need to fetch the Application configuration.
When a NAT session expires or other unallowed traffic is routed to the
Gateway, we drop these packets. It will be useful to learn, how often
that actually happens and what the reason is for why they got dropped.
To do so, we add a counter metric for these packets.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Turns out we are making replication overly complex by creating a
dedicated user for it. The `web` user is already privileged and we can
reuse it since the replication system operates in the same security
context as the remaining app.
The LoggerJSON Redactor only redacts top-level keys, so we need to
redact the entire `connection_opts` param to redact its contained
password.
We also don't need to pass around `connection_opts` across the entire
ReplicationConnection process state, only for the initial connection, so
we refactor that out of the `state`.
This tunnel throughput benchmark isn't a very useful benchmark and it is
very flaky. Remove it entirely until we can replace it with something
more robust and useful.
Resolves: #8172
I don't understand why but in the current location, this log simply
doesn't show up for anything other than UDP packets. If we move it up,
it will actually log all packets.
We have been using buffer pools for a while all over `connlib` as a way
to efficiently use heap-allocated memory. This PR harmonizes the usage
of buffer pools across the codebase by introducing a dedicated
`bufferpool` crate. This crate offers a convenient and easy-to-use API
for all the things we (currently) need from buffer pools. As a nice
bonus of having it all in one place, we can now also track metrics of
how many buffers we have currently allocated.
An example output from the local metrics exporter looks like this:
```
Name : system.buffer.count
Description : The number of buffers allocated in the pool.
Unit : {buffers}
Type : Sum
Sum DataPoints
Monotonic : false
Temporality : Cumulative
DataPoint #0
StartTime : 2025-04-29 12:41:25.278436
EndTime : 2025-04-29 12:42:25.278088
Value : 96
Attributes :
-> system.buffer.pool.name: udp-socket-v6
-> system.buffer.pool.buffer_size: 65535
DataPoint #1
StartTime : 2025-04-29 12:41:25.278436
EndTime : 2025-04-29 12:42:25.278088
Value : 7
Attributes :
-> system.buffer.pool.buffer_size: 131600
-> system.buffer.pool.name: gso-queue
DataPoint #2
StartTime : 2025-04-29 12:41:25.278436
EndTime : 2025-04-29 12:42:25.278088
Value : 128
Attributes :
-> system.buffer.pool.name: udp-socket-v4
-> system.buffer.pool.buffer_size: 65535
DataPoint #3
StartTime : 2025-04-29 12:41:25.278436
EndTime : 2025-04-29 12:42:25.278088
Value : 8
Attributes :
-> system.buffer.pool.buffer_size: 1336
-> system.buffer.pool.name: ip-packet
DataPoint #4
StartTime : 2025-04-29 12:41:25.278436
EndTime : 2025-04-29 12:42:25.278088
Value : 9
Attributes :
-> system.buffer.pool.buffer_size: 1336
-> system.buffer.pool.name: snownet
```
Resolves: #8385
Whenever the Gateway is instructed to (re)create the NAT for a DNS
resource, it performs a DNS query and then overwrite the existing
entries in the NAT table. Depending on how the DNS records are defined,
this may lead to a very bad user experience where connections are cut
regularly.
In particular, if a service utilises round-robin DNS where a DNS query
only ever returns a single entry yet that entry may change as soon as
the TTL expires, all connections for this particular DNS resource for a
Client get cut.
To fix this, we now first check for active NAT sessions for a given
proxy IP and only replace it if we don't have an open NAT session. The
NAT sessions have a TTL of 1 minute, meaning there needs to be at least
1 outgoing packet from the Client every minute to keep it open.
Firezone's control plane is a realtime, distributed system that relies
on a broadcast/subscribe system to function. In many cases, these events
are broadcasted whenever relevant data in the DB changes, such as an
actor losing access to a policy, a membership being deleted, and so
forth.
Today, this is handled in the application layer, typically happening at
the place where the relevant DB call is made (i.e. in an
`after_commit`). While this approach has worked thus far, it has several
issues:
1. We have no guarantee that the DB change will issue a broadcast. If
the application is deployed or the process crashes after the DB changes
are made but before the broadcast happens, we will have potentially
failed to update any connected clients or gateways with the changes.
2. We have no guarantee that the order of DB updates will be maintained
in order for broadcasts. In other words, app server A could win its DB
operation against app server B, but then proceed to lose being the first
to broadcast.
3. If the cluster is in a bad state where broadcasts may return an error
(i.e. https://github.com/firezone/firezone/issues/8660), we will never
retry the broadcast.
To fix the above issues, we introduce a WAL logical decoder that process
the event stream one message at a time and performs any needed work.
Serializability is guaranteed since we only process the WAL in a single,
cluster-global process, `ReplicationConnection`. Durability is also
guaranteed since we only ACK WAL segments after we've successfully
ingested the event.
This means we will only advance the position of our WAL stream after
successfully broadcasting the event.
This PR only introduces the WAL stream processing system but does not
introduce any changes to our current broadcasting behavior - that's
saved for another PR.
It will be interesting to learn for example, how many installations have
no IPv6 connectivity as those will encounter `NetworkUnreachable`
errors. We categorise the errors by IO direction and IP stack which will
allow us to deduce this information.
In order to detect changes to DNS records of DNS resources, `connlib`
will recreate the DNS resource NAT whenever it receives a query for a
DNS resource. The way we implemented this was by clearing the local
state of the DNS resource NAT, which triggered us to perform the
handshake with the Gateway again upon the next packet for this resource.
The Gateway would then perform the DNS query and respond back when this
was finished.
In order to not drop any packets, `connlib` has a buffer where it keeps
the packets that are arriving in the meantime. This works reasonably
well when the connection is first set up because we are only buffering a
TCP SYN or equivalent handshake packet. Yet, when the connection is full
use, and the application just so happens to make another DNS query, we
halt the entire flow of packets until this is confirmed again. To
prevent high memory use, the buffer for this packets is constrained to
32 packets which is nowhere near enough when a connection is actively
transferring data (like a file upload).
In most cases, the DNS query on the Gateway will yield the exact same
results as because the records haven't changed. Thus, there is no reason
for us to actually halt the flow of these packets when we are
_recreating_ the DNS resource NAT. That way, this handshake happens in
parallel to the actual packet flow and does not interrupt anything in
the happy path case.
With #7590, we've moved all UDP IO operations to a separate thread. As a
result, some of the error handling of IO errors within the Client's and
Gateway's event-loop no longer applied as those are now captured within
the respective thread. To fix this, we extend the type-signature of the
receive-channel to also allow for errors and use that to send back
errors from sending AND receiving UDP datagrams.
When a platform's network driver does not support GSO, `quinn-udp`
detects that and disables segmentation offloading:
> 04-30 11:32:49.161 19612 19836 I connlib : quinn_udp:👿
`libc::sendmsg` failed with I/O error (os error 5); halting segmentation
offload
What this means is that it sets an internal field that sets the GSO
batch-size to 1 (instead of the default 32). We then use this batch-size
to compute, how we are meant to chunk up the already batched datagrams.
As a consequence of #8920, we are now also using a "feature" of GSO
where the last datagram in a GSO batch is allowed to be less than the
segment size.
The combination of these two features now makes it possible that we are
passing a datagram to the kernel where the `segment_size` is greater
than the actual length. Android's Linux kernel doesn't seem to like that
an barfs when being passed such a datagram with an IO error 5.
The long-term fix is to sanitise this within `quinn-udp` but in the
short-term, we can do this ourselves as part of the loop where we
segment the datagrams.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This isn't plugged into anything yet on the Swift side but lays the
foundation for changing the log-level at runtime without having to sign
the user out.
Currently, when `connlib`'s log file gets deleted, we write logs into
nirvana until the corresponding process gets restarted. This is painful
for users to do because they need to restart the IPC service or Network
Extension. Instead, we can simply check if the log file exists prior to
writing to it and re-create it if it doesn't.
Resolves: #6850
Related: #7569
When working on the Rust code of Firezone from a MacOS computer, it is
useful to have pretty much all of the code at least compile to ensure
detect problems early. Eventually, once we target features like a
headless MacOS client, some of these stubs will actually be filled in an
be functional.
This shouldn't matter because we are only using the `UniquePacketBuffer`
on the client and not on the Gateway where SYN-ACK packets would be sent
from. To be fully correct though, we need to also compare the ACK flag
of the two packets.
Despite our efforts in #8912, the current implementation still does not
do enough to maintain packet ordering across GSO batches.
At present, we very aggressively batch packets of the same length
together. This however is too eager when we consider packet flows such
as the following:
```
9:03:49.585143 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [.], seq 1:1229, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1228
09:03:49.585151 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [P.], seq 1229:2063, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 834
09:03:49.585157 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [P.], seq 2063:3094, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1031
09:03:49.585187 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [.], seq 3094:4322, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1228
09:03:49.585188 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [P.], seq 4322:5156, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 834
09:03:49.585227 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [.], seq 5156:6384, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1228
09:03:49.585228 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [P.], seq 6384:7612, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1228
09:03:49.585230 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [P.], seq 7612:8249, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 637
09:03:49.585846 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [.], seq 8249:9477, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1228
09:03:49.585851 IP 10.128.15.241.3000 > 100.69.109.138.53474: Flags [P.], seq 9477:10705, ack 524, win 249, options [nop,nop,TS val 3862031964 ecr 1928356896], length 1228
```
As we can see here, the remote sends us packet batches of varying
lengths:
- 1228, 834
- 1031
- 1228, 834
- 1228, 1228, 637
- 1228, 1228
1228 represents a "full" TCP packet so any packet following a
full-packet SHOULD be grouped together into a GSO batch.
Currently, we are batching all the 1228 packets together and we ignore
the fact that there were actually smaller sized packets inbetween those
that belong together.
To mitigate this, we refactor the `GsoQueue` to remove the
`segment_size` from the binning key of our map and instead only group
batches by their source, destination and ECN information. Within such a
connection, we then create an ordered list of batches. A new batch is
started if the length differs or we have previously pushed a packet that
isn't of the length of the batch, therefore signalling the end of the
batch.
The result here looks very promising (this is loading
`blog.firezone.dev` via the `lynx` browser from within the
headless-client docker container, so going through a Gateway running
this PR):
|main|this PR|
|---|---|
|||
Related: #8899
Having multiple threads for reading and writing the TUN device can cause
packet re-orderings on the client. All other clients only use a single
TUN thread, so aligning this value means a more consistent behaviour of
Firezone across all platforms.
Generic Segmentation Offload (GSO) is a clever way of reducing the
number of syscalls made when a you want to send a lot of packets with
the same length to the same recipient. The way this works is that the
packets are concatenated and passed to the kernel as a single packet
together with the `segment_size` as an out-of-band argument.
The component managing this batching in `connlib` is called `GsoQueue`.
In #8772, we made the order in which these batches are sent to the
kernel explicit by prioritising batches with smaller segments. What we
overlooked with that strategy is that in a particular GSO batch, the
last packet is actually allowed to be of a different length.
For example, say the user is downloading an image of 4500Kb. With our
MTU of 1280, we have a payload size of 1252. This results in three
fully-filled packets and one packet of 744 bytes. With the change in
#8772, the small packet of 744 bytes will be transferred first, followed
by the "train" of fully filled packets.
To fix this, we flip the order here and transfer batches or larger sizes
first. The original problem we attempted to mitigate in #8772 no longer
exists now that we merged #7590. We will simply suspend now if the UDP
socket isn't ready contrary to dropping the next batch.
By flipping the order here, we guarantee that batches with a larger size
are sent before batches with a smaller size. This should also imply that
the encapsulated IP packets of e.g. an image arrive in the correct order
(with the smallest packet last as it is part of a smaller batch). What
we don't guarantee with this is that there won't be any other IP packets
sent "in the middle" of such a batch. This shouldn't be a problem though
as we are simply interleaving packets of different TCP / UDP connections
with each other which already happens on the regular Internet anyway.
Turns out that the standard `pgoutput` plugin shipped with Postgres will
do everything we need it to, and there are good examples of prior art
decoding its binary output in Elixir (in production).
So to avoid adding a dependency on `wal2json` here, we'll go with that.
Why:
* The copy to clipboard button was not working at all on the API new
token page due to the fact that the FlowbiteJS library expects the
presence of the elements in the DOM on first render. This was not true
of the API Token code block. Along with that issue the existing code
blocks copy to clipboard buttons did not give any visual indication that
the copy had been completed. It was also somewhat difficult to see the
copy to clipboard button on those code blocks as well. This commit
updates the buttons to be more visible, as well as adds a phx-hook to
make sure the FlowbiteJS init functions are run on every code block even
if it's inserted after the initial load of the page and adds functions
that are run as a callback to toggle the button text and icon to show
the text has been copied.
Enables `wal_level = logical` so we can start implementing CDC.
**WARNING**: This will trigger a restart of our database instance, so it
should be deployed during our standard maintenance window (Saturday
evening).
In order to develop and test WAL replication, we need the wal2json
module installed in our dev postgres image. The module itself builds
very quickly, but I thought it would be better to have this
automatically built and pushed as part of a nightly job so that CI and
developers can make use of it.
API clients don't belong to any actor_groups and attempting to deep link
into the `groups` section when viewing an actor raises a 500 error.
This PR fixes that by removing the deep link into `actor_groups` from
the actors index view.
Prevents more than one sync-enabled adapter per account in order to
prepare for eventually adding a unique constraint on
`provider_identifier` for identities and groups per account.
Related: #6294
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>
Sufficiently large receive buffers are important to sustain
high-throughput as latency increases. If the receive buffer in the
kernel is too small, packets need to be dropped on arrival.
Firefox uses 1MB in its QUIC stack [0]. `quic-go` recommends to set send
and receive buffers to 7.5 MB [1]. Power users of Firezone are likely
receiving a lot more traffic than the average Firefox user (especially
with Internet Resource activated) so setting it to 10 MB seems
reasonable. Sending packets is likely not as critical because we have
back-pressure through our system such that we will stop reading IP
packets when we cannot write to our UDP socket. The UDP socket is
sitting in a separate thread and those threads are connected with
dedicated queues which act as another buffer. However, as the data below
shows, some systems have really small send buffers which are currently
likely a speed bottleneck because we need to suspend writing so
frequently.
Assuming a 50ms latency, the bandwidth-delay product tells us that we
can (in theory) saturate a 1.6 Gbps link with a 10MB receive buffer
(assuming the OS also has large enough buffer sizes in its TCP or QUIC
stack):
```
80 Mb / 0.05s = 1600Mbps
```
Experiments and research [2] show the following:
|OS|Receive buffer (default)|Receive buffer (this PR)|Send buffer
(default)|Send buffer (this PR)|
|---|---|---|---|---|
|Windows|65KB|10MB|65KB|1MB|
|MacOS|786KB|8MB|9KB|1MB|
|Linux|212KB|212KB|212KB|212KB|
With the exception of Linux, the OSes appear to be quite generous with
how big they allow receive buffers to be. On Linux, these limit can be
changed by setting the `core.net.rmem_max` and `core.net.wmem_max`
parameters using `sysctl`.
Most of our users are on Windows and MacOS, meaning they immediately
benefit from this without having to change any system settings. Larger
client-side UDP receive buffers are critical for any "download" scenario
which is likely the majority of usecases that Firezone is used for.
On Windows, increasing this receive buffer almost doubles the throughput
in an iperf3 download test.
[0]: https://github.com/mozilla/neqo/pull/2470
[1]: https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes
[2]: https://unix.stackexchange.com/a/424381
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
When reading from our UDP socket, we utilise GRO to read multiple
packets originating from the same IP + port and with the same length in
a single syscall. Currently, we can read up to 10 different combinations
here in a single syscall. `quinn_udp` actually exposes a constant for
how many batches it can handle at a time. Instead of hard-coding the
value 10, we now follow this constant.
On Linux and MacOS (with `apple-fast-datapath`), this constant has the
value 32. On Windows, it is 1.
Even on my not-so-fast Internet connection of 100Mbit, I can see an
increase in batch-count of up to 29 so increasing this value seems to be
definitely worth it.
When the `recv` syscall completes, `quinn-udp` tells us how many batches
we have read. On Windows, this is always 1 because Windows doesn't have
an APIs to read more than a single GRO batch. The `DatagramSegmentIter`
already has a way of detecting this, however it currently needs to
iterator through all batches (10) and check that their `meta.length ==
0` before realising this.
We can shortcut the iterator early which might improve download
performance on Windows.
I can't measure a direct improvement here but I believe that is because
we are currently limited by the buffer size on Windows. Regardless, this
feels like the right thing to do.