Commit Graph

608 Commits

Author SHA1 Message Date
Thomas Eizinger
ac339ff63b fix(gateway): evaluate fastest nameserver every 60s (#9060)
Currently, the Gateway reads all nameservers from `/etc/resolv.conf` on
startup and evaluates the fastest one to use for SRV and TXT DNS queries
that are forwarded by the Client. If the machine just booted and we do
not have Internet connectivity just yet, this fails which leaves the
Gateway in state where it cannot fulfill those queries.

In order to ensure we always use the fastest one and to self-heal from
such situations, we add a 60s timer that refreshes this state.
Currently, this will **not** re-read the nameservers from
`/etc/resolv.conf` but still use the same IPs read on startup.
2025-05-09 03:38:35 +00:00
Thomas Eizinger
33d5c32f35 fix(gateway): truncate payload of ICMP errors (#9059)
When the Gateway is handed an IP packet for a DNS resource that it
cannot route, it sends back an ICMP unreachable error. According to RFC
792 [0] (for ICMPv4) and RFC 4443 [1] (for ICMPv6), parts of the
original packet should be included in the ICMP error payload to allow
the sending party to correlate, what could not be sent.

For ICMPv4, the RFC says:

```
Internet Header + 64 bits of Data Datagram

The internet header plus the first 64 bits of the original
datagram's data.  This data is used by the host to match the
message to the appropriate process.  If a higher level protocol
uses port numbers, they are assumed to be in the first 64 data
bits of the original datagram's data.
```

For ICMPv6, the RFC says:

```
As much of invoking packet as possible without the ICMPv6 packet exceeding the minimum IPv6 MTU
```

[0]: https://datatracker.ietf.org/doc/html/rfc792
[1]: https://datatracker.ietf.org/doc/html/rfc4443#section-3.1
2025-05-09 01:38:31 +00:00
Thomas Eizinger
005b6fe863 feat(windows): optimise network change detection (#9021)
Presently, the network change detection on Windows is very naive and
simply emits a change event everytime _anything_ changes. We can
optimise this and therefore improve the start-up time of Firezone by:

- Filtering out duplicate events
- Filtering out network change events for our own network adapter

This reduces the number of network change events to 1 during startup. As
far as I can tell from the code comments in this area, we explicitly
send this one to ensure we don't run into a race condition whilst we are
starting up.

Resolves: #8905
2025-05-06 00:23:27 +00:00
Thomas Eizinger
ea475c721a docs(website): update changelog for latest releases (#9015)
In #9013, we forgot to update the changelogs for Apple Clients and the
Gateway.
2025-05-02 13:16:28 +00:00
Jamil
6e0e7343ba chore: release Apple & Gateway with ECN fix (#9013) 2025-05-02 00:16:40 -07:00
Thomas Eizinger
513e0a400c docs(website): update Apple changelog (#9011) 2025-05-02 05:55:25 +00:00
Thomas Eizinger
0aab954fa9 fix(connlib): never clear ECT from IP packets (#9009)
ECN information is helpful to allow the congestion controllers to more
easily fine-tune their send and receive windows. When a Firezone Client
receives an IP packet where the ECN bits signal an ECN capable
transport, we mirror this bit on the UDP datagram that carries the
encrypted IP packet.

When receiving a datagram with ECN bits set, the Gateway will then apply
these bits to the decrypted IP packet and pass it along towards its
destination.

This implementation is unfortunately a bit too naive. Not all devices on
the Internet support ECN and therefore, we may receive a datagram that
has its ECN bits cleared when the ECN bits on the inner IP packet still
signal an ECN capable transport. In this case, we should _not_ override
the ECN bits and instead pass the IP packet along as is. Network devices
along the path between Gateway and Resource may still use these ECN bits
to signal congestion.

We fix this by making the `with_ecn` function on `IpPacket` private. It
is not meant to be used outside of the module. We supersede it with a
`with_ecn_from_transport` function that implements the above logic.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2025-05-02 05:28:19 +00:00
Thomas Eizinger
ec4cd898ba chore: release Gateway v1.4.7 (#8943) 2025-04-30 13:37:32 +00:00
Thomas Eizinger
96998a43ae docs(website): add missing changelog entry for Apple Clients (#8938) 2025-04-30 07:14:33 +00:00
Thomas Eizinger
f7df445924 fix(gateway): don't invalidate active NAT sessions (#8937)
Whenever the Gateway is instructed to (re)create the NAT for a DNS
resource, it performs a DNS query and then overwrite the existing
entries in the NAT table. Depending on how the DNS records are defined,
this may lead to a very bad user experience where connections are cut
regularly.

In particular, if a service utilises round-robin DNS where a DNS query
only ever returns a single entry yet that entry may change as soon as
the TTL expires, all connections for this particular DNS resource for a
Client get cut.

To fix this, we now first check for active NAT sessions for a given
proxy IP and only replace it if we don't have an open NAT session. The
NAT sessions have a TTL of 1 minute, meaning there needs to be at least
1 outgoing packet from the Client every minute to keep it open.
2025-04-30 06:58:58 +00:00
Jamil
2650d81444 chore: release clients with GSO fix (#8936) 2025-04-29 23:52:43 -07:00
Thomas Eizinger
6dc5f85cc5 fix(connlib): don't buffer when recreating DNS resource NAT (#8935)
In order to detect changes to DNS records of DNS resources, `connlib`
will recreate the DNS resource NAT whenever it receives a query for a
DNS resource. The way we implemented this was by clearing the local
state of the DNS resource NAT, which triggered us to perform the
handshake with the Gateway again upon the next packet for this resource.
The Gateway would then perform the DNS query and respond back when this
was finished.

In order to not drop any packets, `connlib` has a buffer where it keeps
the packets that are arriving in the meantime. This works reasonably
well when the connection is first set up because we are only buffering a
TCP SYN or equivalent handshake packet. Yet, when the connection is full
use, and the application just so happens to make another DNS query, we
halt the entire flow of packets until this is confirmed again. To
prevent high memory use, the buffer for this packets is constrained to
32 packets which is nowhere near enough when a connection is actively
transferring data (like a file upload).

In most cases, the DNS query on the Gateway will yield the exact same
results as because the records haven't changed. Thus, there is no reason
for us to actually halt the flow of these packets when we are
_recreating_ the DNS resource NAT. That way, this handshake happens in
parallel to the actual packet flow and does not interrupt anything in
the happy path case.
2025-04-30 04:26:49 +00:00
Thomas Eizinger
122d84cfa2 fix(connlib): recreate log file if it got deleted (#8926)
Currently, when `connlib`'s log file gets deleted, we write logs into
nirvana until the corresponding process gets restarted. This is painful
for users to do because they need to restart the IPC service or Network
Extension. Instead, we can simply check if the log file exists prior to
writing to it and re-create it if it doesn't.

Resolves: #6850
Related: #7569
2025-04-29 13:05:02 +00:00
Thomas Eizinger
bbc9c29d5d docs(website): add changelog for #8920 (#8923) 2025-04-29 10:23:48 +00:00
Thomas Eizinger
ad9a453aa1 feat(linux-client): reduce number of TUN threads to 1 (#8914)
Having multiple threads for reading and writing the TUN device can cause
packet re-orderings on the client. All other clients only use a single
TUN thread, so aligning this value means a more consistent behaviour of
Firezone across all platforms.
2025-04-28 12:25:27 +00:00
Jamil
f181a3245b chore(website): Remove old docs (#8895)
These confuse users and are horribly outdated.

Fixes #8528
2025-04-23 15:24:09 +00:00
Thomas Eizinger
ac5e44d5d0 feat(connlib): request larger buffers for UDP sockets (#8731)
Sufficiently large receive buffers are important to sustain
high-throughput as latency increases. If the receive buffer in the
kernel is too small, packets need to be dropped on arrival.

Firefox uses 1MB in its QUIC stack [0]. `quic-go` recommends to set send
and receive buffers to 7.5 MB [1]. Power users of Firezone are likely
receiving a lot more traffic than the average Firefox user (especially
with Internet Resource activated) so setting it to 10 MB seems
reasonable. Sending packets is likely not as critical because we have
back-pressure through our system such that we will stop reading IP
packets when we cannot write to our UDP socket. The UDP socket is
sitting in a separate thread and those threads are connected with
dedicated queues which act as another buffer. However, as the data below
shows, some systems have really small send buffers which are currently
likely a speed bottleneck because we need to suspend writing so
frequently.

Assuming a 50ms latency, the bandwidth-delay product tells us that we
can (in theory) saturate a 1.6 Gbps link with a 10MB receive buffer
(assuming the OS also has large enough buffer sizes in its TCP or QUIC
stack):

```
80 Mb / 0.05s = 1600Mbps
```

Experiments and research [2] show the following:

|OS|Receive buffer (default)|Receive buffer (this PR)|Send buffer
(default)|Send buffer (this PR)|
|---|---|---|---|---|
|Windows|65KB|10MB|65KB|1MB|
|MacOS|786KB|8MB|9KB|1MB|
|Linux|212KB|212KB|212KB|212KB|

With the exception of Linux, the OSes appear to be quite generous with
how big they allow receive buffers to be. On Linux, these limit can be
changed by setting the `core.net.rmem_max` and `core.net.wmem_max`
parameters using `sysctl`.

Most of our users are on Windows and MacOS, meaning they immediately
benefit from this without having to change any system settings. Larger
client-side UDP receive buffers are critical for any "download" scenario
which is likely the majority of usecases that Firezone is used for.

On Windows, increasing this receive buffer almost doubles the throughput
in an iperf3 download test.

[0]: https://github.com/mozilla/neqo/pull/2470
[1]: https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes
[2]: https://unix.stackexchange.com/a/424381

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2025-04-22 06:52:33 +00:00
Jamil
5db8e20f3b chore: release Apple and GUI clients (#8882)
- Apple clients 1.4.12
- GUI clients 1.4.11
2025-04-21 21:45:16 +00:00
Jamil
368ace2c6e ci: Release Android 1.4.7 (#8878)
App is live on Play store.
2025-04-21 21:12:27 +00:00
Thomas Eizinger
4c5fd9b256 feat(connlib): prefer relay candidates of same IP version (#8798)
When calculating preferences for candidates, `str0m` currently always
prefer IPv6 over IPv4. This is as per the ICE spec. Howver, this can
lead to sub-optimal situations when a connection ends up using a TURN
server.

TURN allows a client to allocate an IPv4 and an IPv6 address in the same
allocation. This makes it possible for e.g. an IPv4-only client to
connect to an IPv6-only peer as long as the TURN server runs in
dual-stack AND the client requests an IPv6 address in addition to an
IPv4 address with the `ADDITIONAL-ADDRESS-FAMILY` attribute.

Assume that a client sits behind symmetric NAT and therefore needs to
rely on a TURN server to communicate with its peers. The TURN server as
well as all the peers operate in dual-stack mode.

The current priority calculation will yield a communication path that
uses IPv4 to talk to the TURN server (as that is the only one available)
but due to the preference ordering of IPv6 over IPv4, will use an IPv6
path to the peer, despite the peer also supporting IPv4.

This isn't a problem per-se but makes our life unnecessarily difficult.
Our TURN servers use eBPF to efficiently deal with TURN's channel-data
messages. This however is at present only implemented for the IPv4 <>
IPv4 and IPv6 <> IPv6 path. Implementing the other paths is possible but
complicates the eBPF code because we need to also translate IP headers
between versions and not just update the source and destination IPs.

We have since patched `str0m` to extend the `Candidate::relayed`
constructor to also take a `base` address which is - similar to the
other candidate types - the address the client is sending from in order
to use this candidate. In the context of relayed candidates, this is the
address the client is using to talk to the TURN server. We can use this
information in the candidate's priority calculation to prefer candidates
that allow traffic to remain within one IP version, i.e. if the client
talks to the TURN server over IPv4, the candidate with an allocated IPv4
address will have a higher priority than the one with the IPv6 address
because we are applying a "punishment" factor as part of the
local-preference component in the priority formula.

Staying within the same IP version whilst relaying traffic allows our
TURN servers to use their eBPF kernel which results in a better UX due
to lower latency and higher throughput.

The final candidate ordering is ultimately decided by the controlling
ICE agent which in our case is the Firezone Client. Thus, we don't
necessarily need to update Gateways in order to test / benefit from
this. Building a Client with this patch included should be enough to
benefit from this change.

Related: https://github.com/algesten/str0m/pull/640
Related: https://github.com/algesten/str0m/pull/644
2025-04-20 22:41:56 +00:00
Thomas Eizinger
f7f6e3885d docs(website): remove duplicate init (#8860)
Resolves: #8858
2025-04-19 22:09:06 +00:00
Jamil
5669c83835 ci: Bump Apple clients to 1.4.11 (#8848)
Includes a fix for auto-starting on launch when other VPN clients have
been connected previously.
2025-04-19 11:45:42 +00:00
Jamil
4c1379a6bf fix(apple): Force enable VPN configuration on autoStart (#8814)
If another VPN has been activated on the system while Firezone is
active, Apple OSes will deactivate our configuration, and never
reactivate it.

We knew this already, and always activated the configuration when
starting during the sign in flow, but failed to also do this when
autoStarting on launch.

This PR updates ensures that during autoStart, we re-enable the
configuration as well.

Fixes #8813
2025-04-18 18:00:44 +00:00
Jamil
a2e32a4918 ci: Bump apple to 1.4.10 to ship PKG (#8797)
This publishes the 1.4.10 permalinks for the PKG download.
2025-04-17 15:13:44 +00:00
Jamil
fc7b6e3fb0 feat(ci): Publish installer PKG for macOS standalone (#8795)
Microsoft Intune's DMG provisioner currently fails unexpectedly when
trying to provision our published DMG file with the error:

> The DMG file couldn't be mounted for installation. Check the DMG file
if the error persists. (0x87D30139)

I ran the following verification commands locally, which all passed:

```
hdiutil verify -verbose <dmg>
hdiutil imageinfo -verbose <dmg>
hdiutil hfsanalyze -verbose <dmg>
hdiutil checksum -type SHA256 -verbose <dmg>
hdiutil info -verbose
hdiutil pmap -verbose <dmg>
```

So the issue appears to be most likely that Intune doens't like the
`/Applications` shortcut in the DMG. This is a UX feature to make it
easy to drag the application the /Applications folder upon opening the
DMG.

So we're publishing an PKG in addition to the DMG, which should be a
more reliable artifact for MDMs to use.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
2025-04-16 16:21:40 +00:00
Thomas Eizinger
4cf36cd8bd docs(kb): update path to Gateway to new location (#8794)
In #8480, we changed the location that `firezone-gateway` gets
downloaded to but forgot to update the knowledgebase with the new path.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-04-16 13:20:28 +00:00
Jamil
aab691a67f ci: Release Apple clients 1.4.9 (#8793)
These contain the recent UDP thread enhancements.
2025-04-15 20:14:43 +00:00
Jamil
743f5fdfeb ci: bump clients/gateway to ship write improvements (#8792)
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2025-04-15 06:21:23 +00:00
Thomas Eizinger
282fdb96ea chore: fixup changelog for latest releases (#8788)
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-04-14 20:41:47 -07:00
Thomas Eizinger
b3746b330f refactor(connlib): spawn dedicated threads for UDP sockets (#7590)
Correctly implementing asynchronous IO is notoriously hard. In order to
not drop packets in the process, one has to ensure a given socket is
ready to accept packets, buffer them if it is not case, suspend
everything else until the socket is ready and then continue.

Until now, we did this because it was the only option to run the UDP
sockets on the same thread as the actual packet processing. That in turn
was motivated by wanting to pass around references of the received
packets for processing. Rust's borrow-checker does not allow to pass
references between threads which forced us to have the sockets on the
same thread as the packet processing.

Like we already did in other places in `connlib`, this can be solved
through the use of buffer pools. Using a buffer pool, we can use heap
allocations to store the received packets without having to make a new
allocation every time we read new packets. Instead, we can have a
dedicated thread that is connected to `connlib`'s packet processing
thread via two channels (one for inbound and one for outbound packets).
These channels are bounded, which ensures backpressure is maintained in
case one of the two threads lags behind. These bounds also mean that we
have at most N buffers from the buffer pool in-flight (where N is the
capacity of the channel).

Within those dedicated threads, we can then use `async/await` notation
to suspend the entire task when a socket isn't ready for sending.

Resolves: #8000
2025-04-14 06:18:06 +00:00
Thomas Eizinger
e0f94824df fix(gateway): default to 1 TUN thread on single-core systems (#8765)
On single-core systems, spawning more than one TUN thread results in
contention that hurts performance more than it helps.

Resolves: #8760
2025-04-13 01:54:04 +00:00
Thomas Eizinger
132487c29e fix(connlib): correctly compute the GSO batch size (#8754)
We are currently naively chunking our buffer into `segment_size *
max_gso_segments()`. `max_gso_segments` is by default 64. Assuming we
processed several IP packets, this would quickly balloon to a size that
the kernel cannot handle. For example, during an `iperf3` run, we
receive _a lot_ of packets at maximum MTU size (1280). With the overhead
that we are adding to the packet, this results in a UDP payload size of
1320.

```
1320 x 64 = 84480
```

That is way too large for the kernel to handle and it will fail the
`sendmsg` call with `EMSGSIZE`. Unfortunately, this error wasn't
surfaced because `quinn_udp` handles it internally because it can also
happen as a result of MTU probes.

We've already patched `quinn_udp` in the past to move the handling of
more quinn-specific errors to the infallible `send` function. The same
is being done for this error in
https://github.com/quinn-rs/quinn/pull/2199.

Resolves: #8699
2025-04-12 13:10:43 +00:00
Jamil
7f4bfc938c docs: Update outdated docs regarding record types (#8532) 2025-03-28 03:22:42 +00:00
Thomas Eizinger
19c5bc530a feat(gateway): deprecate the NAT64 module (#8383)
At present, the Gateway implements a NAT64 conversion that can convert
IPv4 packets to IPv6 and vice versa. Doing this efficiently creates a
fair amount of complexity within our `ip-packet` crate. In addition,
routing ICMP errors back through our NAT is also complicated by this
because we may have to translate the packet embedded in the ICMP error
as well.

The NAT64 module was originally conceived as a result of the new stub
resolver-based DNS architecture. When the Client resolves IPs for a
domain, it doesn't know whether the domain will actually resolve to IPv4
AND IPv6 addresses so it simply assigns 4 of each to every domain. Thus,
when receiving an IPv6 packet for such a DNS resource, the Gateway may
only have IPv4 addresses available and can therefore not route the
packet (unless it translates it).

This problem is not novel. In fact, an IP being unroutable or a
particular route disappearing happens all the time on the Internet. ICMP
was conceived to handle this problem and it is doing a pretty good job
at it. We can make use of that and simply return an ICMP unreachable
error back to the client whenever it picks an IP that we cannot map to
one that we resolved.

In this PR, we leave all of the NAT64 code intact and only add a
feature-flag that - when active - sends aforementioned ICMP error. While
offline (and thus also for our tests), the feature-flag evaluates to
false. It is however set to `true` in the backend, meaning on staging
and later in production, we will send these ICMP errors.

Once this is rolled out and indeed proving to be working as intended, we
can simplify our codebase and rip out the NAT64 module. At that point,
we will also have to adapt the test-suite.
2025-03-27 01:01:37 +00:00
Jamil
cbea27cb57 fix(website): Update broken website links (#8518)
Updates broken links found as a result of
https://github.com/firezone/firezone/pull/8516
2025-03-25 21:12:31 +00:00
Thomas Eizinger
58086bf1e4 docs(website): fix broken links to terraform modules (#8515) 2025-03-25 13:26:35 +00:00
Jamil
effe169414 chore: release apple 1.4.8 (#8499)
Introduces the autoconnect and session end fixes.
2025-03-21 11:43:00 +00:00
Jamil
4701306835 docs: Update terraform gcp module docs for new published module (#8485)
Updates our Google terraform module guide to suit the new published
module in the Terraform registry.
2025-03-19 05:07:11 +00:00
Jamil
a8b9e34c33 fix(apple): Try to connect on launch (#8477)
This is a regression introduced in c9f085c102. The `status` at this
point is still `nil` because we have not yet fully subscribed to VPN
status change updates from the system.

That actually shouldn't prevent us from trying to start the tunnel
anyway. If the `token` is missing from the Keychain, the tunnel process
will no-op. So we simply try to start a session on launch always.

Fixes #8456
2025-03-18 03:06:57 +00:00
Jamil
e642eefb35 chore: Cut all clients to ship search domains (#8442)
Waiting on app reviews to be approved, then this PR will be ready to
merge.
2025-03-17 17:25:11 +00:00
Jamil
a47b96bcad chore: Release android 1.4.4 (#8449)
This was already published on Google Play, but the other clients will
follow suit in #8442.
2025-03-15 17:13:17 -05:00
Jamil
0809d992d6 docs: Search domains (#8437)
- Adds search domains section to Deploy -> DNS docs
- Mentions known issue: #8430
2025-03-14 10:49:48 +00:00
Jamil
eb195861c2 chore(website): Remove redundant no-changes block (#8424)
https://github.com/firezone/firezone/pull/8413#pullrequestreview-2672919083
2025-03-14 02:35:22 +00:00
Jamil
25c708fb43 ci: Bump apple clients to 1.4.6 (#8418) 2025-03-12 04:09:49 +00:00
Jamil
f3e36a2253 ci: bump android to 1.4.3 (#8416) 2025-03-11 05:52:26 +00:00
Jamil
df5bbdd240 ci: Ship SRV/TXT for GUI/Headless/Gateway (#8413) 2025-03-10 21:30:23 -07:00
Jamil
cb0283f00c fix(android): Ensure Android layouts fitsSystemWindows (#8376)
- Sets the `fitsSystemWindows` var to avoid overlapping any system
controls
- Makes all margin padding consistent at `@dimen/spacing_medium` so that
no controls are right on the edge of the view

Fixes:
https://firezonehq.slack.com/archives/C08FPHECLUF/p1741266356394749
Fixes: #7094
2025-03-06 20:28:08 +00:00
Jamil
ab7e805fdd fix(apple): actually show user-friendly alert messages (#8282)
Before, we would receive an `NSError` object and the type-matching
wouldn't take effect at all, causing the default alert to show every
time. This solves that by introducing a `UserFriendlyError` protocol
which is more robust against the two main `Error` and `NSError`
variants.
2025-02-28 14:12:24 +00:00
Jamil
1bd8051aae fix(connlib): Emit resources updated when display fields change (#8286)
Whenever a Resource's name, address_description, or assigned sites
change, it is not currently reflected in clients. For that to happen the
address is changed.

This PR updates that behavior so that if any display fields are changed,
the `on_update_resources` callback is called which properly updates the
resource list views in clients.

Fixes #8284
2025-02-28 04:32:10 +00:00
Jamil
14436908d2 chore: Release GUI client 1.4.7 (#8275) 2025-02-25 23:30:44 -08:00