Commit Graph

900 Commits

Author SHA1 Message Date
Jamil
e1ed497d12 fix(apple): Expose MACOSX_DEPLOYMENT_TARGET in rust apple build script to signal to rustc which macOS to target (#7443)
`MACOSX_DEPLOYMENT_TARGET` is a standard env var read by gcc and rustc
that determines which version of macOS to target binaries for. This
variable was being removed inadvertently in our rust build script.

Exposing this var fixes a slew of warnings about object files being
built for a newer macOS target than being linked that were showing up in
Xcode for some time now.

Hasn't caused any issues thus far from what I can tell, but possibly
related to #7442
2024-12-02 17:27:11 +00:00
Thomas Eizinger
0a6554122a feat(connlib): utilise GSO for UDP sockets (#7210)
## Context

At present, `connlib` sends UDP packets one at a time. Sending a packet
requires us to make a syscall which is quite expensive. Under load, i.e.
during a speedtest, syscalls account for over 50% of our CPU time [0].
In order to improve this situation, we need to somehow make use of GSO
(generic segmentation offload). With GSO, we can send multiple packets
to the same destination in a single syscall.

The tricky question here is, how can we achieve having multiple UDP
packets ready at once so we can send them in a single syscall? Our TUN
interface only feeds us packets one at a time and `connlib`'s state
machine is single-threaded. Additionally, we currently only have a
single `EncryptBuffer` in which the to-be-sent datagram sits.

## 1. Stack-allocating encrypted IP packets

As a first step, we get rid of the single `EncryptBuffer` and instead
stack-allocate each encrypted IP packet. Due to our small MTU, these
packets are only around 1300 bytes. Stack-allocating that requires a few
memcpy's but those are in the single-digit % range in the terms of CPU
time performance hit. That is nothing compared to how much time we are
spending on UDP syscalls. With the `EncryptBuffer` out the way, we can
now "freely" move around the `EncryptedPacket` structs and - technically
- we can have multiple of them at the same time.

## 2. Implementing GSO

The GSO interface allows you to pass multiple packets **of the same
length and for the same destination** in a single syscall, meaning we
cannot just batch-up arbitrary UDP packets. Counterintuitively, making
use of GSO requires us to do more copying: In particular, we change the
interface of `Io` such that "sending" a packet performs essentially a
lookup of a `BytesMut`-buffer by destination and packet length and
appends the payload to that packet.

## 3. Batch-read IP packets

In order to actually perform GSO, we need to process more than a single
IP packet in one event-loop tick. We achieve this by batch-reading up to
50 IP packets from the mpsc-channel that connects `connlib`'s main
event-loop with the dedicated thread that reads and writes to the TUN
device. These reads and writes happen concurrently to `connlib`'s packet
processing. Thus, it is likely that by the time `connlib` is ready to
process another IP packet, multiple have been read from the device and
are sitting in the channel. Batch-processing these IP packets means that
the buffers in our `GsoQueue` are more likely to contain more than a
single datagram.

Imagine you are running a file upload. The OS will send many packets to
the same destination IP and likely max MTU to the TUN device. It is
likely, that we read 10-20 of these packets in one batch (i.e. within a
single "tick" of the event-loop). All packets will be appended to the
same buffer in the `GsoQueue` and on the next event-loop tick, they will
all be flushed out in a single syscall.

## Results

Overall, this results in a significant reduction of syscalls for sending
UDP message. In [1], we spend only a total of 16% of our CPU time in
`udpv6_sendmsg` whereas in [0] (main), we spent a total of 34%. Do note
that these numbers are relative to the total CPU time spent per program
run and thus can't be compared directly (i.e. you cannot just do 34 - 16
and say we now spend 18% less time sending UDP packets). Nevertheless,
this appears to be a great improvement.

In terms of throughput, we achieve a ~60% improvement in our benchmark
suite. That one is running on localhost though so it might not
necessarily be reflect like that in a real network.

[0]: https://share.firefox.dev/4hvoPju
[1]: https://share.firefox.dev/4frhCPv
2024-12-02 01:09:44 +00:00
Thomas Eizinger
5f4816ee46 fix(connlib): don't warn on non-UDP packet to DNS resolver IP (#7418)
Windows appears to sometimes send ICMP (error?) packets to our DNS
resolver IPs. There is nothing we can do with these but the current code
path logs them as a warning because we expect everything that isn't TCP
to be UDP.

With this patch, we slightly change the parsing logic to first attempt
extracting the UDP packet and fail only with a DEBUG log if it isn't.
2024-12-01 16:01:42 +00:00
Thomas Eizinger
a3e3d4cac5 fix(gateway): filter packets not destined for a client (#7417)
This causes unnecessary logs higher up the stack.
2024-12-01 15:59:56 +00:00
Thomas Eizinger
932f6791fb fix(phoenix-channel): lazily create backoff timer (#7414)
Our `phoenix-channel` component is responsible for maintaining a
WebSocket connection to the portal. In case that connection fails, we
want to reconnect to it using an exponential backoff, eventually giving
up after a certain amount of time.

Unfortunately, the code we have today doesn't quite do that. An
`ExponentialBackoff` has a setting for the `max_elapsed_time`.
Regardless of how many and how often we retry something, we won't ever
wait longer than this amount of time. For the Relay, this is set to
15min. For other components its indefinite (Gateway, headless-client),
or very long (30 days for Android, 1 day for Apple).

The point in time from which this duration is counted is when the
`ExponentialBackoff` is **constructed** which translates to when we
**first** connected to the portal. As a result, our backoff would
immediately fail on the first error if it has been longer than
`max_elapsed_time` since we first connected. For most components, this
codepath is not relevant because the `max_elapsed_time` is so long. For
the Relay however, that is only 15 minutes so chances are, the Relay
would immediately fail (and get rebooted) on the first connection error
with the portal.

To fix this, we now lazily create the `ExponentialBackoff` on the first
error.

This bug has some interesting consequences: When a relay reboots, it
looses all its state, i.e. allocations, channel bindings, available
nonces etc, stamp-secret. Thus, all credentials and state that got
distributed to Clients and Gateways get invalidated, causing disconnects
from the Relay. We have observed these alerts in Sentry for a while and
couldn't explain them. Most likely, this is the root cause for those
because whilst a Relay disconnects, the portal also cannot detect its
presence and pro-actively inform Clients and Gateways to no longer use
this Relay.
2024-11-29 20:19:11 +00:00
Thomas Eizinger
c6e7e6192e build(rust): bump Rust to 1.83 (#7409)
Rust 1.83 comes with a bunch of new lints for elidible lifetimes. Those
also trigger in the generated code of `derivative`. That crate is
actually unmaintained so we replace our usages of it with `derive_more`.
2024-11-29 01:04:06 +00:00
Thomas Eizinger
e46cb3f62b chore(snownet): improve log when MessageIntegrity is missing (#7399) 2024-11-29 00:27:53 +00:00
Thomas Eizinger
3ccf795195 test(connlib): don't waste shrinking time & cycles on IDs (#7402)
When `proptest` discovers a test failure, it will attempt to "shrink"
the input to identify, what exactly causes the issue. How this is done
depends on the data type but mostly performs things such as binary
search to be efficient. Not every input within our tests is relevant for
a failure. For example, which ID we have sampled for a client or a
gateway doesn't at all affect whether or not the test will fail.
`proptest` doesn't know that though so it will still happily spend
shrinking time and cycles on figuring out the minimal difference in IDs
(which is 1 because they have to be different). This is a huge waste of
time for no benefit. We are getting much more value out of `proptest` if
it tries to adjust other things such as the transitions involved in a
test, how many gateways and relays there are etc.

By marking the strategies for the IDs and private keys with `no_shrink`,
we can achieve that.
2024-11-28 20:45:34 +00:00
Thomas Eizinger
fd337dd465 test: reduce number of local rejects for generating IPs (#7401)
When generating random input data in property-based tests, we have to
ensure that the data conforms to certain criteria. For example, IP
addresses must not be multicast or unspecified addresses and they must
not be within our reserved IP ranges.

Currently, we ensure this using "filtering" which is a pretty poor
technique [0]. To improve on this, we refactor the generation of IPs to
automatically exclude all IPs within certain ranges. On very big
test-runs (i.e. > 30000 test cases), too many local rejections lead to
the test suite being aborted early.

[0]:
https://proptest-rs.github.io/proptest/proptest/tutorial/filtering.html
2024-11-28 20:45:02 +00:00
Thomas Eizinger
2c26fc9c0e ci: lint Rust dependencies using cargo deny (#7390)
One of Rust's promises is "if it compiles, it works". However, there are
certain situations in which this isn't true. In particular, when using
dynamic typing patterns where trait objects are downcast to concrete
types, having two versions of the same dependency can silently break
things.

This happened in #7379 where I forgot to patch a certain Sentry
dependency. A similar problem exists with our `tracing-stackdriver`
dependency (see #7241).

Lastly, duplicate dependencies increase the compile-times of a project,
so we should aim for having as few duplicate versions of a particular
dependency as possible in our dependency graph.

This PR introduces `cargo deny`, a linter for Rust dependencies. In
addition to linting for duplicate dependencies, it also enforces that
all dependencies are compatible with an allow-list of licenses and it
warns when a dependency is referred to from multiple crates without
introducing a workspace dependency. Thanks to existing tooling
(https://github.com/mainmatter/cargo-autoinherit), transitioning all
dependencies to workspace dependencies was quite easy.

Resolves: #7241.
2024-11-22 00:17:28 +00:00
Thomas Eizinger
56db250e2c feat(connlib): validate integrity of all relay responses (#7378)
In order to avoid processing of responses of relays that somehow got
altered on the network path, we now use the client's `password` as a
shared secret for the relay to also authenticate its responses. This
means that not all message can be authenticated. In particular, BINDING
requests will still be unauthenticated.

Performing this validation now requires every component that crafts
input to the `Allocation` to include a valid `MessageIntegrity`
attribute. This is somewhat problematic for the regression tests of the
relay and the unit tests of `Allocation`. In both cases, we implement
workarounds so we don't have to actually compute a valid
`MessageIntegrity`. This is deemed acceptable because:

- Both of these are just tests.
- We do test the validation path using `tunnel_test` because there we
run an actual relay.
2024-11-19 18:32:33 +00:00
Thomas Eizinger
ecec00afed chore(snownet): print attributes for all requests and responses (#7380)
When debugging issues related to our TURN allocation code, we sometimes
only have the logs that code submitted to Sentry. As part of the event,
we submit the last 500 debug logs as breadcrumbs to give more context to
the error.

Unconditionally printing the attributes of each request-response pair
will help us in more easily diagnosing, why certain errors happen.
2024-11-19 14:32:23 +00:00
Thomas Eizinger
e8519cca0c chore(snownet): warn on exceeding number of candidate pairs (#7376)
In the latest version, we added a warning log to str0m when the maximum
number of candidate pairs is exceeded:
https://github.com/algesten/str0m/pull/587.

We only ever add the candidates of a single relay to an agent (2
candidates), plus at most 2 server-reflexive candidates and at most 2
host candidates. Unless there is a bug like what we fixed in #7334,
exceeding the default number of candidate _pairs_ (100) should never
happen.

In case it does, the newly added `warn` log in `str0m` will trigger a
Sentry alert.
2024-11-19 04:34:23 +00:00
Thomas Eizinger
de35bb067e fix(telemetry): don't embed errors values in telemetry_event! (#7366)
Due to https://github.com/getsentry/sentry-rust/issues/702, errors which
are embedded as `tracing::Value` unfortunately get silently discarded
when reported as part of Sentry "Event"s and not "Exception"s.

The design idea of these telemetry events is that they aren't fatal
errors so we don't need to treat them with the highest priority. They
may also appear quite often, so to save performance and bandwidth, we
sample them at a rate of 1% at creation time.

In order to not lose the context of these errors, we instead format them
into the message. This makes them completely identical to the `debug!`
logs which we have on every call-site of `telemetry_event!` which
prompted me to make that implicit as part of creating the
`telemetry_event!`.

Resolves: #7343.
2024-11-18 18:17:08 +00:00
Thomas Eizinger
d9fb9e53c8 chore(snownet): print error code for unhandled message (#7367)
All our logic for handling errors is based on the error code. Even
though there should be a 1:1 mapping between error code and reason
phrase, I am seeing odd reports in Sentry for a case that we should be
handling but aren't.
2024-11-18 18:15:04 +00:00
Thomas Eizinger
9536b8116c fix: don't exit TUN thread on errors (#7354)
I noticed that in case there is an error when reading from the TUN
device, we currently exit that thread and we don't have a mechanism at
the moment to restart it. Discarding the thread also means we can no
longer send new instances of `Tun` into it.

Instead of exiting the thread, we now just log the error and continue.
In case the error was caused by the FD being closed, we discard the
instance of `Tun` and wait for a new one.
2024-11-16 06:19:41 +00:00
Thomas Eizinger
4e423dc51c fix(connlib): send all unwritten packets before reading new ones (#7342)
With the parallelisation of TUN and UDP operations, we lost
backpressure: Packets can now be read quicker from the UDP sockets than
they can be sent out the TUN device, causing packet loss in extremely
high-throughput situations.

To avoid this, we don't directly send packets into the channel to the
TUN device thread. This channel is bounded, meaning sending can fail if
reading UDP packets is faster than writing packets to the TUN device.

Due to GRO, we may read multiple UDP packets in one go, requiring us to
write multiple IP packets to the TUN device as part of a single
iteration in the event-loop. Thus, we cannot know, how much space we
need in the channel for outgoing IP packets.

By introducing a dedicated buffer, we can temporarily hold on to all of
these packets and on the next call to `poll`, we flush them out into the
channel. If the channel is full, we will suspend and only continue once
there is space in the channel. This behaviour restores backpressue
because we won't read UDP packets from the socket unless we have space
to write the corresponding packet to the TUN device.

UDP itself actually doesn't have any backpressure, instead the packets
will simply get dropped once the receive buffer overflows. The UDP
packets however carry encrypted IP packets, meaning whatever protocol
sits inside these packets will detect the packet loss and should
throttle their sending-pace accordingly.
2024-11-14 06:25:03 +00:00
Thomas Eizinger
8c5a5fa690 chore(rust): correctly disable ANSI escapes globally (#7336)
I think I finally understood and correctly traced, where the use of ANSI
escape codes came from. It turns out, the `with_ansi` switch on
`tracing_subscriber::fmt::Layer` is what you want to toggle. From there,
it trickles down to the `Writer` which we can then test for in our
`Format`.

Resolves: #7284.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2024-11-14 05:00:53 +00:00
Thomas Eizinger
efeba55709 chore(snownet): fail TURN connection on unknown attribute (#7341)
A TURN server that doesn't understand certain attributes should return
"Unknown attributes" as part of its response. Whilst we aim to be as
spec-compliant as possible, Firezone doesn't officially support other
TURN servers than our own relay.

If we encounter a TURN server that sends us an "Unknown attribute", we
now immediately fail this allocation and clear it as we cannot make any
more assumptions about what the connected relay actually supports.
2024-11-14 02:43:17 +00:00
Thomas Eizinger
3cf5cbb989 chore(connlib): only send some tunnel errors to Sentry (#7340)
Errors from the tunnel can potentially happen on a per-packet basis. In
order to not flood Sentry, reduce the log-level down to `debug` and only
report 1% of all errors. We did the same thing for the gateway in #7299.
2024-11-14 02:32:37 +00:00
Thomas Eizinger
00c7c42113 fix(snownet): don't allow duplicate server-reflexive candidates (#7334)
In #7163, we introduced a shared cache of server-reflexive candidates
within a `snownet::Node`. What we unfortunately overlooked is that if a
node (i.e. a client or a gateway) is behind symmetric NAT, then we will
repeatedly create "new" server-reflexive candiates, thereby filling up
this cache.

This cache is used to initialise the agents with local candidates, which
manifests in us sending dozens if not hundreds of candidates to the
other party. Whilst not harmful in itself, it does create quite a lot of
spam. To fix this, we introduce a limit of only keeping around 1
server-reflexive candidate per IP version, i.e. only 1 IPv4 and IPv6
address.

At present, `connlib` only supports a single egress interface meaning
for now, we are fine with making this assumption.

In case we encounter a new candidate of the same kind and same IP
version, we evict the old one and replace it with the new one. Thus, for
subsequent connections, only the new candidate is used.
2024-11-14 00:14:29 +00:00
Thomas Eizinger
3dd913f6df fix(snownet): emit correct event on invalidating srflx candidate (#7333)
This one has been lurking in the codebase for a while. Fortunately, it
is not very critical because invalidation of server-reflexive addresses
happens pretty rarely.
2024-11-13 20:12:20 +00:00
Thomas Eizinger
7e0d2ca59c chore: add telemetry event in case we see large datagrams (#7335)
If we see these, something fishy is going on (see #7332), so we should
definitely know about these by recording Sentry events. These can
potentially be per packet so we only send a telemetry event which gets
sampled at a rate of 1%.
2024-11-13 20:09:58 +00:00
Thomas Eizinger
48ba2869a8 chore(rust): ban the use of .unwrap except in tests (#7319)
Using the clippy lint `unwrap_used`, we can automatically lint against
all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a
few results actually. In most cases, they are invariants that can't
actually be hit. For these, we change them to `Option`. In other cases,
they can actually be hit. For example, if the user supplies an invalid
log-filter.

Activating this lint ensures the compiler will yell at us every time we
use `.unwrap` to double-check whether we do indeed want to panic here.

Resolves: #7292.
2024-11-13 03:59:22 +00:00
Thomas Eizinger
0e20f7d086 chore(connlib): better error reporting for invalid IP packets (#7320)
Currently, we don't report very detailed errors when we fail to parse
certain IP packets. With this patch, we use `Result` in more places and
also extend the validation of IP packets to:

a) enforce a length of at most 1280 bytes. This should already be the
case due to our MTU but bad things may happen if that is off for some
reason
b) validate the entire IP packet instead of just its header
2024-11-12 19:46:32 +00:00
Thomas Eizinger
19f51568c2 chore(rust): don't pass errors as values for debug logs (#7318)
Our logging library `tracing` supports structured logging. Structured
logging means we can include values within a `tracing::Event` without
having to immediately format it as a string. Processing these values -
such as errors - as their original type allows the various `tracing`
layers to capture and represent them as they see fit.

One of these layers is responsible for sending ERROR and WARN events to
Sentry, as part of which `std::error::Error` values get automatically
captured as so-called "sentry exceptions".

Unfortunately, there is a caveat: If an `std::error::Error` value is
included in an event that does not get mapped to an exception, the
`error` field is completely lost. See
https://github.com/getsentry/sentry-rust/issues/702 for details.

To work around this, we introduce a `err_with_sources` adapter that an
error and all its sources together into a string. For all
`tracing::debug!` statements, we then use this to report these errors.

It is really unfortunate that we have to do this and cannot use the same
mechanism, regardless of the log level. However, until this is fixed
upstream, this will do and gives us better information in the log
submitted to Sentry.
2024-11-12 04:00:02 +00:00
Thomas Eizinger
d38304b21f build(rust): depend on our boringtun fork (#7120)
This switches our dependency on `boringtun` over to our fork at
https://github.com/firezone/boringtun. The idea of the fork is to
carefully only patch selective parts such that upstream things later is
still possible. The complete diff can be seen here:
https://github.com/cloudflare/boringtun/compare/master...firezone:boringtun:master

So far, the only patches in the fork are dependency bumps, linter fixes,
adjustments to log levels and the removal of panics when the destination
buffer is too small.
2024-11-12 03:40:36 +00:00
Thomas Eizinger
237865c37b test(connlib): drain all Transmits at the end of advance (#7315)
Within our test suite, we "spin" for several (simulated) seconds after
each state transition to allow for packets being sent between the
different nodes. The test suite simulates different latencies by
delaying the delivery of some of these packets.

`connlib` has several timers for sending packets, i.e. STUN bindings, WG
keep-alives etc. These timers never end so we cannot simply spin "until
we no longer want to send any packets". Currently, we simply hard-stop
after a few seconds and drop the remaining packets and move on to the
next state transition.

At present, this isn't an issue because only our ICE agent adheres to
the simulated time advancement. `boringtun` is still impure and thus we
usually don't get to see any of the WireGuard packets like keep-alives
and session timeouts etc in our tests. The STUN messages are pretty
resilient to retransmissions so the current packet drop doesn't matter.

In the process of adopting our boringtun fork
(https://github.com/firezone/boringtun) where we will eventually fix the
time impurity, dropping some of these packets caused problems.

To fix this, we now drain all remaining packets that are sitting in the
"yet-to-be-delivered" buffer. These packets are delivered to an "inbox"
that is per-host, meaning the host (i.e. client, gateway or relay) will
still perceive the incoming packet with the correct latency.

We extract this functionality from #7120 because it is generally useful.
2024-11-12 03:19:07 +00:00
Thomas Eizinger
a83729e439 chore(gateway): be more detailed in error reporting (#7314)
Instead of collapsing multiple of these errors into one, we emit a
dedicated error message for each case. This will allow us to distinguish
them within Sentry events.
2024-11-12 03:16:06 +00:00
Thomas Eizinger
ad4eea29ff chore(rust): don't panic in fallible functions (#7298)
"Just let it crash" is terrible advice for software that is shipped to
end users. Where possible, we should use proper error handling and only
fail the current function / task that is active, e.g. drop a particular
packet instead of failing all of connlib. We more or less already do
that.

Activating the clippy lint `unwrap_in_result` surfaced a few more places
where we panic despite being in a function that is fallible already.
These cases can easily be converted to not panic and return an error
instead.
2024-11-11 23:55:23 +00:00
dependabot[bot]
7e4e190cd6 build(deps): Bump test-strategy from 0.3.1 to 0.4.0 in /rust (#7308)
Bumps [test-strategy](https://github.com/frozenlib/test-strategy) from
0.3.1 to 0.4.0.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="c683eb3cf6"><code>c683eb3</code></a>
Version 0.4.0.</li>
<li><a
href="17706bcd1c"><code>17706bc</code></a>
Update MSRV to 1.70.0.</li>
<li><a
href="90a5efbf00"><code>90a5efb</code></a>
Update dependencies.</li>
<li><a
href="cff2ede71f"><code>cff2ede</code></a>
Changed the strategy generated by <code>#[filter(...)]</code> to reduce
`Too many local ...</li>
<li><a
href="34cc6d2545"><code>34cc6d2</code></a>
Update expected compile error message.</li>
<li><a
href="a4427e2d98"><code>a4427e2</code></a>
Update CI settings.</li>
<li><a
href="ecb7dbae04"><code>ecb7dba</code></a>
Clippy.</li>
<li><a
href="637f29e9c8"><code>637f29e</code></a>
Made it so an error occurs when an unsupported attribute is specified
for enu...</li>
<li><a
href="6d66057bb0"><code>6d66057</code></a>
Use <code>test</code> instead of <code>check</code> with <code>cargo
hack --rust-version</code>.</li>
<li><a
href="cee2ebbfe6"><code>cee2ebb</code></a>
Fix CI settings.</li>
<li>Additional commits viewable in <a
href="https://github.com/frozenlib/test-strategy/compare/v0.3.1...v0.4.0">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=test-strategy&package-manager=cargo&previous-version=0.3.1&new-version=0.4.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2024-11-11 21:26:41 +00:00
Thomas Eizinger
4e67c568da refactor: downgrade errors when we cannot bind sockets (#7305)
Most of the time, these errors are a result of a limited IP stack, for
example IPv6 not being available. Reporting these as errors to Sentry is
unnecessarily noisy.

If something else happens further down the line, the last 500 debug and
info logs will be sent along with the error report so we will still see
these in the breadcrumbs if an actual error happens later.

Resolves: #7245.
2024-11-11 19:51:10 +00:00
Thomas Eizinger
488c599d5b chore(telemetry): capture Firezone ID and account in user ctx (#7310)
Sentry has a feature called the "User context" which allows us to assign
events to individual users. This in turn will give us statistics in
Sentry, how many users are affected by a certain issue.

Unfortunately, Sentry's user context cannot be built-up step-by-step but
has to be set as a whole. To achieve this, we need to slightly refactor
`Telemetry` to not be `clone`d and instead passed around by mutable
reference.

Resolves: #7248.
Related: https://github.com/getsentry/sentry-rust/issues/706.
2024-11-11 19:50:14 +00:00
Thomas Eizinger
06a274c4e5 refactor(apple): make panics on decoding errors more descriptive (#7301)
The communication between the native Apple client and `connlib` uses
JSON encoding. The deserialisation of these should never fail because a
particular version of `connlib` is always bundled with the native
client. Thus, panicking here is justified.

In case it does ever happen, we improve the panic message with this
patch.
2024-11-11 04:18:07 +00:00
Thomas Eizinger
213dd34ff3 chore(apple): log all connect errors on the connlib-side (#7300)
We don't have Sentry yet in the native Apple client, meaning we don't
yet learn about errors returned from the `connect` call. Normally,
logging and returning an error is an anti-pattern. We are okay with that
in this case until we can report these errors in the native Apple
client.
2024-11-11 04:01:12 +00:00
Thomas Eizinger
62cb32b7a3 chore(gateway): report more tunnel errors to event-loop (#7299)
Currently, the Gateway's state machine functions for processing packets
use type-signature that only return `Option`. Any errors while
processing packets are logged internally. This makes it difficult to
consistently log these errors.

We refactor these functions to return `Result<Option<T>>` in most cases,
indicating that they may fail for various reasons and also sometimes
succeed without producing an output.

This allows us to consistently log these errors in the event-loop.
Logging them on WARN or ERROR would be too spammy though. In order to
still be alerted about some of these, we use the `telemetry_event!`
macro which samples them at a rate of 1%. This will alert us about cases
that happen often and allows us to handle them explicitly.

Once this is deployed to staging, I will monitor the alerts in Sentry to
ensure we won't get spammed with events from customers on the next
release.
2024-11-11 03:50:27 +00:00
Jamil
1dda915376 ci: Publish new clients (#7291)
Fixes the roaming bug.
2024-11-08 22:58:06 +00:00
Thomas Eizinger
8653146c18 fix(connlib): discard timer once it fired (#7288)
Within `connlib`, we have many nested state machines. Many of them have
internal timers by means of timestamps with which they indicate, when
they'd like to be "woken" to perform time-related processing. For
example, the `Allocation` state machine would indicate with a timestamp
5 minutes from the time an allocation is created that it needs to be
woken again in order to send the refresh message to the relay.

When we reset our network connections, we pretty much discard all state
within connlib and together with that, all of these timers. Thus the
`poll_timeout` function would return `None`, indicating that our state
machines are not waiting for anything.

Within the eventloop, the most outer state machine, i.e. `ClientState`
is paired with an `Io` component that actually implements the timer by
scheduling a wake-up aggregated as the earliest point of all state
machines.

In order to not fire the same timer multiple times in a row, we already
intended to reset the timer once it fired. It turns out that this never
worked and the timer still lingered around.

When we call `reset`, `poll_timeout` - which feeds this timer - returns
`None` and the timer doesn't get updated until it will finally return
`Some` with an `Instant`. Because the previous timer didn't get cleared
when it fired, this caused `connlib` to busy loop and prevent some(?)
other parts of it from progressing, resulting in us never being able to
reconnect to the portal. Yet, because the event loop itself was still
operating, we could still resolve DNS queries and such.

Resolves: #7254.

---------

Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
2024-11-08 12:19:14 +00:00
Thomas Eizinger
4d2dc3dfcb refactor(connlib): don't rely on DNS when reconnecting to portal (#7289)
Currently, `connlib` uses the feature of "known hosts" to provide DNS
functionality for some domains even without any network connectivity.
This is primarily used to ensure that when we reconnect to the portal,
we can resolve the domain name which allows us to then create network
connections.

With recent changes to how our phoenix-channel implementation works,
this is actually no longer necessary. Currently, we re-resolve the
domain every time we connect, even though we already resolved them when
connecting to it for the first time. This step is unnecessary and we can
simply directly use the previously resolved IP addresses for the portal
domain.
2024-11-08 05:06:42 +00:00
Thomas Eizinger
47e45a3cf3 chore(telemetry): improve telemetry spans and events (#7206)
DNS resolution is a critical part of `connlib`. If it is slow for
whatever reason, users will notice this. To make sure we notice as well,
we add `telemetry` spans to the client's and gateway's DNS resolution.
For the client, this applies to all DNS queries that we forward to the
upstream servers. For the gateway, this applies to all DNS resources.

In addition to those IO operations, we also instrument the
`match_resource_linear` function. This function operates in `O(n)` of
all defined DNS resources. It _should_ be fast enough to not create an
impact but it can't hurt to measure this regardless.

Lastly, we also instrument `refresh_translations` on the gateway.
Refreshing the DNS resolution of a DNS resource should really only
happen, when the previous IP addresses become stale yet the user is
still trying to send traffic to them. We don't actually have any data on
how often that happens. By instrumenting it, we can gather some of this
data.

To make sure that none of these telemetry events and spans hurt the
end-user performance, we introduce macros to `firezone-logging` that
sample the creation of these events and spans at a rate of 1%. I ran a
flamegraph and none of these even showed up. The most critical one here
is probably the `match_resource_linear` span because it happens on every
DNS query.

Resolves: #7198.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2024-11-06 01:17:57 +00:00
Thomas Eizinger
a5730b6f3b chore: release apple client 1.3.8 (#7268)
To be merged once Apple approves the app review.

---------

Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
2024-11-05 11:15:50 -08:00
Thomas Eizinger
42f00fe01e chore(snownet): fast-path using PartialEq (#7207)
Counter-intuitively, doing a linear search across all local candidates
and checking for equality is faster than hashing the candidate. This is
because a `Candidate` actually has quite a few fields and we call this
function in the hot-path of packet processing; from `snownet`'s
perspective, each packet might come from a different local socket so we
have to test for each packet, whether or not we already know about this
socket.

Using `PartialEq` instead of hashing every candidate saves about 1% in
the during a speedtest.
2024-11-05 14:40:06 +00:00
Thomas Eizinger
78ebad13ab chore(rust): log more errors as tracing::Values (#7208)
Logging these as structured values gives us a better stacktrace in
Sentry (assuming the errors themselves make proper use of defining an
error-chain).
2024-11-05 14:36:47 +00:00
Thomas Eizinger
250a31b872 chore(connlib): ignore resolved ip selector in debug output (#7209) 2024-11-05 14:34:31 +00:00
Thomas Eizinger
271c480357 fix(connlib): don't attempt to encrypt too large packets (#7263)
When encrypting packets, we need to reserve a buffer within which
boringtun will encrypt the IP packet. Unfortunately, `boringtun` panics
if that buffer is not big enough which essentially brings all of
`connlib` down.

Really, we should never see a packet that is too large and ideally, we
enforce this at compile-time by creating different variants of
`IpPacket` that are sized accordingly. That is a large refactoring so
until then, we simply discard them instead of panicking.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-11-05 04:17:21 +00:00
dependabot[bot]
a2828a217b build(deps): Bump thiserror from 1.0.64 to 1.0.68 in /rust (#7260)
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.64 to
1.0.68.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/dtolnay/thiserror/releases">thiserror's
releases</a>.</em></p>
<blockquote>
<h2>1.0.68</h2>
<ul>
<li>Handle incomplete expressions more robustly in format arguments,
such as while code is being typed (<a
href="https://redirect.github.com/dtolnay/thiserror/issues/341">#341</a>,
<a
href="https://redirect.github.com/dtolnay/thiserror/issues/344">#344</a>)</li>
</ul>
<h2>1.0.67</h2>
<ul>
<li>Improve expression syntax support inside format arguments (<a
href="https://redirect.github.com/dtolnay/thiserror/issues/335">#335</a>,
<a
href="https://redirect.github.com/dtolnay/thiserror/issues/337">#337</a>,
<a
href="https://redirect.github.com/dtolnay/thiserror/issues/339">#339</a>,
<a
href="https://redirect.github.com/dtolnay/thiserror/issues/340">#340</a>)</li>
</ul>
<h2>1.0.66</h2>
<ul>
<li>Improve compile error on malformed format attribute (<a
href="https://redirect.github.com/dtolnay/thiserror/issues/327">#327</a>)</li>
</ul>
<h2>1.0.65</h2>
<ul>
<li>Ensure OUT_DIR is left with deterministic contents after build
script execution (<a
href="https://redirect.github.com/dtolnay/thiserror/issues/325">#325</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="8d06fb5549"><code>8d06fb5</code></a>
Release 1.0.68</li>
<li><a
href="372fd8a71a"><code>372fd8a</code></a>
Merge pull request <a
href="https://redirect.github.com/dtolnay/thiserror/issues/344">#344</a>
from dtolnay/binop</li>
<li><a
href="08f89925bf"><code>08f8992</code></a>
Disregard equality binop in fallback parser</li>
<li><a
href="d2a823d2ae"><code>d2a823d</code></a>
Merge pull request <a
href="https://redirect.github.com/dtolnay/thiserror/issues/343">#343</a>
from dtolnay/unnamed</li>
<li><a
href="b3bf7a6f69"><code>b3bf7a6</code></a>
Add logic to determine whether unnamed fmt arguments are present</li>
<li><a
href="490f9c017b"><code>490f9c0</code></a>
Merge pull request <a
href="https://redirect.github.com/dtolnay/thiserror/issues/342">#342</a>
from dtolnay/synfull</li>
<li><a
href="7daf1b169d"><code>7daf1b1</code></a>
Defer is_syn_full() call until first expression</li>
<li><a
href="c92ac9940b"><code>c92ac99</code></a>
Merge pull request <a
href="https://redirect.github.com/dtolnay/thiserror/issues/341">#341</a>
from dtolnay/parsescan</li>
<li><a
href="40a53f7f33"><code>40a53f7</code></a>
Interleave Expr parsing and scanning better</li>
<li><a
href="925f2dde77"><code>925f2dd</code></a>
Release 1.0.67</li>
<li>Additional commits viewable in <a
href="https://github.com/dtolnay/thiserror/compare/1.0.64...1.0.68">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=thiserror&package-manager=cargo&previous-version=1.0.64&new-version=1.0.68)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-04 19:06:15 +00:00
Thomas Eizinger
5564e578fe fix(telemetry): flush sentry.io events in dedicated task (#7205)
`sentry`'s transport layer appears to be using blocking IO for flushing
events. Performing blocking IO within a future that is running on a
worker-thread of tokio causes this operation to hang and eventually
time-out after 5 seconds. As a result, many events - especially traces -
don't get flushed to sentry when an app is being shut down.

To fix this, we make `Telemetry::stop` an `async fn` and offload the
flushing to a task on tokio's thread-pool for blocking IO.
2024-11-01 15:52:09 +00:00
Thomas Eizinger
59412223cb chore: bump Android and Apple apps to next version (#7192)
We are in the process of releasing these so we need to bump their
version to the next one.
2024-10-31 14:24:33 +00:00
Gabi
dc97b9040d fix(connlib): large upstream dns message (#7183)
If edns0 doesn't work correctly DNS servers might respond with messages
bigger than our maximum udp size.

In that case we need to truncate those messages when forwarding the
respond back to the interface and expect the OS to retry with TCP.

Otherwise we aren't able to allocate a packet big enough for this.

Fixes #7121

---------

Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2024-10-30 04:02:14 +00:00
Thomas Eizinger
1d9802f2e4 fix(connlib): don't add host candidates multiple times (#7172)
We introduced a boolean bug in #7163 that causes us to attempt to add
host candidates much more often than necessary. This spams the logs on
DEBUG level but was otherwise not harmful.
2024-10-29 15:13:40 +00:00