Commit Graph

338 Commits

Author SHA1 Message Date
Jamil
6a73406194 chore: Bump Apple version to 1.4.1 (#7946) 2025-01-30 00:04:54 +00:00
Thomas Eizinger
8bd8098cab refactor(connlib): don't re-implement waker for TUN thread (#7944)
Within `connlib` - on UNIX platforms - we have dedicated threads that
read from and write to the TUN device. These threads are connected with
`connlib`'s main thread via bounded channels: one in each direction.
When these channels are full, `connlib`'s main thread will suspend and
not read any network packets from the sockets in order to maintain
back-pressure. Reading more packets from the socket would mean most
likely sending more packets out the TUN device.

When debugging #7763, it became apparent that _something_ must be wrong
with these threads and that somehow, we either consider them as full or
aren't emptying them and as a result, we don't read _any_ network
packets from our sockets.

To maintain back-pressure here, we currently use our own `AtomicWaker`
construct that is shared with the TUN thread(s). This is unnecessary. We
can also directly convert the `flume::Sender` into a
`flume::async::SendSink` and therefore directly access a `poll`
interface.
2025-01-29 15:48:48 +00:00
Jamil
7e231c6b10 chore: Release Android 1.4.1 (#7911) 2025-01-29 00:29:15 +00:00
Thomas Eizinger
c6492d4832 fix(rust): don't start all log files with connlib. (#7853)
At present, the file logger for all Rust code starts each logfile with
`connlib.`. This is very confusing when exporting the logs from the GUI
client because even the logs from the client itself will start with
`connlib.`. To fix this, we make the base file name of the log file
configurable.
2025-01-28 01:35:05 +00:00
Jamil
6670741dee chore: Bump apple clients to 1.4.0 (#7785)
Bumps Apple clients to the 1.4.0 release. They're already live.
2025-01-17 00:07:25 +00:00
Jamil
6476109b73 fix(apple): Simplify Xcode rust build steps (#7709)
Xcode doesn't allow wildcards in input file lists, so the rules I set up
in #7488 never took effect.

Upon further investigation, it appears that the `strip` command executed
unconditionally at the end of every Rust build was the culprit. Since
Xcode already does this for us, it's a useless step that adds about 30s
to the build time.

Unfortunately there isn't a good way to tell Xcode not to build rust.
But now we don't need to -- `cargo`'s build cache is smart enough to
skip builds and we are back to the ~1-2s range for repeated builds when
only Swift code has changed.

We also add the swift bridge generated code to version control. These
doesn't change regularly, and Xcode sometimes complains that the files
don't exist _before_ it lets you run the `cargo build` to generate them
🙃 .
2025-01-09 07:54:37 +00:00
Jamil
a21b9fe811 fix(connlib): Don't double-encode DNS addresses (#7708) 2025-01-08 16:35:05 -08:00
Thomas Eizinger
ed5285268d refactor: merge on_update_routes and on_set_interface_config (#7699)
For a while now, `connlib` has been calling these two callbacks right
after each other because the internal event already bundles all the
information about the TUN device. With this PR, we merge the two
callback functions also in layers above `connlib` itself.

Resolves: #6182.
2025-01-08 18:26:40 +00:00
Thomas Eizinger
3a8c6c7182 chore(connlib): assert that we don't emit WouldBlock errors (#7696)
When file descriptors like sockets or the TUN device are opened in
non-blocking mode, performing operations that would block emit the
`WouldBlock` IO error. These errors _should_ be translated into
`Poll::Pending` and have a waker registered that gets called whenever
the operation should be attempted again. Therefore, we should _never_
see these IO errors.

Previously, the implementation of the tunnel's event-loop did not yet
properly handle this backpressure and instead sometimes dropped packets
when it should have suspended. This has since been fixed but the then
introduced branch of just ignored the `io::ErrorKind::WouldBlock` errors
had remained.

Changing this to a debug-assert will alert us whenever we accidentally
break this without altering the behaviour of the release binary.
2025-01-08 14:11:01 +00:00
Thomas Eizinger
037a2e64b6 fix(connlib): attempt to detect runtime shutdown within TUN task (#7605)
Reading and writing to the TUN device within `connlib` happens in a
separate thread. The task running within these threads is connected to
the rest of `connlib` via channels. When the application shuts down,
these threads also need to exit. Currently, we attempt to detect this
from within the task when these channels close. It appears that there is
a race condition here because we first attempt to read from the TUN
device before reading from the channels. We treat read & write errors on
the TUN device as non-fatal so we loop around and attempt to read from
it again, causing an infinite-loop and log spam.

To fix this, we swap the order in which we evaluate the two concurrent
tasks: The first task to be polled is now the channel for outbound
packets and only if that one is empty, we attempt to read new packets
from the TUN device. This is also better from a backpressure point of
view: We should attempt to flush out our local buffers of already
processed packets before taking on "new work".

As a defense-in-depth strategy, we also attempt to detect the particular
error from the tokio runtime when it is being shut down and exit the
task.

Resolves: #7601.
Related: https://github.com/tokio-rs/tokio/issues/7056.
2025-01-05 20:41:24 +00:00
Jamil
b6aed36c2c feat(apple): Set account slug from Swift -> Rust to hydrate Sentry with (#7662)
- Refactor Telemetry module to expose firezoneId and accountSlug for
easier access in the Adapter module
- Set accountSlug to WrappedSession.connect for hydrating the Rust
sentry context
2025-01-04 02:48:06 +00:00
Jamil
309914a45d chore(android): release version 1.4.0 (#7649)
Bumps the Android client to the 1.4.0 release.

Tested in Android emulator.
2025-01-03 14:45:00 +00:00
Thomas Eizinger
8a1b6f26b4 fix(connlib): don't log warnings for unreachable errors (#7537)
When a Gateway or Client is running in an environment without IPv4 or
IPv6 connectivity, our initial probes for sending packets to the relays
will fail with network unreachable. That isn't a very big concern and
happens a lot in the wild. There is no need to report these as telemetry
events.

Resolves: #7514.
2024-12-17 17:59:20 +00:00
Thomas Eizinger
7e38d3caee chore(connlib): downgrade warning about failed flow (#7480) 2024-12-11 19:01:37 +00:00
Thomas Eizinger
81f71cba62 fix(telemetry): use package@version notation for releases (#7466)
In order for Sentry to parse our releases as semver, they need to be in
the form of `package@version` [0]. Without this, the feature of "Mark
this issue as resolved in the _next_ version" doesn't work properly
because Sentry compares the versions as to when it first saw them vs
parsing the semver string itself. We test versions prior to releasing
them, meaning Sentry learns about a 1.4.0 version before it is actually
released. This causes false-positive "regressions" even though they are
fixed in a later (as per semver) release.

This create some redundancy with the different DSNs that we are already
using. I think it would make sense to consider merging the two projects
we have for the GUI client for example. That is really just one project
that happens to run as two binaries.

For all other projects, I think the separation still makes sense because
we e.g. may add Sentry to the "host" applications of Android and
MacOS/iOS as well. For those, we would reuse the DSN and thus funnel the
issues into the same Sentry project.

As per Sentry's docs, releases are organisation-wide and therefore need
a package identifier to be grouped correctly.

[0]:
https://docs.sentry.io/platforms/javascript/configuration/releases/#bind-the-version
2024-12-09 05:04:45 +00:00
Thomas Eizinger
ddce9312ea fix(android): apply new log-filter on repeated connect call (#7461)
Related: #7460.
Resolves: #5634.
2024-12-06 04:45:28 +00:00
Thomas Eizinger
6115f662cf fix(apple): only initialise global logger once (#7460)
From within the FFI code, we have no control over the lifecycle of the
host application and `connect` may be called multiple times from within
the same process. Therefore, we cannot rely on the global logger state
to **not** be set when `connect` gets called.

To fix this, we cache the handles for the file logger and a
reload-handle for the log filter in a `static` variable. This allows us
to apply the new log-filter of a repeated `connect` call to the existing
logger, even if `connect` is called multiple times from the same
process.
2024-12-06 04:44:41 +00:00
Thomas Eizinger
90cf191a7c feat(linux): multi-threaded TUN device operations (#7449)
## Context

At present, we only have a single thread that reads and writes to the
TUN device on all platforms. On Linux, it is possible to open the file
descriptor of a TUN device multiple times by setting the
`IFF_MULTI_QUEUE` option using `ioctl`. Using multi-queue, we can then
spawn multiple threads that concurrently read and write to the TUN
device. This is critical for achieving a better throughput.

## Solution

`IFF_MULTI_QUEUE` is a Linux-only thing and therefore only applies to
headless-client, GUI-client on Linux and the Gateway (it may also be
possible on Android, I haven't tried). As such, we need to first change
our internal abstractions a bit to move the creation of the TUN thread
to the `Tun` abstraction itself. For this, we change the interface of
`Tun` to the following:

- `poll_recv_many`: An API, inspired by tokio's `mpsc::Receiver` where
multiple items in a channel can be batch-received.
- `poll_send_ready`: Mimics the API of `Sink` to check whether more
items can be written.
- `send`: Mimics the API of `Sink` to actually send an item.

With these APIs in place, we can implement various (performance)
improvements for the different platforms.

- On Linux, this allows us to spawn multiple threads to read and write
from the TUN device and send all packets into the same channel. The `Io`
component of `connlib` then uses `poll_recv_many` to read batches of up
to 100 packets at once. This ties in well with #7210 because we can then
use GSO to send the encrypted packets in single syscalls to the OS.
- On Windows, we already have a dedicated recv thread because `WinTun`'s
most-convenient API uses blocking IO. As such, we can now also tie into
that by batch-receiving from this channel.
- In addition to using multiple threads, this API now also uses correct
readiness checks on Linux, Darwin and Android to uphold backpressure in
case we cannot write to the TUN device.

## Configuration

Local testing has shown that 2 threads give the best performance for a
local `iperf3` run. I suspect this is because there is only so much
traffic that a single application (i.e. `iperf3`) can generate. With
more than 2 threads, the throughput actually drops drastically because
`connlib`'s main thread is too busy with lock-contention and triggering
`Waker`s for the TUN threads (which mostly idle around if there are 4+
of them). I've made it configurable on the Gateway though so we can
experiment with this during concurrent speedtests etc.

In addition, switching `connlib` to a single-threaded tokio runtime
further increased the throughput. I suspect due to less task / context
switching.

## Results

Local testing with `iperf3` shows some very promising results. We now
achieve a throughput of 2+ Gbit/s.

```
Connecting to host 172.20.0.110, port 5201
Reverse mode, remote host 172.20.0.110 is sending
[  5] local 100.80.159.34 port 57040 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   274 MBytes  2.30 Gbits/sec
[  5]   1.00-2.00   sec   279 MBytes  2.34 Gbits/sec
[  5]   2.00-3.00   sec   216 MBytes  1.82 Gbits/sec
[  5]   3.00-4.00   sec   224 MBytes  1.88 Gbits/sec
[  5]   4.00-5.00   sec   234 MBytes  1.96 Gbits/sec
[  5]   5.00-6.00   sec   238 MBytes  2.00 Gbits/sec
[  5]   6.00-7.00   sec   229 MBytes  1.92 Gbits/sec
[  5]   7.00-8.00   sec   222 MBytes  1.86 Gbits/sec
[  5]   8.00-9.00   sec   223 MBytes  1.87 Gbits/sec
[  5]   9.00-10.00  sec   217 MBytes  1.82 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.30 GBytes  1.98 Gbits/sec  22247             sender
[  5]   0.00-10.00  sec  2.30 GBytes  1.98 Gbits/sec                  receiver

iperf Done.
```

This is a pretty solid improvement over what is in `main`:

```
Connecting to host 172.20.0.110, port 5201
[  5] local 100.65.159.3 port 56970 connected to 172.20.0.110 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  90.4 MBytes   758 Mbits/sec  1800    106 KBytes
[  5]   1.00-2.00   sec  93.4 MBytes   783 Mbits/sec  1550   51.6 KBytes
[  5]   2.00-3.00   sec  92.6 MBytes   777 Mbits/sec  1350   76.8 KBytes
[  5]   3.00-4.00   sec  92.9 MBytes   779 Mbits/sec  1800   56.4 KBytes
[  5]   4.00-5.00   sec  93.4 MBytes   783 Mbits/sec  1650   69.6 KBytes
[  5]   5.00-6.00   sec  90.6 MBytes   760 Mbits/sec  1500   73.2 KBytes
[  5]   6.00-7.00   sec  87.6 MBytes   735 Mbits/sec  1400   76.8 KBytes
[  5]   7.00-8.00   sec  92.6 MBytes   777 Mbits/sec  1600   82.7 KBytes
[  5]   8.00-9.00   sec  91.1 MBytes   764 Mbits/sec  1500   70.8 KBytes
[  5]   9.00-10.00  sec  92.0 MBytes   771 Mbits/sec  1550   85.1 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   917 MBytes   769 Mbits/sec  15700             sender
[  5]   0.00-10.00  sec   916 MBytes   768 Mbits/sec                  receiver

iperf Done.
```
2024-12-05 00:18:20 +00:00
Thomas Eizinger
48bd0f9804 chore: bump client versions to 1.4.0 (#7092)
In order to release the new control protocol to users, we need to bump
the versions of the clients to 1.4.0. The portal has a version gate to
only select gateways with version >= 1.4.0 for clients >= 1.4.0. Thus,
bumping these versions can only happen once testing has completed and
the gateway has actually been released as 1.4.0.

Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
2024-12-04 19:48:51 +00:00
Thomas Eizinger
b802021cc4 feat(connlib): implement idempotent control protocol for client (#6942)
Building on top of the gateway PR (#6941), this PR transitions the
clients to the new control protocol. Clients are **not**
backwards-compatible with old gateways. As a result, a certain customer
environment MUST have at least one gateway with the above PR running in
order for clients to be able to establish connections.

With this transition, Clients send explicit events to Gateways whenever
they assign IPs to a DNS resource name. The actual assignment only
happens once and the IPs then remain stable for the duration of the
client session.

When the Gateway receives such an event, it will perform a DNS
resolution of the requested domain name and set up the NAT between the
assigned proxy IPs and the IPs the domain actually resolves to. In order
to support self-healing of any problems that happen during this process,
the client will send an "Assigned IPs" event every time it receives a
DNS query for a particular domain. This in turn will trigger another DNS
resolution on the Gateway. Effectively, this means that DNS queries for
DNS resources propagate to the Gateway, triggering a DNS resolution
there. In case the domain resolves to the same set of IPs, no state is
changed to ensure existing connections are not interrupted.

With this new functionality in place, we can delete the old logic around
detecting "expired" IPs. This is considered a bugfix as this logic isn't
currently working as intended. It has been observed multiple times that
the Gateway can loop on this behaviour and resolving the same domain
over and over again. The only theoretical "incompatibility" here is that
pre-1.4.0 clients won't have access to this functionality of triggering
DNS refreshes on a Gateway 1.4.2+ Gateway. However, as soon as this PR
merges, we expect all admins to have already upgraded to a 1.4.0+
Gateway anyway which already mandates clients to be on 1.4.0+.

Resolves: #7391.
Resolves: #6828.
2024-12-04 12:05:35 +00:00
Jamil
15e75f80ba fix(apple/ios): Expose IPHONEOS_DEPLOYMENT_TARGET to tell rustc our iOS version (#7453)
Fixes a similar issue as #7443 where we were deleting the
`IPHONEOS_DEPLOYMENT_TARGET` variable in our Rust build script, which
caused lots of warnings about building for a different OS than being
linked against.
2024-12-03 14:12:20 -08:00
Thomas Eizinger
dd6b52b236 chore(rust): share edition key via workspace table (#7451) 2024-12-03 00:28:06 +00:00
Jamil
e1ed497d12 fix(apple): Expose MACOSX_DEPLOYMENT_TARGET in rust apple build script to signal to rustc which macOS to target (#7443)
`MACOSX_DEPLOYMENT_TARGET` is a standard env var read by gcc and rustc
that determines which version of macOS to target binaries for. This
variable was being removed inadvertently in our rust build script.

Exposing this var fixes a slew of warnings about object files being
built for a newer macOS target than being linked that were showing up in
Xcode for some time now.

Hasn't caused any issues thus far from what I can tell, but possibly
related to #7442
2024-12-02 17:27:11 +00:00
Thomas Eizinger
932f6791fb fix(phoenix-channel): lazily create backoff timer (#7414)
Our `phoenix-channel` component is responsible for maintaining a
WebSocket connection to the portal. In case that connection fails, we
want to reconnect to it using an exponential backoff, eventually giving
up after a certain amount of time.

Unfortunately, the code we have today doesn't quite do that. An
`ExponentialBackoff` has a setting for the `max_elapsed_time`.
Regardless of how many and how often we retry something, we won't ever
wait longer than this amount of time. For the Relay, this is set to
15min. For other components its indefinite (Gateway, headless-client),
or very long (30 days for Android, 1 day for Apple).

The point in time from which this duration is counted is when the
`ExponentialBackoff` is **constructed** which translates to when we
**first** connected to the portal. As a result, our backoff would
immediately fail on the first error if it has been longer than
`max_elapsed_time` since we first connected. For most components, this
codepath is not relevant because the `max_elapsed_time` is so long. For
the Relay however, that is only 15 minutes so chances are, the Relay
would immediately fail (and get rebooted) on the first connection error
with the portal.

To fix this, we now lazily create the `ExponentialBackoff` on the first
error.

This bug has some interesting consequences: When a relay reboots, it
looses all its state, i.e. allocations, channel bindings, available
nonces etc, stamp-secret. Thus, all credentials and state that got
distributed to Clients and Gateways get invalidated, causing disconnects
from the Relay. We have observed these alerts in Sentry for a while and
couldn't explain them. Most likely, this is the root cause for those
because whilst a Relay disconnects, the portal also cannot detect its
presence and pro-actively inform Clients and Gateways to no longer use
this Relay.
2024-11-29 20:19:11 +00:00
Thomas Eizinger
c6e7e6192e build(rust): bump Rust to 1.83 (#7409)
Rust 1.83 comes with a bunch of new lints for elidible lifetimes. Those
also trigger in the generated code of `derivative`. That crate is
actually unmaintained so we replace our usages of it with `derive_more`.
2024-11-29 01:04:06 +00:00
Thomas Eizinger
2c26fc9c0e ci: lint Rust dependencies using cargo deny (#7390)
One of Rust's promises is "if it compiles, it works". However, there are
certain situations in which this isn't true. In particular, when using
dynamic typing patterns where trait objects are downcast to concrete
types, having two versions of the same dependency can silently break
things.

This happened in #7379 where I forgot to patch a certain Sentry
dependency. A similar problem exists with our `tracing-stackdriver`
dependency (see #7241).

Lastly, duplicate dependencies increase the compile-times of a project,
so we should aim for having as few duplicate versions of a particular
dependency as possible in our dependency graph.

This PR introduces `cargo deny`, a linter for Rust dependencies. In
addition to linting for duplicate dependencies, it also enforces that
all dependencies are compatible with an allow-list of licenses and it
warns when a dependency is referred to from multiple crates without
introducing a workspace dependency. Thanks to existing tooling
(https://github.com/mainmatter/cargo-autoinherit), transitioning all
dependencies to workspace dependencies was quite easy.

Resolves: #7241.
2024-11-22 00:17:28 +00:00
Thomas Eizinger
de35bb067e fix(telemetry): don't embed errors values in telemetry_event! (#7366)
Due to https://github.com/getsentry/sentry-rust/issues/702, errors which
are embedded as `tracing::Value` unfortunately get silently discarded
when reported as part of Sentry "Event"s and not "Exception"s.

The design idea of these telemetry events is that they aren't fatal
errors so we don't need to treat them with the highest priority. They
may also appear quite often, so to save performance and bandwidth, we
sample them at a rate of 1% at creation time.

In order to not lose the context of these errors, we instead format them
into the message. This makes them completely identical to the `debug!`
logs which we have on every call-site of `telemetry_event!` which
prompted me to make that implicit as part of creating the
`telemetry_event!`.

Resolves: #7343.
2024-11-18 18:17:08 +00:00
Thomas Eizinger
8c5a5fa690 chore(rust): correctly disable ANSI escapes globally (#7336)
I think I finally understood and correctly traced, where the use of ANSI
escape codes came from. It turns out, the `with_ansi` switch on
`tracing_subscriber::fmt::Layer` is what you want to toggle. From there,
it trickles down to the `Writer` which we can then test for in our
`Format`.

Resolves: #7284.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2024-11-14 05:00:53 +00:00
Thomas Eizinger
3cf5cbb989 chore(connlib): only send some tunnel errors to Sentry (#7340)
Errors from the tunnel can potentially happen on a per-packet basis. In
order to not flood Sentry, reduce the log-level down to `debug` and only
report 1% of all errors. We did the same thing for the gateway in #7299.
2024-11-14 02:32:37 +00:00
Thomas Eizinger
48ba2869a8 chore(rust): ban the use of .unwrap except in tests (#7319)
Using the clippy lint `unwrap_used`, we can automatically lint against
all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a
few results actually. In most cases, they are invariants that can't
actually be hit. For these, we change them to `Option`. In other cases,
they can actually be hit. For example, if the user supplies an invalid
log-filter.

Activating this lint ensures the compiler will yell at us every time we
use `.unwrap` to double-check whether we do indeed want to panic here.

Resolves: #7292.
2024-11-13 03:59:22 +00:00
Thomas Eizinger
ad4eea29ff chore(rust): don't panic in fallible functions (#7298)
"Just let it crash" is terrible advice for software that is shipped to
end users. Where possible, we should use proper error handling and only
fail the current function / task that is active, e.g. drop a particular
packet instead of failing all of connlib. We more or less already do
that.

Activating the clippy lint `unwrap_in_result` surfaced a few more places
where we panic despite being in a function that is fallible already.
These cases can easily be converted to not panic and return an error
instead.
2024-11-11 23:55:23 +00:00
Thomas Eizinger
488c599d5b chore(telemetry): capture Firezone ID and account in user ctx (#7310)
Sentry has a feature called the "User context" which allows us to assign
events to individual users. This in turn will give us statistics in
Sentry, how many users are affected by a certain issue.

Unfortunately, Sentry's user context cannot be built-up step-by-step but
has to be set as a whole. To achieve this, we need to slightly refactor
`Telemetry` to not be `clone`d and instead passed around by mutable
reference.

Resolves: #7248.
Related: https://github.com/getsentry/sentry-rust/issues/706.
2024-11-11 19:50:14 +00:00
Thomas Eizinger
06a274c4e5 refactor(apple): make panics on decoding errors more descriptive (#7301)
The communication between the native Apple client and `connlib` uses
JSON encoding. The deserialisation of these should never fail because a
particular version of `connlib` is always bundled with the native
client. Thus, panicking here is justified.

In case it does ever happen, we improve the panic message with this
patch.
2024-11-11 04:18:07 +00:00
Thomas Eizinger
213dd34ff3 chore(apple): log all connect errors on the connlib-side (#7300)
We don't have Sentry yet in the native Apple client, meaning we don't
yet learn about errors returned from the `connect` call. Normally,
logging and returning an error is an anti-pattern. We are okay with that
in this case until we can report these errors in the native Apple
client.
2024-11-11 04:01:12 +00:00
Jamil
1dda915376 ci: Publish new clients (#7291)
Fixes the roaming bug.
2024-11-08 22:58:06 +00:00
Thomas Eizinger
4d2dc3dfcb refactor(connlib): don't rely on DNS when reconnecting to portal (#7289)
Currently, `connlib` uses the feature of "known hosts" to provide DNS
functionality for some domains even without any network connectivity.
This is primarily used to ensure that when we reconnect to the portal,
we can resolve the domain name which allows us to then create network
connections.

With recent changes to how our phoenix-channel implementation works,
this is actually no longer necessary. Currently, we re-resolve the
domain every time we connect, even though we already resolved them when
connecting to it for the first time. This step is unnecessary and we can
simply directly use the previously resolved IP addresses for the portal
domain.
2024-11-08 05:06:42 +00:00
Thomas Eizinger
a5730b6f3b chore: release apple client 1.3.8 (#7268)
To be merged once Apple approves the app review.

---------

Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
2024-11-05 11:15:50 -08:00
Thomas Eizinger
78ebad13ab chore(rust): log more errors as tracing::Values (#7208)
Logging these as structured values gives us a better stacktrace in
Sentry (assuming the errors themselves make proper use of defining an
error-chain).
2024-11-05 14:36:47 +00:00
dependabot[bot]
a2828a217b build(deps): Bump thiserror from 1.0.64 to 1.0.68 in /rust (#7260)
Bumps [thiserror](https://github.com/dtolnay/thiserror) from 1.0.64 to
1.0.68.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/dtolnay/thiserror/releases">thiserror's
releases</a>.</em></p>
<blockquote>
<h2>1.0.68</h2>
<ul>
<li>Handle incomplete expressions more robustly in format arguments,
such as while code is being typed (<a
href="https://redirect.github.com/dtolnay/thiserror/issues/341">#341</a>,
<a
href="https://redirect.github.com/dtolnay/thiserror/issues/344">#344</a>)</li>
</ul>
<h2>1.0.67</h2>
<ul>
<li>Improve expression syntax support inside format arguments (<a
href="https://redirect.github.com/dtolnay/thiserror/issues/335">#335</a>,
<a
href="https://redirect.github.com/dtolnay/thiserror/issues/337">#337</a>,
<a
href="https://redirect.github.com/dtolnay/thiserror/issues/339">#339</a>,
<a
href="https://redirect.github.com/dtolnay/thiserror/issues/340">#340</a>)</li>
</ul>
<h2>1.0.66</h2>
<ul>
<li>Improve compile error on malformed format attribute (<a
href="https://redirect.github.com/dtolnay/thiserror/issues/327">#327</a>)</li>
</ul>
<h2>1.0.65</h2>
<ul>
<li>Ensure OUT_DIR is left with deterministic contents after build
script execution (<a
href="https://redirect.github.com/dtolnay/thiserror/issues/325">#325</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="8d06fb5549"><code>8d06fb5</code></a>
Release 1.0.68</li>
<li><a
href="372fd8a71a"><code>372fd8a</code></a>
Merge pull request <a
href="https://redirect.github.com/dtolnay/thiserror/issues/344">#344</a>
from dtolnay/binop</li>
<li><a
href="08f89925bf"><code>08f8992</code></a>
Disregard equality binop in fallback parser</li>
<li><a
href="d2a823d2ae"><code>d2a823d</code></a>
Merge pull request <a
href="https://redirect.github.com/dtolnay/thiserror/issues/343">#343</a>
from dtolnay/unnamed</li>
<li><a
href="b3bf7a6f69"><code>b3bf7a6</code></a>
Add logic to determine whether unnamed fmt arguments are present</li>
<li><a
href="490f9c017b"><code>490f9c0</code></a>
Merge pull request <a
href="https://redirect.github.com/dtolnay/thiserror/issues/342">#342</a>
from dtolnay/synfull</li>
<li><a
href="7daf1b169d"><code>7daf1b1</code></a>
Defer is_syn_full() call until first expression</li>
<li><a
href="c92ac9940b"><code>c92ac99</code></a>
Merge pull request <a
href="https://redirect.github.com/dtolnay/thiserror/issues/341">#341</a>
from dtolnay/parsescan</li>
<li><a
href="40a53f7f33"><code>40a53f7</code></a>
Interleave Expr parsing and scanning better</li>
<li><a
href="925f2dde77"><code>925f2dd</code></a>
Release 1.0.67</li>
<li>Additional commits viewable in <a
href="https://github.com/dtolnay/thiserror/compare/1.0.64...1.0.68">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=thiserror&package-manager=cargo&previous-version=1.0.64&new-version=1.0.68)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-11-04 19:06:15 +00:00
Thomas Eizinger
5564e578fe fix(telemetry): flush sentry.io events in dedicated task (#7205)
`sentry`'s transport layer appears to be using blocking IO for flushing
events. Performing blocking IO within a future that is running on a
worker-thread of tokio causes this operation to hang and eventually
time-out after 5 seconds. As a result, many events - especially traces -
don't get flushed to sentry when an app is being shut down.

To fix this, we make `Telemetry::stop` an `async fn` and offload the
flushing to a task on tokio's thread-pool for blocking IO.
2024-11-01 15:52:09 +00:00
Thomas Eizinger
59412223cb chore: bump Android and Apple apps to next version (#7192)
We are in the process of releasing these so we need to bump their
version to the next one.
2024-10-31 14:24:33 +00:00
Thomas Eizinger
f7a388345b fix(connlib): reconnect in case we lose all relays (#7164)
During normal operation, we should never lose connectivity to the set of
assigned relays in a client or gateway. In the presence of odd network
conditions and partitions however, it is possible that we disconnect
from a relay that is in fact only temporarily unavailable. Without an
explicit mechanism to retrieve new relays, this means that both clients
and gateways can end up with no relays at all. For clients, this can be
fixed by either roaming or signing out and in again. For gateways, this
can only be fixed by a restart!

Without connected relays, no connections can be established. With #7163,
we will at least be able to still establish direct connections. Yet,
that isn't good enough and we need a mechanism for restoring full
connectivity in such a case.

We creating a new connection, we already sample one of our relays and
assign it to this particular connection. This ensures that we don't
create an excessive amount of candidates for each individual connection.
Currently, this selection is allowed to be silently fallible. With this
PR, we make this a hard-error and bubble up the error that all the way
to the client's and gateway's event-loop. There, we initiate a reconnect
to the portal as a compensating action. Reconnecting to the portal means
we will receive another `init` message that allows us to reconnect the
relays.

Due to the nature of this implementation, this fix may only apply with a
certain delay from when we actually lost connectivity to the last relay.
However, this design has the advantage that we don't have to introduce
an additional state within `snownet`: Connections now simply fail to
establish and the next one soon after _should_ succeed again because we
will have received a new `init` message.

Resolves: #7162.
2024-10-29 01:01:47 +00:00
Thomas Eizinger
5cf105f073 chore(android): start telemetry together with connlib session (#7151)
As a first step for integration Sentry into the Android app, we launch
the Sentry Rust agent as soon as a `connlib` session starts up. At a
later point, we can also integrate Sentry into the Android app itself
using the Java / Kotlin SDK.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2024-10-24 20:03:06 +00:00
Thomas Eizinger
c12a02e348 chore(apple): start telemetry together with connlib session (#7152)
This starts up telemetry together with each `connlib` session. At a
later point, we can also integrate the native Swift SDK into the MacOS /
iOS app to catch non-connlib specific problems.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2024-10-24 19:59:52 +00:00
Thomas Eizinger
2ca91a3b1a chore(connlib): remove old mock feature (#7142)
This is so stale, it definitely needs to go in the bin.
2024-10-23 16:30:15 +00:00
Thomas Eizinger
ee30368970 refactor(connlib): simplify error handling on crash (#7134)
The `fmt::Display` implementation of `tokio::task::JoinError` already
does exactly what we do here: Extracting the panic message if there is
one. Thus, we can simplify this code why just moving the `JoinError`
into the `DisconnectError` as its source.
2024-10-23 16:13:39 +00:00
Thomas Eizinger
990324b2ec chore(rust): enable sentry-tracing integration (#7105)
Using the `sentry-tracing` integration, we can automatically capture
events based on what we log via `tracing`. The mapping is defined as
follows:

- ERROR: Gets captured as a fatal error
- WARN: Gets captured as a message
- INFO: Gets captured as a breadcrumb
- `_`: Does not get captured at all

If telemetry isn't active / configured, this integration does nothing.
It is therefore safe to just always enable it.
2024-10-22 23:23:49 +00:00
Thomas Eizinger
73eebd2c4d refactor(rust): consistently record errors as tracing::Value (#7104)
Our logging library, `tracing` supports structured logging. This is
useful because it preserves the more than just the string representation
of a value and thus allows the active logging backend(s) to capture more
information for a particular value.

In the case of errors, this is especially useful because it allows us to
capture the sources of a particular error.

Unfortunately, recording an error as a tracing value is a bit cumbersome
because `tracing::Value` is only implemented for `&dyn
std::error::Error`. Casting an error to this is quite verbose. To make
it easier, we introduce two utility functions in `firezone-logging`:

- `std_dyn_err`
- `anyhow_dyn_err`

Tracking errors as correct `tracing::Value`s will be especially helpful
once we enable Sentry's `tracing` integration:
https://docs.rs/sentry-tracing/latest/sentry_tracing/#tracking-errors
2024-10-22 04:46:26 +00:00
Thomas Eizinger
dbe618c080 refactor(connlib): expose &mut TRoleState for direct access (#7026)
Currently, we have a lot of stupid code to forward data from the
`{Client,Gateway}Tunnel` interface to `{Client,Gateway}State`. Recent
refactorings such as #6919 made it possible to get rid of this
forwarding layer by directly exposing `&mut TRoleState`.

To maintain some type-privacy, several functions are made generic to
accept `impl Into` or `impl TryInto`.
2024-10-15 01:05:35 +00:00
Thomas Eizinger
857bbf5d98 chore(connlib): introduce custom logging format (#7024)
This PR introduces a custom logging format for all Rust-components. It
is more or less a copy of `tracing_subscriber::fmt::format::Compact`
with the main difference that span-names don't get logged.

Spans are super useful because they allow us to record contextual
values, like the current connection ID, for a certain scope. What is IMO
less useful about them is that in the default formatter configuration,
active spans cause a right-drift of the actual log message.

The actual log message is still what most accurately describes, what
`connlib` is currently doing. Spans only add contextual information that
the reader may use for further understand what is happening. This
optional nature of the utility of spans IMO means that they should come
_after_ the actual log message.

Resolves: #7014.
2024-10-14 18:09:38 +00:00