Commit Graph

160 Commits

Author SHA1 Message Date
Thomas Eizinger
81da120c17 fix(phoenix-channel): report connection hiccups to upper layer (#8203)
The WebSocket connection to the portal from within the Clients, Gateways
and Relays may be temporarily interrupted by IO errors. In such cases we
simply reconnect to it. This isn't as much of a problem for Clients and
Gateways. For Relays however, a disconnect can be disruptive for
customers because the portal will send `relays_presence` events to all
Clients and Gateways. Any relayed connection will therefore be
interrupted. See #8177.

Relays run on our own infrastructure and we want to be notified if their
connection flaps.

In order to differentiate between these scenarios, we remove the logging
from within `phoenix-channel` and report these connection hiccups one
layer up. This allows Clients and Gateways to log them on DEBUG whereas
the Relay can log them on WARN.

Related: #8177 
Related: #7004
2025-02-20 00:54:43 +00:00
Thomas Eizinger
3e4976e4ab fix(relay): don't starve items further down in the event-loop (#8177)
At present, the relay uses a priority in the event-loop that favors
routing traffic. Whenever a task further up in the loop is
`Poll::Ready`, we loop back to the top to continue processing. The issue
with that is that in very busy times, this can lead to starvation in
processing timers and messages from the portal. If we then finally get
to process portal messages, we think that the portal hasn't replied in
some time and proactively cut the connection and reconnect.

As a result, the portal will send `relays_presence` messages to the
clients and gateways which in turn will locally remove the relay. This
breaks relayed connections.

To fix this, instead of immediately traversing to the top of the
event-loop with `continue`, we only set a boolean. This gives each
element of the event-loop a chance to execute, even when a certain
component is very busy.

Related: #8165
Related: #8176
2025-02-18 12:00:32 +00:00
Thomas Eizinger
c9b9fb0e6c feat(relay): add SOFTWARE attribute (#8076)
Adding a `SOFTWARE` attribute is recommended by the spec and will allow
us to identify from client logs, which version of the relay we are
talking to.
2025-02-11 03:34:38 +00:00
Thomas Eizinger
5b236408b8 chore(relay): log warn if we can't authenticate error response (#8073)
There should be a `Username` attribute in every request that is worth
sending an error back, if there isn't we have a bug somewhere.

Related: https://firezone-inc.sentry.io/issues/6275631126/.
2025-02-10 22:00:23 +00:00
Thomas Eizinger
e3e6634790 chore: make all Rust code compile on Windows (#8036)
Developing on Windows is much easier if all Rust code compiles without
errors or warnings because you can "trust" your IDE that your code is
error free if it says "0 errors; 0 warnings". We are not far off from
achieving this!

Apart from the "graceful termination" feature in the relay, both the
relay and gateway should actually also work on Windows just fine, thanks
to the platform-agnostic abstractions we have been building up for the
GUI and headless client.
2025-02-06 14:25:10 +00:00
Thomas Eizinger
d2e9b09874 refactor(rust): stringify errors early (#8033)
As it turns out, the effort in #7104 was not a good idea. By logging
errors as values, most of our Sentry reports all have the same title and
thus cannot be differentiated from within the overview at all. To fix
this, we stringify errors with all their sources whenever they got
logged. This ensures log messages are unique and all Sentry issues will
have a useful title.
2025-02-06 14:18:35 +00:00
Thomas Eizinger
b34af41eb0 feat(relay): remove standalone mode (#7701)
Previously, it was possible to use the Firezone relay in "standalone"
mode where it would not attempt to connect to a portal. A long time ago,
this mode was introduced in order for us to test the TURN compatibility
of the relay with non-Firezone TURN clients. These tests have long been
removed and thus the mode is no longer required.

The positive side-effect of this is that we can make the
`FIREZONE_API_URL` a mandatory parameter and thus direct self-hosted
users towards setting this to the endpoint of their self-hosted portal.
2025-01-08 19:26:19 +00:00
Thomas Eizinger
e499d3e856 feat(relay): make telemetry opt-in (#7697)
Currently, telemetry via Sentry in our relay code is opt-out but won't
actually activate for a portal instance that isn't our staging or
production environment. However, this isn't enough to prevent alerts
from relay instances that aren't ours. It turns out that some
self-hosted customers don't realise that they have to change the portal
URL to their self-hosted portal. Without changing that, the relay will
attempt to authenticate to our production portal with an unknown token
and error out with a 401, logging a false-positive to Sentry.
2025-01-08 15:12:52 +00:00
Thomas Eizinger
5b2d7f1adf fix(relay): don't warn when running in standalone mode (#7573) 2024-12-23 13:17:01 +00:00
Thomas Eizinger
7df4389fa6 refactor(relay): avoid stringifying error early (#7553)
When the portal connection in a relay fails, we currently stringify the
error early. This is unnecessary and we should instead retain the full
error chain for as long as possible.
2024-12-18 18:13:55 +00:00
Thomas Eizinger
8e0f00a3a6 fix(relay): buffer packets in case IO is busy (#7536)
At present, the relay's event-loop simply drops a UDP packet in case the
socket is not ready for writing. This is terrible for throughput because
it means the encapsulated packet within the WG payload needs to be
retransmitted by the source after a timeout. To avoid this, we instead
buffer the packet and suspend the event loop until it has been correctly
flushed out. This may still cause packet loss because the receive buffer
may overflow in the meantime. However, there is nothing we can do about
that because UDP itself doesn't have any backpressure.

The relay listens on many sockets at once via a separate worker thread
and an `mio` event-loop. In addition to the current subscription to
readable event, we now also subscribe to writable events.

At the very top of the relay's event-loop, we insert a `flush` function
that ensures all buffered packets have been written out and - in case
writing a packet fails - suspends the event-loop with a waker. If we
receive a new event for write-readiness, we wake the waker which will
trigger a new call to `Eventloop::poll` where we again try to flush the
pending packet. We don't bother with tracking exactly, which socket sent
the write-readiness and which socket we have still pending packets in.
Instead, we suspend the entire event-loop until all pending packets have
been flushed.

Resolves: #7519.
2024-12-18 17:01:24 +00:00
Thomas Eizinger
48857d3bc8 chore(relay): downgrade allocation mismatch warn on CHANNEL_BIND (#7505)
This code-path is handled gracefully in `connlib`, no need to issue a
warning here.
2024-12-13 05:41:28 +00:00
Thomas Eizinger
73625e4669 chore(relay): don't log all AUTH errors on WARN (#7506)
Not all authentication errors are warnings that we need to be alerted
about.
2024-12-13 05:37:15 +00:00
Thomas Eizinger
da04924da1 chore(relay): downgrade log on missing allocation for REFRESH (#7490)
Attempting to refresh an allocation is the only idempotent way in TURN
to test whether one has an active allocation. As such, logging this on
WARN is too aggressive.

Resolves: #7481.
2024-12-12 16:48:02 +00:00
Thomas Eizinger
d06bdaac91 chore(relay): don't warn on existing allocation (#7415)
A client may have lost its state and therefore "probe" the relay whether
or not is still has an allocation. If it does, it will react to the
error, delete it and make a new one. This is no reason to print a
warning on the relay side.
2024-12-02 01:08:58 +00:00
Thomas Eizinger
932f6791fb fix(phoenix-channel): lazily create backoff timer (#7414)
Our `phoenix-channel` component is responsible for maintaining a
WebSocket connection to the portal. In case that connection fails, we
want to reconnect to it using an exponential backoff, eventually giving
up after a certain amount of time.

Unfortunately, the code we have today doesn't quite do that. An
`ExponentialBackoff` has a setting for the `max_elapsed_time`.
Regardless of how many and how often we retry something, we won't ever
wait longer than this amount of time. For the Relay, this is set to
15min. For other components its indefinite (Gateway, headless-client),
or very long (30 days for Android, 1 day for Apple).

The point in time from which this duration is counted is when the
`ExponentialBackoff` is **constructed** which translates to when we
**first** connected to the portal. As a result, our backoff would
immediately fail on the first error if it has been longer than
`max_elapsed_time` since we first connected. For most components, this
codepath is not relevant because the `max_elapsed_time` is so long. For
the Relay however, that is only 15 minutes so chances are, the Relay
would immediately fail (and get rebooted) on the first connection error
with the portal.

To fix this, we now lazily create the `ExponentialBackoff` on the first
error.

This bug has some interesting consequences: When a relay reboots, it
looses all its state, i.e. allocations, channel bindings, available
nonces etc, stamp-secret. Thus, all credentials and state that got
distributed to Clients and Gateways get invalidated, causing disconnects
from the Relay. We have observed these alerts in Sentry for a while and
couldn't explain them. Most likely, this is the root cause for those
because whilst a Relay disconnects, the portal also cannot detect its
presence and pro-actively inform Clients and Gateways to no longer use
this Relay.
2024-11-29 20:19:11 +00:00
Thomas Eizinger
bea8393248 fix(relay): reduce number of warnings (#7411)
With this PR, we reduce some of the warnings emitted by the relay. If we
can only partially fulfill an allocation, we now only emit a warning.

Similarly, if we receive a repeated SIGTERM signal, we shut down
successfully (i.e. exit with code 0) instead of failing the event-loop.
During normal operation, we wait for all allocations to expire before we
shut down. On CI however, the relay gets shutdown much earlier so this
would generate unnecessary errors.

Receiving another SIGTERM is a user-initiated action so we shouldn't
fail as a result but instead just comply with it.
2024-11-28 23:20:10 +00:00
Thomas Eizinger
e91a076307 refactor(relay): improve error messages on failed requests (#7405)
Some house-keeping that should make debugging issues around
relay-disconnects easier.
2024-11-28 22:12:27 +00:00
Thomas Eizinger
973a806707 feat(relay): add Sentry crash reporting (#7406)
In addition to monitoring clients and gateways, it is also useful to
monitor relays in the same way. This gives us alerts on ERROR and WARN
messages logged by the relay as well as panics.
2024-11-28 21:53:21 +00:00
Thomas Eizinger
44c1b453f7 chore(relay): document authentication scheme (#7388)
Follow-up from #7378 to answer some of the questions.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2024-11-21 20:12:31 +00:00
Thomas Eizinger
56db250e2c feat(connlib): validate integrity of all relay responses (#7378)
In order to avoid processing of responses of relays that somehow got
altered on the network path, we now use the client's `password` as a
shared secret for the relay to also authenticate its responses. This
means that not all message can be authenticated. In particular, BINDING
requests will still be unauthenticated.

Performing this validation now requires every component that crafts
input to the `Allocation` to include a valid `MessageIntegrity`
attribute. This is somewhat problematic for the regression tests of the
relay and the unit tests of `Allocation`. In both cases, we implement
workarounds so we don't have to actually compute a valid
`MessageIntegrity`. This is deemed acceptable because:

- Both of these are just tests.
- We do test the validation path using `tunnel_test` because there we
run an actual relay.
2024-11-19 18:32:33 +00:00
Thomas Eizinger
48ba2869a8 chore(rust): ban the use of .unwrap except in tests (#7319)
Using the clippy lint `unwrap_used`, we can automatically lint against
all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a
few results actually. In most cases, they are invariants that can't
actually be hit. For these, we change them to `Option`. In other cases,
they can actually be hit. For example, if the user supplies an invalid
log-filter.

Activating this lint ensures the compiler will yell at us every time we
use `.unwrap` to double-check whether we do indeed want to panic here.

Resolves: #7292.
2024-11-13 03:59:22 +00:00
Thomas Eizinger
73eebd2c4d refactor(rust): consistently record errors as tracing::Value (#7104)
Our logging library, `tracing` supports structured logging. This is
useful because it preserves the more than just the string representation
of a value and thus allows the active logging backend(s) to capture more
information for a particular value.

In the case of errors, this is especially useful because it allows us to
capture the sources of a particular error.

Unfortunately, recording an error as a tracing value is a bit cumbersome
because `tracing::Value` is only implemented for `&dyn
std::error::Error`. Casting an error to this is quite verbose. To make
it easier, we introduce two utility functions in `firezone-logging`:

- `std_dyn_err`
- `anyhow_dyn_err`

Tracking errors as correct `tracing::Value`s will be especially helpful
once we enable Sentry's `tracing` integration:
https://docs.rs/sentry-tracing/latest/sentry_tracing/#tracking-errors
2024-10-22 04:46:26 +00:00
Thomas Eizinger
2d4818e007 refactor(connlib): rotate tunnel private key on reset (#6909)
With the new control protocol specified in #6461, the client will no
longer initiate new connections. Instead, the credentials are generated
deterministically by the portal based on the gateway's and the client's
public key. For as long as they use the same public key, they also have
the same in-memory state which makes creating connections idempotent.

What we didn't consider in the new design at first is that when clients
roam, they discard all connections but keep the same private key. As a
result, the portal would generate the same ICE credentials which means
the gateway thinks it can reuse the existing connection when new flows
get authorized. The client however discarded all connections (and
rotated its ports and maybe IPs), meaning the previous candidates sent
to the gateway are no longer valid and connectivity fails.

We fix this by also rotating the private keys upon reset. Rotating the
keys itself isn't enough, we also need to propagate the new public key
all the way "over" to the phoenix channel component which lives
separately from connlib's data plane.

To achieve this, we change `PhoenixChannel` to now start in the
"disconnected" state and require an explicit `connect` call. In
addition, the `LoginUrl` constructed by various components now acts
merely as a "prototype", which may require additional data to construct
a fully valid URL. In the case of client and gateway, this is the public
key of the `Node`. This additional parameter needs to be passed to
`PhoenixChannel` in the `connect` call, thus forming a type-safe
contract that ensures we never attempt to connect without providing a
public key.

For the relay, this doesn't apply.

Lastly, this allows us to tidy up the code a bit by:

a) generating the `Node`'s private key from the existing RNG
b) removing `ConnectArgs` which only had two members left

Related: #6461.
Related: #6732.
2024-10-07 22:28:51 +00:00
Thomas Eizinger
896fe49f1f fix(relay): set better OTEL metadata (#6322)
Previously, the `service.name` attribute got overridden with "unknown
service" from the detector used in `Resource::default`. To avoid this,
we are now manually composing the two other detectors.

This gives us a useful set of default labels from within the code yet it
allows overriding all of them using `OTEL_RESOURCE_ATTRIBUTES`.
2024-08-16 23:17:10 +00:00
Thomas Eizinger
3b56664e02 test(rust): ensure deterministic proptests (#6319)
For quite a while now, we have been making extensive use of
property-based testing to ensure `connlib` works as intended. The idea
of proptests is that - given a certain seed - we deterministically
sample test inputs and assert properties on a given function.

If the test fails, `proptest` prints the seed which can then be added to
a regressions file to iterate on the test case and fix it. It is quite
obvious that non-determinism in how the test input gets generated is no
bueno and reduces the value we get out of these tests a fair bit.

The `HashMap` and `HashSet` data structures are known to be
non-deterministic in their iteration order. This causes non-determinism
during the input generation because we make use of a lot of maps and
sets to gradually build up the test input. We fix all uses of `HashMap`
and `HashSet` by replacing them with `BTreeMap` and `BTreeSet`.

To ensure this doesn't regress, we refactor `tunnel_test` to not make
use of proptest's macros and instead, we initialise and run the test
ourselves. This allows us to dump the sampled state and transitions into
a file per test run. In CI, we then run a 2nd iteration of all
regression tests and compare the sampled state and transitions with the
previous run. They must match byte-for-byte.

Finally, to discourage use of non-deterministic iteration, we ban the
use of the iteration functions on `HashMap` and `HashSet` across the
codebase. This doesn't catch iteration in a `for`-loop but it is better
than not linting against it at all.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-08-16 23:15:58 +00:00
Thomas Eizinger
4750d76fce fix(relay): re-insert channel into fast-path map (#6332)
This is a test-failure detected in
https://github.com/firezone/firezone/actions/runs/10426492110/job/28879531621.

In the relay, we have fast-path lookup maps to for incoming traffic from
peers. This improves throughput as any incoming packet only needs to
look-up a single routing entry. Unfortunately, this creates duplication
in how the data must be stored.

In #6276, we correctly identified that channels must be re-bound on the
relay when a client sends `CHANNEL_BIND` message whilst the channel is
cooling down. What we failed to identify (and what as now caught by the
tests) is that we also need to re-insert the entry into the fast-path
lookup map to actually allow data from flowing through the channel.
2024-08-16 23:14:00 +00:00
Thomas Eizinger
d399e65246 build(deps): bump tokio-tungstenite to 0.23 (#5509)
With the upgrade to 0.23, `tokio-tungstenite` pulls in `rustls` 0.27
which supports multiple crypto providers. By default, this uses the
`aws-lc-crypto` provider. The previous default was `ring`.

This PR bumps the necessary versions and installs the `ring` crypto
provider at the beginning of each application, before connlib starts. We
try and do this as early as possible to make it obvious that it only
needs to happen once per process.

Resolves: #5380.
2024-08-15 06:02:17 +00:00
Thomas Eizinger
272e4b2bcd feat(snownet,relay): include sticky session ID in STUN requests (#6278)
For most cases, TURN identifies clients by their 3-tuple. This can make
it hard to correlate logs in case the client roams or its NAT session
gets reset, both of which cause the port to change.

To make problem analysis easier, we include the RFC-recommended
`SOFTWARE` attribute in all STUN requests created by `snownet`.
Typically, this includes a textual description of who sent the request
and a version number. See [0] for details. We don't track the version of
`snownet` individually and passing the actual client-version across this
many layers is deemed too complicated for now.

What we can add though is a parameter that includes a sticky session ID.
This session ID is computed based on the `Node`'s public key, meaning it
doesn't change until the user logs-out and in again.

On the relay, we now look for a `SOFTWARE` attribute in all STUN
requests and optionally include it in all spans if it is present.

[0]: https://datatracker.ietf.org/doc/html/rfc5389#section-15.10
2024-08-15 03:10:56 +00:00
Thomas Eizinger
55c97acfc3 feat(relay): record error code as label in response counter metric (#6274)
This will allow us to write queries and thus alerts for increased number
of error responses such as `Allocation Mismatch`.

When attaching labels to metrics, it is important to avoid cardinality
explosions. Thus, the possible label values should always be a fixed,
bounded set of values. The possible error codes could be quite a few but
in practise, we only use a handful and clients cannot influence, which
error codes we send. Thus, it is safe to create labels for these codes.

The same would not be true for IP addresses or ports for example.
2024-08-13 22:17:21 +00:00
Thomas Eizinger
6e86a4dcba fix(snownet,relay): re-use channels to peers in cooldown period (#6276)
For efficiency reasons, TURN's data channels don't have any
authentication or integrity metadata. Instead, the operate using a short
2-byte channel number to identify the target peer of the data.

To avoid abuse, channel bindings are at most valid for 10 minutes before
they need to be refreshed. In case they expire, there is a 5 minute
cooldown period, before the same channel number can be bound to a
different peer and before the same peer can be bound to a different
channel.

We had a similar issue in the past (#5613) where channels got rebound
early. Whilst that was fixed and is no longer happening, a case that we
didn't consider is what happens if we want to bind a channel to a peer
that still has a channel bound but is currently cooling down (i.e. in
the 5 minute period after its expiry).

In that case, `snownet` would wrongly assume that there is no channel to
this peer and try to bind a new one. That would get rejected by the
relay with a bad request.

To fix this, we simply need to check whether we still have a channel to
this peer and if yes, return the same channel number. On the relay, we
need to ensure that we consider a channel as `bound` again when it is
being refreshed.

We ensure that this doesn't regress in two ways:

- We add a unit-test for the `ChannelBindings` struct
- We modify the `Idle` transition to idle for 6 instead of 5 minutes.
This ensures that a combination of 2 idle transitions puts the channel
bindings into the 10-15 minute time window where rebinding the peer to a
different channel fails.

Related: #6265.
2024-08-13 17:01:13 +00:00
Thomas Eizinger
0abbf6bba9 refactor(rust): inline http-health-check crate into bin-shared (#6258)
Now that we have the `bin-shared` crate, it is easy to move the
health-check functionality into there. That allows us to get rid of a
crate which makes navigating the workspace a bit easier.
2024-08-12 16:44:52 +00:00
Thomas Eizinger
93d678aaea feat(relay): set OTEL metadata for metrics and traces (#6249)
I recently discovered that the metrics reporting to Google Cloud Metrics
for the relays is actually working. Unfortunately, they are all bucketed
together because we don't set the metadata correctly.

This PR aims to fix that be setting some useful default metadata for
traces and metrics and additionally, discoveres instance ID and name
from GCE metadata.

Related: #2033.
2024-08-10 16:32:01 +00:00
Thomas Eizinger
bed625a312 chore(rust): make logging more ergonomic (#6237)
Setting up a logger is something that pretty much every entrypoint needs
to do, be it a test, a shared library embedded in another app or a
standalone application. Thus, it makes sense to introduce a dedicated
crate that allows us to bundle all the things together, how we want to
do logging.

This allows us to introduce convenience functions like
`firezone_logging::test` which allow you to construct a logger for a
test as a one-liner.

Crucially though, introducing `firezone-logging` gives us a place to
store a default log directive that silences very noisy crates. When
looking into a problem, it is common to start by simply setting the
log-filter to `debug`. Without further action, this floods the output
with logs from crates like `netlink_proto` on Linux. It is very unlikely
that those are the logs that you want to see. Without a preset filter,
the only alternative here is to explicitly turn off the log filter for
`netlink_proto` by typing something like
`RUST_LOG=netlink_proto=off,debug`. Especially when debugging issues
with customers, this is annoying.

Log filters can be overridden, i.e. a 2nd filter that matches the exact
same scope overrides a previous one. Thus, with this design it is still
possible to activate certain logs at runtime, even if they have silenced
by default.

I'd expect `firezone-logging` to attract more functionality in the
future. For example, we want to support re-loading of log-filters on
other platforms. Additionally, where logs get stored could also be
defined in this crate.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-08-10 05:17:03 +00:00
Thomas Eizinger
f800875aff fix(relay): don't hang when connecting to OTLP exporter (#6034)
The dependency update in #6003 introduced a regression: Connecting to
the OTLP exporter was hanging forever and thus the relay failed to start
up.

The hang seems to be related to _dropping_ the `meter_provider`. Looking
at the changelog update, this change was actually called out:
https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-otlp/CHANGELOG.md#v0170.

By setting these providers globally, the relay starts up just fine.

To ensure this doesn't regress again, we add an OTEL collector to our
`docker-compose.yml` and configure the `relay-1` to connect to it.
2024-07-25 10:36:42 -06:00
Thomas Eizinger
782b171cc1 chore(relay): always log setup on trace (#6031)
In staging and production, setting up the logger for the relay is a
fairly complicated setup. To make debugging easier, we always log these
initial steps on `TRACE` level until the real logger is initialised.
2024-07-25 03:48:52 +00:00
dependabot[bot]
dae90d81e1 build(deps): bump opentelemetry dependencies (#6003)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2024-07-24 17:45:42 +00:00
Thomas Eizinger
da52c66023 refactor(clients): init PhoenixChannel in upper layers (#5884)
This represents a step towards #3837. Eventually, we'd like the
abstractions of `Session` and `Eventloop` to go away entirely. For that,
we need to thin them out.

The introduction of `ConnectArgs` was already a hint that we are passing
a lot of data across layers that we shouldn't. To avoid that, we can
simply initialise `PhoenixChannel` earlier and thus each callsite can
specify the desired configuration directly.

I've left `ConnectArgs` intact to keep the diff small.
2024-07-18 02:08:38 +00:00
Thomas Eizinger
aa279d7731 ci: never tolerate warnings in Rust code (#5893)
Our Rust CI runs various jobs in different configurations of packages
and / or features. Currently, only the clippy job denies warnings which
makes it possible that some code still generates warnings under
particular configurations.

To ensure we always fail on warnings, we set a global env var to deny
warnings for all Rust CI jobs.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2024-07-17 22:22:12 +00:00
Gabi
5b0aaa6f81 fix(connlib): protect all sockets from routing loops (#5797)
Currently, only connlib's UDP sockets for sending and receiving STUN &
WireGuard traffic are protected from routing loops. This is was done via
the `Sockets::with_protect` function. Connlib has additional sockets
though:

- A TCP socket to the portal.
- UDP & TCP sockets for DNS resolution via hickory.

Both of these can incur routing loops on certain platforms which becomes
evident as we try to implement #2667.

To fix this, we generalise the idea of "protecting" a socket via a
`SocketFactory` abstraction. By allowing the different platforms to
provide a specialised `SocketFactory`, anything Linux-based can give
special treatment to the socket before handing it to connlib.

As an additional benefit, this allows us to remove the `Sockets`
abstraction from connlib's API again because we can now initialise it
internally via the provided `SocketFactory` for UDP sockets.

---------

Signed-off-by: Gabi <gabrielalejandro7@gmail.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2024-07-16 00:40:05 +00:00
Thomas Eizinger
8ec6a809a1 refactor(relay): use RangeInclusive to specify available ports (#5820) 2024-07-11 06:26:21 +00:00
Thomas Eizinger
0c2648dae2 test(connlib): correctly scope state within tunnel_test (#5809)
Currently, the type hierarchy within `tunnel_test` is already quite
nested: We have a `Host` that wraps a `SimNode` which wraps a
`ClientState` or `GatewayState`. Additionally, a lot of state that is
actually _per_ client or _per_ gateway is tracked in the root of
`ReferenceState` and `TunnelTest`. That makes it difficult to introduce
multiple gateways / clients to this test.

To fix this, we introduce dedicated `RefClient` and `RefGateway` states.
Those track the expected state of a particular client / gateway.
Similarly, we introduce dedicated `SimClient` and `SimGateway` structs
that track the simulation state by wrapping the corresponding
system-under-test: `ClientState` a `GatewayState`.

This ends up moving a lot of code around but has the great benefit that
all the state is now scoped to a particular instance of a client or a
gateway, paving the way for creating multiple clients & gateways in a
single test.
2024-07-10 23:22:19 +00:00
Thomas Eizinger
9caca475dc test(connlib): introduce routing table to tunnel_test (#5786)
Currently, `tunnel_test` uses a rather naive approach when dispatching
`Transmit`s. In particular, it checks client, gateway and relay
separately whether they "want" a certain packet. In a real network,
these packets are routed based on their IP.

To mimic something similar, we introduce a `Host` abstraction that wraps
each component: client, gateway and relay. Additionally, we introduce a
`RoutingTable` where we can add and remove hosts. With these things in
place, routing a `Transmit` is as easy as looking up the destination IP
in the routing table and dispatching to the corresponding host.

Our hosts are type-safe: client, gateway and relay have different types.
Thus, we abstract over them using a `HostId` in order to know, which
host a certain message is for. Following these patches, we can easily
introduce multiple gateways and relays to this test by simply making
more entries in this routing table. This will increase the test coverage
of connlib.

Lastly, this patch massively increases the performance of `tunnel_test`.
It turns out that previously, we spent a lot of CPU cycles accessing
"random" IPs from very large iterators. With this patch, we take a
limited range of 100 IPs that we sample from, thus drastically
increasing performance of this test. The configured 1000 testcases
execute in 3s on my machine now (with opt-level 1 which is what we use
in CI).

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2024-07-09 01:48:54 +00:00
Thomas Eizinger
28d5b8574c chore(connlib): minor logging tweaks (#5746)
Noticed a few things that caused unnecessary verbosity in the logs.
2024-07-05 14:45:32 +00:00
Thomas Eizinger
b5fd980fb2 fix(relay): don't log all request failures on the same level (#5622)
Currently, the relay logs all failed requests on WARN. This is a bit
excessive because during normal operation, clients are expected to hit
several 401s due to stale or missing nonces.

In order to not flood the logs with these, we introduce a new type,
`ResponseErrorLevel` that represents the subset of `tracing::Level` that
`make_error_response` can log:

- `Warn`
- `Debug`

Both variants mapping to the variants in `tracing::Level` with the same
name, and the function will log accordingly.

So now the caller can pick what level of error is meant to be used and
reduce the noise on the logs when it's meant to be part of normal
operation.

Fixes: #5490.

---------

Co-authored-by: conectado <gabrielalejandro7@gmail.com>
2024-06-29 02:38:55 +00:00
Thomas Eizinger
cf9f7504ce chore(relay): be more lenient with debug-assertions (#5367)
Some of the debug-assertions in the relay are a bit too strict.
Specifically, if an allocation times out because it is not refreshed, we
also clean-up all channel bindings associated with that allocation. Yet,
if an existing channel binding has already been removed earlier, it will
no longer be present in the respective map.

This isn't an issue at all. We can simply change the debug-assertion to
only compare what used to be present in the map. What really matters is
that the item we just removed does in fact point to the data that we are
expecting.

Related: #5355.
2024-06-14 06:07:15 +00:00
Thomas Eizinger
d27a7a3083 feat(relay): support custom turn port (#5208)
Original PR: #5130.

Co-authored-by: Antoine <antoinelabarussias@gmail.com>
2024-06-05 04:04:17 +00:00
Thomas Eizinger
92676f0f53 test(connlib): simulate IO in state machine tests (#4728)
This is similar to #4097 and #4585 but for the entire `ClientState` and
`GatewayState`. We also do it in the context of a property-based test
with the vision that we can deterministically explore a large space of
state transitions and see where our main property breaks: Being able to
send an ICMP packet from the client to the gateway.

In other words, we now correctly pass all the `Transmit`s back and forth
between the components as if they would receive it from the network. Due
to the nature of property-based tests, this already exercises a very
large input space. For example, if the client does not have an IPv6
socket and the gateway doesn't have an IPv4 socket, this test already
checks whether we then correctly fall back to using a relay (because the
allocation we make on the relay is the only network path where the STUN
requests pass through).

What this does not (yet) do is set up a proper network topology. The
`dispatch_transmit` function will happily "route" a `Transmit` from e.g.
the client to the gateway even if they are in different subnets. In
other words, these tests assume that the actual network itself works and
we can exchange UDP packets between the components.

For now, we only send ICMPs to CIDR resources. As a next step, we can
extend this to DNS resources by sending DNS queries for our DNS
resources and then sending an ICMP to the resolved IP.
2024-05-22 23:10:58 +00:00
Thomas Eizinger
99c600f558 chore(relay): allow domains in --otel-grpc-endpoint (#5059)
Replaces #4932.

---------

Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>
2024-05-22 01:43:17 +00:00
Thomas Eizinger
53c7bd8201 fix(relay): clear channel bindings when allocation is deleted (#4705)
As suspected, there was a bug in the relay where channel bindings were
not cleared if the client freed the allocation early by sending a
REFRESH request with a lifetime of 0.

Resolves: #4588.
2024-04-19 13:25:38 +00:00