firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-03-21 19:41:58 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	81da120c17	fix(phoenix-channel): report connection hiccups to upper layer (#8203 ) The WebSocket connection to the portal from within the Clients, Gateways and Relays may be temporarily interrupted by IO errors. In such cases we simply reconnect to it. This isn't as much of a problem for Clients and Gateways. For Relays however, a disconnect can be disruptive for customers because the portal will send `relays_presence` events to all Clients and Gateways. Any relayed connection will therefore be interrupted. See #8177. Relays run on our own infrastructure and we want to be notified if their connection flaps. In order to differentiate between these scenarios, we remove the logging from within `phoenix-channel` and report these connection hiccups one layer up. This allows Clients and Gateways to log them on DEBUG whereas the Relay can log them on WARN. Related: #8177 Related: #7004	2025-02-20 00:54:43 +00:00
Thomas Eizinger	3e4976e4ab	fix(relay): don't starve items further down in the event-loop (#8177 ) At present, the relay uses a priority in the event-loop that favors routing traffic. Whenever a task further up in the loop is `Poll::Ready`, we loop back to the top to continue processing. The issue with that is that in very busy times, this can lead to starvation in processing timers and messages from the portal. If we then finally get to process portal messages, we think that the portal hasn't replied in some time and proactively cut the connection and reconnect. As a result, the portal will send `relays_presence` messages to the clients and gateways which in turn will locally remove the relay. This breaks relayed connections. To fix this, instead of immediately traversing to the top of the event-loop with `continue`, we only set a boolean. This gives each element of the event-loop a chance to execute, even when a certain component is very busy. Related: #8165 Related: #8176	2025-02-18 12:00:32 +00:00
Thomas Eizinger	c9b9fb0e6c	feat(relay): add `SOFTWARE` attribute (#8076 ) Adding a `SOFTWARE` attribute is recommended by the spec and will allow us to identify from client logs, which version of the relay we are talking to.	2025-02-11 03:34:38 +00:00
Thomas Eizinger	5b236408b8	chore(relay): log warn if we can't authenticate error response (#8073 ) There should be a `Username` attribute in every request that is worth sending an error back, if there isn't we have a bug somewhere. Related: https://firezone-inc.sentry.io/issues/6275631126/.	2025-02-10 22:00:23 +00:00
Thomas Eizinger	e3e6634790	chore: make all Rust code compile on Windows (#8036 ) Developing on Windows is much easier if all Rust code compiles without errors or warnings because you can "trust" your IDE that your code is error free if it says "0 errors; 0 warnings". We are not far off from achieving this! Apart from the "graceful termination" feature in the relay, both the relay and gateway should actually also work on Windows just fine, thanks to the platform-agnostic abstractions we have been building up for the GUI and headless client.	2025-02-06 14:25:10 +00:00
Thomas Eizinger	d2e9b09874	refactor(rust): stringify errors early (#8033 ) As it turns out, the effort in #7104 was not a good idea. By logging errors as values, most of our Sentry reports all have the same title and thus cannot be differentiated from within the overview at all. To fix this, we stringify errors with all their sources whenever they got logged. This ensures log messages are unique and all Sentry issues will have a useful title.	2025-02-06 14:18:35 +00:00
Thomas Eizinger	b34af41eb0	feat(relay): remove standalone mode (#7701 ) Previously, it was possible to use the Firezone relay in "standalone" mode where it would not attempt to connect to a portal. A long time ago, this mode was introduced in order for us to test the TURN compatibility of the relay with non-Firezone TURN clients. These tests have long been removed and thus the mode is no longer required. The positive side-effect of this is that we can make the `FIREZONE_API_URL` a mandatory parameter and thus direct self-hosted users towards setting this to the endpoint of their self-hosted portal.	2025-01-08 19:26:19 +00:00
Thomas Eizinger	e499d3e856	feat(relay): make telemetry opt-in (#7697 ) Currently, telemetry via Sentry in our relay code is opt-out but won't actually activate for a portal instance that isn't our staging or production environment. However, this isn't enough to prevent alerts from relay instances that aren't ours. It turns out that some self-hosted customers don't realise that they have to change the portal URL to their self-hosted portal. Without changing that, the relay will attempt to authenticate to our production portal with an unknown token and error out with a 401, logging a false-positive to Sentry.	2025-01-08 15:12:52 +00:00
Thomas Eizinger	5b2d7f1adf	fix(relay): don't warn when running in standalone mode (#7573 )	2024-12-23 13:17:01 +00:00
Thomas Eizinger	7df4389fa6	refactor(relay): avoid stringifying error early (#7553 ) When the portal connection in a relay fails, we currently stringify the error early. This is unnecessary and we should instead retain the full error chain for as long as possible.	2024-12-18 18:13:55 +00:00
Thomas Eizinger	8e0f00a3a6	fix(relay): buffer packets in case IO is busy (#7536 ) At present, the relay's event-loop simply drops a UDP packet in case the socket is not ready for writing. This is terrible for throughput because it means the encapsulated packet within the WG payload needs to be retransmitted by the source after a timeout. To avoid this, we instead buffer the packet and suspend the event loop until it has been correctly flushed out. This may still cause packet loss because the receive buffer may overflow in the meantime. However, there is nothing we can do about that because UDP itself doesn't have any backpressure. The relay listens on many sockets at once via a separate worker thread and an `mio` event-loop. In addition to the current subscription to readable event, we now also subscribe to writable events. At the very top of the relay's event-loop, we insert a `flush` function that ensures all buffered packets have been written out and - in case writing a packet fails - suspends the event-loop with a waker. If we receive a new event for write-readiness, we wake the waker which will trigger a new call to `Eventloop::poll` where we again try to flush the pending packet. We don't bother with tracking exactly, which socket sent the write-readiness and which socket we have still pending packets in. Instead, we suspend the entire event-loop until all pending packets have been flushed. Resolves: #7519.	2024-12-18 17:01:24 +00:00
Thomas Eizinger	48857d3bc8	chore(relay): downgrade allocation mismatch warn on CHANNEL_BIND (#7505 ) This code-path is handled gracefully in `connlib`, no need to issue a warning here.	2024-12-13 05:41:28 +00:00
Thomas Eizinger	73625e4669	chore(relay): don't log all AUTH errors on WARN (#7506 ) Not all authentication errors are warnings that we need to be alerted about.	2024-12-13 05:37:15 +00:00
Thomas Eizinger	da04924da1	chore(relay): downgrade log on missing allocation for REFRESH (#7490 ) Attempting to refresh an allocation is the only idempotent way in TURN to test whether one has an active allocation. As such, logging this on WARN is too aggressive. Resolves: #7481.	2024-12-12 16:48:02 +00:00
Thomas Eizinger	d06bdaac91	chore(relay): don't warn on existing allocation (#7415 ) A client may have lost its state and therefore "probe" the relay whether or not is still has an allocation. If it does, it will react to the error, delete it and make a new one. This is no reason to print a warning on the relay side.	2024-12-02 01:08:58 +00:00
Thomas Eizinger	932f6791fb	fix(phoenix-channel): lazily create backoff timer (#7414 ) Our `phoenix-channel` component is responsible for maintaining a WebSocket connection to the portal. In case that connection fails, we want to reconnect to it using an exponential backoff, eventually giving up after a certain amount of time. Unfortunately, the code we have today doesn't quite do that. An `ExponentialBackoff` has a setting for the `max_elapsed_time`. Regardless of how many and how often we retry something, we won't ever wait longer than this amount of time. For the Relay, this is set to 15min. For other components its indefinite (Gateway, headless-client), or very long (30 days for Android, 1 day for Apple). The point in time from which this duration is counted is when the `ExponentialBackoff` is constructed which translates to when we first connected to the portal. As a result, our backoff would immediately fail on the first error if it has been longer than `max_elapsed_time` since we first connected. For most components, this codepath is not relevant because the `max_elapsed_time` is so long. For the Relay however, that is only 15 minutes so chances are, the Relay would immediately fail (and get rebooted) on the first connection error with the portal. To fix this, we now lazily create the `ExponentialBackoff` on the first error. This bug has some interesting consequences: When a relay reboots, it looses all its state, i.e. allocations, channel bindings, available nonces etc, stamp-secret. Thus, all credentials and state that got distributed to Clients and Gateways get invalidated, causing disconnects from the Relay. We have observed these alerts in Sentry for a while and couldn't explain them. Most likely, this is the root cause for those because whilst a Relay disconnects, the portal also cannot detect its presence and pro-actively inform Clients and Gateways to no longer use this Relay.	2024-11-29 20:19:11 +00:00
Thomas Eizinger	bea8393248	fix(relay): reduce number of warnings (#7411 ) With this PR, we reduce some of the warnings emitted by the relay. If we can only partially fulfill an allocation, we now only emit a warning. Similarly, if we receive a repeated SIGTERM signal, we shut down successfully (i.e. exit with code 0) instead of failing the event-loop. During normal operation, we wait for all allocations to expire before we shut down. On CI however, the relay gets shutdown much earlier so this would generate unnecessary errors. Receiving another SIGTERM is a user-initiated action so we shouldn't fail as a result but instead just comply with it.	2024-11-28 23:20:10 +00:00
Thomas Eizinger	e91a076307	refactor(relay): improve error messages on failed requests (#7405 ) Some house-keeping that should make debugging issues around relay-disconnects easier.	2024-11-28 22:12:27 +00:00
Thomas Eizinger	973a806707	feat(relay): add Sentry crash reporting (#7406 ) In addition to monitoring clients and gateways, it is also useful to monitor relays in the same way. This gives us alerts on ERROR and WARN messages logged by the relay as well as panics.	2024-11-28 21:53:21 +00:00
Thomas Eizinger	44c1b453f7	chore(relay): document authentication scheme (#7388 ) Follow-up from #7378 to answer some of the questions. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-11-21 20:12:31 +00:00
Thomas Eizinger	56db250e2c	feat(connlib): validate integrity of all relay responses (#7378 ) In order to avoid processing of responses of relays that somehow got altered on the network path, we now use the client's `password` as a shared secret for the relay to also authenticate its responses. This means that not all message can be authenticated. In particular, BINDING requests will still be unauthenticated. Performing this validation now requires every component that crafts input to the `Allocation` to include a valid `MessageIntegrity` attribute. This is somewhat problematic for the regression tests of the relay and the unit tests of `Allocation`. In both cases, we implement workarounds so we don't have to actually compute a valid `MessageIntegrity`. This is deemed acceptable because: - Both of these are just tests. - We do test the validation path using `tunnel_test` because there we run an actual relay.	2024-11-19 18:32:33 +00:00
Thomas Eizinger	48ba2869a8	chore(rust): ban the use of `.unwrap` except in tests (#7319 ) Using the clippy lint `unwrap_used`, we can automatically lint against all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a few results actually. In most cases, they are invariants that can't actually be hit. For these, we change them to `Option`. In other cases, they can actually be hit. For example, if the user supplies an invalid log-filter. Activating this lint ensures the compiler will yell at us every time we use `.unwrap` to double-check whether we do indeed want to panic here. Resolves: #7292.	2024-11-13 03:59:22 +00:00
Thomas Eizinger	73eebd2c4d	refactor(rust): consistently record errors as `tracing::Value` (#7104 ) Our logging library, `tracing` supports structured logging. This is useful because it preserves the more than just the string representation of a value and thus allows the active logging backend(s) to capture more information for a particular value. In the case of errors, this is especially useful because it allows us to capture the sources of a particular error. Unfortunately, recording an error as a tracing value is a bit cumbersome because `tracing::Value` is only implemented for `&dyn std::error::Error`. Casting an error to this is quite verbose. To make it easier, we introduce two utility functions in `firezone-logging`: - `std_dyn_err` - `anyhow_dyn_err` Tracking errors as correct `tracing::Value`s will be especially helpful once we enable Sentry's `tracing` integration: https://docs.rs/sentry-tracing/latest/sentry_tracing/#tracking-errors	2024-10-22 04:46:26 +00:00
Thomas Eizinger	2d4818e007	refactor(connlib): rotate tunnel private key on `reset` (#6909 ) With the new control protocol specified in #6461, the client will no longer initiate new connections. Instead, the credentials are generated deterministically by the portal based on the gateway's and the client's public key. For as long as they use the same public key, they also have the same in-memory state which makes creating connections idempotent. What we didn't consider in the new design at first is that when clients roam, they discard all connections but keep the same private key. As a result, the portal would generate the same ICE credentials which means the gateway thinks it can reuse the existing connection when new flows get authorized. The client however discarded all connections (and rotated its ports and maybe IPs), meaning the previous candidates sent to the gateway are no longer valid and connectivity fails. We fix this by also rotating the private keys upon reset. Rotating the keys itself isn't enough, we also need to propagate the new public key all the way "over" to the phoenix channel component which lives separately from connlib's data plane. To achieve this, we change `PhoenixChannel` to now start in the "disconnected" state and require an explicit `connect` call. In addition, the `LoginUrl` constructed by various components now acts merely as a "prototype", which may require additional data to construct a fully valid URL. In the case of client and gateway, this is the public key of the `Node`. This additional parameter needs to be passed to `PhoenixChannel` in the `connect` call, thus forming a type-safe contract that ensures we never attempt to connect without providing a public key. For the relay, this doesn't apply. Lastly, this allows us to tidy up the code a bit by: a) generating the `Node`'s private key from the existing RNG b) removing `ConnectArgs` which only had two members left Related: #6461. Related: #6732.	2024-10-07 22:28:51 +00:00
Thomas Eizinger	896fe49f1f	fix(relay): set better OTEL metadata (#6322 ) Previously, the `service.name` attribute got overridden with "unknown service" from the detector used in `Resource::default`. To avoid this, we are now manually composing the two other detectors. This gives us a useful set of default labels from within the code yet it allows overriding all of them using `OTEL_RESOURCE_ATTRIBUTES`.	2024-08-16 23:17:10 +00:00
Thomas Eizinger	3b56664e02	test(rust): ensure deterministic proptests (#6319 ) For quite a while now, we have been making extensive use of property-based testing to ensure `connlib` works as intended. The idea of proptests is that - given a certain seed - we deterministically sample test inputs and assert properties on a given function. If the test fails, `proptest` prints the seed which can then be added to a regressions file to iterate on the test case and fix it. It is quite obvious that non-determinism in how the test input gets generated is no bueno and reduces the value we get out of these tests a fair bit. The `HashMap` and `HashSet` data structures are known to be non-deterministic in their iteration order. This causes non-determinism during the input generation because we make use of a lot of maps and sets to gradually build up the test input. We fix all uses of `HashMap` and `HashSet` by replacing them with `BTreeMap` and `BTreeSet`. To ensure this doesn't regress, we refactor `tunnel_test` to not make use of proptest's macros and instead, we initialise and run the test ourselves. This allows us to dump the sampled state and transitions into a file per test run. In CI, we then run a 2nd iteration of all regression tests and compare the sampled state and transitions with the previous run. They must match byte-for-byte. Finally, to discourage use of non-deterministic iteration, we ban the use of the iteration functions on `HashMap` and `HashSet` across the codebase. This doesn't catch iteration in a `for`-loop but it is better than not linting against it at all. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>	2024-08-16 23:15:58 +00:00
Thomas Eizinger	4750d76fce	fix(relay): re-insert channel into fast-path map (#6332 ) This is a test-failure detected in https://github.com/firezone/firezone/actions/runs/10426492110/job/28879531621. In the relay, we have fast-path lookup maps to for incoming traffic from peers. This improves throughput as any incoming packet only needs to look-up a single routing entry. Unfortunately, this creates duplication in how the data must be stored. In #6276, we correctly identified that channels must be re-bound on the relay when a client sends `CHANNEL_BIND` message whilst the channel is cooling down. What we failed to identify (and what as now caught by the tests) is that we also need to re-insert the entry into the fast-path lookup map to actually allow data from flowing through the channel.	2024-08-16 23:14:00 +00:00
Thomas Eizinger	d399e65246	build(deps): bump tokio-tungstenite to 0.23 (#5509 ) With the upgrade to 0.23, `tokio-tungstenite` pulls in `rustls` 0.27 which supports multiple crypto providers. By default, this uses the `aws-lc-crypto` provider. The previous default was `ring`. This PR bumps the necessary versions and installs the `ring` crypto provider at the beginning of each application, before connlib starts. We try and do this as early as possible to make it obvious that it only needs to happen once per process. Resolves: #5380.	2024-08-15 06:02:17 +00:00
Thomas Eizinger	272e4b2bcd	feat(snownet,relay): include sticky session ID in STUN requests (#6278 ) For most cases, TURN identifies clients by their 3-tuple. This can make it hard to correlate logs in case the client roams or its NAT session gets reset, both of which cause the port to change. To make problem analysis easier, we include the RFC-recommended `SOFTWARE` attribute in all STUN requests created by `snownet`. Typically, this includes a textual description of who sent the request and a version number. See [0] for details. We don't track the version of `snownet` individually and passing the actual client-version across this many layers is deemed too complicated for now. What we can add though is a parameter that includes a sticky session ID. This session ID is computed based on the `Node`'s public key, meaning it doesn't change until the user logs-out and in again. On the relay, we now look for a `SOFTWARE` attribute in all STUN requests and optionally include it in all spans if it is present. [0]: https://datatracker.ietf.org/doc/html/rfc5389#section-15.10	2024-08-15 03:10:56 +00:00
Thomas Eizinger	55c97acfc3	feat(relay): record error code as label in response counter metric (#6274 ) This will allow us to write queries and thus alerts for increased number of error responses such as `Allocation Mismatch`. When attaching labels to metrics, it is important to avoid cardinality explosions. Thus, the possible label values should always be a fixed, bounded set of values. The possible error codes could be quite a few but in practise, we only use a handful and clients cannot influence, which error codes we send. Thus, it is safe to create labels for these codes. The same would not be true for IP addresses or ports for example.	2024-08-13 22:17:21 +00:00
Thomas Eizinger	6e86a4dcba	fix(snownet,relay): re-use channels to peers in cooldown period (#6276 ) For efficiency reasons, TURN's data channels don't have any authentication or integrity metadata. Instead, the operate using a short 2-byte channel number to identify the target peer of the data. To avoid abuse, channel bindings are at most valid for 10 minutes before they need to be refreshed. In case they expire, there is a 5 minute cooldown period, before the same channel number can be bound to a different peer and before the same peer can be bound to a different channel. We had a similar issue in the past (#5613) where channels got rebound early. Whilst that was fixed and is no longer happening, a case that we didn't consider is what happens if we want to bind a channel to a peer that still has a channel bound but is currently cooling down (i.e. in the 5 minute period after its expiry). In that case, `snownet` would wrongly assume that there is no channel to this peer and try to bind a new one. That would get rejected by the relay with a bad request. To fix this, we simply need to check whether we still have a channel to this peer and if yes, return the same channel number. On the relay, we need to ensure that we consider a channel as `bound` again when it is being refreshed. We ensure that this doesn't regress in two ways: - We add a unit-test for the `ChannelBindings` struct - We modify the `Idle` transition to idle for 6 instead of 5 minutes. This ensures that a combination of 2 idle transitions puts the channel bindings into the 10-15 minute time window where rebinding the peer to a different channel fails. Related: #6265.	2024-08-13 17:01:13 +00:00
Thomas Eizinger	0abbf6bba9	refactor(rust): inline `http-health-check` crate into `bin-shared` (#6258 ) Now that we have the `bin-shared` crate, it is easy to move the health-check functionality into there. That allows us to get rid of a crate which makes navigating the workspace a bit easier.	2024-08-12 16:44:52 +00:00
Thomas Eizinger	93d678aaea	feat(relay): set OTEL metadata for metrics and traces (#6249 ) I recently discovered that the metrics reporting to Google Cloud Metrics for the relays is actually working. Unfortunately, they are all bucketed together because we don't set the metadata correctly. This PR aims to fix that be setting some useful default metadata for traces and metrics and additionally, discoveres instance ID and name from GCE metadata. Related: #2033.	2024-08-10 16:32:01 +00:00
Thomas Eizinger	bed625a312	chore(rust): make logging more ergonomic (#6237 ) Setting up a logger is something that pretty much every entrypoint needs to do, be it a test, a shared library embedded in another app or a standalone application. Thus, it makes sense to introduce a dedicated crate that allows us to bundle all the things together, how we want to do logging. This allows us to introduce convenience functions like `firezone_logging::test` which allow you to construct a logger for a test as a one-liner. Crucially though, introducing `firezone-logging` gives us a place to store a default log directive that silences very noisy crates. When looking into a problem, it is common to start by simply setting the log-filter to `debug`. Without further action, this floods the output with logs from crates like `netlink_proto` on Linux. It is very unlikely that those are the logs that you want to see. Without a preset filter, the only alternative here is to explicitly turn off the log filter for `netlink_proto` by typing something like `RUST_LOG=netlink_proto=off,debug`. Especially when debugging issues with customers, this is annoying. Log filters can be overridden, i.e. a 2nd filter that matches the exact same scope overrides a previous one. Thus, with this design it is still possible to activate certain logs at runtime, even if they have silenced by default. I'd expect `firezone-logging` to attract more functionality in the future. For example, we want to support re-loading of log-filters on other platforms. Additionally, where logs get stored could also be defined in this crate. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>	2024-08-10 05:17:03 +00:00
Thomas Eizinger	f800875aff	fix(relay): don't hang when connecting to OTLP exporter (#6034 ) The dependency update in #6003 introduced a regression: Connecting to the OTLP exporter was hanging forever and thus the relay failed to start up. The hang seems to be related to _dropping_ the `meter_provider`. Looking at the changelog update, this change was actually called out: https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-otlp/CHANGELOG.md#v0170. By setting these providers globally, the relay starts up just fine. To ensure this doesn't regress again, we add an OTEL collector to our `docker-compose.yml` and configure the `relay-1` to connect to it.	2024-07-25 10:36:42 -06:00
Thomas Eizinger	782b171cc1	chore(relay): always log setup on trace (#6031 ) In staging and production, setting up the logger for the relay is a fairly complicated setup. To make debugging easier, we always log these initial steps on `TRACE` level until the real logger is initialised.	2024-07-25 03:48:52 +00:00
dependabot[bot]	dae90d81e1	build(deps): bump opentelemetry dependencies (#6003 ) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-07-24 17:45:42 +00:00
Thomas Eizinger	da52c66023	refactor(clients): init `PhoenixChannel` in upper layers (#5884 ) This represents a step towards #3837. Eventually, we'd like the abstractions of `Session` and `Eventloop` to go away entirely. For that, we need to thin them out. The introduction of `ConnectArgs` was already a hint that we are passing a lot of data across layers that we shouldn't. To avoid that, we can simply initialise `PhoenixChannel` earlier and thus each callsite can specify the desired configuration directly. I've left `ConnectArgs` intact to keep the diff small.	2024-07-18 02:08:38 +00:00
Thomas Eizinger	aa279d7731	ci: never tolerate warnings in Rust code (#5893 ) Our Rust CI runs various jobs in different configurations of packages and / or features. Currently, only the clippy job denies warnings which makes it possible that some code still generates warnings under particular configurations. To ensure we always fail on warnings, we set a global env var to deny warnings for all Rust CI jobs. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>	2024-07-17 22:22:12 +00:00
Gabi	5b0aaa6f81	fix(connlib): protect all sockets from routing loops (#5797 ) Currently, only connlib's UDP sockets for sending and receiving STUN & WireGuard traffic are protected from routing loops. This is was done via the `Sockets::with_protect` function. Connlib has additional sockets though: - A TCP socket to the portal. - UDP & TCP sockets for DNS resolution via hickory. Both of these can incur routing loops on certain platforms which becomes evident as we try to implement #2667. To fix this, we generalise the idea of "protecting" a socket via a `SocketFactory` abstraction. By allowing the different platforms to provide a specialised `SocketFactory`, anything Linux-based can give special treatment to the socket before handing it to connlib. As an additional benefit, this allows us to remove the `Sockets` abstraction from connlib's API again because we can now initialise it internally via the provided `SocketFactory` for UDP sockets. --------- Signed-off-by: Gabi <gabrielalejandro7@gmail.com> Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-07-16 00:40:05 +00:00
Thomas Eizinger	8ec6a809a1	refactor(relay): use `RangeInclusive` to specify available ports (#5820 )	2024-07-11 06:26:21 +00:00
Thomas Eizinger	0c2648dae2	test(connlib): correctly scope state within `tunnel_test` (#5809 ) Currently, the type hierarchy within `tunnel_test` is already quite nested: We have a `Host` that wraps a `SimNode` which wraps a `ClientState` or `GatewayState`. Additionally, a lot of state that is actually _per_ client or _per_ gateway is tracked in the root of `ReferenceState` and `TunnelTest`. That makes it difficult to introduce multiple gateways / clients to this test. To fix this, we introduce dedicated `RefClient` and `RefGateway` states. Those track the expected state of a particular client / gateway. Similarly, we introduce dedicated `SimClient` and `SimGateway` structs that track the simulation state by wrapping the corresponding system-under-test: `ClientState` a `GatewayState`. This ends up moving a lot of code around but has the great benefit that all the state is now scoped to a particular instance of a client or a gateway, paving the way for creating multiple clients & gateways in a single test.	2024-07-10 23:22:19 +00:00
Thomas Eizinger	9caca475dc	test(connlib): introduce routing table to `tunnel_test` (#5786 ) Currently, `tunnel_test` uses a rather naive approach when dispatching `Transmit`s. In particular, it checks client, gateway and relay separately whether they "want" a certain packet. In a real network, these packets are routed based on their IP. To mimic something similar, we introduce a `Host` abstraction that wraps each component: client, gateway and relay. Additionally, we introduce a `RoutingTable` where we can add and remove hosts. With these things in place, routing a `Transmit` is as easy as looking up the destination IP in the routing table and dispatching to the corresponding host. Our hosts are type-safe: client, gateway and relay have different types. Thus, we abstract over them using a `HostId` in order to know, which host a certain message is for. Following these patches, we can easily introduce multiple gateways and relays to this test by simply making more entries in this routing table. This will increase the test coverage of connlib. Lastly, this patch massively increases the performance of `tunnel_test`. It turns out that previously, we spent a lot of CPU cycles accessing "random" IPs from very large iterators. With this patch, we take a limited range of 100 IPs that we sample from, thus drastically increasing performance of this test. The configured 1000 testcases execute in 3s on my machine now (with opt-level 1 which is what we use in CI). --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2024-07-09 01:48:54 +00:00
Thomas Eizinger	28d5b8574c	chore(connlib): minor logging tweaks (#5746 ) Noticed a few things that caused unnecessary verbosity in the logs.	2024-07-05 14:45:32 +00:00
Thomas Eizinger	b5fd980fb2	fix(relay): don't log all request failures on the same level (#5622 ) Currently, the relay logs all failed requests on WARN. This is a bit excessive because during normal operation, clients are expected to hit several 401s due to stale or missing nonces. In order to not flood the logs with these, we introduce a new type, `ResponseErrorLevel` that represents the subset of `tracing::Level` that `make_error_response` can log: - `Warn` - `Debug` Both variants mapping to the variants in `tracing::Level` with the same name, and the function will log accordingly. So now the caller can pick what level of error is meant to be used and reduce the noise on the logs when it's meant to be part of normal operation. Fixes: #5490. --------- Co-authored-by: conectado <gabrielalejandro7@gmail.com>	2024-06-29 02:38:55 +00:00
Thomas Eizinger	cf9f7504ce	chore(relay): be more lenient with debug-assertions (#5367 ) Some of the debug-assertions in the relay are a bit too strict. Specifically, if an allocation times out because it is not refreshed, we also clean-up all channel bindings associated with that allocation. Yet, if an existing channel binding has already been removed earlier, it will no longer be present in the respective map. This isn't an issue at all. We can simply change the debug-assertion to only compare what used to be present in the map. What really matters is that the item we just removed does in fact point to the data that we are expecting. Related: #5355.	2024-06-14 06:07:15 +00:00
Thomas Eizinger	d27a7a3083	feat(relay): support custom turn port (#5208 ) Original PR: #5130. Co-authored-by: Antoine <antoinelabarussias@gmail.com>	2024-06-05 04:04:17 +00:00
Thomas Eizinger	92676f0f53	test(connlib): simulate IO in state machine tests (#4728 ) This is similar to #4097 and #4585 but for the entire `ClientState` and `GatewayState`. We also do it in the context of a property-based test with the vision that we can deterministically explore a large space of state transitions and see where our main property breaks: Being able to send an ICMP packet from the client to the gateway. In other words, we now correctly pass all the `Transmit`s back and forth between the components as if they would receive it from the network. Due to the nature of property-based tests, this already exercises a very large input space. For example, if the client does not have an IPv6 socket and the gateway doesn't have an IPv4 socket, this test already checks whether we then correctly fall back to using a relay (because the allocation we make on the relay is the only network path where the STUN requests pass through). What this does not (yet) do is set up a proper network topology. The `dispatch_transmit` function will happily "route" a `Transmit` from e.g. the client to the gateway even if they are in different subnets. In other words, these tests assume that the actual network itself works and we can exchange UDP packets between the components. For now, we only send ICMPs to CIDR resources. As a next step, we can extend this to DNS resources by sending DNS queries for our DNS resources and then sending an ICMP to the resolved IP.	2024-05-22 23:10:58 +00:00
Thomas Eizinger	99c600f558	chore(relay): allow domains in `--otel-grpc-endpoint` (#5059 ) Replaces #4932. --------- Co-authored-by: Antoine Labarussias <antoinelabarussias@gmail.com>	2024-05-22 01:43:17 +00:00
Thomas Eizinger	53c7bd8201	fix(relay): clear channel bindings when allocation is deleted (#4705 ) As suspected, there was a bug in the relay where channel bindings were not cleared if the client freed the allocation early by sending a REFRESH request with a lifetime of 0. Resolves: #4588.	2024-04-19 13:25:38 +00:00

1 2 3 4

160 Commits