During normal operation, we should never lose connectivity to the set of
assigned relays in a client or gateway. In the presence of odd network
conditions and partitions however, it is possible that we disconnect
from a relay that is in fact only temporarily unavailable. Without an
explicit mechanism to retrieve new relays, this means that both clients
and gateways can end up with no relays at all. For clients, this can be
fixed by either roaming or signing out and in again. For gateways, this
can only be fixed by a restart!
Without connected relays, no connections can be established. With #7163,
we will at least be able to still establish direct connections. Yet,
that isn't good enough and we need a mechanism for restoring full
connectivity in such a case.
We creating a new connection, we already sample one of our relays and
assign it to this particular connection. This ensures that we don't
create an excessive amount of candidates for each individual connection.
Currently, this selection is allowed to be silently fallible. With this
PR, we make this a hard-error and bubble up the error that all the way
to the client's and gateway's event-loop. There, we initiate a reconnect
to the portal as a compensating action. Reconnecting to the portal means
we will receive another `init` message that allows us to reconnect the
relays.
Due to the nature of this implementation, this fix may only apply with a
certain delay from when we actually lost connectivity to the last relay.
However, this design has the advantage that we don't have to introduce
an additional state within `snownet`: Connections now simply fail to
establish and the next one soon after _should_ succeed again because we
will have received a new `init` message.
Resolves: #7162.
The `fmt::Display` implementation of `tokio::task::JoinError` already
does exactly what we do here: Extracting the panic message if there is
one. Thus, we can simplify this code why just moving the `JoinError`
into the `DisconnectError` as its source.
Using the `sentry-tracing` integration, we can automatically capture
events based on what we log via `tracing`. The mapping is defined as
follows:
- ERROR: Gets captured as a fatal error
- WARN: Gets captured as a message
- INFO: Gets captured as a breadcrumb
- `_`: Does not get captured at all
If telemetry isn't active / configured, this integration does nothing.
It is therefore safe to just always enable it.
Our logging library, `tracing` supports structured logging. This is
useful because it preserves the more than just the string representation
of a value and thus allows the active logging backend(s) to capture more
information for a particular value.
In the case of errors, this is especially useful because it allows us to
capture the sources of a particular error.
Unfortunately, recording an error as a tracing value is a bit cumbersome
because `tracing::Value` is only implemented for `&dyn
std::error::Error`. Casting an error to this is quite verbose. To make
it easier, we introduce two utility functions in `firezone-logging`:
- `std_dyn_err`
- `anyhow_dyn_err`
Tracking errors as correct `tracing::Value`s will be especially helpful
once we enable Sentry's `tracing` integration:
https://docs.rs/sentry-tracing/latest/sentry_tracing/#tracking-errors
Currently, we have a lot of stupid code to forward data from the
`{Client,Gateway}Tunnel` interface to `{Client,Gateway}State`. Recent
refactorings such as #6919 made it possible to get rid of this
forwarding layer by directly exposing `&mut TRoleState`.
To maintain some type-privacy, several functions are made generic to
accept `impl Into` or `impl TryInto`.
Do we want to track 401s in sentry? If we see a lot of them, something
is likely wrong but I guess there is some level of 401s that users will
just run into.
Is there a way of marking these as "might not be a really bad error"?
---------
Co-authored-by: Not Applicable <ReactorScram@users.noreply.github.com>
With the new control protocol specified in #6461, the client will no
longer initiate new connections. Instead, the credentials are generated
deterministically by the portal based on the gateway's and the client's
public key. For as long as they use the same public key, they also have
the same in-memory state which makes creating connections idempotent.
What we didn't consider in the new design at first is that when clients
roam, they discard all connections but keep the same private key. As a
result, the portal would generate the same ICE credentials which means
the gateway thinks it can reuse the existing connection when new flows
get authorized. The client however discarded all connections (and
rotated its ports and maybe IPs), meaning the previous candidates sent
to the gateway are no longer valid and connectivity fails.
We fix this by also rotating the private keys upon reset. Rotating the
keys itself isn't enough, we also need to propagate the new public key
all the way "over" to the phoenix channel component which lives
separately from connlib's data plane.
To achieve this, we change `PhoenixChannel` to now start in the
"disconnected" state and require an explicit `connect` call. In
addition, the `LoginUrl` constructed by various components now acts
merely as a "prototype", which may require additional data to construct
a fully valid URL. In the case of client and gateway, this is the public
key of the `Node`. This additional parameter needs to be passed to
`PhoenixChannel` in the `connect` call, thus forming a type-safe
contract that ensures we never attempt to connect without providing a
public key.
For the relay, this doesn't apply.
Lastly, this allows us to tidy up the code a bit by:
a) generating the `Node`'s private key from the existing RNG
b) removing `ConnectArgs` which only had two members left
Related: #6461.
Related: #6732.
The `connlib-shared` crate has become a bit of a dependency magnet
without a clear purpose. It hosts utilities like `get_user_agent`,
messages for the client and gateway to communicate with the portal and
domain types like `ResourceId`.
To create a better dependency structure in our workspace, we repurpose
`connlib-shared` as a `connlib-model` crate. Its purpose is to host
domain-specific model types that multiple crates may want to use. For
that purpose, we rename the `callbacks::ResourceDescription` type to
`ResourceView`, designating that this is a _view_ onto a resource as
seen by `connlib`. The message types which currently double up as
connlib-internal model thus become an implementation detail of
`firezone-tunnel` and shouldn't be used for anything else.
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Refs #6138
Sentry is always enabled for now. In the near future we'll make it
opt-out per device and opt-in per org (see #6138 for details)
- Replaces the `crash_handling` module
- Catches panics in GUI process, tunnel daemon, and Headless Client
- Added a couple "breadcrumbs" to play with that feature
- User ID is not set yet
- Environment is set to the API URL, e.g. `wss://api.firezone.dev`
- Reports panics from the connlib async task
- Release should be automatically pulled from the Cargo version which we
automatically set in the version Makefile
Example screenshot of sentry.io with a caught panic:
<img width="861" alt="image"
src="https://github.com/user-attachments/assets/c5188d86-10d0-4d94-b503-3fba51a21a90">
This log is currently printed after we receive the `init` message from
the client. It is a left-over from early days of connlib where receiving
`init` itself already triggered all kinds of actions.
These days, we are mostly just updating state. In addition, `init` can
be received multiple times during a client's session which is somewhat
confusing when you see multiple "Firezone started" logs.
Merging #6708 had an unintended side-effect that we are seeing a lot of
WARN logs from phoenix-channel because we can no longer parse the
response from gateways. We didn't do anything with these responses but
gateways are sending them for backwards-compatibility reasons.
To not confuse ourselves while debugging, we revert the client-side bit
of #6708 to remove these warnings.
Prior to version 1.1.0, clients did not have an embedded DNS resolver
and relied on the gateway for DNS resolution. In that design, the
gateway responded with the IPs that the domain resolved to.
Our next iteration of the control protocol (#6461) will decouple the
details of how DNS works from the flow-authorization. As a result, we
will need to be able to establish a flow for a DNS resource without
knowing which concrete domain the client is going to access.
Without a concrete domain, we cannot send anything back to these old
clients, meaning we unfortunately have to break compatibility with <
1.1.0 clients as part of implementing the new control protocol.
The `expect` attribute is similar to `allow` in that it will silence a
particular lint. In addition to `allow` however, `expect` will fail as
soon as the lint is no longer emitted. This ensures we don't end up with
stale `allow` attributes in our codebase. Additionally, it provides a
way of adding a `reason` to document, why the lint is being suppressed.
When CIDR resources get added or removed, we need to update the routing
table on the clients to redirect traffic for these resources to the TUN
device. Currently, this is done in a separate event from the remaining
`TunConfig` tracked in `connlib`. Having this in a separate event means
it is hard to diff, whether anything meaningful changed about the TUN
device. Additionally, changes to these routes are currently not tested
in `tunnel_test`.
Not having this code tested already caused bugs previously, such as
#6387.
To fix these things, we:
- Add the IPv4 and IPv6 routes to the `TunConfig` tracked in `connlib`
- Track the expected routes in `RefClient`
- Assert that we don't emit `TunConfigUpdated` events without any actual
changes
Fixes: #6423.
Currently, we buffer UDP packets whenever the socket is busy and try to
flush them out at a later point. This requires allocations and is tricky
to get right.
In order to solve both of these problems, we refactor `snownet` to
return us an `EncryptedPacket` instead of a `Transmit`. An
`EncryptedPacket` is an indirection-abstraction that can be turned into
a `Transmit` given an `EncryptBuffer`. This combination of types allows
us to hold on to the `EncryptedPacket` (which does not contain any
references itself) in the `io` component whilst we are waiting for the
socket to be ready to send again.
This means we will immediately suspend the event loop in case the socket
is no longer ready for sending and resend the datagram in the
`EncryptBuffer` once we get re-polled.
Currently, `connlib` tracks the `Interface` as it is given it by the
portal. This includes the tunnel IP addresses plus the upstream servers.
Upstreams servers however only take effect when they are defined.
Without upstream DNS servers, `connlib` uses the system-defined DNS
servers. In that case, the `Interface` no longer accurately represents,
what we actually configure on the TUN device.
To fix this, we introduce a dedicated `TunConfig` struct that tracks,
what is actually set on the interface. This also allows us to track,
whether or not we need to re-emit this configuration after a change.
Related: #6423.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Upon receiving packets for a resource that we are not connected to,
connlib emits a "connection intent" to the portal. In case there are
gateways online for this resource, the portal sends us a "connection
details" event.
Currently, this is handled in a `create_or_reuse_connection` function.
What the current name doesn't capture is that this message is
essentially an update to connlib's "routing table", i.e. which gateway
in which site to use for the given resource. If we move this concern to
the fore-front of the design, whether or not we will make a new
connection or reuse an existing one kind of becomes secondary.
Re-framing the way we handle this messages makes it more natural to
design it in an asynchronous way, i.e. set its return type to `()` and
schedule events to be emitted. The translation of
`Request::NewConnection` is more or less 1-to-1 with the introduction of
`ClientEvent::RequestConnection`. The translation of
`Request::ReuseConnection` turns into the also renamed
`ClientEvent::RequestAccess`. This captures better what we need to do:
When we have an existing connection, we need to request access for it,
otherwise the gateway will drop all packets we send to this resource.
The motivation for this refactoring is #6335. Buffering the initial
packets while establishing a new connection opens up a race condition
where we may send `RequestAccess` before the gateway has processed
`RequestConnection`. In order to avoid this, we need to be able to
locally buffer our `RequestAccess` messages and wait until the gateway
has confirmed our connection.
Previously, failing to bind to any interfaces was a hard-error. In
reality and in `connlib`'s current state, this is quite unlikely because
machines will at least have a loopback interface that we will bind to.
However, with #6382 in the pipeline, it may be more likely that we
actually end up with no functional UDP sockets. Furthermore, we are
considering to extend those connectivity checks in the future.
Thus, it is important that the case of "no available UDP sockets" is
gracefully handled.
Instead of failing with a hard-error, we now suspend `connlib's` network
stack. The connectivity to the portal is unaffected by this and we will
still also receive commands from the client application like `reset`.
When we receive a `reset`, we attempt to rebind the sockets and thus
retry connectivity.
Because we are suspending the entire eventloop, this won't send any
messages or trigger any timers whatsoever. For example, if we
hypothetically started up without network interfaces, this is now the
log output:
```
2024-08-22T01:50:42.170101Z INFO firezone_headless_client: arch="x86_64" git_version="headless-client-1.2.0-2-gc8eed5938-modified"
2024-08-22T01:50:42.178777Z DEBUG phoenix_channel: Connecting to portal host=api.firez.one user_agent=NixOS/24.5.0 connlib/1.2.1 (x86_64; 6.8.12)
2024-08-22T01:50:42.178978Z DEBUG firezone_headless_client::dns_control::linux: Deactivating DNS control...
2024-08-22T01:50:42.180691Z ERROR firezone_tunnel::sockets: No available UDP sockets
2024-08-22T01:50:42.197098Z INFO firezone_tunnel::device_channel: Initializing TUN device name=tun-firezone
2024-08-22T01:50:42.197165Z DEBUG firezone_tunnel::client: Unable to update DNS servesr without interface configuration
2024-08-22T01:50:42.453988Z DEBUG tungstenite::handshake::client: Client handshake done.
2024-08-22T01:50:42.454161Z INFO phoenix_channel: Connected to portal host=api.firez.one
2024-08-22T01:50:42.676825Z DEBUG firezone_tunnel::client: Updating DNS servers mapping={fd00:2021:1111:8000:100:100:111:0 <> [2606:4700:4700::1111]:53, 100.100.111.1 <> 1.1.1.1:53}
2024-08-22T01:50:42.677084Z INFO firezone_tunnel::client: Activating resource name=IPerf3 address=10.0.32.101/32 sites=AWS Dev (Gateways track `main`)
2024-08-22T01:50:42.677173Z INFO firezone_tunnel::client: Activating resource name=*.slack.com address=**.slack.com sites=Vultr Stable (Latest Release Gateways)
2024-08-22T01:50:42.677223Z INFO firezone_tunnel::client: Activating resource name=*.slack-edge.com address=**.slack-edge.com sites=Vultr Stable (Latest Release Gateways)
2024-08-22T01:50:42.677283Z INFO firezone_tunnel::client: Activating resource name=*.spotify.com address=**.spotify.com sites=AWS Dev (Gateways track `main`)
2024-08-22T01:50:42.677345Z INFO firezone_tunnel::client: Activating resource name=*.github.com address=**.github.com sites=AWS Dev (Gateways track `main`)
2024-08-22T01:50:42.677418Z INFO firezone_tunnel::client: Activating resource name=whatismyip.com address=**.whatismyip.com sites=AWS Dev (Gateways track `main`)
2024-08-22T01:50:42.677489Z INFO firezone_tunnel::client: Activating resource name=ifconfig.net address=ifconfig.net sites=Vultr Stable (Latest Release Gateways)
2024-08-22T01:50:42.677538Z INFO firezone_tunnel::client: Activating resource name=*.google.com address=**.google.com sites=AWS Dev (Gateways track `main`)
2024-08-22T01:50:42.677632Z INFO firezone_tunnel::client: Activating resource name=*.fastmail.com address=**.fastmail.com sites=AWS Dev (Gateways track `main`)
2024-08-22T01:50:42.677682Z INFO firezone_tunnel::client: Activating resource name=speed.cloudflare.com address=speed.cloudflare.com sites=Vultr Stable (Latest Release Gateways)
2024-08-22T01:50:42.678212Z INFO snownet::node: Added new TURN server rid=b6fc4d73-9c8e-44df-a941-da7d2134cb70 address=Dual { v4: 34.40.133.55:3478, v6: [2600:1900:40b0:1504:0:97::]:3478 }
2024-08-22T01:50:42.678322Z INFO snownet::node: Added new TURN server rid=c818b11a-d0cc-4f2a-bb88-473d8298a885 address=Dual { v4: 34.81.229.132:3478, v6: [2600:1900:4030:b0d9:0:9b::]:3478 }
2024-08-22T01:50:42.678365Z INFO connlib_client_shared::eventloop: Firezone Started!
```
After this, nothing will happen other than receiving messages via from
the portal or the client app.
Related: #6382.
Related: #6385.
Previously, `connlib` would only send the currently connected gateways
to the portal upon a new connection intent. With our introduced idle
connection timeout, this could result in the portal choosing a different
gateway upon reconnecting to the resource.
To fix this, we introduce an LRU cache with at most 100 entries.
Iteration over the LRU cache happens in MRU order, meaning a recently
connected gateway will be at the front of the list.
We assume that this list is processed in order and thus still prefer
gateways that we are still connected to.
Related: #6347.
For quite a while now, we have been making extensive use of
property-based testing to ensure `connlib` works as intended. The idea
of proptests is that - given a certain seed - we deterministically
sample test inputs and assert properties on a given function.
If the test fails, `proptest` prints the seed which can then be added to
a regressions file to iterate on the test case and fix it. It is quite
obvious that non-determinism in how the test input gets generated is no
bueno and reduces the value we get out of these tests a fair bit.
The `HashMap` and `HashSet` data structures are known to be
non-deterministic in their iteration order. This causes non-determinism
during the input generation because we make use of a lot of maps and
sets to gradually build up the test input. We fix all uses of `HashMap`
and `HashSet` by replacing them with `BTreeMap` and `BTreeSet`.
To ensure this doesn't regress, we refactor `tunnel_test` to not make
use of proptest's macros and instead, we initialise and run the test
ourselves. This allows us to dump the sampled state and transitions into
a file per test run. In CI, we then run a 2nd iteration of all
regression tests and compare the sampled state and transitions with the
previous run. They must match byte-for-byte.
Finally, to discourage use of non-deterministic iteration, we ban the
use of the iteration functions on `HashMap` and `HashSet` across the
codebase. This doesn't catch iteration in a `for`-loop but it is better
than not linting against it at all.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Most of `connlib-shared` exists only for historical reasons. The
`Tunnel` has since been decoupled from the `Callbacks` and most error
variants on `ConnlibError` are not actually used.
This allows us to move a few things around and trim down `ConnlibError`
to just the variants that actually cause a call to `on_disconnect`.
Moving everything related to `proptest`s to `firezone-tunnel` also
requires us to delete the specialisation for printing IDs in a shorter
format during the tests. That is a bit unfortunate but was always kind
of a hack. I'd rather make progress on getting rid of `connlib-shared`
though and perhaps re-introduce that feature once the messages are fully
moved into the tunnel.
Related: #4470.
Setting up a logger is something that pretty much every entrypoint needs
to do, be it a test, a shared library embedded in another app or a
standalone application. Thus, it makes sense to introduce a dedicated
crate that allows us to bundle all the things together, how we want to
do logging.
This allows us to introduce convenience functions like
`firezone_logging::test` which allow you to construct a logger for a
test as a one-liner.
Crucially though, introducing `firezone-logging` gives us a place to
store a default log directive that silences very noisy crates. When
looking into a problem, it is common to start by simply setting the
log-filter to `debug`. Without further action, this floods the output
with logs from crates like `netlink_proto` on Linux. It is very unlikely
that those are the logs that you want to see. Without a preset filter,
the only alternative here is to explicitly turn off the log filter for
`netlink_proto` by typing something like
`RUST_LOG=netlink_proto=off,debug`. Especially when debugging issues
with customers, this is annoying.
Log filters can be overridden, i.e. a 2nd filter that matches the exact
same scope overrides a previous one. Thus, with this design it is still
possible to activate certain logs at runtime, even if they have silenced
by default.
I'd expect `firezone-logging` to attract more functionality in the
future. For example, we want to support re-loading of log-filters on
other platforms. Additionally, where logs get stored could also be
defined in this crate.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Currently, `connlib` depends on `hickory-resolver` to perform DNS
queries for non-resources. This is unnecessary. Instead of buffering the
original UDP DNS query, consulting hickory to resolve the name and
mapping the response back, we can simply take the UDP payload and send
it via our protected socket directly to the original upstream DNS
server.
This ensures `connlib` is as transparent as possible for DNS queries for
non-resources. Additionally, it removes a lot of error handling and
other cruft that we currently have to perform because we are using
hickory. For example, hickory will automatically retry a DNS query after
a certain timeout. However, the OS / client talking to `connlib` will
also retry after a certain timeout because it is making DNS queries over
an unreliable transport (UDP). It is thus unnecessary for us to do that
internally.
To correctly test this change, our test-suite needed some refactoring.
Specifically, DNS servers are now modelled as dedicated `Host`s that can
receive (UDP) traffic.
Lastly, we can remove our dependency on `hickory-proto` and
`hickory-resolver` everywhere and only use `domain` for parsing DNS
messages.
Resolves: #6141.
Related: #6033.
Related: #4800. (Impossible to happen with this design)
Builds on top of #6164
Part of the effor towards
https://github.com/firezone/firezone/issues/6074
Prepares connlib to call `setDisableResource` from android.
Furthermore, we add a `disablable` parameter for resources which default
to false for now, in the future the portal will set it for the internet
resource, and further in the future it may be used for other resources.
The `disablable` parameter only affect UI.
Connection roaming within `connlib` has changed a fair-bit since we
introduced the `reconnect` function. The new implementation is basically
a hard-reset of all state within `connlib`. Renaming this function
across all layers makes this more obvious.
Resolves: #6038.
As part of debugging full-route tunneling on Windows, we discovered that
we need to always explicitly choose the interface through which we want
to send packets, otherwise Windows may cause a routing loop by routing
our packets back into the TUN device.
We already have a `SocketFactory` abstraction in `connlib` that is used
by each platform to customise the setup of each socket to prevent
routing loops.
So far, this abstraction directly returns tokio sockets which don't
allow us to intercept the actual sending of packets. For some of our
traffic, i.e. the UDP packets exchanged with relays, we don't specify a
source address. To make full-route work on Windows, we need to intercept
these packets and explicitly set the source address.
To achieve that, we introduce dedicated `TcpSocket` and `UdpSocket`
structs within `socket-factory`. With this in place, we will be able to
add Windows-conditional code to looks up and sets the source address of
outgoing UDP packets. For TCP sockets, the lookup will happen prior to
connecting to the address and used to bind to the correct interface.
Related: #2667.
Related: #5955.
The different implementations of `Tun` are the last platform-specific
code within `firezone-tunnel`. By introducing a dedicated crate and a
`Tun` trait, we can move this code into (platform-specific) leaf crates:
- `connlib-client-android`
- `connlib-client-apple`
- `firezone-bin-shared`
Related: #4473.
---------
Co-authored-by: Not Applicable <ReactorScram@users.noreply.github.com>
For `tunnel_test`, it is very important that each execution of a set of
state transitions is completely deterministic, otherwise the shrinking
behaviour does not work.
Iterating over `HashMap` and `HashSet` is non-deterministic. To fix
this, we convert several maps and sets to `BTreeMap`s and `BTreeSet`s.
The two primary users of the `add_resources` and `remove_resources` are
the client's eventloop and the `tunnel_test`. Both of them only ever
pass a single resource at a time.
It is thus simpler to remove the inner loop from within `ClientState`
and simply process a single resource at a time.
This represents a step towards #3837. Eventually, we'd like the
abstractions of `Session` and `Eventloop` to go away entirely. For that,
we need to thin them out.
The introduction of `ConnectArgs` was already a hint that we are passing
a lot of data across layers that we shouldn't. To avoid that, we can
simply initialise `PhoenixChannel` earlier and thus each callsite can
specify the desired configuration directly.
I've left `ConnectArgs` intact to keep the diff small.
Following the removal of the return type from the callback functions in
#5839, we can now move the use of the `Callbacks` one layer up the stack
and decouple them entirely from the `Tunnel`.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Gabi <gabrielalejandro7@gmail.com>
Currently, only connlib's UDP sockets for sending and receiving STUN &
WireGuard traffic are protected from routing loops. This is was done via
the `Sockets::with_protect` function. Connlib has additional sockets
though:
- A TCP socket to the portal.
- UDP & TCP sockets for DNS resolution via hickory.
Both of these can incur routing loops on certain platforms which becomes
evident as we try to implement #2667.
To fix this, we generalise the idea of "protecting" a socket via a
`SocketFactory` abstraction. By allowing the different platforms to
provide a specialised `SocketFactory`, anything Linux-based can give
special treatment to the socket before handing it to connlib.
As an additional benefit, this allows us to remove the `Sockets`
abstraction from connlib's API again because we can now initialise it
internally via the provided `SocketFactory` for UDP sockets.
---------
Signed-off-by: Gabi <gabrielalejandro7@gmail.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Connlib's routing logic and networking code is entirely platform
agnostic. The only platform-specific bit is how we interact with the TUN
device. From connlib's perspective though, all it needs is an interface
for reading and writing. How the device gets initialised and updated is
client-business.
For the most part, this is the same on all platforms: We call callbacks
and the client updates the state accordingly. The only annoying bit here
is that Android recreates the TUN interface on every update and thus our
old file descriptor is invalid. The current design works around this by
returning the new file descriptor on Android. This is a problematic
design for several reasons:
- It forces the callback handler to finish synchronously, and halting
connlib until this is complete.
- The synchronous nature also means we cannot replace the callbacks with
events as events don't have a return value.
To fix this, we introduce a new `set_tun` method on `Tunnel`. This moves
the business of how the `Tun` device is created up to the client. The
clients are already platform-specific so this makes sense. In a future
iteration, we can move all the various `Tun` implementations all the way
up to the client-specific crates, thus co-locating the platform-specific
code.
Initialising `Tun` from the outside surfaces another issue: The routes
are still set via the `Tun` handle on Windows. To fix this, we introduce
a `make_tun` function on `TunDeviceManager` in order for it to remember
the interface index on Windows and being able to move the setting of
routes to `TunDeviceManager`.
This simplifies several of connlib's APIs which are now infallible.
Resolves: #4473.
---------
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: conectado <gabrielalejandro7@gmail.com>
Closes#5601
It looks like we can hit 100+ Mbps in theory. This covers Wintun, Tokio,
and Windows OS overhead. It doesn't cover the cryptography or anything
in connlib itself.
The code is kinda messy but I'm not sure how to clean it up so I'll just
leave it for review.
This test should fail if there's any regressions in #5598.
It fails if any packet is dropped or if the speed is under 100 Mbps
```[tasklist]
### Tasks
- [x] Use `ip_packet::make`
- [x] Switch to `cargo bench`
- [x] Extract windows ARM PR
- [x] Clean up wintun.dll install code
- [x] Re-request review
```
Closes#5449
The smoke tests expect `last_crash.dmp` at a fixed path, so in this case
we write the file with a timestamped name, then copy it over
`last_crash.dmp`.
In a previous design of firezone, relays used to be scoped to a certain
connection. For a while now, this constraint has been lifted and all
connections can use all relays. A related, outdated concern is the idea
of STUN-only servers. Those also used to be assigned on a per-connection
basis.
By removing any use of per-connection relays and STUN-only servers, the
entire `StunBinding` concept is unused code and can thus be deleted.
To push this over the finish line, the `snownet-tests` which test the
hole-punching functionality needed to be slightly adapted to make use of
the more recently introduced API `Node::update_relays`.
Resolves: #4749.
Currently, `snownet` still supports this notion of "reconnecting" which
is a mix between resetting some state but keeping other. In particular,
we currently retain the `StunBinding` and `Allocation` state. This used
to be important because allocations are bound to the 3-tuple of the
client and thus needed to be kept around in case we weren't actually
roaming.
We always rebind the the local UDP sockets upon reconnecting and thus
the 3-tuple always changes anyway. In addition, we always reconnect to
the portal, meaning we receive another `init` message and thus can
actually completely clear the `Node`'s state.
This PR does that an in the process, rebrands `reconnect` as `reset`
which now makes more sense.
Related: #5619.
Currently, we are sending each ICE candidate individually from the
client to the gateway and vice versa. This causes a slight delay as to
when each ICE candidate gets added on the remote ICE agent. As a result,
they all start being tested with a slight offset which causes "endpoint
hopping" whenever a connection expires as they expire just after each
other.
In addition, sending multiple messages to the portal causes unnecessary
load when establishing connections.
Finally, with #5283 we started **not** adding the server-reflexive
candidate to the local ICE agent. Because we talk to multiple relays, we
detect the same server-reflexive candidate multiple times if we are
behind a non-symmetric NAT. Not adding the server-reflexive candidate to
the ICE agent mitigated our de-duplication strategy here which means we
currently send the same candidate multiple times to a peer, causing
additional, unnecessary load.
All of this can be mitigated by batching together all our ICE candidates
together into one message.
Resolves: #3978.
In order to handle DNS resources, connlib intercepts all DNS requests on
the system once it has started up. The DNS queries are then forwarded to
the original DNS resolver in case the query isn't for one of the
configured DNS resources _except_ if the configured DNS resovler is also
a CIDR resource.
In that case, the DNS query will be tunneled to a gateway and forwarded
to the DNS resolver from there.
Exactly this configuration results in a dead-lock when roaming networks.
To make roaming more reliable, we now drop all connections when
detecting a network change (see #5308). As a result, DNS queries cannot
be tunneled right away. This isn't usually a problem: We just send a
connection intent to the portal to connect to the gateway. Upon a
network change, we also reconnect the websocket to the portal which also
requires to resolve the domain name. Connlib's DNS resolver is still
active at the point and thus, we end up deadlocking ourselves because
the DNS query to resolve the portal's domain is waiting for a connection
to a gateway that can only be established once we are connected to the
portal.
To prevent this, we extend connlib with a "known hosts" feature. These
are DNS records that are defined statically for the lifetime of a
connlib session and can thus always be resolved, regardless of the
connection state with the portal or the gateways. We populate these
records with the portal's API, allowing the reconnect to work without
having connected gateways.
---------
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Currently, the clients only send JSON formatted logs to the configured
log directory. These are very hard to read as a human because one has to
re-assemble the spans and fields that we use extensively in connlib's
logs.
With this patch, the logs are sent to two files: `.jsonl` as JSON
formatted and `.log` formatted in syslog format.
This PR is the "client-side" of things for #4994. Up until now, when a
user wanted to connect to a DNS resource, we would establish a
connection to the gateway and pass along the domain we are trying to
access. The gateway would resolve that domain and send the response back
to the client, allowing them to finally send a DNS response.
Now, we instantly assign and respond with 4x A and 4x AAAA records to
any query for one of our DNS resources. Upon the first IP packet for one
of these "proxy IPs", we select a gateway, establish a connection and
send our proxy IPs along. The gateway then performs the necessary
mangling and NATing of all packets. See #5354 for details.
Resolves: #4994.
Resolves: #5491.
---------
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Closes#5481
With this, I can connect to the staging portal without a build.rs or any
extra env var setup
<img width="387" alt="image"
src="https://github.com/firezone/firezone/assets/13400041/9c080b36-3a76-49c7-b706-20723697edc7">
```[tasklist]
### Next steps
- [x] Split out a refactor PR for `ConnectArgs` (#5488)
- [x] Try doing this for other Clients
- [x] Check Gateway
- [x] Check Tauri Client
- [x] Change to `app_version`
- [x] Open for review
- [ ] Use `option_env` so that `FIREZONE_PACKAGE_VERSION` can still override the Cargo.toml version for local testing
- [ ] Check Android Client
- [ ] Check Apple Client
```
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
This is extracted from #5487 since I needed to add an 8th parameter and
Clippy said 8 is too many.
Refs #2986
Stepping stone towards using the Builder pattern. There's only a few
Clients so this has 80% of the advantage for 20% of the effort
When a user sends the first packet to a resource, we generate a
"connection intent" and consult the portal, which gateway to use for
this resource. This process is throttled to only generate a new intent
every 2s.
Once we know, which gateway to use for a certain resource, we initiate a
connection via snownet. This involves an OFFER-ANSWER handshake with the
gateway. A connection for which we have sent an offer and have not yet
received an answer is what we call a "pending connection".
In case the connection setup takes longer than 2s, we will generate
another connection intent which can point to the same gateway that we
are currently setting up a connection with.
Currently, encountering a "pending connection" during another connection
setup is treated as an error which results in some state being
cleaned-up / removed. This is where the bug surfaces: If we remove the
state for a resource as a result of a 2nd connection intent and then
receive the response of the first one, we will be left with no state
that knows about this resource.
We fix this by refactoring `create_or_reuse_connection` to be atomic in
regards to its state changes: All checks that fail the function are
moved to the top which means there is no state to clean up in case of an
error. Additionally, we model the case of a "pending connection" using
an `Option` to not flood the logs with "pending connection" warnings as
those are expected during normal operation.
Fixes: #5385
Closes#5042
Smoke test plan:
- Install on a before-Firezone VM
- Confirm logs default to `str0m=warn,info`
- Set log filter to `debug` in GUI
- Restart IPC service
- Confirm logs are `debug`
- Clear settings back to default
- Restart IPC service
- Confirm logs are `str0m=warn,info`
Directions to apply new log level:
1. Put the new log filter in
2. Click "Apply"
3. Quit Firezone Client
4. Right-click on the Start Menu and click "Terminal (Admin)" to open a
Powershell prompt
5. Run `Restart-Service -Name FirezoneClientIpcService` (on Linux, `sudo
systemctl restart firezone-client-ipc.service`)
6. Re-open Firezone Client
```[tasklist]
- [x] Log the log filter maybe
- [x] Use `atomicwrites` to write the file
- [x] (cancelled) ~~Make the GUI write the file on boot if it's not there (saves a step when upgrading from older versions)~~
- [x] Windows smoke test
- [x] Fix permissions on `/var/lib/dev.firezone.client/config`
- [x] Fix Linux IPC service not loading the log filter file
- [x] Linux smoke test
- [ ] Make sure it's okay that users in `firezone-client` can change the device ID
- [ ] Update user guides to include restarting the computer or IPC service after updating the log level?
```
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
This may have been needed when the logger rolled files and uploaded, but
now it compiles fine without it.
I tested it once manually on Windows. I don't think the logging is
covered by automated tests.