Upon receiving a SIGTERM, we immediately disconnect from the websocket
connection to the portal and set a flag that we are shutting down.
Once we are disconnected from the portal and no longer have an active
allocations, we exit with 0. A repeated SIGTERM signal will interrupt
this process and force the relay to shutdown.
Disconnecting from the portal will (eventually) trigger a message to
clients and gateways that this relay should no longer be used. Thus,
depending on the timeout our supervisor has configured after sending
SIGTERM, the relay will continue all TURN operations until the number of
allocations drops to 0.
Currently, we also allow clients to make new allocations and refreshing
existing allocations. In the future, it may make sense to implement a
dedicated status code and refuse `ALLOCATE` and `REFRESH` messages
whilst we are shutting down.
Related: #4548.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
In a prior design of the relay and the `phoenix-channel`, connecting to
the portal was a blocking operation, i.e. we weren't meant to start the
relaying operations before the portal connection succeeded.
Since then, `phoenix-channel` got refactored to have an internal
(re)-connection mechanism, meaning we don't actually need to `.await`
anything to obtain a `PhoenixChannel` instance that we can use to
initialize the `Server`. Furthermore, we changed the health-check to
return 200 OK prior to the portal connection being established in #4553.
Taking both of these into account, there is no more need to block on the
portal connection being established, which allows us to remove the use
of `phoenix_channel::init` and connect in the background whilst we
already accept STUN & TURN traffic.
This PR adds a unit-test to `snownet` that exercises all code paths that
are required for a relayed connection to work. This includes:
- Nodes make an allocation with real credentials, nonces etc
- Nodes exchange their ICE candidates
- Nodes bind data channels on the relay
- str0m performs ICE over these data channels
- Nodes handshake a wireguard tunnel on the nominated socket
I consider this a baseline. Once merged, I want to attempt writing a
test in #4568 that asserts migration of a connection to a new relay
without the connection expiring. At some point, we can even go further
and move these tests to `firezone-tunnel` and unit-test even more things
like connection intents etc.
During the latest relay outage, we failed to send heartbeats to the
portal because we were busy-looping and never got to handle messages or
timers for the portal.
To mitigate this or similar bugs, we update an `Instant` every time we
send a heartbeat to the portal. In case we are actually
network-partitioned, this will cause the health-check to fail after 15
minutes. This value is the same as the partition timeout for the portal
connection itself[^1]. Very likely, we will never see a relay being
shutdown because of a failing health check in this case as it would have
already shut itself down.
An exception to this are bugs in the eventloop where we fail to interact
with the portal at all.
Resolves: #4510.
[^1]: Previously, this was unlimited.
This one is a bit tricky. Our auth scheme requires me to know the
current time as a UNIX timestamp and that I can only get from
`SystemTime` but not `Instant`. The `Server` is meant to be SANS-IO,
including the current time so technically, I would have to pass that in
as a parameter.
I ended up settling on a compromise of making the auth verification
impure and internally calling `SystemTime::now`. That results in a much
nicer API and allows us to use `Instant` for everything else, e.g.
expiry of channel bindings, allocations etc.
Resolves: #4464.
A wildcard match was the underlying bug fixed in #4486. Despite being a
bit annoying in some cases, I think it is worth having this lint turned
on to ensure we don't wildcard match in situations where it can have bad
consequences, like `poll` functions.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
This required a mid-sized refactor of the relay's eventloop. The idea is
that we can use [`mio`](https://docs.rs/mio/latest/mio/) to do the
actual IO handling instead of `tokio`. `tokio` depends on `mio`
internally but doesn't expose its primitives. Most importantly, we don't
get access to the API where we can dynamically register file descriptors
to watch for readiness.
In order to avoid allocations on the relaying hotpath, we need to listen
on a dynamic number of sockets:
1. Our client-facing socket on port 3478
2. All sockets allocated by clients
`mio` is the building block of the async tokio runtime, hence it does
not provide an async primitives. Instead, it blocks the current thread
that it is running on and feeds you events that you need to deal with.
We still need our `tokio` runtime to register timers and for
communication with the portal. To integrate the two, we spawn a
dedicated thread for `mio::Poll` and communicate with it via channels
within the `Sockets` abstraction. Thus, the `Eventloop` itself has no
idea that `mio` is used for all the network communication.
Whenever `mio` sends us an event that a socket is ready, we try to read
from that specific socket. We must read from this socket until it
returns `WouldBlock` at which point we move on to the next event.
We only register for read-readiness. If a socket is not ready for
writing, we just drop the packet.
With this design in place, we can now have a single buffer that we read
incoming packets into and dispatch it to `Server`, depending on which
port is what received on. A future refactoring could maybe even unify
these functions and let the `Server` deal with the ports internally.
Resolves: #4366.
The value returned from `poll_timeout` needs to only reset the `Sleep`
but don't need to go back to the top of the loop. Instead, we move its
polling to below the resetting of `Sleep`. This will correctly register
a waker in case we did change `Sleep`.
This `continue` causes a busy-loop and stops the relay from dealing with
the `phoenix-channel` which means the portal will eventually consider it
offline.
This was first introduced in #4455.
This is a similar fix as to #4486. I am not sure if this is / was
actively causing problems but using `continue` after _any_ ready event
is definitely more correct.
This is a low-risk change.
This is much more robust than the previous implementation because we now
go through all allocations and channels every time we get a
`handle_timeout` and clean up everything that is expired.
Resolves: #4095.
Previously, we would allocate each message twice:
1. When receiving the original packet.
2. When forming the resulting channel-data message.
We can optimise this to only one allocation each by:
1. Carrying around the original `ChannelData` message for traffic from
clients to peers.
2. Pre-allocating enough space for the channel-data header for traffic
from peers to clients.
Local flamegraphing still shows most of user-space activity as
allocations. I did occasionally see a throughput of ~10GBps with these
patches. I'd like to still work towards #4095 to ensure we handle
anything time-sensitive better.
Previously, we were creating a lot of spans because they were all set to
`level = error`. We now reduce those spans to `debug` which should help
with the CPU utilization.
Related: #4366.
Currently, controlling the RNG seed is gated for debug builds only. This
makes profiling the release build impossible because we cannot generate
credentials upfront.
Additionally, for flamegraphs to be useful, we need to enable debug
symbols for the relay.
Motivated by: #4340.
I also activated
[`clippy::unnnecessary_wraps`](https://rust-lang.github.io/rust-clippy/master/#/unnecessary_wraps)
which does create some false-positives for the platform-specific code
but is IMO overall a net-positive. With the amount of Rust code and
crates increasing, it is good to have tools point out simplifications
like these as they are otherwise hard to spot, especially across crate
boundaries.
This adds the same kind of HTTP health-check that is already present in
the relay to the gateway. The health-check returns 200 OK for as long as
the gateway is active. The gateway automatically shuts down on fatal
errors (like authentication failures with the portal).
To enable this, I've extracted a crate `http-health-check` that shares
this code between the relay and the gateway.
Resolves: #2465.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Previously, the relay neither scheduled a `Wake` command nor did it
register a `TimedAction` to expire a channel binding. Such an action was
only scheduled after the first refresh.
This PR fixes this and adds a test that asserts we can re-bind the same
channel to a different peer after 15 minutes.
Resolves: #3979.
Unfortunately, the current logs don't allow us to correlate, which
allocation and thus which client the expired channel binding relates to.
This PR fixes this by adding more fields to the relevant log messages.
Currently, we are passing a lot of data into `Session::connect`. Half of
this data is only needed to construct the URL we will use to connect to
the portal. We can simplify this by extracting a dedicated `LoginUrl`
component that captures and validates this data early.
Not only does this reduce the number of parameters we pass to
`Session::connect`, it also reduces the number of failure cases we have
to deal with in `Session::connect`. Any time the session fails, we have
to call `onDisconnected` to inform the client. Thus, we should perform
as much validation as we can early on. In other words, once
`Session::connect` returns, the client should be able to expect that the
tunnel is starting.
As part of doing https://github.com/firezone/firezone/pull/3682, we
noticed that the handling of errors up to the clients needs to
differentiate between fatal errors that require clearing the token vs
not.
Upon closer inspection of `phoenix_channel::Error`, it becomes obvious
that the current design is not good here. In particular, we handle
certain errors with retries internally but still expose those same
errors.
To make this more obvious, we reduce the public `Error` to the variants
that are actually fatal. Those can really only be three:
- HTTP client errors (those are by definition non-retryable)
- Token expired
- We have reached our max number of retries
We don't have a concept of "inbound requests", at least not natively in
the phoenix channel JSON format. Thus, we don't need to match on `ref`
for incoming messages.
Extracted out of: #3682.
In #3726 this value was increased but the test didn't reflect that.
I've not the slightest idea how this is passing on CI. It isn't locally.
Now I have an Idea, relay tests aren't run on CI.
In the relay's authentication scheme, each nonce is only valid for a
certain number of requests. This guards against replay attacks.
Currently, this is set to 10 which means all requests after 10 will
receive a "stale nonce" error. 10 turns out to be way to low and greatly
delays the setup of channels and allocations which is always a burst of
messages that end up incurring additional round trips because they all
need to be re-sent with a new nonce.
# Gateways
- [x] When Gateway Group is deleted all gateways should be disconnected
- [x] When Gateway Group is updated (eg. routing) broadcast to all
affected gateway to disconnect all the clients
- [x] When Gateway is deleted it should be disconnected
- [x] When Gateway Token is revoked all gateways that use it should be
disconnected
# Relays
- [x] When Relay Group is deleted all relays should be disconnected
- [x] When Relay is deleted it should be disconnected
- [x] When Relay Token is revoked all gateways that use it should be
disconnected
# Clients
- [x] Remove Delete Client button, show clients using the token on the
Actors page (#2669)
- [x] When client is deleted disconnect it
- [ ] ~When Gateway is offline broadcast to the Clients connected to it
it's status~
- [x] Persist `last_used_token_id` in Clients and show it in tokens UI
# Resources
- [x] When Resource is deleted it should be removed from all gateways
and clients
- [x] When Resource connection is removed it should be deleted from
removed gateway groups
- [x] When Resource is updated (eg. traffic filters) all it's
authorizations should removed
# Authentication
- [x] When Token is deleted related sessions are terminated
- [x] When an Actor is deleted or disabled it should be disconnected
from browser and client
- [x] When Identity is deleted it's sessions should be disconnected from
browser and client
- [x] ^ Ensure the same happens for identities during IdP sync
- [x] When IdP is disabled act like all actors for it are disabled?
- [x] When IdP is deleted act like all actors for it are deleted?
# Authorization
- [x] When Policy is created clients that gain access to a resource
should get an update
- [x] When Policy is deleted we need to all authorizations it's made
- [x] When Policy is disabled we need to all authorizations it's made
- [x] When Actor Group adds or removes a user, related policies should
be re-evaluated
- [x] ^ Ensure the same happens for identities during IdP sync
# Settings
- [x] Re-send init message to Client when DNS settings change
# Code
- [x] Crear way to see all available topics and messages, do not use
binary topics any more
---------
Co-authored-by: conectado <gabrielalejandro7@gmail.com>
Currently, there is a bug in the relay where the channel state of
different peers overlaps because the data isn't indexed correctly by
both peers and clients.
This PR fixes this, introduces more debug assertions (this bug was
caught by one) and also adds some new-type wrappers to avoid conflating
peers with clients.
Previously, we still had a hard-coded rule in the relay that would not
allow us to relay to an IPv6 peer. We can remove that and properly check
this based on the allocated addresses.
Resolves: #3405.
In #3400, a discussion started on what the correct log level would be
for the production relay. Currently, the relay logs some stats about
each packet on debug, i.e. where it came from, where it is going to and
how big it is. This isn't very useful in production though and will fill
up our log disk quickly.
This PR introduces a stats timer like we already have it in other
components. We print the number of allocations, how many channels we
have and how much data we relayed over all these channels since we last
printed. The interval is currently set to 10 seconds.
Here is what this output could look like (captured locally using
`relay/run_smoke_test.sh`, although slightly tweaked, printing ever 2s,
using release mode and larger packets on the clients):
```
2024-01-26T05:01:02.445555Z INFO relay: Seeding RNG from '0'
2024-01-26T05:01:02.445580Z WARN relay: No portal token supplied, starting standalone mode
2024-01-26T05:01:02.445827Z INFO relay: Listening for incoming traffic on UDP port 3478
2024-01-26T05:01:02.447035Z INFO Eventloop::poll: relay: num_allocations=0 num_channels=0 throughput=0.00 B/s
2024-01-26T05:01:02.649194Z INFO Eventloop::poll:handle_client_input{sender=127.0.0.1:39092 transaction_id="8f20177512495fcb563c60de" allocation=AID-1}: relay: Created new allocation first_relay_address=127.0.0.1 lifetime=600s
2024-01-26T05:01:02.650744Z INFO Eventloop::poll:handle_client_input{sender=127.0.0.1:39092 transaction_id="6445943a353d5e8c262a821f" allocation=AID-1 peer=127.0.0.1:41094 channel=16384}: relay: Successfully bound channel
2024-01-26T05:01:04.446317Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=631.54 MB/s
2024-01-26T05:01:06.446319Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=698.73 MB/s
2024-01-26T05:01:08.446325Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=708.98 MB/s
2024-01-26T05:01:10.446324Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=690.79 MB/s
2024-01-26T05:01:12.446316Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=715.53 MB/s
2024-01-26T05:01:14.446315Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=706.90 MB/s
2024-01-26T05:01:16.446313Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=712.03 MB/s
2024-01-26T05:01:18.446319Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=717.54 MB/s
2024-01-26T05:01:20.446316Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=690.74 MB/s
2024-01-26T05:01:22.446313Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=705.08 MB/s
2024-01-26T05:01:24.446311Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=700.41 MB/s
2024-01-26T05:01:26.446319Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=717.57 MB/s
2024-01-26T05:01:28.446320Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=688.82 MB/s
2024-01-26T05:01:30.446329Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=696.35 MB/s
2024-01-26T05:01:32.446317Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=724.03 MB/s
2024-01-26T05:01:34.446320Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=713.46 MB/s
2024-01-26T05:01:36.446314Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=716.13 MB/s
2024-01-26T05:01:38.446327Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=687.16 MB/s
2024-01-26T05:01:40.446315Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=708.20 MB/s
2024-01-26T05:01:42.446314Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=689.36 MB/s
2024-01-26T05:01:44.446314Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=698.62 MB/s
2024-01-26T05:01:46.446315Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=696.21 MB/s
2024-01-26T05:01:48.446378Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=696.36 MB/s
2024-01-26T05:01:50.446314Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=709.47 MB/s
2024-01-26T05:01:52.446319Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=714.48 MB/s
2024-01-26T05:01:54.446323Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=690.71 MB/s
2024-01-26T05:01:56.446313Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=692.70 MB/s
2024-01-26T05:01:58.446321Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=687.87 MB/s
2024-01-26T05:02:00.446316Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=682.11 MB/s
2024-01-26T05:02:02.446312Z INFO Eventloop::poll: relay: num_allocations=1 num_channels=1 throughput=700.07 MB/s
```
Currently, only the gateway has a reconnect logic for (transient) errors
when connecting to the portal. Instead of duplicating this for the
relay, I moved the reconnect state machine to `phoenix-channel`. This
means the relay now automatically gets it too and in the future, the
clients will also benefit from it.
As a nice benefit, this also greatly simplifies the gateway's
`Eventloop` and removes a bunch of cruft with channels.
Resolves: #2915.
Previously, there was a misinterpretation of the spec that didn't allow
_different_ clients to use the same channel number. This is wrong
though. Because channel numbers are managed by clients, they must be
unique _per client_. This patch addresses this short-coming.
I didn't include any dedicated tests for this. The fact that the
existing ones still work means the feature is overall working and the
data structure shows that the channels are now indeed unique per client.
There is another channel which we didn't yet increase in size, the one
between the allocation and the main task loop. Increasing to 1000 means
each allocation can potentially buffer 65MB of data. With the biggest
port range (16383 allocations), that would be a theoretical memory
consumption of ~ 1TB. But, this would imply that we have 16383 connected
clients that all send data at max speed, saturating our downlink and our
uplink is somehow ridiculously small. As long as up and downlink are
roughly within the same ballpark figure, it should be impossible to
actually fill up these buffers.
I suspect that the current packet drops of the iperf test are happening
because on localhost, sending 10 UDP packets is so quick that a tokio is
unable to wake up the task in time to empty the queue.
In addition to the increased channel size, I've also added a check for
the other channels to avoid writing to them in case they are not ready
for some reason.
---------
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
## Changelog
- Updates connlib parameter API_URL (formerly known under different
names as `CONTROL_PLANE_URL`, `PORTAL_URL`, `PORTAL_WS_URL`, and
friends) to be configured as an "advanced" or "hidden" feature at
runtime so that we can test production builds on both staging and
production.
- Makes `AUTH_BASE_URL` configurable at runtime too
- Moves `CONNLIB_LOG_FILTER_STRING` to be configured like this as well
and simplifies its naming
- Fixes a timing attack bug on Android when comparing the `csrf` token
- Adds proper account ID validation to Android to prevent invalid URL
parameter strings from being saved and used
- Cleans up a number of UI / view issues on Android regarding typos,
consistency, etc
- Hides vars from from the `relay` CLI we may not want to expose just
yet
- `get_device_id()` is flawed for connlib components -- SMBios is rarely
available. Data plane components now require a `FIREZONE_ID` now instead
to use for upserting.
Fixes#2482Fixes#2471
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Gabi <gabrielalejandro7@gmail.com>
Fixes#2363
* Rename `relay` package to `firezone-relay` so that binaries outputted
match the `firezone-*` cli naming scheme
* Rename `firezone-headless-client` package to `firezone-linux-client`
for consistency
* Add READMEs for user-facing CLI components (there will also be docs
later)