In order to avoid processing of responses of relays that somehow got
altered on the network path, we now use the client's `password` as a
shared secret for the relay to also authenticate its responses. This
means that not all message can be authenticated. In particular, BINDING
requests will still be unauthenticated.
Performing this validation now requires every component that crafts
input to the `Allocation` to include a valid `MessageIntegrity`
attribute. This is somewhat problematic for the regression tests of the
relay and the unit tests of `Allocation`. In both cases, we implement
workarounds so we don't have to actually compute a valid
`MessageIntegrity`. This is deemed acceptable because:
- Both of these are just tests.
- We do test the validation path using `tunnel_test` because there we
run an actual relay.
Using the clippy lint `unwrap_used`, we can automatically lint against
all uses of `.unwrap()` on `Result` and `Option`. This turns up quite a
few results actually. In most cases, they are invariants that can't
actually be hit. For these, we change them to `Option`. In other cases,
they can actually be hit. For example, if the user supplies an invalid
log-filter.
Activating this lint ensures the compiler will yell at us every time we
use `.unwrap` to double-check whether we do indeed want to panic here.
Resolves: #7292.
As suspected, there was a bug in the relay where channel bindings were
not cleared if the client freed the allocation early by sending a
REFRESH request with a lifetime of 0.
Resolves: #4588.
This one is a bit tricky. Our auth scheme requires me to know the
current time as a UNIX timestamp and that I can only get from
`SystemTime` but not `Instant`. The `Server` is meant to be SANS-IO,
including the current time so technically, I would have to pass that in
as a parameter.
I ended up settling on a compromise of making the auth verification
impure and internally calling `SystemTime::now`. That results in a much
nicer API and allows us to use `Instant` for everything else, e.g.
expiry of channel bindings, allocations etc.
Resolves: #4464.
This required a mid-sized refactor of the relay's eventloop. The idea is
that we can use [`mio`](https://docs.rs/mio/latest/mio/) to do the
actual IO handling instead of `tokio`. `tokio` depends on `mio`
internally but doesn't expose its primitives. Most importantly, we don't
get access to the API where we can dynamically register file descriptors
to watch for readiness.
In order to avoid allocations on the relaying hotpath, we need to listen
on a dynamic number of sockets:
1. Our client-facing socket on port 3478
2. All sockets allocated by clients
`mio` is the building block of the async tokio runtime, hence it does
not provide an async primitives. Instead, it blocks the current thread
that it is running on and feeds you events that you need to deal with.
We still need our `tokio` runtime to register timers and for
communication with the portal. To integrate the two, we spawn a
dedicated thread for `mio::Poll` and communicate with it via channels
within the `Sockets` abstraction. Thus, the `Eventloop` itself has no
idea that `mio` is used for all the network communication.
Whenever `mio` sends us an event that a socket is ready, we try to read
from that specific socket. We must read from this socket until it
returns `WouldBlock` at which point we move on to the next event.
We only register for read-readiness. If a socket is not ready for
writing, we just drop the packet.
With this design in place, we can now have a single buffer that we read
incoming packets into and dispatch it to `Server`, depending on which
port is what received on. A future refactoring could maybe even unify
these functions and let the `Server` deal with the ports internally.
Resolves: #4366.
This is much more robust than the previous implementation because we now
go through all allocations and channels every time we get a
`handle_timeout` and clean up everything that is expired.
Resolves: #4095.
Previously, we would allocate each message twice:
1. When receiving the original packet.
2. When forming the resulting channel-data message.
We can optimise this to only one allocation each by:
1. Carrying around the original `ChannelData` message for traffic from
clients to peers.
2. Pre-allocating enough space for the channel-data header for traffic
from peers to clients.
Local flamegraphing still shows most of user-space activity as
allocations. I did occasionally see a throughput of ~10GBps with these
patches. I'd like to still work towards #4095 to ensure we handle
anything time-sensitive better.
Previously, the relay neither scheduled a `Wake` command nor did it
register a `TimedAction` to expire a channel binding. Such an action was
only scheduled after the first refresh.
This PR fixes this and adds a test that asserts we can re-bind the same
channel to a different peer after 15 minutes.
Resolves: #3979.
Currently, there is a bug in the relay where the channel state of
different peers overlaps because the data isn't indexed correctly by
both peers and clients.
This PR fixes this, introduces more debug assertions (this bug was
caught by one) and also adds some new-type wrappers to avoid conflating
peers with clients.
Previously, we still had a hard-coded rule in the relay that would not
allow us to relay to an IPv6 peer. We can remove that and properly check
this based on the allocated addresses.
Resolves: #3405.
Fixes#2363
* Rename `relay` package to `firezone-relay` so that binaries outputted
match the `firezone-*` cli naming scheme
* Rename `firezone-headless-client` package to `firezone-linux-client`
for consistency
* Add READMEs for user-facing CLI components (there will also be docs
later)
To better take advantage of the OTEL ecosystem, we change our prometheus
metrics to OTEL metrics. OTEL metrics are pushed to the agent via the
OTEL pipeline set up in https://github.com/firezone/firezone/pull/1995
rather than pulled like prometheus.
This means our `/metrics` endpoint is now gone which we previously
(ab)used as a health-check. I've added a dedicated `/healthz` endpoint.
This PR allows the TURN allocation binding to be optionally configured
by `TURN_LOWEST_PORT` and `TURN_HIGHEST_PORT` environment variables.
This will allow client app developers to test their apps against a
fully-working local development cluster in Docker Desktop for
Linux/macOS/Windows, allowing us to remove the PortalMock, Connlib Mock,
and SwiftMock codepaths entirely.
cc @roop @pratikvelani
This patch series adds support for IPv6 allocations. If not specified
otherwise in the ALLOCATE request, clients will get an IP4 allocation.
They can also request an IPv6 address or an additional IPv6 address in
addition to their IPv4 address.
Either of those is only possible if the relay actually has a listening
socket for the requested address family. The CLI is designed such that
the user can either specify IP4, IP6 or both of them.
The `Server` component handles all of this logic and responds with
either a successful allocation response or an Address Family Not
Supported error (see
https://www.rfc-editor.org/rfc/rfc8656#name-stun-error-response-codes).
Multiple refactorings were necessary to achieve this design, they are
all extracted into separate PRs:
Depends-On: #1831.
Depends-On: #1832.
Depends-On: #1833.
---------
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
The metrics are available at `http://{listen_addr}:8080/metrics`.
Currently, we collect the following:
- Number of active allocations: We can have an alert once the number of
allocations passes through a certain threshold.
- Outcome (success / error) and message kind (allocation / channel_bind
/ ...) of all responses: Summing all of these up would give you the
total number of requests handled. We might want to have a grafana alert
for an increased number of error responses.
- Total number of bytes relayed: Dividing this by time gives us an
average "internal" bandwidth.
This is just a start, we can explore what else is useful as we have it
operate.
Depends-On: https://github.com/firezone/firezone/pull/1743
Previously, the relay would treat the `stamp_secret` internally as bytes and share it with the outside world as hex-string. The portal however treats it as an opaque string and uses the UTF-8 bytes to create username and password.
This patch aligns the relay's functionality with the portal and stores the `stamp_secret` internally as a string.
To complete the authentication scheme for the relay, we need to prompt
the client with a nonce when they send an unauthenticated request. The
semantic meaning of a nonce is opaque to the client. As a starting
point, we implement a count-based scheme. Each nonce is valid for 10
requests. After that, a request will be rejected with a 401 and the
client has to authenticate with a new nonce.
This scheme provides a basic form of replay-protection.
We introduce dedicated types for each message that the `Server` can
handle. This allows us to make the functions public because the
type-system now guarantees that those are either parsed from bytes or
constructed with the correct data.
The latter will be useful to write tests against a richer API.
With this patch, the relay can parse and respond to allocation requests. I
ran some basics tests against https://icetest.info/ and implemented a
regression test as a result of the logged data.
In writing this, I also had to slightly change the design of `Server`
(as expected). Event handlers for incoming data now do not return a
message directly. Instead, the caller is responsible to drain `Command`s
from it.
When creating an allocation, we need to start listening on a new port.
This needs to happen outside the `Server` as I am going for a sans-IO
style. We emit a `Command` that instructs the main event loop to listen
on a new port. Any incoming data on that port will be forwarded to the
`Server`.
At the moment, this incoming data is just dropped. This is actually
standards-compliant because we cannot handle binding requests yet which
would allow this data to be forwarded to the client.
In some areas, the code is still a bit rough but I expect to iron those
things out as we go along.
This is an alternative to https://github.com/firezone/firezone/pull/1602
that implements the server using a library I've found called
`stun_codec`.
It already has support for parsing a variety of attributes.
The following is a nice website to test some of the functionality:
https://icetest.info/
The server is still listening on:
`ec2-3-89-112-240.compute-1.amazonaws.com:3478`.