Our relays are essential for connectivity because they also perform STUN for us through which we learn our server-reflexive address. Thus, we must at all times have at least one relay that we can reach in order to establish a connection. The portal tracks the connectivity to the relays for us and in case any of them go down, sends us a `relays_presence` message, meaning we can stop using that relay and migrate any relayed connections to a new one. This works well for as long as we are connected to the portal while the relay is rebooting / going-down. If we are not currently connected to the portal and a relay we are using reboots, we don't learn about it. At least if we are actively using it, the connection will fail and further attempted communication with the relay will time-out and we will stop using it. In case we aren't currently using the relay, this gets a bit trickier. If we aren't using the relay but it rebooted while we were partitioned from the portal, logging in again might return the same relay to us in the `init` message, but this time with different credentials. The first bug that we are fixing in this PR is that we previously ignored those credentials because we already knew about the relay, thinking that we can still use our existing credentials. The fix here is to also compare the credentials and ditch the local state if they differ. The second bug identified during fixing the first one is that we need to pro-actively probe whether all other relays that we know about are actually still responsive. For that, we issue a `REFRESH` message to them. If that one times-out or fails otherwise, we will remove that one from our list of `Allocation`s too. To fix the 2nd bug, several changes were necessary: 1. We lower the log-level of `Disconnecting from relay` from ERROR to WARN. Any ERROR emitted during a test-run fails our test-suite which is what partially motivated this. The test suite builds on the assumption that ERRORs are fatal and thus should never happen during our tests. This change surfaces that disconnecting from a relay can indeed happen during normal operation, which justifies lowering this to WARN. Users should at the minimum monitor on WARN to be alerted about problems. 2. We reduce the total backoff duration for requests to relays from 60s to 10s. The current 60s result in total of 8 retries. UDP is unreliable but it isn't THAT unreliable to justify retrying everything for 60s. We also use a 10s timeout for ICE, which means these are now aligned to better match each other. We had to change the max backoff duration because we only idle-spin for at most 10s in the tests and thus the current 60s were too long to detect that a relay actually disappeared. 3. We had to shuffle around some function calls to make sure all intermediary event buffers are emptied at the right point in time to make the test deterministic. Fixes: #6648.
Rust development guide
Firezone uses Rust for all data plane components. This directory contains the Linux and Windows clients, and low-level networking implementations related to STUN/TURN.
We target the last stable release of Rust using rust-toolchain.toml.
If you are using rustup, that is automatically handled for you.
Otherwise, ensure you have the latest stable version of Rust installed.
Reading Client logs
The Client logs are written as JSONL for machine-readability.
To make them more human-friendly, pipe them through jq like this:
cd path/to/logs # e.g. `$HOME/.cache/dev.firezone.client/data/logs` on Linux
cat *.log | jq -r '"\(.time) \(.severity) \(.message)"'
Resulting in, e.g.
2024-04-01T18:25:47.237661392Z INFO started log
2024-04-01T18:25:47.238193266Z INFO GIT_VERSION = 1.0.0-pre.11-35-gcc0d43531
2024-04-01T18:25:48.295243016Z INFO No token / actor_name on disk, starting in signed-out state
2024-04-01T18:25:48.295360641Z INFO null
Benchmarking on Linux
The recommended way for benchmarking any of the Rust components is Linux' perf utility.
For example, to attach to a running application, do:
- Ensure the binary you are profiling is compiled with the
benchprofile. sudo perf perf record -g --freq 10000 --pid $(pgrep <your-binary>).- Run the speed test or whatever load-inducing task you want to measure.
sudo perf script > profile.perf- Open profiler.firefox.com and load
profile.perf
Instead of attaching to a process with --pid, you can also specify the path to executable directly.
That is useful if you want to capture perf data for a test or a micro-benchmark.