This will keep the files from going out of sync.
This PR also checks that the IPC service creates the IPC socket with
`root:firezone` as the owner and group, when running under systemd.
Now that we've worked out the flakiness from the iperf tests, we should
increase the UDP send rate so we have some benchmark of how many packets
we can actually handle before dropping.
For tests it doesn't hurt, but this will be used as a template for the
systemd service we ship to production, and that can't have the ID there.
So I'm also cleaning up a few other problems I noticed:
- I wanted to split the service files as part of #4531, so that the GUI
Client and headless Client can have separate sandbox rules. e.g, the
headless Client won't be allowed to create Unix domain sockets
- I'm punting more things to systemd, which allows us to tighten down
the sandbox further, e.g. creating `/var/lib/dev.firezone.client` and
`/run/dev.firezone.client` for us
- Closes#4461
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Closes#4682Closes#4691
```[tasklist]
# Before merging
- [x] Wait for `linux-group` test to go green on `main` (#4692)
- [ ] Wait for those browsers tests to get fixed
- [ ] *All* compatibility tests must pass on this branch
```
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Considered using Elixir and Rust to write the tests.
For Elixir, `wallaby` doesn't seem to have a way to attach to an
existing `chromium` instance, launching it each time, which makes it
hard to coordinate with the relay restart.
For Rust we considered `thirtyfour` which would be very nice since we
could test both firefox and chrome but each time it connects to the
instance it launches a new session making it hard to test the DNS cache
behavior.
We also considered `chrome_headless` for Rust it needs a small patch to
prevent it from closing the browser after `Drop` but it still presents a
problem, since it has no easy way to retrieve if loading a page has
succeeded. There are some workarounds such as retrieving the title that
we could have used but after some testing they are quite finnicky and we
don't want that for CI.
So I ended up settling for TypeScript but I'm open to other options, or
a fix for the previous ones!
There are some modifications still incoming for this PR, around the test
name and that sleep in the middle of the test doesn't look good so I
will probably add some retries, but the gist is here, will keep it in
draft until we expect it to be passing.
So feel free to do some initial reviews.
Note: the number of lines changed is greatly exaggerated by
`package.lock`
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Whenever we receive a `relays_presence` message from the portal, we
invalidate the candidates of all now disconnected relays and make
allocations on the new ones. This triggers signalling of new candidates
to the remote party and migrates the connection to the newly nominated
socket.
This still relies on #4613 until we have #4634.
Resolves: #4548.
---------
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
```[tasklist]
# Before merging
- [x] Remove file extension `.txt`
- [x] Wait for `linux-group` test to go green on `main` (#4692)
- [x] *all* compatibility tests must be green on this branch
```
Closes#4664Closes#4665
~~The compatibility tests are expected to fail until the next release is
cut, for the same reasons as in #4686~~
The compatibility test must be handled somehow, otherwise it'll turn
main red.
`linux-group` was moved out of integration / compatibility testing, but
the DNS tests do need the whole Docker + portal setup, so that one can't
move.
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Closes#4669
This should stop the problem of `linux-group` failing because of trying
to test an older release that doesn't have the right CLI features
---------
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
As part of #4568, we are adding a 2nd relay which showed some
short-comings of the current process state assertions because they were
running outside the docker containers, thus listing all relays as soon
as there are multiple.
```[tasklist]
### Before merging
- [x] Update KB
```
Maybe not a feature since Linux IPC isn't available to users yet?
I think it's okay if the new `linux-group` test fails in compatibility,
since it wasn't implemented at all back then.
Closes#4659Closes#4660
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Refs #4513
The next step after this is to use this to test security in the Linux
IPC code, it should reject any IPC commands from users not in the
`firezone` group.
Should drop our `systemd-analyze security` level from 9.7 to about 2.5.
We could go a little further, but it would take a lot more effort, and
this is a good starting point.
```[tasklist]
# Before review
- [x] Remove unused trap function in Bash
- [x] Remove `systemd-analyze` call
```
I wasn't aware of `set x` when I wrote this, and it looks good in the
other test scripts.
I'm not sourcing `lib.sh` yet, because I don't happen to need any
functions from it. I have other draft PRs that will probably end up
using it.
This is needed so that we can auto-update the systemd unit file, either
manually, or with a package manager like `apt`. We don't want users
cut-and-pasting these together on every update, and we don't want
machines doing it. Making the file updatable means we can make security
fixes to it easily.
Upon receiving a SIGTERM, we immediately disconnect from the websocket
connection to the portal and set a flag that we are shutting down.
Once we are disconnected from the portal and no longer have an active
allocations, we exit with 0. A repeated SIGTERM signal will interrupt
this process and force the relay to shutdown.
Disconnecting from the portal will (eventually) trigger a message to
clients and gateways that this relay should no longer be used. Thus,
depending on the timeout our supervisor has configured after sending
SIGTERM, the relay will continue all TURN operations until the number of
allocations drops to 0.
Currently, we also allow clients to make new allocations and refreshing
existing allocations. In the future, it may make sense to implement a
dedicated status code and refuse `ALLOCATE` and `REFRESH` messages
whilst we are shutting down.
Related: #4548.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Unfortunately I had to keep `linux-client` to get the compatibility
tests to pass. #4578 aims to remove that package.
Please add to this list if you think of anything:
```[tasklist]
# Things that may break that CI/CD won't catch
- [ ] Github release artifacts
- [ ] Knowledge base
- [ ] Docker images
- [ ] Docker containers
- [ ] Existing `linux-client` users
- [ ] Anything that downloads ghcr artifacts
- [ ] Nix (Not sure if it's built in CI. It had a merge conflict)
```
Refs #4515, and #3712, #3782
I think this is what Thomas and I agreed on in Slack / Github
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
The clients, gateway and relay all employ an internal design that is
based on an eventloop. This gives us a lot of control in how various IO
components interact with each other. Great control also comes with a
source of bugs, the latest of which made the relay busy-loop once it
started relaying some traffic.
Eventloops are notoriously hard to unit-test because they compose
various IO bits together. Instead of writing unit tests, we can go and
assert the process state after the performance tests. Those generate a
fair bit of load on all our components but after that, they should
suspend.
The most effective tests survive even large refactorings and for that,
they need to be coded against a stable API / property. Asserting that
the process sleeps when it is idle from an application PoV is such a
property.
Related: #4511.
(After GA)
This adds a unit test for the Unix domain sockets that I intend to use
for process splitting on Linux.
The length-prefixed encoding and decoding are copied from `subzone`, but
most of that code will not be re-used since it's Windows-specific and
also specific to a Chromium-like process model, which won't work for
Firezone.
This is part of a yak shave towards CI testing of #3812
Moving the DNS control method out of `docker-compose.yml` and up to the
integration tests themselves allows us to test these scenarios:
- `systemd-resolved`
- `etc-resolv-conf`
- `systemd-resolved` but we're in a container where that won't work, so
we should gracefully degrade to just allowing IP/CIDR resources
- [x] Updated log level string for client and gateways to info or higher
- [x] Update logs to hide DNS information
I also removed `hickory_resolve` errors which could contain sensitive
info from our general error and hide the logs that specifically relates
to them.
@bmanifold double checking that the log levels in the gateway's `*.tf`
files are just used for our own gateways.
Also, the relays still have `debug`, since only we see that I think that
makes sense but double checking with @jamilbk
Fixes: #3618.
---------
Signed-off-by: Gabi <gabrielalejandro7@gmail.com>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
This adds an integration test that downloads a 10MB file from a server
and simulates the client roaming to another network while the download
is active.
We use a DNS resource for this to ensure it also doesn't take too long
in that case. DNS resources are what most users will be using and we
clear some internal DNS caches on connection failures. Hence, using a
DNS resource here is a somewhat roundabout way to test that we aren't
failing and re-establishing the connection but migrate it to a new
network path.
Running as sudo / root causes a lot of problems for GUI programs, so
we're unwinding that. In this case we can go back to using Tauri's "open
URL" function, which is great.
Closes#4103
Refs #3713
Affects #3972 - I was finally able to debug it because it came up
constantly during this PR
I thought this was going to use `cargo-deb` but it was actually easy
with the Tauri deb bundling we already use.
```[tasklist]
### Before merging
- [x] Make sure every file in the Tauri deb is also in our deb (e.g. icons)
```
I think I used `-v` wrong here. I meant it to negate a regular grep, but
it will probably always return true since it negates the matching, not
the exit code.
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Right now it only works on my dev VM, not on my test VMs, due to #4053
and #4103, but it passes tests and should be safe to merge.
There's one doc fix and one script fix which are unrelated and could be
their own PRs, but they'd be tiny, so I left them in here.
Ref #4106 and #3713 for the plan to fix all this by splitting the tunnel
process off so that the GUI runs as a normal user.
Followup from #4100:
- Add `perf/relay` and `debug/relay` etc data plane images in
`firezone-staging`.
- The `perf` images are `debug` stage images and have tooling installed,
but use release binaries.
- The `debug` images are `debug` binaries inside `debug` images
- `firezone-prod` contains only release binaries -- these image names
haven't changed
Fixes some issues encountered after the merge of #4049
- Fix performance tests to only run using base_ref and head_ref to avoid
dependence on `main`
- Fixes some typos
- Prevents a catch-22 condition where breaking compatibility meant we
wouldn't be able to deploy production
Closes#3815
Changes that are breaking (but these aren't in production so it should
be okay)
- Windows, renaming `device_id.json` to `firezone-id.json` to match the
rest of the code
- Linux GUI, storing the firezone-id under `/var/lib` instead of under
`$HOME`
- Linux GUI, bails out if not run with `sudo --preserve-env` by
detecting `$HOME == root` or `$USER != root`
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Closes#3798
- After the crash test, the Windows smoke test runs `minidump-stackwalk`
to print a stack trace:
https://github.com/firezone/firezone/actions/runs/8100801373/job/22139592883#step:11:770
- This acts as runnable documentation for getting stack traces on
Windows, and it should flunk the test if anything in the crash handling
to stack trace pipeline is broken
- I also updated the comment in the code since the minidump PR I was
waiting on was put into their newest release
The CI tests aren't running for Linux just yet.
This organizes the well-known directories used on Linux and Windows for
logs, config, etc., and adds them to the (unused) Linux smoke test
Waiting on #3727
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
I will need to set up the same paths for Linux, (#3734) and I want an
automated test to make sure everything gets into the right directories.
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
~~Highlights the issue hypothesized in #3666~~
This tests that restarting a Relay won't cause sustained downtime.
Sleeps have been removed as they shouldn't necessary -- removing them
will better catch race conditions.
(Waiting on #3721)
Ubuntu is headless by default and needs `xvfb` to run Tauri in CI, hence
the difference.
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>