When `snownet` was first being developed, these tests ensured that
hole-punching as well as connectivity via a relayed works correctly. We
have since added extensive tests that ensure connectivity works in many
scenarios via `tunnel_test`. `tunnel_test` does not (yet) have a
simulated NAT so hole-punching itself is not covered by that.
UDP hole-punching is shockingly trivial though because all you need to
do is send UDP packets to the same socket that the other party is
sending from. This isn't done by our own code but rather by str0m's
implement of ICE (as long as we add the correct candidates).
The `snownet-tests` themselves are quite fragile because they need to
set up their own event loop and manually construct an IP packet. They
haven't caught a single bug to my knowledge so I am proposing to delete
them for ease of maintenance.
For example, in
https://github.com/firezone/firezone/actions/runs/10449965474/job/28948590058?pr=6335
the tests fail because we no longer directly force a handshake when the
connection is established. This is unnecessary now because the buffered
intent packet will directly force a handshake from the client to the
gateway. Yet, `snownet-tests` event loop would need adjusting to also do
that.
Without masquerading, packets sent by the gateway through the TUN
interface use the wrong source address (the TUN device's address)
instead of the gateway's actual network interface.
We set this env variable in all our uses of the gateway, thus we might
as well remove it and always perform unconditionally.
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Considered using Elixir and Rust to write the tests.
For Elixir, `wallaby` doesn't seem to have a way to attach to an
existing `chromium` instance, launching it each time, which makes it
hard to coordinate with the relay restart.
For Rust we considered `thirtyfour` which would be very nice since we
could test both firefox and chrome but each time it connects to the
instance it launches a new session making it hard to test the DNS cache
behavior.
We also considered `chrome_headless` for Rust it needs a small patch to
prevent it from closing the browser after `Drop` but it still presents a
problem, since it has no easy way to retrieve if loading a page has
succeeded. There are some workarounds such as retrieving the title that
we could have used but after some testing they are quite finnicky and we
don't want that for CI.
So I ended up settling for TypeScript but I'm open to other options, or
a fix for the previous ones!
There are some modifications still incoming for this PR, around the test
name and that sleep in the middle of the test doesn't look good so I
will probably add some retries, but the gist is here, will keep it in
draft until we expect it to be passing.
So feel free to do some initial reviews.
Note: the number of lines changed is greatly exaggerated by
`package.lock`
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
As part of #4568, we are adding a 2nd relay which showed some
short-comings of the current process state assertions because they were
running outside the docker containers, thus listing all relays as soon
as there are multiple.
These customizations were from before we used `cargo cross` for all
architectures in CI.
1.77.1 has been tested to work with the following clients:
- [x] Apple
- [x] Android
- [x] Windows
- Runs release asset builds simultaneously with `deploy-staging`. Those
don't depend on each other.
- Prevents running some build workflows in CD because they're run
already in the PR and in the merge group, and the risk of semantic
conflict is negligible
- Run `release` assets in staging
- Adds `compatibility_tests`: **To successfully introduce a breaking
change in the control / data plane APIs, you must now "Merge as
Administrator"**
- Since `CI` is no longer run on `main`, caching needed to be refactored
to make sense again
- Since `CI` is no longer run on `main`, the Elixir
`migrations_and_seeds_test` had to be rewritten. This now tests
migrations using `git checkout` instead of importing `main`'s DB dump.
- Move tauri builds to its own workflow so we can trigger Linux and
Windows builds manually on an adhoc basis like we do for the Swift and
Kotlin builds
- Add a new `hotfix` workflow that will run `compatibility_tests` with
the latest published images
- Add `workflow_dispatch` to trigger `CD` manually for testing purposes
(cc @ReactorScram)
Refs #3995
I don't know much about `cargo chef` so I gave this its own PR in case
I'm doing something that'll subtly break it
I've run into this problem on some branches and not others, where it's
trying to build all the Tauri / glib stuff even though the Docker image
won't need it:
https://github.com/firezone/firezone/actions/runs/8012206575/job/21887478015#step:7:1175
This splits off the easy parts from #3605.
- Add quotes around `PHOENIX_SECURE_COOKIES` because my local
`docker-compose` considers unquoted 'false' to be a schema error - Env
vars are strings or numbers, not bools, it says
- Create `test.httpbin.docker.local` container in a new subnet so it can
be used as a DNS resource without the existing CIDR resource picking it
up
- Add resources and policies to `seeds.exs` per #3342
- Fix warning about `CONNLIB_LOG_UPLOAD_INTERVAL_SECS` not being set
- Add `resolv-conf` dep and unit tests to `firezone-tunnel` and
`firezone-linux-client`
- Impl `on_disconnect` in the Linux client with `tracing::error!`
- Add comments
```[tasklist]
- [x] (failed) Confirm that the client container actually does stop faster this way
- [x] Wait for tests to pass
- [x] Mark as ready for review
```
Currently, `firezone-connection` can only handle connections on a LAN.
Via the use of a STUN server, we can discover our public IP and attempt
to direct, hole-punched connection across multiple subnets.
Turns out #3276 was only part of the problem. After that was fixed, the
issue did turn out to be the statically-linked libc runtime. Staging was
using dynamic linking and so didn't hit the issue.
This reverts back to musl which has been tested as @AndrewDryga noted.
When moving from debian to alpine we stopped installing `curl` and it's
needed to get the public ipv4 and ipv6 of the relay in the
`docker-init.sh`
Signed-off-by: Gabi <gabrielalejandro7@gmail.com>
- rebuild and publish gateway and relay binaries to currently drafted
release
- re-tag current relay/gateway images and push to ghcr.io
Stacked on #2341 to prevent conflicts
Fixes#2223Fixes#2205Fixes#2202Fixes#2239
~~Still TODO: `arm64` images and binaries...~~ Edit: added via
`cross-rs`
Copying the `rust-toolchain.toml` file in is one thing but if we want to
avoid repeatedly installing it, we should do that in the same layer too.
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
This patch-set aims to make several improvements to our CI caching:
1. Use of registry as build cache: Pushes a separate image to our docker
registry at GCP that contains the cache layers. This happens for every
PR & main. As a result, we can restore from **both** which should make
repeated runs of CI on an individual PR faster and give us a good
baseline cache for new PRs from `main`. See
https://docs.docker.com/build/ci/github-actions/cache/#registry-cache
for details. As a nice side-effect, this allows us to use the 10 GB we
have on GitHub actions for other jobs.
2. We make better use of `restore-keys` by also attempting to restore
the cache if the fingerprint of our lockfiles doesn't match. This is
useful for CI runs that upgrade dependencies. Those will restore a cache
that is still useful although doesn't quite match. That is better[^1]
than not hitting the cache at all.
3. There were two tiny bugs in our Swift and Android builds:
a. We used `rustup show` in the wrong directory and thus did not
actually install the toolchain properly.
b. We used `shared-key` instead of `key` for the
https://github.com/Swatinem/rust-cache action and thus did not
differentiate between jobs properly.
5. Our Dockerfile for Rust had a bug where it did not copy in the
`rust-toolchain.toml` file in the `chef` layer and thus also did not use
the correctly toolchain.
6. We remove the dedicated gradle cache because the build action already
comes with a cache configuration:
https://github.com/firezone/firezone/actions/runs/6416847209/job/17421412150#step:10:25
[^1]: Over time, this may mean that our caches grow a bit. In an ideal
world, we automatically remove files from the caches that haven't been
used in a while. The cache action we use for Rust does that
automatically:
https://github.com/Swatinem/rust-cache?tab=readme-ov-file#cache-details.
As a workaround, we can just purge all caches every now and then.
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
This PR fixes `docker compose up` but it doesn't have the test client ->
resource flow working but it prevent anything from erroring at startup.
This fixes:
* tokens (use the correct token for the client user agent we are using)
* randomize `name_suffix` at start up for connlib (we will eventually
allow options to set it manually)
* remove port ranges for relay (see firezone/product#613)
When using `docker compose build` or any other way of building docker
images in parallel the way the cache was working with the rust's
Dockerfile made the caches between images overlap and corrupt each
other. We add a `locked` which prevents multiple writers to the same
cache to fix this behaviour.
This brindgs connlib from its own separated repo to firezone's monorepo.
On top of bringing connlib we also add and unify the Dockerfile for all
rust binaries and add a docker-compose that can run a headless client, a
relay and a gateway which eventually will test the whole flow between a
client and a resource. For this to work we also incorporated some elixir
scripts to generate portal tokens for those components.