Commit Graph

28 Commits

Author SHA1 Message Date
Thomas Eizinger
6a538368cb feat(gateway): add flow-logs MVP (#10576)
Network flow logs are a common feature of VPNs. Due to the nature of a
shared exit node, it is of great interest to a network analyst, which
TCP connections are getting routed through the tunnel, who is initiating
them, for long do they last and how much traffic is sent across them.

With this PR, the Firezone Gateway gains the ability of detecting the
TCP and UDP flows that are being routed through it. The information we
want to attach to these flows is spread out over several layers of the
packet handling code. To simplify the implementation and not complicate
the APIs unnecessarily, we chose to rely on TLS (thread-local storage)
for gathering all the necessary data as a packet gets passed through the
various layers. When using a const initializer, the overhead of a TLS
variable over an actual local variable is basically zero. The entire
routing state of the Gateway is also never sent across any threads,
making TLS variables a particularly good choice for this problem.

In its MVP form, the detected flows are only emitted on stdout and also
that only if `flow_logs=trace` is set using `RUST_LOG`. Early adopters
of this feature are encouraged to enable these logs as described and
then ingest the Gateway's logs into the SIEM of their choice for further
analysis.

Related: #8353
2025-10-22 03:10:21 +00:00
Brian Manifold
27565ea5c8 refactor(portal): remove soft delete elements from portal code (#10607)
Why:

* In previous commits, the portal code had been updated to use hard
deletion rather than soft deletion of data. The fields used in the soft
deletion were still kept in the DB and the code to allow for zero
downtime rollout and an easy rollback if necessary. To continue with
that work the portal code has now been updated to remove any reference
to the soft deleted fields (e.g. deleted_at, persistent_id, etc...).
While the code has been updated the actual data in the DB will need to
remain for now, to once again allow for a zero downtime rollout. Once
this commit has been deployed to production another PR can follow to
remove the columns from the necessary tables in the DB.


Related: #8187
2025-10-18 17:02:26 +00:00
Thomas Eizinger
b11adfcfe4 feat(connlib): create flow on ICMP error "prohibited" (#10462)
In Firezone, a Client requests an "access authorization" for a Resource
on the fly when it sees the first packet for said Resource going through
the tunnel. If we don't have a connection to the Gateway yet, this is
also where we will establish a connection and create the WireGuard
tunnel.

In order for this to work, the access authorization state between the
Client and the Gateway MUST NOT get out of sync. If the Client thinks it
has access to a Resource, it will just route the traffic to the Gateway.
If the access authorization on the Gateway has expired or vanished
otherwise, the packets will be black-holed.

Starting with #9816, the Gateway sends ICMP errors back to the
application whenever it filters a packet. This can happen either because
the access authorization is gone or because the traffic wasn't allowed
by the specific filter rules on the Resource.

With this patch, the Client will attempt to create a new flow (i.e.
re-authorize) traffic for this resource whenever it sees such an ICMP
error, therefore acting as a way of synchronizing the view of the world
between Client and Gateway should they ever run out of sync.

Testing turned out to be a bit tricky. If we let the authorization on
the Gateway lapse naturally, we portal will also toggle the Resource off
and on on the Client, resulting in "flushing" the current
authorizations. Additionally, it the Client had only access to one
Resource, then the Gateway will gracefully close the connection, also
resulting in the Client creating a new flow for the next packet.

To actually trigger this new behaviour we need to:

- Access at least two resources via the same Gateway
- Directly send `reject_access` to the Gateway for this particular
resource

To achieve this, we dynamically eval some code on the API node and
instruct the Gateway channel to send `reject_access`. The connection
stays intact because there is still another active access authorization
but packets for the other resource are answered with ICMP errors.

To achieve a safe roll-out, the new behaviour is feature-flagged. In
order to still test it, we now also allow feature flags to be set via
env variables.

Resolves: #10074

---------

Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>
2025-09-30 08:23:39 +00:00
Thomas Eizinger
83171d3a2d ci: add integration test for graceful Gateway shutdown (#10077)
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
2025-09-10 23:41:55 +00:00
Thomas Eizinger
d1d46fdfb4 ci: create a more realistic network setup (#10301)
Currently, the setup we have in docker-compose does not reflect
real-world scenarios very well because most components share the same
subnet. In reality, Clients, Gateways, relays and the backend are all in
separate subnets, connected via multiple routers on the Internet.

The current setup makes it hard to properly test relayed connections. To
fix this, we move all components into their own subnet with a dedicated
router container that performs source and destination NAT as well as
acts as a firewall for the client and gateway containers to not allow
inbound traffic.

This setup will allow us to more easily test #10286 which requires port
randomization for outgoing traffic on the Client and Gateway side.
2025-09-10 23:37:16 +00:00
Jamil
0ccd4bbf24 feat(ci): enable relay eBPF offloading (#10160)
In CI, eBPF in driver mode actually functions just fine with no changes
to our existing tests, given we apply a few workarounds and bugfixes:

- The interface learning mechanism had two flaws: (1) it only learned
per-CPU, which meant the risk for a missing entry grew as the core count
of the relay host grew, and (2) it did not filter for unicast IPs, so it
picked up broadcast and link-local addresses, causing cross-relay paths
to fail occasionally
- The `relay-relay` candidate where the two relays are the same relay
causes packet drops / loops in the Docker bridge setup, and possibly in
GCP too. I'm not sure this is a valid path that solves a real
connectivity issue in the wild. I can understand relay-relay paths where
two relays are different hosts, and the client and gateway both talk
over their TURN channel to each other (i.e. WireGuard is blocked in each
of their networks), but I can't think of an advantage for a relay-relay
candidate where the traffic simply hairpins (or is dropped) off the
nearest switch. This has been now detected with a new `PacketLoop` error
that triggers whenever source_ip == dest_ip.
- The relays in CI need a common next-hop to talk to for the MAC address
swapping to work. A simple router service is added which functions as a
basic L3 router (no NAT) that allows the MAC swapping to work.
- The `veth` driver has some peculiar requirements to allow it to
function with XDP_TX. If you send a packet out of one interface of a
veth pair with XDP_TX, you need to either make sure both interfaces have
GRO enabled, or you need to attach a dummy XDP program that simply does
XDP_PASS to the other interface so that the sk_buff is allocated before
going up the stack to the Docker bridge. The GRO method was unreliable
and didn't work in our case, causing massive packet delays and
unpredictable bursts that prevented ICE from working, so we use the
XDP_PASS method instead. A simple docker image is built and lives at
https://github.com/firezone/xdp-pass to handle this.

Related: #10138 
Related: #10260
2025-08-31 23:37:03 +00:00
Jamil
516be7417e fix(ci): remove extraneous caching (#10258)
- Removes the swift DerivedData cache. This was added to attempt to
speed up the Swift builds in CI but in reality, those are already fast
and the cache did not speed them up.
- Removes the runner.os/arch specifier from the Webview installer cache
key. The binary download is hardcoded for a specific windows version /
arch already so the cache key just adds unneeded complexity.

These caches are getting saved on PR runs which consumes excess GHA
cache storage.
2025-08-27 05:01:02 -07:00
Jamil
0698e0d35f ci: test IPv6 for CIDR resources (#10168)
Docker for Mac finally supports IPv6 in general availability. It's time
to add IPv6 to our suite of integration tests.

The thinking behind this PR is try and not slow down CI much, if at all,
by testing IPv6 side-by-side with the existing IPv4 tests.

More comprehensive testing is being developed in #10131 that will test
things like IPv4-in-6 relaying, client / gateway IP stack mismatches,
and so forth.
2025-08-18 20:59:40 +00:00
Thomas Eizinger
72fbe306b6 test: remove curl retry in favor of keep-alive (#9888)
At present, the `direct-download-roaming-network` integration test is a
bit odd. It uses the `--retry` switch from `curl` to retry the download
once it failed. However, what we want to show with this integration test
is that a TCP connection can survive network roaming. We can show that
successfully but only if we specify the `--keepalive-time` option,
otherwise the download stalls.

From inspecting the network logs, this is because `curl` simply waits
for more data to be downloaded. After a network reset, the connection
however is gone and the _client_ (in this case `curl`) needs to send at
least 1 packet to re-establish the connection. By using the keep-alive
option, we can send such a packet and the download completes
successfully.
2025-07-16 16:17:27 +00:00
Thomas Eizinger
cf2470ba1e test(iperf): install iptables rule inside of container (#9880)
In Docker environments, applying iptables rules to filter
container-container traffic on the Docker bridged network is not
reliable, leading to direct connections being established in our relayed
tests. To fix this, we insert the rules directly from the client
container itself.

---------

Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
2025-07-16 10:29:33 +00:00
Jamil
84a981f668 refactor(ci): Remove browser-based integration tests (#6435)
Fixes a new issue with puppeteer, chromium 128, and Alpine 3.20 that's
causing failing browser tests.

See more: https://github.com/puppeteer/puppeteer/issues/12189

Failure:

https://github.com/firezone/firezone/actions/runs/10549430305/job/29224528663?pr=6391

Unfortunately, puppeteer's embedded browser doesn't seem to want to run
in Alpine:


https://github.com/firezone/firezone/actions/runs/10563167497/job/29265175731?pr=6435#step:6:56


Fixing this is proving very difficult since we can't seem to use
puppeteer with the latest Alpine images, so I questioned the need to
have these in at all. These tests were added at a time where the DNS
mappings were brittle, so we wanted to verify that relayed and direct
connections held up as we deployed.

This is no longer the case, and we also now have much more unit test
coverage around these things, so given the pain of maintaining these
(and the lack of a current solution to the above), they are removed.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
2024-08-26 20:01:00 +00:00
Thomas Eizinger
7159ffb34b ci: timeout curl requests after 30s (#5537)
Currently, we rely on curl's default timeout when connecting to a
resource. This is problematic because the `direct-dns` and `relayed-dns`
integration tests check that a certain resource _isn't_ accessible and
this test currently waits for 5 minutes to assert that.

We can shorten this and thus every CI by passing a `--connect-timeout`
to `curl`.

See
https://github.com/firezone/firezone/actions/runs/9656570163/job/26634409843#step:6:445
for an example CI run on `main`.
2024-06-25 06:07:13 +00:00
Jamil
0b83b12fd2 ci: bootstrap browser test harness if missing (#4767)
Should be a less brittle fix to the problem of testing release images
for `compat-tests` with the browser harness.
2024-04-24 17:02:47 +00:00
Gabi
adc0bb73f7 test(client): add reconnection tests from a client using a headless browser (#4569)
Considered using Elixir and Rust to write the tests.

For Elixir, `wallaby` doesn't seem to have a way to attach to an
existing `chromium` instance, launching it each time, which makes it
hard to coordinate with the relay restart.

For Rust we considered `thirtyfour` which would be very nice since we
could test both firefox and chrome but each time it connects to the
instance it launches a new session making it hard to test the DNS cache
behavior.

We also considered `chrome_headless` for Rust it needs a small patch to
prevent it from closing the browser after `Drop` but it still presents a
problem, since it has no easy way to retrieve if loading a page has
succeeded. There are some workarounds such as retrieving the title that
we could have used but after some testing they are quite finnicky and we
don't want that for CI.

So I ended up settling for TypeScript but I'm open to other options, or
a fix for the previous ones!

There are some modifications still incoming for this PR, around the test
name and that sleep in the middle of the test doesn't look good so I
will probably add some retries, but the gist is here, will keep it in
draft until we expect it to be passing.

So feel free to do some initial reviews.

Note: the number of lines changed is greatly exaggerated by
`package.lock`

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2024-04-20 06:57:07 +00:00
Thomas Eizinger
51089b89e7 feat(connlib): smoothly migrate relayed connections (#4568)
Whenever we receive a `relays_presence` message from the portal, we
invalidate the candidates of all now disconnected relays and make
allocations on the new ones. This triggers signalling of new candidates
to the remote party and migrates the connection to the newly nominated
socket.

This still relies on #4613 until we have #4634.

Resolves: #4548.

---------

Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-04-20 06:16:35 +00:00
Reactor Scram
7081c71c10 chore(linux-client): allow custom token path (#4666)
```[tasklist]
# Before merging
- [x] Remove file extension `.txt`
- [x] Wait for `linux-group` test to go green on `main` (#4692)
- [x] *all* compatibility tests must be green on this branch
```

Closes #4664 
Closes #4665 

~~The compatibility tests are expected to fail until the next release is
cut, for the same reasons as in #4686~~

The compatibility test must be handled somehow, otherwise it'll turn
main red.
`linux-group` was moved out of integration / compatibility testing, but
the DNS tests do need the whole Docker + portal setup, so that one can't
move.

---------

Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2024-04-19 18:50:24 +00:00
Thomas Eizinger
4972e49b34 ci: run assertions inside docker container (#4680)
As part of #4568, we are adding a 2nd relay which showed some
short-comings of the current process state assertions because they were
running outside the docker containers, thus listing all relays as soon
as there are multiple.
2024-04-18 23:48:42 +00:00
Reactor Scram
e7a4a83e3d chore(linux): only allow IPC connections from members of the firezone group (#4628)
```[tasklist]
### Before merging
- [x] Update KB
```

Maybe not a feature since Linux IPC isn't available to users yet?

I think it's okay if the new `linux-group` test fails in compatibility,
since it wasn't implemented at all back then.

Closes #4659
Closes #4660

---------

Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2024-04-17 21:42:29 +00:00
Thomas Eizinger
be1a719e2c chore(relay): perform graceful shutdown upon receiving SIGTERM (#4552)
Upon receiving a SIGTERM, we immediately disconnect from the websocket
connection to the portal and set a flag that we are shutting down.

Once we are disconnected from the portal and no longer have an active
allocations, we exit with 0. A repeated SIGTERM signal will interrupt
this process and force the relay to shutdown.

Disconnecting from the portal will (eventually) trigger a message to
clients and gateways that this relay should no longer be used. Thus,
depending on the timeout our supervisor has configured after sending
SIGTERM, the relay will continue all TURN operations until the number of
allocations drops to 0.

Currently, we also allow clients to make new allocations and refreshing
existing allocations. In the future, it may make sense to implement a
dedicated status code and refuse `ALLOCATE` and `REFRESH` messages
whilst we are shutting down.

Related: #4548.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-04-12 08:45:08 +00:00
Thomas Eizinger
26494b0e34 ci: reduce duplication in integration tests (#4583)
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-04-11 23:01:12 +00:00
Thomas Eizinger
8d49452668 ci: assert that nothing busy loops after the perf tests (#4546)
The clients, gateway and relay all employ an internal design that is
based on an eventloop. This gives us a lot of control in how various IO
components interact with each other. Great control also comes with a
source of bugs, the latest of which made the relay busy-loop once it
started relaying some traffic.

Eventloops are notoriously hard to unit-test because they compose
various IO bits together. Instead of writing unit tests, we can go and
assert the process state after the performance tests. Those generate a
fair bit of load on all our components but after that, they should
suspend.

The most effective tests survive even large refactorings and for that,
they need to be coded against a stable API / property. Asserting that
the process sleeps when it is idle from an application PoV is such a
property.

Related: #4511.
2024-04-09 07:09:50 +00:00
Jamil
09532ea845 chore(ci): Add portal and relay downtime DNS resource tests (#4517)
Tests that DNS still works in the client with established connections
after the portal and/or relay go down.
2024-04-08 09:43:59 +00:00
Reactor Scram
74a81b2a56 test(gui-client): unit test for Linux IPC (#4277)
(After GA)

This adds a unit test for the Unix domain sockets that I intend to use
for process splitting on Linux.

The length-prefixed encoding and decoding are copied from `subzone`, but
most of that code will not be re-used since it's Windows-specific and
also specific to a Chromium-like process model, which won't work for
Firezone.
2024-04-02 19:34:24 +00:00
Thomas Eizinger
62e082d47a refactor(connlib): make {Client,Gateway}State SANS-IO (#4096)
Resolves: #3929.
2024-03-14 23:44:36 +00:00
Jamil
19a7bac4ae chore(ci): enforce shellscript formatting and style (#3679)
Noticed that we all have different styles of writing scripts :-).

This PR adds linting to our shell scripts to standardize on formatting,
catch common issues and/or possible security bugs.

For editor setup:
- Ensure [`shellcheck`](https://github.com/koalaman/shellcheck) and
[`shfmt`](https://github.com/mvdan/sh) are in your `PATH`
- Configure `shfmt` with indentation of `4`, otherwise it uses tabs by
default.
[Here](https://github.com/jamilbk/nvim/blob/master/init.vim#L159) is how
you can do that with Vim and
[here](https://marketplace.visualstudio.com/items?itemName=mkhl.shfmt)
is how for VScode.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andrew Dryga <andrew@dryga.com>
Co-authored-by: Gabi <gabrielalejandro7@gmail.com>
2024-02-21 01:01:32 +00:00
Jamil
20dc0cf1e9 refactor(ci): Use curl for connectivity tests in CI (#3674)
It would be good to run tests with a TCP protocol like `http` to catch
things like MTU and port issues.
2024-02-16 22:48:13 +00:00
Jamil
9054f70995 refactor(ci): simplify dns resources in ci (#3653)
Attempt at cleaning a couple things I missed in code review.

The old httpbin resource wasn't being used anyhow, so I just deduped
them and updated things in a couple other places that had drifted.

Hopefully this fixes the [flaky
CI](https://github.com/firezone/firezone/actions/runs/7918422653/job/21616835910)
2024-02-15 23:50:12 +00:00
Thomas Eizinger
e47c1766bf ci: move tests to bash scripts (#3648)
This improves maintenance because we can now use a regular matrix for
the integration tests and one can locally use tools like shellcheck or a
`bash-lsp` during development.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2024-02-14 13:55:28 +00:00