firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 10:18:54 +00:00

Author	SHA1	Message	Date
Thomas Eizinger	6a538368cb	feat(gateway): add flow-logs MVP (#10576 ) Network flow logs are a common feature of VPNs. Due to the nature of a shared exit node, it is of great interest to a network analyst, which TCP connections are getting routed through the tunnel, who is initiating them, for long do they last and how much traffic is sent across them. With this PR, the Firezone Gateway gains the ability of detecting the TCP and UDP flows that are being routed through it. The information we want to attach to these flows is spread out over several layers of the packet handling code. To simplify the implementation and not complicate the APIs unnecessarily, we chose to rely on TLS (thread-local storage) for gathering all the necessary data as a packet gets passed through the various layers. When using a const initializer, the overhead of a TLS variable over an actual local variable is basically zero. The entire routing state of the Gateway is also never sent across any threads, making TLS variables a particularly good choice for this problem. In its MVP form, the detected flows are only emitted on stdout and also that only if `flow_logs=trace` is set using `RUST_LOG`. Early adopters of this feature are encouraged to enable these logs as described and then ingest the Gateway's logs into the SIEM of their choice for further analysis. Related: #8353	2025-10-22 03:10:21 +00:00
Brian Manifold	27565ea5c8	refactor(portal): remove soft delete elements from portal code (#10607 ) Why: * In previous commits, the portal code had been updated to use hard deletion rather than soft deletion of data. The fields used in the soft deletion were still kept in the DB and the code to allow for zero downtime rollout and an easy rollback if necessary. To continue with that work the portal code has now been updated to remove any reference to the soft deleted fields (e.g. deleted_at, persistent_id, etc...). While the code has been updated the actual data in the DB will need to remain for now, to once again allow for a zero downtime rollout. Once this commit has been deployed to production another PR can follow to remove the columns from the necessary tables in the DB. Related: #8187	2025-10-18 17:02:26 +00:00
Thomas Eizinger	b11adfcfe4	feat(connlib): create flow on ICMP error "prohibited" (#10462 ) In Firezone, a Client requests an "access authorization" for a Resource on the fly when it sees the first packet for said Resource going through the tunnel. If we don't have a connection to the Gateway yet, this is also where we will establish a connection and create the WireGuard tunnel. In order for this to work, the access authorization state between the Client and the Gateway MUST NOT get out of sync. If the Client thinks it has access to a Resource, it will just route the traffic to the Gateway. If the access authorization on the Gateway has expired or vanished otherwise, the packets will be black-holed. Starting with #9816, the Gateway sends ICMP errors back to the application whenever it filters a packet. This can happen either because the access authorization is gone or because the traffic wasn't allowed by the specific filter rules on the Resource. With this patch, the Client will attempt to create a new flow (i.e. re-authorize) traffic for this resource whenever it sees such an ICMP error, therefore acting as a way of synchronizing the view of the world between Client and Gateway should they ever run out of sync. Testing turned out to be a bit tricky. If we let the authorization on the Gateway lapse naturally, we portal will also toggle the Resource off and on on the Client, resulting in "flushing" the current authorizations. Additionally, it the Client had only access to one Resource, then the Gateway will gracefully close the connection, also resulting in the Client creating a new flow for the next packet. To actually trigger this new behaviour we need to: - Access at least two resources via the same Gateway - Directly send `reject_access` to the Gateway for this particular resource To achieve this, we dynamically eval some code on the API node and instruct the Gateway channel to send `reject_access`. The connection stays intact because there is still another active access authorization but packets for the other resource are answered with ICMP errors. To achieve a safe roll-out, the new behaviour is feature-flagged. In order to still test it, we now also allow feature flags to be set via env variables. Resolves: #10074 --------- Co-authored-by: Mariusz Klochowicz <mariusz@klochowicz.com>	2025-09-30 08:23:39 +00:00
Thomas Eizinger	83171d3a2d	ci: add integration test for graceful Gateway shutdown (#10077 ) Signed-off-by: Thomas Eizinger <thomas@eizinger.io>	2025-09-10 23:41:55 +00:00
Thomas Eizinger	d1d46fdfb4	ci: create a more realistic network setup (#10301 ) Currently, the setup we have in docker-compose does not reflect real-world scenarios very well because most components share the same subnet. In reality, Clients, Gateways, relays and the backend are all in separate subnets, connected via multiple routers on the Internet. The current setup makes it hard to properly test relayed connections. To fix this, we move all components into their own subnet with a dedicated router container that performs source and destination NAT as well as acts as a firewall for the client and gateway containers to not allow inbound traffic. This setup will allow us to more easily test #10286 which requires port randomization for outgoing traffic on the Client and Gateway side.	2025-09-10 23:37:16 +00:00
Jamil	0ccd4bbf24	feat(ci): enable relay eBPF offloading (#10160 ) In CI, eBPF in driver mode actually functions just fine with no changes to our existing tests, given we apply a few workarounds and bugfixes: - The interface learning mechanism had two flaws: (1) it only learned per-CPU, which meant the risk for a missing entry grew as the core count of the relay host grew, and (2) it did not filter for unicast IPs, so it picked up broadcast and link-local addresses, causing cross-relay paths to fail occasionally - The `relay-relay` candidate where the two relays are the same relay causes packet drops / loops in the Docker bridge setup, and possibly in GCP too. I'm not sure this is a valid path that solves a real connectivity issue in the wild. I can understand relay-relay paths where two relays are different hosts, and the client and gateway both talk over their TURN channel to each other (i.e. WireGuard is blocked in each of their networks), but I can't think of an advantage for a relay-relay candidate where the traffic simply hairpins (or is dropped) off the nearest switch. This has been now detected with a new `PacketLoop` error that triggers whenever source_ip == dest_ip. - The relays in CI need a common next-hop to talk to for the MAC address swapping to work. A simple router service is added which functions as a basic L3 router (no NAT) that allows the MAC swapping to work. - The `veth` driver has some peculiar requirements to allow it to function with XDP_TX. If you send a packet out of one interface of a veth pair with XDP_TX, you need to either make sure both interfaces have GRO enabled, or you need to attach a dummy XDP program that simply does XDP_PASS to the other interface so that the sk_buff is allocated before going up the stack to the Docker bridge. The GRO method was unreliable and didn't work in our case, causing massive packet delays and unpredictable bursts that prevented ICE from working, so we use the XDP_PASS method instead. A simple docker image is built and lives at https://github.com/firezone/xdp-pass to handle this. Related: #10138 Related: #10260	2025-08-31 23:37:03 +00:00
Jamil	516be7417e	fix(ci): remove extraneous caching (#10258 ) - Removes the swift DerivedData cache. This was added to attempt to speed up the Swift builds in CI but in reality, those are already fast and the cache did not speed them up. - Removes the runner.os/arch specifier from the Webview installer cache key. The binary download is hardcoded for a specific windows version / arch already so the cache key just adds unneeded complexity. These caches are getting saved on PR runs which consumes excess GHA cache storage.	2025-08-27 05:01:02 -07:00
Jamil	0698e0d35f	ci: test IPv6 for CIDR resources (#10168 ) Docker for Mac finally supports IPv6 in general availability. It's time to add IPv6 to our suite of integration tests. The thinking behind this PR is try and not slow down CI much, if at all, by testing IPv6 side-by-side with the existing IPv4 tests. More comprehensive testing is being developed in #10131 that will test things like IPv4-in-6 relaying, client / gateway IP stack mismatches, and so forth.	2025-08-18 20:59:40 +00:00
Thomas Eizinger	72fbe306b6	test: remove curl retry in favor of keep-alive (#9888 ) At present, the `direct-download-roaming-network` integration test is a bit odd. It uses the `--retry` switch from `curl` to retry the download once it failed. However, what we want to show with this integration test is that a TCP connection can survive network roaming. We can show that successfully but only if we specify the `--keepalive-time` option, otherwise the download stalls. From inspecting the network logs, this is because `curl` simply waits for more data to be downloaded. After a network reset, the connection however is gone and the _client_ (in this case `curl`) needs to send at least 1 packet to re-establish the connection. By using the keep-alive option, we can send such a packet and the download completes successfully.	2025-07-16 16:17:27 +00:00
Thomas Eizinger	cf2470ba1e	test(iperf): install iptables rule inside of container (#9880 ) In Docker environments, applying iptables rules to filter container-container traffic on the Docker bridged network is not reliable, leading to direct connections being established in our relayed tests. To fix this, we insert the rules directly from the client container itself. --------- Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>	2025-07-16 10:29:33 +00:00
Jamil	84a981f668	refactor(ci): Remove browser-based integration tests (#6435 ) Fixes a new issue with puppeteer, chromium 128, and Alpine 3.20 that's causing failing browser tests. See more: https://github.com/puppeteer/puppeteer/issues/12189 Failure: https://github.com/firezone/firezone/actions/runs/10549430305/job/29224528663?pr=6391 Unfortunately, puppeteer's embedded browser doesn't seem to want to run in Alpine: https://github.com/firezone/firezone/actions/runs/10563167497/job/29265175731?pr=6435#step:6:56 Fixing this is proving very difficult since we can't seem to use puppeteer with the latest Alpine images, so I questioned the need to have these in at all. These tests were added at a time where the DNS mappings were brittle, so we wanted to verify that relayed and direct connections held up as we deployed. This is no longer the case, and we also now have much more unit test coverage around these things, so given the pain of maintaining these (and the lack of a current solution to the above), they are removed. --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com>	2024-08-26 20:01:00 +00:00
Thomas Eizinger	7159ffb34b	ci: timeout `curl` requests after 30s (#5537 ) Currently, we rely on curl's default timeout when connecting to a resource. This is problematic because the `direct-dns` and `relayed-dns` integration tests check that a certain resource _isn't_ accessible and this test currently waits for 5 minutes to assert that. We can shorten this and thus every CI by passing a `--connect-timeout` to `curl`. See https://github.com/firezone/firezone/actions/runs/9656570163/job/26634409843#step:6:445 for an example CI run on `main`.	2024-06-25 06:07:13 +00:00
Jamil	0b83b12fd2	ci: bootstrap browser test harness if missing (#4767 ) Should be a less brittle fix to the problem of testing release images for `compat-tests` with the browser harness.	2024-04-24 17:02:47 +00:00
Gabi	adc0bb73f7	test(client): add reconnection tests from a client using a headless browser (#4569 ) Considered using Elixir and Rust to write the tests. For Elixir, `wallaby` doesn't seem to have a way to attach to an existing `chromium` instance, launching it each time, which makes it hard to coordinate with the relay restart. For Rust we considered `thirtyfour` which would be very nice since we could test both firefox and chrome but each time it connects to the instance it launches a new session making it hard to test the DNS cache behavior. We also considered `chrome_headless` for Rust it needs a small patch to prevent it from closing the browser after `Drop` but it still presents a problem, since it has no easy way to retrieve if loading a page has succeeded. There are some workarounds such as retrieving the title that we could have used but after some testing they are quite finnicky and we don't want that for CI. So I ended up settling for TypeScript but I'm open to other options, or a fix for the previous ones! There are some modifications still incoming for this PR, around the test name and that sleep in the middle of the test doesn't look good so I will probably add some retries, but the gist is here, will keep it in draft until we expect it to be passing. So feel free to do some initial reviews. Note: the number of lines changed is greatly exaggerated by `package.lock` --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com> Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-04-20 06:57:07 +00:00
Thomas Eizinger	51089b89e7	feat(connlib): smoothly migrate relayed connections (#4568 ) Whenever we receive a `relays_presence` message from the portal, we invalidate the candidates of all now disconnected relays and make allocations on the new ones. This triggers signalling of new candidates to the remote party and migrates the connection to the newly nominated socket. This still relies on #4613 until we have #4634. Resolves: #4548. --------- Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2024-04-20 06:16:35 +00:00
Reactor Scram	7081c71c10	chore(linux-client): allow custom token path (#4666 ) ```[tasklist] # Before merging - [x] Remove file extension `.txt` - [x] Wait for `linux-group` test to go green on `main` (#4692) - [x] all compatibility tests must be green on this branch ``` Closes #4664 Closes #4665 ~~The compatibility tests are expected to fail until the next release is cut, for the same reasons as in #4686~~ The compatibility test must be handled somehow, otherwise it'll turn main red. `linux-group` was moved out of integration / compatibility testing, but the DNS tests do need the whole Docker + portal setup, so that one can't move. --------- Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com> Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-04-19 18:50:24 +00:00
Thomas Eizinger	4972e49b34	ci: run assertions inside docker container (#4680 ) As part of #4568, we are adding a 2nd relay which showed some short-comings of the current process state assertions because they were running outside the docker containers, thus listing all relays as soon as there are multiple.	2024-04-18 23:48:42 +00:00
Reactor Scram	e7a4a83e3d	chore(linux): only allow IPC connections from members of the `firezone` group (#4628 ) ```[tasklist] ### Before merging - [x] Update KB ``` Maybe not a feature since Linux IPC isn't available to users yet? I think it's okay if the new `linux-group` test fails in compatibility, since it wasn't implemented at all back then. Closes #4659 Closes #4660 --------- Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com> Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2024-04-17 21:42:29 +00:00
Thomas Eizinger	be1a719e2c	chore(relay): perform graceful shutdown upon receiving SIGTERM (#4552 ) Upon receiving a SIGTERM, we immediately disconnect from the websocket connection to the portal and set a flag that we are shutting down. Once we are disconnected from the portal and no longer have an active allocations, we exit with 0. A repeated SIGTERM signal will interrupt this process and force the relay to shutdown. Disconnecting from the portal will (eventually) trigger a message to clients and gateways that this relay should no longer be used. Thus, depending on the timeout our supervisor has configured after sending SIGTERM, the relay will continue all TURN operations until the number of allocations drops to 0. Currently, we also allow clients to make new allocations and refreshing existing allocations. In the future, it may make sense to implement a dedicated status code and refuse `ALLOCATE` and `REFRESH` messages whilst we are shutting down. Related: #4548. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2024-04-12 08:45:08 +00:00
Thomas Eizinger	26494b0e34	ci: reduce duplication in integration tests (#4583 ) Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2024-04-11 23:01:12 +00:00
Thomas Eizinger	8d49452668	ci: assert that nothing busy loops after the perf tests (#4546 ) The clients, gateway and relay all employ an internal design that is based on an eventloop. This gives us a lot of control in how various IO components interact with each other. Great control also comes with a source of bugs, the latest of which made the relay busy-loop once it started relaying some traffic. Eventloops are notoriously hard to unit-test because they compose various IO bits together. Instead of writing unit tests, we can go and assert the process state after the performance tests. Those generate a fair bit of load on all our components but after that, they should suspend. The most effective tests survive even large refactorings and for that, they need to be coded against a stable API / property. Asserting that the process sleeps when it is idle from an application PoV is such a property. Related: #4511.	2024-04-09 07:09:50 +00:00
Jamil	09532ea845	chore(ci): Add portal and relay downtime DNS resource tests (#4517 ) Tests that DNS still works in the client with established connections after the portal and/or relay go down.	2024-04-08 09:43:59 +00:00
Reactor Scram	74a81b2a56	test(gui-client): unit test for Linux IPC (#4277 ) (After GA) This adds a unit test for the Unix domain sockets that I intend to use for process splitting on Linux. The length-prefixed encoding and decoding are copied from `subzone`, but most of that code will not be re-used since it's Windows-specific and also specific to a Chromium-like process model, which won't work for Firezone.	2024-04-02 19:34:24 +00:00
Thomas Eizinger	62e082d47a	refactor(connlib): make `{Client,Gateway}State` SANS-IO (#4096 ) Resolves: #3929.	2024-03-14 23:44:36 +00:00
Jamil	19a7bac4ae	chore(ci): enforce shellscript formatting and style (#3679 ) Noticed that we all have different styles of writing scripts :-). This PR adds linting to our shell scripts to standardize on formatting, catch common issues and/or possible security bugs. For editor setup: - Ensure [`shellcheck`](https://github.com/koalaman/shellcheck) and [`shfmt`](https://github.com/mvdan/sh) are in your `PATH` - Configure `shfmt` with indentation of `4`, otherwise it uses tabs by default. [Here](https://github.com/jamilbk/nvim/blob/master/init.vim#L159) is how you can do that with Vim and [here](https://marketplace.visualstudio.com/items?itemName=mkhl.shfmt) is how for VScode. --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com> Co-authored-by: Thomas Eizinger <thomas@eizinger.io> Co-authored-by: Brian Manifold <bmanifold@users.noreply.github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Dryga <andrew@dryga.com> Co-authored-by: Gabi <gabrielalejandro7@gmail.com>	2024-02-21 01:01:32 +00:00
Jamil	20dc0cf1e9	refactor(ci): Use curl for connectivity tests in CI (#3674 ) It would be good to run tests with a TCP protocol like `http` to catch things like MTU and port issues.	2024-02-16 22:48:13 +00:00
Jamil	9054f70995	refactor(ci): simplify dns resources in ci (#3653 ) Attempt at cleaning a couple things I missed in code review. The old httpbin resource wasn't being used anyhow, so I just deduped them and updated things in a couple other places that had drifted. Hopefully this fixes the [flaky CI](https://github.com/firezone/firezone/actions/runs/7918422653/job/21616835910)	2024-02-15 23:50:12 +00:00
Thomas Eizinger	e47c1766bf	ci: move tests to bash scripts (#3648 ) This improves maintenance because we can now use a regular matrix for the integration tests and one can locally use tools like shellcheck or a `bash-lsp` during development. --------- Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: Jamil <jamilbk@users.noreply.github.com>	2024-02-14 13:55:28 +00:00

28 Commits