firezone

mirror of https://github.com/outbackdingo/firezone.git synced 2026-01-27 10:18:54 +00:00

Author	SHA1	Message	Date
Jamil	d29b210a63	chore(portal): Log metrics that failed to flush (#8142 ) When flushing metrics to GCP, we sometimes get the following error: ``` {400, "{\n \"error\": {\n \"code\": 400,\n \"message\": \"One or more TimeSeries could not be written: timeSeries[0-51]: write for resource=gce_instance{zone:us-east1-d,instance_id:6130184649770384727} failed with: One or more points were written more frequently than the maximum sampling period configured for the metric.\",\n \"status\": \"INVALID_ARGUMENT\",\n \"details\": [\n {\n \"@type\": \"type.googleapis.com/google.monitoring.v3.CreateTimeSeriesSummary\",\n \"totalPointCount\": 52,\n \"successPointCount\": 48,\n \"errors\": [\n {\n \"status\": {\n \"code\": 9\n },\n \"pointCount\": 4\n }\n ]\n }\n ]\n }\n}\n"} ``` It would be helpful to know exactly which metrics are failing to flush so we can further troubleshoot any issues.	2025-02-15 08:50:29 -08:00
Jamil	85ee37dfb3	Revert "fix(portal): Add node name key to metrics labels" (#8141 ) The node_name label is already in the metrics. Reverts firezone/firezone#8082	2025-02-15 08:47:45 -08:00
Jamil	4685c8edfd	ci: Add write perms to release drafter for kotlin (#8140 ) Needed to be able to create release drafts.	2025-02-15 07:46:13 -08:00
Jamil	5a3e940334	fix(portal): Fix typo in sites index (#8139 ) Fixes a typo introduced in #6905	2025-02-15 07:25:08 -08:00
Jamil	b64a919ac0	fix(android): make task dependencies explicit (#8138 ) Fixes a new issue gradle seems to complain about: https://github.com/firezone/firezone/actions/runs/13339271704	2025-02-15 02:19:05 +00:00
Andrew Dryga	bacb4596b7	feat(portal): Internet Sites (#6905 ) Related #6834 Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>	2025-02-15 00:34:30 +00:00
Jamil	80aa9e76c1	build(phoenix-channel): add cfg to enable system CAs (#8137 ) By setting the `system_certs` cfg at compile-time, any TLS connections from `phoenix-channel` will use the system-provided CA store instead of the embedded one. Resolves: #8065 Co-authored-by: oddlama <oddlama@oddlama.org> Co-authored-by: Thomas Eizinger <thomas@eizinger.io>	2025-02-15 00:23:25 +00:00
Jamil	df8b615d35	fix(apple/macOS): Don't force unwrap for menubar items (#8135 ) We can elegantly handle nil items in places where we currently don't. This PR updates all cases in MenuBar.swift to gracefully handle nil items like the menubar icons which can, in rare circumstances, be `nil` if they haven't yet loaded.	2025-02-14 21:50:35 +00:00
Jamil	5efb4b0fe2	fix(portal): Fix typo :dns -> :ip in seeds (#8134 ) Fixes #8119	2025-02-14 20:32:28 +00:00
Thomas Eizinger	bc37e0140b	fix(gui-client): allow sign-in without saving token to keyring (#8129 ) Alternative to #8128. If the user dismissed the unlock prompt or has their keyring otherwise misconfigured, it is still useful to allow them to sign-in. They just won't stay signed-in across reboots of the device.	2025-02-14 15:17:26 +00:00
Thomas Eizinger	9cce4fd637	fix(gateway): don't route packets from expired NAT sessions (#8124 ) When we receive an inbound packet from the TUN device on the Gateway, we make a lookup in the NAT table to see if it needs to be translated back to a DNS proxy IP. At present, non-existence of such a NAT entry results in the packet being sent entirely unmodified because that is what needs to happen for CIDR resources. Whilst that is important, the same code path is currently being executed for DNS resources whose NAT session expired! Those packets should be dropped instead which is what we do with this PR. To differentiate between not having a NAT session at all or whether a previous one existed but is expired now, we keep around all previous "outside" tuples of NAT sessions around. Those are only very small in their memory-footprint. The entire NAT table is scoped to a connection to the given peer and will thus eventually freed once the peer disconnects. This allows us to reliably and cheaply detect, whether a packet is using an expired NAT session. This check must be cheap because all traffic of CIDR resources and the Internet resource needs to perform this check such that we know that they don't have to be translated. This might be the source of some of the "Source not allowed" errors we have been seeing in client logs.	2025-02-14 08:21:23 +00:00
Thomas Eizinger	8f0db6ad47	fix(connlib): run all callbacks on a separate thread (#8126 ) At present, `connlib` communicates with its host app via callbacks. These callbacks are executed synchronously as part of `connlib`s event-loop, meaning `connlib` cannot do anything else whilst the callback is executing in the host app. Additionally, this callback runs within the `Future` that represents `connlib` and thus runs on a `tokio` worker thread. Attempting to interact with the session from within the callback can lead to panics, for example when `Session::disconnect` is called which uses `Runtime::block_on`. This isn't allowed by `tokio`: You cannot block on the execution of an async task from within one of the worker threads. To solve both of these problems, we introduce a thread-pool of size 1 that is responsible for executing `connlib` callbacks. Not only does this allow `connlib` to perform more work such as routing packets or process portal messages, it also means that it is not possible for the host app to cause these panics within the `tokio` runtime because the callbacks run on a different thread.	2025-02-14 06:54:35 +00:00
Thomas Eizinger	10ba02e341	fix(connlib): split TUN send & recv into separate threads (#8117 ) We appear to have caused a pretty big performance regression (~40%) in `037a2e64b6` (identified through `git-bisect`). Specifically, the regression appears to have been caused by [`aef411a` (#7605)](`aef411abf5`). Weirdly enough, undoing just that on top of `main` doesn't fix the regression. My hypothesis is that using the same file descriptor for read AND write interests on the same runtime causes issues because those interests are occasionally cleared (i.e. on false-positive wake-ups). In this PR, we spawn a dedicated thread each for the sending and receiving operations of the TUN device. On unix-based systems, a TUN device is just a file descriptor and can therefore simply be copied and read & written to from different threads. Most importantly, we only construct the `AsyncFd` _within_ the newly spawned thread and runtime because constructing an `AsyncFd` implicitly registers with the runtime active on the current thread. As a nice benefit, this allows us to get rid of a `future::select`. Those are always kind of nasty because they cancel the future that wasn't ready. My original intuition was that we drop packets due to cancelled futures there but that could not be confirmed in experiments.	2025-02-14 05:32:51 +00:00
Jamil	e23bd97ea1	fix(apple): Persist last notified version (#8122 ) Notifications on Apple platforms are delivered with best-effort reliability and are not guaranteed. They can also be queued up by the system so that, for example, it's possible to issue a notification, quit the app, and then upon the next launch of the app, receive the notification. In this second case, if the user dismissed the notification, we will crash. This is because we only track the `lastNotifiedVersion` in the `NotificationAdapter` instance object and don't persist it to disk, then we assert the value not to be nil when saving the user's `dismiss` action. To fix this, we persist the `lastNotifiedVersion` to the `UserDefaults` store and attempt to read this when the user is dismissing the notification. If we can't read it for some reason, we still dismiss the notification but won't prevent showing it again on the next update check. A minor bug is also fixed where the original author didn't correctly call the function's `completionHandler`. Also, unused instance vars `lastDismissedVersion` left over from the original author are removed as well.	2025-02-13 23:57:58 +00:00
Jamil	39cbf60ec8	ci: Bump Apple clients to 1.4.2 (#8109 ) Fixes a slew of memory leaks, crashes, and other papercuts.	2025-02-13 22:08:45 +00:00
Jamil	2b1e9ac17f	fix(gateway): Use StateDirectory to create /var/lib/firezone (#8120 ) This is needed on fresh installations.	2025-02-13 05:35:44 -08:00
Jamil	62876028c8	chore(apple): Update Xcode project settings (#8114 ) Xcode keeps pestering about these on each launch. Seems to be maintainence-related project configuration updates.	2025-02-13 02:40:23 +00:00
Jamil	9a3cde89b9	refactor(apple): Don't create variables we don't use (#8115 ) Both warnings-as-errors and the linter don't error on this particular warning unfortunately. 👎	2025-02-13 02:40:12 +00:00
Thomas Eizinger	0e5d91e266	build(nix): use more recent `pnpm` (#8106 ) Updates to `pnpm` 9.	2025-02-13 01:01:23 +00:00
Jamil	5afeb30f6f	ci: Bump GUI clients to 1.4.5 (#8113 )	2025-02-12 20:56:27 +00:00
Jamil	3feffc9f48	fix(android): Call disconnect in onDisconnect (#8110 ) We need to call `disconnect()` in `onDisconnect` to free the memory associated with the connlib session. Related: https://github.com/firezone/firezone/pull/8104	2025-02-12 20:51:05 +00:00
Jamil	316ba6ddc3	ci: Upload Android symbols to Sentry (#8111 ) Related: #8050	2025-02-12 20:49:54 +00:00
Jamil	8952eabe5a	chore(infra): Upgrade terraform modules (#8112 ) Fixes https://github.com/firezone/firezone/actions/runs/13293765777/job/37121384825 ``` ╷ │ Error: Failed to query available provider packages │ │ Could not retrieve the list of available versions for provider │ hashicorp/aws: locked provider registry.terraform.io/hashicorp/aws 5.64.0 │ does not match configured version constraint >= 3.29.0, >= 5.[79](https://github.com/firezone/firezone/actions/runs/13293765777/job/37121384825#step:8:80).0; must use │ terraform init -upgrade to allow selection of new versions │ │ To see which modules are currently depending on hashicorp/aws and what │ versions are specified, run the following command: │ terraform providers ╵ ```	2025-02-12 20:43:00 +00:00
Jamil	1aef65224b	docs: Fix windows headless client note (#8108 )	2025-02-12 19:43:21 +00:00
Jamil	cf1b74cdc1	fix(apple): Only use connlib sessions that are connected (#8104 ) In the window of time between we check `AdapterState == .tunnelStarted` and we call `setDns` in the Apple `pathUpdateHandler`, it's possible that connlib disconnected. This window of time could potentially be non-trivial since we read system resolvers in there, which hits the disk. As such, we should always check the `session` pointer is valid just before use. The `AdapterState` enum tracks two states: `tunnelStopped` and `tunnelStarted`. In the `tunnelStarted` state, we populate a `WrappedSession` object. This is redundant - connlib is either `connected` and we have a `WrappedSession`, or it is not. Therefore we can remove the `AdapterState` abstraction completely (which was leftover from a previous developer) and directly use a `WrappedSession?` object to issue calls to connlib with. We set this to a valid `WrappedSession` upon connecting, and back to `nil` as soon as connlib either `onDisconnect`s us, or the user disconnects the tunnel. Lastly, we avoid early-returning from queued workItems because we now call connlib with `session?` which will no-op if there is no session, allowing whatever IPC call running at the time (such as fetchResources) to complete successfully, even though they'll see a "snapshotted" state of the Adapter/PacketTunnelProvider. In other words, we no longer enforce the session pointer to be valid for things that don't depend on its state. Fixes #7882	2025-02-12 19:31:39 +00:00
Thomas Eizinger	5a12dcb5b3	fix(gui-client): migrate to tailwind v4 (#8105 ) With the dependency bump in #7995, we introduced a visual regression that made all windows lose their styling: ![image](https://github.com/user-attachments/assets/9c9921a7-cab0-4adc-9868-cd7ddec40c64) The changelog to the v4 bump actually mentions some breaking changes and an automated upgrade tool but both the reviewer and the author of the PR missed that.	2025-02-12 19:19:18 +00:00
Jamil	36f06b84ea	fix(gateway): Harden systemd gateway unit file (#8102 ) Tested this with Vultr. No errors or issues reported for either IP or CIDR resources. Fixes: https://firezonehq.slack.com/archives/C06L41XN05T/p1739275605563679?thread_ts=1739267494.554949&cid=C06L41XN05T	2025-02-12 11:09:27 +00:00
Jamil	93a88563f3	feat(portal): allow socket based postgres connections (#8044 ) (#8097 ) This allows connections to the postgresql database via the standard socket, which - opposed to TCP sockets - allows `peer` authentication based on local unix users. This removes the need for a password and is much simpler to deploy when running components locally. In the current form, `DATABASE_SOCKET_DIR` takes precedence over hostname, if the environment variable is present. I found that `compile_config!` somehow enforces a value to be present which is explicitly not what I want for some of these values (i think). I'd be glad if anyone with more elixir experience can guide me as to how I can make this more idiomatic. --------- Supersedes: #8044 Signed-off-by: Jamil <jamilbk@users.noreply.github.com> Co-authored-by: oddlama <oddlama@oddlama.org>	2025-02-11 19:25:00 -08:00
Jamil	638c60649c	fix(portal): silence `hackney` CVE-2025-1211 (#8103 ) To my knowledge we don't rely on this particular functionality from hackney. Unfortunately, we don't control the `hackney` version used by deps, and there is no non-vulnerable version ready yet, so we ignore the advisory for now. A fuse has been set to fire one week from now.	2025-02-11 19:08:47 -08:00
Jamil	7730fdeda9	fix(ci): Fix minor command injection in pr_title check (#8101 ) https://app.oneleet.com/tenants/148d888b-6cbe-4198-b4be-359e816927f4/code-security	2025-02-11 16:26:11 -08:00
Jamil	e32d2b845f	fix(portal): Add node name key to metrics labels (#8082 ) Ok, the reason why we're still getting the error `One or more points were written more frequently than the maximum sampling period configured for the metric.` is because the metric points are identified by the labels in the metric, and so are "aggregated" more frequently than our API calls. By adding the node name to the labels, we scope the metric by that node and prevent inserting the points more often than our API calls.	2025-02-11 17:21:27 +00:00
Jamil	393436a4aa	ci: Release Gateway 1.4.4 (#8096 )	2025-02-11 07:22:27 -08:00
Jamil	9f88cd16f4	fix(apple): Load NSImage in MenuBar asynchronously (#8090 ) After further investigation, it appears that the `NSImage` initializer loads and decodes images synchronously from the disk. In the MenuBar, we are "lazy-loading" these images, but since the menu is constructed as part of app initialization, we are effectively loading these when the app boots, in `FirezoneApp`. After loading, these are cached, but the initial can hang the UI thread on app launch for slow systems. Unfortunately, `NSImage` does not _formally_ conform to `@Sendable`. However, this may be a nuance that isn't true in most cases, such as when treating `NSImage` instances as read-only from only a single thread. As such, we wrap `NSImage` with our own struct, and mark it `@unchecked Sendable`. This allows us to load the images on a background thread and assign them to their UI thread counterparts in an async manner. See further discussion: - https://forums.swift.org/t/why-cant-i-send-an-nsimage-across-actor-boundaries/76199 - https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/Multithreading/ThreadSafetySummary/ThreadSafetySummary.html#//apple_ref/doc/uid/10000057i-CH12-126728 Related: #7771	2025-02-11 14:36:40 +00:00
Thomas Eizinger	1847e8407a	chore: release Headless Client `v1.4.3` (#8093 )	2025-02-11 14:10:13 +00:00
Thomas Eizinger	6093199ee3	chore: release GUI Client `v1.4.4` (#8092 )	2025-02-11 14:09:34 +00:00
Thomas Eizinger	fc925af6c8	chore(phoenix-channel): log the portal's IP address on connect (#8088 )	2025-02-11 07:11:08 +00:00
Jamil	41f4ae5e7f	fix(apple/macOS): Move to .idle state after log export (#8091 ) This fixes a bug where we couldn't export logs twice because we never returned to the `.idle` state after export. Fixes #8015	2025-02-11 07:07:27 +00:00
Thomas Eizinger	6c93ce76bf	chore(phoenix-channel): log all errors when connection fails (#8089 ) Currently, we are only logging the last error when we fail to connect to any of the addresses from the portal. This is often not useful because the last one is likely to be an IPv6 address which may not be supported on the system so all we learn is "The requested address is not valid in its context.".	2025-02-11 05:58:32 +00:00
Thomas Eizinger	7dcda1dc74	fix(windows): silence `0x800706D9` when DNS deactivation fails (#8085 ) The error code we see here means "There are no more endpoints available from the endpoint mapper." This has something to do with Windows' internal RPC communication between components. DNS deactivation is on a best-effort basis and it appears that everything else is working just fine, despite this error. It appears to happen when we shut down our own service, so perhaps it is just a race condition.	2025-02-11 05:38:37 +00:00
Jamil	063dc73d01	refactor(apple): Remove useless `Task.detached` (#8063 ) Whether we execute a task on the main thread or a background thread doesn't affect whether the thread is "hung" as reported by Sentry. Instead, our options for fixing these are: - Try to use an async version of the underlying API (the [async version](https://developer.apple.com/documentation/appkit/nsworkspace/open(_:configuration:completionhandler:)) of `open` for example) - If there is none, and the call could potentially block (most likely to do disk IO contention), at least schedule this on a new thread using `Task.detached` but with `.background` priority so that it will avoid blocking any other execution. The main takeaway here is that unfortunately, under some conditions, Sentry will _always_ report an "App Hanging" alert since it's constantly monitoring all threads for paused execution longer than 2000ms. We'll probably end up letting some of these slide (pausing a background or worker thread isn't necessarily a UX issue), but pausing the UI thread is. Luckily, we're able to use async APIs for most things. The remaining things (like working with log files over IPC) we use a `Task.detached` for.	2025-02-11 04:55:55 +00:00
Thomas Eizinger	b04d44a711	fix(website): make changelog more typesafe (#8084 ) We currently have a bug in our changelog where the wrong download links are being rendered for the Windows GUI client because we are incorrectly matching on the title. To fix this, we stop matching on the title and instead pass an `OS` enum in the respective changelog components that need to differentiate between OS-specific entries.	2025-02-11 04:55:42 +00:00
Thomas Eizinger	d7ebd07183	fix(linux): check for correct sign of netlink error code (#8087 ) We've previously tried to handle the "No such process" error from netlink when it tries to remove a route that no longer exists. What we failed to do is use the correct sign for the error code as netlink errors are always negative, yet when printed, the are positive numbers.	2025-02-11 04:47:51 +00:00
Thomas Eizinger	b193dd91f6	fix(windows): don't warn on disabled IP stack (#8086 ) When an IP stack is programmatically disabled, such as with: > reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip6\Parameters" /v DisabledComponents /t REG_DWORD /d 255 /f Attempting to interact with this IP stack will yield "NOT_FOUND" errors. These aren't worth reporting to Sentry because there isn't much we can do about it.	2025-02-11 04:37:17 +00:00
Thomas Eizinger	c9b9fb0e6c	feat(relay): add `SOFTWARE` attribute (#8076 ) Adding a `SOFTWARE` attribute is recommended by the spec and will allow us to identify from client logs, which version of the relay we are talking to.	2025-02-11 03:34:38 +00:00
Jamil	feb1ec5e17	chore: Update client URLs & redirects for consistency (#8056 ) Whenever changing a URL we care about, we add an entry in `website/redirects.js` to avoid breaking links to the old page. Most search engines reindex these after 1 year, but other websites and places won't, so we should generally keep them indefinitely since they don't cost us much to keep around.	2025-02-11 03:30:41 +00:00
Thomas Eizinger	436b502eab	fix(windows): handle disabled IPv6 stack gracefully (#8083 ) Fixes: #8049.	2025-02-11 03:21:32 +00:00
Thomas Eizinger	c5381b0e54	fix(telemetry): always clear previous Sentry session (#8075 ) We have a bug in our Rust telemetry code where starting a new telemetry session for an unsupported environment doesn't stop the previous one if one already exists. This results in very confusing Sentry issues that cannot be correlated to our infrastructure.	2025-02-11 00:54:35 +00:00
Thomas Eizinger	f48df7585c	refactor(windows): de-duplicate Win32 error codes (#8071 ) The errors returned from Win32 API calls are currently duplicated in several places. To makes it error-prone to handle them correctly. With this PR, we de-duplicate this and add proper docs and links for further reading to them. We also fix a case where we would currently fail to set IP addresses for our tunnel interface if the IP stack is not supported.	2025-02-10 23:33:06 +00:00
Jamil	e59aa0c93f	chore: Hide internal commands/flags in headless clients (#8055 ) These are just noise for the user and only used internally in Firezone.	2025-02-10 22:38:31 +00:00
Jamil	e8384ea5b0	refactor(apple): Make IPC calls async, bubbling errors (#8062 ) `fetchResources` is an IPC call, and we can use `withCheckedThrowingContinuation` like the others to yield while we wait for the provider to respond. The particular sentry issue related to this isn't because we are necessarily blocking the task thread, rather, I suspect it's when applying the fetched Resources to the UI that we're slow. There isn't much we can do about this, but this PR will only help. Because we're using a timer that fires off a closure to do this, we still use a `callback` inside the timer to actually set the Resources on the main `Store`, which updates the UI. Unfortunately refactoring these IPC calls lead to somewhat of a ball of yarn, so the best way to summarize the spirit of this PR is: - Ensure IPC calls use `withCheckedThrowingContinuation` where possible - Thusly, marking these functions `async throws` - Bubble these errors up the view where we can ultimately decide what to do with them - Keep VPN state management and conditional logic based on `NEVPNStatus` in the vpnConfigurationManager	2025-02-10 22:38:05 +00:00

1 2 3 4 5 ...

6574 Commits