Commit Graph

7634 Commits

Author SHA1 Message Date
Jamil
bebc69e2bc fix(portal): use distinct slot names (#9672)
These were being configured using the same default `events_` value.
2025-06-25 17:28:17 +00:00
Thomas Eizinger
bf03e13cf0 feat(gateway): vary DNS resource NAT TTL by protocol (#9655)
Instead of a 1 minute TTL for all connections, we vary the TTL based on
the protocol being used. For TCP, that is 2 hours. For UDP and ICMP, we
use 2 minutes.

Resolves: #9645
2025-06-25 17:24:40 +00:00
Thomas Eizinger
d5be185ae4 chore(rust): remove telemetry spans and events (#9634)
Originally, we introduced these to gather some data from logs / warnings
that we considered to be too spammy. We've since merged a
burst-protection that will at most submit the same event once every 5
minutes.

The data from the telemetry spans themselves have not been used at all.
2025-06-25 17:15:57 +00:00
Thomas Eizinger
6972d4d62a test(windows): sleep before asserting on keyring (#9670)
I suspect that the new Windows runners are "too fast" and we hit a race
condition in the use of the keyring on Windows which causes failing CI
jobs. The attempt to fix this is to sleep for 1 seconds before every
assert in the test.
2025-06-25 17:05:30 +00:00
Jamil
343717b502 refactor(portal): broadcast client struct when updated (#9664)
When a client is updated, we may need to re-initialize it if "breaking"
fields are updated. If non-breaking fields are changed, such as name, we
don't need to re-initialize the client.

This PR also adds a helper `struct_from_params/2` which will create a
schema struct from WAL data in order to type cast any needed data for
convenience. This avoid having to do a DB hit - we _already have the
data from the DB_ - we just need to format and send it.

Related: #9501
2025-06-25 17:04:41 +00:00
Thomas Eizinger
3b972643b1 feat(rust): stream logs to Sentry when enabled in PostHog (#9635)
Sentry has a new "Logs" feature where we can stream logs directly to
Sentry. Doing this for all Clients and Gateways would be way too much
data to collect though.

In order to aid debugging from customer installations, we add a
PostHog-managed feature flag that - if set to `true` - enables the
streaming of logs to Sentry. This feature flag is evaluated every time
the telemetry context is initialised:

- For all FFI usages of connlib, this happens every time a new session
is created.
- For the Windows/Linux Tunnel service, this also happens every time we
create a new session.
- For the Headless Client and Gateway, it happens on startup and
afterwards, every minute. The feature-flag context itself is only
checked every 5 minutes though so it might take up to 5 minutes before
this takes effect.

The default value - like all feature flags - is `false`. Therefore, if
there is any issue with the PostHog service, we will fallback to the
previous behaviour where logs are simply stored locally.

Resolves: #9600
2025-06-25 16:14:14 +00:00
Jamil
02dd21018d fix(portal): log error when connected_nodes crossed (#9668)
To avoid log spam, we only log an error when the threshold boundary is
crossed.
2025-06-24 21:47:17 -07:00
Jamil
95624211cd fix(portal): update publications when config changes (#9667)
Creating a table publication(s) (and associated replication slot) is
sticky. These will outlive the lifetime of the process that created
them.

We don't want to remove them on shutdown, because this will pause WAL
writing to disk.

However, when starting the _new_ application, it's possible
`table_subscriptions` has changed (such as if we decide we no longer
want events for a certain table). We weren't updating the created
publication(s) with these added/removed tables, so this PR updates the
replication connection setup state machine to pass through a few
conditionals to get these properly updated with the diff of old vs new.
2025-06-24 21:31:40 -07:00
Jamil
a9f49629ae feat(portal): add change_logs table and insert data (#9553)
Building on the WAL consumer that's been in development over the past
several weeks, we introduce a new `change_logs` table that stores very
lightly up-fitted data decoded from the WAL:

- `account_id` (indexed): a foreign key reference to an account.
- `inserted_at` (indexed): the timestamp of insert, for truncating rows
later.
- `table`: the table where the op took place.
- `op`: the operation performed (insert/update/delete)
- `old_data`: a nullable map of the old row data (update/delete)
- `data`: a nullable map of the new row data(insert/update)
- `vsn`: an integer version field we can bump to signify schema changes
in the data in case we need to apply operations to only new or only old
data.

Judging from our prod metrics, we're currently average about 1,000 write
operations a minute, which will generate about 1-2 dozen changelogs / s.
Doing the math on this, 30 days at our current volume will yield about
50M / month, which should be ok for some time, since this is an
append-only, rarely (if ever) read from table.

The one aspect of this we may need to handle sooner than later is
batch-inserting these. That raises an issue though - currently, in this
PR, we process each WAL event serially, ending with the final
acknowledgement `:ok` which will signal to Postgres our status in
processing the WAL.

If we do anything async here, this processing "cursor" then becomes
inaccurate, so we may need to think about what to track and what data we
care about.

Related: #7124
2025-06-25 02:06:20 +00:00
Jamil
2b154d88bf fix(ci): use relaxed naming for ignored checks (#9666)
These jobs have the `ci / ` prefix when run on main, but no prefix when
run on PRs. To fix the ignored checks, we need to use `contains`.
2025-06-24 18:56:34 -07:00
Jamil
75740e4377 fix(ci): check for correct ignored job names (#9665)
These need the `ci / ` prefix.
2025-06-24 16:15:00 -07:00
Jamil
ff5a632d2a fix(portal): only show never synced correctly (#9652)
It's confusing that we clear this field upon sync failure. Instead, we
let it track the time of the last sync.

Will be cleaned up in #6294 so just applying a minimal fix now.

Fixes #7715
2025-06-24 22:54:30 +00:00
Jamil
b68d037ef4 fix(deps): remove unused android-client-ffi dep (#9662)
fixes
https://github.com/firezone/firezone/actions/runs/15859533881/job/44713030395
2025-06-24 21:13:53 +00:00
Jamil
110d504516 fix(ci): maintain whitespace in sources list (#9663)
Another issue was introduced in #9590 - we need to maintain the
whitespace in the sources list when generating them.

Fixes
https://github.com/firezone/firezone/actions/runs/15859521283/job/44713395755
2025-06-24 21:03:11 +00:00
Jamil
85e67f1925 fix(ci): preserve sources whitespace (#9661)
Fixes a whitespace issue introduced in #9590
2025-06-24 19:13:54 +00:00
Jamil
caa21accf9 feat(portal): add mock sync adapter staging (#9660)
This needs to be enabled here too.
2025-06-24 19:08:58 +00:00
Jamil
933d51e3d0 feat(portal): send account_slug in gateway init (#9653)
Adds the `account_slug` to the gateway's `init` message. When the
account slug is changed, the gateway's socket is disconnected using the
same mechanism as gateway deletion, which causes the gateway to
reconnect immediately and receive a new `init`.

Related: #9545
2025-06-24 18:35:06 +00:00
Brian Manifold
27f482e061 fix(portal): trim whitespace in all remaining forms (#9654)
Why:

* After updating the Auth Provider changesets to trim all whitespace
from user editable string fields we realized we needed to do the same
for all forms/entities within Firezone. This commit updates all entities
to trim whitespace on string fields.

Fixes: #9579
2025-06-24 14:28:51 +00:00
Thomas Eizinger
4be73da21c fix(gateway): reply with cookie when rate limit is hit (#9657)
WireGuard implements a rate-limit mechanism when the number of handshake
initiations increases a certain limit. This is important because
handshakes involve asymmetric cryptography and are cryptographically
expensive. To prevent DoS attacks where other peers repeatedly ask for
new handshakes, the rate limiter implements a cookie mechanism where -
when under load - the remote peer needs to include a given cookie in new
handshakes. This cookie is tied to the peer's IP address to prevent it
from being reused by other peers.

Up until now, we have not been passing the sender's IP address to
`boringtun` and therefore, the only option when the rate limit was hit
was to error with `UnderLoad`.

By passing the source IP of the packet, `boringtun` can engage in the
cookie-reply mechanism and therefore avoid the `UnderLoad` error.

Resolves: #9643
2025-06-24 11:33:38 +00:00
Thomas Eizinger
91edd11a47 feat(gateway): send $identify event with account-slug (#9658)
When we receive the `account_slug` from the portal, the Gateway now
sends a `$identify` event to PostHog. This will allow us to target
Gateways with feature-flags based on the account they are connected to.
2025-06-24 11:31:56 +00:00
Thomas Eizinger
d376a122e4 feat(telemetry): send account_slug to PostHog (#9636)
In order to more easily target customers with certain feature flags, we
include the `account_slug` in the `$identify` event to PostHog. This
will allow us to create Cohorts in PostHog and enable / disable feature
flags for all installations of Firezone for a particular customer.
2025-06-24 09:00:24 +00:00
Thomas Eizinger
3c0e866e77 feat(connlib): listen on 52625 by default (#9593)
Presently, `connlib` always just lets the OS pick a random port for our
UDP socket. This works well in many cases but has the downside that IF
network admins would like to aid in the process of establishing direct
connections, they cannot open a specific port because it is always
random.

It doesn't cost us anything to try and bind to a particular port (here
52625) and fallback to a random one if something is listening there.

The port 52625 was chosen because:

- It is within the ephemeral port range and will therefore never be
registered to anything else.
- It is an palindrome and therefore easy to remember.
- When typing FIRE on a phone keypad, it you get the numbers 3473. 52625
is the port at the offset 3473 from the ephemeral port range.

In order for this port to be useful in establishing direct connections,
we generate optimistic candidates based on existing remote candidates by
combining the IP of all server-reflexive candidates with the port of all
host candidates.

This patch deliberately does not publicly announce this feature in the
docs or the changelog so we can first gather experience with it in our
own test environment.

Resolves: #9559
2025-06-24 08:41:08 +00:00
Thomas Eizinger
40f0609d90 ci: lint GitHub workflows with actionlint (#9590)
[`actionlint`](https://github.com/rhysd/actionlint) is a static analysis
tool for GitHub workflows and actions. It detects various issues ahead
of time and runs shellcheck on all `run` blocks. It is worth noting that
this does **not** lint the contents of composite actions so we still
need to be vigilant when working with those.
2025-06-24 08:05:10 +00:00
Jamil
56b70215a7 fix(ci): dont require upload-bencher (#9650)
Bencher is not the most reliable service, so this PR prevent us from
failing CI runs on the `uploader-bencher` job.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-24 08:03:06 +00:00
Thomas Eizinger
a91dda139f feat(connlib): only conditionally hash firezone ID (#9633)
A bit of legacy that we have inherited around our Firezone ID is that
the ID stored on the user's device is sha'd before being passed to the
portal as the "external ID". This makes it difficult to correlate IDs in
Sentry and PostHog with the data we have in the portal. For Sentry and
PostHog, we submit the raw UUID stored on the user's device.

As a first step in overcoming this, we embed an "external ID" in those
services as well IF the provided Firezone ID is a valid UUID. This will
allow us to immediately correlate those events.

As a second step, we automatically generate all new Firezone IDs for the
Windows and Linux Client as `hex(sha256(uuid))`. These won't parse as
valid UUIDs and therefore will be submitted as is to the portal.

As a third step, we update all documentation around generating Firezone
IDs to use `uuidgen | sha256` instead of just `uuidgen`. This is
effectively the equivalent of (2) but for the Headless Client and
Gateway where the Firezone ID can be configured via environment
variables.

Resolves: #9382

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2025-06-24 07:05:48 +00:00
Thomas Eizinger
686918f1d1 chore(rust): bump str0m (#9591)
The latest `main` of str0m undoes a breaking change in the constructor
of `Candidate::relayed` by flipping the parameters back. This will make
it easier to upgrade to the latest release once it is out.
2025-06-24 06:57:55 +00:00
Thomas Eizinger
1bd3d2a382 chore(gateway): remove NAT64/46 module (#9626)
This has been disabled for several releases now and is not causing any
problems in production. We can therefore safely remove it.

It is about time we do this because our tests are actually still testing
the variant without the feature flag and therefore deviate from what we
do in production. We therefore have to convert the tests as well. Doing
so uncovered a minor problem in our ICMP error parsing code: We
attempted to parse the payload of an ICMP error as a fully-valid layer 4
header (e.g. TCP header or UDP header). However, per the RFC a node only
needs to embed the first 8 bytes of the original packet in an ICMPv4
error. That is not enough to parse a valid TCP header as those are at
least 20 bytes.

I don't expect this to be a huge problem in production right now though.
We only use this code to parse ICMP errors arriving on the Gateway and I
_think_ most devices actually include more than 8 bytes. This only
surfaced because we are very strict with only embedding exactly 8 bytes
when we generate an ICMP error.

Additionally, we change our ICMP errors to be sent from the resource IP
rather than the Gateway's TUN device. Given that we perform NAT on these
IPs anyway, I think this can still be argued to be RFC conform. The
_proxy_ IP which we are trying to contact can be reached but it cannot
be routed further. Therefore the destination is unreachable, yet the
source of this error is the proxy IP itself. I think this is actually
more correct than sending the packets from the Gateway's TUN device
because the TUN device itself is not a routing hop per-se: its IP won't
ever show up in the routing path.
2025-06-24 06:48:30 +00:00
Thomas Eizinger
9616296ebc ci: run all jobs if docker-compose.yml changes (#9639) 2025-06-24 06:16:25 +00:00
Jamil
a68d46bd24 chore(ci): remove write perms on winget workflow (#9598)
This wasn't the issue - the issue was that @firezone-bot needed access
to the firezone/winget-pkgs repo.

Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2025-06-23 22:26:31 +00:00
Thomas Eizinger
f211c9d46a feat(apple): use .zip for logs (#9536)
This PR replaces the use of Apple Archive with an API that allows us to
zip the log file contents. This API doesn't handle symlinks well so we
move the symlink out of the way before making the zip. The symlink is
then moved back after the process is completed. Any errors in this
process are ignored as the symlink itself is not a critical component of
Firezone.

The zip compression is marginally less efficient than the Apple Archive.
Instead of compressing ~2GB of logs to 11.8 MB we now get an archive of
12.4 MB. Considering how much easier zip files are to handle, this seems
like a fine trade-off.

<img width="774" alt="Screenshot 2025-06-16 at 00 04 52"
src="https://github.com/user-attachments/assets/8fb6bade-5308-40b9-a446-2a2c364cb621"
/>

Resolves: #7475

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
2025-06-23 22:25:57 +00:00
Jamil
0cd919a5e2 fix(portal): use account_id index in flow expiration (#9623)
There were a couple more instances where we weren't using the
`account_id` which prevented use of the index, causing a DB Connection
queue drop.
2025-06-23 21:51:21 +00:00
Jamil
ec5c433f5b feat(ci): use larger runners for all jobs (#9646)
Append `-xlarge` to the previous runner labels to match new larger
runners.
2025-06-23 14:23:22 -07:00
Thomas Eizinger
950afd9b2d chore(gateway): set account-slug in telemetry context (#9545)
This PR adds an optional field `account_slug` to the Gateway's init
message. If populated, we will use this field to set the account-slug in
the telemetry context. This will allow us to know, which customers a
particular Sentry issue is related to.
2025-06-23 18:52:39 +00:00
Jamil
f55596be4e fix(portal): index auth_providers on adapter (#9625)
The `refresh_tokens` job for each auth provider uses a cross-account
query that unfortunately hits no indexes. This can cause slow queries
each time the job runs for the adapter.

We add a simple sparse index to speed this query up.

Related:
https://firezone-inc.sentry.io/issues/6346235615/?project=4508756715569152&query=is%3Aunresolved&referrer=issue-stream&stream_index=1
2025-06-23 18:50:22 +00:00
Thomas Eizinger
7a344836a2 fix(rust): use rust-lld linker for MSVC (#9641)
The latest VisualStudio version shipped a bug in the MSVC linker that
cannot handle symbols above a certain size. Switching to the Rust linker
fixes this issue.

Related: https://github.com/rust-lang/rust/issues/141626
2025-06-24 01:55:36 +10:00
Thomas Eizinger
e36efa5d62 ci: set static Firezone ID for docker-compose setup (#9637) 2025-06-23 14:59:53 +00:00
Jamil
0af7582ab6 fix(portal): flush metrics as we accumulate (#9622)
Unfortunately #9608 did not handle the case where we receive more than
200 compressed metrics in a single call. To fix this, we ensure we
`flush` the metrics buffer inside the `reduce` so that we never grow the
accumulated metrics buffer larger than 200 points.

The log string was updated to roll the issue over in Sentry as well as
the old issue was set to delete and destroy to prevent issue spam.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-23 14:58:18 +00:00
Jamil
a5b4ec489f fix(docs): fix spacing due to new prettier (#9630)
Prettier was upgraded and has changed its mind on some spacing rules in
markdown files.
2025-06-23 05:52:57 +00:00
Thomas Eizinger
259b8e2a32 ci: fix Tauri workflow permissions (#9628) 2025-06-23 15:52:35 +10:00
Thomas Eizinger
692b61d159 ci: move GUI smoke tests to tauri workflow (#9627) 2025-06-23 08:37:52 +10:00
Thomas Eizinger
94651093cb chore(rust): remove unused Dockerfile-rpm (#9624) 2025-06-23 05:29:18 +10:00
Jamil
3029e00355 fix(android): fix view state lifecycle around tunnel/auth (#9621)
`onViewCreated()` is called when the view initializes, and then
`onResume()` is called right after, in addition to anytime the view is
shown again.

To prevent showing the VPN permission activity twice, we remove the
`checkTunnelState()` from onViewCreated, allowing only `onResume()` to
call it.

A boolean flag is added to track whether this is the "first" launch of
the app in order to determine whether to `connectOnStart`.

Fixes #9584

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
2025-06-22 16:20:11 +00:00
Jamil
867f9dfad3 fix(ci): set github token for publish workflow (#9620)
This env var needs to be explicitly set.

Related: #9618
2025-06-21 20:37:38 -07:00
Jamil
e970e3f15a fix(ci): split newline correctly in github workflow file (#9619)
GitHub doesn't like this syntax.

Related: #9618
2025-06-21 20:26:02 -07:00
Jamil
2e065d6719 fix(ci): use publish inputs directly (#9618)
We can't use job outputs in the job specification for a subsequent
workflow.

Related: #9617
2025-06-21 20:22:41 -07:00
Jamil
cb4441eafa fix(ci): publish sha of images from release (#9617)
To publish retroactively artifacts for the gateway and headless client,
we need to pull the sha of the corresponding release tag.

Related: #9615
2025-06-21 20:18:01 -07:00
Jamil
3baefd0fcf fix(ci): remove unused id from step in publish (#9616)
This isn't a valid name and can be removed anyway.

Related: #9615
2025-06-21 19:47:16 -07:00
Jamil
2598df3030 feat(ci): allow publish workflow to be run manually (#9615)
This allows us to retroactively run publish workflows that may have
failed due to workflow bugs.

Needed to publish the 1.4.11 gateway image.
2025-06-21 19:44:34 -07:00
Jamil
c783b23bae refactor(portal): rename conditional->manual (#9612)
These only have one condition - to run manually. `manual migrations`
better implies that these migrations _must_ typically be run manually.
2025-06-21 21:17:33 +00:00
Thomas Eizinger
a2c122a3c0 refactor(apple): use guard for checking valid handle (#9614)
Follow-up to #9597
2025-06-21 21:17:01 +00:00