Instead of a 1 minute TTL for all connections, we vary the TTL based on
the protocol being used. For TCP, that is 2 hours. For UDP and ICMP, we
use 2 minutes.
Resolves: #9645
Originally, we introduced these to gather some data from logs / warnings
that we considered to be too spammy. We've since merged a
burst-protection that will at most submit the same event once every 5
minutes.
The data from the telemetry spans themselves have not been used at all.
I suspect that the new Windows runners are "too fast" and we hit a race
condition in the use of the keyring on Windows which causes failing CI
jobs. The attempt to fix this is to sleep for 1 seconds before every
assert in the test.
When a client is updated, we may need to re-initialize it if "breaking"
fields are updated. If non-breaking fields are changed, such as name, we
don't need to re-initialize the client.
This PR also adds a helper `struct_from_params/2` which will create a
schema struct from WAL data in order to type cast any needed data for
convenience. This avoid having to do a DB hit - we _already have the
data from the DB_ - we just need to format and send it.
Related: #9501
Sentry has a new "Logs" feature where we can stream logs directly to
Sentry. Doing this for all Clients and Gateways would be way too much
data to collect though.
In order to aid debugging from customer installations, we add a
PostHog-managed feature flag that - if set to `true` - enables the
streaming of logs to Sentry. This feature flag is evaluated every time
the telemetry context is initialised:
- For all FFI usages of connlib, this happens every time a new session
is created.
- For the Windows/Linux Tunnel service, this also happens every time we
create a new session.
- For the Headless Client and Gateway, it happens on startup and
afterwards, every minute. The feature-flag context itself is only
checked every 5 minutes though so it might take up to 5 minutes before
this takes effect.
The default value - like all feature flags - is `false`. Therefore, if
there is any issue with the PostHog service, we will fallback to the
previous behaviour where logs are simply stored locally.
Resolves: #9600
Creating a table publication(s) (and associated replication slot) is
sticky. These will outlive the lifetime of the process that created
them.
We don't want to remove them on shutdown, because this will pause WAL
writing to disk.
However, when starting the _new_ application, it's possible
`table_subscriptions` has changed (such as if we decide we no longer
want events for a certain table). We weren't updating the created
publication(s) with these added/removed tables, so this PR updates the
replication connection setup state machine to pass through a few
conditionals to get these properly updated with the diff of old vs new.
Building on the WAL consumer that's been in development over the past
several weeks, we introduce a new `change_logs` table that stores very
lightly up-fitted data decoded from the WAL:
- `account_id` (indexed): a foreign key reference to an account.
- `inserted_at` (indexed): the timestamp of insert, for truncating rows
later.
- `table`: the table where the op took place.
- `op`: the operation performed (insert/update/delete)
- `old_data`: a nullable map of the old row data (update/delete)
- `data`: a nullable map of the new row data(insert/update)
- `vsn`: an integer version field we can bump to signify schema changes
in the data in case we need to apply operations to only new or only old
data.
Judging from our prod metrics, we're currently average about 1,000 write
operations a minute, which will generate about 1-2 dozen changelogs / s.
Doing the math on this, 30 days at our current volume will yield about
50M / month, which should be ok for some time, since this is an
append-only, rarely (if ever) read from table.
The one aspect of this we may need to handle sooner than later is
batch-inserting these. That raises an issue though - currently, in this
PR, we process each WAL event serially, ending with the final
acknowledgement `:ok` which will signal to Postgres our status in
processing the WAL.
If we do anything async here, this processing "cursor" then becomes
inaccurate, so we may need to think about what to track and what data we
care about.
Related: #7124
It's confusing that we clear this field upon sync failure. Instead, we
let it track the time of the last sync.
Will be cleaned up in #6294 so just applying a minimal fix now.
Fixes#7715
Adds the `account_slug` to the gateway's `init` message. When the
account slug is changed, the gateway's socket is disconnected using the
same mechanism as gateway deletion, which causes the gateway to
reconnect immediately and receive a new `init`.
Related: #9545
Why:
* After updating the Auth Provider changesets to trim all whitespace
from user editable string fields we realized we needed to do the same
for all forms/entities within Firezone. This commit updates all entities
to trim whitespace on string fields.
Fixes: #9579
WireGuard implements a rate-limit mechanism when the number of handshake
initiations increases a certain limit. This is important because
handshakes involve asymmetric cryptography and are cryptographically
expensive. To prevent DoS attacks where other peers repeatedly ask for
new handshakes, the rate limiter implements a cookie mechanism where -
when under load - the remote peer needs to include a given cookie in new
handshakes. This cookie is tied to the peer's IP address to prevent it
from being reused by other peers.
Up until now, we have not been passing the sender's IP address to
`boringtun` and therefore, the only option when the rate limit was hit
was to error with `UnderLoad`.
By passing the source IP of the packet, `boringtun` can engage in the
cookie-reply mechanism and therefore avoid the `UnderLoad` error.
Resolves: #9643
When we receive the `account_slug` from the portal, the Gateway now
sends a `$identify` event to PostHog. This will allow us to target
Gateways with feature-flags based on the account they are connected to.
In order to more easily target customers with certain feature flags, we
include the `account_slug` in the `$identify` event to PostHog. This
will allow us to create Cohorts in PostHog and enable / disable feature
flags for all installations of Firezone for a particular customer.
Presently, `connlib` always just lets the OS pick a random port for our
UDP socket. This works well in many cases but has the downside that IF
network admins would like to aid in the process of establishing direct
connections, they cannot open a specific port because it is always
random.
It doesn't cost us anything to try and bind to a particular port (here
52625) and fallback to a random one if something is listening there.
The port 52625 was chosen because:
- It is within the ephemeral port range and will therefore never be
registered to anything else.
- It is an palindrome and therefore easy to remember.
- When typing FIRE on a phone keypad, it you get the numbers 3473. 52625
is the port at the offset 3473 from the ephemeral port range.
In order for this port to be useful in establishing direct connections,
we generate optimistic candidates based on existing remote candidates by
combining the IP of all server-reflexive candidates with the port of all
host candidates.
This patch deliberately does not publicly announce this feature in the
docs or the changelog so we can first gather experience with it in our
own test environment.
Resolves: #9559
[`actionlint`](https://github.com/rhysd/actionlint) is a static analysis
tool for GitHub workflows and actions. It detects various issues ahead
of time and runs shellcheck on all `run` blocks. It is worth noting that
this does **not** lint the contents of composite actions so we still
need to be vigilant when working with those.
A bit of legacy that we have inherited around our Firezone ID is that
the ID stored on the user's device is sha'd before being passed to the
portal as the "external ID". This makes it difficult to correlate IDs in
Sentry and PostHog with the data we have in the portal. For Sentry and
PostHog, we submit the raw UUID stored on the user's device.
As a first step in overcoming this, we embed an "external ID" in those
services as well IF the provided Firezone ID is a valid UUID. This will
allow us to immediately correlate those events.
As a second step, we automatically generate all new Firezone IDs for the
Windows and Linux Client as `hex(sha256(uuid))`. These won't parse as
valid UUIDs and therefore will be submitted as is to the portal.
As a third step, we update all documentation around generating Firezone
IDs to use `uuidgen | sha256` instead of just `uuidgen`. This is
effectively the equivalent of (2) but for the Headless Client and
Gateway where the Firezone ID can be configured via environment
variables.
Resolves: #9382
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
The latest `main` of str0m undoes a breaking change in the constructor
of `Candidate::relayed` by flipping the parameters back. This will make
it easier to upgrade to the latest release once it is out.
This has been disabled for several releases now and is not causing any
problems in production. We can therefore safely remove it.
It is about time we do this because our tests are actually still testing
the variant without the feature flag and therefore deviate from what we
do in production. We therefore have to convert the tests as well. Doing
so uncovered a minor problem in our ICMP error parsing code: We
attempted to parse the payload of an ICMP error as a fully-valid layer 4
header (e.g. TCP header or UDP header). However, per the RFC a node only
needs to embed the first 8 bytes of the original packet in an ICMPv4
error. That is not enough to parse a valid TCP header as those are at
least 20 bytes.
I don't expect this to be a huge problem in production right now though.
We only use this code to parse ICMP errors arriving on the Gateway and I
_think_ most devices actually include more than 8 bytes. This only
surfaced because we are very strict with only embedding exactly 8 bytes
when we generate an ICMP error.
Additionally, we change our ICMP errors to be sent from the resource IP
rather than the Gateway's TUN device. Given that we perform NAT on these
IPs anyway, I think this can still be argued to be RFC conform. The
_proxy_ IP which we are trying to contact can be reached but it cannot
be routed further. Therefore the destination is unreachable, yet the
source of this error is the proxy IP itself. I think this is actually
more correct than sending the packets from the Gateway's TUN device
because the TUN device itself is not a routing hop per-se: its IP won't
ever show up in the routing path.
This wasn't the issue - the issue was that @firezone-bot needed access
to the firezone/winget-pkgs repo.
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
This PR replaces the use of Apple Archive with an API that allows us to
zip the log file contents. This API doesn't handle symlinks well so we
move the symlink out of the way before making the zip. The symlink is
then moved back after the process is completed. Any errors in this
process are ignored as the symlink itself is not a critical component of
Firezone.
The zip compression is marginally less efficient than the Apple Archive.
Instead of compressing ~2GB of logs to 11.8 MB we now get an archive of
12.4 MB. Considering how much easier zip files are to handle, this seems
like a fine trade-off.
<img width="774" alt="Screenshot 2025-06-16 at 00 04 52"
src="https://github.com/user-attachments/assets/8fb6bade-5308-40b9-a446-2a2c364cb621"
/>
Resolves: #7475
---------
Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Co-authored-by: Jamil Bou Kheir <jamilbk@users.noreply.github.com>
This PR adds an optional field `account_slug` to the Gateway's init
message. If populated, we will use this field to set the account-slug in
the telemetry context. This will allow us to know, which customers a
particular Sentry issue is related to.
The latest VisualStudio version shipped a bug in the MSVC linker that
cannot handle symbols above a certain size. Switching to the Rust linker
fixes this issue.
Related: https://github.com/rust-lang/rust/issues/141626
Unfortunately #9608 did not handle the case where we receive more than
200 compressed metrics in a single call. To fix this, we ensure we
`flush` the metrics buffer inside the `reduce` so that we never grow the
accumulated metrics buffer larger than 200 points.
The log string was updated to roll the issue over in Sentry as well as
the old issue was set to delete and destroy to prevent issue spam.
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
`onViewCreated()` is called when the view initializes, and then
`onResume()` is called right after, in addition to anytime the view is
shown again.
To prevent showing the VPN permission activity twice, we remove the
`checkTunnelState()` from onViewCreated, allowing only `onResume()` to
call it.
A boolean flag is added to track whether this is the "first" launch of
the app in order to determine whether to `connectOnStart`.
Fixes#9584
---------
Signed-off-by: Jamil <jamilbk@users.noreply.github.com>