Commit Graph

33 Commits

Author SHA1 Message Date
Jamil
f6b2256b9a fix(connlib): Differentiate between 4xx and other portal errors in log message (#3203)
Why?

We print the previous error even when getting a `4xx` and disconnecting
immediately, which doesn't make sense.
2024-01-12 15:30:36 +00:00
Gabi
bdf260a58c connlib: only get system dns servers on session connect (#3198) 2024-01-11 22:41:33 +00:00
Jamil
4f37bfab93 refactor(connlib): Remove unused on_error callback (#3162)
Fixes #3161 
Fixes #2867
2024-01-11 12:42:41 +00:00
Gabi
2af8d6096c fix(connlib): mangle packet for upstream dns as resource (#3134)
Fixes #3027 

Left a few TODO, will solve it when doing #3123 

Draft because we're still testing but it's almost ready
2024-01-09 21:08:07 +00:00
Jamil
1251397651 fix(ios/android): Pass device name and os version as overrides over connect (#3036)
Fixes #3035 
Fixes #3037 

# Before

<img width="738" alt="Screenshot 2023-12-28 at 8 05 31 AM"
src="https://github.com/firezone/firezone/assets/167144/c7ab4d74-672c-4536-97fe-f75d8d158bfb">

<img width="546" alt="Screenshot 2023-12-28 at 6 12 30 PM"
src="https://github.com/firezone/firezone/assets/167144/1bd4ba98-d11d-4277-bd14-b0afcdf78119">

# After

<img width="742" alt="Screenshot 2023-12-28 at 10 48 31 AM"
src="https://github.com/firezone/firezone/assets/167144/96054f82-069f-47f7-862c-986455ef76c0">
<img width="744" alt="Screenshot 2023-12-28 at 6 29 37 PM"
src="https://github.com/firezone/firezone/assets/167144/4ffc19b6-7c87-4ccb-bcfe-cb0e76fe95b7">
2024-01-03 20:08:33 +00:00
Reactor Scram
a6659c36cc fix(connlib): move .log to the end of log filenames (#3008)
This allows GUIs including Windows to associate a text editor with them
2023-12-22 21:43:48 +00:00
Gabi
ecfa919bbc refactor(connlib): refresh dns addresses (#2994)
Fix for #2956 this is achieved by refreshing access to every resource
every 5 minutes.

There's still an open question for this PR:

When the gateway resolves an ip the gateway allows access to a DNS
resource it resolves the address and allow access to that ip for that
client.

Right now, until the access for that resource doesn't expire that access
isn't revoked.

We could change it so that we require the client to refresh such
access(with this PR those refresh queries are already being made every 5
minutes) every x minutes on top of the `expires_at` or we can keep
`expires_at` as to mean "allow access until `expires_at` for whatever
this resource resolves to".
cc @jamilbk @AndrewDryga
2023-12-22 13:12:32 -06:00
Gabi
5edfe80eb0 connlib: tune disconnect parameters (#2977)
Should fix #2946 (still testing, trying to reproduce the error reported
in the issue)
2023-12-21 19:37:07 +00:00
Gabi
1d595fd15c refactor(connlib): log more details about failed queries (#2934)
Signed-off-by: Gabi <gabrielalejandro7@gmail.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2023-12-19 21:56:52 +00:00
Gabi
73823ecba0 Fix/firezone id handling (#2958)
fixes #2651 

Wip because firezone portal doesn't handle names longer than 8
characters yet cc @AndrewDryga
2023-12-19 15:38:27 -06:00
Jamil
bdbfa0dc5b Prevent DNS sentinel from being used as a fallback resolver (#2922)
Prevent the edge case where our DNS sentinel could be used as a fallback
resolver. I didn't observe this in the wild, but we should avoid it in
case.

---------

Co-authored-by: Gabi <gabrielalejandro7@gmail.com>
2023-12-16 01:24:07 +00:00
Jamil
0014172c0a Don't automatically delete log files after successful upload (#2904)
Prevents cases where "Export logs" doesn't contain the full log cache.

Fixes #2886
2023-12-14 19:31:40 +00:00
Gabi
b9cbc1786f connlib: disconnect on token expiration (#2890)
Previously, we just expected the portal to disconnects us and 401 on the
retry, right now we harden that behaviour by also just disconnecting
when token expiration.

This seems to work, there's another part to this which is not only
handling the replies but also handling the message generated by the
portal, I'll implement that when I can easily test expirying tokens, for
now this makes the client much more stable.
2023-12-13 15:10:43 +00:00
Thomas Eizinger
0de16d3676 refactor(connlib): remove async from the Device API (#2815)
At present, the definition of `Device` is heavily nested with
conditional code. I've found this hard to understand and navigate.
Recent refactorings now made it possible to remove a lot of these layers
so we primarily deal with two concepts:

- A `Device` which offers async read and non-blocking write functions
- A `Tun` abstraction which is platform-specific

Instead of dedicated modules, I chose to feature-flag individual
functions on `Device` with `#[cfg(target_family = "unix")]` and
`#[cfg(target_family = "windows")]`. I find this easier to understand
because the code is right next to each other.

In addition, changing the module hierarchy of `Device` allows us to
remove `async` from the public API which is only introduced by the use
of `rtnetlink` in Linux. Instead of making functions across all `Tun`
implementations `async`, we embed a "worker" within the `linux::Tun`
implementation that gets polled before `poll_read`.

---------

Co-authored-by: Gabi <gabrielalejandro7@gmail.com>
2023-12-12 19:47:26 +00:00
Reactor Scram
a339f5b437 feat(windows): generate device ID and persist it on disk (#2840)
Relating to #2697 and #2711

---------

Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
2023-12-12 17:46:26 +00:00
Gabi
e1fb6c80a0 fix(connlib): attempt to join topic upon unmatched topic error (#2874)
Fixes: #2854.

Note: this is ready for review but reproducing the bug that triggered
the fix takes ~1 hour or so, so I would like to wait to check that's
fixed.

Can be reviewed meanwhile.
2023-12-12 16:57:47 +00:00
Gabi
8e34457340 Add support for DNS sudomains (#2735)
This PR changes the protocol and adds support for DNS subdomains, now
when a DNS resource is added all its subdomains are automatically
tunneled too. Later we will add support for `*.domain` or `?.domain` but
currently there is an Apple split tunnel implementation limitation which
is too labor-intensive to fix right away.

Fixes #2661 

Co-authored-by: Andrew Dryga <andrew@dryga.com>
2023-12-08 00:16:42 -05:00
Gabi
bc8f438a56 feat(connlib): directly send wireguard traffic instead of tunneling it through WebRTC datachannels (#2643)
This PR started as part of a degradation in performance for the
gateways.

The way to test performance in a realistic enviroment is using a GCP vm
as a client and an AWS vm as a gateway with a single iperf server behind
the gateway.

Then the `iperf` results with current main:

```
Connecting to host 172.31.92.238, port 5201
Reverse mode, remote host 172.31.92.238 is sending
[  5] local 100.83.194.77 port 58426 connected to 172.31.92.238 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.01 MBytes  8.50 Mbits/sec                  
[  5]   1.00-2.00   sec  1.14 MBytes  9.59 Mbits/sec                  
[  5]   2.00-3.00   sec   699 KBytes  5.73 Mbits/sec                  
[  5]   3.00-4.00   sec  1.11 MBytes  9.31 Mbits/sec                  
[  5]   4.00-5.00   sec   664 KBytes  5.44 Mbits/sec                  
[  5]   5.00-6.00   sec   591 KBytes  4.84 Mbits/sec                  
[  5]   6.00-7.00   sec   722 KBytes  5.91 Mbits/sec                  
[  5]   7.00-8.00   sec   833 KBytes  6.83 Mbits/sec                  
[  5]   8.00-9.00   sec   738 KBytes  6.04 Mbits/sec                  
[  5]   9.00-10.00  sec   836 KBytes  6.85 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.06  sec  8.78 MBytes  7.32 Mbits/sec    3             sender
[  5]   0.00-10.00  sec  8.23 MBytes  6.90 Mbits/sec                  receiver

iperf Done.
```

Most of the performance problems were due to using SCTP and DTLS.

So I created a
[fork](https://github.com/firezone/webrtc/tree/expose-new-endpoint) of
webrtc that let us circumvent those, since we don't need them because we
are depending on wireguard for encryption.

With those changes much better throughput is achieved:

```
gabriel@cloudshell:~ (firezone-personal-instances)$ iperf3 -R -c 172.31.92.238
Connecting to host 172.31.92.238, port 5201
Reverse mode, remote host 172.31.92.238 is sending
[  5] local 100.83.194.77 port 51206 connected to 172.31.92.238 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  5.60 MBytes  47.0 Mbits/sec                  
[  5]   1.00-2.00   sec  17.2 MBytes   144 Mbits/sec                  
[  5]   2.00-3.00   sec  15.8 MBytes   132 Mbits/sec                  
[  5]   3.00-4.00   sec  14.8 MBytes   125 Mbits/sec                  
[  5]   4.00-5.00   sec  15.9 MBytes   133 Mbits/sec                  
[  5]   5.00-6.00   sec  15.8 MBytes   133 Mbits/sec                  
[  5]   6.00-7.00   sec  15.3 MBytes   128 Mbits/sec                  
[  5]   7.00-8.00   sec  15.6 MBytes   131 Mbits/sec                  
[  5]   8.00-9.00   sec  15.6 MBytes   131 Mbits/sec                  
[  5]   9.00-10.00  sec  16.0 MBytes   134 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.05  sec   151 MBytes   126 Mbits/sec   74             sender
[  5]   0.00-10.00  sec   148 MBytes   124 Mbits/sec                  receiver

iperf Done
```

However, this is still worse than it was achieved with a previous
commit(`21afdf0a9a113c996d60a63b2e8c8f32d3aeb87`):
```
gabriel@cloudshell:~ (firezone-personal-instances)$ iperf3 -R -c 172.31.92.238
Connecting to host 172.31.92.238, port 5201
Reverse mode, remote host 172.31.92.238 is sending
[  5] local 100.100.68.41 port 49762 connected to 172.31.92.238 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.14 MBytes  51.5 Mbits/sec                  
[  5]   1.00-2.00   sec  17.1 MBytes   144 Mbits/sec                  
[  5]   2.00-3.00   sec  22.8 MBytes   191 Mbits/sec                  
[  5]   3.00-4.00   sec  23.5 MBytes   197 Mbits/sec                  
[  5]   4.00-5.00   sec  23.0 MBytes   193 Mbits/sec                  
[  5]   5.00-6.00   sec  22.1 MBytes   185 Mbits/sec                  
[  5]   6.00-7.00   sec  23.0 MBytes   193 Mbits/sec                  
[  5]   7.00-8.00   sec  22.7 MBytes   190 Mbits/sec                  
[  5]   8.00-9.00   sec  21.0 MBytes   176 Mbits/sec                  
[  5]   9.00-10.00  sec  19.9 MBytes   167 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.05  sec   204 MBytes   170 Mbits/sec  127             sender
[  5]   0.00-10.00  sec   201 MBytes   169 Mbits/sec                  receiver
```

My profiling suggested that this is due to reading/writing packets
happening in its own dedicated tasks. So much so that maybe in the
future we should even consider spawning their own dedicated runtime so
that those loops have a dedicated OS thread.

Also, probably using a multi-queue interface will give us huge gains if
we have a dedicated task for each queue(currently the interface is
started as a multi-queue but a single file descriptor is used) for
handling multiple concurrent clients.

However, the changes proposed in this PR are good enough for now as long
as performance don't degrade.

In that line I will create a CI that reports the throughput using the
local `docker-compose.yml` file that we should always check before
merging, that is not the be all end all of the performance story but for
smaller PRs the correlation to real world throughput should be enough.

For bigger PRs we should manually test before merging for now, until we
have a way in CI to spin up some realistic tests(note that vms should be
in separate cloud enviroments, the same-cloud links are so reliable that
we miss actual performance degradation due to dropped packets). On this
note I'll write a small manual on how to conduct those tests with full
current results that we should use always before merging new PRs that
affect the hot-path. cc @thomaseizinger

Finally, when testing these changes I found some flakiness regarding the
re-connection path. So I changed things so that we cleanup connections
only using wireguard's error(connection expiration). This is quite slow
for now (~120 seconds) but in the future we can issue an ice restart
each time wireguard keepalive expires(rekey timeout) so that we can
restart connection each ~30 seconds and we can reduce the keepalive time
out from the portal to accelerate it even more. And in the future we can
get smarter about it.

---------

Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2023-11-16 02:59:48 +00:00
Gabi
953ddeace6 connlib: update upstream dns format configuration (#2543)
fixes #2297
2023-11-03 05:16:03 +00:00
Thomas Eizinger
b404f10d87 refactor(connlib): read from device as part of eventloop (#2520)
As a next step in refactoring the tunnel implementation, I am removing
the `device_handler` task and instead use a poll-based function to read
from the device. Removing the task means there is one less component
that accesses the `Tunnel` via shared-memory. The final one after this
PR is the `peer_handler`.

Once all shared-access is gone, we can stop using `Arc<Tunnel>` and with
it, remove all uses of `Mutex` in the tunnel and simply use `&mut self`.

To remove the `device_handler`, we introduce a `Device::poll_read`
function that we call as the very first thing in the `Tunnel`'s
poll-function. At a later point, we want to think about prioritization
within the event loop. I'd suggest deferring that until we have removed
the locks as handling the guards is a bit finicky at this stage.
2023-11-03 00:47:26 +00:00
Jamil
2bca378f17 Allow data plane configuration at runtime (#2477)
## Changelog

- Updates connlib parameter API_URL (formerly known under different
names as `CONTROL_PLANE_URL`, `PORTAL_URL`, `PORTAL_WS_URL`, and
friends) to be configured as an "advanced" or "hidden" feature at
runtime so that we can test production builds on both staging and
production.
- Makes `AUTH_BASE_URL` configurable at runtime too
- Moves `CONNLIB_LOG_FILTER_STRING` to be configured like this as well
and simplifies its naming
- Fixes a timing attack bug on Android when comparing the `csrf` token
- Adds proper account ID validation to Android to prevent invalid URL
parameter strings from being saved and used
- Cleans up a number of UI / view issues on Android regarding typos,
consistency, etc
- Hides vars from from the `relay` CLI we may not want to expose just
yet
- `get_device_id()` is flawed for connlib components -- SMBios is rarely
available. Data plane components now require a `FIREZONE_ID` now instead
to use for upserting.


Fixes #2482 
Fixes #2471

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Gabi <gabrielalejandro7@gmail.com>
2023-10-30 23:46:53 -07:00
Jamil
23ca75227c log upload interval 5 minutes; delete file after upload (#2463)
- Reduce interval to 5 minutes
- Delete file after successful upload

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andrew Dryga <andrew@dryga.com>
Co-authored-by: bmanifold <bmanifold@users.noreply.github.com>
2023-10-20 21:35:02 +00:00
Jamil
fa57d66965 Publish Releases (#2344)
- rebuild and publish gateway and relay binaries to currently drafted
release
- re-tag current relay/gateway images and push to ghcr.io

Stacked on #2341 to prevent conflicts

Fixes #2223 
Fixes #2205 
Fixes #2202
Fixes #2239 

~~Still TODO: `arm64` images and binaries...~~ Edit: added via
`cross-rs`
2023-10-20 14:20:43 -07:00
Gabi
29a480789e Fix reuse connections (#2454)
There were 2 bugs:
* `gateway_awaiting_connections` should represent gateways were there is
an on-going connection intent but are not connected yet but currently we
were creating an empty entry when there was no entry, even if there is a
connection established, this would cause the next resource connection
intent to stop early without adding the allowed ip, thus never using the
connection.
* There was a race condition, where if the `ReuseConnection` was sent to
the gateway when the connection wasn't established, the gateway would
just ignore the message, but this connection intent would never be sent
again.

Now that I'm writing this maybe the best solution is, if there is a
pending connection to a gateway, we just do nothing. That way upper
layers would just retry the message and we send `ReuseConnection` once
the connection is established instead of buffering the requests...

Edit: that's exactly the fix I made.
2023-10-20 02:50:29 -03:00
Thomas Eizinger
5a906cb1c4 refactor(connlib): remove ConnId (#2361)
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2023-10-18 21:38:49 +00:00
Gabi
d626f6dbf6 Connlib/forward dns (#2325)
With this we implement DNS forwarding that's specified in  #2043 

This also solve the DNS story in Android.

For the headless client in Linux we still need to implement split dns,
but we can make do with this, specially, we can read from resolvconf and
use the forward DNS (not ideal but can work if we want a beta headless
client).

For the resolver I used `trusted-proto-resolver`.

The other options were:

* Using `domain`'s resolver but while it could work for now, it's no
ideal for this since it doesn't support DoH or DoT and doesn't provide
us with a DNS cache.
* Using `trusted-proto-client`, it doesn't provide us with a DNS cache,
though we could eventually replace it since it provides a way to access
the underlying buffer which could make our code a bit simpler.
* Writing our own. While we could make the API ideal, this is too much
work for beta.


@pratikvelani I did some refactor in the kotlin side so we can return an
array of bytearrays so that we don't require parsing on connlib side, I
also tried to make the dns server detector a bit simpler please take a
look it's my first time doing kotlin

@thomaseizinger please take a look specially at the first commit, I
tried to integrate with the `poll_events` and the `ClientState`.
2023-10-18 20:39:20 +00:00
Thomas Eizinger
2cfe7befef refactor(connlib): remove ControlSignal (#2321) 2023-10-18 17:28:04 +11:00
Thomas Eizinger
b187ac7850 feat(connlib): gzip encode log files (#2345)
Resolves: #2342.

Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2023-10-16 13:13:43 -07:00
Thomas Eizinger
82c2bf3574 refactor(connlib): use events to handle ICE candidates (#2279) 2023-10-10 22:26:42 +00:00
Gabi
e516bcc8dd connlib+android: enable fd replacement (#2235)
Should be easier to review commit by commit.

The gist of this commit is:
* `onAddRoute` on Android now takes an address+prefix as to minimize
parsing
* `onAddRoute` recreates the vpn service each time(TODO: is this too bad
for performance?)
* `on_add_route` and `onAddRoute` returns the new fd
* on android after `on_add_route` we recreate `IfaceConfig` and
`DeviceIo` and we store the new values
* `peer_handler` now runs on a loop, where each time we fail a write
with an error code 9(bad descriptor) we try to take the new `DeviceIo`
* we keep an
[`AbortHandle`](https://docs.rs/tokio/latest/tokio/task/struct.AbortHandle.html)
from the `iface_handler` task, since closing the fd doesn't awake the
`read` task for `AsyncFd`(I tried it, right now `close` is only called
after dropping the fd) so we explicitly abort the task and start a new
one with the new `device_io`.
* in android `DeviceIo` has an atomic which tells if it's closed or open
and we change it to closed after `on_add_route`, we use this as to never
double-close the fd, instead we wait until it's dropped. This *might*
affect performance on android since we use non-`Ordering::Relaxed`
atomic operation each read/write but it won't affect perfromance in
other platforms, furthermore I believe the performance gains if we
remove this will be minimal.

Fixes #2227

---------

Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2023-10-08 23:52:45 -03:00
Gabi
11a2979158 connlib: error out with http 4xx instead of trying to reconnect (#2264)
fixes #2013 

Stops the reconnect loop on a 4xx error.

Right now it seems like android doesn't handle `on_disconnect` properly,
just logging instead of going back to sign-in screen.
2023-10-07 16:52:16 +00:00
Thomas Eizinger
3fcfaa6bfd refactor(connlib): create login_url utility (#2237) 2023-10-05 15:15:51 +11:00
Thomas Eizinger
464efbad56 refactor(connlib): restructure directory for consistency (#2236) 2023-10-05 09:52:35 +11:00