Commit Graph

23 Commits

Author SHA1 Message Date
Gabi
73823ecba0 Fix/firezone id handling (#2958)
fixes #2651 

Wip because firezone portal doesn't handle names longer than 8
characters yet cc @AndrewDryga
2023-12-19 15:38:27 -06:00
Reactor Scram
d1a7211f64 windows: Integrate wintun, run the VPN (#2883)
With this one, ICMP and TCP work, but the client doesn't set up routes
or handle DNS yet, so I've been using `netsh` to fake that.

---------

Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2023-12-13 23:19:36 +00:00
Jamil
8499580388 Remove Apple SplitDNS in favor of unified split DNS approach (#2894)
<img width="1552" alt="Screenshot 2023-12-12 at 11 29 43 PM"
src="https://github.com/firezone/firezone/assets/167144/d517c830-64a8-462d-8cb5-c41835fa2059">

Found a reliable way to return default system DNS resolvers on iOS and
macOS. Even if this method is not perfect, I think it's still worth
pursuing because:

* Many administrators will set an upstream resolver in the portal anyway
(bypassing client system resolvers)
* It unifies our Split DNS approach across platforms (assuming we can
query the default system resolvers on Windows), allowing connlib to
intercept all DNS queries on all platforms. This opens the door for some
interesting feature possibilities in the area of malicious query
blocking. This also makes DNS bugs easier to investigate because there's
only one codepath for packets to take. See
https://github.com/firezone/firezone/issues/2859

Draft because it needs more testing and I need to figure out the
`RustVec<RustString>` type for the Swift -> Rust FFI.

Refs #2713
2023-12-13 22:01:00 +00:00
Gabi
b9cbc1786f connlib: disconnect on token expiration (#2890)
Previously, we just expected the portal to disconnects us and 401 on the
retry, right now we harden that behaviour by also just disconnecting
when token expiration.

This seems to work, there's another part to this which is not only
handling the replies but also handling the message generated by the
portal, I'll implement that when I can easily test expirying tokens, for
now this makes the client much more stable.
2023-12-13 15:10:43 +00:00
Gabi
e1fb6c80a0 fix(connlib): attempt to join topic upon unmatched topic error (#2874)
Fixes: #2854.

Note: this is ready for review but reproducing the bug that triggered
the fix takes ~1 hour or so, so I would like to wait to check that's
fixed.

Can be reviewed meanwhile.
2023-12-12 16:57:47 +00:00
Gabi
8e34457340 Add support for DNS sudomains (#2735)
This PR changes the protocol and adds support for DNS subdomains, now
when a DNS resource is added all its subdomains are automatically
tunneled too. Later we will add support for `*.domain` or `?.domain` but
currently there is an Apple split tunnel implementation limitation which
is too labor-intensive to fix right away.

Fixes #2661 

Co-authored-by: Andrew Dryga <andrew@dryga.com>
2023-12-08 00:16:42 -05:00
Gabi
bc8f438a56 feat(connlib): directly send wireguard traffic instead of tunneling it through WebRTC datachannels (#2643)
This PR started as part of a degradation in performance for the
gateways.

The way to test performance in a realistic enviroment is using a GCP vm
as a client and an AWS vm as a gateway with a single iperf server behind
the gateway.

Then the `iperf` results with current main:

```
Connecting to host 172.31.92.238, port 5201
Reverse mode, remote host 172.31.92.238 is sending
[  5] local 100.83.194.77 port 58426 connected to 172.31.92.238 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.01 MBytes  8.50 Mbits/sec                  
[  5]   1.00-2.00   sec  1.14 MBytes  9.59 Mbits/sec                  
[  5]   2.00-3.00   sec   699 KBytes  5.73 Mbits/sec                  
[  5]   3.00-4.00   sec  1.11 MBytes  9.31 Mbits/sec                  
[  5]   4.00-5.00   sec   664 KBytes  5.44 Mbits/sec                  
[  5]   5.00-6.00   sec   591 KBytes  4.84 Mbits/sec                  
[  5]   6.00-7.00   sec   722 KBytes  5.91 Mbits/sec                  
[  5]   7.00-8.00   sec   833 KBytes  6.83 Mbits/sec                  
[  5]   8.00-9.00   sec   738 KBytes  6.04 Mbits/sec                  
[  5]   9.00-10.00  sec   836 KBytes  6.85 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.06  sec  8.78 MBytes  7.32 Mbits/sec    3             sender
[  5]   0.00-10.00  sec  8.23 MBytes  6.90 Mbits/sec                  receiver

iperf Done.
```

Most of the performance problems were due to using SCTP and DTLS.

So I created a
[fork](https://github.com/firezone/webrtc/tree/expose-new-endpoint) of
webrtc that let us circumvent those, since we don't need them because we
are depending on wireguard for encryption.

With those changes much better throughput is achieved:

```
gabriel@cloudshell:~ (firezone-personal-instances)$ iperf3 -R -c 172.31.92.238
Connecting to host 172.31.92.238, port 5201
Reverse mode, remote host 172.31.92.238 is sending
[  5] local 100.83.194.77 port 51206 connected to 172.31.92.238 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  5.60 MBytes  47.0 Mbits/sec                  
[  5]   1.00-2.00   sec  17.2 MBytes   144 Mbits/sec                  
[  5]   2.00-3.00   sec  15.8 MBytes   132 Mbits/sec                  
[  5]   3.00-4.00   sec  14.8 MBytes   125 Mbits/sec                  
[  5]   4.00-5.00   sec  15.9 MBytes   133 Mbits/sec                  
[  5]   5.00-6.00   sec  15.8 MBytes   133 Mbits/sec                  
[  5]   6.00-7.00   sec  15.3 MBytes   128 Mbits/sec                  
[  5]   7.00-8.00   sec  15.6 MBytes   131 Mbits/sec                  
[  5]   8.00-9.00   sec  15.6 MBytes   131 Mbits/sec                  
[  5]   9.00-10.00  sec  16.0 MBytes   134 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.05  sec   151 MBytes   126 Mbits/sec   74             sender
[  5]   0.00-10.00  sec   148 MBytes   124 Mbits/sec                  receiver

iperf Done
```

However, this is still worse than it was achieved with a previous
commit(`21afdf0a9a113c996d60a63b2e8c8f32d3aeb87`):
```
gabriel@cloudshell:~ (firezone-personal-instances)$ iperf3 -R -c 172.31.92.238
Connecting to host 172.31.92.238, port 5201
Reverse mode, remote host 172.31.92.238 is sending
[  5] local 100.100.68.41 port 49762 connected to 172.31.92.238 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.14 MBytes  51.5 Mbits/sec                  
[  5]   1.00-2.00   sec  17.1 MBytes   144 Mbits/sec                  
[  5]   2.00-3.00   sec  22.8 MBytes   191 Mbits/sec                  
[  5]   3.00-4.00   sec  23.5 MBytes   197 Mbits/sec                  
[  5]   4.00-5.00   sec  23.0 MBytes   193 Mbits/sec                  
[  5]   5.00-6.00   sec  22.1 MBytes   185 Mbits/sec                  
[  5]   6.00-7.00   sec  23.0 MBytes   193 Mbits/sec                  
[  5]   7.00-8.00   sec  22.7 MBytes   190 Mbits/sec                  
[  5]   8.00-9.00   sec  21.0 MBytes   176 Mbits/sec                  
[  5]   9.00-10.00  sec  19.9 MBytes   167 Mbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.05  sec   204 MBytes   170 Mbits/sec  127             sender
[  5]   0.00-10.00  sec   201 MBytes   169 Mbits/sec                  receiver
```

My profiling suggested that this is due to reading/writing packets
happening in its own dedicated tasks. So much so that maybe in the
future we should even consider spawning their own dedicated runtime so
that those loops have a dedicated OS thread.

Also, probably using a multi-queue interface will give us huge gains if
we have a dedicated task for each queue(currently the interface is
started as a multi-queue but a single file descriptor is used) for
handling multiple concurrent clients.

However, the changes proposed in this PR are good enough for now as long
as performance don't degrade.

In that line I will create a CI that reports the throughput using the
local `docker-compose.yml` file that we should always check before
merging, that is not the be all end all of the performance story but for
smaller PRs the correlation to real world throughput should be enough.

For bigger PRs we should manually test before merging for now, until we
have a way in CI to spin up some realistic tests(note that vms should be
in separate cloud enviroments, the same-cloud links are so reliable that
we miss actual performance degradation due to dropped packets). On this
note I'll write a small manual on how to conduct those tests with full
current results that we should use always before merging new PRs that
affect the hot-path. cc @thomaseizinger

Finally, when testing these changes I found some flakiness regarding the
re-connection path. So I changed things so that we cleanup connections
only using wireguard's error(connection expiration). This is quite slow
for now (~120 seconds) but in the future we can issue an ice restart
each time wireguard keepalive expires(rekey timeout) so that we can
restart connection each ~30 seconds and we can reduce the keepalive time
out from the portal to accelerate it even more. And in the future we can
get smarter about it.

---------

Co-authored-by: Thomas Eizinger <thomas@eizinger.io>
2023-11-16 02:59:48 +00:00
Andrew Dryga
33ab23b636 Cleanup UX and fix a bunch of TODOs (#2641)
This PR cleans up a lot of TODO and some issues I've discovered while
fixing them, there are _a few_ UI changes.

We show `(you)` next to your name on the actor view page, where
`Profile` link goes from the dropdown menu:
<img width="1728" alt="Screenshot 2023-11-13 at 19 05 35"
src="https://github.com/firezone/firezone/assets/1877644/f52b2531-e3be-4d3a-a587-4f9f54ca2c49">

Relays were way behind Gateways in terms of view code, so I changed them
to be exactly the same:
<img width="1728" alt="Screenshot 2023-11-13 at 18 54 39"
src="https://github.com/firezone/firezone/assets/1877644/a9f0905d-80d2-4e91-a744-c4baf7ad4a7c">

We also show authorizations on the Actor page because previously to find
"what this user did" you had to go through all user clients
individually:
<img width="1728" alt="Screenshot 2023-11-13 at 18 54 27"
src="https://github.com/firezone/firezone/assets/1877644/02ada445-e175-427e-99de-f9fa5bdd5aab">

I've noticed there is some confusion around sign-in slugs so I added a
home page where you can use ID or slug to get the in link (not all the
clients will know you need to put that in the URL) and recently used
accounts:
<img width="1728" alt="Screenshot 2023-11-13 at 18 54 06"
src="https://github.com/firezone/firezone/assets/1877644/ccfb9198-ed1f-4b3e-a26f-b76bab24243c">

Buttons to copy the code are more visible now, I've used our accent
color but am open to better ideas:
<img width="1728" alt="Screenshot 2023-11-13 at 19 10 29"
src="https://github.com/firezone/firezone/assets/1877644/a2c0658e-1003-409b-b5ad-d5d3ade60a10">

When code is copied it's also more visible:
<img width="699" alt="Screenshot 2023-11-13 at 19 11 41"
src="https://github.com/firezone/firezone/assets/1877644/62e793d2-d760-4aa7-9a42-92a6bbfcbf52">

We also do not redirect from that page automatically, but the large
button becomes green with the text changed:
<img width="660" alt="Screenshot 2023-11-13 at 19 12 11"
src="https://github.com/firezone/firezone/assets/1877644/780dcde3-8018-4405-91e5-984288431ec1">
2023-11-14 13:02:21 -06:00
Gabi
953ddeace6 connlib: update upstream dns format configuration (#2543)
fixes #2297
2023-11-03 05:16:03 +00:00
Jamil
2bca378f17 Allow data plane configuration at runtime (#2477)
## Changelog

- Updates connlib parameter API_URL (formerly known under different
names as `CONTROL_PLANE_URL`, `PORTAL_URL`, `PORTAL_WS_URL`, and
friends) to be configured as an "advanced" or "hidden" feature at
runtime so that we can test production builds on both staging and
production.
- Makes `AUTH_BASE_URL` configurable at runtime too
- Moves `CONNLIB_LOG_FILTER_STRING` to be configured like this as well
and simplifies its naming
- Fixes a timing attack bug on Android when comparing the `csrf` token
- Adds proper account ID validation to Android to prevent invalid URL
parameter strings from being saved and used
- Cleans up a number of UI / view issues on Android regarding typos,
consistency, etc
- Hides vars from from the `relay` CLI we may not want to expose just
yet
- `get_device_id()` is flawed for connlib components -- SMBios is rarely
available. Data plane components now require a `FIREZONE_ID` now instead
to use for upserting.


Fixes #2482 
Fixes #2471

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Gabi <gabrielalejandro7@gmail.com>
2023-10-30 23:46:53 -07:00
Gabi
29a480789e Fix reuse connections (#2454)
There were 2 bugs:
* `gateway_awaiting_connections` should represent gateways were there is
an on-going connection intent but are not connected yet but currently we
were creating an empty entry when there was no entry, even if there is a
connection established, this would cause the next resource connection
intent to stop early without adding the allowed ip, thus never using the
connection.
* There was a race condition, where if the `ReuseConnection` was sent to
the gateway when the connection wasn't established, the gateway would
just ignore the message, but this connection intent would never be sent
again.

Now that I'm writing this maybe the best solution is, if there is a
pending connection to a gateway, we just do nothing. That way upper
layers would just retry the message and we send `ReuseConnection` once
the connection is established instead of buffering the requests...

Edit: that's exactly the fix I made.
2023-10-20 02:50:29 -03:00
Thomas Eizinger
d2c0744518 fix(connlib): correctly forward rollover (#2462) 2023-10-20 05:12:55 +00:00
Thomas Eizinger
919b7890e6 refactor(connlib): move more logic to poll_next_event (#2403) 2023-10-19 02:30:04 +00:00
Gabi
d626f6dbf6 Connlib/forward dns (#2325)
With this we implement DNS forwarding that's specified in  #2043 

This also solve the DNS story in Android.

For the headless client in Linux we still need to implement split dns,
but we can make do with this, specially, we can read from resolvconf and
use the forward DNS (not ideal but can work if we want a beta headless
client).

For the resolver I used `trusted-proto-resolver`.

The other options were:

* Using `domain`'s resolver but while it could work for now, it's no
ideal for this since it doesn't support DoH or DoT and doesn't provide
us with a DNS cache.
* Using `trusted-proto-client`, it doesn't provide us with a DNS cache,
though we could eventually replace it since it provides a way to access
the underlying buffer which could make our code a bit simpler.
* Writing our own. While we could make the API ideal, this is too much
work for beta.


@pratikvelani I did some refactor in the kotlin side so we can return an
array of bytearrays so that we don't require parsing on connlib side, I
also tried to make the dns server detector a bit simpler please take a
look it's my first time doing kotlin

@thomaseizinger please take a look specially at the first commit, I
tried to integrate with the `poll_events` and the `ClientState`.
2023-10-18 20:39:20 +00:00
Thomas Eizinger
82c2bf3574 refactor(connlib): use events to handle ICE candidates (#2279) 2023-10-10 22:26:42 +00:00
Thomas Eizinger
dbb1dd4a3a deps(rust): bump to Rust version 1.73 (#2291)
See https://releases.rs/docs/1.73.0/ for details.
2023-10-10 13:03:06 -07:00
Jamil
00e77062b1 Return fd onRemoveRoute as well (#2296)
Implements the function signatures for `onRemoveRoute` as well.

Getting this error still though:

<img width="1633" alt="Screenshot 2023-10-10 at 8 25 17 AM"
src="https://github.com/firezone/firezone/assets/167144/3dc09f1b-10e1-401b-a1ef-64f1a09e35d5">

Android simulator, Pixel, API 34
2023-10-10 11:26:53 -07:00
Jamil
d0d1c095c3 Fix spelling typos (#2289)
Fixes failing checks in #2284
2023-10-09 18:32:24 -07:00
Gabi
e516bcc8dd connlib+android: enable fd replacement (#2235)
Should be easier to review commit by commit.

The gist of this commit is:
* `onAddRoute` on Android now takes an address+prefix as to minimize
parsing
* `onAddRoute` recreates the vpn service each time(TODO: is this too bad
for performance?)
* `on_add_route` and `onAddRoute` returns the new fd
* on android after `on_add_route` we recreate `IfaceConfig` and
`DeviceIo` and we store the new values
* `peer_handler` now runs on a loop, where each time we fail a write
with an error code 9(bad descriptor) we try to take the new `DeviceIo`
* we keep an
[`AbortHandle`](https://docs.rs/tokio/latest/tokio/task/struct.AbortHandle.html)
from the `iface_handler` task, since closing the fd doesn't awake the
`read` task for `AsyncFd`(I tried it, right now `close` is only called
after dropping the fd) so we explicitly abort the task and start a new
one with the new `device_io`.
* in android `DeviceIo` has an atomic which tells if it's closed or open
and we change it to closed after `on_add_route`, we use this as to never
double-close the fd, instead we wait until it's dropped. This *might*
affect performance on android since we use non-`Ordering::Relaxed`
atomic operation each read/write but it won't affect perfromance in
other platforms, furthermore I believe the performance gains if we
remove this will be minimal.

Fixes #2227

---------

Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
2023-10-08 23:52:45 -03:00
Gabi
11a2979158 connlib: error out with http 4xx instead of trying to reconnect (#2264)
fixes #2013 

Stops the reconnect loop on a 4xx error.

Right now it seems like android doesn't handle `on_disconnect` properly,
just logging instead of going back to sign-in screen.
2023-10-07 16:52:16 +00:00
Thomas Eizinger
dde98f1985 refactor(gateway): introduce Eventloop (#2244) 2023-10-06 22:05:52 +00:00
Thomas Eizinger
3fcfaa6bfd refactor(connlib): create login_url utility (#2237) 2023-10-05 15:15:51 +11:00
Thomas Eizinger
464efbad56 refactor(connlib): restructure directory for consistency (#2236) 2023-10-05 09:52:35 +11:00