Previously, there was a misinterpretation of the spec that didn't allow
_different_ clients to use the same channel number. This is wrong
though. Because channel numbers are managed by clients, they must be
unique _per client_. This patch addresses this short-coming.
I didn't include any dedicated tests for this. The fact that the
existing ones still work means the feature is overall working and the
data structure shows that the channels are now indeed unique per client.
- Fixesfirezone/gtm#220
- Add @jefferenced and @ReactorScram to team page and simplify it
@conectado Interesting use of strong typing to enforce no broken links
in NextJS
Trying to get CI/CD to produce firezone-windows-client.exe. Can't
remember if I need both a PR and a draft release or just the draft
release for that.
---------
Signed-off-by: Reactor Scram <ReactorScram@users.noreply.github.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
In some cases we were obvserving that connections between clients and
gateways couldn't be established.
This happened even when candidates where being found on both ends.
This usually was obvserved when ipv6 isn't working on the relays and
it's still used as one of the viable candidates.
To reproduce this more easily I created an iface with 50 ips using this
script:
```bash
#!/bin/bash
# Generate 10 IPv6 addresses
for i in {1..10}
do
for j in {1..5}
do
# Generate a random IPv6 address
ipv6_address=$(openssl rand -hex 5 | sed 's/\(..\)/\1:/g; s/.$//' | awk '{print "fd00::"$1}')
# Add IPv6 address to lo0
sudo ifconfig lo0 inet6 alias $ipv6_address
echo "Added IPv6 address $ipv6_address to lo0"
done
done
```
This behavior was almost consistently obvserved, as it depended on the
order candidates were used.
I tried modifying timeouts and the limits to channel binding requests
that are internal to webrtc but the connections were still not
consistent, the only thing that worked was limiting the number of host
candidates.
This is okay since even if we can't stablish the local connection (no
hairpin nat) relayed connection will still happen.
But this is not a good long-term solution. In the future we should be
smarter how we sort and ping candidates, prioritizing srflx to srflx or
srflx to relay and leave host candidates for last. Will be easier to
improve on after refactoring webrtc out.
Doing a first pass over documentation and minor UI cleanup. This PR
isn't meant to represent the final state of launch docs, but instead
something that will unblock #2685 and #2675Fixes#2729
Why:
* Self-hosted Relays are not going to be apart of the beta release, so
hiding the functionality in the portal will allow the user not to get
confused about a feature they aren't able to use.
Closes#2178
There is another channel which we didn't yet increase in size, the one
between the allocation and the main task loop. Increasing to 1000 means
each allocation can potentially buffer 65MB of data. With the biggest
port range (16383 allocations), that would be a theoretical memory
consumption of ~ 1TB. But, this would imply that we have 16383 connected
clients that all send data at max speed, saturating our downlink and our
uplink is somehow ridiculously small. As long as up and downlink are
roughly within the same ballpark figure, it should be impossible to
actually fill up these buffers.
I suspect that the current packet drops of the iperf test are happening
because on localhost, sending 10 UDP packets is so quick that a tokio is
unable to wake up the task in time to empty the queue.
In addition to the increased channel size, I've also added a check for
the other channels to avoid writing to them in case they are not ready
for some reason.
---------
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Yesterday, during some portion of the day connections between clients
and resources were impossible.
While I couldn't pinpoint the exact cause I found some issues with
cleanup. This PR fixes those.
Furthermore, I increased the default log level for tunnels in the
clients so that if this happens again we have better logs to triage.
~~Furthermore, I found out about #2705 so, I removed the limit of relays
from connlib since the portal already limits it to 2 (4 if you count
per-ip), that way we make sure that we always use both ipv4 and ipv6.
The connection start up time seems to slow down due to this but I think
this is better. We might want to go to only 2 urls again later on to
speed this up, if the portal can ensure it's a working relay
load-balanced relay there might not be a point in using more than a
single server~~. cc @AndrewDryga
Edit: we always get an ipv4 and ipv6 address for the same relay as the
first two relays in the relay list, save the case where only one of the
ip types is supported. We should be safe limiting it to 2.
---------
Signed-off-by: Gabi <gabrielalejandro7@gmail.com>
Co-authored-by: Jamil <jamilbk@users.noreply.github.com>
Updates user-facing terminology to `One-Time Password` to more
accurately reflect this sign in method and match docs more consistently
Refs #2688
Refs #2021
Fixes#2718
It appears that #2687 missed the case where the tunnel failed to start
for some reason, so the tunnel status never goes to "Connecting" and
then back to "Disconnected" -- we were reacting to the tunnel status
becoming "Disconnected".
In this PR, we do the retry when startTunnel() throws an error as well.
Fixes#2565
With this PR, the 'Quit' menu item (in the macOS status item menu)
disconnects the tunnel (if tunnel is active) before quitting. While the
tunnel is connected, the menu item's title is changed to 'Disconnect and
Quit'.
Depends on #2687 (and this PR currently includes those commits), so
opened as draft.
**Edit:** Now made ready for review.
Fixes#2304.
- If the tunnel disconnects because of connlib, or because the tunnel is
incorrectly configured (we don't expect that to happen), the user is
signed out.
- If the tunnel disconnects because the user disconnected it in
Settings, the user is not signed out, and no alert is shown
- If the tunnel disconnects because the OS brought it down for some
other reason (not sure what it could be), the user is not signed out,
and an alert is shown. Alert will be shown only if the app is running at
that time.
Why:
* As sites are created, the default behavior right now is to route
traffic through whichever path is easiest/fastest. This commit adds the
ability to allow the admin to choose a routing policy for a given site.
This PR started as part of a degradation in performance for the
gateways.
The way to test performance in a realistic enviroment is using a GCP vm
as a client and an AWS vm as a gateway with a single iperf server behind
the gateway.
Then the `iperf` results with current main:
```
Connecting to host 172.31.92.238, port 5201
Reverse mode, remote host 172.31.92.238 is sending
[ 5] local 100.83.194.77 port 58426 connected to 172.31.92.238 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 1.01 MBytes 8.50 Mbits/sec
[ 5] 1.00-2.00 sec 1.14 MBytes 9.59 Mbits/sec
[ 5] 2.00-3.00 sec 699 KBytes 5.73 Mbits/sec
[ 5] 3.00-4.00 sec 1.11 MBytes 9.31 Mbits/sec
[ 5] 4.00-5.00 sec 664 KBytes 5.44 Mbits/sec
[ 5] 5.00-6.00 sec 591 KBytes 4.84 Mbits/sec
[ 5] 6.00-7.00 sec 722 KBytes 5.91 Mbits/sec
[ 5] 7.00-8.00 sec 833 KBytes 6.83 Mbits/sec
[ 5] 8.00-9.00 sec 738 KBytes 6.04 Mbits/sec
[ 5] 9.00-10.00 sec 836 KBytes 6.85 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.06 sec 8.78 MBytes 7.32 Mbits/sec 3 sender
[ 5] 0.00-10.00 sec 8.23 MBytes 6.90 Mbits/sec receiver
iperf Done.
```
Most of the performance problems were due to using SCTP and DTLS.
So I created a
[fork](https://github.com/firezone/webrtc/tree/expose-new-endpoint) of
webrtc that let us circumvent those, since we don't need them because we
are depending on wireguard for encryption.
With those changes much better throughput is achieved:
```
gabriel@cloudshell:~ (firezone-personal-instances)$ iperf3 -R -c 172.31.92.238
Connecting to host 172.31.92.238, port 5201
Reverse mode, remote host 172.31.92.238 is sending
[ 5] local 100.83.194.77 port 51206 connected to 172.31.92.238 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 5.60 MBytes 47.0 Mbits/sec
[ 5] 1.00-2.00 sec 17.2 MBytes 144 Mbits/sec
[ 5] 2.00-3.00 sec 15.8 MBytes 132 Mbits/sec
[ 5] 3.00-4.00 sec 14.8 MBytes 125 Mbits/sec
[ 5] 4.00-5.00 sec 15.9 MBytes 133 Mbits/sec
[ 5] 5.00-6.00 sec 15.8 MBytes 133 Mbits/sec
[ 5] 6.00-7.00 sec 15.3 MBytes 128 Mbits/sec
[ 5] 7.00-8.00 sec 15.6 MBytes 131 Mbits/sec
[ 5] 8.00-9.00 sec 15.6 MBytes 131 Mbits/sec
[ 5] 9.00-10.00 sec 16.0 MBytes 134 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.05 sec 151 MBytes 126 Mbits/sec 74 sender
[ 5] 0.00-10.00 sec 148 MBytes 124 Mbits/sec receiver
iperf Done
```
However, this is still worse than it was achieved with a previous
commit(`21afdf0a9a113c996d60a63b2e8c8f32d3aeb87`):
```
gabriel@cloudshell:~ (firezone-personal-instances)$ iperf3 -R -c 172.31.92.238
Connecting to host 172.31.92.238, port 5201
Reverse mode, remote host 172.31.92.238 is sending
[ 5] local 100.100.68.41 port 49762 connected to 172.31.92.238 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 6.14 MBytes 51.5 Mbits/sec
[ 5] 1.00-2.00 sec 17.1 MBytes 144 Mbits/sec
[ 5] 2.00-3.00 sec 22.8 MBytes 191 Mbits/sec
[ 5] 3.00-4.00 sec 23.5 MBytes 197 Mbits/sec
[ 5] 4.00-5.00 sec 23.0 MBytes 193 Mbits/sec
[ 5] 5.00-6.00 sec 22.1 MBytes 185 Mbits/sec
[ 5] 6.00-7.00 sec 23.0 MBytes 193 Mbits/sec
[ 5] 7.00-8.00 sec 22.7 MBytes 190 Mbits/sec
[ 5] 8.00-9.00 sec 21.0 MBytes 176 Mbits/sec
[ 5] 9.00-10.00 sec 19.9 MBytes 167 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.05 sec 204 MBytes 170 Mbits/sec 127 sender
[ 5] 0.00-10.00 sec 201 MBytes 169 Mbits/sec receiver
```
My profiling suggested that this is due to reading/writing packets
happening in its own dedicated tasks. So much so that maybe in the
future we should even consider spawning their own dedicated runtime so
that those loops have a dedicated OS thread.
Also, probably using a multi-queue interface will give us huge gains if
we have a dedicated task for each queue(currently the interface is
started as a multi-queue but a single file descriptor is used) for
handling multiple concurrent clients.
However, the changes proposed in this PR are good enough for now as long
as performance don't degrade.
In that line I will create a CI that reports the throughput using the
local `docker-compose.yml` file that we should always check before
merging, that is not the be all end all of the performance story but for
smaller PRs the correlation to real world throughput should be enough.
For bigger PRs we should manually test before merging for now, until we
have a way in CI to spin up some realistic tests(note that vms should be
in separate cloud enviroments, the same-cloud links are so reliable that
we miss actual performance degradation due to dropped packets). On this
note I'll write a small manual on how to conduct those tests with full
current results that we should use always before merging new PRs that
affect the hot-path. cc @thomaseizinger
Finally, when testing these changes I found some flakiness regarding the
re-connection path. So I changed things so that we cleanup connections
only using wireguard's error(connection expiration). This is quite slow
for now (~120 seconds) but in the future we can issue an ice restart
each time wireguard keepalive expires(rekey timeout) so that we can
restart connection each ~30 seconds and we can reduce the keepalive time
out from the portal to accelerate it even more. And in the future we can
get smarter about it.
---------
Co-authored-by: Thomas Eizinger <thomas@eizinger.io>