feat(blog): Improving reliability for DNS Resources (#5469)

Need to make this post as a reference to link to from other places.

---------

Signed-off-by: Jamil <jamilbk@users.noreply.github.com>
Co-authored-by: Reactor Scram <ReactorScram@users.noreply.github.com>
This commit is contained in:
Jamil
2024-06-20 13:53:52 -07:00
committed by GitHub
parent 04585874cf
commit 2df512717d
16 changed files with 164 additions and 22 deletions

View File

@@ -8,7 +8,6 @@ export default function _Page() {
<Post
authorName="Jamil Bou Kheir"
authorTitle="Founder"
authorEmail="jamil@firezone.dev"
authorAvatarSrc={gravatar("jamil@firezone.dev")}
title="April 2024 Update: GA"
date="2024-04-01"

View File

@@ -14,7 +14,6 @@ export default function Page() {
<Post
authorName="Jeff Spencer"
authorTitle=""
authorEmail="jeff@firezone.dev"
authorAvatarSrc={gravatar("jeff@firezone.dev")}
title="Enterprises choose open source"
date="December 6, 2023"

View File

@@ -8,8 +8,7 @@ export default function Page() {
return (
<Post
authorName="Jamil Bou Kheir"
authorTitle="Founder & CEO"
authorEmail="jamil@firezone.dev"
authorTitle="Founder"
authorAvatarSrc={gravatar("jamil@firezone.dev")}
title="Firezone 1.0"
date="July 14, 2023"

View File

@@ -8,7 +8,6 @@ export default function _Page() {
<Post
authorName="Gabriel Steinberg"
authorTitle="Senior Backend Engineer"
authorEmail="gabriel@firezone.dev"
authorAvatarSrc={gravatar("gabriel@firezone.dev")}
title="How DNS Works in Firezone"
date="2024-05-08"

View File

@@ -0,0 +1,18 @@
"use client";
import Post from "@/components/Blog/Post";
import Content from "./readme.mdx";
import gravatar from "@/lib/gravatar";
export default function _Page() {
return (
<Post
authorName="Jamil Bou Kheir"
authorTitle="Founder"
authorAvatarSrc={gravatar("jamil@firezone.dev")}
title="Improving reliability for DNS Resources"
date="2024-06-20"
>
<Content />
</Post>
);
}

View File

@@ -0,0 +1,12 @@
import _Page from "./_page";
import { Metadata } from "next";
export const metadata: Metadata = {
title: "Improving reliability for DNS Resources • Firezone Blog",
description:
"Client and Gateway versions 1.1 onwards include a more reliable DNS routing system.",
};
export default function Page() {
return <_Page />;
}

View File

@@ -0,0 +1,109 @@
**tl;dr**: [Upgrade your Gateway(s)](#how-to-upgrade) to 1.1.0 soon to improve
reliability for DNS Resources.
In our [How DNS works in Firezone](/blog/how-dns-works-in-firezone) post, we
covered how DNS Resources are resolved and routed reliably even when the IPs
they resolve to collide. The system described there works well for the vast
majority of our users across many kinds of networks.
But, as it turns out, not all networks are well-behaved (surprise!). Certain
networks in particular can cause issues with DNS Resources, causing them to time
out or fail to be resolved after a period of time.
This post describes why that happens, how we're resolving it, and the steps you
can take to upgrade.
## The case of the NAT reset
The issue was first discovered about a month ago during our internal dogfood
testing sessions. We noticed that after some time (typically 30 minutes to a few
hours), DNS Resources would become unresponsive and require the application to
issue another DNS query to perform the hole-punching dance and re-establish
connectivity.
This is odd behavior -- tunnels are designed to be kept alive indefinitely with
a periodic keep-alive sent from Client to Gateway.
### When tunnels drop
There are two obvious reasons why a tunnel might drop and need to be
re-established:
- The Client experienced a change in network connectivity (e.g. switching Wi-Fi
networks), or
- The Gateway experienced a change in network connectivity (e.g. restarted by an
admin)
A third, less obvious reason is when network in between the Client and Gateway
is misbehaving.
### Google Cloud NAT
We dogfood Firezone internally across a variety of network conditions for both
Client and Gateway. After some investigation, we discovered a curious pattern:
the DNS Resource reliaibility issue only occurred for our Gateways running in
Google Cloud.
After running an overnight soak test, we discovered that the issue happened at
regular intervals. Precisely **every 30 minutes**, the WireGuard tunnel would
drop, and connectivity to the DNS Resource would be lost. Since new tunnels for
DNS Resources are established only at the time of resolution, the application
(`ping` in our case) would lose connectivity until it was restarted.
Google doesn't publish details on the session lifetimes for their NAT Gateways,
so we can't be sure if the problem is related to GCP or another router close to
GCP's datacenters (if you happen to know, please email us!).
But the goal of this post isn't to pick on Google -- some enterprise routers
behave similarly, under the guide of so-called "security" features, so the issue
could occur in other networks as well.
## The solution
The solution is a simple, yet subtle one: instead of establishing the tunnel for
a DNS Resource at the time of resolution, we now wait until we see the first
packet for the Resource before performing the hole-punching dance to set up the
tunnel.
The stub resolver maintains a list of mapped IPs to DNS Resources, so we know at
the packet level which DNS Resource the packet is for, even long after the query
has been resolved.
If the tunnel fails, the very next packet from the application will establish it
again, avoiding the need for another query (which the application may not make)
and thus avoiding reliability issues detailed above.
### NAT64 comes for free
One interesting edge case we hit implementing the above solution is that we
don't know the _actual_ IP of the DNS Resource until the tunnel to the Gateway
is established, at which point the Gateway resolves it.
Since the stub resolver now immediately returns a dummy IP when asked to do so,
it could return an IPv4 address for a Resource that has only `AAAA` records
defined, or vice versa. If the application chooses IPv4 to connect to the
Resource, packets would arrive at the Gateway and suddenly need to be translated
to IPv6.
So we added a NAT64 implemented to Gateways in 1.1.0 that handles this
on-the-fly, with no configuration required. That means your workforce can now
seamlessly connect to IPv6-only Resources even if they're on IPv4-only networks!
## How to upgrade
We released Gateway version 1.1.0 yesterday that includes the change. This
version is compatible with Client versions 1.0.x **and** 1.1.x. However, Client
versions 1.1.x **will not** be compatible with Gateway versions 1.0.x.
To give admins time to upgrade their Gateways, we are waiting to release the
1.1.0 Clients until **Thursday, June 27th**. We recommend upgrading your
Gateways to 1.1.0 as soon as possible to avoid any service disruptions caused by
end users upgrading their Clients prematurely.
Upgrading Gateway(s) usually takes only a couple minutes --
[read the docs](/kb/administer/upgrading) to see how.
### Conclusion
That's all for now. If you have questions or hit issues, contact us via one of
the means [listed here](/support).

View File

@@ -13,7 +13,6 @@ export default function Page() {
<Post
authorName="Jamil Bou Kheir"
authorTitle="Founder"
authorEmail="jamil@firezone.dev"
authorAvatarSrc={gravatar("jamil@firezone.dev")}
title="January 2024 Update"
date="2024-01-01"

View File

@@ -8,7 +8,6 @@ export default function _Page() {
<Post
authorName="Jamil Bou Kheir"
authorTitle="Founder"
authorEmail="jamil@firezone.dev"
authorAvatarSrc={gravatar("jamil@firezone.dev")}
title="March 2024 Update"
date="2024-03-01"

View File

@@ -8,7 +8,6 @@ export default function _Page() {
<Post
authorName="Jamil Bou Kheir"
authorTitle="Founder"
authorEmail="jamil@firezone.dev"
authorAvatarSrc={gravatar("jamil@firezone.dev")}
title="May 2024 Update"
date="2024-05-01"

View File

@@ -21,9 +21,25 @@ export default function Page() {
Announcements, insights, and more from the Firezone team.
</p>
<div className="mt-14 grid divide-y">
<SummaryCard
title="Improving reliability for DNS Resources"
date="June 20, 2024"
href="/blog/improving-reliability-for-dns-resources"
authorName="Jamil Bou Kheir"
authorAvatarSrc={gravatar("jamil@firezone.dev")}
type="Announcement"
>
<p className="mb-2">
We're making some changes to the way DNS Resources are routed in
Firezone. These changes will be coming in Client and Gateway
versions 1.1 and later. Continue reading to understand how these
changes will affect your network and what you need to do to take
advantage of them.
</p>
</SummaryCard>
<SummaryCard
title="Using Tauri to build a cross-platform security app"
date="Jun 11, 2024"
date="June 11, 2024"
href="/blog/using-tauri"
authorName="ReactorScram"
authorAvatarSrc="/images/avatars/reactorscram.png"
@@ -35,7 +51,7 @@ export default function Page() {
</p>
</SummaryCard>
<SummaryCard
title="How DNS Works in Firezone"
title="How DNS works in Firezone"
date="May 8, 2024"
href="/blog/how-dns-works-in-firezone"
authorName="Gabriel Steinberg"
@@ -50,7 +66,7 @@ export default function Page() {
</p>
</SummaryCard>
<SummaryCard
title="May 2024 Update"
title="May 2024 update"
date="May 1, 2024"
href="/blog/may-2024-update"
authorName="Jamil Bou Kheir"
@@ -77,7 +93,7 @@ export default function Page() {
</div>
</SummaryCard>
<SummaryCard
title="April 2024 Update: GA"
title="April 2024 update: GA"
date="April 1, 2024"
href="/blog/apr-2024-update"
authorName="Jamil Bou Kheir"
@@ -112,7 +128,7 @@ export default function Page() {
</ul>
</SummaryCard>
<SummaryCard
title="March 2024 Update"
title="March 2024 update"
date="March 1, 2024"
href="/blog/mar-2024-update"
authorName="Jamil Bou Kheir"
@@ -136,7 +152,7 @@ export default function Page() {
</ul>
</SummaryCard>
<SummaryCard
title="Jaunary 2024 Update"
title="January 2024 update"
date="January 1, 2024"
href="/blog/jan-2024-update"
authorName="Jamil Bou Kheir"

View File

@@ -12,8 +12,7 @@ export default function Page() {
return (
<Post
authorName="Jamil Bou Kheir"
authorTitle="Founder & CEO"
authorEmail="jamil@firezone.dev"
authorTitle="Founder"
authorAvatarSrc={gravatar("jamil@firezone.dev")}
title="Firezone 0.5.0 Released!"
date="July 25, 2022"

View File

@@ -12,8 +12,7 @@ export default function Page() {
return (
<Post
authorName="Jamil Bou Kheir"
authorTitle="Founder & CEO"
authorEmail="jamil@firezone.dev"
authorTitle="Founder"
authorAvatarSrc={gravatar("jamil@firezone.dev")}
title="Firezone 0.6.0 Released!"
date="October 17, 2022"

View File

@@ -13,7 +13,6 @@ export default function Page() {
<Post
authorName="Jeff Spencer"
authorTitle=""
authorEmail="jeff@firezone.dev"
authorAvatarSrc={gravatar("jeff@firezone.dev")}
title="Secure remote access makes remote work a win-win"
date="November 17, 2023"

View File

@@ -7,7 +7,6 @@ export default function _Page() {
<Post
authorName="ReactorScram"
authorTitle="Senior Systems Engineer"
authorEmail="ReactorScram@users.noreply.github.com"
authorAvatarSrc="/images/avatars/reactorscram.png"
title="Using Tauri to build a cross-platform security app"
date="2024-06-11"

View File

@@ -3,7 +3,6 @@ import Image from "next/image";
export default function Post({
authorName,
authorTitle,
authorEmail,
authorAvatarSrc,
title,
date,
@@ -11,7 +10,6 @@ export default function Post({
}: {
authorName: string;
authorTitle: string;
authorEmail: string;
authorAvatarSrc: string;
title: string;
date: string;