VAULT-28478: Updates to autopilot docs (#28331)

* restructure * update command * fixes * fix command flags * revert makefile change * remove tick log
2025-11-01 19:17:58 +00:00 · 2024-09-17 10:53:18 +02:00
parent c140470639
commit d00715d129
9 changed files with 289 additions and 149 deletions
--- a/website/content/api-docs/system/storage/raftautopilot.mdx
+++ b/website/content/api-docs/system/storage/raftautopilot.mdx
@@ -35,54 +35,69 @@ $ curl \

 ```json
 {
-  "healthy": true,
  "failure_tolerance": 1,
+  "healthy": true,
+  "leader": "vault_1",
  "servers": {
-    "raft1": {
-      "id": "raft1",
-      "name": "raft1",
+    "vault_1": {
      "address": "127.0.0.1:8201",
-      "node_status": "alive",
+      "healthy": true,
+      "id": "vault_1",
      "last_contact": "0s",
+      "last_index": 63,
      "last_term": 3,
-      "last_index": 459,
-      "healthy": true,
-      "stable_since": "2021-03-19T20:14:11.831678-04:00",
+      "name": "vault_1",
+      "node_status": "alive",
+      "node_type": "voter",
+      "stable_since": "2024-08-29T16:02:45.639829+02:00",
      "status": "leader",
-      "meta": null
+      "version": "1.17.3"
    },
-    "raft2": {
-      "id": "raft2",
-      "name": "raft2",
-      "address": "127.0.0.2:8201",
-      "node_status": "alive",
-      "last_contact": "516.49595ms",
-      "last_term": 3,
-      "last_index": 459,
+    "vault_2": {
+      "address": "127.0.0.1:8203",
      "healthy": true,
-      "stable_since": "2021-03-19T20:14:19.831931-04:00",
+      "id": "vault_2",
+      "last_contact": "678.62575ms",
+      "last_index": 63,
+      "last_term": 3,
+      "name": "vault_2",
+      "node_status": "alive",
+      "node_type": "voter",
+      "stable_since": "2024-08-29T16:02:47.640976+02:00",
      "status": "voter",
-      "meta": null
+      "version": "1.17.3"
    },
-    "raft3": {
-      "id": "raft3",
-      "name": "raft3",
-      "address": "127.0.0.3:8201",
-      "node_status": "alive",
-      "last_contact": "196.706591ms",
-      "last_term": 3,
-      "last_index": 459,
+    "vault_3": {
+      "address": "127.0.0.1:8205",
      "healthy": true,
-      "stable_since": "2021-03-19T20:14:25.83565-04:00",
+      "id": "vault_3",
+      "last_contact": "3.969159375s",
+      "last_index": 63,
+      "last_term": 3,
+      "name": "vault_3",
+      "node_status": "alive",
+      "node_type": "voter",
+      "stable_since": "2024-08-29T16:02:49.640905+02:00",
      "status": "voter",
-      "meta": null
+      "version": "1.17.3"
    }
  },
-  "leader": "raft1",
-  "voters": ["raft1", "raft2", "raft3"],
-  "non_voters": null
+  "voters": [
+    "vault_1",
+    "vault_2",
+    "vault_3"
+  ]
 }
 ```
+The `failure_tolerance` of a cluster is the number of nodes in the cluster that could
+fail gradually without causing an outage.
+
+When verifying the health of your cluster, check the following fields of each server:
+- `healthy`: whether Autopilot considers this node healthy or not
+- `status`: the voting status of the node. This will be `voter`, `leader`, or [`non-voter`](/vault/docs/concepts/integrated-storage#non-voting-nodes-enterprise-only)")
+- `last_index`: the index of the last applied Raft log. This should be close to the `last_index` value of the leader.
+- `version`: the version of Vault running on the server
+- `node_type`: the type of node. On CE, this will always be `voter`. See below for an explanation of Enterprise node types.

 ### Enterprise only
 Vault Enterprise will include additional output in its API response to indicate the current state of redundancy zones,
@@ -149,7 +164,7 @@ automated upgrade progress (if any), and optimistic failure tolerance.
      }
    },
    "status": "await-new-voters",
-    "target_version": "1.12.0",
+    "target_version": "1.17.5",
    "target_version_non_voters": [
      "vault_5"
    ]
@@ -161,6 +176,11 @@ automated upgrade progress (if any), and optimistic failure tolerance.
 }
 ```

+`optimistic_failure_tolerance` describes the number of healthy active and
+back-up voting servers that can fail gradually without causing an outage.
+
+@include 'autopilot/node-types.mdx'
+
 ## Get configuration

 This endpoint is used to get the configuration of the autopilot subsystem of Integrated Storage.
@@ -203,31 +223,7 @@ This endpoint is used to modify the configuration of the autopilot subsystem of

 ### Parameters

- `cleanup_dead_servers` `(bool: false)` - Controls whether to remove dead servers from
-  the Raft peer list periodically or when a new server joins. This requires that
-  `min_quorum` is also set.
-
- `last_contact_threshold` `(string: "10s")` - Limit on the amount of time a server can
-  go without leader contact before being considered unhealthy.
-
- `dead_server_last_contact_threshold` `(string: "24h")` - Limit on the amount of time
-  a server can go without leader contact before being considered failed. This
-  takes effect only when `cleanup_dead_servers` is `true`. This can not be set to a value
-  smaller than 1m. **We strongly recommend that this is kept at a high duration, such as a day,
-  as it being too low could result in removal of nodes that aren't actually dead.**
-
- `max_trailing_logs` `(int: 1000)` - Amount of entries in the Raft Log that a server
-  can be behind before being considered unhealthy.
-
- `min_quorum` `(int: 3)` - Minimum number of servers allowed in a cluster before
-  autopilot can prune dead servers. This should at least be 3. Applicable only for
-  voting nodes.
-
- `server_stabilization_time` `(string: "10s")` - Minimum amount of time a server must
-  be in a stable, healthy state before it can be added to the cluster.
-
- `disable_upgrade_migration` `(bool: false)` - Disables automatically upgrading Vault using
-  autopilot. (Enterprise-only)
+@include 'autopilot/config.mdx'

 ### Sample request

--- a/website/content/docs/commands/operator/raft.mdx
+++ b/website/content/docs/commands/operator/raft.mdx
@@ -128,6 +128,13 @@ Usage: vault operator raft list-peers
 }
 ```

+Use the output of `list-peers` to ensure that your cluster is in an expected state.
+If you've removed a server using `remove-peer`, the server should no longer be
+listed in the `list-peers` output. If you've added a server using `add-peer` or
+through `retry_join`, check the `list-peers` output to see that it has been added
+to the cluster and (if the node has not been added as a non-voter)
+it has been promoted to a voter.
+
 ## remove-peer

 This command is used to remove a node from being a peer to the Raft cluster. In
@@ -229,14 +236,9 @@ Subcommands:
 ### autopilot state

 Displays the state of the raft cluster under integrated storage as seen by
-autopilot. It shows whether autopilot thinks the cluster is healthy or not,
-and how many nodes could fail before the cluster becomes unhealthy ("Failure Tolerance").
+autopilot. It shows whether autopilot thinks the cluster is healthy or not.

-State includes a list of all servers by nodeID and IP address. Last Index
-indicates how close the state on each node is to the leader's.
-
-A node can have a status of "leader", "voter", and
-"[non-voter](/vault/docs/concepts/integrated-storage#non-voting-nodes-enterprise-only)".
+State includes a list of all servers by nodeID and IP address.

 ```text
 Usage: vault operator raft autopilot state
@@ -249,34 +251,60 @@ Usage: vault operator raft autopilot state
 #### Example output

 ```text
-Healthy:                      true
-Failure Tolerance:            1
-Leader:                       raft1
+Healthy:                         true
+Failure Tolerance:               1
+Leader:                          vault_1
 Voters:
-   raft1
-   raft2
-   raft3
+   vault_1
+   vault_2
+   vault_3
 Servers:
-   raft1
-      Name:            raft1
-      Address:         127.0.0.1:8201
-      Status:          leader
-      Node Status:     alive
-      Healthy:         true
-      Last Contact:    0s
-      Last Term:       3
-      Last Index:      38
-   raft2
-      Name:            raft2
-      Address:         127.0.0.2:8201
-      Status:          voter
-      Node Status:     alive
-      Healthy:         true
-      Last Contact:    2.514176729s
-      Last Term:       3
-      Last Index:      38
+   vault_1
+      Name:              vault_1
+      Address:           127.0.0.1:8201
+      Status:            leader
+      Node Status:       alive
+      Healthy:           true
+      Last Contact:      0s
+      Last Term:         3
+      Last Index:        61
+      Version:           1.17.3
+      Node Type:         voter
+   vault_2
+      Name:              vault_2
+      Address:           127.0.0.1:8203
+      Status:            voter
+      Node Status:       alive
+      Healthy:           true
+      Last Contact:      564.765375ms
+      Last Term:         3
+      Last Index:        61
+      Version:           1.17.3
+      Node Type:         voter
+   vault_3
+      Name:              vault_3
+      Address:           127.0.0.1:8205
+      Status:            voter
+      Node Status:       alive
+      Healthy:           true
+      Last Contact:      3.814017875s
+      Last Term:         3
+      Last Index:        61
+      Version:           1.17.3
+      Node Type:         voter
 ```
-Vault Enterprise will include additional output related to automated upgrades and redundancy zones.
+
+The "Failure Tolerance" of a cluster is the number of nodes in the cluster that could
+fail gradually without causing an outage.
+
+When verifying the health of your cluster, check the following fields of each server:
+- Healthy: whether Autopilot considers this node healthy or not
+- Status: the voting status of the node. This will be `voter`, `leader`, or [`non-voter`](/vault/docs/concepts/integrated-storage#non-voting-nodes-enterprise-only).
+- Last Index: the index of the last applied Raft log. This should be close to the "Last Index" value of the leader.
+- Version: the version of Vault running on the server
+- Node Type: the type of node. On CE, this will always be `voter`. See below for an explanation of Enterprise node types.
+
+Vault Enterprise will include additional output related to automated upgrades, optimistic failure tolerance, and redundancy zones.

 #### Example Vault enterprise output

@@ -292,7 +320,7 @@ Redundancy Zones:
      Failure Tolerance: 1
 Upgrade Info:
   Status: await-new-voters
-   Target Version: 1.12.0
+   Target Version: 1.17.5
   Target Version Voters:
   Target Version Non-Voters: vault_5
   Other Version Voters: vault_1, vault_3
@@ -310,6 +338,11 @@ Upgrade Info:
         Other Version Non-Voters: vault_4
 ```

+"Optimistic Failure Tolerance" describes the number of healthy active and
+back-up voting servers that can fail gradually without causing an outage.
+
+@include 'autopilot/node-types.mdx'
+
 ### autopilot get-config

 Returns the configuration of the autopilot subsystem under integrated storage.
@@ -337,29 +370,49 @@ Usage: vault operator raft autopilot set-config [options]

 Flags applicable to this command are the following:

- `cleanup-dead-servers` `(bool)` - Controls whether to remove dead servers from
+- `cleanup-dead-servers` `(bool: false)` - Controls whether to remove dead servers from
  the Raft peer list periodically or when a new server joins. This requires that
-  `min-quorum` is also set. Defaults to `false`.
+  `min-quorum` is also set.

- `last-contact-threshold` `(string)` - Limit on the amount of time a server can
-  go without leader contact before being considered unhealthy. Defaults to `10s`.
+- `last-contact-threshold` `(string: "10s")` - Limit on the amount of time a server can
+  go without leader contact before being considered unhealthy.

- `dead-server-last-contact-threshold` `(string)` - Limit on the amount of time
-  a server can go without leader contact before being considered failed.
-  This takes effect only when `cleanup_dead_servers` is set as `true`. Defaults to `24h`.
-  
-  -> **Note:** A failed server that autopilot has removed from the raft configuration cannot rejoin the cluster without being reinitialized.  
+- `dead-server-last-contact-threshold` `(string: "24h")` - Limit on the amount of time
+a server can go without leader contact before being considered failed. This
+takes effect only when `cleanup_dead_servers` is set. When adding new nodes
+to your cluster, the `dead_server_last_contact_threshold` needs to be larger
+than the amount of time that it takes to load a Raft snapshot, otherwise the
+newly added nodes will be removed from your cluster before they have finished
+loading the snapshot and starting up. If you are using an [HSM](/vault/docs/enterprise/hsm), your
+`dead_server_last_contact_threshold` needs to be larger than the response
+time of the HSM.

- `max-trailing-logs` `(int)` - Amount of entries in the Raft Log that a server
-  can be behind before being considered unhealthy. Defaults to `1000`.
+<Warning>

- `min-quorum` `(int)` - Minimum number of servers that should always be present in a cluster.
-  Autopilot will not prune servers below this number. This should be set to the expected number
-  of voters in your cluster. There is no default.
+  We strongly recommend keeping `dead_server_last_contact_threshold` at a high
+  duration, such as a day, as it being too low could result in removal of nodes
+  that aren't actually dead

- `server-stabilization-time` `(string)` - Minimum amount of time a server must be in a healthy state before it
+</Warning>
+
+- `max-trailing-logs` `(int: 1000)` - Amount of entries in the Raft Log that a server
+  can be behind before being considered unhealthy. If this value is too low,
+  it can cause the cluster to lose quorum if a follower falls behind. This
+  value only needs to be increased from the default if you have a very high
+  write load on Vault and you see that it takes a long time to promote new
+  servers to becoming voters. This is an unlikely scenario and most users
+  should not modify this value.
+
+- `min-quorum` `(int)` - The minimum number of servers that should always be
+present in a cluster. Autopilot will not prune servers below this number.
+**There is no default for this value** and it should be set to the expected
+number of voters in your cluster when `cleanup_dead_servers` is set as `true`.
+Use the [quorum size guidance](/vault/docs/internals/integrated-storage#quorum-size-and-failure-tolerance)
+to determine the proper minimum quorum size for your cluster.
+
+- `server-stabilization-time` `(string: "10s")` - Minimum amount of time a server must be in a healthy state before it
  can become a voter. Until that happens, it will be visible as a peer in the cluster, but as a non-voter, meaning it
-  won't contribute to quorum. Defaults to `10s`.
+  won't contribute to quorum.

- `disable-upgrade-migration` `(bool)` - Controls whether to disable automated
-  upgrade migrations, an Enterprise-only feature. Defaults to `false`.
+- `disable-upgrade-migration` `(bool: false)` - Controls whether to disable automated
+  upgrade migrations, an Enterprise-only feature.
--- a/website/content/docs/concepts/integrated-storage/autopilot.mdx
+++ b/website/content/docs/concepts/integrated-storage/autopilot.mdx
@@ -17,7 +17,7 @@ These two features were introduced in Vault 1.11.
 Server stabilization helps to retain the stability of the Raft cluster by safely
 joining new voting nodes to the cluster. When a new voter node is joined to an
 existing cluster, autopilot adds it as a non-voter instead, and waits for a
-pre-configured amount of time to monitor it's health. If the node remains to be
+pre-configured amount of time to monitor its health. If the node remains
 healthy for the entire duration of stabilization, then that node will be
 promoted as a voter. The server stabilization period can be tuned using
 `server_stabilization_time` (see below).
@@ -31,7 +31,7 @@ and `min_quorum` (see below).

 ## State API

-State API provides detailed information about all the nodes in the Raft cluster
+The [State API](/vault/api-docs/system/storage/raftautopilot#get-cluster-state) provides detailed information about all the nodes in the Raft cluster
 in a single call. This API can be used for monitoring for cluster health.

 ### Follower health
@@ -50,40 +50,7 @@ although dead server cleanup is not enabled by default. Upgrade of
 Raft clusters deployed with older versions of Vault will also transition to use
 Autopilot automatically.

-Autopilot exposes a [configuration
-API](/vault/api-docs/system/storage/raftautopilot#set-configuration) to manage its
-behavior. Autopilot gets initialized with the following default values. If these default values do not meet your expected autopilot behavior, don't forget to set them to your desired values.
-
- `cleanup_dead_servers` - `false`
-  - This controls whether to remove dead servers from
-    the Raft peer list periodically or when a new server joins. This requires that
-    `min-quorum` is also set.
-
- `dead_server_last_contact_threshold` - `24h`
-  - Limit on the amount of time
-    a server can go without leader contact before being considered failed. This
-    takes effect only when `cleanup_dead_servers` is set. **We strongly recommend
-    that this is kept at a high duration, such as a day, as it being too low could
-    result in removal of nodes that aren't actually dead.**
-
- `min_quorum` - This doesn't default to anything and should be set to the expected
-  number of voters in your cluster when `cleanup_dead_servers` is set as `true`.
-  - Minimum number of servers that should always be present in a cluster.
-  Autopilot will not prune servers below this number.
-
- `max_trailing_logs` - `1000`
-  - Amount of entries in the Raft Log that a server
-    can be behind before being considered unhealthy.
-
- `last_contact_threshold` - `10s`
-  - Limit on the amount of time a server can go without leader contact before being considered unhealthy.
-
- `server_stabilization_time` - `10s`
-  - Minimum amount of time a server must be in a healthy state before it can become a voter. Until that happens,
-    it will be visible as a peer in the cluster, but as a non-voter, meaning it won't contribute to quorum.
-
- `disable_upgrade_migration` - `false`
-  - Controls whether to disable automated upgrade migrations, an Enterprise-only feature.
+@include 'autopilot/config.mdx'

 ~> **Note**: Autopilot in Vault does similar things to what autopilot does in
 [Consul](https://www.consul.io/). However, the configuration in these 2 systems
@@ -94,7 +61,7 @@ provide the autopilot functionality.

 ## Automated upgrades

-Automated Upgrades lets you automatically upgrade a cluster of Vault nodes to a new version as
+[Automated Upgrades](/vault/docs/enterprise/automated-upgrades) lets you automatically upgrade a cluster of Vault nodes to a new version as
 updated server nodes join the cluster. Once the number of nodes on the new version is
 equal to or greater than the number of nodes on the old version, Autopilot will promote
 the newer versioned nodes to voters, demote the older versioned nodes to non-voters,
@@ -104,7 +71,7 @@ nodes can be removed from the cluster.

 ## Redundancy zones

-Redundancy Zones provide both scaling and resiliency benefits by deploying non-voting
+[Redundancy Zones](/vault/docs/enterprise/redundancy-zones) provide both scaling and resiliency benefits by deploying non-voting
 nodes alongside voting nodes on a per availability zone basis. When using redundancy zones,
 each zone will have exactly one voting node and as many additional non-voting nodes as desired.
 If the voting node in a zone fails, a non-voting node will be automatically promoted to
--- a/website/content/docs/concepts/integrated-storage/index.mdx
+++ b/website/content/docs/concepts/integrated-storage/index.mdx
@@ -60,6 +60,11 @@ API (both methods described below). When joining a node, the API address of the
 recommend setting the [`api_addr`](/vault/docs/concepts/ha#direct-access) configuration
 option on all nodes to make joining simpler.

+Always join nodes to a cluster one at a time and wait for the node to become
+healthy and (if applicable) a voter before continuing to add more nodes. The
+status of a node can be verified by performing a [`list-peers`](/vault/docs/commands/operator/raft#list-peers)
+command or by checking the [`autopilot state`](/vault/docs/commands/operator/raft#autopilot-state).
+
 #### `retry_join` configuration

 This method enables setting one, or more, target leader nodes in the config file.
@@ -95,9 +100,10 @@ provided, Vault will use [go-discover](https://github.com/hashicorp/go-discover)
 to automatically attempt to discover and resolve potential Raft leader
 addresses.

-See the go-discover
+Check the go-discover
 [README](https://github.com/hashicorp/go-discover/blob/master/README.md) for
-details on the format of the [`auto_join`](/vault/docs/configuration/storage/raft#auto_join) value.
+details on the format of the [`auto_join`](/vault/docs/configuration/storage/raft#auto_join)
+value per cloud provider.

 ```hcl
 storage "raft" {
@@ -167,6 +173,14 @@ $ vault operator raft remove-peer node1
 Peer removed successfully!
 ```

+#### Re-joining after removal
+
+If you have used `remove-peer` to remove a node from the Raft cluster, but you
+later want to have this same node re-join the cluster, you will need to delete
+any existing Raft data on the removed node before adding it back to the cluster.
+This will involve stopping the Vault process, deleting the data directory containing
+Raft data, and then restarting the Vault process.
+
 ### Listing peers

 To see the current peer set for the cluster you can issue a
--- a/website/content/docs/enterprise/redundancy-zones.mdx
+++ b/website/content/docs/enterprise/redundancy-zones.mdx
@@ -36,3 +36,7 @@ wait to begin leadership transfer until it can ensure that there will be as much
 new Vault version as there was on the old Vault version.

 The status of redundancy zones can be monitored by consulting the [Autopilot state API endpoint](/vault/api-docs/system/storage/raftautopilot#get-cluster-state).
+
+## Optimistic Failure Tolerance
+
+@include 'autopilot/redundancy-zones.mdx'
--- a/website/content/docs/internals/integrated-storage.mdx
+++ b/website/content/docs/internals/integrated-storage.mdx
@@ -271,6 +271,28 @@ For example, if you start with a 5-node cluster:
 You should always maintain quorum to limit the impact on failure tolerance when
 changing or scaling your Vault instance.

+### Redundancy Zones
+
+If you are using autopilot with [redundancy zones](/vault/docs/enterprise/redundancy-zones),
+the total number of servers will be different from the above, and is dependent
+on how many redundancy zones and servers per redundancy zone that you choose.
+
+@include 'autopilot/redundancy-zones.mdx'
+
+<Highlight title="Best practice">
+
+  If you choose to use redundancy zones, we **strongly recommend** using at least 3
+  zones to ensure failure tolerance.
+
+</Highlight>
+
+Redundancy zones | Servers per zone | Quorum size | Failure tolerance | Optimistic failure tolerance
+:--------------: | :--------------: | :---------: | :---------------: | :--------------------------:
+2                | 2                | 2           | 0                 | 2
+3                | 2                | 2           | 1                 | 3
+3                | 3                | 2           | 1                 | 5
+5                | 2                | 3           | 2                 | 6
+
 [consensus protocol]: https://en.wikipedia.org/wiki/Consensus_(computer_science)
 [consistency]: https://en.wikipedia.org/wiki/CAP_theorem
 ["Raft: In search of an Understandable Consensus Algorithm"]: https://raft.github.io/raft.pdf
--- a/website/content/partials/autopilot/config.mdx
+++ b/website/content/partials/autopilot/config.mdx
@@ -0,0 +1,53 @@
+Autopilot exposes a [configuration
+API](/vault/api-docs/system/storage/raftautopilot#set-configuration) to manage its
+behavior. These items cannot be set in Vault server configuration files.
+Autopilot gets initialized with the following default values. If these default
+values do not meet your expected autopilot behavior, don't forget to set them to your desired values.
+
+- `cleanup_dead_servers` `(bool: false)` - This controls whether to remove dead servers from
+the Raft peer list periodically or when a new server joins. This requires that
+`min-quorum` is also set.
+
+- `dead_server_last_contact_threshold` `(string: "24h")` - Limit on the amount of time
+a server can go without leader contact before being considered failed. This
+takes effect only when `cleanup_dead_servers` is set. When adding new nodes
+to your cluster, the `dead_server_last_contact_threshold` needs to be larger
+than the amount of time that it takes to load a Raft snapshot, otherwise the
+newly added nodes will be removed from your cluster before they have finished
+loading the snapshot and starting up. If you are using an [HSM](/vault/docs/enterprise/hsm), your
+`dead_server_last_contact_threshold` needs to be larger than the response
+time of the HSM.
+
+<Warning>
+
+  We strongly recommend keeping `dead_server_last_contact_threshold` at a high
+  duration, such as a day, as it being too low could result in removal of nodes
+  that aren't actually dead
+
+</Warning>
+
+- `min_quorum` `(int)` - The minimum number of servers that should always be
+present in a cluster. Autopilot will not prune servers below this number.
+**There is no default for this value** and it should be set to the expected
+number of voters in your cluster when `cleanup_dead_servers` is set as `true`.
+Use the [quorum size guidance](/vault/docs/internals/integrated-storage#quorum-size-and-failure-tolerance)
+to determine the proper minimum quorum size for your cluster.
+
+- `max_trailing_logs` `(int: 1000)` - Amount of entries in the Raft Log that a
+server can be behind before being considered unhealthy. If this value is too low,
+it can cause the cluster to lose quorum if a follower falls behind. This
+value only needs to be increased from the default if you have a very high
+write load on Vault and you see that it takes a long time to promote new
+servers to becoming voters. This is an unlikely scenario and most users
+should not modify this value.
+
+- `last_contact_threshold` `(string "10s")` - Limit on the amount of time a
+server can go without leader contact before being considered unhealthy.
+
+- `server_stabilization_time` `(string "10s")` - Minimum amount of time a server
+must be in a healthy state before it can become a voter. Until that happens,
+it will be visible as a peer in the cluster, but as a non-voter, meaning it
+won't contribute to quorum.
+
+- `disable_upgrade_migration` `(bool: false)` - Disables automatically upgrading
+Vault using autopilot (Enterprise-only)
--- a/website/content/partials/autopilot/node-types.mdx
+++ b/website/content/partials/autopilot/node-types.mdx
@@ -0,0 +1,6 @@
+#### Enterprise Node Types
+- `voter`: The server is a Raft voter and contributing to quorum.
+- `read-replica`: The server is not a Raft voter, but receives a replica of all data.
+- `zone-voter`: The main Raft voter in a redundancy zone.
+- `zone-extra-voter`: An additional Raft voter in a redundancy zone.
+- `zone-standby`: A non-voter in a redundancy zone that can be promoted to a voter, if needed.
--- a/website/content/partials/autopilot/redundancy-zones.mdx
+++ b/website/content/partials/autopilot/redundancy-zones.mdx
@@ -0,0 +1,25 @@
+The majority of the voting servers in a cluster need to be available to agree on
+changes in configuration. If a voting node becomes unavailable and that causes
+the cluster to have fewer voting nodes than the quorum size, then Autopilot will not
+be able to promote a non-voter to become a voter. This is the **failure tolerance** of
+the cluster. Redundancy zones are not able to improve the failure tolerance of a
+cluster.
+
+Say that you have a cluster configured to have 2 redundancy zones and each zone
+has 2 servers within it (for total of 4 nodes in the cluster). The quorum size
+is 2. If the zone voter in either of the redundancy zones becomes unavailable,
+the cluster does not have quorum and is not able to agree on the configuration
+change needed to promote the non-voter in the zone into a voter.
+
+Redundancy zones do improve the **optimistic failure tolerance** of a cluster.
+The optimistic failure tolerance is the number of healthy active and back-up
+voting servers that can fail gradually without causing an outage. If the Vault
+cluster is able to maintain a quorum of voting nodes, then the cluster has the
+capability to lose nodes gradually and promote the standby redundancy zone nodes
+to take the place of voters.
+
+For example, consider a cluster that is configured to have 3 redundancy zones
+with 2 nodes in each zone. If a voting node becomes unreachable, the zone standby
+in that zone is promoted. The cluster then maintains 3 voting nodes with 2 remaining
+standbys. The cluster can handle an additional 2 gradual failures before it loses
+quorum.