Documentation for Adaptive Overload Protection (#26690)

* Document enabling config * Fix nav data JSON after disabling over-zealous prettifier * Address review feedback * Add warning about reloading config during overload * Bad metrics links * Another bad link * Add upgrade note about deprecation --------- Co-authored-by: Mike Palmiotto <mike.palmiotto@hashicorp.com>
2025-11-02 03:27:54 +00:00 · 2024-05-10 17:55:57 +01:00
parent fc4042bd2e
commit 0a06215d1a
17 changed files with 305 additions and 40 deletions
--- a/website/content/docs/concepts/adaptive-overload-protection/index.mdx
+++ b/website/content/docs/concepts/adaptive-overload-protection/index.mdx
@@ -0,0 +1,94 @@
+---
+layout: docs
+page_title: 'Adaptive overload protection'
+description: >-
+  Vault Enterprise provides adaptive overload protection to automatically
+  prevent workloads from overloading different resources of the Vault servers.
+---
+
+# Adaptive overload protection
+
+@include 'alerts/enterprise-only.mdx'
+
+@include 'alerts/beta.mdx'
+
+Adaptive overload protection refers to a set of features in Vault Enterprise
+that prevent client requests from overwhelming different server resources
+leading to poor availability.
+
+## Preventing overload
+
+Vault currently supports one type of adaptive overload protection that prevents
+Vault servers from being overwhelmed by write requests.
+
+These protection measures are "Adaptive" in the sense that they automatically
+and continuously adjust to maintain optimal performance for the current workload
+and hardware resources available without any user tuning.
+
+Load testing and tuning of appropriate limits is time consuming for users during
+initial setup. Even when clusters are carefully tuned during installation,
+real-world workloads and hardware performance both change over time. A static
+tuning will soon be sub-optimal or even completely ineffective at preventing
+overloads.
+
+For example, an increase in disk latency caused by failing hardware might reduce
+the server's available throughput. A static limit configured while disks were
+performing a their peak would not protect the degraded system from overload. By
+adaptively responding to current load and performance characteristics, Vault
+Enterprise is able to provide long-term protection against overloads.
+
+## Types of overload
+
+There are many potential resources that could become a performance bottleneck in
+a Vault Enterprise cluster. Different forms of adaptive overload protection
+target specific components and workloads. This allows each one to be carefully
+specialized and tuned to the needs of that sub-system. The sections below
+describe specific mechanisms that prevent overload of particular subsystems and
+protect against particular types of overloads.
+
+## Write overload protection
+
+In Vault Enterprise, all writes go through the `WALBackend` to allow for
+replication to other clusters. This is true even if replication is not being
+used. Vault performs batching or "group commit" for these writes to increases
+throughput. Optimal throughput for a given storage backend is obtained when
+there are enough write requests in the queue to fill the next batch. However, if
+there are more requests queued than will fit in a batch, latencies start to grow
+quickly as all writes have to wait behind multiple other batches.
+
+In some cases, a sudden influx of write requests that exceeds Vault's hardware
+capacity can result in the writes queueing for so long that every request times
+out before the write can make it through the queue. This makes Vault effectively
+unavailable to clients even though it is still processing requests and storing
+data as fast as it can. This is illustrated in the test results shown below for
+a workload of 100% logins.
+
+![Login workload telemetry graphs showing difference with and without adaptive overload protection for writes](/img/adaptive-overload-protection-writes.png)
+
+Adaptive Write Overload Protection prevents this scenario. It constantly
+monitors the current state of the write queue and uses a carefully tuned
+algorithm to allow just enough queueing to maximize throughput on the available
+hardware while keeping latencies under control and unnecessary rejections to a
+minimum.
+
+Write overload protection was added in Vault Enterprise 1.17 as a beta feature
+which is disabled by default.
+
+To enable the feature use the [`adaptive_overload_protection` configuration
+stanza](/vault/docs/configuration/adaptive-overload-protection).
+
+### Metrics
+
+Operators may wish to monitor metrics related to the write overload protection
+controller. The most useful of these is the `reject_fraction` which represents
+the controller's current estimate for the fraction of write requests that need
+to be rejected to maintain optimal throughput and stability.
+
+See the [wal.write_controller.reject_fraction metrics reference](/vault/docs/internals/telemetry/metrics/availability#vault-wal-write_controller-reject_fraction).
+
+## Client handling of overloads
+
+When Vault has reached capacity, new requests will be immediately rejected with
+a retryable `503 - Service Unavailable`. See [Vault Server Temporarily
+Overloaded](/vault/docs/concepts/adaptive-overload-protection/vault-server-temporarily-overloaded)
+for additional considerations around handling this error correctly.
--- a/website/content/docs/concepts/adaptive-overload-protection/vault-server-temporarily-overloaded.mdx
+++ b/website/content/docs/concepts/adaptive-overload-protection/vault-server-temporarily-overloaded.mdx
@@ -0,0 +1,51 @@
+---
+layout: docs
+page_title: Vault server temporarily overloaded
+description: |-
+  How to handle Vault servers rejecting requests due to overload.
+---
+
+Vault Enterprise includes features for [Adaptive Overload
+Protection](/vault/docs/concepts/adaptive-overload-protection). When some server
+resource is at capacity, Vault Enterprise may reject some HTTP client requests
+to preserve the Vault server's ability to remain stable and available. This
+document described considerations for handling these requests in client code.
+
+# Vault server temporarily overloaded
+
+Vault returns a `503 - Service Unavailable` response to indicate that a request
+was rejected because there was not enough capacity to service it in a timely way:
+
+```
+Error making API request.
+
+URL: PUT https://127.0.0.1:61555/v1/auth/userpass/login/foo
+Code: 503. Errors:
+
+* 1 error occurred:
+	* Vault server temporarily overloaded
+```
+
+`503 - Service Unavailable` is a retryable HTTP error.
+
+Vault clients should retry their request with a suitable backoff strategy.
+When retrying you should:
+ * Wait for an increasing amount of time between retries.
+ * Randomize the wait time between retries to avoid many clients becoming
+   synchronized and all retrying at the same moment. This is often called
+   adding "jitter".
+ * Limit the total number of retries so that request volume doesn't continue to
+   grow for the duration of an outage as more and more clients add on retries.
+
+~> **NOTE**:  `429 - Too Many Requests` is typically used to indicate that a
+specific client is issuing too many requests. A `503 - Service Unavailable`
+instead indicates that that the server is under excess load, which is likely to
+be unrelated to the behavior of the specific client being rejected.
+
+For more information on request rejection, refer to the [Adaptive Overload
+Protection Overview](/vault/docs/concepts/adaptive-overload-protection).
+
+## API Package
+
+For clients written in Go that use Vault's API package, retries are handled by
+default with no further work needed.
--- a/website/content/docs/concepts/request-limiter/index.mdx
+++ b/website/content/docs/concepts/request-limiter/index.mdx
@@ -10,7 +10,14 @@ description: >-

@include 'alerts/enterprise-only.mdx'

-@include 'alerts/beta.mdx'
+<Warning title="Beta (Deprecated)">
+
+The request limiter was released in Vault 1.16 as a Beta
+feature. During Beta evaluation we found an alternative approach better met
+the needs of our users. This feature will be removed from Vault in a future
+release. It is replaced with [adaptive overload protection](/vault/docs/concepts/adaptive-overload-protection).
+
+</Warning>

 This document contains conceptual information about the **Request Limiter** and
 its user-facing effects.
@@ -71,4 +78,4 @@ needing to retry.

 When Vault has reached capacity, new requests will be immediately rejected with a
 retryable `503 - Service Unavailable`
-[error](/vault/docs/concepts/request-limiter/vault-server-temporarily-overloaded).
+[error](/vault/docs/concepts/adaptive-overload-protection/vault-server-temporarily-overloaded).
--- a/website/content/docs/concepts/request-limiter/vault-server-temporarily-overloaded.mdx
+++ b/website/content/docs/concepts/request-limiter/vault-server-temporarily-overloaded.mdx
@@ -1,33 +0,0 @@
---
-layout: docs
-page_title: Vault server temporarily overloaded
-description: |-
-  Vault Enterprise error when the request limiter is at capacity.
---
-
-# Vault server temporarily overloaded
-
-Vault returns a `503 - Service Unavailable` response to indicate that a request
-was rejected after Vault has reached its in-flight request capacity:
-
-```
-Error making API request.
-
-URL: PUT https://127.0.0.1:61555/v1/auth/userpass/login/foo
-Code: 503. Errors:
-
-* 1 error occurred:
-	* Vault server temporarily overloaded
-```
-
-`503 - Service Unavailable` is a retryable HTTP error, which is handled by the
-Vault API `Client` implementation.
-
-~> **NOTE**:  `429 - Too Many Requests` is typically used to indicate that a
-specific client is issuing too many requests. The choice of `503 - Service
-Unavailable` for request rejection emphasizes that that the server is
-temporarily under excess load, which may not be related to the behavior of a
-specific client.
-
-For more information on request rejection, refer to the [Request
-Limiter](/vault/docs/concepts/request-limiter) documentation.
--- a/website/content/docs/configuration/adaptive-overload-protection.mdx
+++ b/website/content/docs/configuration/adaptive-overload-protection.mdx
@@ -0,0 +1,46 @@
+---
+layout: docs
+page_title: Adaptive overload protection - Configuration
+description: |-
+  Use adaptive overload protection with Vault Enterprise to automatically
+  prevent workloads from overloading different resources of your Vault servers.
+---
+
+# `adaptive_overload_protection`
+
+@include 'alerts/enterprise-only.mdx'
+
+@include 'alerts/beta.mdx'
+
+Configure the `adaptive_overload_protection` stanza to control overload
+protection features for your Vault server.
+
+@include 'config-reload-supported.mdx'
+
+<Warning title="Do not disable during overload">
+
+Do not disable the adaptive overload protection features during an overload.
+This feature is designed to protect your Vault server from overload conditions.
+Disabling it can lead to poor availability.
+
+</Warning>
+
+For more information read [Adaptive Overload
+Protection](/vault/docs/concepts/adaptive-overload-protection).
+
+
+```hcl
+adaptive_overload_protection {
+  disable_write_controller = false
+}
+```
+
+## `adaptive_overload_protection` parameters
+
+These parameters apply to the `adaptive_overload_protection` stanza in the Vault
+configuration file:
+
+- `disable_write_controller` `(bool: <optional>)`: Disables the adaptive write
+  overload controller. Defaults to `true` (controller disabled). Set
+  `disable_write_controller` to `false` to enable the write controller and opt
+  in to the beta functionality.
--- a/website/content/docs/configuration/request-limiter.mdx
+++ b/website/content/docs/configuration/request-limiter.mdx
@@ -10,7 +10,14 @@ description: |-

@include 'alerts/enterprise-only.mdx'

-@include 'alerts/beta.mdx'
+<Warning title="Deprecated beta feature">
+
+Vault 1.16 included the request limiter as a Beta feature. During the beta, we
+found an alternative approach that better meets user needs.  The request limiter
+has been deprecated in favor of [adaptive overload
+protection](/vault/docs/concepts/adaptive-overload-protection).
+
+</Warning>

 The `request_limiter` stanza allows operators to turn on the adaptive
 concurrency limiter, which is off by default. This is a reloadable config.
--- a/website/content/docs/internals/telemetry/metrics/all.mdx
+++ b/website/content/docs/internals/telemetry/metrics/all.mdx
@@ -768,6 +768,14 @@ alphabetic order by name.

@include 'telemetry-metrics/vault/wal/persistwals.mdx'

+@include 'telemetry-metrics/vault/wal/write_controller/d.mdx'
+
+@include 'telemetry-metrics/vault/wal/write_controller/i.mdx'
+
+@include 'telemetry-metrics/vault/wal/write_controller/p.mdx'
+
+@include 'telemetry-metrics/vault/wal/write_controller/reject_fraction.mdx'
+
@include 'telemetry-metrics/vault/zookeeper/delete.mdx'

@include 'telemetry-metrics/vault/zookeeper/get.mdx'
--- a/website/content/docs/internals/telemetry/metrics/availability.mdx
+++ b/website/content/docs/internals/telemetry/metrics/availability.mdx
@@ -49,6 +49,14 @@ your Vault instance. Enterprise installations also include

@include 'telemetry-metrics/vault/wal/persistwals.mdx'

+@include 'telemetry-metrics/vault/wal/write_controller/d.mdx'
+
+@include 'telemetry-metrics/vault/wal/write_controller/i.mdx'
+
+@include 'telemetry-metrics/vault/wal/write_controller/p.mdx'
+
+@include 'telemetry-metrics/vault/wal/write_controller/reject_fraction.mdx'
+
 ## Log shipping metrics

@include 'telemetry-metrics/vault/logshipper/buffer/length.mdx'
--- a/website/content/docs/upgrading/upgrade-to-1.17.x.mdx
+++ b/website/content/docs/upgrading/upgrade-to-1.17.x.mdx
@@ -50,6 +50,18 @@ to control truncation the behavior. Setting the issuer `leaf_not_after_behavior`
 field to `permit` and `enforce_leaf_not_after_behavior` to true restores the
 legacy behavior.

+### Request limiter deprecation
+
+Vault 1.16.0 included an experimental request limiter. The limiter was disabled
+by default. Further testing indicated that an alternative approach improves
+performance and reduces risk for many workloads. Vault 1.17.0 includes a
+new [adaptive overload
+protection](/vault/docs/concepts/adaptive-overload-protection) feature that
+prevents outages when Vault is overwhelmed by write requests. Adaptive overload
+protection is a beta feature in 1.17.0 and is disabled by default.
+
+The beta request limiter will be removed from Vault entirely in a later release.
+
 ## Known issues and workarounds

@include 'known-issues/ocsp-redirect.mdx'
--- a/website/content/partials/config-reload-supported.mdx
+++ b/website/content/partials/config-reload-supported.mdx
@@ -0,0 +1,5 @@
+<Note  title="Configuration reload supported">
+
+  Restart or reload your Vault server for configuration updates to take effect.
+
+</Note>
--- a/website/content/partials/telemetry-metrics/request-limiter-intro.mdx
+++ b/website/content/partials/telemetry-metrics/request-limiter-intro.mdx
@@ -1,2 +1,3 @@
-Request Limiter metrics relate to request success signals observed by the 
-request limiter and its current state.
+Request Limiter metrics relate to request success signals observed by the
+request limiter and its current state. Note the [request limiter is deprecated](/vault/docs/upgrading/upgrade-to-1.17.x#request-limiter-deprecation)
+and will be removed in future Vault versions.
--- a/website/content/partials/telemetry-metrics/vault/wal/write_controller/d.mdx
+++ b/website/content/partials/telemetry-metrics/vault/wal/write_controller/d.mdx
@@ -0,0 +1,9 @@
+### vault.wal.write_controller.d ((#vault-wal-write_controller-d))
+
+Metric type | Value   | Description
+----------- | ------- | -----------
+gauge       | number  | Current derivative value computed by the write controller.
+
+The `vault.wal.write_controller.d` metric has limited production use, but Vault
+developers may find `vault.wal.write_controller.d` useful for tuning or
+debugging controller behavior.
--- a/website/content/partials/telemetry-metrics/vault/wal/write_controller/i.mdx
+++ b/website/content/partials/telemetry-metrics/vault/wal/write_controller/i.mdx
@@ -0,0 +1,10 @@
+### vault.wal.write_controller.i ((#vault-wal-write_controller-i))
+
+Metric type | Value   | Description
+----------- | ------- | -----------
+gauge       | number  | Current integral value computed by the write controller.
+
+
+The `vault.wal.write_controller.i` metric has limited production use, but Vault
+developers may find `vault.wal.write_controller.i` useful for tuning or
+debugging controller behavior.
--- a/website/content/partials/telemetry-metrics/vault/wal/write_controller/p.mdx
+++ b/website/content/partials/telemetry-metrics/vault/wal/write_controller/p.mdx
@@ -0,0 +1,9 @@
+### vault.wal.write_controller.p ((#vault-wal-write_controller-p))
+
+Metric type | Value   | Description
+----------- | ------- | -----------
+gauge       | number  | Current proportional error value detected by the write controller.
+
+The `vault.wal.write_controller.p` metric has limited production use, but Vault
+developers may find `vault.wal.write_controller.p` useful for tuning or
+debugging controller behavior.
--- a/website/content/partials/telemetry-metrics/vault/wal/write_controller/reject_fraction.mdx
+++ b/website/content/partials/telemetry-metrics/vault/wal/write_controller/reject_fraction.mdx
@@ -0,0 +1,8 @@
+### vault.wal.write_controller.reject_fraction ((#vault-wal-write_controller-reject_fraction))
+
+Metric type | Value   | Description
+----------- | ------- | -----------
+gauge       | number  | The estimated fraction of write requests that must be rejected to maintain cluster stability.
+
+The [write controller](/vault/docs/concepts/adaptive-overload-protection) reject
+fraction is an estimate between 0 and 1.
--- a/website/data/docs-nav-data.json
+++ b/website/data/docs-nav-data.json
@@ -308,7 +308,7 @@
      {
        "title": "Request Limiter",
        "badge": {
-          "text": "ENTERPRISE",
+          "text": "ENTERPRISE | DEPRECATED",
          "type": "outlined",
          "color": "neutral"
        },
@@ -321,10 +321,29 @@
              "type": "outlined",
              "color": "highlight"
            }
+          }
+        ]
+      },
+      {
+        "title": "Adaptive overload protection",
+        "badge": {
+          "text": "ENTERPRISE | BETA",
+          "type": "outlined",
+          "color": "neutral"
+        },
+        "routes": [
+          {
+            "title": "Overview",
+            "path": "concepts/adaptive-overload-protection",
+            "badge": {
+              "text": "BETA",
+              "type": "outlined",
+              "color": "highlight"
+            }
          },
          {
            "title": "Vault server temporarily overloaded",
-            "path": "concepts/request-limiter/vault-server-temporarily-overloaded"
+            "path": "concepts/adaptive-overload-protection/vault-server-temporarily-overloaded"
          }
        ]
      }
@@ -544,6 +563,10 @@
        "title": "<code>Request Limiter</code>",
        "path": "configuration/request-limiter"
      },
+      {
+        "title": "Adaptive overload protection",
+        "path": "configuration/adaptive-overload-protection"
+      },
      {
        "title": "<code>ui</code>",
        "path": "configuration/ui"
--- a/website/public/img/adaptive-overload-protection-writes.png
+++ b/website/public/img/adaptive-overload-protection-writes.png