diff --git a/website/content/docs/secrets/pki/considerations.mdx b/website/content/docs/secrets/pki/considerations.mdx index cdb7241527..f5d1e1b2ae 100644 --- a/website/content/docs/secrets/pki/considerations.mdx +++ b/website/content/docs/secrets/pki/considerations.mdx @@ -17,10 +17,19 @@ generating the CA to use with this secrets engine. - [Managed Keys](#managed-keys) - [One CA Certificate, One Secrets Engine](#one-ca-certificate-one-secrets-engine) - [Always Configure a Default Issuer](#always-configure-a-default-issuer) - - [Key Types Matter](#key-types-matter) + - [Key Types Matter](#key-types-matter) + - [Cluster Performance and Key Types](#cluster-performance-and-key-types) - [Use a CA Hierarchy](#use-a-ca-hierarchy) - [Cross-Signed Intermediates](#cross-signed-intermediates) - - [Keep certificate lifetimes short, for CRL's sake](#keep-certificate-lifetimes-short-for-crls-sake) + - [Cluster URLs are Important](#cluster-urls-are-important) + - [Automate Rotation with ACME](#automate-rotation-with-acme) + - [ACME Stores Certificates](#acme-stores-certificates) + - [ACME Role Restrictions Require EAB](#acme-role-restrictions-require-eab) + - [ACME and the Public Internet](#acme-and-the-public-internet) + - [ACME Errors are in Server Logs](#acme-errors-are-in-server-logs) + - [ACME Security Considerations](#acme-security-considerations) + - [ACME and Client Counting](#acme-and-client-counting) + - [Keep Certificate Lifetimes Short, For CRL's Sake](#keep-certificate-lifetimes-short-for-crls-sake) - [NotAfter Behavior on Leaf Certificates](#notafter-behavior-on-leaf-certificates) - [Cluster Performance and Quantity of Leaf Certificates](#cluster-performance-and-quantity-of-leaf-certificates) - [You must configure issuing/CRL/OCSP information _in advance_](#you-must-configure-issuingcrlocsp-information-_in-advance_) @@ -120,7 +129,7 @@ issuer's CRL. This means maintaining a default issuer is important for both backwards compatibility for issuing certificates and for ensuring revoked certificates land on a CRL. -### Key Types Matter +## Key Types Matter Certain key types have impacts on performance. Signing certificates from a RSA key will be slower than issuing from an ECDSA or Ed25519 key. Key generation @@ -135,6 +144,60 @@ also be more expensive. Careful consideration of both issuer and issued key types can have meaningful impacts on performance of not only Vault, but systems using these certificates. +### Cluster Performance and Key Type + +The [benchmark-vault](https://github.com/hashicorp/vault-benchmark) project +can be used to measure the performance of a Vault PKI instance. In general, +some considerations to be aware of: + + - RSA key generation is much slower and highly variable than EC key + generation. If performance and throughput are a necessity, consider using + EC keys (including NIST P-curves and Ed25519) instead of RSA. + + - Key signing requests (via `/pki/sign`) will be faster than (`/pki/issue`), + especially for RSA keys: this removes the necessity for Vault to generate + key material and can sign the key material provided by the client. This + signing step is common between both endpoints, so key generation is pure + overhead if the client has a sufficiently secure source of entropy. + + - The CA's key type matters as well: using a RSA CA will result in a RSA + signature and takes longer than a ECDSA or Ed25519 CA. + + - Storage is an important factor: with [BYOC Revocation](/vault/api-docs/secret/pki#revoke-certificate), + using `no_store=true` still gives you the ability to revoke certificates + and audit logs can be used to track issuance. Clusters using a remote + storage (like Consul) over a slow network and using `no_store=false` will + result in additional latency on issuance. Adding leases for every issued + certificate compounds the problem. + + - Storing too many certificates results in longer `LIST /pki/certs` time, + including the time to tidy the instance. As such, for large scale + deployments (>= 250k active certificates) it is recommended to use audit + logs to track certificates outside of Vault. + +As a general comparison on unspecified hardware, using `benchmark-vault` for +`30s` on a local, single node, raft-backed Vault instance: + + - Vault can issue 300k certificates using EC P-256 for CA & leaf keys and + without storage. + + - But switching to storing these leaves drops that number to 65k, and only + 20k with leases. + + - Using large, expensive RSA-4096 bit keys, Vault can only issue 160 leaves, + regardless of whether or not storage or leases were used. The 95% key + generation time is above 10s. + + - In comparison, using P-521 keys, Vault can issue closer to 30k leaves + without leases and 18k with leases. + +These numbers are for example only, to represent the impact different key types +can have on PKI cluster performance. + +The use of ACME adds additional latency into these numbers, both because +certificates need to be stored and because challenge validation needs to +be performed. + ## Use a CA Hierarchy It is generally recommended to use a hierarchical CA setup, with a root @@ -176,7 +239,160 @@ can be constructed in the following order: All requests to this issuer for signing will now present the full cross-signed chain. -## Keep certificate lifetimes short, for CRL's sake +## Cluster URLs are Important + +In Vault 1.13, support for [templated AIA +URLs](/vault/api-docs/secret/pki#enable_aia_url_templating-1) +was added. With the [per-cluster URL +configuration](/vault/api-docs/secret/pki#set-cluster-configuration) pointing +to this Performance Replication cluster, AIA information will point to the +cluster that issued this certificate automatically. + +In Vault 1.14, with ACME support, the same configuration is used for allowing +ACME clients to discover the URL of this cluster. + +~> **Warning**: It is important to ensure that this configuration is + up to date and maintained correctly, always pointing to the node's + PR cluster address (which may be a Load Balanced or a DNS Round-Robbin + address). If this configuration is not set on every Performance Replication + cluster, certificate issuance (via REST and/or via ACME) will fail. + +## Automate Rotation with ACME + +In Vault 1.14, support for the [Automatic Certificate Management Environment +(ACME)](https://datatracker.ietf.org/doc/html/rfc8555) protocol has been +added to the PKI Engine. This is a standardized way to handle validation, +issuance, rotation, and revocation of server certificates. + +Many ecosystems, from web servers like Caddy, Nginx, and Apache, to +orchestration environments like Kubernetes (via cert-manager) natively +support issuance via the ACME protocol. For deployments without native +support, stand-alone tools like certbot support fetching and renewing +certificates on behalf of consumers. Vault's PKI Engine only includes server +support for ACME; no client functionality has been included. + +~> Note: Vault's PKI ACME server caps the certificate's validity at 90 days + maximum, regardless of role and/or global limits. Shorter validity + durations can be set via limiting the role's TTL to be under 90 days. + Aligning with Let's Encrypt, we do not support the optional `NotBefore` + and `NotAfter` order request parameters. + +### ACME Stores Certificates + +Because ACME requires stored certificates in order to function, the notes +[below about automating tidy](#automate-crl-building-and-tidying) are +especially important for the long-term health of the PKI cluster. ACME also +introduces additional resource types (accounts, orders, authorizations, and +challenges) that must be tidied via [the `tidy_acme=true` +option](/vault/api-docs/secret/pki#tidy). Orders, authorizations, and +challenges are [cleaned up based on the +`safety_buffer`](/vault/api-docs/secret/pki#safety_buffer) +parameter, but accounts can live longer past their last issued certificate +by controlling the [`acme_account_safety_buffer` +parameter](/vault/api-docs/secret/pki#acme_account_safety_buffer). + +As a consequence of the above, and like the discussions in the [Cluster +Scalability](#cluster-scalability) section, because these roles have +`no_store=false` set, ACME can only issue certificates on the active nodes +of PR clusters; standby nodes, if contacted, will transparently forward +all requests to the active node. + +### ACME Role Restrictions Require EAB + +Because ACME by default has no external authorization engine and is +unauthenticated from a Vault perspective, the use of roles with ACME +in the default configuration are of limited value as any ACME client +can request certificates under any role by proving possession of the +requested certificate identifiers. + +To solve this issue, there are two possible approaches: + + 1. Use a restrictive [`allowed_roles`, `allowed_issuers`, and + `default_directory_policy` ACME + configuration](/vault/api-docs/secret/pki#set-acme-configuration) + to let only a single role and issuer be used. This prevents user + choice, allowing some global restrictions to be placed on issuance + and avoids requiring ACME clients to have (at initial setup) access + to a Vault token other mechanism for acquiring a Vault EAB ACME token. + 2. Use a more permissive [configuration with + `eab_policy=always-required`](/vault/api-docs/secret/pki#eab_policy) + to allow more roles and users to select the roles, but bind ACME clients + to a Vault token which can be suitably ACL'd to particular sets of + approved ACME directories. + +The choice of approach depends on the policies of the organization wishing +to use ACME. + +### ACME and the Public Internet + +Using ACME is possible over the public internet; public CAs like Let's Encrypt +offer this as a service. Similarly, organizations running internal PKI +infrastructure might wish to issue server certificates to pieces of +infrastructure outside of their internal network boundaries, from a publicly +accessible Vault instance. By default, without enforcing a restrictive +`eab_policy`, this results in a complicated threat model: _any_ external +client which can prove possession of a domain can issue a certificate under +this CA, which might be considered more trusted by this organization. + +As such, we strongly recommend publicly facing Vault instances (such as HCP +Vault) enforce that PKI mount operators have required a [restrictive +`eab_policy=always-required` configuration](/vault/api-docs/secret/pki#eab_policy). +System administrators of Vault instances can enforce this by [setting the +`VAULT_DISABLE_PUBLIC_ACME=true` environment +variable](/vault/api-docs/secret/pki#acme-external-account-bindings). + +### ACME Errors are in Server Logs + +Because the ACME client is not necessarily trusted (as account registration +may not be tied to a valid Vault token when EAB is not used), many error +messages end up in the Vault server logs out of security necessity. When +troubleshooting issues with clients requesting certificates, first check +the client's logs, if any, (e.g., certbot will state the log location on +errors), and then correlate with Vault server logs to identify the failure +reason. + +### ACME Security Considerations + +ACME allows any client to use Vault to make some sort of external call; +while the design of ACME attempts to minimize this scope and will prohibit +issuance if incorrect servers are contacted, it cannot account for all +possible remote server implementations. Vault's ACME server makes three +types of requests: + + 1. DNS requests for `_acme-challenge.`, which should be least + invasive and most safe. + 2. TLS ALPN requests for the `acme-tls/1` protocol, which should be + safely handled by the TLS before any application code is invoked. + 3. HTTP requests to `http:///.well-known/acme-challenge/`, + which could be problematic based on server design; if all requests, + regardless of path, are treated the same and assumed to be trusted, + this could result in Vault being used to make (invalid) requests. + Ideally, any such server implementations should be updated to ignore + such ACME validation requests or to block access originating from Vault + to this service. + +In all cases, no information about the response presented by the remote +server is returned to the ACME client. + +When running Vault on multiple networks, note that Vault's ACME server +places no restrictions on requesting client/destination identifier +validations paths; a client could use a HTTP challenge to force Vault to +reach out to a server on a network it could otherwise not access. + +### ACME and Client Counting + +In Vault 1.14, ACME contributes differently to usage metrics than other +interactions with the PKI Secrets Engine. Due to its use of unauthenticated +requests (which do not generate Vault tokens), it would not be counted in +the traditional [activity log APIs](/vault/api-docs/system/internal-counters#activity-export). +Instead, certificates issued via ACME will be counted via their unique +certificate identifiers (the combination of CN, DNS SANs, and IP SANs). +These will create a stable identifier that will be consistent across +renewals, other ACME clients, mounts, and namespaces, contributing to +the activity log presently as a non-entity token attributed to the first +mount which created that request. + +## Keep Certificate Lifetimes Short, For CRL's Sake This secrets engine aligns with Vault's philosophy of short-lived secrets. As such it is not expected that CRLs will grow large; the only place a private key