The actions/upload-artifact action does not support filenames with
special characters as it needs to maintain restore compatibility with
NTFS filesystems. Instead of uploading raw log files, which can inherit
names with special characters and break the upload, we tar them all
together to preserve their names and upload the resulting tarball.
Signed-off-by: Ryan Cragun <me@ryan.ec>
We're on a quest to reduce our pipeline execution time to both enhance
our developer productivity but also to reduce the overall cost of the CI
pipeline. The strategy we use here reduces workflow execution time and
network I/O cost by reducing our module cache size and using binary
external tools when possible. We no longer download modules and build
many of the external tools thousands of times a day.
Our previous process of installing internal and external developer tools
was scattered and inconsistent. Some tools were installed via `go
generate -tags tools ./tools/...`,
others via various `make` targets, and some only in Github Actions
workflows. This process led to some undesirable side effects:
* The modules of some dev and test tools were included with those
of the Vault project. This leads to us having to manage our own
Go modules with those of external tools. Prior to Go 1.16 this
was the recommended way to handle external tools, but now
`go install tool@version` is the recommended way to handle
external tools that need to be build from source as it supports
specific versions but does not modify the go.mod.
* Due to Github cache constraints we combine our build and test Go
module caches together, but having our developer tools as deps in
our module results in a larger cache which is downloaded on every
build and test workflow runner. Removing the external tools that were
included in our go.mod reduced the expanded module cache by size
by ~300MB, thus saving time and network I/O costs when downloading
the module cache.
* Not all of our developer tools were included in our modules. Some were
being installed with `go install` or `go run`, so they didn't take
advantage of a single module cache. This resulted in us downloading
Go modules on every CI and Build runner in order to build our
external tools.
* Building our developer tools from source in CI is slow. Where possible
we can prefer to use pre-built binaries in CI workflows. No more
module download or tool compiles if we can avoid them.
I've refactored how we define internal and external build tools
in our Makefile and added several new targets to handle both building
the developer tools locally for development and verifying that they are
available. This allows for an easy developer bootstrap while also
supporting installation of many of the external developer tools from
pre-build binaries in CI. This reduces our network IO and run time
across nearly all of our actions runners.
While working on this I caught and resolved a few unrelated issue:
* Both our Go and Proto format checks we're being run incorrectly. In
CI they we're writing changes but not failing if changes were
detected. The Go was less of a problem as we have git hooks that
are intended to enforce formatting, however we drifted over time.
* Our Git hooks couldn't handle removing a Go file without failing. I
moved the diff check into the new Go helper and updated it to handle
removing files.
* I combined a few separate scripts and into helpers and added a few
new capabilities.
* I refactored how we install Go modules to make it easier to download
and tidy all of the projects go.mod's.
* Refactor our internal and external tool installation and verification
into a tools.sh helper.
* Combined more complex Go verification into `scripts/go-helper.sh` and
utilize it in the `Makefile` and git commit hooks.
* Add `Makefile` targets for executing our various tools.sh helpers.
* Update our existing `make` targets to use new tool targets.
* Normalize our various scripts and targets output to have a consistent
output format.
* In CI, install many of our external dependencies as binaries wherever
possible. When not possible we'll build them from scratch but not mess
with the shared module cache.
* [QT-641] Remove our external build tools from our project Go modules.
* [QT-641] Remove extraneous `go list`'s from our `set-up-to` composite
action.
* Fix formatting and regen our protos
Signed-off-by: Ryan Cragun <me@ryan.ec>
* Pulls in github.com/go-secure-stdlib/plugincontainer@v0.3.0 which exposes a new `Config.Rootless` option to opt in to extra container configuration options that allow establishing communication with a non-root plugin within a rootless container runtime.
* Adds a new "rootless" option for plugin runtimes, so Vault needs to be explicitly told whether the container runtime on the machine is rootless or not. It defaults to false as rootless installs are not the default.
* Updates `run_config.go` to use the new option when the plugin runtime is rootless.
* Adds new `-rootless` flag to `vault plugin runtime register`, and `rootless` API option to the register API.
* Adds rootless Docker installation to CI to support tests for the new functionality.
* Minor test refactor to minimise the number of test Vault cores that need to be made for the external plugin container tests.
* Documentation for the new rootless configuration and the new (reduced) set of restrictions for plugin containers.
* As well as adding rootless support, we've decided to drop explicit support for podman for now, but there's no barrier other than support burden to adding it back again in future so it will depend on demand.
Add support for testing Vault Enterprise with HA seal support by adding
a new `seal_ha` scenario that configures more than one seal type for a
Vault cluster. We also extend existing scenarios to support testing
with or without the Seal HA code path enabled.
* Extract starting vault into a separate enos module to allow for better
handling of complex clusters that need to be started more than once.
* Extract seal key creation into a separate module and provide it to
target modules. This allows us to create more than one seal key and
associate it with instances. This also allows us to forego creating
keys when using shamir seals.
* [QT-615] Add support for configuring more that one seal type to
`vault_cluster` module.
* [QT-616] Add `seal_ha` scenario
* [QT-625] Add `seal_ha_beta` variant to existing scenarios to test with
both code paths.
* Unpin action-setup-terraform
* Add `kms:TagResource` to service user IAM profile
Signed-off-by: Ryan Cragun <me@ryan.ec>
* VAULT-20487 update build failure slack output
* VAULT-20487 add new needs
* VAULT-20487 make it run on my branch
* VAULT-20487 make it run
* VAULT-20487 finalize?
Fix missing log files: we need to use an absolute path, since go test chdirs into the test package dir before running tests. Move the cleanup-on-success behaviour from NewTestCluster into NewTestLogger so it applies more broadly.
* Stop running fips tests on PRs: we expect fips-specific failures to be rare enough that it's not worth the cost.
* Allow PRs with the label "fips" to run fips tests.
Terraform 1.6.x seems to have some incompatiblity with the current
version fo enos and its usage of tfjson. Pin to 1.5.x until it has been
resolved.
```
│ Error: json: cannot unmarshal array into Go struct field rawState.checks of type tfjson.CheckResultStatic
│
```
Signed-off-by: Ryan Cragun <me@ryan.ec>
Sometimes destroying resources in AWS will fail because of unexpected
dependency violations or other such nonsense. When this happens the
behavior of Vault that we wanted to verify has already been successfully
accomplished, however the required workflow will fail. This change
allows us to succeed if `enos scenario launch` completes but allows
`enos scenario destroy` to fail. We still notify our slack channel on
destroy failures so that we can investigate issues, however it won't
require a PR author to retry.
* Execute `enos scenario launch` instead of `enos scenario run` to allow
for very occasional issues when tearing down test infrastructure.
* Improve an error message when getting secondary cluster IP addresses.
* Don't race to get secondary cluster IP addresses.
* Add secondary token to replication scenario outputs.
Signed-off-by: Ryan Cragun <me@ryan.ec>
* Remove old initial versions from the upgrade scenario as they're
unreliable.
* Ensure that shellcheck is available on runners for linting job.
Signed-off-by: Ryan Cragun <me@ryan.ec>
Update our `proxy` and `agent` scenarios to support new variants and
perform baseline verification and their scenario specific verification.
We integrate these updated scenarios into the pipeline by adding them
to artifact samples.
We've also improved the reliability of the `autopilot` and `replication`
scenarios by refactoring our IP address gathering. Previously, we'd ask
vault for the primary IP address and use some Terraform logic to determine
followers. The leader IP address gathering script was also implicitly
responsible for ensuring that a found leader was within a given group of
hosts, and thus waiting for a given cluster to have a leader, and also for
doing some arithmetic and outputting `replication` specific output data.
We've broken these responsibilities into individual modules, improved their
error messages, and fixed various races and bugs, including:
* Fix a race between creating the file audit device and installing and starting
vault in the `replication` scenario.
* Fix how we determine our leader and follower IP addresses. We now query
vault instead of a prior implementation that inferred the followers and sometimes
did not allow all nodes to be an expected leader.
* Fix a bug where we'd always always fail on the first wrong condition
in the `vault_verify_performance_replication` module.
We also performed some maintenance tasks on Enos scenarios byupdating our
references from `oss` to `ce` to handle the naming and license changes. We
also enabled `shellcheck` linting for enos module scripts.
* Rename `oss` to `ce` for license and naming changes.
* Convert template enos scripts to scripts that take environment
variables.
* Add `shellcheck` linting for enos module scripts.
* Add additional `backend` and `seal` support to `proxy` and `agent`
scenarios.
* Update scenarios to include all baseline verification.
* Add `proxy` and `agent` scenarios to artifact samples.
* Remove IP address verification from the `vault_get_cluster_ips`
modules and implement a new `vault_wait_for_leader` module.
* Determine follower IP addresses by querying vault in the
`vault_get_cluster_ips` module.
* Move replication specific behavior out of the `vault_get_cluster_ips`
module and into it's own `replication_data` module.
* Extend initial version support for the `upgrade` and `autopilot`
scenarios.
We also discovered an issue with undo_logs that has been described in
the VAULT-20259. As such, we've disabled the undo_logs check until
it has been fixed.
Signed-off-by: Ryan Cragun <me@ryan.ec>
This adds edition handling to the test-run-enos-scenario-matrix
workflow. Previously we'd pass the version and edition from the caller,
but that isn't an option in the release testing workflow, which only
passes the metadata version without the edition.
Signed-off-by: Ryan Cragun <me@ryan.ec>