* [QT-602] Run `proxy` and `agent` test scenarios (#23176)
Update our `proxy` and `agent` scenarios to support new variants and
perform baseline verification and their scenario specific verification.
We integrate these updated scenarios into the pipeline by adding them
to artifact samples.
We've also improved the reliability of the `autopilot` and `replication`
scenarios by refactoring our IP address gathering. Previously, we'd ask
vault for the primary IP address and use some Terraform logic to determine
followers. The leader IP address gathering script was also implicitly
responsible for ensuring that a found leader was within a given group of
hosts, and thus waiting for a given cluster to have a leader, and also for
doing some arithmetic and outputting `replication` specific output data.
We've broken these responsibilities into individual modules, improved their
error messages, and fixed various races and bugs, including:
* Fix a race between creating the file audit device and installing and starting
vault in the `replication` scenario.
* Fix how we determine our leader and follower IP addresses. We now query
vault instead of a prior implementation that inferred the followers and sometimes
did not allow all nodes to be an expected leader.
* Fix a bug where we'd always always fail on the first wrong condition
in the `vault_verify_performance_replication` module.
We also performed some maintenance tasks on Enos scenarios byupdating our
references from `oss` to `ce` to handle the naming and license changes. We
also enabled `shellcheck` linting for enos module scripts.
* Rename `oss` to `ce` for license and naming changes.
* Convert template enos scripts to scripts that take environment
variables.
* Add `shellcheck` linting for enos module scripts.
* Add additional `backend` and `seal` support to `proxy` and `agent`
scenarios.
* Update scenarios to include all baseline verification.
* Add `proxy` and `agent` scenarios to artifact samples.
* Remove IP address verification from the `vault_get_cluster_ips`
modules and implement a new `vault_wait_for_leader` module.
* Determine follower IP addresses by querying vault in the
`vault_get_cluster_ips` module.
* Move replication specific behavior out of the `vault_get_cluster_ips`
module and into it's own `replication_data` module.
* Extend initial version support for the `upgrade` and `autopilot`
scenarios.
We also discovered an issue with undo_logs that has been described in
the VAULT-20259. As such, we've disabled the undo_logs check until
it has been fixed.
* actions: fix actionlint error and linting logic (#23305)
* enos: don't attempt to use the vault proxy command before 1.14
---------
Signed-off-by: Ryan Cragun <me@ryan.ec>
Replace our prior implementation of Enos test groups with the new Enos
sampling feature. With this feature we're able to describe which
scenarios and variant combinations are valid for a given artifact and
allow enos to create a valid sample field (a matrix of all compatible
scenarios) and take an observation (select some to run) for us. This
ensures that every valid scenario and variant combination will
now be a candidate for testing in the pipeline. See QT-504[0] for further
details on the Enos sampling capabilities.
Our prior implementation only tested the amd64 and arm64 zip artifacts,
as well as the Docker container. We now include the following new artifacts
in the test matrix:
* CE Amd64 Debian package
* CE Amd64 RPM package
* CE Arm64 Debian package
* CE Arm64 RPM package
Each artifact includes a sample definition for both pre-merge/post-merge
(build) and release testing.
Changes:
* Remove the hand crafted `enos-run-matrices` ci matrix targets and replace
them with per-artifact samples.
* Use enos sampling to generate different sample groups on all pull
requests.
* Update the enos scenario matrices to handle HSM and FIPS packages.
* Simplify enos scenarios by using shared globals instead of
cargo-culted locals.
Note: This will require coordination with vault-enterprise to ensure a
smooth migration to the new system. Integrating new scenarios or
modifying existing scenarios/variants should be much smoother after this
initial migration.
[0] https://github.com/hashicorp/enos/pull/102
Signed-off-by: Ryan Cragun <me@ryan.ec>
* use verify changes for docs to skip tests
* add verify-changes to the needed jobs
* skip go tests for doc/ui only changes
* fix a job ref
* change names, remove script
* remove ui conditions
* separate flags
* feedback
We further optimize the CI workflow for better costs and speed.
We tested the Go CI workflows across several instance classes
and update our compute choices. We achieve an average execution
speed improvement of 2-2.5 minutes per test workflow while
reducing the infrastructure cost by about 20%. We also also save
another ~2 minutes by installing `gotestsum` from the Github release
instead of downloading the Go modules and compiling it every time.
In addition to the speed improvements, we also further reduced our cache
usage by updating the `security-scan` workflow to not cache Go modules.
We also use the `cache/save` and `cache/restore` actions for timing
caches. This results is saving half as many cache results for timing
data.
*UI test results*
results for 2x runs:
* c6a.2xlarge (12m54s, 11m55s)
* c6a.4xlarge (10m47s, 11m6s)
* c6a.8xlarge (11m32s, 10m51s)
* m5.2xlarge (15m23s, 14m16s)
* m5.4xlarge (14m48s, 12m54s)
* m5.8xlarge (12m27s, 12m24s)
* m6a.2xlarge (11m55s, 12m20s)
* m6a.4xlarge (10m54s, 10m43s)
* m6a.8xlarge (10m33s, 10m51s)
Current runner:
m5.2xlarge (15m23s, 14m16s, avg 14m50s) @ 0.448/hr = $0.11
Faster candidates
* c6a.2xlarge (12m54s, 11m55s, avg 12m24s) @ 0.3816/hr = $0.078
* m6a.2xlarge (11m55s, 12m20s, avg 12m8s) @ 0.4032/hr = $0.081
* c6a.4xlarge (10m47s, 11m6s, avg 10m56s) @ 0.7632/hr = $0.139
* m6a.4xlarge (10m54s, 10m43s, avg 10m48s) @ 0.8064/hr = $0.140
Best bang for the buck for test-ui:
m6a.2xlarge, > 25% cost savings from current and we save ~2.5 minutes.
*Go test results*
During testing the external replication tests, when not broken up, will
always take the longest. Our original analysis focuses on this job.
Most other tests groups will finish ~3m faster so we'll use subtract
that time when estimating the cost for the whole job.
external replication job results:
* c6a.2xlarge (20m49s, 19m20s, avg 20m5s)
* c6a.4xlarge (19m1s, 19m38s, avg 19m20s)
* c6a.8xlarge (19m51s, 18m54s, avg 19m23s)
* m5.2xlarge (22m12s, 20m29s, avg 21m20s)
* m5.4xlarge (20m7s, 19m3s, avg 20m35s)
* m5.8xlarge (20m24s, 19m42s, avg 20m3s)
* m6a.2xlarge (21m10s, 19m37s, avg 20m23s)
* m6a.4xlarge (18m58s, 19m51s, avg 19m24s)
* m6a.8xlarge (19m27s, 18m47s, avg 19m7s)
There is little separation in time when we increase class size. In the
best case a class size increase yields about a ~5% performance increase
and doubles the cost. For test-go our best bang for the buck is
certainly going to be in the 2xlarge class.
Current runner:
m5.2xlarge (22m12s, 20m29s, avg 21m20s) @ 0.448/hr (16@avg-3m + 1@avg) = $2.35
Candidates in the same class
* c6a.2xlarge (20m49s, 19m20s, avg 20m5s) @ 0.3816/hr (16@avg-3m + 1@avg) = $1.86
* m6a.2xlarge (21m10s, 19m37s, avg 20m23s) @ 0.4032/hr (16@avg-3m + 1@avg) = $2.00
Best bang for the buck for test-go:
c6a.2xlarge: 20% cost savings and save about ~2.25 minutes.
We ran the tests with similar instances and saw similar execution times as
with test-go. Therefore we can use the same recommended instance sizes.
After breaking up test-go's external replication tests, the longest group
was shorter on average. I choose to look at group 3 as it was usually the
longest grouping:
* c6a.2xlarge: (14m51s, 14m48s)
* c6a.4xlarge: (14m14s, 14m15)
* c6a.8xlarge: (14m0s, 13m54s)
* m5.2xlarge: (15m36s, 15m35s)
* m5.4xlarge: (14m46s, 14m49s)
* m5.8xlarge: (14m25s, 14m25s)
* m6a.2xlarge: 14m51s, 14m53s)
* m6a.4xlarge: 14m16s, 14m16s)
* m6a.8xlarge: (14m2s, 13m57s)
Again, we see ~5% performance gains between the 2x and 8x instance classes
at quadruple the cost. The c6a and m6a families are almost identical, with
the c6a class being cheaper.
*Notes*
* UI and Go Test timing results: https://github.com/hashicorp/vault-enterprise/actions/runs/5556957460/jobs/10150759959
* Go Test with data race detection timing results: https://github.com/hashicorp/vault-enterprise/actions/runs/5558013192
* Go Test with replication broken up: https://github.com/hashicorp/vault-enterprise/actions/runs/5558490899
Signed-off-by: Ryan Cragun <me@ryan.ec>
Co-authored-by: Ryan Cragun <me@ryan.ec>
* backport of commit dc104898f7 (#21853)
* fix multiline
* shellcheck, and success message for builds
* add full path
* cat the summary
* fix and faster
* fix if condition
* base64 in a separate step
* echo
* check against empty string
* add echo
* only use matrix ids
* only id
* echo matrix
* remove wrapping array
* tojson
* try echo again
* use jq to get packages
* don't quote
* only run binary tests once
* only run binary tests once
* test what's wrong with the binary
* separate file
* use matrix file
* failed test
* update comment on success
* correct variable name
* bae64 fix
* output to file
* use multiline
* fix
* fix formatting
* fix newline
* fix whitespace
* correct body, remove comma
* small fixes
* shellcheck
* another shellcheck fix
* fix deprecation checker
* only run comments for prs
* Update .github/workflows/test-go.yml
Co-authored-by: Mike Palmiotto <mike.palmiotto@hashicorp.com>
* Update .github/workflows/test-go.yml
Co-authored-by: Mike Palmiotto <mike.palmiotto@hashicorp.com>
* fixes
---------
Co-authored-by: Mike Palmiotto <mike.palmiotto@hashicorp.com>
* backport of commit 3b00dde1ba (#21936)
* limit test comments
* remove unecessary tee
* fix go test condition
* fix
* fail test
* remove ailways entirely
* fix columns
* make a bunch of tests fail
* separate line
* include Failures:
* remove test fails
* fix whitespace
* backport of commit 245430215c (#21973)
* only add binary tests if they exist
* shellcheck
---------
Co-authored-by: miagilepner <mia.epner@hashicorp.com>
Co-authored-by: Mike Palmiotto <mike.palmiotto@hashicorp.com>
In order to reliably store Go test times in the Github Actions cache we
need to reduce our cache thrashing by not using more than 10gb over all
of our caches. This change reduces our cache usage significantly by
sharing Go module cache between our Go CI workflows and our build
workflows. We lose our per-builder cache which will result in a bit of
performance hit, but we'll enable better automatic rebalancing of our CI
workflows. Overall we should see a per branch reduction in cache sizes
from ~17gb to ~850mb.
Some preliminary investigation into this new strategy:
Prior build workflow strategy on a cache miss:
Download modules: ~20s
Build Vault: ~40s
Upload cache: ~30s
Total: ~1m30s
Prior build workflow strategy on a cache hit:
Download and decompress modules and build cache: ~12s
Build Vault: ~15s
Total: ~28s
New build workflow strategy on a cache miss:
Download modules: ~20
Build Vault: ~40s
Upload cache: ~6s
Total: ~1m6s
New build workflow strategy on a cache hit:
Download and decompress modules: ~3s
Build Vault: ~40s
Total: ~43s
Expected time if we used no Go caching:
Download modules: ~20
Build Vault: ~40s
Total: ~1m
Signed-off-by: Ryan Cragun <me@ryan.ec>
Co-authored-by: Ryan Cragun <me@ryan.ec>
Improve our build workflow execution time by using custom runners,
improved caching and conditional Web UI builds.
Runners
-------
We improve our build times[0] by using larger custom runners[1] when
building the UI and Vault.
Caching
-------
We improve Vault caching by keeping a cache for each build job. This
strategy has the following properties which should result in faster
build times when `go.sum` hasn't been changed from prior builds, or
when a pull request is retried or updated after a prior successful
build:
* Builds will restore cached Go modules and Go build cache according to
the Go version, platform, architecture, go tags, and hash of `go.sum`
that relates to each individual build workflow. This reduces the
amount of time it will take to download the cache on hits and upload
the cache on misses.
* Parallel build workflows won't clobber each others build cache. This
results in much faster compile times after cache hits because the Go
compiler can reuse the platform, architecture, and tag specific build
cache that it created on prior runs.
* Older modules and build cache will not be uploaded when creating a new
cache. This should result in lean cache sizes on an ongoing basis.
* On cache misses we will have to upload our compressed module and build
cache. This will slightly extend the build time for pull requests that
modify `go.sum`.
Web UI
------
We no longer build the web UI in every build workflow. Instead we separate
the UI building into its own workflow and cache the resulting assets.
The same UI assets are restored from cache during build worklows. This
strategy has the following properties:
* If the `ui` directory has not changed from prior builds we'll restore
`http/web_ui` from cache and skip building the UI for no reason.
* We continue to use the built-in `yarn` caching functionality in
`action/setup-node`. The default mode saves the `yarn` global cache.
to improve UI build times if the cache has not been modified.
Changes
-------
* Add per platform/archicture Go module and build caching
* Move UI building into a separate job and cache the result
* Restore UI cache during build
* Pin workflows
Notes
-----
[0] https://hashicorp.atlassian.net/browse/QT-578
[1] https://github.com/hashicorp/vault/actions/runs/5415830307/jobs/9844829929
Signed-off-by: Ryan Cragun <me@ryan.ec>
Co-authored-by: Ryan Cragun <me@ryan.ec>
* backport all gha migration changes to release/1.13.x
* remove the .circleci directory
* remove references to circleci configuration from pre-commit hook
* remove reference to .circleci in Makefile
* port change to how gofumpt is executed in Makefile
* add gotestsum to tools/tools.go
* remove postgresql/scram package from generate-test-package-lists.sh since it didn't exist in release 1.13 or earlier
* blank out environment variables to allow test to properly function
* use go:embed to load files into test
---------
Co-authored-by: Kuba Wieczorek <kuba.wieczorek@hashicorp.com>
Introducing a new approach to testing Vault artifacts before merge
and after merge/notorization/signing. Rather than run a few static
scenarios across the artifacts, we now have the ability to run a
pseudo random sample of scenarios across many different build artifacts.
We've added 20 possible scenarios for the AMD64 and ARM64 binary
bundles, which we've broken into five test groups. On any given push to
a pull request branch, we will now choose a random test group and
execute its corresponding scenarios against the resulting build
artifacts. This gives us greater test coverage but lets us split the
verification across many different pull requests.
The post-merge release testing pipeline behaves in a similar fashion,
however, the artifacts that we use for testing have been notarized and
signed prior to testing. We've also reduce the number of groups so that
we run more scenarios after merge to a release branch.
We intend to take what we've learned building this in Github Actions and
roll it into an easier to use feature that is native to Enos. Until then,
we'll have to manually add scenarios to each matrix file and manually
number the test group. It's important to note that Github requires every
matrix to include at least one vector, so every artifact that is being
tested must include a single scenario in order for all workflows to pass
and thus satisfy branch merge requirements.
* Add support for different artifact types to enos-run
* Add support for different runner type to enos-run
* Add arm64 scenarios to build matrix
* Expand build matrices to include different variants
* Update Consul versions in Enos scenarios and matrices
* Refactor enos-run environment
* Add minimum version filtering support to enos-run. This allows us to
automatically exclude scenarios that require a more recent version of
Vault
* Add maximum version filtering support to enos-run. This allows us to
automatically exclude scenarios that require an older version of
Vault
* Fix Node 12 deprecation warnings
* Rename enos-verify-stable to enos-release-testing-oss
* Convert artifactory matrix into enos-release-testing-oss matrices
* Add all Vault editions to Enos scenario matrices
* Fix verify version with complex Vault edition metadata
* Rename the crt-builder to ci-helper
* Add more version helpers to ci-helper and Makefile
* Update CODEOWNERS for quality team
* Add support for filtering matrices by group and version constraints
* Add support for pseudo random test scenario execution
Signed-off-by: Ryan Cragun <me@ryan.ec>
Here we make the following major changes:
* Centralize CRT builder logic into a script utility so that we can share the
logic for building artifacts in CI or locally.
* Simplify the build workflow by calling a reusable workflow many times
instead of repeating the contents.
* Create a workflow that validates whether or not the build workflow and all
child workflows have succeeded to allow for merge protection.
Motivation
* We need branch requirements for the build workflow and all subsequent
integration tests (QT-353)
* We need to ensure that the Enos local builder works (QT-558)
* Debugging build failures can be difficult because one has to hand craft the
steps to recreate the build
* Merge conflicts between Vault OSS and Vault ENT build workflows are quite
painful. As the build workflow must be the same file and name we'll reduce
what is contained in each that is unique. Implementations of building
will be unique per edition so we don't have to worry about conflict
resolution.
* Since we're going to be touching the build workflow to do the first two
items we might as well try and improve those other issues at the same time
to reduce the overhead of backports and conflicts.
Considerations
* Build logic for Vault OSS and Vault ENT differs
* The Enos local builder was duplicating a lot of what we did in the CRT build
workflow
* Version and other artifact metadata has been an issue before. Debugging it
has been tedious and error prone.
* The build workflow is full of brittle copy and paste that is hard to
understand, especially for all of the release editions in Vault Enterprise
* Branch check requirements for workflows are incredibly painful to use for
workflows that are dynamic or change often. The required workflows have to be
configured in Github settings by administrators. They would also prevent us
from having simple docs PRs since required integration workflows always have
to run to satisfy branch requirements.
* Doormat credentials requirements that are coming will require us to modify
which event types trigger workflows. This changes those ahead of time since
we're doing so much to build workflow. The only noticeable impact will be
that the build workflow no longer runs on pushes to non-main or release
branches. In order to test other branches it requires a workflow_dispatch
from the Actions tab or a pull request.
Solutions
* Centralize the logic that determines build metadata and creates releasable
Vault artifacts. Instead of cargo-culting logic multiple times in the build
workflow and the Enos local modules, we now have a crt-builder script which
determines build metadata and also handles building the UI, Vault, and the
package bundle. There are make targets for all of the available sub-commands.
Now what we use in the pipeline is the same thing as the local builder, and
it can be executed locally by developers. The crt-builder script works in OSS
and Enterprise so we will never have to deal with them being divergent or with
special casing things in the build workflow.
* Refactor the bulk of the Vault building into a reusable workflow that we can
call multiple times. This allows us to define Vault builds in a much simpler
manner and makes resolving merge conflicts much easier.
* Rather than trying to maintain a list and manually configure the branch check
requirements for build, we'll trigger a single workflow that uses the github
event system to determine if the build workflow (all of the sub-workflows
included) have passed. We'll then create branch restrictions on that single
workflow down the line.
Signed-off-by: Ryan Cragun me@ryan.ec
* [CI-only] Update RedHat registry tag
There are a few changes being made to RedHat's registry on October 20, 2022 that affect the way images need to be tagged prior to being pushed to the registry. This PR changes the tag to conform to the new standard.
We have other work queued up in crt-workflows-common and actions-docker-build to support the other required changes.
This PR should be merged to `main` and all release branches on or after October 20, 2022, and MUST be merged before your next production release. Otherwise, the automation to push to the RedHat registry will not work.
----
A detailed list of changes shared from RedHat (as an FYI):
The following changes will occur for container certification projects that leverage the Red Hat hosted registry [[registry.connect.redhat.com](http://registry.connect.redhat.com/)] for image distribution:
- All currently published images are migrating to a NEW, Red Hat hosted quay registry. Partners do not have to do anything for this migration, and this will not impact customers. The registry will still utilize [registry.connect.redhat.com](http://registry.connect.redhat.com/) as the registry URL.
- The registry URL currently used to push, tag, and certify images, as well as the registry login key, will change. You can see these changes under the “Images” tab of the container certification project. You will now see a [quay.io](http://quay.io/) address and will no longer see [scan.connect.redhat.com](http://scan.connect.redhat.com/).
- Partners will have the opportunity to auto-publish images by selecting “Auto-publish” in the Settings tab of your certification project. This will automatically publish images that pass all certification tests.
- For new container image projects, partners will have the option to host within their own chosen image registry while using [registry.connect.redhat.com](http://registry.connect.redhat.com/) as a proxy address. This means the end user can authenticate to the Red Hat registry to pull a partner image without having to provide additional authentication to the partner’s registry.
* docker: update redhat_tag
Co-authored-by: Sam Salisbury <samsalisbury@gmail.com>
Add our initial Enos integration tests to Vault. The Enos scenario
workflow will automatically be run on branches that are created from the
`hashicorp/vault` repository. See the README.md in ./enos a full description
of how to compose and execute scenarios locally.
* Simplify the metadata build workflow jobs
* Automatically determine the Go version from go.mod
* Add formatting check for Enos integration scenarios
* Add Enos smoke and upgrade integration scenarios
* Add Consul backend matrix support
* Add Ubuntu and RHEL distro support
* Add Vault edition support
* Add Vault architecture support
* Add Vault builder support
* Add Vault Shamir and awskms auto-unseal support
* Add Raft storage support
* Add Raft auto-join voter verification
* Add Vault version verification
* Add Vault seal verification
* Add in-place upgrade support for all variants
* Add four scenario variants to CI. These test a maximal distribution of
the aforementioned variants with the `linux/amd64` Vault install
bundle.
Signed-off-by: Ryan Cragun <me@ryan.ec>
Co-authored-by: Rebecca Willett <rwillett@hashicorp.com>
Co-authored-by: Jaymala <jaymalasinha@gmail.com>
Update Go to 1.18
From 1.17.12
1.18.5 was just released, but not all packages have been updated, so I
went with 1.18.4
Co-authored-by: Steven Clark <steven.clark@hashicorp.com>
* Copy UBI Dockerfile into Vault
This Dockerfile was modeled off of the existing Alpine Dockerfile (in
this repo) and the external Dockerfile from the docker-vault repo:
> https://github.com/hashicorp/docker-vault/blob/master/ubi/Dockerfile
We also import the UBI-specific docker-entrypoint.sh, as certain
RHEL/Alpine changes (like interpreter) require a separate entry script.
Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
* Add UBI build to CRT pipeline
Also adds workflow_dispatch to the CRT pipeline, to allow manually
triggering CRT from PRs, when desired.
Signed-off-by: Alexander Scheel <alex.scheel@hashicorp.com>
* Update Dockerfile
Co-authored-by: Sam Salisbury <samsalisbury@gmail.com>
* Update Dockerfile
Co-authored-by: Sam Salisbury <samsalisbury@gmail.com>
* Update Dockerfile
Co-authored-by: Sam Salisbury <samsalisbury@gmail.com>
* Update Dockerfile
* Update Dockerfile
* Update build.yml
Allow for both push to arbitrary branch plus workflow dispatch, per Newsletter article.
Co-authored-by: Sam Salisbury <samsalisbury@gmail.com>
* add BuildDate to version base
* populate BuildDate with ldflags
* include BuildDate in FullVersionNumber
* add BuildDate to seal-status and associated status cmd
* extend core/versions entries to include BuildDate
* include BuildDate in version-history API and CLI
* fix version history tests
* fix sys status tests
* fix TestStatusFormat
* remove extraneous LD_FLAGS from build.sh
* add BuildDate to build.bat
* fix TestSysUnseal_Reset
* attempt to add build-date to release builds
* add branch to github build workflow
* add get-build-date to build-* job needs
* fix release build command vars
* add missing quote in release build command
* Revert "add branch to github build workflow"
This reverts commit b835699ecb7c2c632757fa5fe64b3d5f60d2a886.
* add changelog entry
Use the latest version of the actions-packaging-linux@v1 to ensure that
.deb and .rpm artifacts are generated with release.
Signed-off-by: Ryan Cragun <me@ryan.ec>
* achieve parity with ent in core.go
* add VAULT_DISABLE_LOCAL_AUTH_MOUNT_ENTITIES
* parity in build.yml with ent but without adding the +ent
* pass base version to ldflags
Co-authored-by: Kyle Penfound <kpenfound11@gmail.com>
* adding CRT to main branch
* cleanup
* um i dont know how that got removed but heres the fix
* add vault.service
Co-authored-by: Kyle Penfound <kpenfound11@gmail.com>