* Fix the build notification. It appears that during a rebase the JSON
payload was slightly corrupted.
* Don't create a successful CI step summary if the CI workflow is
cancelled.
* Don't create a successful CI comment if the workflow was cancelled.
Signed-off-by: Ryan Cragun <me@ryan.ec>
Don't rely on the pass/fail result of the CI workflow for notifications.
We do this to ensure we notify Slack on failures but still allow for
merging.
Signed-off-by: Ryan Cragun <me@ryan.ec>
Context
-------
Building and testing Vault artifacts on pull requests and merges is
responsible for about 1/3rd of our overall spend on Vault CI. Of the
artifacts that we ship as part of a release, we do Enos testing scenarios
on the `linux/amd64` and `linux/arm64` binaries and their derivative
artifacts. The extended build artifacts for non-Linux platforms or less
common machine architectures are not tested at this time. They are built,
notarized, and signed as part of every pull request update and merge. As
we don't actually test these artifacts, the only gain we get from this
rather expensive behavior is that we wont merge a change that would prevent
Vault from building on one of the extended targets. Extended platform or
architecture changes are quite rare, so performing this work as frequently
as we do is costly in both monetary and developer time for little relative
safety benefit.
Goals
-----
Rethink and implement how and when we build binaries and artifacts of Vault
so that we can spend less money on repetitive work and while also reducing
the time it takes for the build and test pipelines to complete.
Solution
--------
Instead of building all release artifacts on every push, we'll opt to build
only our testable (core) artifacts. With this change we are introducing a
bit of risk. We could merge a change that breaks an extended platform and
only find out after the fact when we trigger a complete build for a release.
We'll hedge against that risk by building all of the release targets on a
scheduled cadence to ensure that they are still buildable.
We'll make building all of the targets optional on any pull request by
use of a `build/all` label on the pull request.
Further considerations
----------------------
* We want to reduce the total number of workflows and runners for all of our
pipelines if possible. As each workflow runner has infrastructure cost and
runner time penalties, using a single runner over many is often preferred.
* Many of our jobs runners have been optimized for cost and performance. We
should simplify the choices of which runners to use.
* CRT requires us to use the same build workflow in both CE and Ent.
Historically that meant that modifying `build.yml` in CE would result in a
merge conflict with `build.yml` in Ent, and break our merge workflows.
* Workflow flow control in both `build.yml` and `ci.yml` can be quite
complicated, as each needs to maintain compatibility whether executed as CE
or Ent, and when triggered with various Github events like pull_request,
push, and workflow_call, each with their own requirements.
* Many jobs utilize similar patterns of flow control and metadata but are not
reusable.
* Workflow call depth has a maximum of four, so we need to be quite
considerate when calling other workflows.
* Called workflows can only have 10 inputs.
Implementation
--------------
* Refactor the `build.yml` workflow to be agnostic to whether or not it is
executing in CE or Ent. That makes future updates to the build much easier
as we won't have to worry about merge conflicts when the change is merged
downstream.
* Extract common steps in workflows into composite actions that we can reuse.
* Fix bugs where some but not all workflows would use different Git
references when building and testing a pull request.
* We rewrite the application, docs, and UI change helpers as a composite
action. This allows us to re-use this logic to make consistent behavior
choices across build and CI.
* We combine several `build.yml` and `ci.yml` jobs into our final job.
This reduces the number of workflows required for the same behavior while
saving time overall.
* Update most of our action pins.
Results
-------
| Metric | Before | After | Diff |
|-------------------|----------|---------|-------|
| Duration: | ~14-18m | ~15-18m | ~ = |
| Workflows: | 43 | 18 | - 58% |
| Billable time: | ~1h15m | 16m | - 79% |
| Saved artifacts: | 34 | 12 | - 65% |
Infra costs should map closely to billable time.
Network I/O costs should map closely to the workflow count.
Storage costs should map directly with saved artifacts.
We could probably get parity with duration by getting more clever with
our UBI container build, as that's where we're seeing the increase. I'm
not yet concerned as it takes roughly the same time for this job to
complete as it did before.
While the CI workflow was not the focus on the PR, some shared
refactoring does show some marginal improvements there.
| Metric | Before | After | Diff |
|-------------------|----------|----------|--------|
| Duration: | ~24m | ~12.75m | - 15% |
| Workflows: | 55 | 47 | - 8% |
| Billable time: | ~4h20m | ~3h36m | - 7% |
Further focus on streamlining the CI workflows would likely result in a
few more marginal improvements, but nothing on the order like we've seen
with the build workflow.
Signed-off-by: Ryan Cragun <me@ryan.ec>
We're on a quest to reduce our pipeline execution time to both enhance
our developer productivity but also to reduce the overall cost of the CI
pipeline. The strategy we use here reduces workflow execution time and
network I/O cost by reducing our module cache size and using binary
external tools when possible. We no longer download modules and build
many of the external tools thousands of times a day.
Our previous process of installing internal and external developer tools
was scattered and inconsistent. Some tools were installed via `go
generate -tags tools ./tools/...`,
others via various `make` targets, and some only in Github Actions
workflows. This process led to some undesirable side effects:
* The modules of some dev and test tools were included with those
of the Vault project. This leads to us having to manage our own
Go modules with those of external tools. Prior to Go 1.16 this
was the recommended way to handle external tools, but now
`go install tool@version` is the recommended way to handle
external tools that need to be build from source as it supports
specific versions but does not modify the go.mod.
* Due to Github cache constraints we combine our build and test Go
module caches together, but having our developer tools as deps in
our module results in a larger cache which is downloaded on every
build and test workflow runner. Removing the external tools that were
included in our go.mod reduced the expanded module cache by size
by ~300MB, thus saving time and network I/O costs when downloading
the module cache.
* Not all of our developer tools were included in our modules. Some were
being installed with `go install` or `go run`, so they didn't take
advantage of a single module cache. This resulted in us downloading
Go modules on every CI and Build runner in order to build our
external tools.
* Building our developer tools from source in CI is slow. Where possible
we can prefer to use pre-built binaries in CI workflows. No more
module download or tool compiles if we can avoid them.
I've refactored how we define internal and external build tools
in our Makefile and added several new targets to handle both building
the developer tools locally for development and verifying that they are
available. This allows for an easy developer bootstrap while also
supporting installation of many of the external developer tools from
pre-build binaries in CI. This reduces our network IO and run time
across nearly all of our actions runners.
While working on this I caught and resolved a few unrelated issue:
* Both our Go and Proto format checks we're being run incorrectly. In
CI they we're writing changes but not failing if changes were
detected. The Go was less of a problem as we have git hooks that
are intended to enforce formatting, however we drifted over time.
* Our Git hooks couldn't handle removing a Go file without failing. I
moved the diff check into the new Go helper and updated it to handle
removing files.
* I combined a few separate scripts and into helpers and added a few
new capabilities.
* I refactored how we install Go modules to make it easier to download
and tidy all of the projects go.mod's.
* Refactor our internal and external tool installation and verification
into a tools.sh helper.
* Combined more complex Go verification into `scripts/go-helper.sh` and
utilize it in the `Makefile` and git commit hooks.
* Add `Makefile` targets for executing our various tools.sh helpers.
* Update our existing `make` targets to use new tool targets.
* Normalize our various scripts and targets output to have a consistent
output format.
* In CI, install many of our external dependencies as binaries wherever
possible. When not possible we'll build them from scratch but not mess
with the shared module cache.
* [QT-641] Remove our external build tools from our project Go modules.
* [QT-641] Remove extraneous `go list`'s from our `set-up-to` composite
action.
* Fix formatting and regen our protos
Signed-off-by: Ryan Cragun <me@ryan.ec>
* Stop running fips tests on PRs: we expect fips-specific failures to be rare enough that it's not worth the cost.
* Allow PRs with the label "fips" to run fips tests.
* Attempt to new-line/emojify test output
* Update emoji
* Make it always run, for testing
* Put the emojis first
* Add a space
* OSS -> CE
* Update enterprise tests also
* Test failure
* Test failures but better
* Print it even if not main :)
* Fix the comparison
* Finalize changes
* Remove diff-oss-ci
* Eliminate another inconsistency
* Fix logic: we want to only apply the fork check on the CE repo. On ent we want to always run the job.
---------
Co-authored-by: hc-github-team-secure-vault-core <github-team-secure-vault-core@hashicorp.com>
* adding testonly CI test job
* small instance for testonly tests
* feedback
* shopt
* disable glob expansion
* revert back to a large instance
* fix a mistake
We further optimize the CI workflow for better costs and speed.
We tested the Go CI workflows across several instance classes
and update our compute choices. We achieve an average execution
speed improvement of 2-2.5 minutes per test workflow while
reducing the infrastructure cost by about 20%. We also also save
another ~2 minutes by installing `gotestsum` from the Github release
instead of downloading the Go modules and compiling it every time.
In addition to the speed improvements, we also further reduced our cache
usage by updating the `security-scan` workflow to not cache Go modules.
We also use the `cache/save` and `cache/restore` actions for timing
caches. This results is saving half as many cache results for timing
data.
*UI test results*
results for 2x runs:
* c6a.2xlarge (12m54s, 11m55s)
* c6a.4xlarge (10m47s, 11m6s)
* c6a.8xlarge (11m32s, 10m51s)
* m5.2xlarge (15m23s, 14m16s)
* m5.4xlarge (14m48s, 12m54s)
* m5.8xlarge (12m27s, 12m24s)
* m6a.2xlarge (11m55s, 12m20s)
* m6a.4xlarge (10m54s, 10m43s)
* m6a.8xlarge (10m33s, 10m51s)
Current runner:
m5.2xlarge (15m23s, 14m16s, avg 14m50s) @ 0.448/hr = $0.11
Faster candidates
* c6a.2xlarge (12m54s, 11m55s, avg 12m24s) @ 0.3816/hr = $0.078
* m6a.2xlarge (11m55s, 12m20s, avg 12m8s) @ 0.4032/hr = $0.081
* c6a.4xlarge (10m47s, 11m6s, avg 10m56s) @ 0.7632/hr = $0.139
* m6a.4xlarge (10m54s, 10m43s, avg 10m48s) @ 0.8064/hr = $0.140
Best bang for the buck for test-ui:
m6a.2xlarge, > 25% cost savings from current and we save ~2.5 minutes.
*Go test results*
During testing the external replication tests, when not broken up, will
always take the longest. Our original analysis focuses on this job.
Most other tests groups will finish ~3m faster so we'll use subtract
that time when estimating the cost for the whole job.
external replication job results:
* c6a.2xlarge (20m49s, 19m20s, avg 20m5s)
* c6a.4xlarge (19m1s, 19m38s, avg 19m20s)
* c6a.8xlarge (19m51s, 18m54s, avg 19m23s)
* m5.2xlarge (22m12s, 20m29s, avg 21m20s)
* m5.4xlarge (20m7s, 19m3s, avg 20m35s)
* m5.8xlarge (20m24s, 19m42s, avg 20m3s)
* m6a.2xlarge (21m10s, 19m37s, avg 20m23s)
* m6a.4xlarge (18m58s, 19m51s, avg 19m24s)
* m6a.8xlarge (19m27s, 18m47s, avg 19m7s)
There is little separation in time when we increase class size. In the
best case a class size increase yields about a ~5% performance increase
and doubles the cost. For test-go our best bang for the buck is
certainly going to be in the 2xlarge class.
Current runner:
m5.2xlarge (22m12s, 20m29s, avg 21m20s) @ 0.448/hr (16@avg-3m + 1@avg) = $2.35
Candidates in the same class
* c6a.2xlarge (20m49s, 19m20s, avg 20m5s) @ 0.3816/hr (16@avg-3m + 1@avg) = $1.86
* m6a.2xlarge (21m10s, 19m37s, avg 20m23s) @ 0.4032/hr (16@avg-3m + 1@avg) = $2.00
Best bang for the buck for test-go:
c6a.2xlarge: 20% cost savings and save about ~2.25 minutes.
We ran the tests with similar instances and saw similar execution times as
with test-go. Therefore we can use the same recommended instance sizes.
After breaking up test-go's external replication tests, the longest group
was shorter on average. I choose to look at group 3 as it was usually the
longest grouping:
* c6a.2xlarge: (14m51s, 14m48s)
* c6a.4xlarge: (14m14s, 14m15)
* c6a.8xlarge: (14m0s, 13m54s)
* m5.2xlarge: (15m36s, 15m35s)
* m5.4xlarge: (14m46s, 14m49s)
* m5.8xlarge: (14m25s, 14m25s)
* m6a.2xlarge: 14m51s, 14m53s)
* m6a.4xlarge: 14m16s, 14m16s)
* m6a.8xlarge: (14m2s, 13m57s)
Again, we see ~5% performance gains between the 2x and 8x instance classes
at quadruple the cost. The c6a and m6a families are almost identical, with
the c6a class being cheaper.
*Notes*
* UI and Go Test timing results: https://github.com/hashicorp/vault-enterprise/actions/runs/5556957460/jobs/10150759959
* Go Test with data race detection timing results: https://github.com/hashicorp/vault-enterprise/actions/runs/5558013192
* Go Test with replication broken up: https://github.com/hashicorp/vault-enterprise/actions/runs/5558490899
Signed-off-by: Ryan Cragun <me@ryan.ec>
* limit test comments
* remove unecessary tee
* fix go test condition
* fix
* fail test
* remove ailways entirely
* fix columns
* make a bunch of tests fail
* separate line
* include Failures:
* remove test fails
* fix whitespace
* fix multiline
* shellcheck, and success message for builds
* add full path
* cat the summary
* fix and faster
* fix if condition
* base64 in a separate step
* echo
* check against empty string
* add echo
* only use matrix ids
* only id
* echo matrix
* remove wrapping array
* tojson
* try echo again
* use jq to get packages
* don't quote
* only run binary tests once
* only run binary tests once
* test what's wrong with the binary
* separate file
* use matrix file
* failed test
* update comment on success
* correct variable name
* bae64 fix
* output to file
* use multiline
* fix
* fix formatting
* fix newline
* fix whitespace
* correct body, remove comma
* small fixes
* shellcheck
* another shellcheck fix
* fix deprecation checker
* only run comments for prs
* Update .github/workflows/test-go.yml
Co-authored-by: Mike Palmiotto <mike.palmiotto@hashicorp.com>
* Update .github/workflows/test-go.yml
Co-authored-by: Mike Palmiotto <mike.palmiotto@hashicorp.com>
* fixes
---------
Co-authored-by: Mike Palmiotto <mike.palmiotto@hashicorp.com>
* Make sure that we always download all of the required modules.
* Fix actions/set-up-go path for UI test
* Fix broken go.mod in hcp_link
Signed-off-by: Ryan Cragun <me@ryan.ec>
In order to reliably store Go test times in the Github Actions cache we
need to reduce our cache thrashing by not using more than 10gb over all
of our caches. This change reduces our cache usage significantly by
sharing Go module cache between our Go CI workflows and our build
workflows. We lose our per-builder cache which will result in a bit of
performance hit, but we'll enable better automatic rebalancing of our CI
workflows. Overall we should see a per branch reduction in cache sizes
from ~17gb to ~850mb.
Some preliminary investigation into this new strategy:
Prior build workflow strategy on a cache miss:
Download modules: ~20s
Build Vault: ~40s
Upload cache: ~30s
Total: ~1m30s
Prior build workflow strategy on a cache hit:
Download and decompress modules and build cache: ~12s
Build Vault: ~15s
Total: ~28s
New build workflow strategy on a cache miss:
Download modules: ~20
Build Vault: ~40s
Upload cache: ~6s
Total: ~1m6s
New build workflow strategy on a cache hit:
Download and decompress modules: ~3s
Build Vault: ~40s
Total: ~43s
Expected time if we used no Go caching:
Download modules: ~20
Build Vault: ~40s
Total: ~1m
Signed-off-by: Ryan Cragun <me@ryan.ec>
* use verify changes for docs to skip tests
* add verify-changes to the needed jobs
* skip go tests for doc/ui only changes
* fix a job ref
* change names, remove script
* remove ui conditions
* separate flags
* feedback
* combine into one checker
* combine and simplify ci checks
* add to test package list
* remove testing test
* only run deprecations check
* only run deprecations check
* remove unneeded repo check
* fix bash options
Improve our build workflow execution time by using custom runners,
improved caching and conditional Web UI builds.
Runners
-------
We improve our build times[0] by using larger custom runners[1] when
building the UI and Vault.
Caching
-------
We improve Vault caching by keeping a cache for each build job. This
strategy has the following properties which should result in faster
build times when `go.sum` hasn't been changed from prior builds, or
when a pull request is retried or updated after a prior successful
build:
* Builds will restore cached Go modules and Go build cache according to
the Go version, platform, architecture, go tags, and hash of `go.sum`
that relates to each individual build workflow. This reduces the
amount of time it will take to download the cache on hits and upload
the cache on misses.
* Parallel build workflows won't clobber each others build cache. This
results in much faster compile times after cache hits because the Go
compiler can reuse the platform, architecture, and tag specific build
cache that it created on prior runs.
* Older modules and build cache will not be uploaded when creating a new
cache. This should result in lean cache sizes on an ongoing basis.
* On cache misses we will have to upload our compressed module and build
cache. This will slightly extend the build time for pull requests that
modify `go.sum`.
Web UI
------
We no longer build the web UI in every build workflow. Instead we separate
the UI building into its own workflow and cache the resulting assets.
The same UI assets are restored from cache during build worklows. This
strategy has the following properties:
* If the `ui` directory has not changed from prior builds we'll restore
`http/web_ui` from cache and skip building the UI for no reason.
* We continue to use the built-in `yarn` caching functionality in
`action/setup-node`. The default mode saves the `yarn` global cache.
to improve UI build times if the cache has not been modified.
Changes
-------
* Add per platform/archicture Go module and build caching
* Move UI building into a separate job and cache the result
* Restore UI cache during build
* Pin workflows
Notes
-----
[0] https://hashicorp.atlassian.net/browse/QT-578
[1] https://github.com/hashicorp/vault/actions/runs/5415830307/jobs/9844829929
Signed-off-by: Ryan Cragun <me@ryan.ec>