The [Streaming Standby][standby] architecture requires custom tls certs
for two clusters in two regions to connect to each other.
This patch manages the custom certs following the configuration
described in the article [Using Cert Manager to Deploy TLS for Postgres
on Kubernetes][article].
NOTE: One thing not mentioned anywhere in the crunchy documentation is
how custom tls certs work with pgbouncer. The pgbouncer service uses a
tls certificate issued by the pgo root cert, not by the custom
certificate authority.
For this reason, we use kustomize to patch the zitadel Deployment and
the zitadel-init and zitadel-setup Jobs. The patch projects the ca
bundle from the `zitadel-pgbouncer` secret into the zitadel pods at
/pgbouncer/ca.crt
[standby]: https://access.crunchydata.com/documentation/postgres-operator/latest/architecture/disaster-recovery#streaming-standby-with-an-external-repo
[article]: https://www.crunchydata.com/blog/using-cert-manager-to-deploy-tls-for-postgres-on-kubernetes
A full backup was taken using:
```
kubectl annotate postgrescluster zitadel postgres-operator.crunchydata.com/pgbackrest-backup="$(date)"
```
And completed with:
```
❯ k logs -f zitadel-backup-5r6v-v5jnm
time="2024-03-10T21:52:15Z" level=info msg="crunchy-pgbackrest starts"
time="2024-03-10T21:52:15Z" level=info msg="debug flag set to false"
time="2024-03-10T21:52:15Z" level=info msg="backrest backup command requested"
time="2024-03-10T21:52:15Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=2 --type=full]"
time="2024-03-10T21:55:18Z" level=info msg="crunchy-pgbackrest ends"
```
This patch verifies the point in time backup is robust in the face of
the following operations:
1. pg cluster zitadel was deleted (whole namespace emptied)
2. pg cluster zitadel was re-created _without_ a `dataSource`
3. pgo initailized a new database and backed up the blank database to
S3.
4. pg cluster zitadel was deleted again.
5. pg cluster zitadel was re-created with `dataSource` `options: ["--type=time", "--target=\"2024-03-10 21:56:00+00\""]` (Just after the full backup completed)
6. Restore completed successfully.
7. Applied the holos zitadel component.
8. Zitadel came up successfully and user login worked as expected.
- [x] Perform an in place [restore][restore] from [s3][bucket].
- [x] Set repo1-retention-full to clear warning
[restore]: https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/backups-disaster-recovery/disaster-recovery#restore-properties
[bucket]: https://access.crunchydata.com/documentation/postgres-operator/latest/tutorials/backups-disaster-recovery/disaster-recovery#cloud-based-data-source
To establish the canonical https://login.ois.run identity issuer on the
core cluster pair.
Custom resources for PGO have been imported with:
timoni mod vendor crds -f deploy/clusters/core2/components/prod-pgo-crds/prod-pgo-crds.gen.yaml
Note, the zitadel tls connection took some considerable effort to get
working. We intentionally use pgo issued certs to reduce the toil of
managing certs issued by cert manager.
The default tls configuration of pgo is pretty good with verify full
enabled.
The core2 cluster cannot provision pvcs because it's using the k8s-dev
pool when it has credentials valid only for the k8s-prod pool.
This patch adds an entry to the platform cluster map to configure the
pool for each cluster, with a default of k8s-dev.
PGO uses plain yaml and kustomize as the recommended installation
method. Holos supports upstream by adding a new PlainFiles component
kind, which simply copies files into place and lets kustomize handle the
generation of the api objects.
Cue is responsible for very little in this kind of component, basically
allowing overlay resources if needed and deferring everything else to
the holos cli.
The holos cli in turn is responsible for executing kubectl kustomize
build on the input directory to produce the rendered output, then writes
the rendered output into place.
Without this patch the arc controller fails to create a listener. The
template for the listener doesn't appear to be configurable from the
chart.
Could patch the listener pod template with kustomize, do this as a
follow up feature.
With this patch we get the expected two pods in the runner system
namespace:
```
❯ k get pods
NAME READY STATUS RESTARTS AGE
gha-rs-7db9c9f7-listener 1/1 Running 0 43s
gha-rs-controller-56bb9c77d9-6tjch 1/1 Running 0 8s
```
The resource names for the arc controller are too long:
❯ k get pods -n arc-systems
NAME READY STATUS RESTARTS AGE
gha-runner-scale-set-controller-gha-rs-controller-6bdf45bd6jx5n 1/1 Running 0 59m
Solve the problem by allowing components to set the release name to
`gha-rs-controller` which requires an additional field from the cue code
to differentiate from the chart name.
Separate the SecretStore resources from the namespaces component because
it creates a deadlock. The secretstore crds don't get applied until the
eso component is managed.
The namespaces component should have nothing but core api objects, no
custom resources.
This patch switches CockroachDB to use certs provided by ExternalSecrets
instead of managing Certificate resources in-cluster from the upstream
helm chart.
This paves the way for multi-cluster replication by moving certificates
outside of the lifecycle of the workload cluster cockroach db operates
within.
Closes: #36
Issuing mtls certs for cockroach db moves to the provisioner cluster so
we can more easily support cross cluster replication in the future.
crdb certs will be synced same as public tls certs, using ExternalSecret
resources.
This patch uses cert manager in the provisioner cluster to provision tls
certs for https://login.example.com and https://httpbin.k2.example.com
The certs are not yet synced to the clusters. Next step is to replace
the Certificate resources with ExternalSecret resources, then remove
cert manager from the workload clusters.
This patch moves certificate management to the provisioner cluster to
centralize all secrets into the highly secured cluster. This change
also simplifies the architecture in a number of ways:
1. Certificate lives are now completely independent of cluster
lifecycle.
2. Remove the need for bi-directional sync to save cert secrets.
3. Workload clusters no longer need access to DNS.
Multiple holos components rely on kustomize to modify the output of the
upstream helm chart, for example patching a Deployment to inject the
istio sidecar.
The new holos cue based component system did not support running
kustomize after helm template. This patch adds the kustomize execution
if two fields are defined in the helm chart kind of cue output.
The API spec is pretty loose in this patch but I'm proceeding for
expedience and to inform the final API with more use cases as more
components are migrated to cue.
Cockroach DB uses tls certs for client authentication. Issue one for
Zitadel.
With this patch Zitadel starts up bit is not yet exposted with a
VirtualService.
Refer to https://zitadel.com/docs/self-hosting/manage/configure
The istio default Gateway is the basis for what will become a dynamic
set of server entries specified from cue project data integrated with
extauthz.
For now we simply need to get the identity provider up and running as
the first step toward identity and access management.
This patch migrates the https redirect and the
istio-ingressgateway-loopback Service from
`holos-infra/components/core/istio/ingress/templates/deployment`
This patch adds the standard istiod controller, which depends on
istio-base.
The holos reference platform heavily customizes the meshconfig, so the
upstream istio ConfigMap is disabled in the helm chart values. The mesh
config is generated from cue data defined in the controller holos
component.
Note: This patch adds a static configuration for the istio meshconfig in
the meshconfig.cue file. The extauthz providers are a core piece of
functionality in the holos reference platform and a key motivation of
moving to CUE from Helm is the need to dynamically generate the
meshconfig from a platform scoped set of projects and services across
multiple clusters.
For expedience this dynamic generation is not part of this patch but is
expected to replace the static meshconfig once the cluster is more fully
configured with the new cue based holos command line interface.
Using a list to merge dependencies through the tree from root to leaf is
challenging. This patch uses a #DependsOn struct instead then builds
the list of dependencies for flux from the struct field values.
This enables the dns01 letsencrypt acme solver and is heavily used in
the reference platform.
Secret migrated from Vault using:
```bash
vault kv get -format=json -field data kv/k8s/ns/cert-manager/cloudflare-api-token-secret \
| holos create secret --namespace cert-manager cloudflare-api-token-secret --data-stdin --append-hash=false
```
It makes sense to manage the SecretStore along with the Namespace in the
platform namespaces holos component. Otherwise, the first component
that needs an ExternalSecret also needs to manage a SecretStore, which
creates an artificial dependency for subesequent components that also
need a SecretStore in the same namespace.
Best to just have all components depend on the namespaces component.
This patch partially adds the Let's Encrypt issuers. The platform data
expands to take a contact email and a cloudflare login email.
The external secret needs to be added next.
Straight-forward helm install with no customization.
This patch also adds a "Skip" output kind which allows intermediate cue
files in the tree to signal holos to skip over the instance. This
enables constraints to be added at intermediate layers without build
errors.
This patch changes the interface between CUE and Holos to remove the
content field and replace it with an api object map. The map is a
`map[string]map[string]string` with the rendered yaml as the value of a
kind/name nesting.
This structure enables better error messages, cue disjunction errors
indicate the type and the name of the resource instead of just the list
index number.