This patch refactors the API following the [API Best Practices][api]
documentation. The UpdatePlatform method is modeled after a mutating
operation described [by Netflix][nflx] instead of using a REST resource
representation. This makes it much easier to iterate over the fields
that need to be updated as the PlatformUpdateOperation is a flat data
structure while a Platform resource may have nested fields. Nested
fields are more complicated and less clear to handle with a FieldMask.
This patch also adds a snapckbar message on save. Previously, the save
button didn't give any indication of success or failure. This patch
fixes the problem by adding a snackbar message that pop up at the bottom
of the screen nicely.
When the snackbar message is dismissed or times out the save button is
re-enabled.
[api]: https://protobuf.dev/programming-guides/api/
[nflx]: https://netflixtechblog.com/practical-api-design-at-netflix-part-2-protobuf-fieldmask-for-mutation-operations-2e75e1d230e4
Examples:
FieldMask for ListPlatforms
```
grpcurl -H "x-oidc-id-token: $(holos token)" -d @ ${HOLOS_SERVER##*/} holos.platform.v1alpha1.PlatformService.ListPlatforms <<EOF
{
"org_id": "018f36fb-e3f7-7f7f-a1c5-c85fb735d215",
"field_mask": { "paths": ["id","name"] }
}
EOF
```
```json
{
"platforms": [
{
"id": "018f36fb-e3ff-7f7f-a5d1-7ca2bf499e94",
"name": "bare"
},
{
"id": "018f6b06-9e57-7223-91a9-784e145d998c",
"name": "gary"
},
{
"id": "018f6b06-9e53-7223-8ae1-1ad53d46b158",
"name": "jeff"
},
{
"id": "018f6b06-9e5b-7223-8b8b-ea62618e8200",
"name": "nate"
}
]
}
```
Closes: #171
Problem:
When the ingress default Gateway AuthorizationPolicy/authpolicy-custom
rule is in place the choria machine room holos controller fails to
connect to the provisioner broker with the following error:
```
❯ holos controller run --config=agent.cfg
WARN[0000] Starting controller version 0.68.1 with config file /home/jeff/workspace/holos-run/holos/hack/choria/agent/agent.cfg leader=false
WARN[0000] Switching to provisioning configuration due to build defaults and missing /home/jeff/workspace/holos-run/holos/hack/choria/agent/agent.cfg
WARN[0000] Setting anonymous TLS mode during provisioning component=server connection=coffee.home identity=coffee.home
WARN[0000] Initial connection to the Broker failed on try 1: invalid websocket connection component=server connection=coffee.home identity=coffee.home
WARN[0000] Initial connection to the Broker failed on try 2: invalid websocket connection component=server connection=coffee.home identity=coffee.home
WARN[0002] Initial connection to the Broker failed on try 3: invalid websocket connection component=server connection=coffee.home identity=coffee.home
```
This problem is caused because the provisioning token url is set to
`wss://jeff.provision.dev.k2.holos.run:443` which has the port number
specified.
Solution:
Follow the upstream istio guidance of [Writing Host Match Policies][1]
to match host headers with or without the port specified.
Result:
The controller is able to connect to the provisioner broker:
[1]: https://istio.io/latest/docs/ops/best-practices/security/#writing-host-match-policies
This problem fixes an error where the istio ingress gateway proxy failed
to verify the TLS certificate presented by the choria broker upstream
server.
kubectl logs choria-broker-0
level=error msg="websocket: TLS handshake error from 10.244.1.190:36142: remote error: tls: unknown certificate\n"
Istio ingress logs:
kubectl -n istio-ingress logs -l app=istio-ingressgateway -f | grep --line-buffered '^{' | jq .
"upstream_transport_failure_reason": "TLS_error:|268435581:SSL_routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED:TLS_error_end:TLS_error_end"
Client curl output:
curl https://jeff.provision.dev.k2.holos.run
upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: TLS_error:|268435581:SSL routines:OPENSSL_i
nternal:CERTIFICATE_VERIFY_FAILED:TLS_error_end:TLS_error_end
Explanation of error:
Istio defaults to expecting a tls certificate matching the downstream
host/authority which isn't how we've configured Choria.
Refer to [ClientTLSSettings][1]
> A list of alternate names to verify the subject identity in the
> certificate. If specified, the proxy will verify that the server
> certificate’s subject alt name matches one of the specified values. If
> specified, this list overrides the value of subject_alt_names from the
> ServiceEntry. If unspecified, automatic validation of upstream presented
> certificate for new upstream connections will be done based on the
> downstream HTTP host/authority header, provided
> VERIFY_CERTIFICATE_AT_CLIENT and ENABLE_AUTO_SNI environmental variables
> are set to true.
[1]: https://istio.io/latest/docs/reference/config/networking/destination-rule/#ClientTLSSettings
This patch configures ArgoCD to log in via PKCE.
Note the changes are primarily in platform.site.cue and ensuring the
emailDomain is set properly. Note too the redirect URL needs to be
`/pkce/verify` when PKCE is enabled. Finally, if the setting is
reconfigured make sure to clear cookies otherwise the incorrect
`/auth/callback` path may be used.
Problem:
Port names in the default Gateway.spec.servers.port field must be unique
across all servers associated with the workload.
Solution:
Append the fully qualified domain name with dots replaced with hyphens.
Result:
Port name is unique.
Problem:
The default gateway in one cluster gets server entries for all hosts in
the problem. This makes the list unnecessarily large with entries for
clusters that should not be handled on the current cluster.
For example, the k2 cluster has gateway entries to route hosts for k1,
k3, k4, k5, etc...
Solution:
Add a field to the CertInfo definition representing which clusters the
host is valid on.
Result:
Hosts which are valid on all clusters, e.g. login.ois.run, have all
project clusters added to the clusters field of the CertInfo. Hosts
which are valid on a single cluster have the coresponding single entry
added.
When building resources, holos components should check if `#ClusterName`
is a valid field of the CertInfo.clusters field. If so, the host is
valid for the current cluster. If not, the host should be omitted from
the current cluster.
Doing so forces unnecessary hosts for some projects. For example,
iam.ois.run is useless for the iam project, the primary project host is
login to build login.ois.run.
Some projects may not need any hosts as well.
Better to let the user specify `project: foo: hosts: foo: _` if they
want it.
This patch loops over every Gateway.spec.servers entry in the default
gateway and manages an ExternalSecret to sync the credential from the
provisioner cluster.
Problem:
A Holos Component is created for each project stage, but all hosts for
all stages in the project are added. This creates duplicates.
Solution:
Sort project hosts by their stage and map the holos component for a
stage to the hosts for that stage.
Result:
Duplicates are eliminated, the prod certs are not in the dev holos
component and vice-versa.
This patch provisions wildcard certs in the provisioning cluster. The
CN matches the project stage host global hostname without any cluster
qualifiers.
The use of a wildcard in place of the environment name dns segment at
the leftmost position of the fully qualified dns name enables additional
environments to be configured without reissuing certificates.
This is to avoid the 100 name per cert limit in LetsEncrypt.
Mapping each project host fqdn to the stage is unnecessary. The list of
gateway servers is constructed from each FQDN in the project.
This patch removes the unnecessary struct mappings.
Problem:
It's difficult to map and reduce the collection of project hosts when
configuring related Certificate, Gateway.spec.servers, VirtualService,
and auth proxy cookie domain settings.
Solution:
Define #ProjectHosts which takes a project and provides Hosts which is a
struct with a fqdn key and a #CertInfo value. The #CertInfo definition
is intended to provide everything need to reduce the Hosts property to
structs usful for the problematic resources mentioned previously.
Result:
Gateway.spec.servers are mapped using #ProjectHosts
Next step is to map the Certificate resources on the provisioner
cluster.
Problem:
Adding environments to a project causes certs to be re-issued.
Solution:
Enable wildcard certs for per-environment namespaces like jeff, gary,
nate, etc...
Result:
Environments can be added to a project stage without needing the cert to
be re-issued.
This patch avoids LetsEncrypt rate limits by consolidating multiple dns
names into one certificate.
For each project host, create a certificate for each stage in the
project. The certificate contains the dns names for all clusters and
environments associated with that stage and host.
This can become quite a list, the limit is 100 dnsNames.
For the Holos project which has 7 clusters and 4 dev environments, the
number of dns names is 32 (4 envs + 4 envs * 7 clusters = 32 dns names).
Still, a much needed improvement because we're limited to 50 certs per
week.
It may be worth considering wildcards for the per-developer
environments, which are the ones we'll likely spin up the most
frequently.
This patch is a partial step toward getting the choria broker up
and running in my own namespace. The choria broker is necessary for
provisioning machine room agents such as the holos controller.
This patch adds an initial holos controller subcommand. The machine
room agent starts, but doesn't yet provision because we haven't deployed
the provisioning infrastructure yet.
Configure NATS in a 3 Node deployment with resolver authentication using
an Operator JWT.
The operator secret nkeys are stored in the provisioner cluster. Get
them with:
holos get secret -n jeff-holos nats-nsc --print-key nsc.tgz | tar -tvzf-
The upstream nats charts don't specify namespaces for each attribute.
This works with helm update, but not helm template which holos uses to
render the yaml.
The missing namespace causes flux to fail.
This patch uses the flux kustomization to add the target namespace to
all resources.
When rendering a holos component which contains more than one helm chart, rendering fails. It should succeed.
```
holos render --cluster-name=k2 /home/jeff/workspace/holos-run/holos/docs/examples/platforms/reference/clusters/holos/... --log-level debug
```
```
9:03PM ERR could not execute version=0.64.2 err="could not rename: rename /home/jeff/workspace/holos-run/holos/docs/examples/platforms/reference/clusters/holos/nats/envs/vendor553679311 /home/jeff/workspace/holos-run/holos/docs/examples/platforms/reference/clusters/holos/nats/envs/vendor: file exists" loc=helm.go:145
```
This patch fixes the problem by moving each child item of the temporary
directory charts are installed into. This avoids the problem of moving
the parent when the parent target already exists.
Without this patch users encounter an error from istio because it does
not have a valid Jwks from ZITADEL to verify the request when processing
a `RequestAuthentication` policy.
Fixes error `AuthProxy JWKS Error - Jwks doesn't have key to match kid or alg from Jwt`.
Occurs when accessing a protected URL for the first time after tokens have expired.
Grafana does not yet have the istio sidecar. Prometheus is accessible
through the auth proxy. Cert manager is added to the workload clusters
so tls certs can be issued for webhooks, the kube-prom-stack helm chart
uses cert manager for this purpose.
With this patch Grafana is integrated with OIDC and I'm able to log in
as an Administrator.
Problem:
The VirtualService that catches auth routes for paths, e.g.
`/holos/authproxy/istio-ingress` is bound to the default gateway which
no longer exists because it has no hosts.
Solution:
It's unnecessary and complicated to create a Gateway for every project.
Instead, put all server entries into one `default` gateway and
consolidate the list using CUE.
Result:
It's easier to reason about this system. There is only one ingress
gateway, `default` and everything gets added to it. VirtualServices
need only bind to this gateway, which has a hosts entry appropriately
namespaced for the project.
Problem:
The ZITADEL database isn't restoring into the prod-iam namespace after
moving from prod-iam-zitadel because no backup exists at the bucket
path.
Solution:
Hard-code the path to the old namespace to restore the database. We'll
figure out how to move the backups to the new location in a follow up
change.
The `prod-platform-gateway` kustomization is reconciling early:
ExternalSecret/istio-ingress/argocd.ois.run dry-run failed: failed to
get API group resources: unable to retrieve the complete list of server
APIs: external-secrets.io/v1beta1: the server could not find the
requested resource