Commit Graph

189 Commits

Author SHA1 Message Date
Jeff McCune
bba3895f35 (#175) Add holos generate component helm command
This patch adds a schematic to generate a holos component that wraps a
helm chart.  The cert-manager chart is the current example.

Usage:

```bash
set -euo pipefail

rm -rf ~/holos/dev/bare
mkdir ~/holos/dev/bare
cd ~/holos/dev/bare

holos generate platform bare
holos pull platform config .
holos render platform ./platform/
(cd components && holos generate component helm cert-manager)
```

The chart builds:

```bash
holos build ./components/cert-manager | yq .
```

And renders:

```bash
holos render component ./components/cert-manager --cluster-name k2
find deploy -type f
```

```txt
9:41PM INF render.go:83 rendered cert-manager version=0.81.1 cluster=k2 status=ok action=rendered name=cert-manager
deploy/clusters/k2/holos/components/cert-manager-kustomization.gen.yaml
deploy/clusters/k2/components/cert-manager/cert-manager.gen.yaml
```
2024-05-21 11:05:53 -07:00
Jeff McCune
24346b9a38 (#172) Deploy v0.79.0 to dev 2024-05-17 10:15:05 -07:00
Jeff McCune
d3c2d55706 (#172) Deploy v0.76.0 to dev 2024-05-14 13:28:19 -07:00
Jeff McCune
9a2773c618 (#171) Refactor API to use FieldMasks
This patch refactors the API following the [API Best Practices][api]
documentation.  The UpdatePlatform method is modeled after a mutating
operation described [by Netflix][nflx] instead of using a REST resource
representation.  This makes it much easier to iterate over the fields
that need to be updated as the PlatformUpdateOperation is a flat data
structure while a Platform resource may have nested fields.  Nested
fields are more complicated and less clear to handle with a FieldMask.

This patch also adds a snapckbar message on save.  Previously, the save
button didn't give any indication of success or failure.  This patch
fixes the problem by adding a snackbar message that pop up at the bottom
of the screen nicely.

When the snackbar message is dismissed or times out the save button is
re-enabled.

[api]: https://protobuf.dev/programming-guides/api/
[nflx]: https://netflixtechblog.com/practical-api-design-at-netflix-part-2-protobuf-fieldmask-for-mutation-operations-2e75e1d230e4

Examples:

FieldMask for ListPlatforms

```
grpcurl -H "x-oidc-id-token: $(holos token)" -d @ ${HOLOS_SERVER##*/} holos.platform.v1alpha1.PlatformService.ListPlatforms <<EOF
{
  "org_id": "018f36fb-e3f7-7f7f-a1c5-c85fb735d215",
  "field_mask": { "paths": ["id","name"] }
}
EOF
```

```json
{
 "platforms": [
   {
     "id": "018f36fb-e3ff-7f7f-a5d1-7ca2bf499e94",
     "name": "bare"
   },
   {
     "id": "018f6b06-9e57-7223-91a9-784e145d998c",
     "name": "gary"
   },
   {
     "id": "018f6b06-9e53-7223-8ae1-1ad53d46b158",
     "name": "jeff"
   },
   {
     "id": "018f6b06-9e5b-7223-8b8b-ea62618e8200",
     "name": "nate"
   }
 ]
}
```

Closes: #171
2024-05-13 16:20:20 -07:00
Jeff McCune
19df2ec0fb (#167) Bump dev deployment to 0.74.0 2024-05-07 16:58:03 -07:00
Jeff McCune
42f916af41 (#164) Use quay.io/holos/oauth2-proxy:v7.6.0-1-g77a03ae2
Custom build to set samesite=none on the csrf cookie.
2024-05-06 16:18:32 -07:00
Jeff McCune
4e8fa5abda (#165) Bump dev deployment to 0.73.1 2024-05-06 11:22:24 -07:00
Jeff McCune
6894f45b6c (#165) Deploy Holos to Dev
This patch deploys holos to the dev environment on the k2 cluster.  It's
accessible at https://app.dev.k2.holos.run/ behind the auth proxy by
default.
2024-05-06 11:10:29 -07:00
Jeff McCune
62735b99e7 (#126) Update Tiltfile to use holos.run for dev
This patch updates the Tiltfile to use the holos.run domain which is
integrated with the default Gateway.
2024-04-22 13:42:18 -07:00
Jeff McCune
29ab9c6300 (#141) Install provisioner helper.rb from entrypoint
And add a script to reset the choria provisioner credentials and config.
2024-04-22 13:20:38 -07:00
Jeff McCune
c07f35ecd6 (#141) Fix holos controller invalid websocket connection error
Problem:
When the ingress default Gateway AuthorizationPolicy/authpolicy-custom
rule is in place the choria machine room holos controller fails to
connect to the provisioner broker with the following error:

```
❯ holos controller run --config=agent.cfg
WARN[0000] Starting controller version 0.68.1 with config file /home/jeff/workspace/holos-run/holos/hack/choria/agent/agent.cfg  leader=false
WARN[0000] Switching to provisioning configuration due to build defaults and missing /home/jeff/workspace/holos-run/holos/hack/choria/agent/agent.cfg
WARN[0000] Setting anonymous TLS mode during provisioning  component=server connection=coffee.home identity=coffee.home
WARN[0000] Initial connection to the Broker failed on try 1: invalid websocket connection  component=server connection=coffee.home identity=coffee.home
WARN[0000] Initial connection to the Broker failed on try 2: invalid websocket connection  component=server connection=coffee.home identity=coffee.home
WARN[0002] Initial connection to the Broker failed on try 3: invalid websocket connection  component=server connection=coffee.home identity=coffee.home
```

This problem is caused because the provisioning token url is set to
`wss://jeff.provision.dev.k2.holos.run:443` which has the port number
specified.

Solution:
Follow the upstream istio guidance of [Writing Host Match Policies][1]
to match host headers with or without the port specified.

Result:
The controller is able to connect to the provisioner broker:

[1]: https://istio.io/latest/docs/ops/best-practices/security/#writing-host-match-policies
2024-04-22 12:31:10 -07:00
Jeff McCune
74a181db21 (#133) Add missing Choria Provisioner deployment 2024-04-19 15:14:31 -07:00
Jeff McCune
ba10113342 (#133) Fix tls error when connecting to provisioner websocket
This problem fixes an error where the istio ingress gateway proxy failed
to verify the TLS certificate presented by the choria broker upstream
server.

    kubectl logs choria-broker-0

    level=error msg="websocket: TLS handshake error from 10.244.1.190:36142: remote error: tls: unknown certificate\n"

Istio ingress logs:

    kubectl -n istio-ingress logs -l app=istio-ingressgateway -f | grep --line-buffered '^{' | jq .

    "upstream_transport_failure_reason": "TLS_error:|268435581:SSL_routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED:TLS_error_end:TLS_error_end"

Client curl output:

    curl https://jeff.provision.dev.k2.holos.run

    upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: TLS_error:|268435581:SSL routines:OPENSSL_i
nternal:CERTIFICATE_VERIFY_FAILED:TLS_error_end:TLS_error_end

Explanation of error:

Istio defaults to expecting a tls certificate matching the downstream
host/authority which isn't how we've configured Choria.

Refer to [ClientTLSSettings][1]

> A list of alternate names to verify the subject identity in the
> certificate. If specified, the proxy will verify that the server
> certificate’s subject alt name matches one of the specified values. If
> specified, this list overrides the value of subject_alt_names from the
> ServiceEntry. If unspecified, automatic validation of upstream presented
> certificate for new upstream connections will be done based on the
> downstream HTTP host/authority header, provided
> VERIFY_CERTIFICATE_AT_CLIENT and ENABLE_AUTO_SNI environmental variables
> are set to true.

[1]: https://istio.io/latest/docs/reference/config/networking/destination-rule/#ClientTLSSettings
2024-04-19 13:13:09 -07:00
Jeff McCune
eb0207c92e (#133) Choria Provisioner
This patch is a work in progress to configure the provisioner to connect
to the broker.  Services and deployments are prefixed with choria for
clarity.
2024-04-19 13:13:08 -07:00
Jeff McCune
0fbcee8119 (#133) Extend the life of the Platform Issuer CA
The platform issuer root CA was set to expire after 90 days, the default
value.  This is too short.

Extend the life of the root CA beyond 100 years.
2024-04-19 11:29:17 -07:00
Jeff McCune
ce8bc798f6 (#133) Exclude nats and provision hosts from the auth proxy
Problem:
The identity aware auth proxy attached to the default gateway is
blocking access to NATS and the Choria Provisioner cluster.

Solution:
Add configuration that causes the project hosts to get added to the
exclusion list of the AuthorizationPolicy/authproxy-custom rule.

Result:
Requests bypass the auth proxy and go straight to the backend.  The
rules look like:

    kubectl get authorizationpolicy authproxy-custom -o yaml

```yaml
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: authproxy-custom
  namespace: istio-ingress
  labels:
    app.kubernetes.io/name: authproxy-custom
    app.kubernetes.io/part-of: istio-ingressgateway
spec:
  action: CUSTOM
  provider:
    name: ingressauth
  rules:
  - to:
    - operation:
        notHosts:
        - login.ois.run
        - vault.core.ois.run
        - provision.holos.run
        - nats.holos.run
        - provision.dev.holos.run
        - nats.dev.holos.run
        - jeff.provision.dev.holos.run
        - jeff.nats.dev.holos.run
        - gary.provision.dev.holos.run
        - gary.nats.dev.holos.run
        - nate.provision.dev.holos.run
        - nate.nats.dev.holos.run
        - provision.k2.holos.run
        - nats.k2.holos.run
        - provision.dev.k2.holos.run
        - nats.dev.k2.holos.run
        - jeff.provision.dev.k2.holos.run
        - jeff.nats.dev.k2.holos.run
        - gary.provision.dev.k2.holos.run
        - gary.nats.dev.k2.holos.run
        - nate.provision.dev.k2.holos.run
        - nate.nats.dev.k2.holos.run
    when:
    - key: request.headers[x-oidc-id-token]
      notValues:
      - '*'
  selector:
    matchLabels:
      istio: ingressgateway
```
2024-04-19 06:32:11 -07:00
Jeff McCune
996195d651 (#137) Fix ArgoCD PKCE login small comment 2024-04-18 14:39:13 -07:00
Jeff McCune
f00b29d3a3 (#137) Fix ArgoCD PKCE login
This patch configures ArgoCD to log in via PKCE.

Note the changes are primarily in platform.site.cue and ensuring the
emailDomain is set properly.  Note too the redirect URL needs to be
`/pkce/verify` when PKCE is enabled.  Finally, if the setting is
reconfigured make sure to clear cookies otherwise the incorrect
`/auth/callback` path may be used.
2024-04-18 14:37:06 -07:00
Jeff McCune
1642787825 (#101) Fix port names in servers must be unique: duplicate name https
Problem:
Port names in the default Gateway.spec.servers.port field must be unique
across all servers associated with the workload.

Solution:
Append the fully qualified domain name with dots replaced with hyphens.

Result:
Port name is unique.
2024-04-18 11:21:05 -07:00
Jeff McCune
f83781480f (#101) Do not add Gateway.spec.servers for other clusters
Problem:
The default gateway in one cluster gets server entries for all hosts in
the problem.  This makes the list unnecessarily large with entries for
clusters that should not be handled on the current cluster.

For example, the k2 cluster has gateway entries to route hosts for k1,
k3, k4, k5, etc...

Solution:
Add a field to the CertInfo definition representing which clusters the
host is valid on.

Result:
Hosts which are valid on all clusters, e.g. login.ois.run, have all
project clusters added to the clusters field of the CertInfo.  Hosts
which are valid on a single cluster have the coresponding single entry
added.

When building resources, holos components should check if `#ClusterName`
is a valid field of the CertInfo.clusters field.  If so, the host is
valid for the current cluster.  If not, the host should be omitted from
the current cluster.
2024-04-18 11:10:16 -07:00
Jeff McCune
9b70205855 (#101) Use letsencrypt production instead of staging
Certificates issue OK from staging, switching to production.
2024-04-18 10:36:28 -07:00
Jeff McCune
0e4bf3c144 (#101) Manage certs on all clusters
We're going to do a big re-issue so might as well do it once so we don't
have to re-issue again to add more clusters to existing projects.
2024-04-18 10:16:55 -07:00
Jeff McCune
1241c74b41 (#101) Do not add the project name as a project host
Doing so forces unnecessary hosts for some projects.  For example,
iam.ois.run is useless for the iam project, the primary project host is
login to build login.ois.run.

Some projects may not need any hosts as well.

Better to let the user specify `project: foo: hosts: foo: _` if they
want it.
2024-04-18 09:59:21 -07:00
Jeff McCune
44fea098de (#101) Manage an ExternalSecret for every Server in the default Gateway
This patch loops over every Gateway.spec.servers entry in the default
gateway and manages an ExternalSecret to sync the credential from the
provisioner cluster.
2024-04-18 09:53:39 -07:00
Jeff McCune
52286efa25 (#101) Fix duplicate certs in holos components
Problem:
A Holos Component is created for each project stage, but all hosts for
all stages in the project are added.  This creates duplicates.

Solution:
Sort project hosts by their stage and map the holos component for a
stage to the hosts for that stage.

Result:
Duplicates are eliminated, the prod certs are not in the dev holos
component and vice-versa.
2024-04-18 09:17:49 -07:00
Jeff McCune
a1b2179442 (#101) Remove holos-saas-certs holos component
No longer needed now that project host certs are using wildcards and
organized nicely.
2024-04-18 06:32:06 -07:00
Jeff McCune
cffc430738 (#101) Provision wildcard certs for all Gateway servers
This patch provisions wildcard certs in the provisioning cluster.  The
CN matches the project stage host global hostname without any cluster
qualifiers.

The use of a wildcard in place of the environment name dns segment at
the leftmost position of the fully qualified dns name enables additional
environments to be configured without reissuing certificates.

This is to avoid the 100 name per cert limit in LetsEncrypt.
2024-04-18 06:26:29 -07:00
Jeff McCune
d76454272b (#101) Simplify the GatewayServers struct
Mapping each project host fqdn to the stage is unnecessary.  The list of
gateway servers is constructed from each FQDN in the project.

This patch removes the unnecessary struct mappings.
2024-04-18 05:32:19 -07:00
Jeff McCune
9d1e77c00f (#101) Define #ProjectHosts to manage project hosts
Problem:
It's difficult to map and reduce the collection of project hosts when
configuring related Certificate, Gateway.spec.servers, VirtualService,
and auth proxy cookie domain settings.

Solution:
Define #ProjectHosts which takes a project and provides Hosts which is a
struct with a fqdn key and a #CertInfo value.  The #CertInfo definition
is intended to provide everything need to reduce the Hosts property to
structs usful for the problematic resources mentioned previously.

Result:
Gateway.spec.servers are mapped using #ProjectHosts

Next step is to map the Certificate resources on the provisioner
cluster.
2024-04-17 21:59:04 -07:00
Jeff McCune
2050abdc6c (#101) Add wildcard support to project certs
Problem:
Adding environments to a project causes certs to be re-issued.

Solution:
Enable wildcard certs for per-environment namespaces like jeff, gary,
nate, etc...

Result:
Environments can be added to a project stage without needing the cert to
be re-issued.
2024-04-17 12:32:44 -07:00
Jeff McCune
3ea013c503 (#101) Consolidate certificates by project stage
This patch avoids LetsEncrypt rate limits by consolidating multiple dns
names into one certificate.

For each project host, create a certificate for each stage in the
project.  The certificate contains the dns names for all clusters and
environments associated with that stage and host.

This can become quite a list, the limit is 100 dnsNames.

For the Holos project which has 7 clusters and 4 dev environments, the
number of dns names is 32 (4 envs + 4 envs * 7 clusters = 32 dns names).

Still, a much needed improvement because we're limited to 50 certs per
week.

It may be worth considering wildcards for the per-developer
environments, which are the ones we'll likely spin up the most
frequently.
2024-04-17 11:58:46 -07:00
Jeff McCune
309db96138 (#133) Choria Broker for Holos Controller provisioning
This patch is a partial step toward getting the choria broker up
and running in my own namespace.  The choria broker is necessary for
provisioning machine room agents such as the holos controller.
2024-04-17 08:48:31 -07:00
Jeff McCune
ab9bca0750 (#132) Controller Subcommand
This patch adds an initial holos controller subcommand.  The machine
room agent starts, but doesn't yet provision because we haven't deployed
the provisioning infrastructure yet.
2024-04-16 15:40:25 -07:00
Jeff McCune
ac2be67c3c (#130) NATS deployment with operator jwt
Configure NATS in a 3 Node deployment with resolver authentication using
an Operator JWT.

The operator secret nkeys are stored in the provisioner cluster.  Get
them with:

    holos get secret -n jeff-holos nats-nsc --print-key nsc.tgz | tar -tvzf-
2024-04-15 17:02:18 -07:00
Jeff McCune
2954a57872 (#120) Fix NATS target namespace
The upstream nats charts don't specify namespaces for each attribute.
This works with helm update, but not helm template which holos uses to
render the yaml.

The missing namespace causes flux to fail.

This patch uses the flux kustomization to add the target namespace to
all resources.
2024-04-10 21:54:58 -07:00
Jeff McCune
df705bd79f (#121) Fix Multiple Charts cause holos render to fail
When rendering a holos component which contains more than one helm chart, rendering fails.  It should succeed.

```
holos render --cluster-name=k2 /home/jeff/workspace/holos-run/holos/docs/examples/platforms/reference/clusters/holos/... --log-level debug
```

```
9:03PM ERR could not execute version=0.64.2 err="could not rename: rename /home/jeff/workspace/holos-run/holos/docs/examples/platforms/reference/clusters/holos/nats/envs/vendor553679311 /home/jeff/workspace/holos-run/holos/docs/examples/platforms/reference/clusters/holos/nats/envs/vendor: file exists" loc=helm.go:145
```

This patch fixes the problem by moving each child item of the temporary
directory charts are installed into.  This avoids the problem of moving
the parent when the parent target already exists.
2024-04-10 21:27:39 -07:00
Jeff McCune
4e8ce3585d (#115) Minor clean up of cue code 2024-04-10 21:21:16 -07:00
Jeff McCune
bf5765c9cb (#110) Update ZITADEL to v2.49.1 from v2.46.0
Attempt to resolve issue where `/oauth/v2/keys` returns `{"keys": []}`
causing id token verification failures.

Closes: #110
2024-04-07 17:20:10 -07:00
Jeff McCune
04158485c7 (#96) Do not expire ZITADEL signing public key
The public key needs to be configured along with the signing key.
2024-04-05 10:52:36 -07:00
Jeff McCune
cf83c77280 (#96) Do not expire ZITADEL signing private key
Without this patch users encounter an error from istio because it does
not have a valid Jwks from ZITADEL to verify the request when processing
a `RequestAuthentication` policy.

Fixes error `AuthProxy JWKS Error - Jwks doesn't have key to match kid or alg from Jwt`.

Occurs when accessing a protected URL for the first time after tokens have expired.
2024-04-04 15:56:00 -07:00
Jeff McCune
6e545b13dd (#104) Deploy crunchy monitoring stack for ZITADEL
Not exposed via the ingress gateway, but accessible via

    kubectl port-forward svc/crunchy-grafana 3000

Refer to [day two monitoring][1].  This is pretty much a straight copy
of the upstream kustomize configs at [2].

[1]: https://access.crunchydata.com/documentation/postgres-operator/5.5/tutorials/day-two/monitoring
[2]: https://github.com/CrunchyData/postgres-operator-examples/tree/main/kustomize/monitoring
2024-04-04 15:40:07 -07:00
Jeff McCune
bf258a1f41 (#104) Enable monitoring for ZITADEL postgres
This patch enables the monitoring configuration for the ZITADEL postgres
cluster.

Refer to: https://access.crunchydata.com/documentation/postgres-operator/5.5/tutorials/day-two/monitoring

Integrating with:
https://github.com/CrunchyData/postgres-operator-examples/tree/main/kustomize/monitoring
which will become a separate holos component instance.
2024-04-03 22:26:38 -07:00
Jeff McCune
6f06c73d6f (#85) Initial addition of kube-prometheus-stack
Grafana does not yet have the istio sidecar.  Prometheus is accessible
through the auth proxy.  Cert manager is added to the workload clusters
so tls certs can be issued for webhooks, the kube-prom-stack helm chart
uses cert manager for this purpose.

With this patch Grafana is integrated with OIDC and I'm able to log in
as an Administrator.
2024-04-03 21:29:26 -07:00
Jeff McCune
bcb02b5c5c (#47) Remove the prod-iam-zitadel namespace
No longer needed, cluster has moved to prod-iam namespace.
2024-04-03 15:10:30 -07:00
Jeff McCune
0736c7de1a (#47) Bind ALL VirtualServices to the default gateway
Problem:
The VirtualService that catches auth routes for paths, e.g.
`/holos/authproxy/istio-ingress` is bound to the default gateway which
no longer exists because it has no hosts.

Solution:
It's unnecessary and complicated to create a Gateway for every project.
Instead, put all server entries into one `default` gateway and
consolidate the list using CUE.

Result:
It's easier to reason about this system.  There is only one ingress
gateway, `default` and everything gets added to it.  VirtualServices
need only bind to this gateway, which has a hosts entry appropriately
namespaced for the project.
2024-04-03 14:56:40 -07:00
Jeff McCune
28be9f9fbb (#47) Use the project specific Gateway
The login service is unavailable because the wrong gateway is used.
When using projects the VS needs to attach to the correct Gateway.
2024-04-03 12:59:48 -07:00
Jeff McCune
647681de38 (#99) Restore backups from prod-iam namespace
This patch configures the standby cluster to restore backups from the
prod-iam namespace instead of the prod-iam-zitadel namespace.
2024-04-03 12:30:12 -07:00
Jeff McCune
81beb5c539 (#47) Restore ZITADEL from existing backups
Problem:
The ZITADEL database isn't restoring into the prod-iam namespace after
moving from prod-iam-zitadel because no backup exists at the bucket
path.

Solution:
Hard-code the path to the old namespace to restore the database.  We'll
figure out how to move the backups to the new location in a follow up
change.
2024-04-03 11:44:16 -07:00
Jeff McCune
5c1e0a29c8 (#47) Have Ceph depend on secret stores
Another kustomization reconciling too early.
2024-04-03 11:22:15 -07:00
Jeff McCune
01ac5276a9 (#47) Have Gateway depend on secret stores
The `prod-platform-gateway` kustomization is reconciling early:

ExternalSecret/istio-ingress/argocd.ois.run dry-run failed: failed to
get API group resources: unable to retrieve the complete list of server
APIs: external-secrets.io/v1beta1: the server could not find the
requested resource
2024-04-03 11:20:15 -07:00