* Add a `coredns` variable to configure the CoreDNS manifests,
with an `enable` field to determine whether CoreDNS manifests
are applied to the cluster during provisioning (default true)
* Add a `kube-proxy` variable to configure kube-proxy manifests,
with an `enable` field to determine whether the kube-proxy
Daemonset is applied to the cluster during provisioning (default
true)
* These optional allow for provisioning clusters without CoreDNS
or kube-proxy, so these components can be customized or managed
through separate plan/apply processes or automation
* Cilium v1.14 seems to have problems reliably accessing the
apiserver via default in-cluster service discovery (relies on
kube-proxy instead of DNS) after some time
* Configure Cilium agents to use the DNS name resolving to the
cluster's load balanced apiserver and port. Regrettably, this
relies on external DNS rather than being self-contained, but its
what Cilium pushes users towards
* Allow the Kubelet kubeconfig to get/list workloads and
evict pods to perform drain operations, via the kubelet-delete
ClusterRole bound to the system:nodes group
* Previously, the ClusterRole only allowed node deletion
* Mount both /etc/ssl/certs and /etc/pki into control plane static
pods and kube-proxy, rather than choosing one based a variable
(set based on Flatcar Linux or Fedora CoreOS)
* Remove `trusted_certs_dir` variable
* Remove deprecated `--port` from `kube-scheduler` static Pod
* seccomp graduated to GA in Kubernetes v1.19. Support
for seccomp alpha annotations will be removed in v1.22
* Replace seccomp annotations with the GA seccompProfile
field in the PodTemplate securityContext
* Switch profile from `docker/default` to `runtime/default`
(no effective change, since docker is the runtime)
* Verify with docker inspect SecurityOpt. Without the
profile, you'd see `seccomp=unconfined`
Related:
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.19.md#seccomp-graduates-to-general-availability
* Use node label `node.kubernetes.io/controller` to select
controller nodes (action required)
* Tolerate node taint `node-role.kubernetes.io/controller`
for workloads that should run on controller nodes. Don't
tolerate `node-role.kubernetes.io/master` (action required)
* Add nodeAffinity to CoreDNS deployment PodSpec to
prefer running CoreDNS pods on controllers, while
relying on podAntiAffinity for spreading.
* For single master clusters, running two CoreDNS pods
on the master or running one pod on a worker is
permissible.
* Note: Its still _possible_ to end up with CoreDNS pods
all running on workers since we only express scheduling
preference ("soft"), but unlikely. Plus the motivating
scenario (below) is also rare.
Background:
* CoreDNS replicas are set to the higher of 2 or the
number of control plane nodes to (at a minimum) support
Deployment updates or pod restarts and match the cluster
size (e.g. 5 master/controller nodes likely means a
larger cluster, so run 5 CoreDNS replicas)
* In the past (before v1.14), we required kube-dns (CoreOS
predecessor) to run CoreDNS pods on master nodes. With
CoreDNS this node selection was relaxed. We'd like a
gentler form of it now.
Motivation:
* On clusters using 100% preemptible/spot workers, it is
possible that CoreDNS pods schedule to workers that are all
preempted at the same time, causing a loss of cluster internal
DNS service until a CoreDNS pod reschedules (1 min). We'd like
CoreDNS to prefer controller/master nodes (which aren't preempted)
to reduce the possibility of control plane disruption
* Enable bootstrap token authentication on kube-apiserver
* Generate the bootstrap.kubernetes.io/token Secret that
may be used as a bootstrap token
* Generate a bootstrap kubeconfig (with a bootstrap token)
to be securely distributed to nodes. Each Kubelet will use
the bootstrap kubeconfig to authenticate to kube-apiserver
as `system:bootstrappers` and send a node-unique CSR for
kube-controller-manager to automatically approve to issue
a Kubelet certificate and kubeconfig (expires in 72 hours)
* Add ClusterRoleBinding for bootstrap token subjects
(`system:bootstrappers`) to have the `system:node-bootstrapper`
ClusterRole
* Add ClusterRoleBinding for bootstrap token subjects
(`system:bootstrappers`) to have the csr nodeclient ClusterRole
* Add ClusterRoleBinding for bootstrap token subjects
(`system:bootstrappers`) to have the csr selfnodeclient ClusterRole
* Enable NodeRestriction admission controller to limit the
scope of Node or Pod objects a Kubelet can modify to those of
the node itself
* Ability for a Kubelet to delete its Node object is retained
as preemptible nodes or those in auto-scaling instance groups
need to be able to remove themselves on shutdown. This need
continues to have precedence over any risk of a node deleting
itself maliciously
Security notes:
1. Issued Kubelet certificates authenticate as user `system:node:NAME`
and group `system:nodes` and are limited in their authorization
to perform API operations by Node authorization and NodeRestriction
admission. Previously, a Kubelet's authorization was broader. This
is the primary security motivation.
2. The bootstrap kubeconfig credential has the same sensitivity
as the previous generated TLS client-certificate kubeconfig.
It must be distributed securely to nodes. Its compromise still
allows an attacker to obtain a Kubelet kubeconfig
3. Bootstrapping Kubelet kubeconfig's with a limited lifetime offers
a slight security improvement.
* An attacker who obtains the kubeconfig can likely obtain the
bootstrap kubeconfig as well, to obtain the ability to renew
their access
* A compromised bootstrap kubeconfig could plausibly be handled
by replacing the bootstrap token Secret, distributing the token
to new nodes, and expiration. Whereas a compromised TLS-client
certificate kubeconfig can't be revoked (no CRL). However,
replacing a bootstrap token can be impractical in real cluster
environments, so the limited lifetime is mostly a theoretical
benefit.
* Cluster CSR objects are visible via kubectl which is nice
4. Bootstrapping node-unique Kubelet kubeconfigs means Kubelet
clients have more identity information, which can improve the
utility of audits and future features
Rel: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/
* Change kube-proxy, flannel, and calico-node DaemonSet
tolerations to tolerate `node.kubernetes.io/not-ready`
and `node-role.kubernetes.io/master` (i.e. controllers)
explicitly, rather than tolerating all taints
* kube-system DaemonSets will no longer tolerate custom
node taints by default. Instead, custom node taints must
be enumerated to opt-in to scheduling/executing the
kube-system DaemonSets.
Background: Tolerating all taints ruled out use-cases
where certain nodes might legitimately need to keep
kube-proxy or CNI networking disabled
* Kubernetes plans to stop releasing the hyperkube image in
the future.
* Upstream will continue releasing container images for
`kube-apiserver`, `kube-controller-manager`, `kube-proxy`,
and `kube-scheduler`. Typhoon will use these images
* Upstream will release the kubelet as a binary for distros
to package, either as a traditional DEB/RPP or as a container
image for container-optimized operating systems. Typhoon will
take on the packaging of Kubelet and its dependencies as a new
container image (alongside kubectl)
Rel: https://github.com/kubernetes/kubernetes/pull/88676
See: https://github.com/poseidon/kubelet
* Set kube-proxy --metrics-bind-address to 0.0.0.0 (default
127.0.0.1) so Prometheus metrics can be scraped
* Add pod port list (informational only)
* Require node firewall rules to be updated before scrapes
can succeed
* Kubernetes v1.11 considered kube-proxy IPVS mode GA
* Many problems were found https://github.com/poseidon/typhoon/pull/321
* Since then, major blockers seem to have been addressed
* Run kube-apiserver, kube-scheduler, and kube-controller-manager
as static pods on each controller node
* Boostrap a minimal control plane by copying `static-manifests`
to the Kubelet `--pod-manifest-path` and tls/auth secrets to
`/etc/kubernetes/bootstrap-secrets`. Then, kubectl apply Kubernetes
manifests.
* Discontinue using bootkube to bootstrap and pivot to a self-hosted
control plane.
* Remove bootkube self-hosted kube-apiserver DaemonSet and
kube-scheduler and kube-controller-manager Deployments
* Remove pod-checkpointer manifests (no longer needed)
Advantages:
* Reduce control plane bootstrapping complexity. Self-hosted pivot and
pod checkpointing worked well, but in-place edits to kube-apiserver,
kube-controller-manager, or kube-scheduler is infrequently used. The
concept was originally geared toward continuously in-place upgrading
clusters, a goal Typhoon doesn't take on (rec. blue/green clusters).
As such, the value-add isn't justifying the extra components for this
particular project.
* Static pods still provide kubectl visibility and log access
Drawbacks:
* In-place edits to kube-apiserver, kube-controller-manager, and
kube-scheduler are not possible via kubectl (non-goal)
* Assets must be copied to each controller (not just one)
* Static pod must load credentials via hostPath, which is less clean
compared with the former Kubernetes secrets and service accounts
* Require bootstrap-kube-apiserver and kube-apiserver components
listen on port 6443 (internally) to allow kube-apiserver pods to
run with lower user privilege
* Remove variable `apiserver_port`. The kube-apiserver listen
port is no longer customizable.
* Add variable `external_apiserver_port` to allow architectures
where a load balancer fronts kube-apiserver 6443 backends, but
listens on a different port externally. For example, Google Cloud
TCP Proxy load balancers cannot listen on 6443
* Add `ready` plugin and change the readinessProbe to check
default port 8181 to ensure all plugins are ready
* `upstream [ADDRESS]` defines upstream resolvers for external
services. If no address is given, resolution is against CoreDNS
itself, which is the default. So `upstream` can be removed
* Add an `enable_aggregation` variable to enable the kube-apiserver
aggregation layer for adding extension apiservers to clusters
* Aggregation is **disabled** by default. Typhoon recommends you not
enable aggregation. Consider whether less invasive ways to achieve
your goals are possible and whether those goals are well-founded
* Enabling aggregation and extension apiservers increases the attack
surface of a cluster and makes extensions a part of the control plane.
Admins must scrutinize and trust any extension apiserver used.
* Passing a v1.14 CNCF conformance test requires aggregation be enabled.
Having an option for aggregation keeps compliance, but retains the stricter
security posture on default clusters
* Resolve in-addr.arpa and ip6.arpa DNS PTR requests for Kubernetes
service IPs and pod IPs
* Previously, CoreDNS was configured to resolve in-addr.arpa PTR
records for service IPs (but not pod IPs)
* Priority Admission Controller has been enabled since Typhoon
v1.11.1
* Assign cluster and node components a builtin priorityClassName
(higher is higher priority) to inform scheduler prepemption,
scheduling order, and node out-of-resource eviction order
* Fix a regression caused by lowering the Kubelet TLS client
certificate to system:nodes group (#100) since dropping
cluster-admin dropped the Kubelet's ability to delete nodes.
* On clouds where workers can scale down (manual terraform apply,
AWS spot termination, Azure low priority deletion), worker shutdown
runs the delete-node.service to remove a node to prevent NotReady
nodes from accumulating
* Allow Kubelets to delete cluster nodes via system:nodes group. Kubelets
acting with system:node and kubelet-delete ClusterRoles is still an
improvement over acting as cluster-admin
* Allow kube-controller-manager to sign Approved CSR's using the
cluster CA private key to issue cluster certificates
* System components that need to use certificates signed by the
cluster CA can submit a CSR to the apiserver, have an admin
inspect and manually approve it, and be issued a certificate
* Admins should inspect CSRs very carefully to ensure their
origin and authorization level are appropriate
* https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster/#approving-certificate-signing-requests
* Change Kubelet TLS client certificate to belong to the system:nodes
group instead of the system:masters group (more limited)
* Bind the system:node ClusterRole to the system:nodes group (yes,
the ClusterRole is singular)
* Generate separate admin.crt and admin.key files (which do still use
system:masters). Output kubeconfig-kubelet and kubeconfig-admin values
from the module
* Remove the kubeconfig output to force users to pick the correct
kubeconfig, depending on how the output is used (action required!)
Related:
* https://kubernetes.io/docs/reference/access-authn-authz/rbac/#core-component-roles
Note, NodeAuthorizer/NodeRestriction would be an enhancement, but to
work across platforms it effectively requires TLS bootstraping which
doesn't have a viable attestation strategy and clashes with CCM. This
change improves Kubelet limitations, but intentionally doesn't aim to
steer toward NodeAuthorizer/NodeRestriction
* Switch kube-apiserver from using the kube-system default ServicAccount
(with cluster-admin) to using a kube-apiserver ServiceAccount bound to
cluster-admin (as before)
* Remove the default-sa ClusterRoleBinding that allowed kube-apiserver
and kube-scheduler (or other 3rd-party components added to kube-system)
to use the kube-system default ServiceAccount for cluster-admin
* Require all future components in kube-system define their own
ServiceAccount
* Switch kube-scheduler from using the kube-system default ServiceAccount
(with cluster-admin) to using a kube-scheduler ServiceAccount bound to
the builtin system:kube-scheduler and system:volume-scheduler
(required for StorageClass) ClusterRoles
* https://kubernetes.io/docs/reference/access-authn-authz/rbac/#core-component-roles
* loop sends an initial query to detect infinite forwarding
loops in configured upstream DNS servers and fast exit with
an error (its a fatal misconfiguration on the network that
will otherwise cause resolvers to consume memory/CPU until
crashing, masking the problem)
* https://github.com/coredns/coredns/tree/master/plugin/loop
* loadbalance randomizes the ordering of A, AAAA, and MX records
in responses to provide round-robin load balancing (as usual,
clients may still cache responses though)
* https://github.com/coredns/coredns/tree/master/plugin/loadbalance
* Prefer InternalIP and ExternalIP over the node's hostname,
to match upstream behavior and kubeadm
* Previously, hostname-override was used to set node names
to internal IP's to work around some cloud providers not
resolving hostnames for instances (e.g. DO droplets)
* Run at least two replicas of CoreDNS to better support
rolling updates (previously, kube-dns had a pod nanny)
* On multi-master clusters, set the CoreDNS replica count
to match the number of masters (e.g. a 3-master cluster
previously used replicas:1, now replicas:3)
* Add AntiAffinity preferred rule to favor distributing
CoreDNS pods across nodes
* Continue to ensure scheduler and controller-manager run
at least two replicas to support performing kubectl edits
on single-master clusters (no change)
* For multi-master clusters, set scheduler / controller-manager
replica count to the number of masters (e.g. a 3-master cluster
previously used replicas:2, now replicas:3)