* Migration from a self-hosted to a static pod control plane dropped
a few kube-controller-manager customizations
* Reduce kube-controller-manager --pod-eviction-timeout from 5m to 1m
to move pods more quickly when nodes are preempted
* Fix flex-volume-plugin-dir since the Kubernetes default points to
a read-only filesystem on Container Linux / Fedora CoreOS
Related:
* https://github.com/poseidon/terraform-render-bootstrap/pull/148
* 7b06557b7a
* Reduce kube-apiserver and kube-controller-manager CPU
requests from 200m to 150m. Prefer slightly lower commitment
after running with the requests chosen in #161 for a while
* Reduce calico-node CPU request from 150m to 100m to match
CoreDNS and flannel
* Set small CPU requests on static pods kube-apiserver,
kube-controller-manager, and kube-scheduler to align with
upstream tooling and for edge cases
* Control plane nodes are tainted to isolate them from
ordinary workloads. Even dense workloads can only compress
CPU resources on worker nodes.
* Control plane static pods use the highest priority class, so
contention favors control plane pods (over say node-exporter)
and CPU is compressible too.
* Effectively, a practical case for these requests hasn't been
observed. However, a small static pod CPU request may offer
a slight benefit if a controller became overloaded and the
above mechanisms were insufficient for some reason (bit of a
stretch, due to CPU compressibility)
* Continue to avoid setting a memory request for static pods.
It would impose a hard size requirement on controller nodes,
which isn't warranted and is handled more gently by Typhoon
default instance types across clouds and via docs
* Change calico-node livenessProve from httpGet to exec
a calico-node -felix-ready, as recommended by Calico
* Allow advertising Kubernetes service ClusterIPs
* Kubernetes v1.11 considered kube-proxy IPVS mode GA
* Many problems were found https://github.com/poseidon/typhoon/pull/321
* Since then, major blockers seem to have been addressed
* Run kube-apiserver, kube-scheduler, and kube-controller-manager
as static pods on each controller node
* Boostrap a minimal control plane by copying `static-manifests`
to the Kubelet `--pod-manifest-path` and tls/auth secrets to
`/etc/kubernetes/bootstrap-secrets`. Then, kubectl apply Kubernetes
manifests.
* Discontinue using bootkube to bootstrap and pivot to a self-hosted
control plane.
* Remove bootkube self-hosted kube-apiserver DaemonSet and
kube-scheduler and kube-controller-manager Deployments
* Remove pod-checkpointer manifests (no longer needed)
Advantages:
* Reduce control plane bootstrapping complexity. Self-hosted pivot and
pod checkpointing worked well, but in-place edits to kube-apiserver,
kube-controller-manager, or kube-scheduler is infrequently used. The
concept was originally geared toward continuously in-place upgrading
clusters, a goal Typhoon doesn't take on (rec. blue/green clusters).
As such, the value-add isn't justifying the extra components for this
particular project.
* Static pods still provide kubectl visibility and log access
Drawbacks:
* In-place edits to kube-apiserver, kube-controller-manager, and
kube-scheduler are not possible via kubectl (non-goal)
* Assets must be copied to each controller (not just one)
* Static pod must load credentials via hostPath, which is less clean
compared with the former Kubernetes secrets and service accounts
* Require bootstrap-kube-apiserver and kube-apiserver components
listen on port 6443 (internally) to allow kube-apiserver pods to
run with lower user privilege
* Remove variable `apiserver_port`. The kube-apiserver listen
port is no longer customizable.
* Add variable `external_apiserver_port` to allow architectures
where a load balancer fronts kube-apiserver 6443 backends, but
listens on a different port externally. For example, Google Cloud
TCP Proxy load balancers cannot listen on 6443
* Add `ready` plugin and change the readinessProbe to check
default port 8181 to ensure all plugins are ready
* `upstream [ADDRESS]` defines upstream resolvers for external
services. If no address is given, resolution is against CoreDNS
itself, which is the default. So `upstream` can be removed
* Change flannel port from the kernel default 8472 to the
IANA assigned VXLAN port 4789
* Requires a change to firewall rules or security groups
depending on the platform (**action required!**)
* Why now? Calico now offers its own VXLAN backend so
standardizing on the IANA port simplifies configuration
* https://github.com/coreos/flannel/blob/master/Documentation/backends.md#vxlan
* Accept a `network_encapsulation` variable to choose whether the
default IPPool should use ipip (default) or vxlan encapsulation
* Use `network_mtu` as the MTU for workload interfaces for ipip
or vxlan (although Calico can have a IPPools with a mix, we're
picking ipip xor vxlan)
* Add an `enable_aggregation` variable to enable the kube-apiserver
aggregation layer for adding extension apiservers to clusters
* Aggregation is **disabled** by default. Typhoon recommends you not
enable aggregation. Consider whether less invasive ways to achieve
your goals are possible and whether those goals are well-founded
* Enabling aggregation and extension apiservers increases the attack
surface of a cluster and makes extensions a part of the control plane.
Admins must scrutinize and trust any extension apiserver used.
* Passing a v1.14 CNCF conformance test requires aggregation be enabled.
Having an option for aggregation keeps compliance, but retains the stricter
security posture on default clusters
* calico-node uses only a small fraction of its CPU request
(i.e. reservation) even under stress. The unbounded limit
already allows usage to scale favorably in bursty cases
* Motivation: On instance types that skew memory-optimized
(e.g. GCP n1), over-requesting can push the system toward
overcommitment (alerts can be tuned)
* Overcommitment is not necessarily bad, but 250m seems too
generous a minimum given the actual usage
* Add calico-ipam CRDs and RBAC permissions
* Switch IPAM from host-local to calico-ipam!
* `calico-ipam` subnets `ippools` (defaults to pod CIDR) into
`ipamblocks` (defaults to /26, but set to /24 in Typhoon)
* `host-local` subnets the pod CIDR based on the node PodCIDR
field (set via kube-controller-manager as /24's)
* Create a custom default IPv4 IPPool to ensure the block size
is kept at /24 to allow 110 pods per node (Kubernetes default)
* Retaining host-local was slightly preferred, but Calico v3.6
is migrating all usage to calico-ipam. The codepath that skipped
calico-ipam for KDD was removed
* https://docs.projectcalico.org/v3.6/release-notes/
* Resolve in-addr.arpa and ip6.arpa DNS PTR requests for Kubernetes
service IPs and pod IPs
* Previously, CoreDNS was configured to resolve in-addr.arpa PTR
records for service IPs (but not pod IPs)
* Priority Admission Controller has been enabled since Typhoon
v1.11.1
* Assign cluster and node components a builtin priorityClassName
(higher is higher priority) to inform scheduler prepemption,
scheduling order, and node out-of-resource eviction order
* Fix a regression caused by lowering the Kubelet TLS client
certificate to system:nodes group (#100) since dropping
cluster-admin dropped the Kubelet's ability to delete nodes.
* On clouds where workers can scale down (manual terraform apply,
AWS spot termination, Azure low priority deletion), worker shutdown
runs the delete-node.service to remove a node to prevent NotReady
nodes from accumulating
* Allow Kubelets to delete cluster nodes via system:nodes group. Kubelets
acting with system:node and kubelet-delete ClusterRoles is still an
improvement over acting as cluster-admin
* Allow kube-controller-manager to sign Approved CSR's using the
cluster CA private key to issue cluster certificates
* System components that need to use certificates signed by the
cluster CA can submit a CSR to the apiserver, have an admin
inspect and manually approve it, and be issued a certificate
* Admins should inspect CSRs very carefully to ensure their
origin and authorization level are appropriate
* https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster/#approving-certificate-signing-requests
* Provide an admin kubeconfig which includes a named context
and also sets that context as the current-context
* Retains support for both the KUBECONFIG=path style of usage
or adding many kubeconfig's to a ~/.kube/configs folder and
using `kubectl use-context CLUSTER-context`
* Change Kubelet TLS client certificate to belong to the system:nodes
group instead of the system:masters group (more limited)
* Bind the system:node ClusterRole to the system:nodes group (yes,
the ClusterRole is singular)
* Generate separate admin.crt and admin.key files (which do still use
system:masters). Output kubeconfig-kubelet and kubeconfig-admin values
from the module
* Remove the kubeconfig output to force users to pick the correct
kubeconfig, depending on how the output is used (action required!)
Related:
* https://kubernetes.io/docs/reference/access-authn-authz/rbac/#core-component-roles
Note, NodeAuthorizer/NodeRestriction would be an enhancement, but to
work across platforms it effectively requires TLS bootstraping which
doesn't have a viable attestation strategy and clashes with CCM. This
change improves Kubelet limitations, but intentionally doesn't aim to
steer toward NodeAuthorizer/NodeRestriction
* Switch kube-apiserver from using the kube-system default ServicAccount
(with cluster-admin) to using a kube-apiserver ServiceAccount bound to
cluster-admin (as before)
* Remove the default-sa ClusterRoleBinding that allowed kube-apiserver
and kube-scheduler (or other 3rd-party components added to kube-system)
to use the kube-system default ServiceAccount for cluster-admin
* Require all future components in kube-system define their own
ServiceAccount
* Switch kube-scheduler from using the kube-system default ServiceAccount
(with cluster-admin) to using a kube-scheduler ServiceAccount bound to
the builtin system:kube-scheduler and system:volume-scheduler
(required for StorageClass) ClusterRoles
* https://kubernetes.io/docs/reference/access-authn-authz/rbac/#core-component-roles
* loop sends an initial query to detect infinite forwarding
loops in configured upstream DNS servers and fast exit with
an error (its a fatal misconfiguration on the network that
will otherwise cause resolvers to consume memory/CPU until
crashing, masking the problem)
* https://github.com/coredns/coredns/tree/master/plugin/loop
* loadbalance randomizes the ordering of A, AAAA, and MX records
in responses to provide round-robin load balancing (as usual,
clients may still cache responses though)
* https://github.com/coredns/coredns/tree/master/plugin/loadbalance
* Organize flannel and Calico manifests to use consistent
naming, structure, and ordering to align
* Downside: Makes direct diff'ing with upstream harder, but
that's become difficult lately anyway, since Calico uses a
templating engine
* Prefer InternalIP and ExternalIP over the node's hostname,
to match upstream behavior and kubeadm
* Previously, hostname-override was used to set node names
to internal IP's to work around some cloud providers not
resolving hostnames for instances (e.g. DO droplets)