* flannel and Cilium default to UDP 8472 for VXLAN traffic to
avoid conflicts with other VXLAN usage (e.g. Open vSwith)
* Aligning flannel and Cilium to use the same vxlan port makes
firewall rules or security policies simpler across clouds
* Change the default Pod CIDR from 10.2.0.0/16 to 10.20.0.0/14
(10.20.0.0 - 10.23.255.255) to support 1024 nodes by default
* Most CNI providers divide the Pod CIDR so that each node has
a /24 to allocate to local pods (256). The previous `10.2.0.0/16`
default only fits 256 /24's so 256 nodes were supported without
customizing the pod_cidr
* By default the `networking` CNI provider is pre-installed,
but the component variable provides an extensible mechanism
to skip installing these components
* Validate that networking can only be set to one of flannel,
calico, or cilium
* Add a `coredns` variable to configure the CoreDNS manifests,
with an `enable` field to determine whether CoreDNS manifests
are applied to the cluster during provisioning (default true)
* Add a `kube-proxy` variable to configure kube-proxy manifests,
with an `enable` field to determine whether the kube-proxy
Daemonset is applied to the cluster during provisioning (default
true)
* These optional allow for provisioning clusters without CoreDNS
or kube-proxy, so these components can be customized or managed
through separate plan/apply processes or automation
* Cilium v1.14 seems to have problems reliably accessing the
apiserver via default in-cluster service discovery (relies on
kube-proxy instead of DNS) after some time
* Configure Cilium agents to use the DNS name resolving to the
cluster's load balanced apiserver and port. Regrettably, this
relies on external DNS rather than being self-contained, but its
what Cilium pushes users towards
* Cilium KubeProxyReplacement mode used to support a partial
option, but in v1.14 it became true or false
* Emulate the old partial mode by disabling KubeProxyReplacement
but turning on the individual features
* The alternative of enabling KubeProxyReplacement has ramifications
because Cilium then needs to be configured with the apiserver server
address, which creates a dependency on the cloud provider's DNS,
clashes with kube-proxy, and removing kube-proxy creates complications
for how node health is assessed. Removing kube-proxy is further
complicated by the fact its still used by other supported CNIs which
creates a tricky support matrix
Docs: https://docs.cilium.io/en/latest/network/kubernetes/kubeproxy-free/#kube-proxy-hybrid-modes
* In Cilium v1.14, kube-proxy-replacement mode again changed
its valid values, this time from partial to true/false. The
value should be true for Cilium to support HostPort features
as expected
```
cilium status --verbose
Services:
- ClusterIP: Enabled
- NodePort: Enabled (Range: 30000-32767)
- LoadBalancer: Enabled
- externalIPs: Enabled
- HostPort: Enabled
```
14ced84f7e
introduced `enable-ipv6-masquerade` and `enable-ipv4-masquerade`. This
updates the ConfigMap of cilium to align with the expected flag.
enable-ipv4-masquerade is enabled and enable-ipv6-masquerade is
disabled.
* https://github.com/cilium/cilium/releases/tag/v1.13.0
* Change kube-proxy-replacement from probe (deprecated) to
partial and disable nodeport health checks as recommended
* Add ciliumloadbalanacerippools to ClusterRole
* Enable BPF socket load balancing from host namespace
* Allow the Kubelet kubeconfig to get/list workloads and
evict pods to perform drain operations, via the kubelet-delete
ClusterRole bound to the system:nodes group
* Previously, the ClusterRole only allowed node deletion
* Remove uses of `key_algorithm` on `tls_self_signed_cert` and
`tls_cert_request` resources. The field is deprecated and inferred
from the `private_key_pem`
* Remove uses of `ca_key_algorithm` on `tls_locally_signed_cert`
resources. The field is deprecated and inferred from the
`ca_private_key_pem`
* Require at least hashicorp/tls provider v3.2
Rel: https://github.com/hashicorp/terraform-provider-tls/blob/main/CHANGELOG.md#320-april-04-2022
* Mount both /etc/ssl/certs and /etc/pki into control plane static
pods and kube-proxy, rather than choosing one based a variable
(set based on Flatcar Linux or Fedora CoreOS)
* Remove `trusted_certs_dir` variable
* Remove deprecated `--port` from `kube-scheduler` static Pod
* Update minimum required versions for `tls`, `template`,
and `random`. Older versions have some differing behaviors
(e.g. `random` may be missing sensitive fields) and we'd
prefer to shake loose any setups still using very old
providers than continue allowing them
* Remove the upper bound version constraint since providers
are more regularly updated these days and require less
manual vetting to allow use
* Change `enable_aggregation` default from false to true
* These days, Kubernetes control plane components emit annoying
messages related to assumptions baked into the Kubernetes API
Aggregation Layer if you don't enable it. Further the conformance
tests force you to remember to enable it if you care about passing
those
* This change is motivated by eliminating annoyances, rather than
any enthusiasm for Kubernetes' aggregation features
* https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/apiserver-aggregation/
* Kubernetes v1.22.0 disables kube-controller-manager insecure
port which was used internally for Prometheus metrics scraping
In Typhoon, we'll switch to using the https port which requires
Prometheus present a bearer token
* Go ahead and disable the insecure port for kube-scheduler too,
we'll configure Prometheus to scrape it with a bearer token as
well
* Remove unused kube-apiserver `--port` flag
Rel:
* https://github.com/kubernetes/kubernetes/pull/96216
* Add init container to auto-mount /sys/fs/cgroup cgroup2
at /run/cilium/cgroupv2 for the Cilium agent
* Enable CNI exclusive mode, to disable other configs
found in /etc/cni/net.d/
* https://github.com/cilium/cilium/pull/16259
* Set `kube-apiserver` `service-account-jwks-uri` because conformance
ServiceAccountIssuerDiscovery OIDC discovery will access a JWT endpoint
using the kube-apiserver's advertise address by default, instead of
using the intended in-cluster service (10.3.0.1) resolved by cluster DNS
`kubernetes.default.svc.cluster.local`, which causes a cert SAN error
* Set the authentication and authorization kubeconfig for kube-scheduler
and kube-controller-manager. Here, authn/z refer to aggregated API
use cases only, so its not strictly neccessary and warnings about
missing `extension-apiserver-authentication` when enable_aggregation
is false can be ignored
* Mount `/var/lib/kubelet/volumeplugins` to to the default location
expected within kube-controller-manager to remove the need for a flag
* Enable `tokencleaner` controller to automatically delete expired
bootstrap tokens (default node token is good 1 year, so cleanup won't
really matter at that point, but enable regardless)
* Remove unused `cloud-provider` flag, we never intend to use in-tree
cloud providers or support custom providers
* Terraform `random_password` is identical to `random_string` except
its value is marked sensitive so it isn't displayed in terraform
plan and other outputs
* Prefer marking the bootstrap token as sensitive for cases where
terraform is run in an automated CI/CD system
* Change control plane static pods to mount `/etc/kubernetes/pki`,
instead of `/etc/kubernetes/bootstrap-secrets` to better reflect
their purpose and match some loose conventions upstream
* Require TLS assets to be placed at `/etc/kubernetes/pki`, instead
of `/etc/kubernetes/bootstrap-secrets` on hosts (breaking)
* Mount to `/etc/kubernetes/pki` to match the host (less surprise)
* https://kubernetes.io/docs/setup/best-practices/certificates/
* Generate TLS client certificates for kube-scheduler and
kube-controller-manager with `system:kube-scheduler` and
`system:kube-controller-manager` CNs
* Template separate kubeconfigs for kube-scheduler and
kube-controller manager (`scheduler.conf` and
`controller-manager.conf`). Rename admin for clarity
* Before v1.16.0, Typhoon scheduled a self-hosted control
plane, which allowed the steady-state kube-scheduler and
kube-controller-manager to use a scoped ServiceAccount.
With a static pod control plane, separate CN TLS client
certificates are the nearest equiv.
* https://kubernetes.io/docs/setup/best-practices/certificates/
* Remove unused Kubelet certificate, TLS bootstrap is used
instead
* Originally, generated TLS certificates, manifests, and
cluster "assets" written to local disk (`asset_dir`) during
terraform apply cluster bootstrap
* Typhoon v1.17.0 introduced bootstrapping using only Terraform
state to store cluster assets, to avoid ever writing sensitive
materials to disk and improve automated use-cases. `asset_dir`
was changed to optional and defaulted to "" (no writes)
* Typhoon v1.18.0 deprecated the `asset_dir` variable, removed
docs, and announced it would be deleted in future.
* Remove the `asset_dir` variable
Cluster assets are now stored in Terraform state only. For those
who wish to write those assets to local files, this is possible
doing so explicitly.
```
resource local_file "assets" {
for_each = module.bootstrap.assets_dist
filename = "some-assets/${each.key}"
content = each.value
}
```
Related:
* https://github.com/poseidon/typhoon/pull/595
* https://github.com/poseidon/typhoon/pull/678
* seccomp graduated to GA in Kubernetes v1.19. Support
for seccomp alpha annotations will be removed in v1.22
* Replace seccomp annotations with the GA seccompProfile
field in the PodTemplate securityContext
* Switch profile from `docker/default` to `runtime/default`
(no effective change, since docker is the runtime)
* Verify with docker inspect SecurityOpt. Without the
profile, you'd see `seccomp=unconfined`
Related:
* https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.19.md#seccomp-graduates-to-general-availability
* Allow Cilium operator Pods to leader elect when Deployment
has more than one replica
* Use topology spread constraint to keep multiple operators
from running on the same node (pods bind hostNetwork ports)
* Update CNI plugins from v0.6.0 to v0.8.6 to fix several CVEs
* Update the base image to alpine:3.12
* Use `flannel-cni` as an init container and remove sleep
* Add Linux ARM64 and multi-arch container images
* https://github.com/poseidon/flannel-cni
* https://quay.io/repository/poseidon/flannel-cni
Background
* Switch from github.com/coreos/flannel-cni v0.3.0 which was last
published by me in 2017 and which is no longer accessible to me
to maintain or patch
* Port to the poseidon/flannel-cni rewrite, which releases v0.4.0
to continue the prior release numbering
* Accept experimental CNI `networking` mode "cilium"
* Run Cilium v1.8.0 with overlay vxlan tunnels and a
minimal set of features. We're interested in:
* IPAM: Divide pod_cidr into /24 subnets per node
* CNI networking pod-to-pod, pod-to-external
* BPF masquerade
* NetworkPolicy as defined by Kubernetes (no L7)
* Continue using kube-proxy with Cilium probe mode
* Firewall changes:
* Require UDP 8472 for vxlan (Linux kernel default) between nodes
* Optional ICMP echo(8) between nodes for host reachability (health)
* Optional TCP 4240 between nodes for host reachability (health)
* Use node label `node.kubernetes.io/controller` to select
controller nodes (action required)
* Tolerate node taint `node-role.kubernetes.io/controller`
for workloads that should run on controller nodes. Don't
tolerate `node-role.kubernetes.io/master` (action required)
* Add nodeAffinity to CoreDNS deployment PodSpec to
prefer running CoreDNS pods on controllers, while
relying on podAntiAffinity for spreading.
* For single master clusters, running two CoreDNS pods
on the master or running one pod on a worker is
permissible.
* Note: Its still _possible_ to end up with CoreDNS pods
all running on workers since we only express scheduling
preference ("soft"), but unlikely. Plus the motivating
scenario (below) is also rare.
Background:
* CoreDNS replicas are set to the higher of 2 or the
number of control plane nodes to (at a minimum) support
Deployment updates or pod restarts and match the cluster
size (e.g. 5 master/controller nodes likely means a
larger cluster, so run 5 CoreDNS replicas)
* In the past (before v1.14), we required kube-dns (CoreOS
predecessor) to run CoreDNS pods on master nodes. With
CoreDNS this node selection was relaxed. We'd like a
gentler form of it now.
Motivation:
* On clusters using 100% preemptible/spot workers, it is
possible that CoreDNS pods schedule to workers that are all
preempted at the same time, causing a loss of cluster internal
DNS service until a CoreDNS pod reschedules (1 min). We'd like
CoreDNS to prefer controller/master nodes (which aren't preempted)
to reduce the possibility of control plane disruption
* Set a consistent MCS level/range for Calico install-cni
* Note: Rebooting a node was a workaround, because Kubelet
relabels /etc/kubernetes(/cni/net.d)
Background:
* On SELinux enforcing systems, the Calico CNI install-cni
container ran with default SELinux context and a random MCS
pair. install-cni places CNI configs by first creating a
temporary file and then moving them into place, which means
the file MCS categories depend on the containers SELinux
context.
* calico-node Pod restarts creates a new install-cni container
with a different MCS pair that cannot access the earlier
written file (it places configs every time), causing the
init container to error and calico-node to crash loop
* https://github.com/projectcalico/cni-plugin/issues/874
```
mv: inter-device move failed: '/calico.conf.tmp' to
'/host/etc/cni/net.d/10-calico.conflist'; unable to remove target:
Permission denied
Failed to mv files. This may be caused by selinux configuration on the
host, or something else.
```
Note, this isn't a host SELinux configuration issue.
* Enable bootstrap token authentication on kube-apiserver
* Generate the bootstrap.kubernetes.io/token Secret that
may be used as a bootstrap token
* Generate a bootstrap kubeconfig (with a bootstrap token)
to be securely distributed to nodes. Each Kubelet will use
the bootstrap kubeconfig to authenticate to kube-apiserver
as `system:bootstrappers` and send a node-unique CSR for
kube-controller-manager to automatically approve to issue
a Kubelet certificate and kubeconfig (expires in 72 hours)
* Add ClusterRoleBinding for bootstrap token subjects
(`system:bootstrappers`) to have the `system:node-bootstrapper`
ClusterRole
* Add ClusterRoleBinding for bootstrap token subjects
(`system:bootstrappers`) to have the csr nodeclient ClusterRole
* Add ClusterRoleBinding for bootstrap token subjects
(`system:bootstrappers`) to have the csr selfnodeclient ClusterRole
* Enable NodeRestriction admission controller to limit the
scope of Node or Pod objects a Kubelet can modify to those of
the node itself
* Ability for a Kubelet to delete its Node object is retained
as preemptible nodes or those in auto-scaling instance groups
need to be able to remove themselves on shutdown. This need
continues to have precedence over any risk of a node deleting
itself maliciously
Security notes:
1. Issued Kubelet certificates authenticate as user `system:node:NAME`
and group `system:nodes` and are limited in their authorization
to perform API operations by Node authorization and NodeRestriction
admission. Previously, a Kubelet's authorization was broader. This
is the primary security motivation.
2. The bootstrap kubeconfig credential has the same sensitivity
as the previous generated TLS client-certificate kubeconfig.
It must be distributed securely to nodes. Its compromise still
allows an attacker to obtain a Kubelet kubeconfig
3. Bootstrapping Kubelet kubeconfig's with a limited lifetime offers
a slight security improvement.
* An attacker who obtains the kubeconfig can likely obtain the
bootstrap kubeconfig as well, to obtain the ability to renew
their access
* A compromised bootstrap kubeconfig could plausibly be handled
by replacing the bootstrap token Secret, distributing the token
to new nodes, and expiration. Whereas a compromised TLS-client
certificate kubeconfig can't be revoked (no CRL). However,
replacing a bootstrap token can be impractical in real cluster
environments, so the limited lifetime is mostly a theoretical
benefit.
* Cluster CSR objects are visible via kubectl which is nice
4. Bootstrapping node-unique Kubelet kubeconfigs means Kubelet
clients have more identity information, which can improve the
utility of audits and future features
Rel: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet-tls-bootstrapping/
* Change kube-proxy, flannel, and calico-node DaemonSet
tolerations to tolerate `node.kubernetes.io/not-ready`
and `node-role.kubernetes.io/master` (i.e. controllers)
explicitly, rather than tolerating all taints
* kube-system DaemonSets will no longer tolerate custom
node taints by default. Instead, custom node taints must
be enumerated to opt-in to scheduling/executing the
kube-system DaemonSets.
Background: Tolerating all taints ruled out use-cases
where certain nodes might legitimately need to keep
kube-proxy or CNI networking disabled
* Kubernetes plans to stop releasing the hyperkube image in
the future.
* Upstream will continue releasing container images for
`kube-apiserver`, `kube-controller-manager`, `kube-proxy`,
and `kube-scheduler`. Typhoon will use these images
* Upstream will release the kubelet as a binary for distros
to package, either as a traditional DEB/RPP or as a container
image for container-optimized operating systems. Typhoon will
take on the packaging of Kubelet and its dependencies as a new
container image (alongside kubectl)
Rel: https://github.com/kubernetes/kubernetes/pull/88676
See: https://github.com/poseidon/kubelet
* Set kube-proxy --metrics-bind-address to 0.0.0.0 (default
127.0.0.1) so Prometheus metrics can be scraped
* Add pod port list (informational only)
* Require node firewall rules to be updated before scrapes
can succeed
* Migration from a self-hosted to a static pod control plane dropped
a few kube-controller-manager customizations
* Reduce kube-controller-manager --pod-eviction-timeout from 5m to 1m
to move pods more quickly when nodes are preempted
* Fix flex-volume-plugin-dir since the Kubernetes default points to
a read-only filesystem on Container Linux / Fedora CoreOS
Related:
* https://github.com/poseidon/terraform-render-bootstrap/pull/148
* 7b06557b7a
* Reduce kube-apiserver and kube-controller-manager CPU
requests from 200m to 150m. Prefer slightly lower commitment
after running with the requests chosen in #161 for a while
* Reduce calico-node CPU request from 150m to 100m to match
CoreDNS and flannel
* Annotate terraform output variables containing generated TLS
credentials and kubeconfigs as sensitive to suppress / mask
them in terraform CLI display.
* Allow for easier use in automation systems and logged environments
* `asset_dir` is an absolute path to a directory where generated
assets from terraform-render-bootstrap are written (sensitive)
* Change `asset_dir` to default to "" so no assets are written
(favor Terraform output mechanisms). Previously, asset_dir was
required so all users set some path. To take advantage of the
new optionality, remove asset_dir or set it to ""
* Introduce a new `assets_dist` output variable that provides
a mapping from suggested asset paths to asset contents (for
assets that should be distributed to controller nodes). This
new output format is intended to align with a modified asset
distribution style in Typhoon.
* Lay the groundwork for `assets_dir` to become optional. The
output map provides output variable access to the minimal assets
that are required for bootstrap
* Assets that aren't required for bootstrap itself (e.g.
the etcd CA key) but can be used by admins may later be added
as specific output variables to further reduce asset_dir use
Background:
* `terraform-render-bootstrap` rendered assets were previously
only provided by rendering files to an `asset_dir`. This was
neccessary, but created a responsibility to maintain those
assets on the machine where terraform apply was run
* Remove unused `ca_cert`, `kubelet_cert`, `kubelet_key`,
and `server` outputs
* These outputs were once needed to support clusters with
managed instance groups, but that hasn't been the case for
quite some time
* Set small CPU requests on static pods kube-apiserver,
kube-controller-manager, and kube-scheduler to align with
upstream tooling and for edge cases
* Control plane nodes are tainted to isolate them from
ordinary workloads. Even dense workloads can only compress
CPU resources on worker nodes.
* Control plane static pods use the highest priority class, so
contention favors control plane pods (over say node-exporter)
and CPU is compressible too.
* Effectively, a practical case for these requests hasn't been
observed. However, a small static pod CPU request may offer
a slight benefit if a controller became overloaded and the
above mechanisms were insufficient for some reason (bit of a
stretch, due to CPU compressibility)
* Continue to avoid setting a memory request for static pods.
It would impose a hard size requirement on controller nodes,
which isn't warranted and is handled more gently by Typhoon
default instance types across clouds and via docs
* Adopt Terrform v0.12 type and templatefile function
features to replace the use of terraform-provider-template's
`template_dir`
* Use of `for_each` to write local assets requires
that consumers use Terraform v0.12.6+ (action required)
* Continue use of `template_file` as its quite common. In
future, we may replace it as well.
* Remove outputs `id` and `content_hash` (no longer used)
Background:
* `template_dir` was added to `terraform-provider-template`
to add support for template directory rendering in CoreOS
Tectonic Kubernetes distribution (~2017)
* Terraform v0.12 introduced a native `templatefile` function
and v0.12.6 introduced native `for_each` support (July 2019)
that makes it possible to replace `template_dir` usage
* Change calico-node livenessProve from httpGet to exec
a calico-node -felix-ready, as recommended by Calico
* Allow advertising Kubernetes service ClusterIPs
* Kubernetes v1.11 considered kube-proxy IPVS mode GA
* Many problems were found https://github.com/poseidon/typhoon/pull/321
* Since then, major blockers seem to have been addressed
* Rename from terraform-render-bootkube to terraform-render-bootstrap
* Generated manifest and certificate assets are no longer geared
specifically for bootkube (no longer used)
* Rename the organization in generated CA certificates for
clusters from bootkube to typhoon
* Mainly helpful to avoid confusion with bootkube CA certificates
if users inspect their CA, especially now that bootkube isn't used
(better their searches lead to Typhoon)
* Run kube-apiserver, kube-scheduler, and kube-controller-manager
as static pods on each controller node
* Boostrap a minimal control plane by copying `static-manifests`
to the Kubelet `--pod-manifest-path` and tls/auth secrets to
`/etc/kubernetes/bootstrap-secrets`. Then, kubectl apply Kubernetes
manifests.
* Discontinue using bootkube to bootstrap and pivot to a self-hosted
control plane.
* Remove bootkube self-hosted kube-apiserver DaemonSet and
kube-scheduler and kube-controller-manager Deployments
* Remove pod-checkpointer manifests (no longer needed)
Advantages:
* Reduce control plane bootstrapping complexity. Self-hosted pivot and
pod checkpointing worked well, but in-place edits to kube-apiserver,
kube-controller-manager, or kube-scheduler is infrequently used. The
concept was originally geared toward continuously in-place upgrading
clusters, a goal Typhoon doesn't take on (rec. blue/green clusters).
As such, the value-add isn't justifying the extra components for this
particular project.
* Static pods still provide kubectl visibility and log access
Drawbacks:
* In-place edits to kube-apiserver, kube-controller-manager, and
kube-scheduler are not possible via kubectl (non-goal)
* Assets must be copied to each controller (not just one)
* Static pod must load credentials via hostPath, which is less clean
compared with the former Kubernetes secrets and service accounts
* Require bootstrap-kube-apiserver and kube-apiserver components
listen on port 6443 (internally) to allow kube-apiserver pods to
run with lower user privilege
* Remove variable `apiserver_port`. The kube-apiserver listen
port is no longer customizable.
* Add variable `external_apiserver_port` to allow architectures
where a load balancer fronts kube-apiserver 6443 backends, but
listens on a different port externally. For example, Google Cloud
TCP Proxy load balancers cannot listen on 6443
* Terraform v0.12 is a major Terraform release with breaking changes
to the HCL language. In v0.11, it was required to use redundant brackets
as interpreter type hints to pass lists or concat and flatten lists and
strings. In v0.12, that work-around is no longer supported. Lists are
represented as first-class objects and the redundant brackets create
nested lists. Consequently, its not possible to pass lists in a way that
works with both v0.11 and v0.12 at the same time. We've made the
difficult choice to pursue a hard cutover to Terraform v0.12.x
* https://www.terraform.io/upgrade-guides/0-12.html#referring-to-list-variables
* Use expression syntax instead of interpolated strings, where suggested
* Define Terraform required_version ~> v0.12.0 (> v0.12, < v0.13)
* Add `ready` plugin and change the readinessProbe to check
default port 8181 to ensure all plugins are ready
* `upstream [ADDRESS]` defines upstream resolvers for external
services. If no address is given, resolution is against CoreDNS
itself, which is the default. So `upstream` can be removed
* Change flannel port from the kernel default 8472 to the
IANA assigned VXLAN port 4789
* Requires a change to firewall rules or security groups
depending on the platform (**action required!**)
* Why now? Calico now offers its own VXLAN backend so
standardizing on the IANA port simplifies configuration
* https://github.com/coreos/flannel/blob/master/Documentation/backends.md#vxlan
* Accept a `network_encapsulation` variable to choose whether the
default IPPool should use ipip (default) or vxlan encapsulation
* Use `network_mtu` as the MTU for workload interfaces for ipip
or vxlan (although Calico can have a IPPools with a mix, we're
picking ipip xor vxlan)
* Remove the `ca_certificate`, `ca_key_alg`, and `ca_private_key`
variables
* Typhoon does not plan to expose custom CA support. Continuing
to support it clutters the implementation and security auditing
* Using an existing CA certificate and private key has been
supported in terraform-render-bootkube only to match bootkube
* Add an `enable_aggregation` variable to enable the kube-apiserver
aggregation layer for adding extension apiservers to clusters
* Aggregation is **disabled** by default. Typhoon recommends you not
enable aggregation. Consider whether less invasive ways to achieve
your goals are possible and whether those goals are well-founded
* Enabling aggregation and extension apiservers increases the attack
surface of a cluster and makes extensions a part of the control plane.
Admins must scrutinize and trust any extension apiserver used.
* Passing a v1.14 CNCF conformance test requires aggregation be enabled.
Having an option for aggregation keeps compliance, but retains the stricter
security posture on default clusters
* calico-node uses only a small fraction of its CPU request
(i.e. reservation) even under stress. The unbounded limit
already allows usage to scale favorably in bursty cases
* Motivation: On instance types that skew memory-optimized
(e.g. GCP n1), over-requesting can push the system toward
overcommitment (alerts can be tuned)
* Overcommitment is not necessarily bad, but 250m seems too
generous a minimum given the actual usage
* Add calico-ipam CRDs and RBAC permissions
* Switch IPAM from host-local to calico-ipam!
* `calico-ipam` subnets `ippools` (defaults to pod CIDR) into
`ipamblocks` (defaults to /26, but set to /24 in Typhoon)
* `host-local` subnets the pod CIDR based on the node PodCIDR
field (set via kube-controller-manager as /24's)
* Create a custom default IPv4 IPPool to ensure the block size
is kept at /24 to allow 110 pods per node (Kubernetes default)
* Retaining host-local was slightly preferred, but Calico v3.6
is migrating all usage to calico-ipam. The codepath that skipped
calico-ipam for KDD was removed
* https://docs.projectcalico.org/v3.6/release-notes/
* Resolve in-addr.arpa and ip6.arpa DNS PTR requests for Kubernetes
service IPs and pod IPs
* Previously, CoreDNS was configured to resolve in-addr.arpa PTR
records for service IPs (but not pod IPs)
* Priority Admission Controller has been enabled since Typhoon
v1.11.1
* Assign cluster and node components a builtin priorityClassName
(higher is higher priority) to inform scheduler prepemption,
scheduling order, and node out-of-resource eviction order
* Fix a regression caused by lowering the Kubelet TLS client
certificate to system:nodes group (#100) since dropping
cluster-admin dropped the Kubelet's ability to delete nodes.
* On clouds where workers can scale down (manual terraform apply,
AWS spot termination, Azure low priority deletion), worker shutdown
runs the delete-node.service to remove a node to prevent NotReady
nodes from accumulating
* Allow Kubelets to delete cluster nodes via system:nodes group. Kubelets
acting with system:node and kubelet-delete ClusterRoles is still an
improvement over acting as cluster-admin
* Allow kube-controller-manager to sign Approved CSR's using the
cluster CA private key to issue cluster certificates
* System components that need to use certificates signed by the
cluster CA can submit a CSR to the apiserver, have an admin
inspect and manually approve it, and be issued a certificate
* Admins should inspect CSRs very carefully to ensure their
origin and authorization level are appropriate
* https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster/#approving-certificate-signing-requests
* Provide an admin kubeconfig which includes a named context
and also sets that context as the current-context
* Retains support for both the KUBECONFIG=path style of usage
or adding many kubeconfig's to a ~/.kube/configs folder and
using `kubectl use-context CLUSTER-context`
* Change Kubelet TLS client certificate to belong to the system:nodes
group instead of the system:masters group (more limited)
* Bind the system:node ClusterRole to the system:nodes group (yes,
the ClusterRole is singular)
* Generate separate admin.crt and admin.key files (which do still use
system:masters). Output kubeconfig-kubelet and kubeconfig-admin values
from the module
* Remove the kubeconfig output to force users to pick the correct
kubeconfig, depending on how the output is used (action required!)
Related:
* https://kubernetes.io/docs/reference/access-authn-authz/rbac/#core-component-roles
Note, NodeAuthorizer/NodeRestriction would be an enhancement, but to
work across platforms it effectively requires TLS bootstraping which
doesn't have a viable attestation strategy and clashes with CCM. This
change improves Kubelet limitations, but intentionally doesn't aim to
steer toward NodeAuthorizer/NodeRestriction
* Switch kube-apiserver from using the kube-system default ServicAccount
(with cluster-admin) to using a kube-apiserver ServiceAccount bound to
cluster-admin (as before)
* Remove the default-sa ClusterRoleBinding that allowed kube-apiserver
and kube-scheduler (or other 3rd-party components added to kube-system)
to use the kube-system default ServiceAccount for cluster-admin
* Require all future components in kube-system define their own
ServiceAccount
* Switch kube-scheduler from using the kube-system default ServiceAccount
(with cluster-admin) to using a kube-scheduler ServiceAccount bound to
the builtin system:kube-scheduler and system:volume-scheduler
(required for StorageClass) ClusterRoles
* https://kubernetes.io/docs/reference/access-authn-authz/rbac/#core-component-roles
* loop sends an initial query to detect infinite forwarding
loops in configured upstream DNS servers and fast exit with
an error (its a fatal misconfiguration on the network that
will otherwise cause resolvers to consume memory/CPU until
crashing, masking the problem)
* https://github.com/coredns/coredns/tree/master/plugin/loop
* loadbalance randomizes the ordering of A, AAAA, and MX records
in responses to provide round-robin load balancing (as usual,
clients may still cache responses though)
* https://github.com/coredns/coredns/tree/master/plugin/loadbalance
* Organize flannel and Calico manifests to use consistent
naming, structure, and ordering to align
* Downside: Makes direct diff'ing with upstream harder, but
that's become difficult lately anyway, since Calico uses a
templating engine
* Prefer InternalIP and ExternalIP over the node's hostname,
to match upstream behavior and kubeadm
* Previously, hostname-override was used to set node names
to internal IP's to work around some cloud providers not
resolving hostnames for instances (e.g. DO droplets)
* Run at least two replicas of CoreDNS to better support
rolling updates (previously, kube-dns had a pod nanny)
* On multi-master clusters, set the CoreDNS replica count
to match the number of masters (e.g. a 3-master cluster
previously used replicas:1, now replicas:3)
* Add AntiAffinity preferred rule to favor distributing
CoreDNS pods across nodes
* Continue to ensure scheduler and controller-manager run
at least two replicas to support performing kubectl edits
on single-master clusters (no change)
* For multi-master clusters, set scheduler / controller-manager
replica count to the number of masters (e.g. a 3-master cluster
previously used replicas:2, now replicas:3)
* A kubernetes apiserver should be authorized to make requests
to kubelets using an admin role associated with system:masters
* Kubelet defaults to AlwaysAllow so an apiserver that presented
a valid certificate had all access to the Kubelet. With Webhook
authorization, we're making that admin access explicit
* Its important the apiserver be able to perform or proxy to
kubelets for kubectl log, exec, port-forward, etc.
* https://github.com/poseidon/typhoon/issues/215
* Calico's default method "first-found" is appropriate for
single-NIC or bonded-NIC bare-metal and for clouds
* On bare-metal machines with multiple NICs, first-found
may result in Calico pods picking an unintended IP address
(perhaps an admin has dedicated certain NICs to certain
purposes). It mat be helpful to use `can-reach=DEST` or
`interface=REGEX` to select the host's address
* Caveat: autodetection method is set for the Calico
DaemonSet so the choice must be appropriate for all
machines in the cluster.
* https://docs.projectcalico.org/v3.1/reference/node/configuration#ip-autodetection-methods
* controller-manager can handle overlapping pod and service CIDRs
to avoid address collisions, if its informed of both ranges
* Still favor non-overlapping pod and service ranges of course
* https://github.com/kubernetes-incubator/bootkube/pull/797
`terraform-render-bootkube` is a Terraform module that renders [kubernetes-incubator/bootkube](https://github.com/kubernetes-incubator/bootkube) assets for bootstrapping a Kubernetes cluster.
`terraform-render-bootstrap` is a Terraform module that renders TLS certificates, static pods, and manifests for bootstrapping a Kubernetes cluster.
## Audience
`terraform-render-bootkube` is a low-level component of the [Typhoon](https://github.com/poseidon/typhoon) Kubernetes distribution. Use Typhoon modules to create and manage Kubernetes clusters across supported platforms. Use the bootkube module if you'd like to customize a Kubernetes control plane or build your own distribution.
`terraform-render-bootstrap` is a low-level component of the [Typhoon](https://github.com/poseidon/typhoon) Kubernetes distribution. Use Typhoon modules to create and manage Kubernetes clusters across supported platforms. Use the bootstrap module if you'd like to customize a Kubernetes control plane or build your own distribution.
## Usage
Use the module to declare bootkube assets. Check [variables.tf](variables.tf) for options and [terraform.tfvars.example](terraform.tfvars.example) for examples.
Use the module to declare bootstrap assets. Check [variables.tf](variables.tf) for options and [terraform.tfvars.example](terraform.tfvars.example) for examples.
Compare assets. Note that experimental must be generated to a separate directory for terraform applies to sync. Move the experimental `bootstrap-manifests` and `manifests` files during deployment.
description="Path to a directory where generated assets should be placed (contains secrets)"
type="string"
}
variable"cloud_provider"{
description="The provider for cloud services (empty string for no provider)"
type="string"
default=""
}
# optional
variable"networking"{
description="Choice of networking provider (flannel or calico)"
type="string"
default="flannel"
}
variable"network_mtu"{
description="CNI interface MTU (applies to calico only)"
type="string"
default="1500"
}
variable"pod_cidr"{
description="CIDR IP range to assign Kubernetes pods"
type="string"
default="10.2.0.0/16"
}
variable"service_cidr"{
description=<<EOD
CIDR IP range to assign Kubernetes services.
The 1st IP will be reserved for kube_apiserver, the 10th IP will be reserved for kube-dns, the 15th IP will be reserved for self-hosted etcd, and the 20th IP will be reserved for bootstrap self-hosted etcd.
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.