* Using spot instances, when an instance is deleted it actually
lowers the desired number of nodes in the VMSS so the node is
not replaced
* Restore the auto-scale setting needed to maintain a consistent
desired number of workers while spot instances come and go. This
was mistakely removed in refactoring
* Azure Load Balancers include 5 rules (3 LB rules, 2 outbound) whether used or not
* [#1468](https://github.com/poseidon/typhoon/pull/1468) added 3 LB rules to support IPv6 load balancing,
raising the rules count from 5 to 8 and added ~$21/mo to the cost of the load balancer. If you use an edge
(e.g. Cloudflare) a cluster does not need to load balance IPv6, so this additional cost can be avoided
* I noticed this because my load balancing costs were up for the last
few months. The gotcha is that outbound rules count toward the 5 rules
included with the base cost of the LB (~$18/mo)
Docs: https://azure.microsoft.com/en-us/pricing/details/load-balancer/
* flannel and Cilium default to UDP 8472 for VXLAN traffic to
avoid conflicts with other VXLAN usage (e.g. Open vSwith)
* Aligning flannel and Cilium to use the same vxlan port makes
firewall rules or security policies simpler across clouds
Rel: https://github.com/poseidon/terraform-render-bootstrap/pull/403
* Explicitly load the `nf_conntrack` and `br_netfilter` kernel
modules that are needed for flannel CNI setups
* Specifically, flannel needs `br_netfilter` and kube-proxy (used
in flannel setups) needs `nf_conntrack`. Previously these kernel
modules were loaded by default but no longer seem to be
* Cilium has been the default for about 3 years and is the defacto
standard CNI choice. flannel is supported as a simple alternative
* Remove various historical options that were needed that are
specific to Calico
* By default, Kubelet will pull container images one by one
(in series), which is mostly related to Docker-era bugs in
parallel image pulls. These days we use containerd so parallel
pulls should be fine
* Serial image pulls are undesirable because one slow registry
or image can cause other image pulls to wait. Parallel image
pulls ensure only large images / slow registries see that impact
Docs: https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/
* Change the default Pod CIDR from 10.2.0.0/16 to 10.20.0.0/14
(10.20.0.0 - 10.23.255.255) to support 1024 nodes by default
* Most CNI providers divide the Pod CIDR so that each node has
a /24 to allocate to local pods (256). The previous `10.2.0.0/16`
default only fits 256 /24's so 256 nodes were supported without
customizing the pod_cidr