The idempotency.go (perhaps not so accurately named) contains
API calls that kubeadm does against an API server using client-go.
Some users seem to have unstable setups where for unknown reasons
the API server can be unavailable or refuse to respond as expected.
Use PollUntilContextTimeout in all exported functions to ensure
such API calls are all retry-able.
NOTE: The context passed to PollUntilContextTimeout is not propagated
in the polled function. Instead the poll function creates it's own
context 'ctx := context.Background()', this is to avoid
breaking expectations on the side of the callers, that expect
a certain type of error and not "context timeout" errors.
Additional changes:
- Make all context.TODO() -> context.Background()
- Update all unit tests and make sure during testing the retry
interval and timeout are short. Test coverage of idempotency.go
is at ~97%.
- Remove the TestMutateConfigMapWithConflict test. It does not
contribute much, because conflict handling is done at the API,
server side, not on the side of kubeadm. This simulating this is not
needed.
kubeadm upgrade checks the migration path for the existing CoreDNS
deployment pre-flight. Migration paths are defined for CoreDNS
versions, which are derived from the image tag used in the existing
deployment.
The kubeadm ClusterConfiguration.DNS.ImageMeta supports suffixing the
tag with a digest, but at upgrade time does not derive the version
correctly from an image with digest suffix, because DeployedDNSAddon
does not deal with digests correctly. This commit makes DeployedDNSAddon
digest-aware.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Previous v1beta4 work added support for
ClusterConfiguration.EncryptionAlgorithm, however the possible
values were limited to just "RSA" (2048 key size) and "ECDSA" (P256).
Allow more arbitrary algorithm types, that can also include key size
or curve type encoded in the name:
"RSA-2048" (default), "RSA-3072", "RSA-4096" or "ECDSA-P256".
Update the deprecation notice of the PublicKeysECDSA FeatureGate
as ideally it should be removed only after v1beta3 is removed.
The previous fix changed the behavior of
EnsureAdminClusterRoleBindingImpl under the assumption that the unit
test was correct and the real-world behavior was wrong, but in fact,
the real-world behavior was already correct, and the unit test was
expecting the wrong result because of the difference in behavior
between real and fake clients.
The function TryRunCommand() uses an exponential backoff,
which is good, but it's inconsistent and only used in a couple
of places.
Remove its usage in the token.go#UpdateOrCreateTokens()
and switch to using the standard function used in other places -
PollUntilContextTimeout().
Remove wait.go#TryRunCommand(), as there are no other usages.
The function wait.go#WaitForKubeletAndFunc() has been used in
a number of places in kubeadm. It starts a go routine to wait for
the kubelet /healthz and in parallel starts another go routine
to wait for an custom function.
This logic is problematic. If kubeadm is waiting for the kubelet
in parallel with something that requires the kubelet, the right
solution would be to first wait for the kubelet in serial and only
then proceed with the other action. The parallelism here particularly
during "init" required a unwanted "initial timeout" of 40s, before
the kubelet waiting even starts. In most cases, this makes the kubelet
waiter to not even start, while the main point of waiting becomes
the "other action".
- Remove the function WaitForKubeletAndFunc() from the Waiter interface.
- Rename the function WaitForHealthyKubelet() to just WaitForKubelet()
to be consistent with the naming WaitForAPI().
- Update WaitForKubelet() to not use TryRunCommand() and instead
use PollUntilContextTimeout().
- Remove the "initial timeout" of 40s in WaitForKubelet().
- Make both WaitForKubelet() and WaitForAPI() use similar error
handling and output.
- Update all usage of WaitForKubelet() to be a serial call before
any other action, such as another wait* call.
- Make the default wait timeout for the kubelet
/healthz to be 1 minute (kubeadmconstants.DefaultKubeletTimeout).
- Apply updates to all implementations of the Waiter interface.
The component connection between kube-apiserver and kubelet does not
require the "O" field on the Subject to be set to the
"system:masters" privileged group. It can be a less
privileged group like "kubeadm:cluster-admins".
Change the group in the apiserve-kubelet-client
certificate specification. This cert is passed to
--kubelet-client-certificate.
In EnsureAdminClusterRoleBindingImpl() there are a couple of
polls around CRB create calls. When testing the function
a short retry and a timeout are used. These introduce around
2x20 fake client "connections" / poll iterations under a couple
of test cases with 2 seconds overall test increase.
Given the polls in EnsureAdminClusterRoleBindingImpl()
are of type PollUntilContextTimeout() with "immediate" set to "true",
the short retry / time out can be removed when testing,
because one poll iteration is guaranteed and the tested function
is at 100% coverage with reactors and test cases.
Poll CRB create calls for kubeadm:cluster-admins when using the
super-admin.conf credential. The prior create call that uses the
credential admin.conf was already polled. Polling this subsequent
call seems advisable to ensure that momentary errors in between
cannot trip EnsureAdminClusterRoleBindingImpl().
- Register the new file in /certs/renewal, so that the
file is renewed if present. If not present the common message "MISSING"
is shown. Same for other certs/kubeconfig files.
- In /kubeconfig, update the spec for admin.conf to use
the "kubeadm:cluster-admins" Group. A new spec is added for
the "super-admin.conf" file that uses the "system:masters" Group.
- Add a new function EnsureAdminClusterRoleBinding() that includes
logic to ensure that admin.conf contains a User that is properly
bound on the "cluster-admin" built-in ClusterRole. This requires
bootstrapping using the "system:masters" containing "super-admin.conf".
Add detailed unit tests for this new logic.
- In /upgrade#PerformPostUpgradeTasks() add logic to create the
"admin.conf" and "super-admin.conf" with the new, updated specs.
Add detailed unit tests for this new logic.
- In /upgrade#StaticPodControlPlane() ensure that renewal of
"super-admin.conf" is performed if the file exists.
Update unit tests.