This change introduces the ability for the Kubelet to monitor and report
the health of devices allocated via Dynamic Resource Allocation (DRA).
This addresses a key part of KEP-4680 by providing visibility into
device failures, which helps users and controllers diagnose pod failures.
The implementation includes:
- A new `v1alpha1.NodeHealth` gRPC service with a `WatchResources`
stream that DRA plugins can optionally implement.
- A health information cache within the Kubelet's DRA manager to track
the last known health of each device and handle plugin disconnections.
- An asynchronous update mechanism that triggers a pod sync when a
device's health changes.
- A new `allocatedResourcesStatus` field in `v1.ContainerStatus` to
expose the device health information to users via the Pod API.
Update vendor
KEP-4680: Fix lint, boilerplate, and codegen issues
Add another e2e test, add TODO for KEP4680 & update test infra helpers
Add Feature Gate e2e test
Fixing presubmits
Fix var names, feature gating, and nits
Fix DRA Health gRPC API according to review feedback
Pod level hugepage resources are not propagated to the containers, only pod level cgroup values are propagated to the containers when they do not specify hugepage resources.
As part of the PR 132028 we added more e2e test coverage to validate
the fix, and check as much as possible there are no regressions.
The issue and the fix become evident largely when inspecting
memory allocation with the Memory Manager static policy enabled.
Quoting the commit message of bc56d0e45a
```
The podresources API List implementation uses the internal data of the
resource managers as source of truth.
Looking at the implementation here:
https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60
we take care of syncing the device allocation data before querying the
device manager to return its pod->devices assignment.
This is needed because otherwise the device manager (and all the other
resource managers) would do the cleanup asynchronously, so the `List` call
will return incorrect data.
But we don't do this syncing neither for CPUs or for memory,
so when we report these we will get stale data as the issue #132020 demonstrates.
For CPU manager, we however have the reconcile loop which cleans the stale data periodically.
Turns out this timing interplay was actually the reason the existing issue #119423 seemed fixed
(see: #119423 (comment)).
But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time
very high (>= 5 minutes), then the issue still reproduces against current master branch
(https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983).
```
The missing actor here is memory manager. Memory manager has no
reconcile loop (implicit fixing the stale data problem) no explicit
synchronization, so it is the unlucky one which reported stale data,
leading to the eventual understanding of the problem.
For this reason it was (and still is) important to exercise it during
the test.
Turns out the test is however wrong, likely because a hidden dependency
between the test expectations and the lane configuration (notably
machine specs), so we disable the memory manager activation for the time
being, until we figure out a safe way to enable it.
Note this significantly weakens the signal for this specific test.
Signed-off-by: Francesco Romani <fromani@redhat.com>
As before when adding v1beta2, DRA drivers built using the
k8s.io/dynamic-resource-allocation helper packages remain compatible with all
Kubernetes release >= 1.32. The helper code picks whatever API version is
enabled from v1beta1/v1beta2/v1.
However, the control plane now depends on v1, so a cluster configuration where
only v1beta1 or v1beta2 are enabled without the v1 won't work.
fix the utilities to enable multi-app-container tests,
which were previously quite hard to implement.
Add a consumer of the new utility to demonstrate the usage
and to initiate the basic coverage.
Signed-off-by: Francesco Romani <fromani@redhat.com>
add a e2e test to ensure that if the Get endpoint is asked
about a non-existing pod, it returns error.
Likewise, add a e2e test for terminated pods, which should
not be returned because they don't consume nor hold resources,
much like `List` does.
The expected usage patterns is to iterate over the list of
pods returned by `List`, but nevertheless the endpoint must
handle this case.
Signed-off-by: Francesco Romani <fromani@redhat.com>
While we support cgroup v1, we want some test coverage.
This patch enables v1 coverage for most of the testcases.
We intentionally rule out the CFS quota tests because we
want to support this change only on cgroup v2.
Signed-off-by: Francesco Romani <fromani@redhat.com>