This change introduces the ability for the Kubelet to monitor and report
the health of devices allocated via Dynamic Resource Allocation (DRA).
This addresses a key part of KEP-4680 by providing visibility into
device failures, which helps users and controllers diagnose pod failures.
The implementation includes:
- A new `v1alpha1.NodeHealth` gRPC service with a `WatchResources`
stream that DRA plugins can optionally implement.
- A health information cache within the Kubelet's DRA manager to track
the last known health of each device and handle plugin disconnections.
- An asynchronous update mechanism that triggers a pod sync when a
device's health changes.
- A new `allocatedResourcesStatus` field in `v1.ContainerStatus` to
expose the device health information to users via the Pod API.
Update vendor
KEP-4680: Fix lint, boilerplate, and codegen issues
Add another e2e test, add TODO for KEP4680 & update test infra helpers
Add Feature Gate e2e test
Fixing presubmits
Fix var names, feature gating, and nits
Fix DRA Health gRPC API according to review feedback
As before when adding v1beta2, DRA drivers built using the
k8s.io/dynamic-resource-allocation helper packages remain compatible with all
Kubernetes release >= 1.32. The helper code picks whatever API version is
enabled from v1beta1/v1beta2/v1.
However, the control plane now depends on v1, so a cluster configuration where
only v1beta1 or v1beta2 are enabled without the v1 won't work.
Introduce a mock net.Listener for tests that triggers a controlled
error on Close, enabling reliable simulation of gRPC server failures
in test scenarios.
Refactor StartPlugin and related test helpers to accept a variadic
list of options of any type, allowing both public and test-specific
options to be passed.
Refactor the DRA e2e_node test helpers and test cases to accept
variadic kubeletplugin.Option arguments.
This change improves test flexibility and maintainability, allowing
new options to be passed in the future without requiring widespread
code changes.
There are no functional changes to test coverage or behavior.
The new metric informs admins whether DRA in general (special "driver_name: <any>"
label) and/or specific DRA drivers (other label values) are in use on nodes.
This is useful to know because removing a driver is only safe if it is not in
use. If a driver gets removed while it has prepared a ResourceClaim,
unpreparing that ResourceClaim and stopping pods is blocked.
The implementation of the metric uses read locking of the claim
info cache. It retrieves "claims in use" and turns those into the metric.
The same code is also used to log changes in the claim info cache with
a diff. This hooks into a write update of the claim info cache and uses
contextual logging.
The unit tests check that metrics get calculated. The e2e_node test checks that
kubelet really exports the metrics data.
While at it, some bugs in the claiminfo_test.go get fixed: the way how the
cache got populated in the test did not match the code anymore.
Added tests to verify DRA functionality with 2 different socket
configurations:
- the same socket is used for the registration and the DRA service
- 2 separate sockets are used for the registration and the DRA service
Used table-driven ginkgo to avoid code duplication:
specs https://onsi.github.io/ginkgo/#table-driven-tests
This change enhances the robustness of the DRA e2e tests by
validating its behavior with different socket setups.
Added an ability to specify the socket path for the DRA gRPC
service in the e2e node tests.
The PluginSocket option is added to allow setting the name
of the socket inside the directory where the DRA driver
creates the socket for the DRA gRPC calls. This is used by
the kubelet to connect to the DRA plugin.
The newDRAService and newRegistrar functions are updated to
accept a socketPath parameter, which is used to configure
the PluginDataDirectoryPath and PluginSocket options for the
DRA plugin.
This change enables more flexible configuration of the DRA
plugin in e2e tests, allowing for testing with different
socket paths.
Fixed the following warnings:
dra_test.go:884:2: singleCaseSwitch: should rewrite switch statement to if statement (gocritic)
switch podName {
^
dra_test.go:686:4: SA4006: this value of kubeletPlugin is never used (staticcheck)
kubeletPlugin = newDRAService(ctx, f.ClientSet, nodeName, driverName)
^
This ensures that ResourceSlices get removed also when a plugin becomes
unresponsive without removing the registration socket.
Tests are from https://github.com/kubernetes/kubernetes/pull/131073 by Ed
with some modifications, the implementation is new.
The rest of the system logs information using "driverName" as key in structured
logging. The kubelet should do the same.
This also gets clarified in the code, together with using consistent a
consistent name for a Plugin pointer: "plugin" instead of "client" or
"instance".
The New in NewDRAPluginClient made no sense because it's not constructing
anything, and it returns a plugin, not a client -> GetDRAPlugin.
Adds a DRA e2e_node test to verify that the kubelet plugin manager
retries plugin registration when the GetInfo call fails, and
successfully registers the plugin once GetInfo succeeds.
This ensures correct recovery and registration behavior for
DRA plugins in failure scenarios.
We only need one special "DynamicResourceAllocation" feature for the optional
node support of DRA (plugin registration, CDI support in the container
runtime). For individual features, the automatic labeling through
WithFeatureGate is sufficient.
To find DRA-related tests in a label filter, instead of plain-text "DRA" a
"DRA" label now gets added.
This change depends on an update of the DRA jobs.
Passing a constant value to gomega.Consistently means that it will not re-check
while running.
Found by linter after removing the suppression rule for the check. It was
disabled earlier because of a bug in the linter.
When supporting rolling updates, we cannot use the same fixed socket paths for
old and new pod. With the revised API, the caller no longer specifies the full
socket paths, only directories. The logic about how to name sockets then can be
in the helper.
While at it, avoid passing a context to the gRPC helper code when
all that the helper code needs is a logger. That leads to confusion
about whether cancellation has an effect.
Drivers need to know that because admin access may also grant additional
permissions. The allocator needs to ignore such results when determining which
devices are considered as allocated.
In both cases it is conceptually cleaner to not rely on the content of the
ClaimSpec.
Some of the E2E node tests were flaky. Their timeout apparently was chosen
under the assumption that kubelet would retry immediately after a failed gRPC
call, with a factor of 2 as safety margin. But according to
0449cef8fd,
kubelet has a different, higher retry period of 90 seconds, which was exactly
the test timeout. The test timeout has to be higher than that.
As the tests don't use the gRPC call timeout anymore, it can be made
private. While at it, the name and documentation gets updated.
This is in preparation for revamping the resource.k8s.io completely. Because
there will be no support for transitioning from v1alpha2 to v1alpha3, the
roundtrip test data for that API in 1.29 and 1.30 gets removed.
Repeating the version in the import name of the API packages is not really
required. It was done for a while to support simpler grepping for usage of
alpha APIs, but there are better ways for that now. So during this transition,
"resourceapi" gets used instead of "resourcev1alpha3" and the version gets
dropped from informer and lister imports. The advantage is that the next bump
to v1beta1 will affect fewer source code lines.
Only source code where the version really matters (like API registration)
retains the versioned import.
This is a first step towards making kubelet independent of the resource.k8s.io
API versioning because it now doesn't need to copy structs defined by that API
from the driver to the API server. The next step is removing the other
direction (reading ResourceClaim status and passing the resource handle to
drivers).
The drivers must get deployed so that they have their own connection to the API
server. Securing at least the writes via a validating admission policy should
be possible.
As before, the kubelet removes all ResourceSlices for its node at startup, then
DRA drivers recreate them if (and only if) they start up again. This ensures
that there are no orphaned ResourceSlices when a driver gets removed while the
kubelet was down.
While at it, logging gets cleaned up and updated to use structured, contextual
logging as much as possible. gRPC requests and streams now use a shared,
per-process request ID and streams also get logged.
This makes the API nicer:
resourceClaims:
- name: with-template
resourceClaimTemplateName: test-inline-claim-template
- name: with-claim
resourceClaimName: test-shared-claim
Previously, this was:
resourceClaims:
- name: with-template
source:
resourceClaimTemplateName: test-inline-claim-template
- name: with-claim
source:
resourceClaimName: test-shared-claim
A more long-term benefit is that other, future alternatives
might not make sense under the "source" umbrella.
This is a breaking change. It's justified because DRA is still
alpha and will have several other API breaks in 1.31.
The information is received from the DRA driver plugin through a new gRPC
streaming interface. This is backwards compatible with old DRA driver kubelet
plugins, their gRPC server will return "not implemented" and that can be
handled by kubelet. Therefore no API break is needed.
However, DRA drivers need to be updated because the Go API changed. They can
return
status.New(codes.Unimplemented, "no node resource support").Err()
if they don't support the new ListAndWatchResources method and
structured parameters.
The controller in kubelet then synchronizes this information from the driver
with NodeResourceSlice objects, creating, updating and deleting them as needed.
This changes the text registration so that tags for which the framework has a
dedicated API (features, feature gates, slow, serial, etc.) those APIs are
used.
Arbitrary, custom tags are still left in place for now.
Combining all prepare/unprepare operations for a pod enables plugins to
optimize the execution. Plugins can continue to use the v1beta2 API for now,
but should switch. The new API is designed so that plugins which want to work
on each claim one-by-one can do so and then report errors for each claim
separately, i.e. partial success is supported.
Added NodeAlphaFeature:DynamicResourceAllocation to the Node DRA test
to fix failing containerd serial jobs. Those jobs skip tests labeled
with NodeAlphaFeature.
Removed NodeFeature:DynamicResourceAllocation label from the
tests to fix cos-cgroupv1/v2-containerd-node-e2e-serial CI jobs.
It turned out that labeling DRA Node tests as NodeFeature was
a mistake. Re-labeling with NodeAlphaFeature would not work either.
It would fail certain containerd jobs as DRA requires containerd >= 1.7