Files
John-Paul Sassine b7de71f9ce feat(kubelet): Add ResourceHealthStatus for DRA pods
This change introduces the ability for the Kubelet to monitor and report
the health of devices allocated via Dynamic Resource Allocation (DRA).
This addresses a key part of KEP-4680 by providing visibility into
device failures, which helps users and controllers diagnose pod failures.

The implementation includes:
- A new `v1alpha1.NodeHealth` gRPC service with a `WatchResources`
  stream that DRA plugins can optionally implement.
- A health information cache within the Kubelet's DRA manager to track
  the last known health of each device and handle plugin disconnections.
- An asynchronous update mechanism that triggers a pod sync when a
  device's health changes.
- A new `allocatedResourcesStatus` field in `v1.ContainerStatus` to
  expose the device health information to users via the Pod API.

Update vendor

KEP-4680: Fix lint, boilerplate, and codegen issues

Add another e2e test, add TODO for KEP4680 & update test infra helpers

Add Feature Gate e2e test

Fixing presubmits

Fix var names, feature gating, and nits

Fix DRA Health gRPC API according to review feedback
2025-07-24 23:23:18 +00:00
..
2025-07-24 08:33:45 +02:00
2025-05-16 21:53:30 +03:00
2025-04-29 09:55:05 +02:00

dra-test-driver

This driver implements the controller and a resource kubelet plugin for dynamic resource allocation. This is done in a single binary to minimize the amount of boilerplate code. "Real" drivers could also implement both in different binaries.

Usage

The driver could get deployed as a Deployment for the controller, with leader election. A DaemonSet could get used for the kubelet plugin. The controller can also run as a Kubernetes client outside of a cluster. The same works for the kubelet plugin when using port forwarding. This is how it is used during testing.

Valid parameters are key/value string pairs stored in a ConfigMap. Those get copied into the ResourceClaimStatus with "user_" and "admin_" as prefix, depending on whether they came from the ResourceClaim or DeviceClass. They get stored in the ResourceHandle field as JSON map by the controller. The kubelet plugin then sets these attributes as environment variables in each container that uses the resource.

Resource availability is configurable and can simulate different scenarios:

  • Network-attached resources, available on all nodes where the node driver runs, or host-local resources, available only on the node whether they were allocated.
  • Shared or unshared allocations.
  • Unlimited or limited resources. The limit is a simple number of allocations per cluster or node.

While the functionality itself is very limited, the code strives to showcase best practices and supports metrics, leader election, and the same logging options as Kubernetes.

Design

The binary itself is a Cobra command with two operations, controller and kubelet-plugin. Logging is done with contextual logging.

The k8s.io/dynamic-resource-allocation/controller package implements the interaction with ResourceClaims. It is generic and relies on an interface to implement the actual driver logic. Long-term that part could be split out into a reusable utility package.

The k8s.io/dynamic-resource-allocation/kubelet-plugin package implements the interaction with kubelet, again relying only on the interface defined for the kubelet<->dynamic resource allocation plugin interaction.

app is the driver itself with a very simple implementation of the interfaces.

Deployment

local-up-cluster.sh

To try out the feature, build Kubernetes, then in one console run:

RUNTIME_CONFIG="resource.k8s.io/v1alpha3" FEATURE_GATES=DynamicResourceAllocation=true ALLOW_PRIVILEGED=1 ./hack/local-up-cluster.sh -O

In another:

sudo mkdir -p /var/run/cdi
sudo mkdir -p /var/lib/kubelet/plugins/test-driver.cdi.k8s.io
sudo mkdir -p /var/lib/kubelet/plugins_registry
sudo chmod a+rx /var/lib/kubelet /var/lib/kubelet/plugins
sudo chmod a+rwx /var/run/cdi /var/lib/kubelet/plugins_registry /var/lib/kubelet/plugins/test-driver.cdi.k8s.io
KUBECONFIG=/var/run/kubernetes/admin.kubeconfig go run ./test/e2e/dra/test-driver -v=5 kubelet-plugin --node-name=127.0.0.1

And finally:

$ export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
$ kubectl create -f test/e2e/dra/test-driver/deploy/example/deviceclass.yaml
resourceclass/example created
$ kubectl create -f test/e2e/dra/test-driver/deploy/example/pod-inline.yaml
configmap/pause-claim-parameters created
pod/pause created

$ kubectl get resourceclaims
NAME             CLASSNAME   STATE                AGE
pause-resource   example     allocated,reserved   19s

$ kubectl get pods
NAME    READY   STATUS    RESTARTS   AGE
pause   1/1     Running   0          23s

There are also examples for other scenarios (multiple pods, multiple claims).

multi-node cluster

At this point there are no container images that contain the test driver and therefore it cannot be deployed on "normal" clusters.

Prior art

Some of this code was derived from the external-resizer. controller corresponds to the controller logic, which in turn is similar to the sig-storage-lib-external-provisioner.