kubernetes

mirror of https://github.com/outbackdingo/kubernetes.git synced 2026-01-27 18:19:28 +00:00

Author	SHA1	Message	Date
John-Paul Sassine	b7de71f9ce	feat(kubelet): Add ResourceHealthStatus for DRA pods This change introduces the ability for the Kubelet to monitor and report the health of devices allocated via Dynamic Resource Allocation (DRA). This addresses a key part of KEP-4680 by providing visibility into device failures, which helps users and controllers diagnose pod failures. The implementation includes: - A new `v1alpha1.NodeHealth` gRPC service with a `WatchResources` stream that DRA plugins can optionally implement. - A health information cache within the Kubelet's DRA manager to track the last known health of each device and handle plugin disconnections. - An asynchronous update mechanism that triggers a pod sync when a device's health changes. - A new `allocatedResourcesStatus` field in `v1.ContainerStatus` to expose the device health information to users via the Pod API. Update vendor KEP-4680: Fix lint, boilerplate, and codegen issues Add another e2e test, add TODO for KEP4680 & update test infra helpers Add Feature Gate e2e test Fixing presubmits Fix var names, feature gating, and nits Fix DRA Health gRPC API according to review feedback	2025-07-24 23:23:18 +00:00
Patrick Ohly	5c4f81743c	DRA: use v1 API As before when adding v1beta2, DRA drivers built using the k8s.io/dynamic-resource-allocation helper packages remain compatible with all Kubernetes release >= 1.32. The helper code picks whatever API version is enabled from v1beta1/v1beta2/v1. However, the control plane now depends on v1, so a cluster configuration where only v1beta1 or v1beta2 are enabled without the v1 won't work.	2025-07-24 08:33:45 +02:00
Ed Bartosh	e4320fe25c	e2e_node: DRA: test handling fatal serving failures Added an e2e_node test to verify that the DRA plugin and registration services cancel provided context when handling fatal gRPC serving errors.	2025-07-16 15:49:41 +03:00
Ed Bartosh	ea05ad8887	e2e_node: DRA: add errorOnCloseListener Introduce a mock net.Listener for tests that triggers a controlled error on Close, enabling reliable simulation of gRPC server failures in test scenarios.	2025-07-16 15:49:41 +03:00
Ed Bartosh	1981c985b1	e2e: DRA: support test and public options Refactor StartPlugin and related test helpers to accept a variadic list of options of any type, allowing both public and test-specific options to be passed.	2025-07-16 15:49:41 +03:00
Ed Bartosh	169965350c	e2e_node: Refactor DRA tests to use variadic options Refactor the DRA e2e_node test helpers and test cases to accept variadic kubeletplugin.Option arguments. This change improves test flexibility and maintainability, allowing new options to be passed in the future without requiring widespread code changes. There are no functional changes to test coverage or behavior.	2025-07-16 15:42:12 +03:00
Patrick Ohly	6d6a749c62	DRA kubelet: add dra_resource_claims_in_use gauge vector The new metric informs admins whether DRA in general (special "driver_name: <any>" label) and/or specific DRA drivers (other label values) are in use on nodes. This is useful to know because removing a driver is only safe if it is not in use. If a driver gets removed while it has prepared a ResourceClaim, unpreparing that ResourceClaim and stopping pods is blocked. The implementation of the metric uses read locking of the claim info cache. It retrieves "claims in use" and turns those into the metric. The same code is also used to log changes in the claim info cache with a diff. This hooks into a write update of the claim info cache and uses contextual logging. The unit tests check that metrics get calculated. The e2e_node test checks that kubelet really exports the metrics data. While at it, some bugs in the claiminfo_test.go get fixed: the way how the cache got populated in the test did not match the code anymore.	2025-06-26 14:31:03 +02:00
Ed Bartosh	cf544da6f7	e2e_node: DRA: add tests for different socket setups Added tests to verify DRA functionality with 2 different socket configurations: - the same socket is used for the registration and the DRA service - 2 separate sockets are used for the registration and the DRA service Used table-driven ginkgo to avoid code duplication: specs https://onsi.github.io/ginkgo/#table-driven-tests This change enhances the robustness of the DRA e2e tests by validating its behavior with different socket setups.	2025-06-24 10:42:45 +02:00
Ed Bartosh	7f6389e770	e2e_node: DRA: pass socket path as a parameter Added an ability to specify the socket path for the DRA gRPC service in the e2e node tests. The PluginSocket option is added to allow setting the name of the socket inside the directory where the DRA driver creates the socket for the DRA gRPC calls. This is used by the kubelet to connect to the DRA plugin. The newDRAService and newRegistrar functions are updated to accept a socketPath parameter, which is used to configure the PluginDataDirectoryPath and PluginSocket options for the DRA plugin. This change enables more flexible configuration of the DRA plugin in e2e tests, allowing for testing with different socket paths.	2025-06-24 10:42:45 +02:00
Ed Bartosh	c90c2e0d40	kubelet: DRA: fix linter warnings Fixed the following warnings: dra_test.go:884:2: singleCaseSwitch: should rewrite switch statement to if statement (gocritic) switch podName { ^ dra_test.go:686:4: SA4006: this value of kubeletPlugin is never used (staticcheck) kubeletPlugin = newDRAService(ctx, f.ClientSet, nodeName, driverName) ^	2025-06-24 10:42:45 +02:00
Ed Bartosh	4ee7374b24	DRA kubelet: add connection monitoring This ensures that ResourceSlices get removed also when a plugin becomes unresponsive without removing the registration socket. Tests are from https://github.com/kubernetes/kubernetes/pull/131073 by Ed with some modifications, the implementation is new.	2025-06-24 10:42:41 +02:00
Patrick Ohly	494a129d02	DRA kubelet: clarify plugin vs, driver name The rest of the system logs information using "driverName" as key in structured logging. The kubelet should do the same. This also gets clarified in the code, together with using consistent a consistent name for a Plugin pointer: "plugin" instead of "client" or "instance". The New in NewDRAPluginClient made no sense because it's not constructing anything, and it returns a plugin, not a client -> GetDRAPlugin.	2025-06-06 18:24:33 +02:00
Ed Bartosh	b9e2a16083	e2e_node: dra: test plugin registration retry Adds a DRA e2e_node test to verify that the kubelet plugin manager retries plugin registration when the GetInfo call fails, and successfully registers the plugin once GetInfo succeeds. This ensures correct recovery and registration behavior for DRA plugins in failure scenarios.	2025-05-16 21:53:35 +03:00
Ed Bartosh	ec7e732cbc	e2e: dra: move gomega matchers to dedicated package Moved gomega matcher definitions from test-driver/app to a new test-driver/gomega package.	2025-05-15 20:55:17 +03:00
Morten Torkildsen	e262cccf23	Cleanup after rebase	2025-05-12 16:00:07 +00:00
Morten Torkildsen	ece35e5882	Update DRA e2e test framework to allow publishing advanced ResourceSlices	2025-05-12 15:56:24 +00:00
Patrick Ohly	b09d034a57	DRA E2E: revise test labeling We only need one special "DynamicResourceAllocation" feature for the optional node support of DRA (plugin registration, CDI support in the container runtime). For individual features, the automatic labeling through WithFeatureGate is sufficient. To find DRA-related tests in a label filter, instead of plain-text "DRA" a "DRA" label now gets added. This change depends on an update of the DRA jobs.	2025-05-09 11:33:04 +02:00
Patrick Ohly	9bada79de1	DRA node test: fix useless gomega.Consistently Passing a constant value to gomega.Consistently means that it will not re-check while running. Found by linter after removing the suppression rule for the check. It was disabled earlier because of a bug in the linter.	2025-05-02 12:51:02 +02:00
Patrick Ohly	ec12727957	DRA kubeletplugin: revise socket path handling When supporting rolling updates, we cannot use the same fixed socket paths for old and new pod. With the revised API, the caller no longer specifies the full socket paths, only directories. The logic about how to name sockets then can be in the helper. While at it, avoid passing a context to the gRPC helper code when all that the helper code needs is a logger. That leads to confusion about whether cancellation has an effect.	2025-03-14 14:19:56 +01:00
Kevin Hannon	6a608c3cdb	drop NodeSpecialFeature and NodeAlphaFeature from e2e-node	2024-12-16 09:29:04 -05:00
Kubernetes Prow Robot	48c65d1870	Merge pull request #128576 from bart0sh/PR166-refactor-kubelet-stop-and-restart e2e_node: refactor Kubelet stopping and restarting	2024-11-06 20:10:40 +00:00
Patrick Ohly	33ea278c51	DRA: use v1beta1 API No code is left which depends on the v1alpha3, except of course the code implementing that version.	2024-11-06 13:03:19 +01:00
Ed Bartosh	3aa95dafea	e2e_node: refactor stopping and restarting kubelet Moved Kubelet health checks from test cases to the stopKubelet API. This should make the API cleaner and easier to use.	2024-11-06 11:34:48 +02:00
Patrick Ohly	f3fef01e79	DRA API: AdminAccess in DeviceRequestAllocationResult Drivers need to know that because admin access may also grant additional permissions. The allocator needs to ignore such results when determining which devices are considered as allocated. In both cases it is conceptually cleaner to not rely on the content of the ClaimSpec.	2024-10-29 09:50:07 +01:00
Ed Bartosh	c5842ca4ad	DRA: e2e_node: improve readability	2024-07-29 21:57:44 +03:00
Patrick Ohly	d11b58efe6	DRA kubelet: refactor gRPC call timeouts Some of the E2E node tests were flaky. Their timeout apparently was chosen under the assumption that kubelet would retry immediately after a failed gRPC call, with a factor of 2 as safety margin. But according to `0449cef8fd`, kubelet has a different, higher retry period of 90 seconds, which was exactly the test timeout. The test timeout has to be higher than that. As the tests don't use the gRPC call timeout anymore, it can be made private. While at it, the name and documentation gets updated.	2024-07-22 18:09:34 +02:00
Patrick Ohly	0b62bfb690	DRA e2e: adapt to v1alpha3 API	2024-07-22 18:09:34 +02:00
Patrick Ohly	b51d68bb87	DRA: bump API v1alpha2 -> v1alpha3 This is in preparation for revamping the resource.k8s.io completely. Because there will be no support for transitioning from v1alpha2 to v1alpha3, the roundtrip test data for that API in 1.29 and 1.30 gets removed. Repeating the version in the import name of the API packages is not really required. It was done for a while to support simpler grepping for usage of alpha APIs, but there are better ways for that now. So during this transition, "resourceapi" gets used instead of "resourcev1alpha3" and the version gets dropped from informer and lister imports. The advantage is that the next bump to v1beta1 will affect fewer source code lines. Only source code where the version really matters (like API registration) retains the versioned import.	2024-07-21 17:28:13 +02:00
Patrick Ohly	616a014347	DRA: move ResourceSlice publishing into DRA drivers This is a first step towards making kubelet independent of the resource.k8s.io API versioning because it now doesn't need to copy structs defined by that API from the driver to the API server. The next step is removing the other direction (reading ResourceClaim status and passing the resource handle to drivers). The drivers must get deployed so that they have their own connection to the API server. Securing at least the writes via a validating admission policy should be possible. As before, the kubelet removes all ResourceSlices for its node at startup, then DRA drivers recreate them if (and only if) they start up again. This ensures that there are no orphaned ResourceSlices when a driver gets removed while the kubelet was down. While at it, logging gets cleaned up and updated to use structured, contextual logging as much as possible. gRPC requests and streams now use a shared, per-process request ID and streams also get logged.	2024-07-18 09:09:19 +02:00
Patrick Ohly	3d4bc44a2f	dra e2e node: addd test case for ResourceSlice handling during kubelet startup Any redundant object must get deleted, but not the ones of other names.	2024-07-18 09:09:19 +02:00
Patrick Ohly	bde9b64cdf	DRA: remove "source" indirection from v1 Pod API This makes the API nicer: resourceClaims: - name: with-template resourceClaimTemplateName: test-inline-claim-template - name: with-claim resourceClaimName: test-shared-claim Previously, this was: resourceClaims: - name: with-template source: resourceClaimTemplateName: test-inline-claim-template - name: with-claim source: resourceClaimName: test-shared-claim A more long-term benefit is that other, future alternatives might not make sense under the "source" umbrella. This is a breaking change. It's justified because DRA is still alpha and will have several other API breaks in 1.31.	2024-06-27 17:53:24 +02:00
Ed Bartosh	ee0340a828	e2e_node: add tests for 2 Kubelet plugins	2024-06-07 22:53:35 +03:00
Ed Bartosh	ce6faef8d8	e2e_node: change DRA test APIs to work with multiple plugins	2024-06-07 22:53:31 +03:00
Ed Bartosh	118158d8df	e2e_node: DRA: test plugin failures	2024-06-07 22:51:53 +03:00
Ed Bartosh	ffc407b4dd	e2e_node: DRA: reimplement call blocking	2024-06-07 22:47:20 +03:00
Ed Bartosh	d6c78f853a	e2e_node: add deferPodDeletion parameter	2024-05-25 01:02:31 +03:00
Ed Bartosh	f609aa8310	e2e: test-driver: add new matchers	2024-05-25 01:02:25 +03:00
Patrick Ohly	d59676a545	dra kubelet: publish NodeResourceSlices The information is received from the DRA driver plugin through a new gRPC streaming interface. This is backwards compatible with old DRA driver kubelet plugins, their gRPC server will return "not implemented" and that can be handled by kubelet. Therefore no API break is needed. However, DRA drivers need to be updated because the Go API changed. They can return status.New(codes.Unimplemented, "no node resource support").Err() if they don't support the new ListAndWatchResources method and structured parameters. The controller in kubelet then synchronizes this information from the driver with NodeResourceSlice objects, creating, updating and deleting them as needed.	2024-03-07 22:22:13 +01:00
Patrick Ohly	f2cfbf44b1	e2e: use framework labels This changes the text registration so that tags for which the framework has a dedicated API (features, feature gates, slow, serial, etc.) those APIs are used. Arbitrary, custom tags are still left in place for now.	2023-11-01 15:17:34 +01:00
Patrick Ohly	d743c50bb9	kubelet: support batched prepare/unprepare in v1alpha3 DRA plugin API Combining all prepare/unprepare operations for a pod enables plugins to optimize the execution. Plugins can continue to use the v1beta2 API for now, but should switch. The new API is designed so that plugins which want to work on each claim one-by-one can do so and then report errors for each claim separately, i.e. partial success is supported.	2023-07-12 14:50:30 +02:00
Stanislav Laznicka	7f532891c9	e2e tests: set all PSa labels instead of just enforcing	2023-06-21 15:05:13 +02:00
Ed Bartosh	a1e0aa0e50	DRA Node E2E: add NodeAlphaFeature to fix CI Added NodeAlphaFeature:DynamicResourceAllocation to the Node DRA test to fix failing containerd serial jobs. Those jobs skip tests labeled with NodeAlphaFeature.	2023-06-17 13:12:40 +03:00
Ed Bartosh	a83edd35c4	DRA Node E2E: relabel test suite to fix CI Removed NodeFeature:DynamicResourceAllocation label from the tests to fix cos-cgroupv1/v2-containerd-node-e2e-serial CI jobs. It turned out that labeling DRA Node tests as NodeFeature was a mistake. Re-labeling with NodeAlphaFeature would not work either. It would fail certain containerd jobs as DRA requires containerd >= 1.7	2023-06-14 20:46:24 +03:00
Ed Bartosh	4960207b31	DRA Node E2E: test NodePrepareResource timeout	2023-06-13 12:42:05 +03:00
Ed Bartosh	58162ffd63	DRA: add node tests - Setup overall test structure - Tested Kubelet plugin re-registration on plugin and Kubelet restarts - Tested pod processing on Kubelet start	2023-06-06 23:03:50 +03:00

45 Commits