kubernetes

mirror of https://github.com/optim-enterprises-bv/kubernetes.git synced 2025-11-26 19:35:10 +00:00

Author	SHA1	Message	Date
Kubernetes Prow Robot	8e5d7cbef7	Merge pull request #127250 from bart0sh/PR157-Kubelet-DRA-fix-testify-errors Kubelet: DRA: fix testify errors	2024-09-09 23:24:48 +01:00
Ed Bartosh	e70a2ad828	Kubelet: DRA: fix testify errors	2024-09-09 22:18:07 +03:00
Kubernetes Prow Robot	1c15f718b6	Merge pull request #126717 from bart0sh/PR154-DRA-test-restoring-checkpoint-for-upgraded-structure DRA: test checkpoint structure upgrade	2024-09-09 11:32:27 +01:00
Ed Bartosh	d3b5cb6f41	DRA: test checkpoint structure and version upgrades Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>	2024-08-29 15:25:58 +03:00
carlory	3372c056cd	fix linter hints	2024-08-27 01:30:58 +08:00
carlory	7b33495d9d	DRA: rename pkg/cm/dra/plugin files	2024-08-27 00:54:37 +08:00
Ed Bartosh	e1bc8defac	kubelet: Migrate DRA Manager to contextual logging Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>	2024-08-22 11:12:41 +03:00
Ed Bartosh	9d893c83f0	DRA: fix failing test Added error assertion for NodePrepareResources call unveiled "rpc error: code = DeadlineExceeded desc = context deadline exceeded" failure in the TestGRPCConnIsReused test. Setting clientCallTimeout field when creating plugin should fix it.	2024-08-20 11:11:43 +03:00
Ed Bartosh	c0d922e786	DRA: Kubelet code cleanup	2024-07-24 00:27:52 +03:00
Ed Bartosh	59555c6a62	DRA: move dra/checkpont/* to dra/state/*	2024-07-24 00:12:10 +03:00
Ed Bartosh	35fbbc5cfd	DRA: use crc32.ChecksumIEEE to calculate checkpoint checksum	2024-07-24 00:10:39 +03:00
Ed Bartosh	59daed75d6	DRA: refactor checkpointing Co-authored-by: Kevin Klues <klueska@gmail.com>	2024-07-24 00:10:30 +03:00
Patrick Ohly	d11b58efe6	DRA kubelet: refactor gRPC call timeouts Some of the E2E node tests were flaky. Their timeout apparently was chosen under the assumption that kubelet would retry immediately after a failed gRPC call, with a factor of 2 as safety margin. But according to `0449cef8fd`, kubelet has a different, higher retry period of 90 seconds, which was exactly the test timeout. The test timeout has to be higher than that. As the tests don't use the gRPC call timeout anymore, it can be made private. While at it, the name and documentation gets updated.	2024-07-22 18:09:34 +02:00
Patrick Ohly	877829aeaa	DRA kubelet: adapt to v1alpha3 API This adds the ability to select specific requests inside a claim for a container. NodePrepareResources is always called, even if the claim is not used by any container. This could be useful for drivers where that call has some effect other than injecting CDI device IDs into containers. It also ensures that drivers can validate configs. The pod resource API can no longer report a class for each claim because there is no such 1:1 relationship anymore. Instead, that API reports claim, API devices (with driver/pool/device as ID) and CDI device IDs. The kubelet itself doesn't extract that information from the claim. Instead, it relies on drivers to report this information when the claim gets prepared. This isolates the kubelet from API changes. Because of a faulty E2E test, kubelet was told to contact the wrong driver for a claim. This was not visible in the kubelet log output. Now changes to the claim info cache are getting logged. While at it, naming of variables and some existing log output gets harmonized. Co-authored-by: Oksana Baranova <oksana.baranova@intel.com> Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>	2024-07-22 18:09:34 +02:00
Patrick Ohly	91d7882e86	DRA: new API for 1.31 This is a complete revamp of the original API. Some of the key differences: - refocused on structured parameters and allocating devices - support for constraints across devices - support for allocating "all" or a fixed amount of similar devices in a single request - no class for ResourceClaims, instead individual device requests are associated with a mandatory DeviceClass For the sake of simplicity, optional basic types (ints, strings) where the null value is the default are represented as values in the API types. This makes Go code simpler because it doesn't have to check for nil (consumers) and values can be set directly (producers). The effect is that in protobuf, these fields always get encoded because `opt` only has an effect for pointers. The roundtrip test data for v1.29.0 and v1.30.0 changes because of the new "request" field. This is considered acceptable because the entire `claims` field in the pod spec is still alpha. The implementation is complete enough to bring up the apiserver. Adapting other components follows.	2024-07-22 18:09:34 +02:00
Patrick Ohly	b51d68bb87	DRA: bump API v1alpha2 -> v1alpha3 This is in preparation for revamping the resource.k8s.io completely. Because there will be no support for transitioning from v1alpha2 to v1alpha3, the roundtrip test data for that API in 1.29 and 1.30 gets removed. Repeating the version in the import name of the API packages is not really required. It was done for a while to support simpler grepping for usage of alpha APIs, but there are better ways for that now. So during this transition, "resourceapi" gets used instead of "resourcev1alpha3" and the version gets dropped from informer and lister imports. The advantage is that the next bump to v1beta1 will affect fewer source code lines. Only source code where the version really matters (like API registration) retains the versioned import.	2024-07-21 17:28:13 +02:00
Kubernetes Prow Robot	f2428d66cc	Merge pull request #125163 from pohly/dra-kubelet-api-version-independent-no-rest-proxy DRA: make kubelet independent of the resource.k8s.io API version	2024-07-18 17:47:48 -07:00
Patrick Ohly	7701a48bd6	dra kubelet: bump gRPC API to v1alpha4 The previous changes are an API break, therefore we need a new version.	2024-07-18 23:30:09 +02:00
Patrick Ohly	348f94ab55	DRA: read ResourceClaim in DRA drivers This is the second and final step towards making kubelet independent of the resource.k8s.io API versioning because it now doesn't need to copy structs defined by that API from the driver to the API server.	2024-07-18 09:09:20 +02:00
Patrick Ohly	616a014347	DRA: move ResourceSlice publishing into DRA drivers This is a first step towards making kubelet independent of the resource.k8s.io API versioning because it now doesn't need to copy structs defined by that API from the driver to the API server. The next step is removing the other direction (reading ResourceClaim status and passing the resource handle to drivers). The drivers must get deployed so that they have their own connection to the API server. Securing at least the writes via a validating admission policy should be possible. As before, the kubelet removes all ResourceSlices for its node at startup, then DRA drivers recreate them if (and only if) they start up again. This ensures that there are no orphaned ResourceSlices when a driver gets removed while the kubelet was down. While at it, logging gets cleaned up and updated to use structured, contextual logging as much as possible. gRPC requests and streams now use a shared, per-process request ID and streams also get logged.	2024-07-18 09:09:19 +02:00
Patrick Ohly	b9d00841a6	kubelet: improve checkpoint errors Recording the expected and actual checksum in the error makes it possible to provide that information, for example in a failed test like the ones for DRA. Otherwise developers have to manually step through the test with a debugger to figure out what the new checksum is.	2024-07-17 16:07:31 +02:00
Matthieu MOREL	f014b754fb	fix: enable empty and len rules from testifylint on pkg package Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com> Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>	2024-07-06 23:15:43 +00:00
Kubernetes Prow Robot	ac9aec9f9b	Merge pull request #125116 from pohly/dra-one-of-source DRA: remove "source" indirection from v1 Pod API	2024-06-28 12:46:45 -07:00
Matthieu MOREL	0cde5f1e28	fix: enable bool-compare rule from testifylint linter (#125135 ) * fix: enable bool-compare rule from testifylint linter Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com> * Update hack/golangci.yaml.in Co-authored-by: Patrick Ohly <patrick.ohly@intel.com> * Update golangci.yaml.in * Update golangci-strict.yaml * Update golangci.yaml.in * Update golangci.yaml.in * Update golangci.yaml.in * Update golangci.yaml.in * Update golangci.yaml * Update golangci-hints.yaml * Update golangci-strict.yaml * Update golangci.yaml.in * Update golangci.yaml * Update mux_test.go --------- Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com> Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>	2024-06-28 10:58:05 -07:00
Patrick Ohly	bde9b64cdf	DRA: remove "source" indirection from v1 Pod API This makes the API nicer: resourceClaims: - name: with-template resourceClaimTemplateName: test-inline-claim-template - name: with-claim resourceClaimName: test-shared-claim Previously, this was: resourceClaims: - name: with-template source: resourceClaimTemplateName: test-inline-claim-template - name: with-claim source: resourceClaimName: test-shared-claim A more long-term benefit is that other, future alternatives might not make sense under the "source" umbrella. This is a breaking change. It's justified because DRA is still alpha and will have several other API breaks in 1.31.	2024-06-27 17:53:24 +02:00
Oksana Baranova	c4ec24890e	nodeResourceSlicesController: add exponential backoff	2024-05-27 23:12:53 +03:00
Kubernetes Prow Robot	8352c09592	Merge pull request #124323 from bart0sh/PR142-dra-fix-cache-integrity kubelet: DRA: fix cache integrity	2024-05-13 09:54:02 -07:00
Alvaro Aleman	6d0ac8c561	Use the generic/typed workqueue throughout This change makes us use the generic workqueue throughout the project in order to improve type safety and readability of the code.	2024-05-04 14:33:12 -04:00
Ed Bartosh	f24134d7b2	kubelet: DRA: add unit test for ClaimInfo and claimInfoCache	2024-05-03 13:30:31 +00:00
Ed Bartosh	6ce294558a	kubelet: DRA: add stress test The tests calls PrepareResources and UnprepareResources API in parallel to help discover race conditions.	2024-05-03 13:30:29 +00:00
Kevin Klues	86a18d5333	kubelet: DRA: update manager test to adhere to new claiminfo cache APIs Signed-off-by: Kevin Klues <kklues@nvidia.com>	2024-05-03 13:28:37 +00:00
Kevin Klues	805e7c3434	kubelet: DRA: remove check to set pluginName to DriverName if not in ResourceHandle It has always been validated that a ResourceHandle MUST have DriverName set, so this check is unnecessary. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2024-05-03 13:23:29 +00:00
Kevin Klues	f80be2728e	kubelet: DRA: change key of claimInfo cache to "namespace/claimname" Signed-off-by: Kevin Klues <kklues@nvidia.com>	2024-05-03 13:23:29 +00:00
Kevin Klues	639e887631	kubelet: DRA: add a reconcile loop to unprepare claims for deleted pods Signed-off-by: Kevin Klues <kklues@nvidia.com>	2024-05-03 13:23:29 +00:00
Kevin Klues	a8931c6c25	kubelet: DRA: update locking/checkpoint semantics of the claimInfo cache Signed-off-by: Kevin Klues <kklues@nvidia.com>	2024-05-03 13:23:27 +00:00
Patrick Ohly	77341f7595	DRA: remove support for v1alpha2 kubelet API The v1alpha2 API is several releases old. No current drivers should still depend on it.	2024-04-19 18:27:05 +02:00
Ayato Tokubi	d04f87abde	add nil check for Node(Un)PrepareResources. Signed-off-by: Ayato Tokubi <atokubi@redhat.com>	2024-04-04 23:24:25 +00:00
HirazawaUi	10b6319e64	fix slow dra unit test	2024-03-16 22:21:15 +08:00
Ed Bartosh	26881132bd	kubelet: assign Node as an owner for the ResourceSlice Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>	2024-03-15 09:46:13 +02:00
Patrick Ohly	a0add8d2c7	dra api: NodeResourceModel -> ResourceModel When renaming NodeResourceSlice to ResourceSlice, the embedded [Node]ResourceModel also should have been renamed.	2024-03-14 18:07:36 +01:00
Kevin Klues	fc2134c84c	dra kubelet: fix error log Previously we were returning the error string from 'err' (which is nil), when we should have been returning it from result.Error. Without this it is hard to debug issues with NodeUnprepareResources. Signed-off-by: Kevin Klues <kklues@nvidia.com>	2024-03-11 13:51:29 +00:00
Kevin Klues	13a6dcc21c	dra kubelet: add StructuredResourceModel to UnprepareResources call Signed-off-by: Kevin Klues <kklues@nvidia.com>	2024-03-09 18:08:14 +00:00
Patrick Ohly	0b6a0d686a	dra api: rename NodeResourceSlice -> ResourceSlice While currently those objects only get published by the kubelet for node-local resources, this could change once we also support network-attached resources. Dropping the "Node" prefix enables such a future extension. The NodeName in ResourceSlice and StructuredResourceHandle then becomes optional. The kubelet still needs to provide one and it must match its own node name, otherwise it doesn't have permission to access ResourceSlice objects.	2024-03-07 22:22:55 +01:00
Patrick Ohly	d59676a545	dra kubelet: publish NodeResourceSlices The information is received from the DRA driver plugin through a new gRPC streaming interface. This is backwards compatible with old DRA driver kubelet plugins, their gRPC server will return "not implemented" and that can be handled by kubelet. Therefore no API break is needed. However, DRA drivers need to be updated because the Go API changed. They can return status.New(codes.Unimplemented, "no node resource support").Err() if they don't support the new ListAndWatchResources method and structured parameters. The controller in kubelet then synchronizes this information from the driver with NodeResourceSlice objects, creating, updating and deleting them as needed.	2024-03-07 22:22:13 +01:00
Patrick Ohly	6f1ddfcd2e	kubelet: support structured parameters for preparing resources If the resource handle has data from a structured parameter model, then we need to pass that to the DRA driver kubelet plugin. Because Kubernetes uses gogo/protobuf, we cannot use "optional" for that new optional field and have to resort to "repeated" with a single repetition if present. This is a new, backwards-compatible field. That extending the resource.k8s.io changes the checksum of a kubelet checkpoint is unfortunate. Updating the test cases is a stop-gap measure, the actual solution will have to be something else before beta.	2024-03-07 22:22:13 +01:00
TommyStarK	6f021e99cf	dra: increase timeout in setupFakeDRADriverGRPCServer to prevent tests to flake. Signed-off-by: TommyStarK <thomasmilox@gmail.com>	2024-01-11 09:20:04 +01:00
charles-chenzz	abaf7a800d	increase timeout in fakeDraDriverGrpcServer to fix flake in dra/manger_test	2023-11-07 19:38:27 +08:00
Kubernetes Prow Robot	191abe34b8	Merge pull request #120550 from adrianchiris/fix-dra-node-reboot DRA: call plugins for claims even if exist in cache	2023-10-26 10:26:59 +02:00
adrianc	3738111337	Add unit tests adjust existing tests and add new test flows to cover new DRA manager behaviour Signed-off-by: adrianc <adrianc@nvidia.com>	2023-10-25 13:20:22 +03:00
adrianc	08b942028f	DRA: call plugins for claims even if exist in cache Today, DRA manager does not call plugin NodePrepareResource for claims that it previously successfully handled, that is, if claims are present in cache (checkpoint) even if node rebooted. After node reboots, it is required to call DRA plugin for resource claims so that plugins may prepare them again in case the resources dont persist reboot. To achieve that, once kubelet is started, we call DRA plugins for claims once if a pod sandbox is required to be created during PodSync. Signed-off-by: adrianc <adrianc@nvidia.com>	2023-10-25 13:20:16 +03:00

1 2

97 Commits