26 Commits

Author SHA1 Message Date
Sunyanan Choochotkaew
7f052afaef KEP 5075: implement scheduler
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
2025-07-30 09:52:49 +09:00
yliao
23d6f73e72 extended resource backed by DRA: test 2025-07-29 18:55:28 +00:00
Kobayashi,Daisuke
6653ef652b KEP-5007 DRA Device Binding Conditions: Add dra integration test 2025-07-29 11:36:07 +00:00
Rita Zhang
c15a54f8c0 draadminaccess: move metrics test from e2e to integration
Signed-off-by: Rita Zhang <rita.z.zhang@gmail.com>
2025-07-24 14:08:14 -07:00
Patrick Ohly
24de875ceb DRA: graduate DynamicResourceAllocation feature to GA
It hasn't been on-by-default before, therefore it does not get locked to the
new default on yet. This has some impact on the scheduler configuration
because the plugin is now enabled by default.

Because the feature is now GA, it doesn't need to be a label on E2E tests,
which wouldn't be possible anyway once it gets removed entirely.
2025-07-24 08:33:56 +02:00
Patrick Ohly
5c4f81743c DRA: use v1 API
As before when adding v1beta2, DRA drivers built using the
k8s.io/dynamic-resource-allocation helper packages remain compatible with all
Kubernetes release >= 1.32. The helper code picks whatever API version is
enabled from v1beta1/v1beta2/v1.

However, the control plane now depends on v1, so a cluster configuration where
only v1beta1 or v1beta2 are enabled without the v1 won't work.
2025-07-24 08:33:45 +02:00
Kobayashi,Daisuke
61bd5789be Updated to not directly change the global variable claim 2025-07-23 03:44:48 +00:00
Patrick Ohly
729cd583ad scheduler integration: fail test instead of existing
Calling klog.FlushAndExit causes the `go test` binary to quit
without properly recording which test failed. Both callers of
StartScheduler already have a ktesting.TContext, so switching
to that is easy and also reduces the number of parameters.
2025-07-18 09:43:04 +02:00
Patrick Ohly
5cea72d564 DRA integration: add test case for FilterTimeout
This covers disabling the feature via the configuration, failing to schedule
because of timeouts for all nodes, and retrying after ResourceSlice changes with
partial success (timeout for one node, success for the other).

While at it, some helper code gets improved.
2025-07-17 21:18:28 +02:00
Patrick Ohly
241ac018e2 DRA integration: remove unnecessary anonymous import
It's unclear why k8s.io/kubernetes/pkg/apis/resource/install needs
to be imported explicitly. Having the apiserver and scheduler ready
to be started ensures that all APIs are available.
2025-07-17 21:18:28 +02:00
Patrick Ohly
2e966244ed DRA resourceslice controller: fix recreation after quick delete
If a ResourceSlice got published by the ResourceSlice controller in a DRA
driver and then that ResourceSlice got deleted quickly (within one minute, the
mutation cache TTL) by someone (for example, the kubelet because of a restart),
then the controller did not react properly to the deletion unless some other
event triggered the syncing of the pool.

Found while adding upgrade/downgrade tests with a driver which keeps running
across the upgrade/downgrade.

The exact sequence leading to this were:
- controller adds ResourceSlice, schedules a sync for one minute in the future (the TTL)
- someone else deletes the ResourceSlice
- add and delete events schedule another sync 30 seconds in the future (the delay),
  *overwriting* the other scheduled sync
- sync runs once, finds deleted slices in the mutation cache,
  does not re-create them, and also does not run again

One possible fix would be to set a resync period. But then work is done
periodically, whether it's necessary or not.

Another fix is to ensure that the TTL is shorter than the delay. Then when a
sync occurs, all locally stored additional slices are expired. But that renders
the whole storing of recently created slices in the cache pointless.

So the fix used here is to keep track of when another sync has to run because
of added slices. At the end of each sync, the next sync gets scheduled if (and
only if) needed, until eventually syncing can stop.
2025-07-03 08:20:39 +02:00
Patrick Ohly
10de6780cf DRA API: remove obsolete types from v1alpha3
The v1alpha3 version is still needed for DeviceTaintRule, but the rest of the
types and most structs became obsolete in v1.32 when we introduced v1beta1 and
bumped the storage version to v1beta1.

Removing them now simplifies adding new features because new fields don't need
to be added to these obsolete types. This could have been done already in 1.33,
but wasn't to minimize disrupting on-going work.
2025-06-06 12:06:28 +02:00
Kubernetes Prow Robot
0731167a99 Merge pull request #131996 from ritazh/dra-adminaccess-updatelabelkey
DRAAdminAccess: update label key
2025-06-04 12:16:45 -07:00
Patrick Ohly
4f91a69f2b DRA integration: move and extend device status test
This moves the enabled/disabled test into the common test/integration/dra which
simplifies the code a bit and amortizes the cost of starting the apiserver
because several different tests can use the same instance, running in parallel.

While at it, setting the status via SSA also gets tested.
2025-05-30 10:29:18 +02:00
Rita Zhang
5058e385b0 DRAAdminAccess: update label key
Signed-off-by: Rita Zhang <rita.z.zhang@gmail.com>
2025-05-27 21:19:25 -07:00
Patrick Ohly
e63019a870 DRA integration: refactor code to support other tests
Creating class, claim and pod is expected to be fairly common.
2025-05-23 17:52:26 +02:00
Patrick Ohly
50f152440b DRA integration: start scheduler on demand
As soon as we have more than one test using the scheduler, we need some
coordination between tests. This is handled by a singleton which starts the
scheduler for the first user and stops it after the last one is gone.

To avoid having to pass around an additional parameter, the context is used to
access the singleton under the hood.
2025-05-23 15:04:00 +02:00
Patrick Ohly
60c36432f2 DRA integration: set up nodes for scheduling
This enables proper scheduling tests. Most of them are probably better done in
scheduler_perf where the same test then can also be used for benchmarking and
creating objects is a bit better supported (from YAML, for example), but some
special cases (in particular, anything involving error injection) are better
done here.
2025-05-20 17:43:30 +02:00
Patrick Ohly
3b5cfeaf20 DRA: use v1beta2
DRA drivers must provide ResourceSlices using the v1beta2 API types.
The controller then converts under the hood to v1beta1 if needed, i.e.
drivers are compatible with Kubernetes 1.32 and Kubernetes 1.33, as
long as at least one beta API group is enabled.

Testing pivots from using v1beta1 as the main API to v1beta2, with only one
test case exercising v1beta1.
2025-05-05 08:49:09 +02:00
Patrick Ohly
a171795e31 DRA resourceslices: better error reporting
A user of the controller can register an error handler via the controller
options. For a kubelet plugin, the error handler is method in the interface
which must be implemented. This is a conscious choice to make DRA driver
developer aware that they should react intelligently to errors.

The controller will invoke that handler with all errors that it encounters
while syncing the desired set of slices. This includes validation errors from
the apiserver if the driver's slices are invalid. Dropped fields get reported
with a special DroppedFieldsError.
2025-05-05 08:40:52 +02:00
Morten Torkildsen
39507d911f Add resource v1beta2 API 2025-03-26 14:41:09 +00:00
Kubernetes Prow Robot
ab3cec0701 Merge pull request #130447 from pohly/dra-device-taints
device taints and tolerations (KEP 5055)
2025-03-19 13:00:32 -07:00
Patrick Ohly
37b47f4724 DRA helper: support dropped fields and TimeAdded defaults
Both the new DeviceTaint.TimeAdded and dropped fields when
the DRADeviceTaints feature is disabled confused the ResourceSlice
controller because what is stored and sent back can be different
from what the controller wants to store.

It's now more lenient regarding TimeAdded (doesn't need to be exact because of
rounding during serialization, only having a value on the server is okay)
and dropped fields (doesn't try to store them again). It also preserves
a server-side TimeAdded when updating slices.
2025-03-19 09:18:38 +01:00
Rita Zhang
0301e5a9f8 DRA: AdminAccess validate based on namespace label
Signed-off-by: Rita Zhang <rita.z.zhang@gmail.com>
2025-03-18 22:56:54 -07:00
Patrick Ohly
89440b1239 DRA: integration tests for prioritized list
This adds dedicated integration tests for the feature to the general
test/integration/dra for the API and some minimal testing with the scheduler.

It also adds non-performance test cases for scheduler_perf because that is a
better place for running through the complete flow (for example, can reuse
infrastructure for setting up nodes).
2025-03-10 11:38:06 +01:00
Patrick Ohly
9492a2ca9b DRA: add dedicated integration tests
DRA had integration tests as part of test/integration/scheduler_perf (for the
scheduler plugin) and some others scattered in different
places (e.g. test/integration/resourceclaim for device status).

The new test/integration/dra is meant to become the common location for all
DRA-related integration tests. This makes it simpler to share common setup
code.
2025-02-21 20:48:04 +01:00