This was previously caught during Filter by the allocator check. Doing it
sooner avoids wasting resources on a pod which ultimately cannot get scheduled.
While at it, be a bit more clear about which feature is disabled. The user
might not know that.
All logic related to obtaining DRA objects and tracking modifications
to ResourceClaims in-memory is extracted to DefaultDRAManager, which
implements framework.SharedDRAManager.
This is intended to be a no-op in terms of the DRA plugin behavior.
A better place is the cel package because a) the name can become shorter
and b) it is tightly coupled with the compiler there.
Moving the compilation into the cache simplifies the callers.
"Allocated devices" are the ones which can be observed from the informer. "All
allocated devices" also includes those which are in flight and haven't been
written back to the apiserver.
The logic for skipping "admin access" was repeated in three different places. A
single foreachAllocatedDevices with a callback puts it into one function.
DeviceClasses and different requests are very likely to contain the same
expression string. We don't need to compile that over and over again.
To avoid hanging onto that cache longer than necessary, it's currently tied to
each PreFilter/Filter combination. It might make sense to move this up into the
scheduler plugin and thus reuse compiled expressions for different pods.
goos: linux
goarch: amd64
pkg: k8s.io/kubernetes/test/integration/scheduler_perf
cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
│ before │ after │
│ SchedulingThroughput/Average │ SchedulingThroughput/Average vs base │
PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36 33.95 ± 4% 36.65 ± 2% +7.95% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36 105.8 ± 2% 106.7 ± 3% ~ (p=0.177 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36 100.7 ± 1% 119.7 ± 3% +18.82% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36 90.78 ± 1% 121.10 ± 4% +33.40% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36 50.51 ± 7% 63.72 ± 3% +26.17% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36 103.7 ± 5% 110.2 ± 2% +6.32% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36 28.50 ± 2% 28.16 ± 5% ~ (p=0.102 n=6)
geomean 64.99 73.15 +12.56%
Using unique strings instead of normal strings speeds up allocation with
structured parameters because maps that use those strings as key no longer need
to build hashes of the string content. However, care must be taken to call
unique.Make as little as possible because it is costly.
Pre-allocating the map of allocated devices reduces the need to grow the map
when adding devices.
goos: linux
goarch: amd64
pkg: k8s.io/kubernetes/test/integration/scheduler_perf
cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
│ before │ after │
│ SchedulingThroughput/Average │ SchedulingThroughput/Average vs base │
PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36 18.06 ± 2% 33.30 ± 2% +84.31% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36 104.7 ± 2% 105.3 ± 2% ~ (p=0.818 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36 96.62 ± 1% 100.75 ± 1% +4.28% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36 83.00 ± 2% 90.96 ± 2% +9.59% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36 32.45 ± 7% 49.84 ± 4% +53.60% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36 95.22 ± 7% 103.80 ± 1% +9.00% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36 9.111 ± 10% 27.215 ± 7% +198.69% (p=0.002 n=6)
geomean 45.86 64.26 +40.12%
The Allocate call used to call back into the claim lister for each node. This
was significant work which showed up at the top of the CPU profile. It's
okay to list only once during PreFilter because the Filter call does not change
the claim status between Allocate calls.
goos: linux
goarch: amd64
pkg: k8s.io/kubernetes/test/integration/scheduler_perf
cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
│ before │ after │
│ SchedulingThroughput/Average │ SchedulingThroughput/Average vs base │
PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36 15.04 ± 0% 18.06 ± 2% +20.07% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36 105.5 ± 1% 104.7 ± 2% ~ (p=0.485 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36 95.83 ± 1% 96.62 ± 1% ~ (p=0.063 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36 79.67 ± 3% 83.00 ± 2% +4.18% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36 27.11 ± 5% 32.45 ± 7% +19.68% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36 84.00 ± 3% 95.22 ± 7% +13.36% (p=0.002 n=6)
PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36 7.110 ± 6% 9.111 ± 10% +28.15% (p=0.002 n=6)
geomean 41.05 45.86 +11.73%
The new DRAAdminAccess feature gate has the following effects:
- If disabled in the apiserver, the spec.devices.requests[*].adminAccess
field gets cleared. Same in the status. In both cases the scenario
that it was already set and a claim or claim template get updated
is special: in those cases, the field is not cleared.
Also, allocating a claim with admin access is allowed regardless of the
feature gate and the field is not cleared. In practice, the scheduler
will not do that.
- If disabled in the resource claim controller, creating ResourceClaims
with the field set gets rejected. This prevents running workloads
which depend on admin access.
- If disabled in the scheduler, claims with admin access don't get
allocated. The effect is the same.
The alternative would have been to ignore the fields in claim controller and
scheduler. This is bad because a monitoring workload then runs, blocking
resources that probably were meant for production workloads.
This removes the DRAControlPlaneController feature gate, the fields controlled
by it (claim.spec.controller, claim.status.deallocationRequested,
claim.status.allocation.controller, class.spec.suitableNodes), the
PodSchedulingContext type, and all code related to the feature.
The feature gets removed because there is no path towards beta and GA and DRA
with "structured parameters" should be able to replace it.
Having a dedicated ActionType which only gets used when the scheduler itself
already detects some change in the list of generated ResourceClaims of a pod
avoids calling the DRA plugin for unrelated Pod changes.
d66f8f9 added that "plugins have to implement a QueueingHint for Pod/Update
event if the rejection from them could be resolved by updating unscheduled
Pods itself".
This applies to DRA because the name of a generated ResourceClaim must be
recorded in the pod status before the pod can be scheduled.
Making unschedulable pods schedulable again after ResourceSlice cluster events
was accidentally left out when adding structured parameters to Kubernetes 1.30.
All E2E tests were defined so that a driver starts first. A new test with a
different order (create pod first, wait for unschedulable, start driver)
triggered the bug and now passes.
In the API, the effect of the feature gate is that alpha fields get dropped on
create. They get preserved during updates if already set. The
PodSchedulingContext registration is *not* restricted by the feature gate.
This enables deleting stale PodSchedulingContext objects after disabling
the feature gate.
The scheduler checks the new feature gate before setting up an informer for
PodSchedulingContext objects and when deciding whether it can schedule a
pod. If any claim depends on a control plane controller, the scheduler bails
out, leading to:
Status: Pending
...
Warning FailedScheduling 73s default-scheduler 0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
The rest of the changes prepare for testing the new feature separately from
"structured parameters". The goal is to have base "dra" jobs which just enable
and test those, then "classic-dra" jobs which add DRAControlPlaneController.
The structured parameter allocation logic was written from scratch in
staging/src/k8s.io/dynamic-resource-allocation/structured where it might be
useful for out-of-tree components.
Besides the new features (amount, admin access) and API it now supports
backtracking when the initial device selection doesn't lead to a complete
allocation of all claims.
Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
Co-authored-by: John Belamaric <jbelamaric@google.com>
As agreed in https://github.com/kubernetes/enhancements/pull/4709, immediate
allocation is one of those features which can be removed because it makes no
sense for structured parameters and the justification for classic DRA is weak.
This is in preparation for revamping the resource.k8s.io completely. Because
there will be no support for transitioning from v1alpha2 to v1alpha3, the
roundtrip test data for that API in 1.29 and 1.30 gets removed.
Repeating the version in the import name of the API packages is not really
required. It was done for a while to support simpler grepping for usage of
alpha APIs, but there are better ways for that now. So during this transition,
"resourceapi" gets used instead of "resourcev1alpha3" and the version gets
dropped from informer and lister imports. The advantage is that the next bump
to v1beta1 will affect fewer source code lines.
Only source code where the version really matters (like API registration)
retains the versioned import.
The JSON patch approach works, but it is complex. A retry loop is easier to
understand (detect conflict, get new claim, try again). There is one additional
API call (the get), but in practice this scenario is unlikely.
There was a race caused by having to update claim finalizer and status in two
different operations:
- Resource claim controller removes allocation, does not yet
get to remove the finalizer.
- Scheduler prepares an allocation, without adding the finalizer
because it's there.
- Controller removes finalizer.
- Scheduler adds allocation.
This is an invalid state. Automatic checking found this during the execution of
the "with translated parameters on single node.*supports sharing a claim
sequentially" E2E test, but only when run stand-alone. When running in
parallel (as in the CI), the bad outcome of the race did not occur.
The fix is to check that the finalizer is still set when adding the
allocation. The apiserver doesn't check that because it doesn't know which
finalizer goes with the allocation result. It could check for "some finalizer",
but that is not guaranteed to be correct (could be some unrelated one).
Checking the finalizer can only be done with a JSON patch. Despite the
complications, having the ability to add multiple pods concurrently to
ReservedFor seems worth it (avoids expensive rescheduling or a local retry
loop).
The resource claim controller doesn't need this, it can do a normal update
which implicitly checks ResourceVersion.
This finishes the transition to the assume cache as source of truth for the
current set of claims.
The tests have to be adapted. It's not enough anymore to directly put objects
into the informer store because that doesn't change the assume cache
content. Instead, normal Create/Update calls and waiting for the cache update
are needed.
This enables connecting the event handler for ResourceClaim to the assume
cache, which addresses a theoretic race condition.
It may also be useful for implementing the autoscaler support, because now
the autoscaler can modify the content of the cache.
The claim parameter key didn't include the namespace of the claim. In the case
where two namespaces used the exact same parameter reference, the "too many
generated parameters" case got triggered incorrectly and lookup could have
returned an object from the wrong namespace.
Found while running the E2E tests in parallel:
message: 'running PreFilter plugin "DynamicResources": multiple generated claim
parameters for ConfigMap. dra-8794/parameters-3 found: [dra-4729/parameters-4
dra-7328/parameters-4 dra-8794/parameters-4 dra-3402/parameters-4 dra-6156/parameters-4
dra-1839/parameters-4 dra-7434/parameters-4 dra-6504/parameters-4]'
There's no reason for having the interface because there is only one
implementation. Makes the implementation of the test functions a bit
simpler (no casting). They are still stand-alone functions instead of methods
because they should not be considered part of the "normal" API.
This is now used by both the volumebinding and dynamicresources plugin, so
promoting it to a common helper package is better.
In terms of functionality, nothing was changed. Documentation got
updated (warns about storing locally modified objects, clarifies what the Get
parameters are). Code coverage should be a bit better than before (tested with
and without indexer, exercises event handlers, more error paths).
Checking for specific errors can now be done via errors.Is.
Coverage was checked with a cover profile. The biggest remaining gap is for
isSchedulableAfterClaimParametersChange and
isSchedulableAfterClassParametersChange which will get handled when refactoring
the
foreachPodResourceClaim (https://github.com/kubernetes/kubernetes/issues/123697).
The code was incorrectly checking for a controller, but only the boolean
is set for allocated claims. As a result, deallocation was requested from
a non-existent control plane controller.
While at it, let's also clear the driver name. It's not needed when the
claim is deallocated.