Commit Graph

64 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
ea1143efc7 Merge pull request #126022 from macsko/new_node_to_status_map_structure
Change structure of NodeToStatus map in scheduler
2024-08-13 21:02:55 -07:00
Maciej Skoczeń
98be7dfc5d Change structure of NodeToStatus map in scheduler 2024-07-25 07:48:35 +00:00
Patrick Ohly
9f36c8d718 DRA: add DRAControlPlaneController feature gate for "classic DRA"
In the API, the effect of the feature gate is that alpha fields get dropped on
create. They get preserved during updates if already set. The
PodSchedulingContext registration is *not* restricted by the feature gate.
This enables deleting stale PodSchedulingContext objects after disabling
the feature gate.

The scheduler checks the new feature gate before setting up an informer for
PodSchedulingContext objects and when deciding whether it can schedule a
pod. If any claim depends on a control plane controller, the scheduler bails
out, leading to:

    Status:       Pending
    ...
      Warning  FailedScheduling             73s   default-scheduler  0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

The rest of the changes prepare for testing the new feature separately from
"structured parameters". The goal is to have base "dra" jobs which just enable
and test those, then "classic-dra" jobs which add DRAControlPlaneController.
2024-07-22 18:09:34 +02:00
Patrick Ohly
599fe605f9 DRA scheduler: adapt to v1alpha3 API
The structured parameter allocation logic was written from scratch in
staging/src/k8s.io/dynamic-resource-allocation/structured where it might be
useful for out-of-tree components.

Besides the new features (amount, admin access) and API it now supports
backtracking when the initial device selection doesn't lead to a complete
allocation of all claims.

Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
Co-authored-by: John Belamaric <jbelamaric@google.com>
2024-07-22 18:09:34 +02:00
Patrick Ohly
8a629b9f15 DRA: remove "sharable" from claim allocation result
Now all claims are shareable up to the limit imposed by the size of the
"reserverFor" array.

This is one of the agreed simplifications for 1.31.
2024-07-21 17:28:14 +02:00
Patrick Ohly
de5742ae83 DRA: remove immediate allocation
As agreed in https://github.com/kubernetes/enhancements/pull/4709, immediate
allocation is one of those features which can be removed because it makes no
sense for structured parameters and the justification for classic DRA is weak.
2024-07-21 17:28:14 +02:00
Patrick Ohly
b51d68bb87 DRA: bump API v1alpha2 -> v1alpha3
This is in preparation for revamping the resource.k8s.io completely. Because
there will be no support for transitioning from v1alpha2 to v1alpha3, the
roundtrip test data for that API in 1.29 and 1.30 gets removed.

Repeating the version in the import name of the API packages is not really
required. It was done for a while to support simpler grepping for usage of
alpha APIs, but there are better ways for that now. So during this transition,
"resourceapi" gets used instead of "resourcev1alpha3" and the version gets
dropped from informer and lister imports. The advantage is that the next bump
to v1beta1 will affect fewer source code lines.

Only source code where the version really matters (like API registration)
retains the versioned import.
2024-07-21 17:28:13 +02:00
googs1025
a3978e8315 scheduler: Add ctx param and error return to EnqueueExtensions.EventsToRegister() 2024-07-18 12:22:17 +08:00
Patrick Ohly
4bddebc48e DRA: fix scheduler/resource claim controller race with retry
The JSON patch approach works, but it is complex. A retry loop is easier to
understand (detect conflict, get new claim, try again). There is one additional
API call (the get), but in practice this scenario is unlikely.
2024-06-27 15:03:56 +02:00
Patrick Ohly
ecbafb8de5 DRA: fix scheduler/resource claim controller race
There was a race caused by having to update claim finalizer and status in two
different operations:
- Resource claim controller removes allocation, does not yet
  get to remove the finalizer.
- Scheduler prepares an allocation, without adding the finalizer
  because it's there.
- Controller removes finalizer.
- Scheduler adds allocation.

This is an invalid state. Automatic checking found this during the execution of
the "with translated parameters on single node.*supports sharing a claim
sequentially" E2E test, but only when run stand-alone. When running in
parallel (as in the CI), the bad outcome of the race did not occur.

The fix is to check that the finalizer is still set when adding the
allocation. The apiserver doesn't check that because it doesn't know which
finalizer goes with the allocation result. It could check for "some finalizer",
but that is not guaranteed to be correct (could be some unrelated one).

Checking the finalizer can only be done with a JSON patch. Despite the
complications, having the ability to add multiple pods concurrently to
ReservedFor seems worth it (avoids expensive rescheduling or a local retry
loop).

The resource claim controller doesn't need this, it can do a normal update
which implicitly checks ResourceVersion.
2024-06-27 15:03:06 +02:00
Kubernetes Prow Robot
8c478a06d8 Merge pull request #124595 from pohly/dra-scheduler-assume-cache-eventhandlers
DRA: scheduler event handlers via assume cache
2024-06-25 11:56:28 -07:00
Patrick Ohly
1b63639d31 DRA scheduler: use assume cache to list claims
This finishes the transition to the assume cache as source of truth for the
current set of claims.

The tests have to be adapted. It's not enough anymore to directly put objects
into the informer store because that doesn't change the assume cache
content. Instead, normal Create/Update calls and waiting for the cache update
are needed.
2024-06-25 14:00:25 +02:00
Patrick Ohly
9a6f3b9388 scheduler: central ResourceClaim assume cache
This enables connecting the event handler for ResourceClaim to the assume
cache, which addresses a theoretic race condition.

It may also be useful for implementing the autoscaler support, because now
the autoscaler can modify the content of the cache.
2024-06-25 14:00:25 +02:00
Patrick Ohly
e0fce54d02 DRA: fix indexing of generated parameters
The claim parameter key didn't include the namespace of the claim. In the case
where two namespaces used the exact same parameter reference, the "too many
generated parameters" case got triggered incorrectly and lookup could have
returned an object from the wrong namespace.

Found while running the E2E tests in parallel:

              message: 'running PreFilter plugin "DynamicResources": multiple generated claim
                parameters for ConfigMap. dra-8794/parameters-3 found: [dra-4729/parameters-4
                dra-7328/parameters-4 dra-8794/parameters-4 dra-3402/parameters-4 dra-6156/parameters-4
                dra-1839/parameters-4 dra-7434/parameters-4 dra-6504/parameters-4]'
2024-06-13 17:27:04 +02:00
carlory
2794baf4c0 fix dra flaky test on TestPlugin 2024-05-30 23:22:37 +08:00
carlory
3072987fcc DRA: scheduler: index claim and class parameters to simplify lookup 2024-05-27 15:57:10 +08:00
carlory
06d3cd33b2 use slices library instead 2024-04-29 16:50:53 +08:00
Patrick Ohly
7f54c5dfec scheduler: remove AssumeCache interface
There's no reason for having the interface because there is only one
implementation. Makes the implementation of the test functions a bit
simpler (no casting). They are still stand-alone functions instead of methods
because they should not be considered part of the "normal" API.
2024-04-25 11:46:58 +02:00
Patrick Ohly
26e0409c36 scheduler: move assume cache to utils, part 2
This is now used by both the volumebinding and dynamicresources plugin, so
promoting it to a common helper package is better.

In terms of functionality, nothing was changed. Documentation got
updated (warns about storing locally modified objects, clarifies what the Get
parameters are). Code coverage should be a bit better than before (tested with
and without indexer, exercises event handlers, more error paths).

Checking for specific errors can now be done via errors.Is.
2024-04-25 11:45:43 +02:00
Kubernetes Prow Robot
d2ce87eb94 Merge pull request #123938 from pohly/dra-structured-parameters-tests
DRA: test for structured parameters
2024-04-18 02:10:08 -07:00
Patrick Ohly
458e227de0 dra scheduler: unit tests
Coverage was checked with a cover profile. The biggest remaining gap is for
isSchedulableAfterClaimParametersChange and
isSchedulableAfterClassParametersChange which will get handled when refactoring
the
foreachPodResourceClaim (https://github.com/kubernetes/kubernetes/issues/123697).
2024-03-22 10:03:22 +01:00
Patrick Ohly
95136db063 dra scheduler: fix re-allocation of claim with structured parameters
The code was incorrectly checking for a controller, but only the boolean
is set for allocated claims. As a result, deallocation was requested from
a non-existent control plane controller.

While at it, let's also clear the driver name. It's not needed when the
claim is deallocated.
2024-03-22 10:03:22 +01:00
Kubernetes Prow Robot
aa73f3163a Merge pull request #122292 from sanposhiho/nodeupdate
register Node/UpdateTaint event to plugins which has Node/Add only and doesn't have Node/UpdateTaint
2024-03-18 08:33:54 -07:00
Kensei Nakada
2b56de43e5 register Node/UpdateNodeTaint event to plugins which has Node/Add only, doesn't have Node/UpdateNodeTaint 2024-03-16 14:13:06 +00:00
Kevin Klues
21a0dd1d70 dra scheduler: create default claim/class parameters instead of nil
Without this, the scheduler was crashing in newClaimController() in
pkg/scheduler/framework/plugins/dynamicresources/structuredparameters.go

The code in newClaimController() assumes that the parameters are not nil.
Furthermore it assumes that there is at least one DriverRequest populated in
order to allocate any resources to a claim.

This PR adds logic to define default claim/class parameters that will allow
allocation to proceed even if an end user doesn't provide any class or claim
parameters themselves.

Signed-off-by: Kevin Klues <kklues@nvidia.com>
2024-03-11 13:57:16 +00:00
Patrick Ohly
251b3859b0 dra scheduler: consider in-flight allocation for resource calculation
Storing a modified claim with allocation and the original resource version in
the assume cache was not reliable: if an update was received, it replaced the
modified claim and the resource that was reserved for the claim might have been
used for some other claim.

To fix this, the in-flight claims are now stored in the map instead of just a
boolean and the status stored there overrides whatever is in the assume cache.

Logging got extended to diagnose this problem better. It started to occur in
E2E tests after splitting the claim update so that first the finalizer is set
and then the status, because setting the finalizer triggered an update.
2024-03-07 22:26:16 +01:00
Patrick Ohly
0b6a0d686a dra api: rename NodeResourceSlice -> ResourceSlice
While currently those objects only get published by the kubelet for node-local
resources, this could change once we also support network-attached
resources. Dropping the "Node" prefix enables such a future extension.

The NodeName in ResourceSlice and StructuredResourceHandle then becomes
optional. The kubelet still needs to provide one and it must match its own node
name, otherwise it doesn't have permission to access ResourceSlice objects.
2024-03-07 22:22:55 +01:00
Patrick Ohly
096e948905 dra scheduler: support structured parameters
When a claim uses structured parameters, as indicated by the resource class
flag, the scheduler is responsible for allocating it. To do this it needs to
gather information about available node resources by watching
NodeResourceSlices and then match the in-tree claim parameters against those
resources.
2024-03-07 22:21:04 +01:00
Kubernetes Prow Robot
c606448922 Merge pull request #122996 from Huang-Wei/cleanup-dra-postfilter
DRA: always returns Unschedulable in PostFilter
2024-01-27 08:19:44 -08:00
Kubernetes Prow Robot
02aaad0de9 Merge pull request #121876 from pohly/dra-reserve-during-pod-binding
dra: reserve + publish during pod binding
2024-01-26 19:58:01 +01:00
Wei Huang
ceabc4aba8 DRA: always returns Unschedulable in PostFilter 2024-01-26 09:44:00 -08:00
Patrick Ohly
6cf4203751 dra scheduler: reformat code
By continuing with the next item in the if clause, the else is no longer needed
and indention can be reduced.
2024-01-26 10:58:03 +01:00
Patrick Ohly
a809a6353b scheduler: publish PodSchedulingContext during PreBind
Blocking API calls during a scheduling cycle like the DRA plugin is doing slow
down overall scheduling, i.e. also affecting pods which don't use DRA.

It is easy to move the blocking calls into a goroutine while the scheduling
cycle ends with "pod unschedulable". The hard part is handling an error when
those API calls then fail in the background. There is a solution for that
(see https://github.com/kubernetes/kubernetes/pull/120963), but it's complex.

Instead, publishing the modified PodSchedulingContext can also be done
later. In the more common case of a pod which is ready for binding except for
its claims, that'll be in PreBind, which runs in a separate goroutine already.

In the less common case that a pod cannot be scheduled, that'll be in
Unreserve which is still blocking.
2024-01-26 10:58:03 +01:00
Patrick Ohly
5d1509126f dra: patch ReservedFor during PreBind
This moves adding a pod to ReservedFor out of the main scheduling cycle into
PreBind. There it is done concurrently in different goroutines. For claims
which were specifically allocated for a pod (the most common case), that
usually makes no difference because the claim is already reserved.

It starts to matter when that pod then cannot be scheduled for other reasons,
because then the claim gets unreserved to allow deallocating it. It also
matters for claims that are created separately and then get used multiple times
by different pods.

Because multiple pods might get added to the same claim rapidly independently
from each other, it makes sense to do all claim status updates via patching:
then it is no longer necessary to have an up-to-date copy of the claim because
the patch operation will succeed if (and only if) the patched claim is valid.

Server-side-apply cannot be used for this because a client always has to send
the full list of all entries that it wants to be set, i.e. it cannot add one
entry unless it knows the full list.
2024-01-26 10:58:03 +01:00
Kubernetes Prow Robot
6c493a1ef9 Merge pull request #122969 from kerthcet/fix/claim
[DRA] Fix indexing the error value in unavailableClaim
2024-01-25 17:34:11 +01:00
kerthcet
7801173f6e get the error claim in dra
Signed-off-by: kerthcet <kerthcet@gmail.com>
2024-01-25 23:22:50 +08:00
kerthcet
8371e4cf93 quick break when met
Signed-off-by: kerthcet <kerthcet@gmail.com>
2024-01-23 19:40:15 +08:00
Patrick Ohly
b0d4a8cd6d dra scheduler: fix incorrect tracking of claim candidates for reallocation
When dealing with unschedulable pods, the intent was to deallocate only claims
which are allocated and use delayed allocation. That if check wasn't handled
correctly, causing also claims with immediate allocation to be considered as
candidates.

Found during code reading, probably has never occurred in practice yet.
2023-12-20 09:04:01 +01:00
AxeZhan
be48c93689 Sched framework: expose NodeInfo in all functions of PluginsRunner interface 2023-12-15 11:30:06 +08:00
Kubernetes Prow Robot
74afd1a06f Merge pull request #119539 from HirazawaUi/remove-not-register-event-code
remove unregistered event code
2023-12-13 21:25:33 +01:00
Kubernetes Prow Robot
5a4e792e06 Merge pull request #120534 from pohly/dra-scheduler-ssa-as-fallback
dra scheduler: fall back to SSA for PodSchedulingContext updates
2023-10-23 21:06:58 +02:00
Kensei Nakada
cb5dc46edf feature(scheduler): simplify QueueingHint by introducing new statuses 2023-10-19 11:02:11 +00:00
HirazawaUi
b20bc79a60 remove not register event code 2023-10-07 21:59:01 +08:00
bzsuni
6200eb04af use generic sets in scheduler
Signed-off-by: bzsuni <bingzhe.sun@daocloud.io>
2023-09-28 21:31:33 +08:00
Kubernetes Prow Robot
3ac83f528d Merge pull request #119290 from carlory/add-logger
the scheduling queue logs the error and treats it as QueueAfterBackoff
2023-09-22 08:10:49 -07:00
carlory
0105a002bc when the hint fn returns error, the scheduling queue logs the error and treats it as QueueAfterBackoff.
Co-authored-by: Kensei Nakada <handbomusic@gmail.com>

Co-authored-by: Kante Yin <kerthcet@gmail.com>

Co-authored-by: XsWack <xushiwei5@huawei.com>
2023-09-21 09:40:44 +08:00
Mengjiao Liu
a7466f44e0 Change the scheduler plugins PluginFactory function to use context parameter to pass logger
- Migrated pkg/scheduler/framework/plugins/nodevolumelimits to use contextual logging
- Fix golangci-lint validation failed
- Check for plugins creation err
2023-09-20 17:49:54 +08:00
Patrick Ohly
7cac1dcf67 dra scheduler: fall back to SSA for PodSchedulingContext updates
During scheduler_perf testing, roughly 10% of the PodSchedulingContext update
operations failed with a conflict error. Using SSA would avoid that, but
performance measurements showed that this causes a considerable
slowdown (primarily because of the slower encoding with JSON instead of
protobuf, but also because server-side processing is more expensive).

Therefore a normal update is tried first and SSA only gets used when there has
been a conflict. Using SSA in that case instead of giving up outright is better
because it avoids another scheduling attempt.
2023-09-15 15:05:38 +02:00
Patrick Ohly
6f9140e421 DRA scheduler: stop allocating before deallocation
This fixes a test flake:

    [sig-node] DRA [Feature:DynamicResourceAllocation] multiple nodes reallocation [It] works
    /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/dra.go:552

      [FAILED] number of deallocations
      Expected
          <int64>: 2
      to equal
          <int64>: 1
      In [It] at: /nvme/gopath/src/k8s.io/kubernetes/test/e2e/dra/dra.go:651 @ 09/05/23 14:01:54.652

This can be reproduced locally with

    stress -p 10 go test ./test/e2e -args -ginkgo.focus=DynamicResourceAllocation.*reallocation.works  -ginkgo.no-color -v=4 -ginkgo.v

Log output showed that the sequence of events leading to this was:
- claim gets allocated because of selected node
- a different node has to be used, so PostFilter sets
  claim.status.deallocationRequested
- the driver deallocates
- before the scheduler can react and select a different node,
  the driver allocates *again* for the original node
- the scheduler asks for deallocation again
- the driver deallocates again (causing the test failure)
- eventually the pod runs

The fix is to disable allocations first by removing the selected node and then
starting to deallocate.
2023-09-11 10:56:17 +02:00
Kubernetes Prow Robot
a64a3e16ec Merge pull request #120253 from pohly/dra-scheduler-podschedulingcontext-updates
dra scheduler: refactor PodSchedulingContext updates
2023-09-08 02:48:14 -07:00