Commit Graph

1239 Commits

Author SHA1 Message Date
Kensei Nakada
d4d91d4ace fix: use set methods 2024-11-07 14:09:35 +09:00
Kensei Nakada
a95b8b5085 fix: use Activate always 2024-11-07 14:09:35 +09:00
Kensei Nakada
677792663f fix: register Pod/Delete event at the preemption plugin 2024-11-07 14:09:35 +09:00
Kensei Nakada
fe3119fa69 make sure DefaultPreemption implements PreEnqueuePlugin 2024-11-07 14:09:35 +09:00
Kensei Nakada
69a8d0ec0b feature(KEP-4832): asynchronous preemption 2024-11-07 14:09:34 +09:00
Patrick Ohly
33ea278c51 DRA: use v1beta1 API
No code is left which depends on the v1alpha3, except of course the code
implementing that version.
2024-11-06 13:03:19 +01:00
Kubernetes Prow Robot
0fad78930f Merge pull request #127904 from towca/jtuznik/dra-autoscaling
DRA: allow Cluster Autoscaler to integrate with DRA scheduler plugin
2024-11-06 10:01:29 +00:00
Kubernetes Prow Robot
f81a68f488 Merge pull request #128377 from tallclair/allocated-status-2
[FG:InPlacePodVerticalScaling] Implement AllocatedResources status changes for Beta
2024-11-05 23:21:49 +00:00
Kuba Tużnik
8d489425aa scheduler/dynamicresources: extract obtaining and tracking in-memory modifications of DRA objects
All logic related to obtaining DRA objects and tracking modifications
to ResourceClaims in-memory is extracted to DefaultDRAManager, which
implements framework.SharedDRAManager.

This is intended to be a no-op in terms of the DRA plugin behavior.
2024-11-05 14:11:04 +01:00
Patrick Ohly
7863d9a381 DRA scheduler: refactor CEL compilation cache
A better place is the cel package because a) the name can become shorter
and b) it is tightly coupled with the compiler there.

Moving the compilation into the cache simplifies the callers.
2024-11-05 08:34:42 +01:00
Tim Allclair
81df195819 Stop using status.AllocatedResources to aggregate resources 2024-11-01 14:02:58 -07:00
Patrick Ohly
6f07fa3a5e DRA scheduler: update some stale comments 2024-11-01 13:23:42 +01:00
Patrick Ohly
ae6b5522ea DRA scheduler: rename variable
"Allocated devices" are the ones which can be observed from the informer. "All
allocated devices" also includes those which are in flight and haven't been
written back to the apiserver.
2024-11-01 13:23:42 +01:00
Patrick Ohly
0130ebba1d DRA scheduler: refactor "allocated devices" lookup
The logic for skipping "admin access" was repeated in three different places. A
single foreachAllocatedDevices with a callback puts it into one function.
2024-11-01 13:23:28 +01:00
Patrick Ohly
bd7ff9c4c7 DRA scheduler: update some log strings 2024-11-01 13:23:11 +01:00
Patrick Ohly
bc55e82621 DRA scheduler: maintain a set of allocated device IDs
Reacting to events from the informer cache (indirectly, through the assume
cache) is more efficient than repeatedly listing it's content and then
converting to IDs with unique strings.

    goos: linux
    goarch: amd64
    pkg: k8s.io/kubernetes/test/integration/scheduler_perf
    cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
                                                                                       │            before            │                        after                        │
                                                                                       │ SchedulingThroughput/Average │ SchedulingThroughput/Average  vs base               │
    PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36                      54.70 ± 6%                     76.81 ± 6%  +40.42% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36                     106.4 ± 4%                     105.6 ± 2%        ~ (p=0.413 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36                     120.0 ± 4%                     118.9 ± 7%        ~ (p=0.117 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36                      112.5 ± 4%                     105.9 ± 4%   -5.87% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36                      87.13 ± 4%                    123.55 ± 4%  +41.80% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36                      113.4 ± 2%                     103.3 ± 2%   -8.95% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36                      65.55 ± 3%                    121.30 ± 3%  +85.05% (p=0.002 n=6)
    geomean                                                                                                90.81                          106.8       +17.57%
2024-11-01 13:23:06 +01:00
Patrick Ohly
814c9428fd DRA scheduler: cache compiled CEL expressions
DeviceClasses and different requests are very likely to contain the same
expression string. We don't need to compile that over and over again.

To avoid hanging onto that cache longer than necessary, it's currently tied to
each PreFilter/Filter combination. It might make sense to move this up into the
scheduler plugin and thus reuse compiled expressions for different pods.

    goos: linux
    goarch: amd64
    pkg: k8s.io/kubernetes/test/integration/scheduler_perf
    cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
                                                                                       │            before            │                        after                        │
                                                                                       │ SchedulingThroughput/Average │ SchedulingThroughput/Average  vs base               │
    PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36                      33.95 ± 4%                     36.65 ± 2%   +7.95% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36                     105.8 ± 2%                     106.7 ± 3%        ~ (p=0.177 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36                     100.7 ± 1%                     119.7 ± 3%  +18.82% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36                      90.78 ± 1%                    121.10 ± 4%  +33.40% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36                      50.51 ± 7%                     63.72 ± 3%  +26.17% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36                      103.7 ± 5%                     110.2 ± 2%   +6.32% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36                      28.50 ± 2%                     28.16 ± 5%        ~ (p=0.102 n=6)
    geomean                                                                                                64.99                          73.15       +12.56%
2024-11-01 13:20:06 +01:00
Patrick Ohly
941d17b3b8 DRA scheduler: code cleanups
Looking up the slice can be avoided by storing it when allocating a device.
The AllocationResult struct is small enough that it can be copied by value.

    goos: linux
    goarch: amd64
    pkg: k8s.io/kubernetes/test/integration/scheduler_perf
    cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
                                                                                       │            before            │                       after                        │
                                                                                       │ SchedulingThroughput/Average │ SchedulingThroughput/Average  vs base              │
    PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36                      33.30 ± 2%                     33.95 ± 4%       ~ (p=0.288 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36                     105.3 ± 2%                     105.8 ± 2%       ~ (p=0.524 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36                     100.8 ± 1%                     100.7 ± 1%       ~ (p=0.738 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36                      90.96 ± 2%                     90.78 ± 1%       ~ (p=0.952 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36                      49.84 ± 4%                     50.51 ± 7%       ~ (p=0.485 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36                      103.8 ± 1%                     103.7 ± 5%       ~ (p=0.582 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36                      27.21 ± 7%                     28.50 ± 2%       ~ (p=0.065 n=6)
    geomean                                                                                                64.26                          64.99       +1.14%
2024-11-01 13:19:51 +01:00
Patrick Ohly
1246898315 DRA scheduler: ResourceSlice with unique strings
Using unique strings instead of normal strings speeds up allocation with
structured parameters because maps that use those strings as key no longer need
to build hashes of the string content. However, care must be taken to call
unique.Make as little as possible because it is costly.

Pre-allocating the map of allocated devices reduces the need to grow the map
when adding devices.

    goos: linux
    goarch: amd64
    pkg: k8s.io/kubernetes/test/integration/scheduler_perf
    cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
                                                                                       │            before            │                        after                         │
                                                                                       │ SchedulingThroughput/Average │ SchedulingThroughput/Average  vs base                │
    PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36                     18.06 ±  2%                     33.30 ± 2%   +84.31% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36                    104.7 ±  2%                     105.3 ± 2%         ~ (p=0.818 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36                    96.62 ±  1%                    100.75 ± 1%    +4.28% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36                     83.00 ±  2%                     90.96 ± 2%    +9.59% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36                     32.45 ±  7%                     49.84 ± 4%   +53.60% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36                     95.22 ±  7%                    103.80 ± 1%    +9.00% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36                     9.111 ± 10%                    27.215 ± 7%  +198.69% (p=0.002 n=6)
    geomean                                                                                               45.86                           64.26        +40.12%
2024-11-01 13:19:48 +01:00
Patrick Ohly
7de6d070f2 DRA scheduler: avoid listing claims during Filter
The Allocate call used to call back into the claim lister for each node. This
was significant work which showed up at the top of the CPU profile. It's
okay to list only once during PreFilter because the Filter call does not change
the claim status between Allocate calls.

    goos: linux
    goarch: amd64
    pkg: k8s.io/kubernetes/test/integration/scheduler_perf
    cpu: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
                                                                                       │            before            │                        after                        │
                                                                                       │ SchedulingThroughput/Average │ SchedulingThroughput/Average  vs base               │
    PerfScheduling/SchedulingWithResourceClaimTemplateStructured/5000pods_500nodes-36                      15.04 ± 0%                    18.06 ±  2%  +20.07% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_100nodes-36                     105.5 ± 1%                    104.7 ±  2%        ~ (p=0.485 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/empty_500nodes-36                     95.83 ± 1%                    96.62 ±  1%        ~ (p=0.063 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_100nodes-36                      79.67 ± 3%                    83.00 ±  2%   +4.18% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/half_500nodes-36                      27.11 ± 5%                    32.45 ±  7%  +19.68% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_100nodes-36                      84.00 ± 3%                    95.22 ±  7%  +13.36% (p=0.002 n=6)
    PerfScheduling/SteadyStateClusterResourceClaimTemplateStructured/full_500nodes-36                      7.110 ± 6%                    9.111 ± 10%  +28.15% (p=0.002 n=6)
    geomean                                                                                                41.05                         45.86        +11.73%
2024-11-01 12:43:17 +01:00
Kubernetes Prow Robot
223ac36b50 Merge pull request #128399 from JesseStutler/dra
Refactor the dynamicResources struct to DynamicResources
2024-11-01 00:33:27 +00:00
Kubernetes Prow Robot
daef8c2419 Merge pull request #127266 from pohly/dra-admin-access-in-status
DRA API: AdminAccess in DeviceRequestAllocationResult + DRAAdminAccess feature gate
2024-10-30 03:41:25 +00:00
Kubernetes Prow Robot
988769933e Merge pull request #128307 from NoicFank/bugfix-scheduler-preemption
bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it
2024-10-29 19:05:02 +00:00
NoicFank
68f7a7c682 bugfix(scheduler): preemption picks wrong victim node with higher priority pod on it.
Introducing pdb to preemption had disrupted the orderliness of pods in the victims,
which would leads picking wrong victim node with higher priority pod on it.
2024-10-29 19:50:55 +08:00
Patrick Ohly
4419568259 DRA: treat AdminAccess as a new feature gated field
Using the "normal" logic for a feature gated field simplifies the
implementation of the feature gate.

There is one (entirely theoretic!) problem with updating from 1.31: if a claim
was allocated in 1.31 with admin access, the status field was not set because
it didn't exist yet. If a driver now follows the current definition of "unset =
off", then it will not grant admin access even though it should. This is
theoretic because drivers are starting to support admin access with 1.32, so
there shouldn't be any claim where this problem could occur.
2024-10-29 10:22:31 +01:00
Patrick Ohly
9a7e4ccab2 DRA admin access: add feature gate
The new DRAAdminAccess feature gate has the following effects:
- If disabled in the apiserver, the spec.devices.requests[*].adminAccess
  field gets cleared. Same in the status. In both cases the scenario
  that it was already set and a claim or claim template get updated
  is special: in those cases, the field is not cleared.

  Also, allocating a claim with admin access is allowed regardless of the
  feature gate and the field is not cleared. In practice, the scheduler
  will not do that.
- If disabled in the resource claim controller, creating ResourceClaims
  with the field set gets rejected. This prevents running workloads
  which depend on admin access.
- If disabled in the scheduler, claims with admin access don't get
  allocated. The effect is the same.

The alternative would have been to ignore the fields in claim controller and
scheduler. This is bad because a monitoring workload then runs, blocking
resources that probably were meant for production workloads.
2024-10-29 09:50:11 +01:00
Patrick Ohly
f3fef01e79 DRA API: AdminAccess in DeviceRequestAllocationResult
Drivers need to know that because admin access may also grant additional
permissions. The allocator needs to ignore such results when determining which
devices are considered as allocated.

In both cases it is conceptually cleaner to not rely on the content of the
ClaimSpec.
2024-10-29 09:50:07 +01:00
jessestutler
f7003c76b4 Refactor the dynamicResources struct to DynamicResources 2024-10-29 11:44:42 +08:00
Patrick Ohly
9d1b0654e0 DRA: add wg/device-management label automatically
This makes PRs show up automatically in the WG's project
board (https://github.com/orgs/kubernetes/projects/95/views/1).
2024-10-28 16:36:04 +01:00
Kubernetes Prow Robot
25d6f76538 Merge pull request #128337 from torredil/fix-gce-cos-master-serial-5123
Add VolumeAttachment event registration to CSI volume limits plugin
2024-10-26 16:00:52 +01:00
torredil
fe1badf635 Add VolumeAttachment event registration to CSI volume limits plugin
Signed-off-by: torredil <torredil@amazon.com>
2024-10-26 13:41:28 +00:00
Kubernetes Prow Robot
aec2ea1877 Merge pull request #124609 from AxeZhan/refac
Move some helper functions from api/v1 to component-helpers
2024-10-25 17:26:52 +01:00
AxeZhan
2ffb568540 rename functions 2024-10-25 12:53:24 +08:00
Kubernetes Prow Robot
352056f09d Merge pull request #127757 from torredil/scheduler-bugfix-5123
scheduler: Improve CSILimits plugin accuracy by using VolumeAttachments
2024-10-23 18:12:52 +01:00
Kubernetes Prow Robot
e39571591d Merge pull request #127478 from googs1025/scheduler/fine-grained
feature(scheduler): more fine-grained QHints for podtopologyspread plugin
2024-10-20 13:29:03 +01:00
googs1025
1edbd0b54f feature(scheduler): more fine-grained QHints for podtopologyspread plugin 2024-10-19 23:45:13 +08:00
torredil
56f2b192cc scheduler: Improve CSILimits plugin accuracy by using VolumeAttachments
Signed-off-by: torredil <torredil@amazon.com>
2024-10-18 19:02:14 +00:00
Kensei Nakada
83f9e4b6df cleanup: remove event list 2024-10-18 11:10:10 +10:00
Patrick Ohly
f84eb5ecf8 DRA: remove "classic DRA"
This removes the DRAControlPlaneController feature gate, the fields controlled
by it (claim.spec.controller, claim.status.deallocationRequested,
claim.status.allocation.controller, class.spec.suitableNodes), the
PodSchedulingContext type, and all code related to the feature.

The feature gets removed because there is no path towards beta and GA and DRA
with "structured parameters" should be able to replace it.
2024-10-16 23:09:50 +02:00
AxeZhan
b1f07bb36c add tests for scheduler 2024-10-10 15:53:19 +08:00
Kubernetes Prow Robot
3de975b732 Merge pull request #125171 from YamasouA/ft/queuehint-csidriver
volumebinding: scheduler queueing hints - CSIDriver
2024-10-04 00:26:27 +01:00
YamasouA
6dbaa5660e fix test 2024-10-02 22:50:39 +09:00
googs1025
24a28766d4 chore(scheduler dra): improve dra queue hint unit test 2024-10-01 17:22:15 +08:00
Kubernetes Prow Robot
67cdc26214 Merge pull request #127497 from pohly/dra-scheduler-queueing-hints-fix
DRA scheduler: fix queuing hint support
2024-09-30 23:21:48 +01:00
Patrick Ohly
aee77bfc84 DRA scheduler: add special ActionType for ResourceClaim changes
Having a dedicated ActionType which only gets used when the scheduler itself
already detects some change in the list of generated ResourceClaims of a pod
avoids calling the DRA plugin for unrelated Pod changes.
2024-09-27 16:53:58 +02:00
Patrick Ohly
d425353c13 DRA scheduler: reduce verbosity of queuing hints
Other hints also only use V(5) or higher.
2024-09-27 08:15:33 +02:00
Patrick Ohly
4a265feb83 DRA scheduler: fix queuing hint support
d66f8f9 added that "plugins have to implement a QueueingHint for Pod/Update
event if the rejection from them could be resolved by updating unscheduled
Pods itself".

This applies to DRA because the name of a generated ResourceClaim must be
recorded in the pod status before the pod can be scheduled.
2024-09-27 08:15:33 +02:00
Kubernetes Prow Robot
960e3984b0 Merge pull request #127444 from dom4ha/fine-grained-qhints
Fine grain QueueHints for NodeAffinity plugin
2024-09-27 01:42:00 +01:00
dom4ha
c7db4bb450 Fine grain QueueHints for nodeaffinity plugin.
Skip queue on unrelated change that keeps pod schedulable when QueueHints are enabled.

Split add from QHints disabled case

Remove case when QHints are disabled

Remove two GHint alternatives in unit tests

more fine-grained Node QHint for NodeResourceFit plugin

Return early when updated Node causes unmatch

Revert "more fine-grained Node QHint for NodeResourceFit plugin"

This reverts commit dfbceb60e0c1c4e47748c12722d9ed6dba1a8366.

Add integration test for requeue of a pod previously rejected by NodeAffinity plugin when a suitable Node is added

Add integratin test for a Node update operation that does not trigger requeue in NodeAffinity plugin

Remove innacurrate comment

Apply review comments
2024-09-26 10:21:08 +00:00
dom4ha
903b1f7e28 more fine-grained Node QHint for NodeResourceFit plugin 2024-09-26 09:51:36 +00:00