Commit Graph

25997 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
2b196cff8b Merge pull request #127589 from soltysh/timestamp_e2e
e2e: add test covering cronjob-scheduled-timestamp annotation added by cronjob
2024-09-25 17:46:09 +01:00
Kubernetes Prow Robot
5fc4e71a30 Merge pull request #127499 from pohly/scheduler-perf-updates
scheduler_perf: updates to enhance performance testing of DRA
2024-09-25 13:32:00 +01:00
Maciej Szulik
f11ddad99d e2e: add test covering cronjob-scheduled-timestamp annotation added by cronjob 2024-09-25 12:47:27 +02:00
Kubernetes Prow Robot
75214d11d5 Merge pull request #127428 from googs1025/scheduler/plugin
chore(scheduler): refactor import package ordering in scheduler
2024-09-25 11:40:07 +01:00
Lukasz Szaszkiewicz
ae35048cb0 adds watchListEndpointRestrictions for watchlist requests (#126996)
* endpoints/handlers/get: intro watchListEndpointRestrictions

* consistencydetector/list_data_consistency_detector: expose IsDataConsistencyDetectionForListEnabled

* e2e/watchlist: extract common function for adding unstructured secrets

* e2e/watchlist: new e2e scenarios for convering watchListEndpointRestrict
2024-09-25 10:12:01 +01:00
Patrick Ohly
d100768d94 scheduler_perf: track and visualize progress over time
This is useful to see whether pod scheduling happens in bursts and how it
behaves over time, which is relevant in particular for dynamic resource
allocation where it may become harder at the end to find the node which still
has resources available.

Besides "pods scheduled" it's also useful to know how many attempts were
needed, so schedule_attempts_total also gets sampled and stored.

To visualize the result of one or more test runs, use:

     gnuplot.sh *.dat
2024-09-25 11:09:15 +02:00
Patrick Ohly
ded96042f7 scheduler_perf + DRA: load up cluster by allocating claims
Having to schedule 4999 pods to simulate a "full" cluster is slow. Creating
claims and then allocating them more or less like the scheduler would when
scheduling pods is much faster and in practice has the same effect on the
dynamicresources plugin because it looks at claims, not pods.

This allows defining the "steady state" workloads with higher number of
devices ("claimsPerNode") again. This was prohibitively slow before.
2024-09-25 09:45:39 +02:00
Patrick Ohly
385599f0a8 scheduler_perf + DRA: measure pod scheduling at a steady state
The previous tests were based on scheduling pods until the cluster was
full. This is a valid scenario, but not necessarily realistic.

More realistic is how quickly the scheduler can schedule new pods when some
old pods finished running, in particular in a cluster that is properly
utilized (= almost full). To test this, pods must get created, scheduled, and
then immediately deleted. This can run for a certain period of time.

Scenarios with empty and full cluster have different scheduling rates. This was
previously visible for DRA because the 50% percentile of the scheduling
throughput was lower than the average, but one had to guess in which scenario
the throughput was lower. Now this can be measured for DRA with the new
SteadyStateClusterResourceClaimTemplateStructured test.

The metrics collector must watch pod events to figure out how many pods got
scheduled. Polling misses pods that already got deleted again. There seems to
be no relevant difference in the collected
metrics (SchedulingWithResourceClaimTemplateStructured/2000pods_200nodes, 6 repetitions):

     │            before            │                     after                     │
     │ SchedulingThroughput/Average │ SchedulingThroughput/Average  vs base         │
                         157.1 ± 0%                     157.1 ± 0%  ~ (p=0.329 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc50 │ SchedulingThroughput/Perc50  vs base         │
                        48.99 ± 8%                    47.52 ± 9%  ~ (p=0.937 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc90 │ SchedulingThroughput/Perc90  vs base         │
                       463.9 ± 16%                   460.1 ± 13%  ~ (p=0.818 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc95 │ SchedulingThroughput/Perc95  vs base         │
                       463.9 ± 16%                   460.1 ± 13%  ~ (p=0.818 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc99 │ SchedulingThroughput/Perc99  vs base         │
                       463.9 ± 16%                   460.1 ± 13%  ~ (p=0.818 n=6)
2024-09-25 09:45:39 +02:00
Patrick Ohly
51cafb0053 scheduler_perf: more useful errors for configuration mistakes
Before, the first error was reported, which typically was the "invalid op code"
error from the createAny operation:

    scheduler_perf.go:900: parsing test cases error: error unmarshaling JSON: while decoding JSON: cannot unmarshal {"collectMetrics":true,"count":10,"duration":"30s","namespace":"test","opcode":"createPods","podTemplatePath":"config/dra/pod-with-claim-template.yaml","steadyState":true} into any known op type: invalid opcode "createPods"; expected "createAny"

Now the opcode is determined first, then decoding into exactly the matching operation is
tried and validated. Unknown fields are an error.

In the case above, decoding a string into time.Duration failed:

    scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: decoding {"collectMetrics":true,"count":10,"duration":"30s","namespace":"test","opcode":"createPods","podTemplatePath":"config/dra/pod-with-claim-template.yaml","steadyState":true} into *benchmark.createPodsOp: json: cannot unmarshal string into Go struct field createPodsOp.Duration of type time.Duration

Some typos:

    scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: unknown opcode "sleeep" in {"duration":"5s","opcode":"sleeep"}

    scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: decoding {"countParram":"$deletingPods","deletePodsPerSecond":50,"opcode":"createPods"} into *benchmark.createPodsOp: json: unknown field "countParram"
2024-09-25 09:45:39 +02:00
Kubernetes Prow Robot
5dd244ff00 Merge pull request #125796 from haorenfsa/fix-gc-sync-blocked
garbagecollector: controller should not be blocking on failed cache sync
2024-09-25 04:02:00 +01:00
Kubernetes Prow Robot
e9cde03b91 Merge pull request #127598 from aojea/servicecidr_seconday_dualwrite
bugfix: initialize secondary range registry with the right value
2024-09-24 21:08:08 +01:00
Antonio Ojea
7a9bca3888 bugfix: initialize secondary range registry with the right value
When MultiCIDRServiceAllocator feature is enabled, we added an
additional feature gate DisableAllocatorDualWrite that allows to enable
a mirror behavior on the old allocator to deal with problems during
cluster upgrades.

During the implementation the secondary range of the legacy allocator
was initialized with the valuye of the primary range, hence, when a
Service tried to allocate a new IP on the secondary range, it succeded
in the new ip allocator but failed when it tried to allocate the same IP
on the legacy allocator, since it has a different range.

Expand the integration test that run over all the combinations of
Service ClusterIP possibilities to run with all the possible
combinations of the feature gates.

The integration test need to change the way of starting the apiserver
otherwise it will timeout.
2024-09-24 17:48:13 +00:00
Patrick Ohly
7bbb3465e5 scheduler_perf: more realistic structured parameters tests
Real devices are likely to have a handful of attributes and (for GPUs) the
memory as capacity. Most keys will be driver specific, a few may eventually
have a domain (none standardized right now).
2024-09-24 18:52:45 +02:00
Kubernetes Prow Robot
b071443187 Merge pull request #127592 from dims/wait-for-gpus-even-for-aws-kubetest2-ec2-harness
Wait for GPUs even for AWS kubetest2 ec2 harness
2024-09-24 17:26:08 +01:00
Davanum Srinivas
472ca3b279 skip control plane nodes, they may not have GPUs
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-24 10:09:33 -04:00
Kubernetes Prow Robot
6ded721910 Merge pull request #127496 from macsko/add_metricscollectionop_to_scheduler_perf
Add separate ops for collecting metrics from multiple namespaces in scheduler_perf
2024-09-24 14:34:00 +01:00
Davanum Srinivas
349c7136c9 Wait for GPUs even for AWS kubetest2 ec2 harness
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-24 09:11:18 -04:00
Maciej Skoczeń
a273e5381a Add separate ops for collecting metrics from multiple namespaces in scheduler_perf 2024-09-24 12:28:53 +00:00
Kubernetes Prow Robot
f0036aac21 Merge pull request #127572 from soltysh/reuse_helper
Reuse CreateTestCRD helper for kubectl e2e
2024-09-24 06:05:59 +01:00
Kubernetes Prow Robot
94df29b8f2 Merge pull request #127464 from sanposhiho/trigger-nodedelete
fix(eventhandler): trigger Node/Delete event
2024-09-24 02:24:00 +01:00
Kubernetes Prow Robot
1137a6a0cc Merge pull request #127093 from jpbetz/retry-generate-name-ga
Promote RetryGenerateName to GA
2024-09-24 00:46:06 +01:00
Kubernetes Prow Robot
d6bb550b10 Merge pull request #122890 from HirazawaUi/fix-pod-grace-period
[kubelet]: Fix the bug where pod grace period will be overwritten
2024-09-24 00:45:59 +01:00
Kubernetes Prow Robot
7ff0580bc8 Merge pull request #127458 from ii/promote-volume-attachment-status-test
Promote e2e test for VolumeAttachmentStatus Endpoints +3 Endpoints
2024-09-23 18:08:00 +01:00
Maciej Szulik
b51d6308a7 Reuse CreateTestCRD helper for kubectl e2e 2024-09-23 18:32:27 +02:00
Kubernetes Prow Robot
ff391cefe2 Merge pull request #127547 from dims/skip-reinstallation-of-gpu-daemonset
Skip re-installation of GPU daemonset
2024-09-23 15:28:00 +01:00
Kubernetes Prow Robot
f187480140 Merge pull request #127558 from pohly/e2e-framework-docs
e2e framework: better documentation of ExpectNoError
2024-09-23 14:12:00 +01:00
Kubernetes Prow Robot
15d08bf7c8 Merge pull request #127323 from vrutkovs/tracing-cacher-get
tracing: add span for get cacher
2024-09-23 10:27:59 +01:00
Patrick Ohly
e5aa609513 e2e framework: better documentation of ExpectNoError
It wasn't clear from the comments what "explain" does, leading to calls like
this:

   framework.ExpectNoError(fmt.Errorf("additional info ....: %v", ..., err))
2024-09-23 10:58:06 +02:00
Kubernetes Prow Robot
89f418f29e Merge pull request #127481 from kannon92/fix-mount-propogation-flake
Use the last kubelet pid in the pidof command
2024-09-23 09:05:59 +01:00
Davanum Srinivas
1abbb00067 Double a couple of other timeouts
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-22 19:36:39 -04:00
Davanum Srinivas
92683139d7 Skip re-installation of GPU daemonset
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-22 13:54:12 -04:00
Kensei Nakada
421f87a4e3 feat: add a requeueing integration test for PodTopologySpread with Node/delete event (QHint: disabled) 2024-09-23 00:29:56 +09:00
Kensei Nakada
bf8f7a3ad7 feat: add a requeueing integration test for PodTopologySpread with Node/delete event 2024-09-22 17:34:37 +09:00
Kubernetes Prow Robot
61dbc03563 Merge pull request #127471 from macsko/add_deletepodsop_to_scheduler_perf
Add deletePodsOp to scheduler_perf
2024-09-22 07:00:04 +01:00
Vadim Rutkovsky
dff0075e7c tracing: add span for cacher.Get
Also updates tracing integration tests for cacher.GetList
2024-09-21 09:53:43 +02:00
Kubernetes Prow Robot
221bf19ee0 Merge pull request #127309 from ii/create-csinode-lifecycle-test
Write e2e test for StorageV1CSINode  Endpoints +6 Endpoints
2024-09-21 03:59:59 +01:00
Kubernetes Prow Robot
a8fd8f5a41 Merge pull request #127516 from dims/bump-timeout-to-account-for-slow-gpu-operations
Bump timeout to account for slow GPU daemonset activation
2024-09-21 02:55:58 +01:00
Davanum Srinivas
3d7d06e7cd Bump timeout to account for slow GPU operations
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-20 20:52:51 -04:00
Kubernetes Prow Robot
52095a8b7b Merge pull request #127509 from dims/test-more-gpu-stuff
Test MOAR GPU stuff (add the cuda demo suite!)
2024-09-20 23:53:58 +01:00
Kubernetes Prow Robot
7a58803c84 Merge pull request #127281 from ii/remove-node-endpoints
Remove Node endpoints from pending_eligible_endpoints.yaml
2024-09-20 22:50:04 +01:00
Kubernetes Prow Robot
f9a57ba82d Merge pull request #126760 from ncdc/ncdc/emeritus
Move ncdc to emeritus
2024-09-20 21:01:58 +01:00
Davanum Srinivas
e516e003c5 Test MOAR GPU stuff
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-20 11:40:33 -04:00
HirazawaUi
9d4e272c16 add e2e test for pod grace period being overridden 2024-09-20 22:25:03 +08:00
HirazawaUi
7c85784b9f fix the bug where pod grace period will be overwritten 2024-09-20 22:25:01 +08:00
Kubernetes Prow Robot
f2700895a4 Merge pull request #127422 from srivastav-abhishek/go-vet-fix
Go vet fixes for gotip
2024-09-20 14:37:58 +01:00
Kevin Hannon
9b6ef250fc always use the last entry in the pidof command as that is the oldest 2024-09-20 09:05:31 -04:00
Maciej Skoczeń
287b61918a Add deletePodsOp to scheduler_perf 2024-09-20 09:46:27 +00:00
Kubernetes Prow Robot
ffabcdc6d1 Merge pull request #127448 from Nordix/esotsal/fix-123852
Potentially deflake "RuntimeClass should reject a Pod requesting a deleted RuntimeClass" test
2024-09-20 08:07:43 +01:00
Abhishek Kr Srivastav
95860cff1c Fix Go vet errors for master golang
Co-authored-by: Rajalakshmi-Girish <rajalakshmi.girish1@ibm.com>
Co-authored-by: Abhishek Kr Srivastav <Abhishek.kr.srivastav@ibm.com>
2024-09-20 12:36:38 +05:30
Kubernetes Prow Robot
08aefc8a92 Merge pull request #119362 from pacoxu/add-new-eviction-pid-test
add new e2e test with PodAndContainerStatsFromCRI enabled for pid eviction order
2024-09-20 05:44:45 +01:00