Commit Graph

5145 Commits

Author SHA1 Message Date
Joe Betz
3570feb2fc Cancel informers for shutdown server in peerproxy test 2024-10-08 21:49:09 -04:00
Richa Banker
fe97e41f29 add more logging for peer_proxy_test, also tweak IdentityLeaseGCPeriod and IdentityLeaseRenewIntervalPeriod 2024-10-08 17:18:27 -07:00
Kubernetes Prow Robot
41440c8117 Merge pull request #127389 from macsko/pod_delete_event_handling_scheduler_perf_test_case
Add scheduler_perf test case for AssignedPodDelete event handling
2024-10-08 21:52:28 +01:00
Cici Huang
baeeb66613 Update tests 2024-10-08 17:02:07 +00:00
Kensei Nakada
a2b3a4f4dc chore: ensure the scheduler handles events before checking the pod position 2024-10-06 21:06:45 +09:00
Kubernetes Prow Robot
7478a30fdc Merge pull request #127260 from carlory/fix-124136
Fix TestPersistentVolumeProvisionMultiPVCs
2024-10-04 15:02:50 +01:00
Kubernetes Prow Robot
7dd03c1ee5 Merge pull request #127353 from Gekko0114/integration_test_volumezone
Add integration test for VolumeZone in requeueing scenarios
2024-10-03 05:48:26 +01:00
Maciej Skoczeń
2a08ce5c68 Add scheduler_perf test case for AssignedPodDelete event handling 2024-10-02 09:16:28 +00:00
moriya
3e57d5cf67 fix 2024-10-02 06:54:32 +09:00
Kubernetes Prow Robot
ae617c3d20 Merge pull request #127781 from macsko/use_barrier_not_sleep_where_possible_in_scheduler_perf_test_cases
Use barrier instead of sleep when possible in scheduler_perf test cases
2024-10-01 22:06:10 +01:00
Maciej Skoczeń
bae0eb91d4 Use barrier instead of sleep when possible in scheduler_perf test cases 2024-10-01 13:53:04 +00:00
Maciej Skoczeń
5e2552c2b0 Allow to filter pods using labels on barrier in scheduler_perf 2024-10-01 08:48:37 +00:00
Kubernetes Prow Robot
22a30e7cbb Merge pull request #127700 from macsko/add_option_waitforpodsprocessed
Add option to wait for pods to be attempted in barrierOp in scheduler_perf
2024-10-01 05:17:49 +01:00
Kubernetes Prow Robot
5e65529ca9 Merge pull request #127759 from macsko/allow_to_filter_pods_using_labels_while_collecting_metrics_scheduler_perf
Allow to filter pods using labels while collecting metrics in scheduler_perf
2024-09-30 20:37:35 +01:00
Maciej Skoczeń
fdbf21e03a Allow to filter pods using labels while collecting metrics in scheduler_perf 2024-09-30 13:32:12 +00:00
Lionel Jouin
0bb0e8feaf Fix TestEnableDisableServiceCIDR
The wrong clientset was used to create services and an incorrect amount
of services was created.

Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-09-30 13:15:00 +02:00
Lionel Jouin
8dafdb2cdd Fix ServiceCIDR integration test enable/disable
feature_enable_disable.go was missing the suffix _test.go to be
considered as a test. Without it, TestEnableDisableServiceCIDR was not
executed.

Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-09-30 12:00:25 +02:00
Maciej Skoczeń
928670061d Allow to wait for pods to be attempted in barrierOp in scheduler_perf 2024-09-30 08:07:15 +00:00
Kubernetes Prow Robot
80941e3e87 Merge pull request #127643 from Jefftree/set-emulation-integration-test
Allow emulation version to be set in integration test
2024-09-27 21:56:01 +01:00
dom4ha
9bf6ee976b Assert whethere there are no pod in active queue while waiting for all pods to get scheduled instead of asserting it afterwards. 2024-09-27 15:06:04 +00:00
dom4ha
54b0ed45b7 Add one more check to the test case precondition assessment. 2024-09-27 15:06:04 +00:00
dom4ha
151ac846a2 Increase the readability of the test preconditions and double check that all test pods are really unschedulable. 2024-09-27 15:06:04 +00:00
Kubernetes Prow Robot
960e3984b0 Merge pull request #127444 from dom4ha/fine-grained-qhints
Fine grain QueueHints for NodeAffinity plugin
2024-09-27 01:42:00 +01:00
Kubernetes Prow Robot
5ebd0da6cc Merge pull request #127662 from macsko/make_scheduler_perf_sleepop_duration_parametrizable
Make sleepOp duration parametrizable in scheduler_perf
2024-09-26 20:10:01 +01:00
Kubernetes Prow Robot
421436a94c Merge pull request #127473 from dom4ha/fine-grain-qhints-fit
feature(scheduler): more fine-grained Node QHint for NodeResourceFit plugin
2024-09-26 18:34:02 +01:00
Maciej Skoczeń
837d917d91 Make sleepOp duration parametrizable in scheduler_perf 2024-09-26 13:07:22 +00:00
dom4ha
c7db4bb450 Fine grain QueueHints for nodeaffinity plugin.
Skip queue on unrelated change that keeps pod schedulable when QueueHints are enabled.

Split add from QHints disabled case

Remove case when QHints are disabled

Remove two GHint alternatives in unit tests

more fine-grained Node QHint for NodeResourceFit plugin

Return early when updated Node causes unmatch

Revert "more fine-grained Node QHint for NodeResourceFit plugin"

This reverts commit dfbceb60e0c1c4e47748c12722d9ed6dba1a8366.

Add integration test for requeue of a pod previously rejected by NodeAffinity plugin when a suitable Node is added

Add integratin test for a Node update operation that does not trigger requeue in NodeAffinity plugin

Remove innacurrate comment

Apply review comments
2024-09-26 10:21:08 +00:00
dom4ha
903b1f7e28 more fine-grained Node QHint for NodeResourceFit plugin 2024-09-26 09:51:36 +00:00
Jefftree
dacc2e1f5d Allow emulation version to be set in integration test 2024-09-25 22:01:15 -04:00
Maciej Skoczeń
40154baab0 Add updateAnyOp to scheduler_perf 2024-09-25 12:42:25 +00:00
Kubernetes Prow Robot
5fc4e71a30 Merge pull request #127499 from pohly/scheduler-perf-updates
scheduler_perf: updates to enhance performance testing of DRA
2024-09-25 13:32:00 +01:00
Kubernetes Prow Robot
75214d11d5 Merge pull request #127428 from googs1025/scheduler/plugin
chore(scheduler): refactor import package ordering in scheduler
2024-09-25 11:40:07 +01:00
Patrick Ohly
d100768d94 scheduler_perf: track and visualize progress over time
This is useful to see whether pod scheduling happens in bursts and how it
behaves over time, which is relevant in particular for dynamic resource
allocation where it may become harder at the end to find the node which still
has resources available.

Besides "pods scheduled" it's also useful to know how many attempts were
needed, so schedule_attempts_total also gets sampled and stored.

To visualize the result of one or more test runs, use:

     gnuplot.sh *.dat
2024-09-25 11:09:15 +02:00
Patrick Ohly
ded96042f7 scheduler_perf + DRA: load up cluster by allocating claims
Having to schedule 4999 pods to simulate a "full" cluster is slow. Creating
claims and then allocating them more or less like the scheduler would when
scheduling pods is much faster and in practice has the same effect on the
dynamicresources plugin because it looks at claims, not pods.

This allows defining the "steady state" workloads with higher number of
devices ("claimsPerNode") again. This was prohibitively slow before.
2024-09-25 09:45:39 +02:00
Patrick Ohly
385599f0a8 scheduler_perf + DRA: measure pod scheduling at a steady state
The previous tests were based on scheduling pods until the cluster was
full. This is a valid scenario, but not necessarily realistic.

More realistic is how quickly the scheduler can schedule new pods when some
old pods finished running, in particular in a cluster that is properly
utilized (= almost full). To test this, pods must get created, scheduled, and
then immediately deleted. This can run for a certain period of time.

Scenarios with empty and full cluster have different scheduling rates. This was
previously visible for DRA because the 50% percentile of the scheduling
throughput was lower than the average, but one had to guess in which scenario
the throughput was lower. Now this can be measured for DRA with the new
SteadyStateClusterResourceClaimTemplateStructured test.

The metrics collector must watch pod events to figure out how many pods got
scheduled. Polling misses pods that already got deleted again. There seems to
be no relevant difference in the collected
metrics (SchedulingWithResourceClaimTemplateStructured/2000pods_200nodes, 6 repetitions):

     │            before            │                     after                     │
     │ SchedulingThroughput/Average │ SchedulingThroughput/Average  vs base         │
                         157.1 ± 0%                     157.1 ± 0%  ~ (p=0.329 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc50 │ SchedulingThroughput/Perc50  vs base         │
                        48.99 ± 8%                    47.52 ± 9%  ~ (p=0.937 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc90 │ SchedulingThroughput/Perc90  vs base         │
                       463.9 ± 16%                   460.1 ± 13%  ~ (p=0.818 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc95 │ SchedulingThroughput/Perc95  vs base         │
                       463.9 ± 16%                   460.1 ± 13%  ~ (p=0.818 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc99 │ SchedulingThroughput/Perc99  vs base         │
                       463.9 ± 16%                   460.1 ± 13%  ~ (p=0.818 n=6)
2024-09-25 09:45:39 +02:00
Patrick Ohly
51cafb0053 scheduler_perf: more useful errors for configuration mistakes
Before, the first error was reported, which typically was the "invalid op code"
error from the createAny operation:

    scheduler_perf.go:900: parsing test cases error: error unmarshaling JSON: while decoding JSON: cannot unmarshal {"collectMetrics":true,"count":10,"duration":"30s","namespace":"test","opcode":"createPods","podTemplatePath":"config/dra/pod-with-claim-template.yaml","steadyState":true} into any known op type: invalid opcode "createPods"; expected "createAny"

Now the opcode is determined first, then decoding into exactly the matching operation is
tried and validated. Unknown fields are an error.

In the case above, decoding a string into time.Duration failed:

    scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: decoding {"collectMetrics":true,"count":10,"duration":"30s","namespace":"test","opcode":"createPods","podTemplatePath":"config/dra/pod-with-claim-template.yaml","steadyState":true} into *benchmark.createPodsOp: json: cannot unmarshal string into Go struct field createPodsOp.Duration of type time.Duration

Some typos:

    scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: unknown opcode "sleeep" in {"duration":"5s","opcode":"sleeep"}

    scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: decoding {"countParram":"$deletingPods","deletePodsPerSecond":50,"opcode":"createPods"} into *benchmark.createPodsOp: json: unknown field "countParram"
2024-09-25 09:45:39 +02:00
Kubernetes Prow Robot
5dd244ff00 Merge pull request #125796 from haorenfsa/fix-gc-sync-blocked
garbagecollector: controller should not be blocking on failed cache sync
2024-09-25 04:02:00 +01:00
Kubernetes Prow Robot
e9cde03b91 Merge pull request #127598 from aojea/servicecidr_seconday_dualwrite
bugfix: initialize secondary range registry with the right value
2024-09-24 21:08:08 +01:00
Antonio Ojea
7a9bca3888 bugfix: initialize secondary range registry with the right value
When MultiCIDRServiceAllocator feature is enabled, we added an
additional feature gate DisableAllocatorDualWrite that allows to enable
a mirror behavior on the old allocator to deal with problems during
cluster upgrades.

During the implementation the secondary range of the legacy allocator
was initialized with the valuye of the primary range, hence, when a
Service tried to allocate a new IP on the secondary range, it succeded
in the new ip allocator but failed when it tried to allocate the same IP
on the legacy allocator, since it has a different range.

Expand the integration test that run over all the combinations of
Service ClusterIP possibilities to run with all the possible
combinations of the feature gates.

The integration test need to change the way of starting the apiserver
otherwise it will timeout.
2024-09-24 17:48:13 +00:00
Patrick Ohly
7bbb3465e5 scheduler_perf: more realistic structured parameters tests
Real devices are likely to have a handful of attributes and (for GPUs) the
memory as capacity. Most keys will be driver specific, a few may eventually
have a domain (none standardized right now).
2024-09-24 18:52:45 +02:00
Kubernetes Prow Robot
6ded721910 Merge pull request #127496 from macsko/add_metricscollectionop_to_scheduler_perf
Add separate ops for collecting metrics from multiple namespaces in scheduler_perf
2024-09-24 14:34:00 +01:00
Maciej Skoczeń
a273e5381a Add separate ops for collecting metrics from multiple namespaces in scheduler_perf 2024-09-24 12:28:53 +00:00
Kubernetes Prow Robot
94df29b8f2 Merge pull request #127464 from sanposhiho/trigger-nodedelete
fix(eventhandler): trigger Node/Delete event
2024-09-24 02:24:00 +01:00
moriya
cd0e0fc881 add_test 2024-09-23 21:49:09 +09:00
moriya
090145aadf add_non_queued_pod 2024-09-23 21:24:09 +09:00
Kubernetes Prow Robot
15d08bf7c8 Merge pull request #127323 from vrutkovs/tracing-cacher-get
tracing: add span for get cacher
2024-09-23 10:27:59 +01:00
Kensei Nakada
421f87a4e3 feat: add a requeueing integration test for PodTopologySpread with Node/delete event (QHint: disabled) 2024-09-23 00:29:56 +09:00
Kensei Nakada
bf8f7a3ad7 feat: add a requeueing integration test for PodTopologySpread with Node/delete event 2024-09-22 17:34:37 +09:00
Kubernetes Prow Robot
61dbc03563 Merge pull request #127471 from macsko/add_deletepodsop_to_scheduler_perf
Add deletePodsOp to scheduler_perf
2024-09-22 07:00:04 +01:00
Vadim Rutkovsky
dff0075e7c tracing: add span for cacher.Get
Also updates tracing integration tests for cacher.GetList
2024-09-21 09:53:43 +02:00