Commit Graph

3027 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
491a23f079 Merge pull request #129999 from pohly/test-e2e-node-timeout
E2E node: fix --timeout default
2025-02-06 03:59:55 -08:00
Patrick Ohly
46a17f60e4 E2E node: fix --timeout default
For unknown reasons, hack/make-rules/test-e2e-node.sh adds -timeout instead of
--timeout. Therefore the fallback code in test/e2e_node/remote/remote.go didn't
find it and added its own --timeout=60m after it. This effectively limits E2E
node test runs to 60 minutes, regardless of what is specified in the job:

    W0206 09:53:51.425532    7151 remote.go:158] ginkgo flags are missing explicit --timeout (ginkgo defaults to 60 minutes)
    I0206 09:53:51.425565    7151 remote.go:165] updated ginkgo flags: -timeout=24h --label-filter="Feature: containsAny DynamicResourceAllocation && Feature: isSubsetOf { Beta, DynamicResourceAllocation } && !Flaky && !Slow"  --no-color -v --timeout=60m
    ...
    I0206 09:53:57.767096    7151 ssh.go:146] Running the command ssh, with args: ... timeout -k 30s 3600.000000s ./ginkgo -timeout=24h --label-filter="Feature: containsAny DynamicResourceAllocation && Feature: isSubsetOf { Beta, DynamicResourceAllocation } && !Flaky && !Slow"  --no-color -v --timeout=60m ...

Note that the timeout for the test was 60m in this case (hence the "timeout -k
30s 3600.000000s") but it could also be something larger.
2025-02-06 11:45:12 +01:00
Kubernetes Prow Robot
c4434c3161 Merge pull request #129910 from bitoku/fix-129836
Fix flaky test for container life cycle
2025-02-04 16:23:09 -08:00
Kubernetes Prow Robot
f82439f536 Merge pull request #129486 from iholder101/bugfix/swap-container-cri-stats
[KEP-2400] [Bugfix]: Ensure container-level swap metrics are collected
2025-02-04 08:14:59 -08:00
Kubernetes Prow Robot
a376ae5dad Merge pull request #128845 from SergeyKanzhelev/staticPodUpgrade
static pod upgrade test with hostNetwork
2025-02-03 23:30:58 -08:00
Vinayak Goyal
81f09811ca Fix kubelet_authz_test.go 2025-01-31 15:38:18 +00:00
Ayato Tokubi
da5a76bd39 Fix flaky test for container life cycle
Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
2025-01-30 16:23:51 +00:00
Vinayak Goyal
ce7d2130ad Fix kubelet_authz_test.go 2025-01-29 23:06:56 +00:00
Swati Sehgal
82f0303f89 node: e2e: Remove flaky label as device plugin reboot test is deflaked
With the device plugin node reboot test fixed, we can see in testgrid
[node-kubelet-containerd-flaky](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-flaky)
that the test is passing consitently and we can remove the flaky label.

With the test not flaky anymore, we can validate new PRs against it
and ensure we don't cause regressions.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2025-01-29 11:12:40 +00:00
Kubernetes Prow Robot
48dce2e9b3 Merge pull request #129776 from saschagrunert/cni-plugins-1.6.2
Update CNI plugins to v1.6.2 and avoid using k8s-artifacts-cni bucket
2025-01-28 07:29:26 -08:00
Kubernetes Prow Robot
2bda5dd8c7 Merge pull request #129656 from vinayakankugoyal/kep2862beta
KEP-2862: Graduate to BETA.
2025-01-27 19:05:23 -08:00
Itamar Holder
617c094435 Add an e2e test
Signed-off-by: Itamar Holder <iholder@redhat.com>
2025-01-27 15:44:18 +02:00
Vinayak Goyal
3a780a1c1b KEP-2862: Graduate to BETA. 2025-01-24 21:36:00 +00:00
Kubernetes Prow Robot
29bf17b6cf Merge pull request #129168 from kannon92/drop-node-features
[KEP-3041] - remove nodefeatures from k/k repo
2025-01-23 12:07:29 -08:00
Kubernetes Prow Robot
4f979c9db8 Merge pull request #129010 from ffromani/e2e-fix-device-plugin-reboot-test
node: e2e: fix device plugin reboot test
2025-01-23 12:07:22 -08:00
Sascha Grunert
da999fbc1b Update CNI plugins to v1.6.2 and avoid using k8s-artifacts-cni bucket
Updating the CNI plugins to the latest release and switch over to use
GitHub releases instead of the `k8s-artifacts-cni` bucket.

Follow-up on https://github.com/kubernetes/kubernetes/pull/129095

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2025-01-23 10:50:58 +01:00
Kubernetes Prow Robot
a271299643 Merge pull request #129717 from esotsal/fix-128837
testing: Fix pod delete timeout failures after InPlacePodVerticalScaling Graduate to Beta commit
2025-01-21 15:50:47 -08:00
Kubernetes Prow Robot
0d988d7209 Merge pull request #129619 from ffromani/sig-node-approvers-ffromani
Self-nominating ffromani as approver for sig-node container and resource managers
2025-01-21 15:50:36 -08:00
Kubernetes Prow Robot
3d2ee2fbb7 Merge pull request #129609 from carlory/cleanup-exec-utils
Move some exec helper functions from framework/volume to framework/pod
2025-01-21 09:00:37 -08:00
Sotiris Salloumis
c5fc4193bb Fix pod delete issues in podresize tests 2025-01-21 07:25:14 +01:00
Kevin Hannon
bae4122f56 deprecate nodefeature for feature labels 2025-01-20 17:02:59 -05:00
carlory
8b4eae24ab Move some exec helper functions from framework/volume to framework/pod 2025-01-18 21:42:42 +08:00
Kubernetes Prow Robot
2d0a4f7556 Merge pull request #129166 from kannon92/move-node-features-to-features
[KEP-3041]: deprecate nodefeature for feature labels
2025-01-14 20:02:33 -08:00
Francesco Romani
8221e28e4d Add ffromani as approver for kubelet resource managers and their tests
Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-01-14 13:18:40 +01:00
Kevin Hannon
ca4529574e remove node special feature typos 2024-12-20 16:33:45 -05:00
Kubernetes Prow Robot
4c466d8f98 Merge pull request #129095 from borg-land/cni-bucket-change
fetch cni plugins from GitHub releases
2024-12-18 13:40:08 +01:00
Kevin Hannon
8495df64b2 deprecate nodefeature for feature labels 2024-12-17 13:58:12 -05:00
Kevin Hannon
6a608c3cdb drop NodeSpecialFeature and NodeAlphaFeature from e2e-node 2024-12-16 09:29:04 -05:00
Kubernetes Prow Robot
5cc6f6633f Merge pull request #129070 from zhifei92/fix-typo
e2e_node: Simplify the code logic
2024-12-13 12:24:25 +01:00
Kubernetes Prow Robot
e8615e2712 Merge pull request #129054 from pohly/remove-import-name
remove import doc comments
2024-12-12 09:58:35 +01:00
Kubernetes Prow Robot
c0862c3184 Merge pull request #129105 from carlory/sig-scheduling
scheduling e2e tests: add feature-gate label when these tests depend feature-gate
2024-12-12 06:40:25 +00:00
carlory
060c653b53 scheduling e2e tests: add feature-gate label when these tests depend feature-gate 2024-12-06 17:22:43 +08:00
upodroid
dce863e5e6 fetch cni plugins from GitHub releases 2024-12-05 13:31:35 +03:00
Francesco Romani
29d26297a1 e2e: node: fix misleading device plugin test
We have a e2e test which tries to ensure device plugin assignments to pods are kept
across node reboots. And this tests is permafailing since many weeks at
time of writing (xref: #128443).

Problem is: closer inspection reveals the test was well intentioned, but
puzzling:
The test runs a pod, then restarts the kubelet, then _expects the pod to
end up in admission failure_ and yet _ensure the device assignment is
kept_! https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/test/e2e_node/device_plugin_test.go#L97

A reader can legitmately wonder if this means the device will be kept busy forever?

This is not the case, luckily. The test however embodied the behavior at
time of the kubelet, in turn caused by #103979

Device manager used to record the last admitted pod and forcibly added
to the list of active pod. The retention logic had space for exactly one
pod, the last which attempted admission.

This retention prevented the cleanup code
(see: https://github.com/kubernetes/kubernetes/blob/v1.32.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549
compare to: https://github.com/kubernetes/kubernetes/blob/v1.31.0-rc.0/pkg/kubelet/cm/devicemanager/manager.go#L549)
to clear the registration, so the device was still (mis)reported
allocated to the failed pod.

This fact was in turn leveraged by the test in question:
the test uses the podresources API to learn about the device assignment,
and because of the chain of events above the pod failed admission yet
was still reported as owning the device.

What happened however was the next pod trying admission would have
replaced the previous pod in the device manager data, so the previous
pod was no longer forced to be added into the active list, so its
assignment were correctly cleared once the cleanup code runs;
And the cleanup code is run, among other things, every time device
manager is asked to allocated devices and every time podresources API
queries the device assignment

Later, in PR https://github.com/kubernetes/kubernetes/pull/120661
the forced retention logic was removed from all the resource managers,
thus also from device manager, and this is what caused the permafailure.

Because all of the above, it should be evident that the e2e test was
actually enforcing a very specific and not really work-as-intended
behavior, which was also overall quite puzzling for users.

The best we can do is to fix the test to record and ensure that
pods which did fail admission _do not_ retain device assignment.

Unfortunately, we _cannot_ guarantee the desirable property that
pod going running retain their device assignment across node reboots.

In the kubelet restart flow, all pods race to be admitted. There's no
order enforced between device plugin pods and application pods.
Unless an application pod is lucky enough to _lose_ the race with both
the device plugin (to go running before the app pod does) and _also_
with the kubelet (which needs to set devices healthy before the pod
tries admission).

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-12-04 17:06:27 +01:00
zhifei92
cb74323e07 refactor: Simplify the code logic. 2024-12-03 20:31:09 +08:00
Patrick Ohly
8a908e0c0b remove import doc comments
The "// import <path>" comment has been superseded by Go modules.
We don't have to remove them, but doing so has some advantages:

- They are used inconsistently, which is confusing.
- We can then also remove the (currently broken) hack/update-vanity-imports.sh.
- Last but not least, it would be a first step towards avoiding the k8s.io domain.

This commit was generated with
   sed -i -e 's;^package \(.*\) // import.*;package \1;' $(git grep -l '^package.*// import' | grep -v 'vendor/')

Everything was included, except for
   package labels // import k8s.io/kubernetes/pkg/util/labels
because that package is marked as "read-only".
2024-12-02 16:59:34 +01:00
HirazawaUi
53e9f29d29 Fix kubelet e2e tests incorrect message 2024-12-01 22:45:29 +08:00
Paco Xu
59dfb0e779 skip if cri proxy is disabled/undefined 2024-11-19 11:17:07 +08:00
Sergey Kanzhelev
a9c311b96a static pod upgrade test with hostNetwork 2024-11-19 00:27:01 +00:00
Laura Lorenz
9ab0d81d76 Now that sleep is shorter, only expect to reach 3 within 30s
Focused too much on the container restart one in commit that fixed that

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-13 01:39:58 +00:00
Laura Lorenz
59f9858086 Move function specific to container restart test inline
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 23:59:30 +00:00
Laura Lorenz
529d5ba9d3 Don't overly indirect image name
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 23:34:57 +00:00
Laura Lorenz
8e7b2af712 Use a better util
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 23:30:03 +00:00
Laura Lorenz
285d433dea Clearer image pull test and utils
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 23:30:00 +00:00
Laura Lorenz
e03d0f60ef Orient tests to run faster, but tolerate infra slowdowns up to 5 minutes
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 21:48:28 +00:00
Laura Lorenz
d293c5088f Fix spelling
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 21:12:20 +00:00
Laura Lorenz
1da8ca816e Extract restart number properly
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 20:00:11 +00:00
Laura Lorenz
2732d57e33 Missed refactor of container name here
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 19:50:11 +00:00
Laura Lorenz
e6059d7386 Fix typecheck and verify
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 19:48:38 +00:00
Laura Lorenz
f032068ef7 Focus on restart numbers instead of timing
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-12 07:12:24 +00:00