Commit Graph

2771 Commits

Author SHA1 Message Date
RuquanZhao
babac47c6f fix DevicePluginProbe node-e2e: pod and kubelet restarts
The kubelet restarts working pods with an exponential back-off delay,
with a maximum cap of 5 minutes. The waiting 1 minutes may happen to be
in back-off time.

Signed-off-by: Ruquan Zhao <ruquan.zhao@arm.com>
2023-10-11 10:15:32 +08:00
Kubernetes Prow Robot
bdcb73d6b3 Merge pull request #120460 from tzneal/deflake-oom-tests-on-containerd
skip the reason check for OOM reason test if it will fail
2023-10-11 01:03:17 +02:00
Patrick Ohly
19ecf93ec3 e2e: define features and node features
The list is based on the -list-tests output.
2023-10-10 18:15:49 +02:00
Patrick Ohly
f2d34426f8 e2e: enhance SIGDescribe
framework.SIGDescribe is better because:
- Ginkgo uses the source code location of the test, not of the wrapper,
  when reporting progress.
- Additional annotations can be passed.

To make this a drop-in replacement, framework.SIGDescribe generates a function
that can be used instead of the former SIGDescribe functions.

windows.SIGDescribe contained some additional code to ensure that tests are
skipped when not running with a suitable node OS. This gets moved into a
separate wrapper generator, to allow using framework.SIGDescribe as intended.
To ensure that all callers were modified, the windows.sigDescribe isn't
exported anymore (wasn't necessary in the first place!).
2023-10-10 18:15:49 +02:00
carlory
d5d7fb595e e2e_node: stop using deprecated framework.ExpectEqual 2023-10-09 16:42:42 +08:00
Katarzyna Lach
122ff5a212 Move grpc rate limitter from podresource folder
Rate limitter.go file is a generic file implementing
grpc Limiter interface. This file can be reuse by other gRPC
API not only by podresource.

Change-Id: I905a46b5b605fbb175eb9ad6c15019ffdc7f2563
2023-10-09 07:22:23 +00:00
charles-chenzz
ccc6458683 e2e_node: add testcase to check status of pod ready to start condition are set to false after terminating 2023-10-08 20:40:36 +08:00
Gunju Kim
8b5f30ef09 Don't reuse CPU set of a restartable init container 2023-10-06 22:16:15 +09:00
Kubernetes Prow Robot
f19b62fc09 Merge pull request #120959 from pohly/e2e-test-whitespace-cleanup
e2e: remove redundant spaces in test names
2023-10-05 00:41:59 +02:00
Patrick Ohly
0e8a1f1816 e2e: remove redundant spaces in test names
The spaces are redundant because Ginkgo will add them itself when concatenating
the different test name components. Upcoming change in the framework will
enforce that there are no such redundant spaces.
2023-09-29 08:30:57 +02:00
Davanum Srinivas
d900217664 fix missed branch - targets when building using arm64
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2023-09-27 15:52:37 -04:00
Davanum Srinivas
52f5093d77 Build kubelet with CGO for sig-node e2e tests (not ginkgo)
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2023-09-26 08:32:59 -04:00
Kubernetes Prow Robot
884bc96fec Merge pull request #120773 from swatisehgal/tm-metrics-e2e-deflake
topology-mgr: metrics: Deflake Topology Manager metrics e2e tests
2023-09-20 11:26:26 -07:00
Kubernetes Prow Robot
7fb7e2625b Merge pull request #120401 from shijinye/e2eclean-node-notequal
cleanup:e2e:stop using deprecated framework.ExpectNotEqual
2023-09-20 11:26:19 -07:00
Kubernetes Prow Robot
3191493cea Merge pull request #119402 from Tal-or/e2e_podres_terminal_pods
e2e:podresources: verify count for terminal pods
2023-09-20 11:26:11 -07:00
Swati Sehgal
f5d915b594 topology-mgr: metrics: Deflake Topology Manager metrics e2e tests
On local execution of Topology Manager metrics tests, the tests pass rate was 100%.
Yet, we can see that the Topology Manager metrics tests are failing in upstream
CI consistently: https://testgrid.k8s.io/sig-node-presubmits#pr-kubelet-serial-gce-e2e-topology-manager.

From the logs, it was identified that these failures are because of timeouts,
so we are increasing the default timeout as well as polling interval frequency
of obtaining KubeletMetrics to deflake this test.

We have noticed a similar flake in case of CPU manager metrics tests as well:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-kubelet-serial-cpu-manager/1701615009836044288.
Once it is confirmed that the issue is resolved for Topology Manager test,
we will be fix this for CPU Manager as well in a follow-up PR.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-09-20 13:37:27 +01:00
Kubernetes Prow Robot
a68093a3ff Merge pull request #120506 from alexzielenski/import-restrictions
Update e2e import restrictions
2023-09-13 21:56:22 -07:00
Kubernetes Prow Robot
160fe010f3 Merge pull request #120464 from gjkim42/deflake-container-lifecycle-e2e-test
e2e_node: Assign enough time to finish the postStart hook
2023-09-12 17:44:44 -07:00
Kubernetes Prow Robot
04e5914079 Merge pull request #120349 from ruquanzhao/fixTopologyManagerJobs
e2e-node: fix TopologyManager test jobs.
2023-09-12 17:44:37 -07:00
Kubernetes Prow Robot
8aeebda818 Merge pull request #120306 from Rei1010/nodeClean
e2e_node:stop using deprecated framework.ExpectError
2023-09-12 17:44:23 -07:00
Todd Neal
af151eeba2 specifically check that the pod was successful 2023-09-12 13:40:20 -05:00
Gunju Kim
1fb4eee94e Use container log instead of termination log
Since the termination log cannot be accessed until the container is
terminated, use the container log.
2023-09-11 22:55:09 +09:00
Sascha Grunert
5e0931336b kubelet: fix metric container_start_time_seconds's timestamp
Adapting the tests and reverting https://github.com/kubernetes/kubernetes/pull/103429

Carry-over from https://github.com/kubernetes/kubernetes/pull/117881

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2023-09-08 09:13:37 +02:00
Alexander Zielenski
7a13b11af0 update e2e import restrictions 2023-09-07 12:20:29 -07:00
Kubernetes Prow Robot
b27670dfbd Merge pull request #118740 from saschagrunert/kubelet-label-types
Make kubelet label types public
2023-09-06 23:46:57 -07:00
Francesco Romani
2ea47038b9 podresources: e2e: force eager connection
Add and use more facilities to the *internal* podresources client.
Checking e2e test runs, we have quite some
```
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/pod-resources/kubelet.sock: connect: connection refused": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/pod-resources/kubelet.sock: connect: connection refused"
```

This is likely caused by kubelet restarts, which we do plenty in e2e tests,
combined with the fact gRPC does lazy connection AND we don't really
check the errors in client code - we just bubble them up.

While it's arguably bad we don't check properly error codes, it's also
true that in the main case, e2e tests, the functions should just never
fail besides few well known cases, we're connecting over a
super-reliable unix domain socket after all.

So, we centralize the fix adding a function (alongside with minor
cleanups) which wants to trigger and ensure the connection happens,
localizing the changes just here. The main advantage is this approach
is opt-in, composable, and doesn't leak gRPC details into the client
code.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-09-07 08:24:49 +02:00
Todd Neal
94afd6e3a4 skip the reason check for OOM tests if it will fail
This is currently flaking badly due to a race between cgroup deletion
and the runtime detecting the OOM kill.
2023-09-06 12:20:02 -05:00
Gunju Kim
b468e4eb1c e2e_node: Assign enough time to finish the postStart hook
This deflakes the "Containers Lifecycle should not launch second
container before PostStart of the first container completed" test by
assigning enough time to finish the postStart hook.
2023-09-07 00:42:54 +09:00
Kubernetes Prow Robot
56cc5e77a1 Merge pull request #120441 from tzneal/revert-npd-update
Revert "bump npd to v0.8.14"
2023-09-06 06:39:04 -07:00
Kubernetes Prow Robot
debe30de70 Merge pull request #120281 from gjkim42/feature-gate-sidecar-containers-in-kuberuntime
Feature-gate SidecarContainers code in pkg/kubelet/kuberuntime
2023-09-05 18:34:54 -07:00
Todd Neal
355ae44a3c Revert "bump npd to v0.8.14"
This reverts commit 7b44d73f73.
2023-09-05 20:28:53 -05:00
jinye
a774887262 cleanup:e2e:stop using deprecated framework.ExpectNotEqual 2023-09-05 18:16:57 +08:00
RuquanZhao
bfc3c2110f e2e-node: fix TopologyManager test jobs.
Signed-off-by: Ruquan Zhao <ruquan.zhao@arm.com>
2023-09-01 17:53:16 +08:00
wen.rui
3d9b5d0577 e2e_node:stop using deprecated framework.ExpectError 2023-09-01 17:42:36 +08:00
Kubernetes Prow Robot
400059d025 Merge pull request #120194 from bzsuni/bz/bump/npd
bump npd to v0.8.14
2023-08-31 20:52:30 -07:00
Gunju Kim
63177db32c Add an e2e test for the pod sandbox changed scenario
This adds an e2e test to ensure that a pod should restart its containers
in right order after the pod sandbox is changed.
2023-09-01 00:13:47 +09:00
Todd Neal
ede524e1a6 fix a pidpressure test flake
With the new busybox, ash has a built-in sleep command. Prior to this
change we were creating half the pids expected since `sleep` wasn't
actually launching a new binary.  Use the full path to /bin/sleep which
avoids the built-in and actually launches a new process.
2023-08-30 22:44:36 -05:00
bzsuni
7b44d73f73 bump npd to v0.8.14
Signed-off-by: bzsuni <bingzhe.sun@daocloud.io>
2023-08-30 19:03:33 +08:00
Fan Shang Xiang
8d9517318a Extend npd e2e timeout to fix npd e2e error 2023-08-29 17:22:28 +08:00
Kubernetes Prow Robot
232d343d58 Merge pull request #119969 from saschagrunert/cni-plugins
Update CNI plugins to v1.3.0
2023-08-23 12:41:57 -07:00
Dixita Narang
d2dbc583a0 Adding coverage for OOM Kill scenario due to node allocatable memory limits, when pod level memory limits are not set 2023-08-22 00:45:17 +00:00
Davanum Srinivas
3e9a4c15a8 Restrict what imports get into code within test/e2e_node
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2023-08-21 15:04:23 -04:00
Kubernetes Prow Robot
4dee8398ae Merge pull request #120078 from tzneal/investigate-test-failure
expect the new resource_scape_error metric
2023-08-21 04:13:34 -07:00
Todd Neal
b8512cfe24 expect the new resource_scape_error metric 2023-08-20 14:17:54 -05:00
Todd Neal
905f07f1ac Revert "mark the OOM killer as serial to reduce flakes"
This reverts commit bd6f548746.

Running as serial didn't completely eliminate the flake so I think
there's something more going on here.  Reverting the change to serial
since its not a solution.
2023-08-20 13:38:07 -05:00
Todd Neal
bd6f548746 mark the OOM killer as serial to reduce flakes
In testing I could only reproduce the flake by running stress-ng to load
the CPU. Running it as serial should reduce and hopefully eliminate the
flakiness.
2023-08-18 13:18:50 -05:00
Todd Neal
577197559a remove the legacy test dependency
This removes the import which added a bunch of apparently
old failing tests.
2023-08-17 12:54:20 -05:00
Sascha Grunert
7933368460 Update CNI plugins to v1.3.0
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2023-08-17 09:50:53 +02:00
Kubernetes Prow Robot
4d166947cf Merge pull request #119097 from pacoxu/fix-eviction-pid
PIDPressure condition is triggered slow on CRI-O with large PID pressure/heavy load
2023-08-16 16:36:19 -07:00
Kubernetes Prow Robot
88d14edc26 Merge pull request #119197 from saschagrunert/stop-container-runtime-err
Check dbus error on container runtime start/stop
2023-08-16 15:27:52 -07:00