kubernetes

mirror of https://github.com/outbackdingo/kubernetes.git synced 2026-01-27 18:19:28 +00:00

Author	SHA1	Message	Date
Kubernetes Prow Robot	d21da29c9e	Merge pull request #133170 from ffromani/e2e-node-podres-memmgr e2e: podresources: disable memory manager integration	2025-07-24 07:56:48 -07:00
Francesco Romani	449763fb11	e2e: podresources: disable memory manager integration As part of the PR 132028 we added more e2e test coverage to validate the fix, and check as much as possible there are no regressions. The issue and the fix become evident largely when inspecting memory allocation with the Memory Manager static policy enabled. Quoting the commit message of `bc56d0e45a` ``` The podresources API List implementation uses the internal data of the resource managers as source of truth. Looking at the implementation here: https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60 we take care of syncing the device allocation data before querying the device manager to return its pod->devices assignment. This is needed because otherwise the device manager (and all the other resource managers) would do the cleanup asynchronously, so the `List` call will return incorrect data. But we don't do this syncing neither for CPUs or for memory, so when we report these we will get stale data as the issue #132020 demonstrates. For CPU manager, we however have the reconcile loop which cleans the stale data periodically. Turns out this timing interplay was actually the reason the existing issue #119423 seemed fixed (see: #119423 (comment)). But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time very high (>= 5 minutes), then the issue still reproduces against current master branch (https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983). ``` The missing actor here is memory manager. Memory manager has no reconcile loop (implicit fixing the stale data problem) no explicit synchronization, so it is the unlucky one which reported stale data, leading to the eventual understanding of the problem. For this reason it was (and still is) important to exercise it during the test. Turns out the test is however wrong, likely because a hidden dependency between the test expectations and the lane configuration (notably machine specs), so we disable the memory manager activation for the time being, until we figure out a safe way to enable it. Note this significantly weakens the signal for this specific test. Signed-off-by: Francesco Romani <fromani@redhat.com>	2025-07-24 12:35:45 +02:00
Francesco Romani	303a7056ff	e2e: node: podresources: enable multi-container tests fix the utilities to enable multi-app-container tests, which were previously quite hard to implement. Add a consumer of the new utility to demonstrate the usage and to initiate the basic coverage. Signed-off-by: Francesco Romani <fromani@redhat.com>	2025-07-22 19:58:29 +02:00
Francesco Romani	38a9a8a59d	e2e: node: podresources: add tests for missing pod add a e2e test to ensure that if the Get endpoint is asked about a non-existing pod, it returns error. Likewise, add a e2e test for terminated pods, which should not be returned because they don't consume nor hold resources, much like `List` does. The expected usage patterns is to iterate over the list of pods returned by `List`, but nevertheless the endpoint must handle this case. Signed-off-by: Francesco Romani <fromani@redhat.com>	2025-07-22 19:55:09 +02:00
Kubernetes Prow Robot	9212246d78	Merge pull request #132827 from guptaNswati/e2e-podresourcesGet-featuregate Add feature gate enable test for KubeletPodResourcesGet	2025-07-18 12:12:25 -07:00
Swati Gupta	14a5ef56a3	fix pipeline failure Signed-off-by: Swati Gupta <swatig@nvidia.com>	2025-07-17 23:21:26 +00:00
Sascha Grunert	532d48fe6a	Convert `k8s.io/kubelet/pkg/apis/podresources` from gogo to protoc Use standard protoc for the pod resources instead of gogo. Part of kubernetes#96564 Signed-off-by: Sascha Grunert <sgrunert@redhat.com>	2025-07-17 14:56:44 +02:00
Swati Gupta	8f4a624a59	Fix pipeline errors Signed-off-by: Swati Gupta <swatig@nvidia.com>	2025-07-16 22:56:59 +00:00
Swati Gupta	d460611e77	Add more checks Signed-off-by: Swati Gupta <swatig@nvidia.com>	2025-07-15 21:51:36 +00:00
Francesco Romani	05e1c4b489	e2e: node: fix podresources API feature label We want to fix and enhance lanes which exercise the podresources API tests. The first step is to clarify the label and made it specific to podresources API, minimzing the clash and the ambiguity with the "PodLevelResources" feature. Note we change the label names, but the label name is backward compatible (filtering for "Feature:PodResources" will still get the tests). This turns out to be not a problem because these tests are no longer called out explicitly in the lane definitions. We want to change this ASAP. The new name is more specific and allows us to clearly call out tests for this feature in the lane definitions. Signed-off-by: Francesco Romani <fromani@redhat.com>	2025-07-15 14:15:00 +02:00
Swati Gupta	bb6bd52012	Add feature gate enable test for KubeletPodResourcesGet Signed-off-by: Swati Gupta <swatig@nvidia.com>	2025-07-08 23:49:34 +00:00
Francesco Romani	8f92a81787	node: e2e: podresources: add more e2e tests add more e2e tests to cover the interaction with core resource managers (cpu, memory) and to ensure proper reporting. Signed-off-by: Francesco Romani <fromani@redhat.com>	2025-07-08 17:18:34 +02:00
Kubernetes Prow Robot	29bf17b6cf	Merge pull request #129168 from kannon92/drop-node-features [KEP-3041] - remove nodefeatures from k/k repo	2025-01-23 12:07:29 -08:00
Sotiris Salloumis	c5fc4193bb	Fix pod delete issues in podresize tests	2025-01-21 07:25:14 +01:00
Kevin Hannon	bae4122f56	deprecate nodefeature for feature labels	2025-01-20 17:02:59 -05:00
Kevin Hannon	8495df64b2	deprecate nodefeature for feature labels	2024-12-17 13:58:12 -05:00
Ed Bartosh	3aa95dafea	e2e_node: refactor stopping and restarting kubelet Moved Kubelet health checks from test cases to the stopKubelet API. This should make the API cleaner and easier to use.	2024-11-06 11:34:48 +02:00
Francesco Romani	14ec0edd10	node: metrics: add metrics about cpu pool sizes Add metrics about the sizing of the cpu pools. Currently the cpumanager maintains 2 cpu pools: - shared pool: this is where all pods with non-exclusive cpu allocation run - exclusive pool: this is the union of the set of exclusive cpus allocated to containers, if any (requires static policy in use). By reporting the size of the pools, the users (humans or machines) can get better insights and more feedback about how the resources actually allocated to the workload and how the node resources are used.	2024-10-24 15:35:51 +02:00
Kubernetes Prow Robot	c6ad6fa951	Merge pull request #125477 from my-git9/namespaceformat Modify some error words	2024-10-17 17:17:17 +01:00
Sujay	223aedcf6b	enhance boolean assertions	2024-07-31 15:58:15 +00:00
xin.li	c3832428b6	Modify some error words Signed-off-by: xin.li <xin.li@daocloud.io>	2024-06-13 13:13:29 +08:00
Kevin Hannon	43e0bd4304	mark flaky jobs as flaky and move them to a different job	2024-04-08 09:27:15 -04:00
Kubernetes Prow Robot	20d0ab7ae8	Merge pull request #124011 from bart0sh/PR138-e2e_node-fix-podresurces-failure e2e_node: fix podresources test	2024-03-22 08:16:08 -07:00
Ed Bartosh	6f5240b19c	e2e_node: fix podresources test Fixed `The phase of Pod e2e-test-pod is Succeeded which is unexpected` error. `e2epod.NewPodClient(f).CreateSync` is unable to catch 'Running' status of the pod as pod finishes too fast. Using `Create` API should solve the issue as it doesn't query pod status.	2024-03-21 13:11:03 +02:00
Ed Bartosh	247392271f	Fix admission error Fixed UnexpectedAdmissionError: Allocate failed due to not enough cpus available to satisfy request: requested=2, available=1, which is unexpected	2024-03-20 18:03:13 +02:00
Gunju Kim	dd890b899f	Make PodResources API include restartable init containers	2024-02-21 22:00:09 +09:00
Patrick Ohly	f2cfbf44b1	e2e: use framework labels This changes the text registration so that tags for which the framework has a dedicated API (features, feature gates, slow, serial, etc.) those APIs are used. Arbitrary, custom tags are still left in place for now.	2023-11-01 15:17:34 +01:00
Kubernetes Prow Robot	cfafffa611	Merge pull request #121019 from kl52752/rate-limiting Move grpc rate limiter from podresource folder	2023-10-19 08:15:26 +02:00
Kevin Hannon	dd9c3358f5	Revert "podresources: e2e: force eager connection"	2023-10-16 09:46:04 -04:00
Kubernetes Prow Robot	38a1ec75f0	Merge pull request #119882 from ffromani/podres-client-wait podresources: e2e: force eager connection	2023-10-12 15:59:55 +02:00
carlory	d5d7fb595e	e2e_node: stop using deprecated framework.ExpectEqual	2023-10-09 16:42:42 +08:00
Katarzyna Lach	122ff5a212	Move grpc rate limitter from podresource folder Rate limitter.go file is a generic file implementing grpc Limiter interface. This file can be reuse by other gRPC API not only by podresource. Change-Id: I905a46b5b605fbb175eb9ad6c15019ffdc7f2563	2023-10-09 07:22:23 +00:00
Kubernetes Prow Robot	3191493cea	Merge pull request #119402 from Tal-or/e2e_podres_terminal_pods e2e:podresources: verify count for terminal pods	2023-09-20 11:26:11 -07:00
Francesco Romani	2ea47038b9	podresources: e2e: force eager connection Add and use more facilities to the internal podresources client. Checking e2e test runs, we have quite some ``` rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/pod-resources/kubelet.sock: connect: connection refused": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/pod-resources/kubelet.sock: connect: connection refused" ``` This is likely caused by kubelet restarts, which we do plenty in e2e tests, combined with the fact gRPC does lazy connection AND we don't really check the errors in client code - we just bubble them up. While it's arguably bad we don't check properly error codes, it's also true that in the main case, e2e tests, the functions should just never fail besides few well known cases, we're connecting over a super-reliable unix domain socket after all. So, we centralize the fix adding a function (alongside with minor cleanups) which wants to trigger and ensure the connection happens, localizing the changes just here. The main advantage is this approach is opt-in, composable, and doesn't leak gRPC details into the client code. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-09-07 08:24:49 +02:00
wen.rui	3d9b5d0577	e2e_node:stop using deprecated framework.ExpectError	2023-09-01 17:42:36 +08:00
Talor Itzhak	3964f71fe0	e2e:podresources: verify count for terminal pods PodResourcesAPI reports in the List call about resources of pods in terminal phase. The internal managers reassign resources assigned to pods in terminal phase, so podresources should ignore them. Whether this behavior intended or not (the docs are not unequivocal) this e2e test demonstrates and verifies the mentioned above. Signed-off-by: Talor Itzhak <titzhak@redhat.com>	2023-07-23 12:46:41 +03:00
Francesco Romani	01c3a51a78	node: podresources: getallocatable: move to GA lock the feature gate to GA, and remove the now-redundant code. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-07-12 14:11:22 +02:00
Kubernetes Prow Robot	86038ae590	Merge pull request #116846 from moshe010/e2e--node-pod-resources kubelet pod-resources: add e2e for KubeletPodResourcesGet feature	2023-07-11 04:53:24 -07:00
Francesco Romani	dfc150ca18	e2e: node: podresources: cooldown the rate limit We have a e2e test which want to get a rate limit error. To do so, we sent an abnormally high amount of calls in a tight loop. The relevant test per se is reportedly fine, but wwe need to play nicer with other tests which may run just after and which need to query the API. If the testsuite runs "too fast", it's possible an innocent test falls in the same rate limit watch period which was saturated by the ratelimit test, so the innocent test can still be throttled because the throttling period is not exhausted yet, yielding false negatives, leading to flakes. We can't reset the period for the rate limit, we just wait "long enough" to make sure we absorb the burst and other legit queries are not rejected. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-06-29 17:40:36 +02:00
Kubernetes Prow Robot	2190775b69	Merge pull request #118280 from stlaz/e2e_psa_labels Set all PSa labels in tests	2023-06-28 11:14:43 -07:00
Davanum Srinivas	a75b00ea39	Better URL for scraping metrics from kubelet Signed-off-by: Davanum Srinivas <davanum@gmail.com>	2023-06-27 16:14:59 -04:00
Moshe Levi	38222014c6	kubelet pod-resources: add e2e for KubeletPodResourcesGet feature Signed-off-by: Moshe Levi <moshele@nvidia.com>	2023-06-26 08:10:24 +03:00
Stanislav Laznicka	7f532891c9	e2e tests: set all PSa labels instead of just enforcing	2023-06-21 15:05:13 +02:00
Paco Xu	bbae445d17	fix metrics test with 1.16.0 prometheus client	2023-06-20 16:46:31 +08:00
Ian K. Coolidge	cede96336a	Depend on k8s.io/utils cpuset Steps performed: $ find . -name '*.go' -exec sed -i 's\|k8s.io/kubernetes/pkg/kubelet/cm/cpuset\|k8s.io/utils/cpuset\|g' {} \ $ ./hack/update-vendor.sh $ ./hack/update-gofmt.sh $ git rm -r pkg/kubelet/cm/cpuset/	2023-05-03 16:26:09 +00:00
Moshe Levi	9a776cbf21	kubelet pod-resources: e2e node test add failure description ExpectNoError Signed-off-by: Moshe Levi <moshele@nvidia.com>	2023-03-23 11:05:21 +02:00
Francesco Romani	b837a0c1ff	kubelet: podresources: DOS prevention with builtin ratelimit Implement DOS prevention wiring a global rate limit for podresources API. The goal here is not to introduce a general ratelimiting solution for the kubelet (we need more research and discussion to get there), but rather to prevent misuse of the API. Known limitations: - the rate limits value (QPS, BurstTokens) are hardcoded to "high enough" values. Enabling user-configuration would require more discussion and sweeping changes to the other kubelet endpoints, so it is postponed for now. - the rate limiting is global. Malicious clients can starve other clients consuming the QPS quota. Add e2e test to exercise the flow, because the wiring itself is mostly boilerplate and API adaptation.	2023-03-11 08:00:54 +01:00
Francesco Romani	5ca235e0ee	e2e: podresources: promote platform-independent test as NodeConformance We have quite a few podresources e2e tests and, as the feature progresses to GA, we should consider moving them to NodeConformance. Unfortunately most of them require linux-specific features not in the test themselves but in the test prelude (fixture) to check or create the node conditions (e.g. presence or not of devices, online CPUS...) to be verified in the test proper. For this reason we promote only a single test for starters. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-03-09 16:26:01 +01:00
Francesco Romani	00b41334bf	e2e: node: podresources: fix restart wait Fix the waiting logic in the e2e test loop to wait for resources to be reported again instead of making logic on the timestamp. The idea is that waiting for resource availability is the canonical way clients should observe the desired state, and it should also be more robust than comparing timestamps, especially on CI environments. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-02-22 14:04:55 +01:00
Francesco Romani	92e00203e0	e2e: node: unify sample device plugin utilities Start to consolidate the sample device plugin utility and constants in a central place, because we need to use it in different e2e tests. Having a central dependency is better than a maze of entangled e2e tests depending on each other helpers. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-02-22 14:04:55 +01:00

1 2

89 Commits