89 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
d21da29c9e Merge pull request #133170 from ffromani/e2e-node-podres-memmgr
e2e: podresources: disable memory manager integration
2025-07-24 07:56:48 -07:00
Francesco Romani
449763fb11 e2e: podresources: disable memory manager integration
As part of the PR 132028 we added more e2e test coverage to validate
the fix, and check as much as possible there are no regressions.

The issue and the fix become evident largely when inspecting
memory allocation with the Memory Manager static policy enabled.
Quoting the commit message of bc56d0e45a
```
The podresources API List implementation uses the internal data of the
resource managers as source of truth.
Looking at the implementation here:
https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60
we take care of syncing the device allocation data before querying the
device manager to return its pod->devices assignment.
This is needed because otherwise the device manager (and all the other
resource managers) would do the cleanup asynchronously, so the `List` call
will return incorrect data.

But we don't do this syncing neither for CPUs or for memory,
so when we report these we will get stale data as the issue #132020 demonstrates.

For CPU manager, we however have the reconcile loop which cleans the stale data periodically.
Turns out this timing interplay was actually the reason the existing issue #119423 seemed fixed
(see: #119423 (comment)).
But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time
very high (>= 5 minutes), then the issue still reproduces against current master branch
(https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983).
```

The missing actor here is memory manager. Memory manager has no
reconcile loop (implicit fixing the stale data problem) no explicit
synchronization, so it is the unlucky one which reported stale data,
leading to the eventual understanding of the problem.

For this reason it was (and still is) important to exercise it during
the test.
Turns out the test is however wrong, likely because a hidden dependency
between the test expectations and the lane configuration (notably
machine specs), so we disable the memory manager activation for the time
being, until we figure out a safe way to enable it.

Note this significantly weakens the signal for this specific test.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-24 12:35:45 +02:00
Francesco Romani
303a7056ff e2e: node: podresources: enable multi-container tests
fix the utilities to enable multi-app-container tests,
which were previously quite hard to implement.

Add a consumer of the new utility to demonstrate the usage
and to initiate the basic coverage.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-22 19:58:29 +02:00
Francesco Romani
38a9a8a59d e2e: node: podresources: add tests for missing pod
add a e2e test to ensure that if the Get endpoint is asked
about a non-existing pod, it returns error.
Likewise, add a e2e test for terminated pods, which should
not be returned because they don't consume nor hold resources,
much like `List` does.

The expected usage patterns is to iterate over the list of
pods returned by `List`, but nevertheless the endpoint must
handle this case.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-22 19:55:09 +02:00
Kubernetes Prow Robot
9212246d78 Merge pull request #132827 from guptaNswati/e2e-podresourcesGet-featuregate
Add feature gate enable test for KubeletPodResourcesGet
2025-07-18 12:12:25 -07:00
Swati Gupta
14a5ef56a3 fix pipeline failure
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-17 23:21:26 +00:00
Sascha Grunert
532d48fe6a Convert k8s.io/kubelet/pkg/apis/podresources from gogo to protoc
Use standard protoc for the pod resources instead of gogo.

Part of kubernetes#96564

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2025-07-17 14:56:44 +02:00
Swati Gupta
8f4a624a59 Fix pipeline errors
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-16 22:56:59 +00:00
Swati Gupta
d460611e77 Add more checks
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-15 21:51:36 +00:00
Francesco Romani
05e1c4b489 e2e: node: fix podresources API feature label
We want to fix and enhance lanes which exercise
the podresources API tests. The first step is to clarify
the label and made it specific to podresources API,
minimzing the clash and the ambiguity with the "PodLevelResources"
feature.

Note we change the label names, but the label name is backward
compatible (filtering for "Feature:PodResources" will still
get the tests). This turns out to be not a problem because
these tests are no longer called out explicitly in the lane
definitions. We want to change this ASAP.

The new name is more specific and allows us to clearly
call out tests for this feature in the lane definitions.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-15 14:15:00 +02:00
Swati Gupta
bb6bd52012 Add feature gate enable test for KubeletPodResourcesGet
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-08 23:49:34 +00:00
Francesco Romani
8f92a81787 node: e2e: podresources: add more e2e tests
add more e2e tests to cover the interaction with
core resource managers (cpu, memory) and to ensure
proper reporting.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-08 17:18:34 +02:00
Kubernetes Prow Robot
29bf17b6cf Merge pull request #129168 from kannon92/drop-node-features
[KEP-3041] - remove nodefeatures from k/k repo
2025-01-23 12:07:29 -08:00
Sotiris Salloumis
c5fc4193bb Fix pod delete issues in podresize tests 2025-01-21 07:25:14 +01:00
Kevin Hannon
bae4122f56 deprecate nodefeature for feature labels 2025-01-20 17:02:59 -05:00
Kevin Hannon
8495df64b2 deprecate nodefeature for feature labels 2024-12-17 13:58:12 -05:00
Ed Bartosh
3aa95dafea e2e_node: refactor stopping and restarting kubelet
Moved Kubelet health checks from test cases to the stopKubelet API.
This should make the API cleaner and easier to use.
2024-11-06 11:34:48 +02:00
Francesco Romani
14ec0edd10 node: metrics: add metrics about cpu pool sizes
Add metrics about the sizing of the cpu pools.
Currently the cpumanager maintains 2 cpu pools:
- shared pool: this is where all pods with non-exclusive
  cpu allocation run
- exclusive pool: this is the union of the set of exclusive
  cpus allocated to containers, if any (requires static policy in use).

By reporting the size of the pools, the users (humans or machines)
can get better insights and more feedback about how the resources
actually allocated to the workload and how the node resources are used.
2024-10-24 15:35:51 +02:00
Kubernetes Prow Robot
c6ad6fa951 Merge pull request #125477 from my-git9/namespaceformat
Modify some error words
2024-10-17 17:17:17 +01:00
Sujay
223aedcf6b enhance boolean assertions 2024-07-31 15:58:15 +00:00
xin.li
c3832428b6 Modify some error words
Signed-off-by: xin.li <xin.li@daocloud.io>
2024-06-13 13:13:29 +08:00
Kevin Hannon
43e0bd4304 mark flaky jobs as flaky and move them to a different job 2024-04-08 09:27:15 -04:00
Kubernetes Prow Robot
20d0ab7ae8 Merge pull request #124011 from bart0sh/PR138-e2e_node-fix-podresurces-failure
e2e_node: fix podresources test
2024-03-22 08:16:08 -07:00
Ed Bartosh
6f5240b19c e2e_node: fix podresources test
Fixed `The phase of Pod e2e-test-pod is Succeeded which is unexpected`
error. `e2epod.NewPodClient(f).CreateSync` is unable to catch 'Running'
status of the pod as pod finishes too fast.
Using `Create` API should solve the issue as it doesn't query pod
status.
2024-03-21 13:11:03 +02:00
Ed Bartosh
247392271f Fix admission error
Fixed UnexpectedAdmissionError: Allocate failed due to not enough cpus
available to satisfy request: requested=2, available=1, which is unexpected
2024-03-20 18:03:13 +02:00
Gunju Kim
dd890b899f Make PodResources API include restartable init containers 2024-02-21 22:00:09 +09:00
Patrick Ohly
f2cfbf44b1 e2e: use framework labels
This changes the text registration so that tags for which the framework has a
dedicated API (features, feature gates, slow, serial, etc.) those APIs are
used.

Arbitrary, custom tags are still left in place for now.
2023-11-01 15:17:34 +01:00
Kubernetes Prow Robot
cfafffa611 Merge pull request #121019 from kl52752/rate-limiting
Move grpc rate limiter from podresource folder
2023-10-19 08:15:26 +02:00
Kevin Hannon
dd9c3358f5 Revert "podresources: e2e: force eager connection" 2023-10-16 09:46:04 -04:00
Kubernetes Prow Robot
38a1ec75f0 Merge pull request #119882 from ffromani/podres-client-wait
podresources: e2e: force eager connection
2023-10-12 15:59:55 +02:00
carlory
d5d7fb595e e2e_node: stop using deprecated framework.ExpectEqual 2023-10-09 16:42:42 +08:00
Katarzyna Lach
122ff5a212 Move grpc rate limitter from podresource folder
Rate limitter.go file is a generic file implementing
grpc Limiter interface. This file can be reuse by other gRPC
API not only by podresource.

Change-Id: I905a46b5b605fbb175eb9ad6c15019ffdc7f2563
2023-10-09 07:22:23 +00:00
Kubernetes Prow Robot
3191493cea Merge pull request #119402 from Tal-or/e2e_podres_terminal_pods
e2e:podresources: verify count for terminal pods
2023-09-20 11:26:11 -07:00
Francesco Romani
2ea47038b9 podresources: e2e: force eager connection
Add and use more facilities to the *internal* podresources client.
Checking e2e test runs, we have quite some
```
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/pod-resources/kubelet.sock: connect: connection refused": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/pod-resources/kubelet.sock: connect: connection refused"
```

This is likely caused by kubelet restarts, which we do plenty in e2e tests,
combined with the fact gRPC does lazy connection AND we don't really
check the errors in client code - we just bubble them up.

While it's arguably bad we don't check properly error codes, it's also
true that in the main case, e2e tests, the functions should just never
fail besides few well known cases, we're connecting over a
super-reliable unix domain socket after all.

So, we centralize the fix adding a function (alongside with minor
cleanups) which wants to trigger and ensure the connection happens,
localizing the changes just here. The main advantage is this approach
is opt-in, composable, and doesn't leak gRPC details into the client
code.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-09-07 08:24:49 +02:00
wen.rui
3d9b5d0577 e2e_node:stop using deprecated framework.ExpectError 2023-09-01 17:42:36 +08:00
Talor Itzhak
3964f71fe0 e2e:podresources: verify count for terminal pods
PodResourcesAPI reports in the List call about resources of pods in terminal phase.
The internal managers reassign resources assigned to pods in terminal phase, so podresources should ignore them.

Whether this behavior intended or not (the docs are not unequivocal)
this e2e test demonstrates and verifies the mentioned above.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-07-23 12:46:41 +03:00
Francesco Romani
01c3a51a78 node: podresources: getallocatable: move to GA
lock the feature gate to GA, and remove the now-redundant code.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-07-12 14:11:22 +02:00
Kubernetes Prow Robot
86038ae590 Merge pull request #116846 from moshe010/e2e--node-pod-resources
kubelet pod-resources: add e2e for KubeletPodResourcesGet feature
2023-07-11 04:53:24 -07:00
Francesco Romani
dfc150ca18 e2e: node: podresources: cooldown the rate limit
We have a e2e test which want to get a rate limit error. To do so, we
sent an abnormally high amount of calls in a tight loop.

The relevant test per se is reportedly fine, but wwe need to play nicer
with *other* tests which may run just after and which need to query the API.
If the testsuite runs "too fast", it's possible an innocent test falls in the
same rate limit watch period which was saturated by the ratelimit test,
so the innocent test can still be throttled because the throttling period
is not exhausted yet, yielding false negatives, leading to flakes.

We can't reset the period for the rate limit, we just wait "long enough" to
make sure we absorb the burst and other legit queries are not rejected.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-06-29 17:40:36 +02:00
Kubernetes Prow Robot
2190775b69 Merge pull request #118280 from stlaz/e2e_psa_labels
Set all PSa labels in tests
2023-06-28 11:14:43 -07:00
Davanum Srinivas
a75b00ea39 Better URL for scraping metrics from kubelet
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2023-06-27 16:14:59 -04:00
Moshe Levi
38222014c6 kubelet pod-resources: add e2e for KubeletPodResourcesGet feature
Signed-off-by: Moshe Levi <moshele@nvidia.com>
2023-06-26 08:10:24 +03:00
Stanislav Laznicka
7f532891c9 e2e tests: set all PSa labels instead of just enforcing 2023-06-21 15:05:13 +02:00
Paco Xu
bbae445d17 fix metrics test with 1.16.0 prometheus client 2023-06-20 16:46:31 +08:00
Ian K. Coolidge
cede96336a Depend on k8s.io/utils cpuset
Steps performed:

$ find . -name '*.go' -exec sed -i
's|k8s.io/kubernetes/pkg/kubelet/cm/cpuset|k8s.io/utils/cpuset|g' {} \
$ ./hack/update-vendor.sh
$ ./hack/update-gofmt.sh
$ git rm -r pkg/kubelet/cm/cpuset/
2023-05-03 16:26:09 +00:00
Moshe Levi
9a776cbf21 kubelet pod-resources: e2e node test add failure description ExpectNoError
Signed-off-by: Moshe Levi <moshele@nvidia.com>
2023-03-23 11:05:21 +02:00
Francesco Romani
b837a0c1ff kubelet: podresources: DOS prevention with builtin ratelimit
Implement DOS prevention wiring a global rate limit for podresources
API. The goal here is not to introduce a general ratelimiting solution
for the kubelet (we need more research and discussion to get there),
but rather to prevent misuse of the API.

Known limitations:
- the rate limits value (QPS, BurstTokens) are hardcoded to
  "high enough" values.
  Enabling user-configuration would require more discussion
  and sweeping changes to the other kubelet endpoints, so it
  is postponed for now.
- the rate limiting is global. Malicious clients can starve other
  clients consuming the QPS quota.

Add e2e test to exercise the flow, because the wiring itself
is mostly boilerplate and API adaptation.
2023-03-11 08:00:54 +01:00
Francesco Romani
5ca235e0ee e2e: podresources: promote platform-independent test as NodeConformance
We have quite a few podresources e2e tests and, as the feature
progresses to GA, we should consider moving them to NodeConformance.
Unfortunately most of them require linux-specific features not in the
test themselves but in the test prelude (fixture) to check or create the
node conditions (e.g. presence or not of devices, online CPUS...) to be
verified in the test proper.

For this reason we promote only a single test for starters.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-03-09 16:26:01 +01:00
Francesco Romani
00b41334bf e2e: node: podresources: fix restart wait
Fix the waiting logic in the e2e test loop to wait
for resources to be reported again instead of making logic on the
timestamp. The idea is that waiting for resource availability
is the canonical way clients should observe the desired state,
and it should also be more robust than comparing timestamps,
especially on CI environments.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-02-22 14:04:55 +01:00
Francesco Romani
92e00203e0 e2e: node: unify sample device plugin utilities
Start to consolidate the sample device plugin utility
and constants in a central place, because we need
to use it in different e2e tests.

Having a central dependency is better than a maze of
entangled e2e tests depending on each other helpers.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-02-22 14:04:55 +01:00