Commit Graph

3214 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
d21da29c9e Merge pull request #133170 from ffromani/e2e-node-podres-memmgr
e2e: podresources: disable memory manager integration
2025-07-24 07:56:48 -07:00
Francesco Romani
449763fb11 e2e: podresources: disable memory manager integration
As part of the PR 132028 we added more e2e test coverage to validate
the fix, and check as much as possible there are no regressions.

The issue and the fix become evident largely when inspecting
memory allocation with the Memory Manager static policy enabled.
Quoting the commit message of bc56d0e45a
```
The podresources API List implementation uses the internal data of the
resource managers as source of truth.
Looking at the implementation here:
https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60
we take care of syncing the device allocation data before querying the
device manager to return its pod->devices assignment.
This is needed because otherwise the device manager (and all the other
resource managers) would do the cleanup asynchronously, so the `List` call
will return incorrect data.

But we don't do this syncing neither for CPUs or for memory,
so when we report these we will get stale data as the issue #132020 demonstrates.

For CPU manager, we however have the reconcile loop which cleans the stale data periodically.
Turns out this timing interplay was actually the reason the existing issue #119423 seemed fixed
(see: #119423 (comment)).
But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time
very high (>= 5 minutes), then the issue still reproduces against current master branch
(https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983).
```

The missing actor here is memory manager. Memory manager has no
reconcile loop (implicit fixing the stale data problem) no explicit
synchronization, so it is the unlucky one which reported stale data,
leading to the eventual understanding of the problem.

For this reason it was (and still is) important to exercise it during
the test.
Turns out the test is however wrong, likely because a hidden dependency
between the test expectations and the lane configuration (notably
machine specs), so we disable the memory manager activation for the time
being, until we figure out a safe way to enable it.

Note this significantly weakens the signal for this specific test.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-24 12:35:45 +02:00
Patrick Ohly
5c4f81743c DRA: use v1 API
As before when adding v1beta2, DRA drivers built using the
k8s.io/dynamic-resource-allocation helper packages remain compatible with all
Kubernetes release >= 1.32. The helper code picks whatever API version is
enabled from v1beta1/v1beta2/v1.

However, the control plane now depends on v1, so a cluster configuration where
only v1beta1 or v1beta2 are enabled without the v1 won't work.
2025-07-24 08:33:45 +02:00
Kubernetes Prow Robot
dd6fa8bafd Merge pull request #133129 from ffromani/podres-get-add-tests
node: podresources: improve test coverage for the `Get` endpoint
2025-07-23 19:56:40 -07:00
Kubernetes Prow Robot
aee92cd6c3 Merge pull request #132968 from wongchar/uncore-e2e-beta
cpumanager: expand test coverage for prefer-align-cpus-by-uncore-cache
2025-07-22 13:40:50 -07:00
Francesco Romani
303a7056ff e2e: node: podresources: enable multi-container tests
fix the utilities to enable multi-app-container tests,
which were previously quite hard to implement.

Add a consumer of the new utility to demonstrate the usage
and to initiate the basic coverage.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-22 19:58:29 +02:00
Francesco Romani
38a9a8a59d e2e: node: podresources: add tests for missing pod
add a e2e test to ensure that if the Get endpoint is asked
about a non-existing pod, it returns error.
Likewise, add a e2e test for terminated pods, which should
not be returned because they don't consume nor hold resources,
much like `List` does.

The expected usage patterns is to iterate over the list of
pods returned by `List`, but nevertheless the endpoint must
handle this case.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-22 19:55:09 +02:00
Charles Wong
545b36ba29 fix uncore e2e check 2025-07-22 09:50:18 -05:00
Kubernetes Prow Robot
4dfb8523fc Merge pull request #128239 from HirazawaUi/fix-e2e-tests
Fix container lifecycle flaking e2e tests
2025-07-21 18:08:25 -07:00
Kubernetes Prow Robot
b8eda18fc9 Merge pull request #132198 from natasha41575/mirror-obs-gen
add generation / observedGeneration test for mirror pods
2025-07-21 16:30:25 -07:00
Natasha Sarkar
c659b41826 e2e test for mirror pod with pod generation 2025-07-21 22:27:13 +00:00
Kubernetes Prow Robot
47d9d86326 Merge pull request #133028 from saschagrunert/deviceplugin-proto
Convert `k8s.io/kubelet/pkg/apis/deviceplugin` from gogo to protoc
2025-07-21 14:14:55 -07:00
Kubernetes Prow Robot
7d758620bc Merge pull request #132083 from carlory/cleanup-GAed-fg-DevicePluginCDIDevices
remove general avaliable feature-gate DevicePluginCDIDevices
2025-07-21 13:06:27 -07:00
Charles Wong
ccc82775f4 expand test coverage for uncore alignment
add feature compatibility

check uncore cpuset alignment

check shared uncores
2025-07-21 11:19:25 -05:00
Sascha Grunert
3026020b44 Convert k8s.io/kubelet/pkg/apis/deviceplugin from gogo to protoc
Use standard protoc for the device plugin API instead of gogo.

Part of kubernetes#96564

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2025-07-21 10:04:01 +02:00
Kubernetes Prow Robot
5e83b9c2c2 Merge pull request #129942 from bart0sh/PR171-migrate-some-kubelet-components-to-contextual-logging
Migrate kubelet/{apis,kubeletconfig,nodeshutdown,pod,preemption} to contextual logging
2025-07-18 20:28:25 -07:00
Kubernetes Prow Robot
daee8efa4d Merge pull request #132811 from ffromani/e2e-serial-cpumanager-tests-cleanup
e2e: node: cpumanager: fix cpu quota non-regression tests
2025-07-18 15:24:38 -07:00
Kubernetes Prow Robot
7fa6cdde88 Merge pull request #127630 from dshebib/e2eNode_UpdateToAgnhost
[e2e_node] containers_lifecycle update from busybox to agnhost
2025-07-18 15:24:25 -07:00
Kubernetes Prow Robot
9212246d78 Merge pull request #132827 from guptaNswati/e2e-podresourcesGet-featuregate
Add feature gate enable test for KubeletPodResourcesGet
2025-07-18 12:12:25 -07:00
Swati Gupta
14a5ef56a3 fix pipeline failure
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-17 23:21:26 +00:00
Sascha Grunert
532d48fe6a Convert k8s.io/kubelet/pkg/apis/podresources from gogo to protoc
Use standard protoc for the pod resources instead of gogo.

Part of kubernetes#96564

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2025-07-17 14:56:44 +02:00
Ed Bartosh
75ccd69bab migrate pkg/kubelet/kubeletconfig to contextual logging 2025-07-17 10:16:03 +03:00
Kubernetes Prow Robot
8f312e6fbf Merge pull request #132348 from iholder101/swap/add-container-swap-limit-metric
[KEP-2400] Add a container_swap_limit_bytes metric
2025-07-16 20:02:30 -07:00
Kubernetes Prow Robot
9f545c5b46 Merge pull request #130992 from dshebib/addRegularContainerImageChangeToE2E_reverted
E2E Node Tests: Remove failing test from reverted PR
2025-07-16 20:02:23 -07:00
Swati Gupta
8f4a624a59 Fix pipeline errors
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-16 22:56:59 +00:00
Ed Bartosh
e4320fe25c e2e_node: DRA: test handling fatal serving failures
Added an e2e_node test to verify that the DRA plugin and
registration services cancel provided context when handling
fatal gRPC serving errors.
2025-07-16 15:49:41 +03:00
Ed Bartosh
ea05ad8887 e2e_node: DRA: add errorOnCloseListener
Introduce a mock net.Listener for tests that triggers a controlled
error on Close, enabling reliable simulation of gRPC server failures
in test scenarios.
2025-07-16 15:49:41 +03:00
Ed Bartosh
1981c985b1 e2e: DRA: support test and public options
Refactor StartPlugin and related test helpers to accept a variadic
list of options of any type, allowing both public and test-specific
options to be passed.
2025-07-16 15:49:41 +03:00
Ed Bartosh
169965350c e2e_node: Refactor DRA tests to use variadic options
Refactor the DRA e2e_node test helpers and test cases to accept
variadic kubeletplugin.Option arguments.

This change improves test flexibility and maintainability, allowing
new options to be passed in the future without requiring widespread
code changes.

There are no functional changes to test coverage or behavior.
2025-07-16 15:42:12 +03:00
Swati Gupta
d460611e77 Add more checks
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-15 21:51:36 +00:00
Kubernetes Prow Robot
20344f9aba Merge pull request #132345 from ffromani/e2e-podresourcesapi-labels
e2e: node: fix podresources API feature label
2025-07-15 13:16:29 -07:00
Kubernetes Prow Robot
394f412767 Merge pull request #132617 from aramase/aramase/f/kep_4412_pod_cache_key_type
Add ServiceAccountTokenCacheType support to credential provider plugin
2025-07-15 10:56:45 -07:00
Francesco Romani
05e1c4b489 e2e: node: fix podresources API feature label
We want to fix and enhance lanes which exercise
the podresources API tests. The first step is to clarify
the label and made it specific to podresources API,
minimzing the clash and the ambiguity with the "PodLevelResources"
feature.

Note we change the label names, but the label name is backward
compatible (filtering for "Feature:PodResources" will still
get the tests). This turns out to be not a problem because
these tests are no longer called out explicitly in the lane
definitions. We want to change this ASAP.

The new name is more specific and allows us to clearly
call out tests for this feature in the lane definitions.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-15 14:15:00 +02:00
carlory
bd30b0adef remove general avaliable feature-gate DevicePluginCDIDevices
Signed-off-by: carlory <baofa.fan@daocloud.io>
2025-07-15 16:55:12 +08:00
Kubernetes Prow Robot
bf0be9fb56 Merge pull request #132028 from ffromani/podresources-list-active-pods
podresources: list: use active pods
2025-07-14 12:06:24 -07:00
Charles Wong
98c4514eae add e2e_node tests for uncore alignment 2025-07-11 10:32:01 -05:00
Anish Ramasekar
4d2566eb5a credentialprovider: wire in service account mode cache type
Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>
2025-07-10 14:50:54 -05:00
Swati Gupta
bb6bd52012 Add feature gate enable test for KubeletPodResourcesGet
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-08 23:49:34 +00:00
Francesco Romani
8f92a81787 node: e2e: podresources: add more e2e tests
add more e2e tests to cover the interaction with
core resource managers (cpu, memory) and to ensure
proper reporting.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-08 17:18:34 +02:00
Francesco Romani
380ed8d9b3 e2e: node: memory manager: build everywhere, run only on linux
Since the KEP 4885
(https://github.com/kubernetes/enhancements/blob/master/keps/sig-windows/4885-windows-cpu-and-memory-affinity/README.md)
memory manager is supported also on windows.

Plus, we want to add podresources e2e tests which configure
the memory manager. Both these facts suggest it's useful to build
the e2e memory manager tests on all OSes, not just on linux;

However, since we are not sure we are ready to run these tests
everywhere, we tag them LinuxOnly to keep preserve most of the
old behavior.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-08 17:18:34 +02:00
Francesco Romani
006b2a3b52 e2e: node: cpumanager: fix cpu quota non-regression tests
The non regression tests should check the quota management
introduced in #127525 can be disabled, so we need to verify
the previous behaviour using the integer quotas.

It seems the problem was just a bad rebase that wrongly duplicated
the tests. We fix removing the incorrect duplicates.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-08 16:10:16 +02:00
Itamar Holder
25d9d8d9ba refactor: use getLocalNode() to avoid code duplication
Signed-off-by: Itamar Holder <iholder@redhat.com>
2025-07-08 15:48:35 +03:00
Itamar Holder
bc9e8e1a91 add a context argument to prePodCreationModificationFunc()
Signed-off-by: Itamar Holder <iholder@redhat.com>
2025-07-08 15:45:42 +03:00
Itamar Holder
1ac60e35e9 e2e test: Add a container_swap_limit_bytes metric
Signed-off-by: Itamar Holder <iholder@redhat.com>
2025-07-08 12:38:18 +03:00
Kubernetes Prow Robot
09d99b7990 Merge pull request #132672 from iholder101/test/swap-delme-mod
Stabilize swap eviction priority test
2025-07-08 00:35:28 -07:00
Kubernetes Prow Robot
ee012e883f Merge pull request #131641 from pohly/dra-kubelet-in-use-metric
DRA kubelet: add dra_resource_claims_in_use gauge vector
2025-07-07 03:11:26 -07:00
PatrickLaabs
0e8424fcf0 chore: depr. pointer pkg replacement for the e2e_node 2025-07-06 11:27:16 +02:00
Sascha Grunert
b464bbeb8f Remove gogo-protobuf from CRI
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2025-07-04 08:55:57 +02:00
Itamar Holder
90bbce56b9 PriorityMemoryEvictionOrdering: allocate more memory when swap is provisioned
Whenever swap is provisioned on the node,
the kernel might be able to reclaim much more memory,
hence it is harder to get the node to be memory pressured.

This will add another container that allocates
the same amount as the swap capacity to help
bring the node to memory pressure.

Signed-off-by: Itamar Holder <iholder@redhat.com>
2025-07-03 21:14:44 +03:00
Itamar Holder
25498cd34d Eviction tests: small refactor
This small refactor:
- Adds swap log statistics.
- Adds a pre pods modification function.

The later can be used in order to perform
changes to pods before creation.

Signed-off-by: Itamar Holder <iholder@redhat.com>
2025-07-03 21:14:43 +03:00