3238 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
9d3fff5048 Merge pull request #133353 from ffromani/e2e-node-serial-unblock
node: unblock e2e serial lanes
2025-08-06 11:37:25 -07:00
Francesco Romani
aca402f25b e2e: node: skip breaking tests
Skip problematic tests to recover signal, then we will
reintroduce them gradually

See: https://github.com/kubernetes/kubernetes/issues/133314
See: https://github.com/kubernetes/kubernetes/pull/133336

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-08-01 13:59:32 +02:00
Kevin Hannon
e83e5815e5 always pull pause image for eviction tests 2025-08-01 00:55:10 -04:00
Kubernetes Prow Robot
91731d05e2 Merge pull request #133279 from ffromani/pod-level-resource-managers
[PodLevelResources] handle pod-level resource manager alignment
2025-07-29 17:28:33 -07:00
Francesco Romani
a3a767b37e WIP: fix e2e tests
Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-29 20:20:08 +02:00
Kubernetes Prow Robot
ba76d238ad Merge pull request #133259 from ndixita/code-refactor
Adding check for nil pod resources in huge pages test
2025-07-28 23:18:30 -07:00
Kubernetes Prow Robot
dd4e4f1dd1 Merge pull request #133262 from BenTheElder/no-authenticated-image-pulling
remove broken test that depends on expired credential, remove hardcoded credential, add TODOs
2025-07-28 17:28:28 -07:00
Benjamin Elder
8ace0fb89f remove failing test that depends on expired credential, remove credential, add TODOs
see: https://github.com/kubernetes/kubernetes/issues/130271
2025-07-28 15:43:43 -07:00
ndixita
4b698656be Returning early if podResources is nil to avoid nil pointer dereferencing
Signed-off-by: ndixita <ndixita@google.com>
2025-07-28 19:31:08 +00:00
Kevin Torres
766d011bba E2E tests for no hints nor aligment of CPU and Memory managers 2025-07-28 18:53:04 +00:00
Kubernetes Prow Robot
6d4ca967f7 Merge pull request #132824 from roycaihw/psi-pressure-test
Extend E2E test coverage for PSI metrics under pressure
2025-07-25 00:32:27 -07:00
Kubernetes Prow Robot
72f9a9260a Merge pull request #130606 from Jpsassine/dra_device_health_status
Expose DRA device health in PodStatus
2025-07-24 20:14:27 -07:00
Kubernetes Prow Robot
3fd1251165 Merge pull request #131089 from KevinTMtz/pod-level-hugepage-cgroups
[PodLevelResources] Propagate Pod level hugepage cgroup to containers
2025-07-24 19:08:26 -07:00
Kubernetes Prow Robot
63011fe547 Merge pull request #132277 from KevinTMtz/pod-level-resources-eviction-manager
[PodLevelResources] Pod Level Resources Eviction Manager
2025-07-24 16:44:34 -07:00
John-Paul Sassine
b7de71f9ce feat(kubelet): Add ResourceHealthStatus for DRA pods
This change introduces the ability for the Kubelet to monitor and report
the health of devices allocated via Dynamic Resource Allocation (DRA).
This addresses a key part of KEP-4680 by providing visibility into
device failures, which helps users and controllers diagnose pod failures.

The implementation includes:
- A new `v1alpha1.NodeHealth` gRPC service with a `WatchResources`
  stream that DRA plugins can optionally implement.
- A health information cache within the Kubelet's DRA manager to track
  the last known health of each device and handle plugin disconnections.
- An asynchronous update mechanism that triggers a pod sync when a
  device's health changes.
- A new `allocatedResourcesStatus` field in `v1.ContainerStatus` to
  expose the device health information to users via the Pod API.

Update vendor

KEP-4680: Fix lint, boilerplate, and codegen issues

Add another e2e test, add TODO for KEP4680 & update test infra helpers

Add Feature Gate e2e test

Fixing presubmits

Fix var names, feature gating, and nits

Fix DRA Health gRPC API according to review feedback
2025-07-24 23:23:18 +00:00
Haowei Cai
252513a1b9 Add WithFeature and WithSerial, also check if cgroup v2 is used in test 2025-07-24 21:40:08 +00:00
Kevin Torres
f925e55548 E2E tests for container hugepage resources immutability
Pod level hugepage resources are not propagated to the containers, only pod level cgroup values are propagated to the containers when they do not specify hugepage resources.
2025-07-24 21:29:04 +00:00
Kubernetes Prow Robot
ebbebe8be6 Merge pull request #133157 from haircommander/cgroup-driver-cri-ga
KEP 4033: Add metric for out of support CRI and bump feature to GA
2025-07-24 13:05:04 -07:00
Kubernetes Prow Robot
e4e13c1e80 Merge pull request #132818 from ffromani/e2e-node-cpumanager-cgroupv1-compat
e2e: node: cpumanager cgroup v1 compatibility
2025-07-24 13:04:41 -07:00
Kevin Torres
add7132a6d E2E tests for pod level resources Kubelet Preemption 2025-07-24 17:08:13 +00:00
Kevin Torres
976a617d05 E2E tests for pod level resources eviction manager 2025-07-24 17:07:09 +00:00
Peter Hunt
83a0d0c660 kubelet: add metric for version CRI implementation will lose support
Signed-off-by: Peter Hunt <pehunt@redhat.com>
2025-07-24 11:42:59 -04:00
Kubernetes Prow Robot
d21da29c9e Merge pull request #133170 from ffromani/e2e-node-podres-memmgr
e2e: podresources: disable memory manager integration
2025-07-24 07:56:48 -07:00
Francesco Romani
449763fb11 e2e: podresources: disable memory manager integration
As part of the PR 132028 we added more e2e test coverage to validate
the fix, and check as much as possible there are no regressions.

The issue and the fix become evident largely when inspecting
memory allocation with the Memory Manager static policy enabled.
Quoting the commit message of bc56d0e45a
```
The podresources API List implementation uses the internal data of the
resource managers as source of truth.
Looking at the implementation here:
https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/pkg/kubelet/apis/podresources/server_v1.go#L60
we take care of syncing the device allocation data before querying the
device manager to return its pod->devices assignment.
This is needed because otherwise the device manager (and all the other
resource managers) would do the cleanup asynchronously, so the `List` call
will return incorrect data.

But we don't do this syncing neither for CPUs or for memory,
so when we report these we will get stale data as the issue #132020 demonstrates.

For CPU manager, we however have the reconcile loop which cleans the stale data periodically.
Turns out this timing interplay was actually the reason the existing issue #119423 seemed fixed
(see: #119423 (comment)).
But it's actually timing. If in the reproducer we set the `cpuManagerReconcilePeriod` to a time
very high (>= 5 minutes), then the issue still reproduces against current master branch
(https://github.com/kubernetes/kubernetes/blob/v1.34.0-alpha.0/test/e2e_node/podresources_test.go#L983).
```

The missing actor here is memory manager. Memory manager has no
reconcile loop (implicit fixing the stale data problem) no explicit
synchronization, so it is the unlucky one which reported stale data,
leading to the eventual understanding of the problem.

For this reason it was (and still is) important to exercise it during
the test.
Turns out the test is however wrong, likely because a hidden dependency
between the test expectations and the lane configuration (notably
machine specs), so we disable the memory manager activation for the time
being, until we figure out a safe way to enable it.

Note this significantly weakens the signal for this specific test.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-24 12:35:45 +02:00
Patrick Ohly
5c4f81743c DRA: use v1 API
As before when adding v1beta2, DRA drivers built using the
k8s.io/dynamic-resource-allocation helper packages remain compatible with all
Kubernetes release >= 1.32. The helper code picks whatever API version is
enabled from v1beta1/v1beta2/v1.

However, the control plane now depends on v1, so a cluster configuration where
only v1beta1 or v1beta2 are enabled without the v1 won't work.
2025-07-24 08:33:45 +02:00
Kubernetes Prow Robot
dd6fa8bafd Merge pull request #133129 from ffromani/podres-get-add-tests
node: podresources: improve test coverage for the `Get` endpoint
2025-07-23 19:56:40 -07:00
Kubernetes Prow Robot
aee92cd6c3 Merge pull request #132968 from wongchar/uncore-e2e-beta
cpumanager: expand test coverage for prefer-align-cpus-by-uncore-cache
2025-07-22 13:40:50 -07:00
Francesco Romani
303a7056ff e2e: node: podresources: enable multi-container tests
fix the utilities to enable multi-app-container tests,
which were previously quite hard to implement.

Add a consumer of the new utility to demonstrate the usage
and to initiate the basic coverage.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-22 19:58:29 +02:00
Francesco Romani
38a9a8a59d e2e: node: podresources: add tests for missing pod
add a e2e test to ensure that if the Get endpoint is asked
about a non-existing pod, it returns error.
Likewise, add a e2e test for terminated pods, which should
not be returned because they don't consume nor hold resources,
much like `List` does.

The expected usage patterns is to iterate over the list of
pods returned by `List`, but nevertheless the endpoint must
handle this case.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-22 19:55:09 +02:00
Charles Wong
545b36ba29 fix uncore e2e check 2025-07-22 09:50:18 -05:00
Kubernetes Prow Robot
4dfb8523fc Merge pull request #128239 from HirazawaUi/fix-e2e-tests
Fix container lifecycle flaking e2e tests
2025-07-21 18:08:25 -07:00
Kubernetes Prow Robot
b8eda18fc9 Merge pull request #132198 from natasha41575/mirror-obs-gen
add generation / observedGeneration test for mirror pods
2025-07-21 16:30:25 -07:00
Natasha Sarkar
c659b41826 e2e test for mirror pod with pod generation 2025-07-21 22:27:13 +00:00
Kubernetes Prow Robot
47d9d86326 Merge pull request #133028 from saschagrunert/deviceplugin-proto
Convert `k8s.io/kubelet/pkg/apis/deviceplugin` from gogo to protoc
2025-07-21 14:14:55 -07:00
Kubernetes Prow Robot
7d758620bc Merge pull request #132083 from carlory/cleanup-GAed-fg-DevicePluginCDIDevices
remove general avaliable feature-gate DevicePluginCDIDevices
2025-07-21 13:06:27 -07:00
Charles Wong
ccc82775f4 expand test coverage for uncore alignment
add feature compatibility

check uncore cpuset alignment

check shared uncores
2025-07-21 11:19:25 -05:00
Haowei Cai
cb29414b44 Extend E2E test coverage for PSI metrics under pressure
Validate that PSI metrics are correctly reported under various resource pressure scenarios.
2025-07-21 16:13:32 +00:00
Francesco Romani
ea326373ef e2e: node: cpumanager cgroup v1 compatibility
While we support cgroup v1, we want some test coverage.
This patch enables v1 coverage for most of the testcases.
We intentionally rule out the CFS quota tests because we
want to support this change only on cgroup v2.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-07-21 13:57:50 +02:00
Sascha Grunert
3026020b44 Convert k8s.io/kubelet/pkg/apis/deviceplugin from gogo to protoc
Use standard protoc for the device plugin API instead of gogo.

Part of kubernetes#96564

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2025-07-21 10:04:01 +02:00
Kubernetes Prow Robot
5e83b9c2c2 Merge pull request #129942 from bart0sh/PR171-migrate-some-kubelet-components-to-contextual-logging
Migrate kubelet/{apis,kubeletconfig,nodeshutdown,pod,preemption} to contextual logging
2025-07-18 20:28:25 -07:00
Kubernetes Prow Robot
daee8efa4d Merge pull request #132811 from ffromani/e2e-serial-cpumanager-tests-cleanup
e2e: node: cpumanager: fix cpu quota non-regression tests
2025-07-18 15:24:38 -07:00
Kubernetes Prow Robot
7fa6cdde88 Merge pull request #127630 from dshebib/e2eNode_UpdateToAgnhost
[e2e_node] containers_lifecycle update from busybox to agnhost
2025-07-18 15:24:25 -07:00
Kubernetes Prow Robot
9212246d78 Merge pull request #132827 from guptaNswati/e2e-podresourcesGet-featuregate
Add feature gate enable test for KubeletPodResourcesGet
2025-07-18 12:12:25 -07:00
Swati Gupta
14a5ef56a3 fix pipeline failure
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-17 23:21:26 +00:00
Sascha Grunert
532d48fe6a Convert k8s.io/kubelet/pkg/apis/podresources from gogo to protoc
Use standard protoc for the pod resources instead of gogo.

Part of kubernetes#96564

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2025-07-17 14:56:44 +02:00
Ed Bartosh
75ccd69bab migrate pkg/kubelet/kubeletconfig to contextual logging 2025-07-17 10:16:03 +03:00
Kubernetes Prow Robot
8f312e6fbf Merge pull request #132348 from iholder101/swap/add-container-swap-limit-metric
[KEP-2400] Add a container_swap_limit_bytes metric
2025-07-16 20:02:30 -07:00
Kubernetes Prow Robot
9f545c5b46 Merge pull request #130992 from dshebib/addRegularContainerImageChangeToE2E_reverted
E2E Node Tests: Remove failing test from reverted PR
2025-07-16 20:02:23 -07:00
Swati Gupta
8f4a624a59 Fix pipeline errors
Signed-off-by: Swati Gupta <swatig@nvidia.com>
2025-07-16 22:56:59 +00:00
Ed Bartosh
e4320fe25c e2e_node: DRA: test handling fatal serving failures
Added an e2e_node test to verify that the DRA plugin and
registration services cancel provided context when handling
fatal gRPC serving errors.
2025-07-16 15:49:41 +03:00