Commit Graph

1568 Commits

Author SHA1 Message Date
Davanum Srinivas
abbc5ad346 Copy limited pieces of code we use from runc's apparmor and utils packages
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-10-22 09:56:22 -04:00
Jing Zhang
0365cf4b20 KEP-4540: Add CPUManager policy option strict-cpu-reservation
Signed-off-by: Jing Zhang <jing.c.zhang.ext@nokia.com>
2024-10-21 11:57:17 -04:00
Kubernetes Prow Robot
ded7ad554e Merge pull request #125513 from mauri870/hotfix/grpc-handle-err
kubelet/cm/devicemanager: log grpc Serve error
2024-10-18 02:49:03 +01:00
PiotrProkop
37ac9aa060 topologymanager: promote TopologyManagerPolicyOptions feature to GA
* Promote TopologyManagerPolicyOptions feature to GA
* Promote PreferClosestNUMANodes TopologyManagerPolicyOption to stable

Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2024-10-17 20:58:34 +02:00
Kubernetes Prow Robot
a4c262bc8c Merge pull request #127293 from hshiina/typecheck
kubelet/cm: Unite return value types of helper functions
2024-10-17 07:45:04 +01:00
Peter Hunt
77d03e42cd kubelet/cm: move CPU reading from cm to cm/cpumanager
Authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Peter Hunt <pehunt@redhat.com>
2024-10-11 11:29:16 -04:00
Peter Hunt
c51195dbd0 kubelet/cm: fix bug where kubelet restarts from missing cpuset cgroup
on None cpumanager policy, cgroupv2, and systemd cgroup manager, kubelet
could get into a situation where it believes the cpuset cgroup was created
(by libcontainer in the cgroupfs) but systemd has deleted it, as it wasn't requested
to create it. This causes one unnecessary restart, as kubelet fails with

`failed to initialize top level QOS containers: root container [kubepods] doesn't exist.`

This only causes one restart because the kubelet skips recreating the cgroup
if it already exists, but it's still a bother and is fixed this way

Signed-off-by: Peter Hunt <pehunt@redhat.com>
2024-10-11 10:49:16 -04:00
Kubernetes Prow Robot
3bf17e2340 Merge pull request #127959 from ffromani/fix-smtalign-error-message
node: cpumanager: fix smtalign error message and minor cleanup
2024-10-11 00:32:20 +01:00
Francesco Romani
838f911dea cpumanager: smtalign: fix error message
Fix error message if availablePhysicalCPUs = 0.
Without this change, the logic was mistakenly emitting
the old error message, which is confusing for troubleshooting.

Plus, a tiny quality of life improvement:
cpumanager static policy wants to use `cpuGroupSize` multiple times.
The value represents how many VCPUs per PCPUs the machine has.
So, let's cache (and log!) the value in the policy data.
We don't support dynamic update of the HW topology anyway.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-10-10 10:18:44 +02:00
Kubernetes Prow Robot
c923a61ddd Merge pull request #125982 from harche/compressible_reserved
Set only compressible resources on system and kube reserved cgroup slices
2024-10-04 04:08:27 +01:00
Harshal Patil
3bad47e8ed Set only compressible resources on system slice
Signed-off-by: Harshal Patil <harpatil@redhat.com>
2024-10-03 13:23:34 -04:00
Kubernetes Prow Robot
e34f7f4d80 Merge pull request #127671 from mmorel-35/testify/error-contains
fix: use `ErrorContains(t, err` instead of `Contains(t, err.Error()`
2024-09-28 19:18:01 +01:00
Matthieu MOREL
f736cca0e5 fix: enable expected-actual rule from testifylint in module k8s.io/kubernetes
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2024-09-27 07:56:31 +02:00
Matthieu MOREL
f777addb05 fix: use ErrorContains(t, err instead of Contains(t, err.Error()
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2024-09-26 22:22:20 +02:00
Matthieu MOREL
27b98be303 fix: enable nil-compare and error-nil rules from testifylint in module k8s.io/kubernetes
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2024-09-25 06:02:47 +02:00
rongfu.leng
ead64fb8f0 add resourceupdates.Update chan buffer
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
2024-09-24 16:48:32 +00:00
Matthieu MOREL
fa0e38981c fix: enable compares rule from testifylint in module k8s.io/kubernetes
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2024-09-22 11:20:05 +02:00
Kubernetes Prow Robot
f2700895a4 Merge pull request #127422 from srivastav-abhishek/go-vet-fix
Go vet fixes for gotip
2024-09-20 14:37:58 +01:00
Abhishek Kr Srivastav
95860cff1c Fix Go vet errors for master golang
Co-authored-by: Rajalakshmi-Girish <rajalakshmi.girish1@ibm.com>
Co-authored-by: Abhishek Kr Srivastav <Abhishek.kr.srivastav@ibm.com>
2024-09-20 12:36:38 +05:30
Kubernetes Prow Robot
24a74f887a Merge pull request #126595 from pacoxu/kubelet-cgroup-v2-kernel-version
[1.32]kubelet: add log and event for cgroup v2 running on kernel < 5.8
2024-09-18 18:34:44 +01:00
Paco Xu
259671bd43 check root cpu.stat instead of kernel version for cgroup v2 2024-09-18 11:39:36 +08:00
Kubernetes Prow Robot
f153edf356 Merge pull request #123443 from Tal-or/mm_consistent_memory_numa_alloc
memorymanager: avoid violating NUMA node memory allocation rule
2024-09-17 20:10:43 +01:00
Ed Bartosh
bba786496b Kubelet: DRA: fix golangci-lint findings 2024-09-16 12:35:12 +03:00
rongfu.leng
d04a54c50b optimize code, filter podUID is empty string
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
2024-09-13 01:48:14 +00:00
Kubernetes Prow Robot
11e8169a16 Merge pull request #120569 from ffromani/cpumanager-extra-logs
enhance the cpumanager logs
2024-09-12 00:25:18 +01:00
Hironori Shiina
107bb17538 kubelet/cm: Unite return value types of helper functions 2024-09-11 10:50:48 +02:00
Kubernetes Prow Robot
8e5d7cbef7 Merge pull request #127250 from bart0sh/PR157-Kubelet-DRA-fix-testify-errors
Kubelet: DRA: fix testify errors
2024-09-09 23:24:48 +01:00
Ed Bartosh
e70a2ad828 Kubelet: DRA: fix testify errors 2024-09-09 22:18:07 +03:00
Kubernetes Prow Robot
1c15f718b6 Merge pull request #126717 from bart0sh/PR154-DRA-test-restoring-checkpoint-for-upgraded-structure
DRA: test checkpoint structure upgrade
2024-09-09 11:32:27 +01:00
Kubernetes Prow Robot
14ff551c96 Merge pull request #124246 from SataQiu/fix-20240409
Remove unused code in container manager
2024-09-06 23:47:02 +01:00
Kubernetes Prow Robot
3ac8fc04e1 Merge pull request #126834 from carlory/fix-125924-1
DRA: rename pkg/cm/dra/plugin files
2024-09-05 12:15:58 +01:00
Kubernetes Prow Robot
14f2cab4de Merge pull request #126976 from jsturtevant/socket-file-revert
Revert "fix: handle socket file detection on Windows"
2024-09-03 18:31:16 +01:00
Kubernetes Prow Robot
a4ec0c039a Merge pull request #126435 from bart0sh/PR151-Kubelet-devicemanager-stop-using-CDI-annotations
Kubelet: stop using CDI annotations
2024-08-29 16:49:30 +01:00
Ed Bartosh
d3b5cb6f41 DRA: test checkpoint structure and version upgrades
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2024-08-29 15:25:58 +03:00
James Sturtevant
3ca610757e Revert "fix: handle socket file detection on Windows"
This reverts commit 4060ee60c1.
2024-08-28 10:31:58 -07:00
carlory
3372c056cd fix linter hints 2024-08-27 01:30:58 +08:00
carlory
7b33495d9d DRA: rename pkg/cm/dra/plugin files 2024-08-27 00:54:37 +08:00
Ed Bartosh
e1bc8defac kubelet: Migrate DRA Manager to contextual logging
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2024-08-22 11:12:41 +03:00
Ed Bartosh
9d893c83f0 DRA: fix failing test
Added error assertion for NodePrepareResources call unveiled
"rpc error: code = DeadlineExceeded desc = context deadline exceeded"
failure in the TestGRPCConnIsReused test.

Setting clientCallTimeout field when creating plugin should fix it.
2024-08-20 11:11:43 +03:00
Paco Xu
69a67556c7 kubelet: add warning log and events for cgroup v2 running on kernel < 5.8 2024-08-12 14:06:56 +08:00
Ed Bartosh
ea3c6628b7 Kubelet: stop using CDI annotations
Removing setting CDI annotations by the device manager as CRI field
CDIDevices is mature enough to be used instead.
2024-07-29 18:26:27 +03:00
Paco Xu
78d3830d97 ignore order of containers status allocated resources 2024-07-29 16:48:00 +08:00
Kubernetes Prow Robot
5af1710d90 Merge pull request #126243 from SergeyKanzhelev/devicePluginFailures
Implement resource health in pod status (KEP 4680)
2024-07-23 20:12:24 -07:00
Sergey Kanzhelev
62f96d2748 set AllocatedResourcesStatus in the Pod Status 2024-07-24 00:29:35 +00:00
Ed Bartosh
c0d922e786 DRA: Kubelet code cleanup 2024-07-24 00:27:52 +03:00
Ed Bartosh
59555c6a62 DRA: move dra/checkpont/* to dra/state/* 2024-07-24 00:12:10 +03:00
Ed Bartosh
35fbbc5cfd DRA: use crc32.ChecksumIEEE to calculate checkpoint checksum 2024-07-24 00:10:39 +03:00
Ed Bartosh
59daed75d6 DRA: refactor checkpointing
Co-authored-by: Kevin Klues <klueska@gmail.com>
2024-07-24 00:10:30 +03:00
Patrick Ohly
d11b58efe6 DRA kubelet: refactor gRPC call timeouts
Some of the E2E node tests were flaky. Their timeout apparently was chosen
under the assumption that kubelet would retry immediately after a failed gRPC
call, with a factor of 2 as safety margin. But according to
0449cef8fd,
kubelet has a different, higher retry period of 90 seconds, which was exactly
the test timeout. The test timeout has to be higher than that.

As the tests don't use the gRPC call timeout anymore, it can be made
private. While at it, the name and documentation gets updated.
2024-07-22 18:09:34 +02:00
Patrick Ohly
877829aeaa DRA kubelet: adapt to v1alpha3 API
This adds the ability to select specific requests inside a claim for a
container.

NodePrepareResources is always called, even if the claim is not used by any
container. This could be useful for drivers where that call has some effect
other than injecting CDI device IDs into containers. It also ensures that
drivers can validate configs.

The pod resource API can no longer report a class for each claim because there
is no such 1:1 relationship anymore. Instead, that API reports claim,
API devices (with driver/pool/device as ID) and CDI device IDs. The kubelet
itself doesn't extract that information from the claim. Instead, it relies on
drivers to report this information when the claim gets prepared. This isolates
the kubelet from API changes.

Because of a faulty E2E test, kubelet was told to contact the wrong driver for
a claim. This was not visible in the kubelet log output. Now changes to the
claim info cache are getting logged. While at it, naming of variables and some
existing log output gets harmonized.

Co-authored-by: Oksana Baranova <oksana.baranova@intel.com>
Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
2024-07-22 18:09:34 +02:00