Commit Graph

266 Commits

Author SHA1 Message Date
rongfu.leng
d04a54c50b optimize code, filter podUID is empty string
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
2024-09-13 01:48:14 +00:00
Kubernetes Prow Robot
14f2cab4de Merge pull request #126976 from jsturtevant/socket-file-revert
Revert "fix: handle socket file detection on Windows"
2024-09-03 18:31:16 +01:00
Kubernetes Prow Robot
a4ec0c039a Merge pull request #126435 from bart0sh/PR151-Kubelet-devicemanager-stop-using-CDI-annotations
Kubelet: stop using CDI annotations
2024-08-29 16:49:30 +01:00
James Sturtevant
3ca610757e Revert "fix: handle socket file detection on Windows"
This reverts commit 4060ee60c1.
2024-08-28 10:31:58 -07:00
Ed Bartosh
ea3c6628b7 Kubelet: stop using CDI annotations
Removing setting CDI annotations by the device manager as CRI field
CDIDevices is mature enough to be used instead.
2024-07-29 18:26:27 +03:00
Paco Xu
78d3830d97 ignore order of containers status allocated resources 2024-07-29 16:48:00 +08:00
Sergey Kanzhelev
62f96d2748 set AllocatedResourcesStatus in the Pod Status 2024-07-24 00:29:35 +00:00
Kubernetes Prow Robot
9196650533 Merge pull request #123819 from fakecore/fc/master
fix: handle socket file detection on Windows
2024-07-18 00:53:16 -07:00
Matthieu MOREL
f014b754fb fix: enable empty and len rules from testifylint on pkg package
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>

Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2024-07-06 23:15:43 +00:00
Ed Bartosh
f53991d111 kube_features: DevicePluginCDIDevices: LockToDefault 2024-06-25 16:14:48 +03:00
Kubernetes Prow Robot
a8d51f4f05 Use a generic Set instead of a specified Set in kubelet
Signed-off-by: bzsuni <bingzhe.sun@daocloud.io>
2024-06-04 14:25:43 +08:00
Kubernetes Prow Robot
1fd835ce59 Merge pull request #123398 from ffromani/remove-legacy-checkpoint
node: devicemgr: remove obsolete pre-1.20 checkpoint file support
2024-04-29 14:46:53 -07:00
Marek Siarkowicz
3ee8178768 Cleanup defer from SetFeatureGateDuringTest function call 2024-04-24 20:25:29 +02:00
Francesco Romani
181fb0da51 node: devicemgr: remove obsolete pre-1.20 checkpoint file support
In commit 2f426fdba6 we added
compatibility (and tests) to deal with pre-1.20 checkpoint files.
We are now well past the end of support for pre-1.20 kubelets,
so we can get rid of this code.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-04-15 14:01:56 +02:00
HirazawaUi
10b6319e64 fix slow dra unit test 2024-03-16 22:21:15 +08:00
fakecore
4060ee60c1 fix: handle socket file detection on Windows
Update socket file detection logic to use os.Stat as per upstream
Go fix for https://github.com/golang/go/issues/33357. This resolves
the issue where socket files could not be properly identified on
Windows systems.
2024-03-08 18:16:10 +08:00
Kubernetes Prow Robot
70383f3701 Merge pull request #119561 from payall4u/fix-kubelet-panic-when-allocate-device
Fix kubelet panic when allocate resource for pod.
2024-02-29 03:06:54 -08:00
Daniel Hu
1baf7d4586 Corrected some spelling and grammatical errors
Signed-off-by: Daniel Hu <farmer.hutao@outlook.com>
2024-01-27 10:10:25 +08:00
Daniel Hu
d652596e42 Remove redundant string conversions in print statements
Signed-off-by: Daniel Hu <farmer.hutao@outlook.com>
2024-01-15 09:57:35 +08:00
payall4u
d6b8a660b0 Fix kubelet panic when allocate resource for pod.
Signed-off-by: payall4u <payall4u@qq.com>
2023-11-12 10:54:05 +08:00
Kubernetes Prow Robot
a5ff0324a9 Merge pull request #120461 from gjkim42/do-not-reuse-device-of-restartable-init-container
Don't reuse the device of a restartable init container
2023-10-31 19:15:53 +01:00
Antonio Ojea
8e0be64b8f remove data race on the devicemanager client plugin
Change-Id: I45b85440a792e5ed2f75a344ec1f0332854d8d6d
2023-10-24 21:35:13 +00:00
Shiming Zhang
35f4d29d73 Fix unit test 2023-10-24 11:06:35 +08:00
Swati Sehgal
9a354fc9d0 node: sample-dp: Add retry to handle device plugin restart failure
Add retry mechanism to handle cases where after kubelet restarts, the device
plugin unix socket(s) were created but not ready to serve yet.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:10 +01:00
Swati Sehgal
d0d133298d node: sample-dp: Use fsnotify for kubelet restart detection
Add kubeletSocket file to fsnotify instead of polling and waiting for deletion
of device plugin unix socket as a way of detecting kubelet restart. We need to
ensure that the device plugin re-registers itself after kubelet restart depending
on the configured registration mode (auto-registration or controller registration).

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:10 +01:00
Swati Sehgal
211d8cc80a node: sample-dp: stubRegisterControlFunc for controlling registration
If the user specifies the intent to control registration process, we rely on
registration triggers (deletion of control file) to prompt registration.

This behvaiour is expected to be consistent across kubelet restarts and therefore
across the watch calls where we watch for changes to the unix socket so we make
this part of Stub object instead of a parameter.

Co-authored-by: Francesco Romani <fromani@redhat.com>
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:10 +01:00
Swati Sehgal
c4c9d61d66 node: sample-dp: Handle re-registration for controlled registrations
In case `REGISTER_CONTROL_FILE` is specified, we want to ensure that the
registration is triggered by deletion of the control file. This is
applicable both when the registration happens for the first time and
subsequent ones because of kubelet restarts.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:19:07 +01:00
Swati Sehgal
6714e678d3 node: sample-dp: register by default and re-register on restarts
In issue: 115107 we added an environment variable to control the registration of sample
device plugin to kubelet. The intent of this patch is to ensure that the default
behaviour of the plugin is to register to kubelet (in case no environment
variable is specified).

In addition to that, we want to ensure that the plugin registers itself not just once.
It should re-register itself to kubelet in case of node reboot or kubelet restarts.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-10-17 12:14:09 +01:00
Gunju Kim
d2b803246a Don't reuse the device allocated to the restartable init container 2023-10-17 18:28:29 +09:00
Gunju Kim
a0610a97b3 pkg/kubelet/cm: Remove deprecated sets.String and sets.Int
This removes deprecated sets.String and sets.Int
- replace sets.String with sets.Set[string]
- replace sets.Int with sets.Set[int]
- replace sets.NewString with sets.New[string]
- replace sets.NewInt with sets.New[int]
- replace sets.(OLD).List with sets.List(NEW)
2023-09-27 22:02:15 +09:00
Kubernetes Prow Robot
bdcf812c95 Merge pull request #118254 from elezar/4009/add-cdi-devices-to-device-plugin
Add CDI devices to device plugin API
2023-07-17 05:21:08 -07:00
Evan Lezar
b57c7e2fe4 Add CDI devices to device plugin API
This change adds CDI device IDs to the ContainerAllocateResponse in the
device plugin API. This allows a device plugin to specify CDI devices
by their unique fully-qualified CDI device names using the related field
in the CRI specification.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2023-07-17 11:53:09 +02:00
Francesco Romani
c635a7e7d8 node: devicemgr: topomgr: add logs
One of the contributing factors of issues #118559 and #109595 hard to
debug and fix is that the devicemanager has very few logs in important
flow, so it's unnecessarily hard to reconstruct the state from logs.

We add minimal logs to be able to improve troubleshooting.
We add minimal logs to be backport-friendly, deferring a more
comprehensive review of logging to later PRs.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-07-12 13:25:36 +02:00
Francesco Romani
3bcf4220ec kubelet: devices: skip allocation for running pods
When kubelet initializes, runs admission for pods and possibly
allocated requested resources. We need to distinguish between
node reboot (no containers running) versus kubelet restart (containers
potentially running).

Running pods should always survive kubelet restart.
This means that device allocation on admission should not be attempted,
because if a container requires devices and is still running when kubelet
is restarting, that container already has devices allocated and working.

Thus, we need to properly detect this scenario in the allocation step
and handle it explicitely. We need to inform
the devicemanager about which pods are already running.

Note that if container runtime is down when kubelet restarts, the
approach implemented here won't work. In this scenario, so on kubelet
restart containers will again fail admission, hitting
https://github.com/kubernetes/kubernetes/issues/118559 again.
This scenario should however be pretty rare.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-07-12 13:25:36 +02:00
Evan Lezar
cd14e97ea8 Add a builder for ContainerAllocateResponse objects
This chagne introduces a helper to construct ContainerAllocateResponse instances.
Test cases are updated to use a new constructor accepting functional options
allowing the response contents to be set based on the test requirements.

This can then be extended to also test additional fields in the device plugin API
such as annotations which are not currently covered or new fields.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
2023-07-11 11:48:26 +02:00
Kubernetes Prow Robot
484645e817 Merge pull request #116659 from claudiubelu/skip-flaky-tests-2
unit tests: Skip flaky tests on Windows (part 2)
2023-05-23 20:04:48 -07:00
Kubernetes Prow Robot
1241ddc567 Merge pull request #116376 from swatisehgal/device-mgr-recovery-wip
node: device-mgr: Handle recovery flow by checking if healthy devices exist- attempt 2
2023-05-01 21:30:11 -07:00
Swati Sehgal
dc1a592632 node: device-mgr: Handle recovery by checking if healthy devices exist
In case of node reboot/kubelet restart, the flow of events involves
obtaining the state from the checkpoint file followed by setting
the `healthDevices`/`unhealthyDevices` to its zero value. This is
done to allow the device plugin to re-register itself so that
capacity can be updated appropriately.

During the allocation phase, we need to check if the resources requested
by the pod have been registered AND healthy devices are present on
the node to be allocated.

Also we need to move this check above `needed==0` where needed is
required - devices allocated to the container (which is obtained from
the checkpoint file) because even in cases where no additional devices
have to be allocated (as they were pre-allocated), we still need to
make sure he devices that were previously allocated are healthy.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-04-28 14:41:30 +01:00
Claudiu Belu
0979d55443 unit tests: Skip flaky tests on Windows (part 2)
Some of the unit tests are currently flaky on Windows. This commit
skips them until they are resolved.
2023-04-13 12:07:18 +00:00
Kubernetes Prow Robot
d0fc9d16ce Merge pull request #114800 from haoruan/feature-8976-spew-sprintf-refactor
Capture spew.Sprintf() with all our favorite config into a util func
2023-04-11 15:34:57 -07:00
Hao Ruan
f638e2849f replaced spew.Sprintf with a util pretty print function 2023-03-27 09:24:22 +08:00
Todd Neal
4096c9209c dedupe pod resource request calculation 2023-03-09 17:15:53 -06:00
David Porter
9c20cee504 Revert "node: device-mgr: Handle recovery flow by checking if healthy devices exist" 2023-03-07 11:50:52 -08:00
Claudiu Belu
5ba74c81ca unit tests: Skip flaky tests on Windows
Some of the unit tests are currently flaky on Windows. This commit
skips them until they are resolved.
2023-03-06 20:46:05 +00:00
Kubernetes Prow Robot
890d39f976 Merge pull request #114640 from swatisehgal/handle-device-mgr-recovery
node: device-mgr: Handle recovery flow by checking if healthy devices exist
2023-03-06 07:10:28 -08:00
Swati Sehgal
5b2a3dbbdc node: device-mgr: explicitly check if pre-allocated devices are healthy
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-03-06 11:52:23 +00:00
Swati Sehgal
a799ffb571 node: device-mgr: unit-tests: admission failure due to unhealthy devices
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-03-06 11:52:23 +00:00
Swati Sehgal
7ac399c205 node: device-mgr: Handle recovery by checking if healthy devices exist
In case of node reboot/kubelet restart, the flow of events involves
obtaining the state from the checkpoint file followed by setting
the `healthDevices`/`unhealthyDevices` to its zero value. This is
done to allow the device plugin to re-register itself so that
capacity can be updated appropriately.

During the allocation phase, we need to check if the resources requested
by the pod have been registered AND healthy devices are present on
the node to be allocated.

Also we need to move this check above `needed==0` where needed is
required - devices allocated to the container (which is obtained from
the checkpoint file) because even in cases where no additional devices
have to be allocated (as they were pre-allocated), we still need to
make the devices that were previously allocated are healthy.

Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-03-06 11:52:23 +00:00
huyinhou
88274d96fc update code style
Signed-off-by: huyinhou <huyinhou@bytedance.com>
2023-03-06 14:23:14 +08:00
huyinhou
32495ae3f1 add lock in generate topology hints function 2023-02-20 10:56:53 +08:00