Commit Graph

1440 Commits

Author SHA1 Message Date
rongfu.leng
d04a54c50b optimize code, filter podUID is empty string
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
2024-09-13 01:48:14 +00:00
Kubernetes Prow Robot
11e8169a16 Merge pull request #120569 from ffromani/cpumanager-extra-logs
enhance the cpumanager logs
2024-09-12 00:25:18 +01:00
Kubernetes Prow Robot
8e5d7cbef7 Merge pull request #127250 from bart0sh/PR157-Kubelet-DRA-fix-testify-errors
Kubelet: DRA: fix testify errors
2024-09-09 23:24:48 +01:00
Ed Bartosh
e70a2ad828 Kubelet: DRA: fix testify errors 2024-09-09 22:18:07 +03:00
Kubernetes Prow Robot
1c15f718b6 Merge pull request #126717 from bart0sh/PR154-DRA-test-restoring-checkpoint-for-upgraded-structure
DRA: test checkpoint structure upgrade
2024-09-09 11:32:27 +01:00
Kubernetes Prow Robot
14ff551c96 Merge pull request #124246 from SataQiu/fix-20240409
Remove unused code in container manager
2024-09-06 23:47:02 +01:00
Kubernetes Prow Robot
3ac8fc04e1 Merge pull request #126834 from carlory/fix-125924-1
DRA: rename pkg/cm/dra/plugin files
2024-09-05 12:15:58 +01:00
Kubernetes Prow Robot
14f2cab4de Merge pull request #126976 from jsturtevant/socket-file-revert
Revert "fix: handle socket file detection on Windows"
2024-09-03 18:31:16 +01:00
Kubernetes Prow Robot
a4ec0c039a Merge pull request #126435 from bart0sh/PR151-Kubelet-devicemanager-stop-using-CDI-annotations
Kubelet: stop using CDI annotations
2024-08-29 16:49:30 +01:00
Ed Bartosh
d3b5cb6f41 DRA: test checkpoint structure and version upgrades
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2024-08-29 15:25:58 +03:00
James Sturtevant
3ca610757e Revert "fix: handle socket file detection on Windows"
This reverts commit 4060ee60c1.
2024-08-28 10:31:58 -07:00
carlory
3372c056cd fix linter hints 2024-08-27 01:30:58 +08:00
carlory
7b33495d9d DRA: rename pkg/cm/dra/plugin files 2024-08-27 00:54:37 +08:00
Ed Bartosh
e1bc8defac kubelet: Migrate DRA Manager to contextual logging
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2024-08-22 11:12:41 +03:00
Ed Bartosh
9d893c83f0 DRA: fix failing test
Added error assertion for NodePrepareResources call unveiled
"rpc error: code = DeadlineExceeded desc = context deadline exceeded"
failure in the TestGRPCConnIsReused test.

Setting clientCallTimeout field when creating plugin should fix it.
2024-08-20 11:11:43 +03:00
Ed Bartosh
ea3c6628b7 Kubelet: stop using CDI annotations
Removing setting CDI annotations by the device manager as CRI field
CDIDevices is mature enough to be used instead.
2024-07-29 18:26:27 +03:00
Paco Xu
78d3830d97 ignore order of containers status allocated resources 2024-07-29 16:48:00 +08:00
Kubernetes Prow Robot
5af1710d90 Merge pull request #126243 from SergeyKanzhelev/devicePluginFailures
Implement resource health in pod status (KEP 4680)
2024-07-23 20:12:24 -07:00
Sergey Kanzhelev
62f96d2748 set AllocatedResourcesStatus in the Pod Status 2024-07-24 00:29:35 +00:00
Ed Bartosh
c0d922e786 DRA: Kubelet code cleanup 2024-07-24 00:27:52 +03:00
Ed Bartosh
59555c6a62 DRA: move dra/checkpont/* to dra/state/* 2024-07-24 00:12:10 +03:00
Ed Bartosh
35fbbc5cfd DRA: use crc32.ChecksumIEEE to calculate checkpoint checksum 2024-07-24 00:10:39 +03:00
Ed Bartosh
59daed75d6 DRA: refactor checkpointing
Co-authored-by: Kevin Klues <klueska@gmail.com>
2024-07-24 00:10:30 +03:00
Patrick Ohly
d11b58efe6 DRA kubelet: refactor gRPC call timeouts
Some of the E2E node tests were flaky. Their timeout apparently was chosen
under the assumption that kubelet would retry immediately after a failed gRPC
call, with a factor of 2 as safety margin. But according to
0449cef8fd,
kubelet has a different, higher retry period of 90 seconds, which was exactly
the test timeout. The test timeout has to be higher than that.

As the tests don't use the gRPC call timeout anymore, it can be made
private. While at it, the name and documentation gets updated.
2024-07-22 18:09:34 +02:00
Patrick Ohly
877829aeaa DRA kubelet: adapt to v1alpha3 API
This adds the ability to select specific requests inside a claim for a
container.

NodePrepareResources is always called, even if the claim is not used by any
container. This could be useful for drivers where that call has some effect
other than injecting CDI device IDs into containers. It also ensures that
drivers can validate configs.

The pod resource API can no longer report a class for each claim because there
is no such 1:1 relationship anymore. Instead, that API reports claim,
API devices (with driver/pool/device as ID) and CDI device IDs. The kubelet
itself doesn't extract that information from the claim. Instead, it relies on
drivers to report this information when the claim gets prepared. This isolates
the kubelet from API changes.

Because of a faulty E2E test, kubelet was told to contact the wrong driver for
a claim. This was not visible in the kubelet log output. Now changes to the
claim info cache are getting logged. While at it, naming of variables and some
existing log output gets harmonized.

Co-authored-by: Oksana Baranova <oksana.baranova@intel.com>
Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
2024-07-22 18:09:34 +02:00
Patrick Ohly
91d7882e86 DRA: new API for 1.31
This is a complete revamp of the original API. Some of the key
differences:
- refocused on structured parameters and allocating devices
- support for constraints across devices
- support for allocating "all" or a fixed amount
  of similar devices in a single request
- no class for ResourceClaims, instead individual
  device requests are associated with a mandatory
  DeviceClass

For the sake of simplicity, optional basic types (ints, strings) where the null
value is the default are represented as values in the API types. This makes Go
code simpler because it doesn't have to check for nil (consumers) and values
can be set directly (producers). The effect is that in protobuf, these fields
always get encoded because `opt` only has an effect for pointers.

The roundtrip test data for v1.29.0 and v1.30.0 changes because of the new
"request" field. This is considered acceptable because the entire `claims`
field in the pod spec is still alpha.

The implementation is complete enough to bring up the apiserver.
Adapting other components follows.
2024-07-22 18:09:34 +02:00
Francesco Romani
0a9b17771d node: cpumgr: log: make errors louder
We have a special case which is not supposed to happen.
Make it louder with default log settings to make sure this is visible.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-07-22 14:04:05 +02:00
Francesco Romani
2dc5ddd08a node: cpumgr: logs: bump log verbosiness for expected skips
In the reconciliation flow, there are expected skipping
conditions (e.g. for active logs). To reduce noise in the logs,
bump up the verbosiness of these messages, using odd levels.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-07-22 14:04:05 +02:00
Francesco Romani
5a0bc1020b node: cpumgr: move flow to left and add logs
Refactor the code to align to the left bailing out
earlier if the code must do nothing.
Add log to trace this occurrence.
Besides extra log, no intended change in behavior.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-07-22 14:04:04 +02:00
Francesco Romani
a89c843edd node: cpumgr: ErrorS -> InfoS
Convert uncommon use of ErrorS(nil, ...) into more
regular use of InfoS. Set the verbosiness level to
make sure the message is still emitted in regular
expected configuration.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-07-22 14:04:04 +02:00
Patrick Ohly
b51d68bb87 DRA: bump API v1alpha2 -> v1alpha3
This is in preparation for revamping the resource.k8s.io completely. Because
there will be no support for transitioning from v1alpha2 to v1alpha3, the
roundtrip test data for that API in 1.29 and 1.30 gets removed.

Repeating the version in the import name of the API packages is not really
required. It was done for a while to support simpler grepping for usage of
alpha APIs, but there are better ways for that now. So during this transition,
"resourceapi" gets used instead of "resourcev1alpha3" and the version gets
dropped from informer and lister imports. The advantage is that the next bump
to v1beta1 will affect fewer source code lines.

Only source code where the version really matters (like API registration)
retains the versioned import.
2024-07-21 17:28:13 +02:00
Kubernetes Prow Robot
f2428d66cc Merge pull request #125163 from pohly/dra-kubelet-api-version-independent-no-rest-proxy
DRA: make kubelet independent of the resource.k8s.io API version
2024-07-18 17:47:48 -07:00
Kubernetes Prow Robot
5fc7032a0e Merge pull request #126156 from pohly/kubelet-test-enhancements
kubelet test enhancements
2024-07-18 14:50:54 -07:00
Patrick Ohly
7701a48bd6 dra kubelet: bump gRPC API to v1alpha4
The previous changes are an API break, therefore we need a new version.
2024-07-18 23:30:09 +02:00
Kubernetes Prow Robot
9196650533 Merge pull request #123819 from fakecore/fc/master
fix: handle socket file detection on Windows
2024-07-18 00:53:16 -07:00
Patrick Ohly
348f94ab55 DRA: read ResourceClaim in DRA drivers
This is the second and final step towards making kubelet independent of the
resource.k8s.io API versioning because it now doesn't need to copy structs
defined by that API from the driver to the API server.
2024-07-18 09:09:20 +02:00
Patrick Ohly
616a014347 DRA: move ResourceSlice publishing into DRA drivers
This is a first step towards making kubelet independent of the resource.k8s.io
API versioning because it now doesn't need to copy structs defined by that API
from the driver to the API server. The next step is removing the other
direction (reading ResourceClaim status and passing the resource handle to
drivers).

The drivers must get deployed so that they have their own connection to the API
server. Securing at least the writes via a validating admission policy should
be possible.

As before, the kubelet removes all ResourceSlices for its node at startup, then
DRA drivers recreate them if (and only if) they start up again. This ensures
that there are no orphaned ResourceSlices when a driver gets removed while the
kubelet was down.

While at it, logging gets cleaned up and updated to use structured, contextual
logging as much as possible. gRPC requests and streams now use a shared,
per-process request ID and streams also get logged.
2024-07-18 09:09:19 +02:00
Patrick Ohly
b9d00841a6 kubelet: improve checkpoint errors
Recording the expected and actual checksum in the error makes it possible to
provide that information, for example in a failed test like the ones for DRA.
Otherwise developers have to manually step through the test with a debugger to
figure out what the new checksum is.
2024-07-17 16:07:31 +02:00
Kubernetes Prow Robot
2263f2d719 Merge pull request #124148 from cyclinder/add_flag_kubelet
kubelet: Add a TopologyManager policy option: max-allowable-numa-nodes
2024-07-15 19:27:16 -07:00
Kubernetes Prow Robot
3361895612 Merge pull request #123733 from Jeffwan/jiaxin/kep-4176-240305
KEP-4176: Add a new static policy SpreadPhysicalCPUsPreferredOption
2024-07-15 01:41:10 -07:00
Jiaxin Shan
6c85fd4ddd KEP-4176: Add static policy option to distribute cpus across cores 2024-07-12 11:52:51 -07:00
Kubernetes Prow Robot
2d4514e169 Merge pull request #125802 from mmorel-35/testifylint/len+empty
fix: enable empty and len rules from testifylint on pkg and staging package
2024-07-11 23:12:06 -07:00
Harshal Patil
68d317a8d1 Add a warning log, event and metric for cgroup version 1
Signed-off-by: Harshal Patil <harpatil@redhat.com>
2024-07-09 11:34:46 -04:00
cyclinder
87129c350a kubelet: Add a TopologyManager policy options: "max-allowable-numa-nodes"
Signed-off-by: cyclidner <kuocyclinder@gmail.com>
2024-07-09 22:26:24 +08:00
Matthieu MOREL
f014b754fb fix: enable empty and len rules from testifylint on pkg package
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>

Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2024-07-06 23:15:43 +00:00
Kubernetes Prow Robot
7e1a5a0ea8 Merge pull request #125687 from bart0sh/PR146-DevicePluginCDIDevices-LockToDefault
kube_features: DevicePluginCDIDevices: LockToDefault
2024-07-01 17:07:41 -07:00
Kubernetes Prow Robot
34b8832edb Merge pull request #125631 from SergeyKanzhelev/logFailedAdmission
improve logging of pod admission denied
2024-06-28 19:36:20 -07:00
Kubernetes Prow Robot
16b7d5310a Merge pull request #125047 from zhanluxianshen/clean-typos-in-kubelet
clean typos logs in kubelet.
2024-06-28 16:48:24 -07:00
Kubernetes Prow Robot
ac9aec9f9b Merge pull request #125116 from pohly/dra-one-of-source
DRA: remove "source" indirection from v1 Pod API
2024-06-28 12:46:45 -07:00
Matthieu MOREL
0cde5f1e28 fix: enable bool-compare rule from testifylint linter (#125135)
* fix: enable bool-compare rule from testifylint linter

Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>

* Update hack/golangci.yaml.in

Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>

* Update golangci.yaml.in

* Update golangci-strict.yaml

* Update golangci.yaml.in

* Update golangci.yaml.in

* Update golangci.yaml.in

* Update golangci.yaml.in

* Update golangci.yaml

* Update golangci-hints.yaml

* Update golangci-strict.yaml

* Update golangci.yaml.in

* Update golangci.yaml

* Update mux_test.go

---------

Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
Co-authored-by: Patrick Ohly <patrick.ohly@intel.com>
2024-06-28 10:58:05 -07:00