Since the GA graduation of memory manager in https://github.com/kubernetes/kubernetes/pull/128517
we are sharing the initial container map across managers.
The intention of this sharing was not to actually share a data
structure, but
1. save the relatively expensive relisting from runtime
2. have all the managers share a consistent view - even though the
chance for misalignement tend to be tiny.
The unwanted side effect though is now all the managers race
to modify a data shared, not thread safe data structure.
The fix is to clone (deepcopy) the computed map when passing it
to each manager. This restores the old semantic of the code.
This issue brings the topic of possibly managers go out of sync
since each of them maintain a private view of the world.
This risk is real, yet this is how the code worked for
most of the lifetime, so the plan is to look at this and evaluate
possible improvements later on.
Signed-off-by: Francesco Romani <fromani@redhat.com>
The code from github.com/opencontainers/runc/libcontainer/userns package
was moved into github.com/moby/sys/user and github.com/moby/sys/userns
(see [1]), and the runc package is now deprecated in favor of moby/sys
(see [2]).
In addition, moby/sys/userns now has a non-Linux implementation, so
pkg/kubelet/user/userns package (introduced in commit 2e999ff to make a
non-Linux implementation) is not really needed anymore.
Let's switch to moby/sys/userns, and remove the package.
[1]: https://github.com/moby/sys/releases/tag/userns%2Fv0.1.0
[2]: https://github.com/opencontainers/runc/pull/4350
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This adds the ability to select specific requests inside a claim for a
container.
NodePrepareResources is always called, even if the claim is not used by any
container. This could be useful for drivers where that call has some effect
other than injecting CDI device IDs into containers. It also ensures that
drivers can validate configs.
The pod resource API can no longer report a class for each claim because there
is no such 1:1 relationship anymore. Instead, that API reports claim,
API devices (with driver/pool/device as ID) and CDI device IDs. The kubelet
itself doesn't extract that information from the claim. Instead, it relies on
drivers to report this information when the claim gets prepared. This isolates
the kubelet from API changes.
Because of a faulty E2E test, kubelet was told to contact the wrong driver for
a claim. This was not visible in the kubelet log output. Now changes to the
claim info cache are getting logged. While at it, naming of variables and some
existing log output gets harmonized.
Co-authored-by: Oksana Baranova <oksana.baranova@intel.com>
Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
When --fail-swap-on=false kubelet CLI argument
is provided, but tmpfs noswap is not supported
by the kernel, warn about the risks of memory-backed
volumes being swapped into disk
Signed-off-by: Itamar Holder <iholder@redhat.com>
The information is received from the DRA driver plugin through a new gRPC
streaming interface. This is backwards compatible with old DRA driver kubelet
plugins, their gRPC server will return "not implemented" and that can be
handled by kubelet. Therefore no API break is needed.
However, DRA drivers need to be updated because the Go API changed. They can
return
status.New(codes.Unimplemented, "no node resource support").Err()
if they don't support the new ListAndWatchResources method and
structured parameters.
The controller in kubelet then synchronizes this information from the driver
with NodeResourceSlice objects, creating, updating and deleting them as needed.
This removes deprecated sets.String and sets.Int
- replace sets.String with sets.Set[string]
- replace sets.Int with sets.Set[int]
- replace sets.NewString with sets.New[string]
- replace sets.NewInt with sets.New[int]
- replace sets.(OLD).List with sets.List(NEW)
This change adds CDI device IDs to the ContainerAllocateResponse in the
device plugin API. This allows a device plugin to specify CDI devices
by their unique fully-qualified CDI device names using the related field
in the CRI specification.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
When kubelet initializes, runs admission for pods and possibly
allocated requested resources. We need to distinguish between
node reboot (no containers running) versus kubelet restart (containers
potentially running).
Running pods should always survive kubelet restart.
This means that device allocation on admission should not be attempted,
because if a container requires devices and is still running when kubelet
is restarting, that container already has devices allocated and working.
Thus, we need to properly detect this scenario in the allocation step
and handle it explicitely. We need to inform
the devicemanager about which pods are already running.
Note that if container runtime is down when kubelet restarts, the
approach implemented here won't work. In this scenario, so on kubelet
restart containers will again fail admission, hitting
https://github.com/kubernetes/kubernetes/issues/118559 again.
This scenario should however be pretty rare.
Signed-off-by: Francesco Romani <fromani@redhat.com>
Currently claimInfo CDIDevices and annotations access directly without RLock.
This can lead to concurrent read write error.
To avoid it we added RLock all before getting the CDIDevices and annotations
Signed-off-by: Moshe Levi <moshele@nvidia.com>
The checkpointing mechanism will repopulate DRA Manager in-memory cache on kubelet restart.
This will ensure that the information needed by the PodResources API is available across
a kubelet restart.
The ClaimInfoState struct represent the DRA Manager in-memory cache state in checkpoint.
It is embedd in the ClaimInfo which also include the annotation field. The separation between
the in-memory cache and the cache state in the checkpoint is so we won't be tied to the in-memory
cache struct which may change in the future. In the ClaimInfoState we save the minimal required fields
to restore the in-memory cache.
Signed-off-by: Moshe Levi <moshele@nvidia.com>
With Topology Manager enabled by default, we no longer need
`resourceAllocator` as Topology Manager serves as the main
PodAdmitHandler completely responsible for admission check
based on hints received from the hintProviders and the
subsequent allocation of the corresponding resources to a
pod as can be seen here:
https://github.com/kubernetes/kubernetes/blob/v1.26.0/pkg/kubelet/cm/topologymanager/scope.go#L150
With regard to DRA, the passing of `cm.draManager` into
resourceAllocator seems redundant as no admission checks
(and allocation of resources handled by DRA) is taking place
in `Admit` method of resourceAllocator. DRA has a completely
different model to the rest of the resource managers where
pod is only scheduled on a node once resources are reserved
for it. Because of this, admission checks or waiting for
resources to be provisioned after the pod has been scheduled
on the node is not required.
Before making the above change, it was verified that DRA Manager
is instantiated in `NewContainerManager`:
https://github.com/kubernetes/kubernetes/blob/v1.26.0/pkg/kubelet/cm/container_manager_linux.go#L318
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
Since Topology manager is graduating to GA, we remove
internal configuration variable names with `Experimental`
prefix.
There is no expected change in behavior, only trival
variable renaming.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
In 'set', conversions to slice are done also, but with different names:
ToSliceNoSort() -> UnsortedList()
ToSlice() -> List()
Reimplement List() in terms of UnsortedList to save some duplication.
Dependencies need to be updated to use
github.com/container-orchestrated-devices/container-device-interface.
It's not decided yet whether we will implement Topology support
for DRA or not. Not having any toppology-related code
will help to avoid wrong impression that DRA is used as a hint
provider for the Topology Manager.
CPUManager is going GA, thus it makes little sense
to keep the names of the internal configuration
variables `Experimental*`.
Trivial rename only.
Signed-off-by: Francesco Romani <fromani@redhat.com>
With graduation of device plugins to GA in 1.26, the feature gate is
enabled by default so `devicePluginEnabled` field no longer needs to
be passed at the time of Container Manager creation.
In addition to that, we remove the `ManagerStub` as it is no longer
needed.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>