InPlace Pod Vertical Scaling has been introduced as a feature recently,
and with it new unit tests. The feature does not have Windows support
yet, thus, the unit tests fail on Windows.
Fixes unit test which checks Linux-specific fields on Windows.
This was making my eyes bleed as I read over code.
I used the following in vim. I made them up on the fly, but they seemed
to pass manual inspection.
:g/},\n\s*{$/s//}, {/
:w
:g/{$\n\s*{$/s//{{/
:w
:g/^\(\s*\)},\n\1},$/s//}},/
:w
:g/^\(\s*\)},$\n\1}$/s//}}/
:w
This touches cases where FromInt() is used on numeric constants, or
values which are already int32s, or int variables which are defined
close by and can be changed to int32s with little impact.
Signed-off-by: Stephen Kitt <skitt@redhat.com>
In case of node reboot/kubelet restart, the flow of events involves
obtaining the state from the checkpoint file followed by setting
the `healthDevices`/`unhealthyDevices` to its zero value. This is
done to allow the device plugin to re-register itself so that
capacity can be updated appropriately.
During the allocation phase, we need to check if the resources requested
by the pod have been registered AND healthy devices are present on
the node to be allocated.
Also we need to move this check above `needed==0` where needed is
required - devices allocated to the container (which is obtained from
the checkpoint file) because even in cases where no additional devices
have to be allocated (as they were pre-allocated), we still need to
make sure he devices that were previously allocated are healthy.
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
The current error comparison `imagePullResult.err ==
ErrRegistryUnavailable` will never work with any remote runtime, because
we produce gRPC errors which wrap a code and a description, like:
```
rpc error: code = Unknown desc = This is the error description
```
To be able to check custom error types from `pkg/kubelet/images/types.go`,
we now strip the code if the status is unknown on image pull.
Beside that, we use a string comparison to check against
`ErrRegistryUnavailable.Error()`, because validating them via the
`errors` package is not yet supported by grpc-go:
https://github.com/grpc/grpc-go/issues/3616
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
This allows us to return with a timeout error as soon as the
context is canceled. Previously in cases where the mount will
never succeed pods can get stuck deleting for 2 minutes.
In the Sync*Pod methods that call VolumeManager.WaitFor*, we
must filter out wait.Interrupted errors from being logged as
they are part of control flow, not runtime problems. Any
early interruption should result in exiting the Sync*Pod method
as quickly as possible without logging intermediate errors.
The pod worker may recieve a new pod which is marked as terminal in
the runtime cache. This can occur if a pod is marked as terminal and the
kubelet is restarted.
The kubelet needs to drive these pods through the termination state
machine. If upon restart, the kubelet receives a pod which is terminal
based on runtime cache, it indicates that pod finished
`SyncTerminatingPod`, but it did not complete `SyncTerminatedPod`. The
pod worker needs ensure that `SyncTerminatedPod` will run on these pods.
To accomplish this, set `finished=False`, on the pod sync status, to
drive the pod through the rest of the state machine.
This will ensure that status manager and other kubelet subcomponents
(e.g. volume manager), will be aware of this pod and properly cleanup
all of the resources of the pod after the kubelet is restarted.
While making change, also update the comments to provide a bit more
background around why the kubelet needs to read the runtime pod cache
for newly synced terminal pods.
Signed-off-by: David Porter <david@porter.me>
The Boot option for the node log query is not supported on Windows and
an error is returned if the option is set. Because of this, one of the
unit tests which sets the option fails, as it does not expect any errror.
A pod that cannot be started yet (due to static pod fullname
exclusion when UIDs are reused) must be accounted for in the
pod worker since it is considered to have been admitted and will
eventually start.
Due to a bug we accidentally cleared pendingUpdate for pods that
cannot start yet which means we can't report the right metric to
users in kubelet_working_pods and in theory we might fail to start
the pod in the future (although we currently have not observed
that in tests that should catch such an error). Describe, implement,
and test the invariant that when startPodSync returns in every path
that either activeUpdate OR pendingUpdate is set on the status, but
never both, and is only nil when the pod can never start.
This bug was detected by a "programmer error" assertion we added
on metrics that were not being reported, suggesting that we should
be more aggressive on using log assertions and automating detection
in tests.