After a Node has stopped posting heartbeats for nodeMonitorGracePeriod,
it will be considered unreachable, its ready condition will be set to
Unknown, NoSchedule taint will be added, all Pods on it will be set to
NotReady, but there is always a delay of 5s before NoExecute taint is
added to the Node, adding 5s to the recovery time of Pods which are
supposed to be evicted by the taint and recreated on other Nodes sooner.
The delay is because processTaintBaseEviction() uses the last observed
ready condition of the Node instead of the current one to determine
whether it should add the Node to the taint queue. When a Node is set to
unreachable due to missing heartbeats, the last observed ready condition
is still true and the current ready condition is unknown, we should use
the latter for processTaintBaseEviction().
Signed-off-by: Quan Tian <qtian@vmware.com>
TestCancelEviction flaked with a 0,01% rate because assumed that an event had
already been created once the pod was updated, but that was only true under
some timing conditions.
fake.Clientset suffers from a race condition related to informers:
it does not implement resource version support in its Watch
implementation and instead assumes that watches are set up
before further changes are made.
If a test waits for caches to be synced and then immediately
adds an object, that new object will never be seen by event handlers
if the race goes wrong and the Watch call hadn't completed yet
(can be triggered by adding a sleep before b53b9fb557/staging/src/k8s.io/client-go/tools/cache/reflector.go (L431)).
To work around this, we count all watches and only proceed when
all of them are in place. This replaces the normal watch reactor
(b53b9fb557/staging/src/k8s.io/client-go/kubernetes/fake/clientset_generated.go (L161-L173)).
The creation of the shared informer factory and starting it can be done all in
the same function, which makes it a bit more obvious what happens in which
order and avoids some code duplication.
The timed worker queue actually can have nil entries in its map if the work was
kicked off immediately. This looks like an unnecessary special case (it would
be fine to call AfterFunc with a duration <= 0 and it would do the right
thing), but to avoid more sweeping changes the fix consists of documenting this
special behavior and adding a nil check.
Normally the scheduler shouldn't schedule when there is a taint, but perhaps it
didn't know yet.
The TestEviction/update test covered this, but only failed under the right
timing conditions. The new event handler test case covers it reliably.
Events get recorded in the apiserver asynchronously, so even if the test knows
that the event has been evicted because the pod is deleted, it still has to
also check for the event to be recorded.
This caused a flake in the "Consistently" check of events.
There was one error path that led to a "controller has shut down" log
message. Other errors caused different log entries or are so unlikely (event
handler registration failure!) that they weren't checked at all.
It's clearer to let Run return an error in all cases and then log the
"controller has shut down" error at the call site. This also enables tests to
mark themselves as failed, should that ever happen.
In tests it is sometimes unavoidable to use the Prometheus types directly,
for example when writing a custom gatherer which needs to normalize data
before testing it. device_taint_eviction_test.go does this to strip
out unpredictable data in a histogram.
With type aliases in a package that is explicitly meant for tests we
can avoid adding exceptions for such tests to the global exception list.
The controller is derived from the node taint eviction controller.
In contrast to that controller it tracks the UID of pods to prevent
deleting the wrong pod when it got replaced.