ServiceStartTimeout is defined at e2e core framework and the one of
the core is used at many places, but the one of this endpoints/ports.go
is not used at the other places.
So this removes the one of endpoints/ports.go for the cleanup.
The manifest list is stateful, which means that the same list will get amended
with each successive image published. That's unintended, and can lead to the
wrong image being pulled from the manifest list.
Resets the manifest list before amending new images into it.
The test was flaky because it required the job succeeds 3 times with
pseudorandom 50% failure chance within 15 minutes, while there is an
exponential back-off delay (10s, 20s, 40s …) capped at 6 minutes before
recreating failed pods. As 7 consecutive failures (1/128 chance) could
take 20+ minutes, exceeding the timeout, the test failed intermittently
because of "timed out waiting for the condition".
This PR forces the Pods of a Job to be scheduled to a single node and
uses a hostPath volume instead of an emptyDir to persist data across new
Pods.
The e2e test "Kubectl Port forwarding With a server listening .."
is failed sometimes due to the difference between expected data and
received data. To receive the data, the test does CloseWrite() but
it didn't have the corresponding error handling.
This adds it to investigate the issue more.
Not all errors will happen in sync during Instances.Insert(...).Do(), so
it is important to verify the operation object to see why insert fails.
An example is when exceeding the resource quota.
Eg.
could not create instance test-cos-beta-80-12739-29-0: [&{Code:QUOTA_EXCEEDED Location: Message:Quota 'CPUS' exceeded. Limit: 24.0 in region europe-west6. ForceSendFields:[] NullFields:[]}
This fixes the issue where tests will fail "silently" when instance
insert fails.
This is currently the top flake against PRs, so I'm tagging it
as [Flaky]. Flaky tests can't be conformance tests, so I'm
removing it from [Conformance] as well until this is resolved.
it turns out that the e2e test was not using the timeout used to
hold the CLOSE_WAIT status, hence the test was flake depending
on how fast it checked the conntrack table.
This PR replaces the dependency on ssh using a pod to check the conntrack
entries on the host in a loop, to make the test more robust
and reduce the flakiness due to race conditions and/or ssh issues.
It also fixes a bug trying to grep the conntrack entry, where
the error was swallowed if a conntrack entry wasn't found.