If the script got killed, it failed to wait properly:
Waiting for Ginkgo with pid 73873...
./hack/ginkgo-e2e.sh: line 236: wait: `{73873}': not a pid or valid job spec
What happens at the moment in e.g. pull-kubernetes-e2e-kind in case of a
timeout is that ginkgo-e2e.sh gets killed with SIGTERM. This is not propagated
to the E2E test suite processes, therefore there is no "Interrupted by User"
report and no JUnit file, depending on timing during the process shutdown.
Running the Ginkgo CLI with job control enabled creates a new process group,
which then can be used to kill the Ginko CLI and the E2E test suite
processes. With these changes, more information is produced. Some of it seems
a bit redundant, but it's better than none:
*** hack/ginkgo-e2e.sh: received termination signal -> asking Ginkgo to stop.
***
*** Beware that a timeout may have been caused by some earlier test,
*** not necessarily the one which gets interrupted now.
*** See the "Spec runtime" for information about how long the
*** interrupted test was running.
------------------------------
Interrupted by User
First interrupt received; Ginkgo will run any cleanup and reporting nodes but will skip all remaining specs. Interrupt again to skip cleanup.
Here's a current progress report:
[sig-node] DRA [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] [Beta] ResourceSlice Controller creates slices (Spec Runtime: 9
.065s)
k8s.io/kubernetes/test/e2e/dra/dra.go:812
In [It] (Node Runtime: 9.044s)
k8s.io/kubernetes/test/e2e/dra/dra.go:812
At [By Step] Creating slices (Step Runtime: 8.884s)
k8s.io/kubernetes/test/e2e/dra/dra.go:847
...
Begin Additional Progress Reports >>
There is no failure as the matcher passed to Consistently has not yet failed
<< End Additional Progress Reports
------------------------------
• [INTERRUPTED] [11.955 seconds]
[sig-node] DRA [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] [Beta] ResourceSlice Controller [It] creates slices [sig-node, Feature:DynamicResourceAllocation, FeatureGate:DynamicResourceAllocation, Feature:Beta]
k8s.io/kubernetes/test/e2e/dra/dra.go:812
Timeline >>
STEP: Creating a kubernetes client @ 01/09/25 17:18:59.769
...
[FAILED] in [It] - k8s.io/kubernetes/test/e2e/dra/dra.go:881 @ 01/09/25 17:19:08.835
I0109 17:19:11.703212 302727 helper.go:125] Waiting up to 7m0s for all (but 0) nodes to be ready
STEP: dump namespace information after failure @ 01/09/25 17:19:11.706
STEP: Collecting events from namespace "dra-7998". @ 01/09/25 17:19:11.706
STEP: Found 0 events. @ 01/09/25 17:19:11.708
...
STEP: Destroying namespace "dra-7998" for this suite. @ 01/09/25 17:19:11.72
<< Timeline
[INTERRUPTED] Interrupted by User
In [It] at: k8s.io/kubernetes/test/e2e/dra/dra.go:812 @ 01/09/25 17:19:08.833
This is the Progress Report generated when the interrupt was received:
[sig-node] DRA [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] [Beta] ResourceSlice Controller creates slices (Spec Runtime: 9
.065s)
...
[FAILED] An interrupt occurred and then the following failure was recorded in the interrupted node before it exited:
Context was cancelled (cause: Interrupted by User) after 0.329s.
There is no failure as the matcher passed to Consistently has not yet failed
In [It] at: k8s.io/kubernetes/test/e2e/dra/dra.go:881 @ 01/09/25 17:19:08.835
------------------------------
Checking for custom logdump instances, if any
----------------------------------------------------------------------------------------------------
k/k version of the log-dump.sh script is deprecated!
Please migrate your test job to use test-infra's repo version of log-dump.sh!
Migration steps can be found in the readme file.
----------------------------------------------------------------------------------------------------
Sourcing kube-util.sh
Detecting project
Skeleton Provider: detect-project not implemented
Dumping logs from master locally to '/tmp/test'
Master SSH not supported for local
Dumping logs from nodes locally to '/tmp/test'
Node SSH not supported for local
Summarizing 1 Failure:
[INTERRUPTED] [sig-node] DRA [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] [Beta] ResourceSlice Controller [It] creates slices [sig-node, Feature:DynamicResourceAllocation, FeatureGate:DynamicResourceAllocation, Feature:Beta]
k8s.io/kubernetes/test/e2e/dra/dra.go:812
Ran 1 of 6644 Specs in 12.208 seconds
FAIL! - Interrupted by User -- 0 Passed | 1 Failed | 0 Pending | 6643 Skipped
--- FAIL: TestE2E (12.74s)
FAIL
Ginkgo ran 1 suite in 13.379078611s
We have zero flake policy for a long time now (> 1 year) https://github.com/kubernetes/community/pull/7538, however , there are some places that are still tolerating flakes and retrying
Flakes does not help, to the point that when we have to take a hard decision it creates more iuncertainty.
It does not matter how, we should be always able to deal with flakes:
- if the software or algorithm is racy, we need to work to make deterministic
- if is deterministic, the test must be deterministic
- if the test is determinist but it depends on the environment, then we work on making the environment deterministci
We typically have plenty of skipped tests in an E2E run, which causes Ginkgo to
print many S characters (one for each skipped test). This easily fills up an entire
console window. Now the default is to silence that
output. GINKGO_SILENCE_SKIPS=n reverts to the previous behavior.
By default, Ginkgo prints all progress characters (S and o) in a single
line. Buffering in Prow causes that output to occur only after the run is over,
which defeats the purpose of having those characters. Now a newline is added
after each character, so there is visible progress in Prow while Ginkgo runs.
Fixed ginkgo warning
You're using deprecated Ginkgo functionality:
=============================================
--untilItFails is deprecated, use --until-it-fails instead
Used consistent approach with this flag in e2e_node and e2e scripts.
The previous approach was based on the observation that some Prow jobs use the
--report-dir parameter instead of the E2E_REPORT_DIR env variable. Parsing the
command line was necessary to use the --json-report and --junit-report
parameters.
But that is complex and can be avoided by triggering the creation of complete
reports in the E2E test suite. The paths are hard-coded and relative to the
report directory to keep the code simple.
There was a report that k8s-triage started processing more data after
6db4b741dd was merged. It's unclear whether
that was because of the new <report-dir>/ginkgo_report.xml file. To avoid
this potential problem, the reports are now in a "ginkgo" sub-directory.
While at it, error checking gets enhanced:
- Create directories at the start of
the suite and bail out early if that fails.
- *All* e2e suites using the framework do this, not just test/e2e.
- Added missing error checking of truncated JUnit report writing.
From the warning message that ginkgo now emits:
--slow-spec-threshold is deprecated --slow-spec-threshold has been deprecated
and will be removed in a future version of Ginkgo. This feature has proved
to be more noisy than useful. You can use --poll-progress-after, instead, to
get more actionable feedback about potentially slow specs and understand
where they might be getting stuck.
We already use --poll-progress-after.
This is a fix for 104aab81a4: because
the default was not set for E2E_TEST_DEBUG_TOOL, all parameters were always
also passed to the E2E suite.
That wasn't wrong for the parameters so far, but breaks when using something
like --output-dir which is only understood by the CLI.
If the script was called with no arguments, it passed "${@:-}" to the suite,
which expands to one empty argument. That's not right, "${@}" should be used
instead because it expands to nothing when empty.
Most parameters can be passed to both the CLI and the suite, but some
(for example, --ginkgo.slow-spec-threshold) had no effect when only
passed to the suite.
Adding the ability to ignore no schedule flags in testing.
Specifically node.cloudprovider.kubernetes.io/uninitialized:NoSchedule
Fix shellcheck complaint.
The new support in ginkgo for progress reports while a test runs dumps
information about where a test is stuck when it runs too long. This can provide
additional insights into what the test is waiting for.
For the Kubernetes jobs using ginkgo-e2e.sh, such dumps are now enabled after
300 seconds and then get repeated every 20 seconds. The initial delay is
intentionally the same as for warning about a slow test. The rationale is that
such test runtimes are unexpected and may need further information to diagnose
why they are slow.
With -ginkgo.source-root, Ginkgo is able to locate the Kubernetes source code
and display small source code snippets for functions that are related to the
test, determined through a heuristic that assumes that all files under the test
suite are for the tests in it.
Set intercept mode to none will help to reveal more information when the
test hangs,
- https://github.com/onsi/ginkgo/issues/970
or circumvent cases where the code grabbing the stdout/stderr pipe
is not under the framework control and may cause hangs,
- https://github.com/onsi/ginkgo/issues/851
The flag `output-interceptor-mode` is set to `none` as we were trying to
figure out of the rootcase of the test flaky, it's only intended for debugging.
- https://github.com/kubernetes/kubernetes/issues/111086
But this set also has some side effect, since it will turn off stdout/stderr
capture completely, any output to stdout/stderr will be lost.
Now that the root cause is not caused by Ginkgo bump nor how the intercept
mode was set, we'd better to follow the default value.
Signed-off-by: Dave Chen <dave.chen@arm.com>
This applies to all jobs using hack/ginkgo-e2e.sh. This is done because
Spyglass does not render the escape sequences, making test output harder to
read.
It is done here because then we don't need to set GINKGO_NO_COLOR in all the
different Prow job configs.
Default timeout setting has been reduced from `24h` down to `1h` in
Ginkgo V2, but for some long running test this is too short.
How long to abort the test was controlled by the the linux command `timeout`
in V1. e.g. `'timeout -k 30s 150m ...`, and is configured in the file
like `sig-network-misc.yaml`.
Set the timeout manually for Ginkgo V2 to avoid the early aborting.
Signed-off-by: Dave Chen <dave.chen@arm.com>
Some tests have a short timeout for starting the pods (1 minute), but if
those tests happen to be the first ones to run, and the images have to be
pulled, then the test could timeout, especially with larger images. This
commit will allow us to prepull commonly used E2E test images, so this issue
can be avoided.
The image "gcr.io/authenticated-image-pulling/windows-nanoserver:v1" is not a
manifest list, and it is only useful for Windows Server 1809, which means that the
test "should be able to pull from private registry with secret" will fail for
environments with Windows Server 1903, 1909, or any other future version we might
want to test.
This commit adds the the ability to have an alternative private image to pull by
using a configurable docker config file which contains the necessary credentials
needed to pull the image.
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.
Deprecate InfluxDB cluster monitoring
InfluxDB cluster monitoring addon will no longer be supported and will be removed in k8s 1.12.
Default monitoring solution will be changed to `standalone`.
Heapster will still be deployed for backward compatibility of `kubectl top`
```release-note
Stop using InfluxDB as default cluster monitoring
InfluxDB cluster monitoring is deprecated and will be removed in v1.12
```
cc @piosz
In scalability testing influxdb was recently disabled, but we still
trying to execute corresponidng test, as a result it fails all the time.
Skip test if influxdb is disabled.
Transitional part of kubernetes/test-infra#3307, should be eliminated
by kubernetes/test-infra#3330: Allow NODE_INSTANCE_GROUP to be set
before we get here, which eliminates any cluster/gke use if
KUBERNETES_CONFORMANCE_PROVIDER is set to "gke".