What a build failure means
Jenkins reports several distinct end states, and they are not interchangeable. A red FAILURE means a step exited non-zero and threw an exception. A yellow UNSTABLE means the run completed but a quality signal (usually a failing test) was recorded — execution proceeds by default even when the build is unstable. ABORTED means a timeout or manual cancel interrupted the run. Knowing which one you have tells you where to look before you read a single log line.
The useful signal is which stage failed and that stage's log — not the whole console, and not the parent pipeline log when a sub-job is the real culprit.
Diagnose it
Identify the entity type first. A pipeline stage is read from the parent
build; a sub-job (e.g. cd-step-preparation #5819) must be read from that
sub-job's own build. Use the Pipeline REST API to jump straight to the failing
node instead of scrolling the full console:
# Stage-level status for a run (which stage is red / yellow):
curl -s "$JENKINS/job/<job>/<build>/wfapi/describe"
# That node's log only (node id from the describe output above):
curl -s "$JENKINS/job/<job>/<build>/execution/node/<id>/wfapi/log"
# Whole console as a fallback:
curl -s "$JENKINS/job/<job>/<build>/consoleText" | tail -100
# Test results when the run is UNSTABLE rather than FAILURE:
# <job>/<build>/testReport
The wfapi/describe endpoint returns each stage and links to its per-node log
(Pipeline REST API plugin).
Causes — diagnose and fix each
1. Test failure (UNSTABLE, not FAILURE)
A failing test recorded by junit marks the run
UNSTABLE (yellow), distinct from FAILED (red).
- Diagnose: the run is yellow; the shell step that ran the tests still exited
zero. Open
<build>/testReportfor the failed cases. - Fix: fix the test or the code under test. If later stages must not run once
the build is unstable, set
options { skipStagesAfterUnstable() }, and act in theunstablepost condition — notfailure.
2. Compile / step error (FAILURE)
A script that exits non-zero causes the step to fail with an exception; this is the red, most common case.
- Diagnose: the stage log shows the compiler or command error directly.
- Fix: correct the code or command and re-run.
3. Swallowed failure via returnStatus / catchError
A sh step with returnStatus: true
returns the status code instead of throwing,
and catchError
sets the result to UNSTABLE and continues.
Either can hide a real failure.
- Diagnose: a step that clearly errored did not fail the stage; check the
Jenkinsfile for
returnStatus,catchError, or a swallowed exit code. - Fix: branch on the returned status and fail explicitly, or remove the
catchErrorwrapper so the non-zero exit propagates.
4. Out of memory — exit code 137 (node-level)
If the host runs short on memory the kernel OOM killer can terminate the process;
on Linux you see
exit code 137 (128 + SIGKILL).
This is a node/agent resource limit, not a code bug.
- Diagnose: the log ends abruptly with
137(orKilled), often mid-test or mid-build with no application error. - Fix: raise the agent/container memory limit or the JVM/tool heap, or reduce
parallelism on that agent. Confirm by checking the agent's
dmesg/OOM logs.
5. Agent disconnection (infrastructure, not code)
A controller-to-agent channel drop fails the running step with errors such as
channel closed or Backing channel ... is disconnected. Pipeline builds can
often survive a brief reconnect, but a dropped agent still aborts in-flight steps
(durable-task / nodes-and-processes plugin).
- Diagnose: the failure message is about the connection/agent, not your build command; ephemeral (e.g. Kubernetes) agents may have been evicted or terminated.
- Fix: wrap the flaky stage in
retryandtimeoutoptions so a fresh agent is allocated, and address the underlying node capacity/eviction.
6. Missing or changed dependency
A version bump or an unavailable artifact breaks a stage that worked before.
- Diagnose: the stage log shows a resolve/download/version error; diff the lockfile or pinned versions against the last green build.
- Fix: pin or restore the working version, or repair the registry/proxy.
7. Environment / agent drift
A tool version or credential changed on the agent since the last green run.
- Diagnose: the command exists but behaves differently, or auth now fails; the change is on the agent, not in the repo.
- Fix: restore the expected tool version/credential, or label the job to a known-good agent.
8. Flaky infrastructure
A transient network or registry error that a retry resolves.
- Diagnose: the same run passes on re-run with no code change.
- Fix: wrap the unreliable step in
retry(n)and stabilize the dependency.
Fix it
- Read the end state first: red = FAILURE, yellow = UNSTABLE, grey = ABORTED. That alone narrows the cause list above.
- Find the failing stage from the stage view (or
wfapi/describe); open that stage's per-node log — for a sub-job, read the sub-job's build directly. - Match the log to a cause above and apply its fix.
- Diff against the last green build: triggering commit, dependency, agent change.
- Re-run. If it only fails intermittently, treat it as flaky and wrap it in
retry/timeoutrather than re-running by hand.
How Intellira diagnoses this
Intellira reads the build info, the run's end state, the correct stage/sub-job
logs and test results from the Jenkins MCP server, isolates the failing step,
and ties it to the triggering commit — so you get "stage unit-tests is UNSTABLE
on commit a1f9c2e, three assertions failed" or "stage build exited 137 on the
agent" instead of a wall of console output.
Sources
- Diagnosing Errors — exit code 137 / OOM
- Pipeline: Nodes and Processes (sh, returnStatus, returnStdout)
- Pipeline: Basic Steps (catchError)
- Recording tests and artifacts — UNSTABLE vs FAILED
- Pipeline Syntax — post conditions, options, retry, timeout, skipStagesAfterUnstable
- Pipeline REST API plugin — wfapi describe and node logs
By Intellira Engineering. AI-assisted draft, reviewed by the Intellira engineering team; claims cited inline; last verified 2026-06-02.