The causality chain: from commit to incident

The first question is always "what changed?"

When a healthy service starts failing, the highest-yield question is not "what is the error?" — it is "what changed, and when?" Google's own SRE practice puts a number on why: "In Google's experience, a majority of incidents are triggered by binary or configuration pushes" (Google SRE Workbook, Canarying Releases). The same book's troubleshooting chapter calls recent change the place to start: "Recent changes to a system can be a productive place to start identifying what's going wrong" (Google SRE, Effective Troubleshooting).

So the fastest path to root cause is usually to reconstruct the delivery chain — commit, build, deploy, runtime — and find the link in it that moved. The catch: the evidence for each link lives in a different system, and a tool that owns only one layer of that chain cannot see the link that broke it. This piece argues that correlating the chain beats deepening any single layer, then draws the honest limit: correlation in time is a strong lead, not proof of cause.

Why "what changed?" dominates an investigation

Systems at rest tend to stay at rest. The SRE book frames it as inertia: "a working computer system tends to remain in motion until acted upon by an external force, such as a configuration change or a shift in the type of load served" (Effective Troubleshooting). Most of the time, the external force is a change you shipped.

DORA's research operationalizes this as change failure rate — "the ratio of deployments that require immediate intervention following a deployment," meaning a rollback or a hotfix (dora.dev, DORA metrics). It is one of two stability metrics DORA tracks, precisely because deployments are where stability is won or lost. The 2024 State of DevOps report puts the elite band around a 5% change failure rate, with mid and high clusters landing near 10–20% (2024 DORA report summary, getDX). Read that the other way: even strong teams expect a meaningful share of deploys to cause harm. The base rate that "a recent change did this" is high enough that checking it first is not a hunch — it is the expected value play.

Be precise about the terms here. A deployment ships a new artifact to an environment; a release exposes it to users; an incident is the user-visible failure. The causal link you are hunting is often a deployment that became a release without anyone deciding it should — config flipped, a flag defaulted on, an ArgoCD sync reconciled the cluster to a bad commit. Conflating these three is how teams look at the wrong timestamp.

The chain has four links, and they live in four systems

"What changed?" is rarely answered by one tool, because a single logical change leaves evidence in four separate places:

Commit — the diff itself: a raised memory request, a changed query, a new dependency. Lives in Bitbucket/GitHub, keyed by SHA and author.
Build — what that commit produced: a new image tag/digest, pinned (or drifted) dependencies, build flags. Lives in Jenkins/CI, keyed by build number.
Deploy — when and where the artifact went live: the rollout, the sync, the Helm values that were actually applied. Lives in ArgoCD/Kubernetes, keyed by revision and timestamp.
Runtime — the symptom: the crash loop, the latency spike, the OOMKill, the 5xx surge. Lives in your metrics/logs/traces.

The causal sentence you want reads across all four: this commit, built into this image, deployed at this time, produced this symptom. Each system holds one fragment and is blind to the rest. That structural blindness is the real problem — not a missing dashboard.

Why single-layer observability can't close the chain

The "three pillars" — metrics, logs, and traces — are the canonical observability model: metrics alert you to a problem, traces show the path of execution, and logs give the context to resolve it (IBM, three pillars of observability). They are necessary. They are also, on their own, confined to the runtime link of the chain. They tell you the service started returning errors at 14:03. They do not, by construction, tell you that a Jenkins build at 13:58 produced an image from commit a1f9c2e that ArgoCD synced at 14:01.

That gap is acknowledged inside the observability world itself. As one vendor's own framing of the three pillars notes, "it can be useful... to contextualize logs, metrics and traces with data from a CI/CD pipeline to help you determine which application update or redeployment correlates with a performance degradation" (Elastic, 3 pillars of observability). Deployment data is a separate source you have to bring alongside the pillars — it is not inside them. A metrics platform that does not ingest your CI build numbers, your VCS commit history, and your GitOps sync events is, by design, looking at one link of a four-link chain and inferring the rest by hand.

This is where the cross-system correlation pays off. AIOps tooling that ingests deployment markers exists precisely to close this gap: it "automatically highlights recent code deployments, configuration changes, or infrastructure events that temporally correlate with incidents, providing immediate clues for root cause analysis," and the telemetry foundation it depends on explicitly includes "deployment markers... configuration changes" alongside logs, metrics, and traces (Splunk, AIOps explained). The deployment marker is the join key between the runtime symptom and the change that caused it. Without it, you are alt-tabbing between five consoles and reconstructing a timeline from memory.

A concrete instance of the chain: a pod that gets OOMKilled shows up in runtime as a 137 exit and a crash loop. The metrics tell you memory hit the ceiling. They do not tell you that the ceiling was a limits.memory left unchanged while a commit raised the app's cache size — that fact lives in the diff, two links up the chain. Diagnosis means walking from the runtime symptom back to the commit, not staring harder at the saturation graph.

What "evidence" has to mean

If the goal is a defensible diagnosis, "evidence" is a higher bar than a plausible story. To name a change as the cause, the evidence chain should tie together:

A symptom with a timestamp — the runtime signal and when it started.
A change with a timestamp — the deploy/sync/config event, ideally just before the symptom.
A mechanism — a stated reason the change produces the symptom (the diff raised the cache above the memory limit; the new query dropped an index hint).
An artifact identity — the SHA, build number, and revision that connect the change to the running code, so the claim is checkable, not narrated.

The mechanism is what separates evidence from coincidence. A timeline that shows a deploy 90 seconds before a crash is a strong lead; a one-sentence reason the deploy causes that specific crash is what makes it a root cause. We've argued this distinction at length in why summaries are not root cause — a fluent restatement of the alert is not a diagnosis, and "a deploy happened near the incident" is not yet a mechanism.

The honest limit: correlation in time is a lead, not a verdict

Temporal correlation across the delivery chain is the strongest cheap signal you have. It is not proof. The SRE book is blunt about the trap: "correlation is not causation: some correlated events, say packet loss within a cluster and failed hard drives in the cluster, share common causes — in this case, a power outage," and it lists "hunting down spurious correlations that are actually coincidences or are correlated with shared causes" as a named troubleshooting pitfall (Effective Troubleshooting). Three failure modes to keep in view:

The noisy deploy window. Continuous delivery means several changes can land in the same five-minute window — an app deploy, a config push, a dependency bump, an infra change. "The last thing that shipped" narrows the field; it does not pick the culprit. You still need the mechanism to choose between candidates.
Shared upstream cause. The change you shipped and the incident may both be downstream of a third event (a node-pool autoscale, a dependency's outage). The deploy correlates because it ran into the real problem, not because it is the problem.
The latent change. Not every cause is recent. A change merged days ago can stay dormant until a traffic shift or a cron boundary triggers it — and a "what changed in the last 10 minutes?" view will miss it entirely. Recent-change bias is productive precisely because it is usually right, which is exactly why it fails loudly on the cases where it is wrong.

The operational consequence: correlation across the chain should rank suspects and surface the evidence for each, then leave the causal call — and the mechanism — to a human or to a system that has to show its work. Anything that asserts cause from timing alone will be confidently wrong on the noisy-window and shared-cause cases, which are exactly the incidents that hurt most.

What this means for tooling

If most incidents follow a change, and the change leaves evidence in four systems, then the tool that helps most is the one that reads all four and joins them on the artifact identity and timeline — not the one that goes deepest on a single pillar. Correlating the commit-to-runtime chain is the core idea behind cross-system root cause analysis: pull the VCS diff, the CI build, the GitOps sync, and the runtime signal into one timeline, rank the changes that line up with the symptom, and attach the evidence for each so the on-call engineer is verifying a hypothesis instead of assembling one from scratch.

The honest framing matters as much as the capability. A correlation engine earns trust by surfacing the suspect change and its evidence and by staying silent — or hedging — when the window is noisy or the signal is a shared cause. The value is in collapsing the four-console scavenger hunt into a ranked, cited timeline, not in pretending timing is proof.

Sources

Google SRE Workbook — Canarying Releases (majority of incidents triggered by binary or configuration pushes)
Google SRE Book — Effective Troubleshooting (system inertia / "what touched it last"; correlation is not causation; spurious-correlation pitfall)
DORA — DORA metrics (Four Keys) (change failure rate definition)
2024 DORA State of DevOps report summary — getDX (change failure rate benchmark bands)
IBM — The three pillars of observability (metrics/logs/traces roles)
Elastic — The 3 pillars of observability (need to contextualize pillars with CI/CD pipeline data)
Splunk — AIOps explained (deployment markers and change-event correlation for RCA)

By Intellira Engineering. AI-assisted draft, reviewed by the Intellira engineering team; claims cited inline; last verified 2026-06-02.