Incident commander checklist

What the incident commander does in one line

The incident commander (IC) coordinates the response and owns the decisions — they do not touch the keyboard. Google's SRE practice frames the job as the "three Cs": coordinate the response effort, communicate between responders and stakeholders, and maintain control over the incident (Google SRE Workbook, Incident Response). PagerDuty puts the boundary bluntly: as IC, "you should not be performing any actions or remediations, checking graphs, or investigating logs. Those tasks should be delegated" (PagerDuty IC training). This checklist walks the role end to end: declare and classify, assign roles, run the comms cadence, mitigate before chasing root cause, hand off cleanly, decide when it is resolved, and run the review.

When to declare an incident

Declaring early is cheap; declaring late is what turns a degradation into an outage. Google's managing-incidents guidance says to declare an incident if any of these are true (Google SRE book, Managing Incidents):

Do you need to involve a second team to fix the problem?
Is the outage visible to customers?
Is the issue still unsolved after an hour of concentrated analysis?

If you are unsure, declare. A short incident that turns out minor costs a channel and a few updates. A delayed declaration costs the time nobody coordinated.

Classify severity (and re-check it)

Severity sets the response: who gets paged, how often you update, and whether external comms open. Use a fixed scale so the number means the same thing every time. Atlassian's widely cited model is a useful default (Atlassian: severity levels):

SEV1 — "a critical incident with very high impact": customer data loss, a security breach, or a client-facing service down for all customers.
SEV2 — "a major incident with significant impact": a client-facing service down for a subset of customers, or a critical function not working.
SEV3 — "a minor incident with low impact": a glitch causing slight inconvenience.

Atlassian treats SEV1 and SEV2 as major incidents. Pick your own thresholds, but write them down before the incident, not during it.

Severity is not set once. Re-assess it whenever impact changes — a SEV3 that starts dropping writes is now a SEV2, and the cadence and audience change with it. Say the new severity out loud in the channel and timestamp it.

Assign roles in the first five minutes

The IC's first job after declaring is to stop everyone debugging at once. Google's IMAG (Incident Management at Google) model — itself based on the emergency-services Incident Command System — defines a hierarchy where the Communications Lead and Operations Lead report to the IC (Google SRE Workbook, Incident Response):

Incident commander (you). "Holds the high-level state about the incident" and "structures the incident response task force, assigning responsibilities according to need and priority." The IC holds any role not yet delegated and removes roadblocks so Ops can work (Managing Incidents).
Operations lead (OL). "Works with the incident commander to respond to the incident by applying operational tools." Critically: "the operations team should be the only group modifying the system during an incident." One hand on the system prevents two fixes from colliding.
Communications lead (CL). "The public face of the incident response task force" — issues periodic updates to responders and stakeholders and keeps the incident document accurate.
Planning lead (PL) (for longer incidents). Handles longer-term issues: filing bugs, arranging handoffs, and "tracking how the system has diverged from the norm so it can be reverted once the incident is resolved."

On a small incident one person may wear OL and CL. The point is that the roles are named and acknowledged, not that you always staff four people.

What the IC must not do

The most common failure is the IC who quietly becomes a responder. PagerDuty's training names it directly: "you cannot take on another role at the same time as being an Incident Commander... you must resist the temptation to abandon the role of IC. If you really are the only person able to solve the problem, you should handover to another Incident Commander and then assume the role of SME" (PagerDuty IC training).

Two consequences follow:

The IC "become[s] the highest ranking individual on any major incident call, regardless of their day-to-day rank. Their decisions made as commander are final." Seniority in the org does not override the IC during the incident.
If you start reading logs, nobody is coordinating. Delegate the investigation and hold the structure.

Keep span of control sane

A single IC can only track so many people before coordination breaks down. PagerDuty's rule of thumb: "if you have more than 7 or 8 people directly reporting to the Incident Commander things can quickly get overwhelming" (PagerDuty IC training).

When the response grows past that, spin off a sub-team: assign a team leader, give them a specific, time-boxed task, and re-affirm that they are your single point of contact — "all communication from their team should come via the leader." You now track one leader, not five investigators.

Set the communication cadence

Pick an interval before you need it and hold to it. "Updates every 15 minutes, even if nothing has changed" is a defensible default for a major incident; "nothing new, next update in 15" is a valid update. The CL owns this.

Match the message to the audience:

Internal responders: current hypothesis, what's being tried, who owns what, time of next update.
Internal stakeholders / leadership: severity, scope of impact, current action, ETA to the next update — not an ETA to resolution.
External / customer-facing: acknowledge, state scope, give a next-update time. Never publish a resolution ETA you cannot back. A missed "fixed by 3pm" erodes trust faster than the outage.

Keep every message factual and tied to evidence. The incident document — which Google calls the IC's "most important responsibility" — is the source of truth, not the chat scrollback.

Mitigate before you root-cause

Separate the two and do them in order. Google's first listed best practice is: "Prioritize. Stop the bleeding, restore service, and preserve the evidence for root-causing" (Managing Incidents).

Mitigation stops user impact now: roll back the suspect deploy, fail over, scale out, disable the broken feature flag. It does not require knowing why.
Root cause is the causal chain you establish afterward, with evidence. Chasing it while customers are down trades minutes of impact for curiosity.

Most incidents follow a change — capture the suspected triggering deploy or config change first, then decide whether rolling it back is the fastest mitigation. Preserve evidence as you go (logs, metrics, the diff) so the review is not a reconstruction from memory. For why a one-line "the deploy broke it" is not yet a root cause, see summaries are not root cause.

Hand off command cleanly

ICs rotate on long incidents — fatigue degrades judgment. The handoff has to be explicit or you end up with two ICs or none. Google's protocol: the outgoing commander states "You're now the incident commander, okay?" and "should not leave the call until receiving firm acknowledgment of handoff," and the change is announced to everyone working the incident (Managing Incidents).

A clean handoff transfers: current severity, the working hypothesis, what's been tried, who owns what, and the next external update time. The incoming IC reads back the state; the outgoing IC confirms; the channel is told.

Decide when it is resolved

"Resolved" is a signal, not a feeling. Confirm with telemetry that impact is gone:

error rate back to baseline (the four golden signals — latency, traffic, errors, saturation — are a good checklist here; see Google SRE: monitoring distributed systems);
the failing units healthy (pods Running and Ready, queue drained);
the triggering alert cleared, not just silenced.

Declare resolved only after the recovery holds for a sustained window — a metric that dips and recovers for thirty seconds is not recovery. Then assign the post-incident review owner before you close the channel.

Run a blameless post-incident review

The review is where the incident pays for itself. Google defines the postmortem as "a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring" (Google SRE book, Postmortem Culture).

Blameless is the load-bearing word: the review must "focus on identifying the contributing causes of the incident without indicting any individual or team," because "if a culture of finger pointing and shaming... prevails, people will not bring issues to light for fear of punishment" (Postmortem Culture). The artifact you want is "what change caused this, and how do we catch it earlier" — never "who."

Trigger a review consistently, not selectively. Google's common criteria include user-visible downtime past a threshold, data loss of any kind, on-call intervention such as a rollback, resolution time above a threshold, or a monitoring failure that meant a human found the problem before the alert did (Postmortem Culture). Capture the full timeline — triggering change → detection → mitigation → resolution, each with evidence — and turn the follow-ups into tracked, owned work, not a document that gets filed and forgotten.

The 60-second checklist

Declare early — when in doubt, declare.
Classify severity from a written scale; re-check it as impact changes.
Name IC / OL / CL (and PL if it runs long); only Ops modifies the system.
Do not investigate as IC — delegate and coordinate.
Keep direct reports under 8; spin off time-boxed sub-teams with one leader each.
Hold a fixed update cadence; never publish a resolution ETA you can't back.
Mitigate first, root-cause second, preserve evidence throughout.
Hand off command explicitly, with read-back and acknowledgment.
Resolve on signal, sustained — then assign the review owner.
Run a blameless review with tracked follow-ups.

Where Intellira fits

Intellira shortens the "what changed?" step that dominates the first ten minutes: it correlates the incident with the commit, build, and deploy behind it and produces an evidence-backed timeline — so the IC starts coordinating from a hypothesis instead of a blank channel, and the post-incident review starts from a recorded causal chain instead of a reconstruction. It is read-only by design: it surfaces the evidence and the suspected triggering change, but the Operations lead still owns every change to the system, as incident-command discipline requires.

Sources

By Intellira Engineering. AI-assisted draft, reviewed by the Intellira engineering team; claims cited inline; last verified 2026-06-02.