What OOMKilled actually means
OOMKilled means the Linux kernel's out-of-memory killer terminated your
container because it exceeded the memory it was allowed to use. In Kubernetes,
the container is recorded with Reason: OOMKilled and exit code 137 (128 + signal 9, SIGKILL)
(Kubernetes: Assign Memory Resources).
The official docs are precise about the boundary: "A Container can exceed its
memory request if the Node has memory available. But a Container is not allowed
to use more than its memory limit. If a Container allocates more memory than its
limit, the Container becomes a candidate for termination."
The key distinction: OOMKilled is almost never "the node ran out of memory." It
is usually "this container hit its own limits.memory," which means the fix is
about the workload, not the cluster.
The detail most guides miss: enforcement is reactive
Memory limits are not a hard ceiling the way CPU limits are. The docs are
explicit: "memory limits are enforced by the kernel with out of memory (OOM)
kills … However, terminations only happen when the kernel detects memory
pressure. … A container may use more memory than its memory limit, but if it
does, it may get killed"
(Resource Management for Pods and Containers).
Contrast that with CPU: "cpu limits are enforced by CPU throttling … a cpu
limit is a hard limit the kernel enforces" (same source). So a container over
its CPU limit is slowed down; a container over its memory limit is killed —
but only reactively, when the node is under pressure. That's why a slow leak can
run above its limit for a while and then die seemingly at random: the kill lands
when pressure hits, not the instant the limit is crossed.
Confirm it's really OOM
Don't guess from the symptom. Confirm from the pod's last state:
kubectl describe pod <pod> -n <namespace>
# Look under "Last State":
# Last State: Terminated
# Reason: OOMKilled
# Exit Code: 137
Then check how close the workload runs to its limit over time:
kubectl top pod <pod> -n <namespace> --containers
# Compare the MEMORY column against the container's limits.memory
Common causes
- Limit set too low for real usage — the most common case. The app needs
more than its
limits.memory, often after a config or traffic change. - A memory leak — usage climbs steadily until it crosses the limit and the next memory-pressure event kills it; restart; repeat. The give-away is a sawtooth memory graph.
- A cache or heap sized in absolute terms (e.g. "2GB cache") while the limit was left unchanged — a classic deploy-induced OOM.
- Runtime heap larger than the container limit — a JVM/Node runtime that sizes its heap from the host rather than the cgroup limit.
Terminology: request vs limit (get this right)
These are different knobs and the docs treat them differently:
requests.memoryis a scheduling guarantee — "Pod scheduling is based on requests. A Pod is scheduled to run on a Node only if the Node has enough available memory to satisfy the Pod's memory request" (Assign Memory Resources). Set it too high and the pod sitsPending(see Kubernetes Pod Pending); set it too low and the scheduler overcommits the node.limits.memoryis the OOM-kill threshold described above.
Node memory pressure vs your own limit — and who gets evicted
Everything above is the container-hit-its-own-limit path. There's a second, distinct path that catches teams out: node memory pressure. When the whole node runs low on memory, the kubelet doesn't wait for the kernel — it proactively evicts pods, "the process by which the kubelet proactively terminates pods to reclaim resource on nodes" (Node-pressure Eviction). The pod that dies here may not be the one using the most memory — it's chosen by Quality of Service (QoS) class.
Every pod is assigned a QoS class from its requests/limits (QoS Classes):
| QoS class | Condition | Eviction risk under node pressure |
|---|---|---|
| Guaranteed | every container has memory and CPU request == limit | Lowest — "least likely to face eviction" |
| Burstable | has some requests/limits but isn't Guaranteed | Middle — evicted after BestEffort, and only "exceeding resource requests" |
| BestEffort | no requests or limits on any container | Highest — killed first |
The docs are explicit on order: "When a Node runs out of resources, Kubernetes
will first evict BestEffort Pods running on that Node, followed by Burstable
and finally Guaranteed Pods," and "only Pods exceeding resource requests are
candidates for eviction" (QoS Classes).
The operational takeaway: a BestEffort pod (no requests/limits) is the first
casualty when any workload pressures the node — even a well-behaved pod can be
evicted because a noisy neighbour filled the node. Give anything you can't afford
to lose requests == limits (Guaranteed) so it survives longest, and set honest
requests everywhere so the scheduler doesn't overcommit the node in the first
place. This is also why "just raise the limit" can backfire: higher limits let
pods pack tighter against node capacity, making node-pressure evictions more
likely elsewhere.
Telling them apart. A container with Last State: Terminated, Reason: OOMKilled (exit 137) hit a memory limit. A pod with status.reason: Evicted
and phase: Failed was evicted by the kubelet under node pressure — "the kubelet
sets the phase for the selected pods to Failed, and terminates the Pod"
(Node-pressure Eviction).
Check the node and its neighbours:
kubectl describe node <node> | grep -iA5 conditions # MemoryPressure: True?
kubectl get events -A --field-selector reason=Evicted
kubectl top nodes
kubectl top pods -A --sort-by=memory # the noisy neighbour
The kubelet watches the memory.available signal (default hard threshold
100Mi) and reclaims node resources before the kernel's OOM killer fires; it
ranks pods for eviction by whether they exceed requests, then Pod Priority, then
usage (Node-pressure Eviction).
Fixing node pressure is a different job from fixing a limit — raising the
victim's limit does nothing. Find and right-size the noisy neighbour; set honest
requests so the scheduler stops overcommitting the node; promote critical
workloads to Guaranteed and a higher PriorityClass so they are evicted last;
and add capacity or enable the Cluster Autoscaler so pressure has somewhere to go.
Fix it
- Confirm the kill and the real working-set size (the two commands above).
- If usage is legitimately higher than the limit, raise
limits.memory(andrequests.memory) to match reality plus headroom. - If usage climbs without bound, you have a leak — raising the limit only delays the kill. Profile the process and fix the leak.
- If a recent change set a cache/heap size, align the container limit with it (or revert the change).
- For runtimes, make the heap cgroup-aware (e.g. container-aware JVM flags) so it respects
limits.memory. - Roll out and watch
kubectl topuntil the working set settles below the new limit.
Tradeoffs
Raising limits.memory is the fast fix, but it has a cost: higher limits reduce
how many pods fit per node and can mask a real leak that will resurface at scale.
Raising requests.memory to match makes scheduling honest but lowers node
density. The blast radius of getting this wrong is cluster-wide — under-set
limits cause OOM kills, over-set requests strand capacity. Decide deliberately,
not by reflex bumping the number until the alerts stop.
Prevent recurrence
- Set
requests.memoryclose to real usage andlimits.memorywith deliberate headroom. - Alert on container memory approaching its limit, not just on the kill — because enforcement is reactive, the kill is a lagging signal.
- Treat any change to cache/heap sizing as a change that must be paired with a limits review.
How Intellira diagnoses this
Instead of just reporting "OOMKilled," Intellira reads the pod's last state and working set read-only, then walks the causality chain — was there a recent ArgoCD sync, a Jenkins build, or a Bitbucket commit that changed a cache or heap setting while the limit stayed put? The output names the specific change and the file to fix, with the evidence behind it, rather than leaving you to bisect.
Sources
- Kubernetes — Assign Memory Resources to Containers and Pods
- Kubernetes — Resource Management for Pods and Containers
- Kubernetes — Pod Quality of Service Classes
- Kubernetes — Node-pressure Eviction
By Intellira Engineering. AI-assisted draft, reviewed by the Intellira engineering team; claims cited inline to the official Kubernetes docs; last verified 2026-06-02.