Skip to content
Intellira
Kuberneteshigh severity

NodeNotReady

A NotReady node has stopped reporting healthy to the control plane. Its pods get evicted and rescheduled. Here is how to find why the kubelet went unhealthy.

Written by Intellira Engineering, Editorial team

What Node NotReady means

A node is NotReady when its kubelet stops posting a healthy Ready status to the control plane. The kubelet keeps a heartbeat alive two ways: it updates the Node .status (default every 5 minutes) and renews a Lease object in the kube-node-lease namespace (default every 10 seconds). If those stop or the status turns unhealthy, the node controller stops scheduling to the node and, after a timeout, taints it NoExecute so its pods are evicted and rescheduled — turning one bad node into a cluster-wide capacity problem. See node status and conditions.

First, classify it: False vs Unknown

This single distinction decides where you look. The Ready condition is either False (kubelet is talking and says it is unhealthy) or Unknown (the control plane has not heard from the node within node-monitor-grace-period, default 50s):

kubectl get node <node> -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'
  • Ready=False → the kubelet is reaching the API server. Look on the node: runtime, resource pressure, kubelet config. The control plane adds the node.kubernetes.io/not-ready:NoExecute taint.
  • Ready=Unknown → the node is unreachable (network, crashed kernel, dead kubelet). Look at connectivity and the node lease. The control plane adds node.kubernetes.io/unreachable:NoExecute.

Both taint timings and the 50s default come from the kube-controller-manager NodeMonitorGracePeriod and the node-controller taints.

Diagnose it

kubectl get nodes
kubectl describe node <node>
# Conditions: Ready, plus MemoryPressure / DiskPressure / PIDPressure
# Look at LastHeartbeatTime and the Taints line
kubectl get lease <node> -n kube-node-lease   # renewTime stale => unreachable

On the node itself (if reachable):

systemctl status kubelet
journalctl -u kubelet --no-pager | tail -80
systemctl status containerd        # or crio / docker
df -h ; df -i                       # disk AND inode usage

Causes, each end to end

Kubelet or container runtime down (Ready=False)

The kubelet crashed or hung, or the CRI runtime (containerd / CRI-O) is down, so the kubelet cannot manage pods and reports unhealthy.

  • Diagnose: systemctl status kubelet and systemctl status containerd. A missing runtime socket (ls -la /run/containerd/containerd.sock) or a crash-looping kubelet in journalctl -u kubelet confirms it.
  • Fix: restart the failed service (systemctl restart containerd && systemctl restart kubelet) and read the logs for the crash reason — a bad config flag, an OOM-killed kubelet, or a corrupt runtime state directory. The kubelet re-registers and the node returns to Ready.

MemoryPressure (Ready=False)

Available memory dropped below the kubelet's memory.available eviction threshold (hard default <100Mi), so the kubelet sets MemoryPressure=True and starts evicting pods.

  • Diagnose: kubectl describe node shows MemoryPressure True; events show Evicted pods. The kubelet evicts by QoS class — BestEffort first, then Burstable over their requests, Guaranteed last.
  • Fix: find and cap the offending workload (set memory limits), or add node capacity. For critical pods use Guaranteed QoS (request == limit) so they are evicted last. Thresholds are set with --eviction-hard. See node-pressure eviction.

DiskPressure — bytes or inodes (Ready=False)

DiskPressure=True fires when nodefs.available (default under 10%), nodefs.inodesFree (under 5%), or imagefs.available (under 15%) breach their hard thresholds. Inode exhaustion is the trap: df -h can show free space while df -i is at 100%.

  • Diagnose: df -h and df -i on the node; DiskPressure True in kubectl describe node. Check container logs, dead containers, and unused images filling nodefs/imagefs.
  • Fix: free space or grow the disk. The kubelet first reclaims by pruning unused images and dead containers; if a runaway log or emptyDir is the cause, cap it. Per-threshold defaults are in node-pressure eviction.

PIDPressure (Ready=False)

pid.available fell below the threshold (hard default <4%) — a process or fork-bomb workload exhausted the node's PIDs, so the kubelet sets PIDPressure=True and cannot start new pods.

  • Diagnose: kubectl describe node shows PIDPressure True; on the node, ps -eLf | wc -l against cat /proc/sys/kernel/pid_max.
  • Fix: kill or cap the offending workload and set pod/PID limits. See PIDPressure and node-pressure eviction.

Network partition — node unreachable (Ready=Unknown)

The node cannot reach the API server, so its lease in kube-node-lease goes stale and the control plane flips Ready to Unknown after node-monitor-grace-period. The CNI plugin itself can also report NetworkUnavailable=True.

  • Diagnose: kubectl get lease <node> -n kube-node-lease shows a stale renewTime; NetworkUnavailable may be True. From the node, test reachability to the API server endpoint. Suspect a recent security-group / firewall / route change or a CNI (Calico, Cilium, flannel) failure.
  • Fix: restore connectivity (security group, route, VPN, CNI pod). Once the lease renews, Ready returns to True. See node heartbeats.

Expired kubelet certificate or clock skew (Ready=False or Unknown)

An expired kubelet client cert makes the API server reject the heartbeat; large clock skew breaks TLS validity windows. Both stop status updates.

  • Diagnose: journalctl -u kubelet shows x509: certificate has expired or Unauthorized. Check skew with timedatectl / chronyc tracking.
  • Fix: rotate the kubelet cert (kubeadm renews /var/lib/kubelet/pki, or re-run TLS bootstrap), restart the kubelet, and fix NTP so the clock stays in sync.

What happens to the pods

When Ready stays False/Unknown past node-monitor-grace-period, the node controller adds the not-ready or unreachable NoExecute taint. Each pod is evicted after its tolerationSeconds — Kubernetes injects a default of 300s (5 minutes) for both taints unless the pod sets its own. Set a shorter value on latency-sensitive workloads, or a longer one to ride out brief node restarts. See taint-based eviction.

If a node will not recover, cordon and drain it, then replace it:

kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

How Intellira diagnoses this

Intellira reads node conditions, the node lease, and recent events read-only, then correlates the NotReady transition with what changed — a node-pool image update, a CNI change, or a workload that exhausted disk or PIDs. It classifies the node as pressure (Ready=False) versus unreachable (Ready=Unknown) up front and points at the likely trigger with evidence, instead of leaving you to SSH around.

Sources

By Intellira Engineering. AI-assisted draft, reviewed by the Intellira engineering team; claims cited inline; last verified 2026-06-02.

Frequently asked questions

What makes a node go NotReady?
The kubelet stopped reporting a healthy status to the control plane. Common causes are the kubelet or container runtime crashing, the node running out of memory, disk, or process IDs (MemoryPressure / DiskPressure / PIDPressure), or a network partition that stops the node's heartbeat reaching the API server.
What is the difference between NotReady and Unknown?
Ready=False means the kubelet is still reporting but says the node is unhealthy (pressure, runtime down). Ready=Unknown means the control plane has not heard from the node within node-monitor-grace-period (default 50s) — the node is unreachable. The two trigger different taints (not-ready vs unreachable) and you diagnose them differently.
What happens to pods on a NotReady node?
The node controller adds a NoExecute taint, and after each pod's tolerationSeconds (default 300s / 5 minutes) the controller reschedules the pods elsewhere. Pods that cannot be confirmed terminated may sit in Terminating until the node returns or is force-deleted.

Related errors

Find the root cause of NodeNotReady on your stack

Connect read-only and Intellira correlates the change behind it across Bitbucket, Jenkins, ArgoCD and Kubernetes — with the evidence to prove it.