What a DNS resolution failure looks like
A pod cannot turn a name into an IP. nslookup inside the pod returns
can't resolve, SERVFAIL, or times out, and your app logs name-lookup
errors. The trap is assuming there is one cause. There are at least six
independent ones, and they need different fixes: CoreDNS pods being down or
crashlooping, the wrong dnsPolicy/dnsConfig, the default ndots:5 slowing
or breaking external lookups, a NetworkPolicy blocking egress to kube-dns, an
upstream/resolv.conf forwarding loop, and querying the wrong Service name
form. Start by deciding whether the failure is cluster-wide (every pod) or
scoped to one workload — that split points you at the right half of this page.
The two authoritative references are Debugging DNS Resolution and DNS for Services and Pods.
First triage: cluster-wide or one workload
Run a clean debug pod so you are not debugging through your app's own quirks.
The official dnsutils pod is the canonical tool:
# In-cluster name (should return the kube-dns clusterIP -> kubernetes svc IP)
kubectl exec -i -t dnsutils -- nslookup kubernetes.default
# What the pod was actually handed
kubectl exec -ti dnsutils -- cat /etc/resolv.conf
A healthy kubernetes.default lookup returns the DNS server and the answer:
Server: 10.0.0.10
Address 1: 10.0.0.10
Name: kubernetes.default
Address 1: 10.0.0.1
A failure prints the server but no answer (Debugging DNS Resolution):
nslookup: can't resolve 'kubernetes.default'
A correct /etc/resolv.conf for a pod in the default namespace looks like
this — note the cluster search domains, the kube-dns Service IP as
nameserver, and options ndots:5
(Debugging DNS Resolution):
search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.0.0.10
options ndots:5
If a fresh dnsutils pod also fails, the problem is cluster-wide — work
causes 1, 5, and 6 below. If only your workload fails, work causes 2, 3, and 4.
Cause 1: CoreDNS pods are down or crashlooping
DNS in current clusters is served by the CoreDNS add-on; both CoreDNS and the
older kube-dns use the label k8s-app=kube-dns
(Debugging DNS Resolution).
If the resolver itself is unhealthy, every pod loses DNS.
Diagnose:
# Are the DNS pods Running and Ready?
kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
# Does the Service exist with a clusterIP?
kubectl get svc --namespace=kube-system
# Are there backing endpoints? Zero endpoints = nothing answers on :53
kubectl get endpointslice -l kubernetes.io/service-name=kube-dns --namespace=kube-system
# What is CoreDNS logging?
kubectl logs --namespace=kube-system -l k8s-app=kube-dns
A healthy startup log shows the version banner and a config hash
(Debugging DNS Resolution).
If the pods are CrashLoopBackOff, read the logs for the reason — a common one
is the forwarding loop covered in Cause 5. If the pods are Running but the
EndpointSlice is empty, the Service has no backends and no query is answered;
that is the same shape as a missing-endpoints Service problem.
Fix: restore the pods (resolve the crash cause, fix resource limits, or scale
the Deployment back up). Blast radius note: CoreDNS is a cluster-wide
singleton path — restarting it briefly affects every pod's lookups, so expect
transient SERVFAILs during a roll.
Cause 2: wrong dnsPolicy
A pod's dnsPolicy decides which resolver it even uses. The four values and
their exact behavior
(DNS for Services and Pods):
Default— the pod inherits name resolution from the node. This does NOT use cluster DNS, so in-cluster Service names will not resolve.ClusterFirst— queries that do not match the cluster domain are forwarded upstream by the cluster DNS; in-cluster names resolve via CoreDNS.ClusterFirstWithHostNet— required forhostNetworkpods. The docs are explicit: ahostNetworkpod left onClusterFirst"will fallback to the behavior of theDefaultpolicy."None— ignores cluster DNS entirely; you must supply everything viadnsConfig.
Critically, "Default" is not the default: if dnsPolicy is unset, the value
used is ClusterFirst
(DNS for Services and Pods).
Diagnose:
kubectl get pod <pod> -o jsonpath='{.spec.dnsPolicy}{"\n"}'
kubectl get pod <pod> -o jsonpath='{.spec.hostNetwork}{"\n"}'
If a pod that needs Service discovery shows Default, or a hostNetwork: true
pod shows ClusterFirst, that is your bug. Fix by setting the right policy:
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
Cause 3: ndots:5 breaking or slowing external lookups
This is the most misdiagnosed DNS problem, because in-cluster names work fine
while external names are slow or intermittently fail. kubelet writes
options ndots:5 into every pod's resolv.conf
(Debugging DNS Resolution).
With ndots:5, any name containing fewer than 5 dots is first tried against
every search domain before the name is tried as-is. The docs spell out the
default value of 5 and that "any hostname with fewer than 5 dots will trigger
search domain expansion, causing extra queries"
(DNS for Services and Pods).
So a lookup for api.example.com (two dots, under the threshold of five) first
tries api.example.com.default.svc.cluster.local, then ...svc.cluster.local,
then ...cluster.local, each returning NXDOMAIN, before finally querying the
real name. That is several extra round trips per resolution, and under DNS
packet loss it shows up as slow or flaky external calls.
Diagnose by enabling query logging on CoreDNS and watching the expansion. Add
log to the Corefile
(Debugging DNS Resolution):
kubectl -n kube-system edit configmap coredns
.:53 {
log
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
You will see the cluster-suffixed variants resolve NXDOMAIN ahead of the real
query. Two fixes, with tradeoffs:
- Per-pod
dnsConfigloweringndots(smallest blast radius — affects only this workload):
spec:
dnsConfig:
options:
- name: ndots
value: "2"
- Or query a fully qualified name with a trailing dot (
api.example.com.), which skips search expansion for that call. Tradeoff: it only fixes the call sites you edit, not the whole workload.
Do not blanket-lower ndots for everything without thought — short in-cluster
names like mysvc rely on search-domain expansion to resolve, and an
aggressive ndots:1 can break those. ndots:2 is the common compromise.
Cause 4: a NetworkPolicy blocking egress to kube-dns
An egress NetworkPolicy is an allow-list. The docs are explicit: "When a pod is
isolated for egress, the only allowed connections from the pod are those
allowed by the egress list of some NetworkPolicy that applies to the pod for
egress"
(Network Policies).
So the moment any egress policy selects your pod, DNS is dropped unless you also
allow port 53 to kube-dns. The classic symptom is a workload that resolved fine
until a security policy landed, then resolves nothing.
Diagnose:
# Which egress policies select this pod?
kubectl get networkpolicy -n <namespace> -o wide
# Test whether :53 to kube-dns is even reachable
kubectl exec -ti dnsutils -- nslookup kubernetes.default
Fix by explicitly allowing egress to the kube-dns pods on both UDP and TCP (TCP matters for large or truncated answers, and NetworkPolicy supports TCP, UDP, and SCTP Network Policies):
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-dns-egress
namespace: <namespace>
spec:
podSelector: {}
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
Blast radius note: confirm your CNI actually enforces NetworkPolicy (some default configs do not), and confirm the kube-dns label in your cluster — a selector that matches nothing silently blocks DNS.
Cause 5: a resolv.conf / upstream forwarding loop
CoreDNS forwards non-cluster queries to whatever is in /etc/resolv.conf via
forward . /etc/resolv.conf. On hosts running systemd-resolved, that file
points at the loopback stub 127.0.0.53, so CoreDNS forwards to itself and the
loop plugin halts the server. The CoreDNS loop plugin "detects simple
forwarding loops and halts the server," logging
(CoreDNS loop plugin):
plugin/loop: Loop (127.0.0.1:55953 -> :1053) detected for zone ".",
see https://coredns.io/plugins/loop#troubleshooting.
The documented cause is "CoreDNS forwarding requests directly to itself. e.g.
via a loopback address such as 127.0.0.1, ::1 or 127.0.0.53," and in
Kubernetes specifically "systemd-resolved will put the loopback address
127.0.0.53 as a nameserver into /etc/resolv.conf"
(CoreDNS loop plugin). The Kubernetes docs
echo that some distros (e.g. Ubuntu) use systemd-resolved, which can replace
/etc/resolv.conf
(Debugging DNS Resolution).
Diagnose: this shows up as CoreDNS CrashLoopBackOff (Cause 1) with the Loop
line in kubectl logs -n kube-system -l k8s-app=kube-dns.
Fix, in order of preference (CoreDNS loop plugin):
- Point kubelet at the real upstream file rather than the systemd stub, via the
kubelet config
resolvConfset to a path to the actual nameserver file (commonly/run/systemd/resolve/resolv.confon systemd-resolved hosts). - Or replace
forward . /etc/resolv.confwith a concrete upstream, e.g.forward . 8.8.8.8. - Or restore the host's original
/etc/resolv.confand disable the local DNS cache on nodes.
Cause 6: querying the wrong Service name form
If resolution "fails" but DNS is healthy, you may be asking for a name that does not exist. The record formats are exact (DNS for Services and Pods):
- A normal Service:
my-svc.my-namespace.svc.cluster-domain.exampleresolves to the Service's cluster IP. - A headless Service (clusterIP: None): the same name resolves to the set of IPs of all selected pods, not a single VIP. Clients consume the set or round-robin over it.
- SRV records use
_port-name._port-protocol.my-svc.my-namespace.svc.cluster-domain.example. For a headless Service these return one answer per pod, of the formhostname.my-svc.my-namespace.svc.cluster-domain.example.
Two failure modes here. First, cross-namespace: "DNS queries that don't specify
a namespace are limited to the Pod's namespace. To access Services in other
namespaces, specify the namespace" — query data.prod, not bare data
(DNS for Services and Pods).
Second, expecting a single IP from a headless Service: an app that picks the
"first" answer and caches it will miss the other pods and behave like a partial
outage.
Diagnose by resolving each form explicitly and comparing:
kubectl exec -ti dnsutils -- nslookup my-svc.my-namespace.svc.cluster.local
kubectl exec -ti dnsutils -- nslookup my-svc.my-namespace
Fix: use the namespace-qualified name (or the full FQDN), and for headless Services design the client to consume all returned addresses.
Recommended approach
Work the split first — cluster-wide versus single-workload — because it halves
the search space before you touch anything. From a fresh dnsutils pod, if
kubernetes.default fails, go straight to Cause 1 (and Cause 5 if CoreDNS is
crashlooping). If only your workload fails, read its dnsPolicy and
/etc/resolv.conf, list the NetworkPolicies selecting it, then compare the name
forms it queries. Change one variable at a time, and prefer per-pod dnsConfig
over cluster-wide CoreDNS edits — the blast radius of a Corefile mistake is
every pod in the cluster.
Validation
# In-cluster resolution
kubectl exec -i -t dnsutils -- nslookup kubernetes.default
# External resolution (and check it is not slow)
kubectl exec -i -t dnsutils -- nslookup kubernetes.io
# Confirm the resolver config the pod actually has
kubectl exec -ti dnsutils -- cat /etc/resolv.conf
# Confirm CoreDNS is healthy and has endpoints
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl get endpointslice -l kubernetes.io/service-name=kube-dns -n kube-system
A fix is confirmed when both in-cluster and external names resolve from a fresh
debug pod, CoreDNS shows no Loop lines, and your workload's resolv.conf and
NetworkPolicy match intent.
Prevention
- Pin
ndotsdeliberately. The defaultndots:5is a latency trap for external-heavy workloads; standardize on a per-workloaddnsConfigwhere it matters rather than discovering it under load. - Treat "allow DNS egress" as a required baseline in every default-deny egress policy template, so security rollouts never silently sever name resolution.
- Alert on CoreDNS
SERVFAILrate and pod restarts. CoreDNS exposes Prometheus metrics on:9153(theprometheus :9153line in the Corefile above), which is the natural signal source. - Keep node
resolv.confoff the systemd-resolved loopback stub for CoreDNS's upstream, so a node rebuild does not reintroduce the forwarding loop.
How Intellira diagnoses this
DNS failures are a correlation problem: the symptom (an app can't reach a
dependency) sits one or two hops from the cause (a NetworkPolicy merged an hour
ago, a CoreDNS roll, a systemd-resolved change on a node). Intellira's
read-only RCA approach is to line up the timeline across those systems — the
recent NetworkPolicy and ConfigMap changes, CoreDNS pod/endpoint state, and the
failing pod's dnsPolicy/resolv.conf — and surface the most likely cause
with the evidence attached, rather than running the six manual checks above by
hand. It does not mutate cluster state; the remediation steps stay with you.
Sources
- Debugging DNS Resolution — Kubernetes
- DNS for Services and Pods — Kubernetes
- Network Policies — Kubernetes
- CoreDNS loop plugin — coredns.io
By Intellira Engineering. AI-assisted draft, reviewed by the Intellira engineering team; claims cited inline; last verified 2026-06-02.