Application troubleshooting is the highest-leverage CKAD topic. Direct troubleshooting questions are 30% of the exam, but troubleshooting skills also help on every other domain — when your deployment doesn’t roll out, when your probe fails, when your service has no endpoints. The candidates who pass aren’t the ones who memorize more YAML; they’re the ones who diagnose faster.
This guide gives you a deterministic playbook for every kind of application failure the CKAD tests. By the end, you’ll have a mental flowchart for each failure mode and a set of kubectl commands that surface the root cause in under 60 seconds.
The Universal Diagnostic Workflow
Every troubleshooting question follows the same shape: something isn’t working, and you need to fix it without scattering changes. Before typing any command, answer three questions:
- What layer is broken? Pod, container, config, network, or controller.
- What’s the smallest signal that confirms the layer? Status, event, log line, or describe output.
- What’s the minimum change to fix it? Edit, patch, or recreate.
Resist the urge to delete and recreate. The grader checks specific properties — labels, ownership, resource references — that recreating from scratch may lose.
The 60-Second Diagnostic for Any Pod
When a pod is failing, this sequence finds the cause every time:
# 1. What's the current state?
kubectl get pod <pod> -n <ns>
# 2. What does Kubernetes itself say?
kubectl describe pod <pod> -n <ns>
# 3. What did the application say?
kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous # for crashed containers
# 4. For multi-container pods, target the failing container
kubectl logs <pod> -c <container-name> --previous
The STATUS from step 1 tells you which path to take.
Status: Pending
The pod has been accepted but isn’t scheduled or its containers haven’t started.
kubectl describe pod <pod> | grep -A 20 Events
Look for these patterns:
- “0/N nodes are available: insufficient cpu” → resource requests don’t fit. Reduce requests or add capacity.
- “didn’t match Pod’s node affinity/selector” → wrong
nodeSelectoror affinity. Compare withkubectl get nodes --show-labels. - “FailedScheduling: persistentvolumeclaim … not found” → PVC missing or not bound. Check
kubectl get pvc -n <ns>. - “had untolerated taint” → add a toleration matching the node’s taint.
Status: ImagePullBackOff or ErrImagePull
The image can’t be pulled.
kubectl describe pod <pod> | grep -A 5 -E "Failed|Image"
Common causes and fixes:
Wrong Image Name or Tag
kubectl set image deploy/<deploy> <container>=<correct-image>
For standalone pods, you can’t set image — recreate them, or kubectl edit pod and change the image (kubelet will restart the container).
Private Registry Without imagePullSecret
kubectl create secret docker-registry regcred \
--docker-server=<server> \
--docker-username=<user> \
--docker-password=<pass> \
--docker-email=<email>
Then add to the pod spec:
spec:
imagePullSecrets:
- name: regcred
Tag Doesn’t Exist
nginx:1.99 doesn’t exist. Use a tag that does (nginx:1.25, nginx:latest).
Status: CrashLoopBackOff
The container starts and exits repeatedly.
kubectl logs <pod> --previous
kubectl describe pod <pod> | grep -A 5 "Last State"
The Reason field on Last State is the answer:
Reason: Error
The app crashed. Read kubectl logs --previous for the exception. Common causes:
- Missing env var or ConfigMap key
- Wrong DB connection string
- Missing file the app expects to read
Reason: OOMKilled
Container exceeded memory limit.
kubectl describe pod <pod> | grep -A 5 Limits
Raise resources.limits.memory or fix the leak in the app.
Reason: Completed
The container’s command finishes too fast for what’s expected to be a long-running pod. Most often:
- Wrong
command:orargs:in the spec - Missing
sleepor wrong entrypoint
Example: a pod intended to run nginx but with command: ['echo', 'hello'] will exit immediately.
Status: Running but 0/1 Ready
Container is up but failing readiness.
kubectl describe pod <pod> | grep -A 5 "Readiness"
The Events show the actual probe failure:
Readiness probe failed: HTTP probe failed with statuscode: 404→ wrong pathReadiness probe failed: connection refused→ wrong port or app not listening yetReadiness probe failed: timeout→ app too slow; tunetimeoutSeconds
To test the probe endpoint manually:
kubectl exec -it <pod> -- wget -O- http://localhost:8080/healthz
kubectl exec -it <pod> -- nc -zv localhost 8080
For a complete probe troubleshooting playbook, see our CKAD probes guide.
Service Has No Endpoints
A pod is running, but the service returns nothing — one of the most common CKAD scenarios.
kubectl get svc <svc> -n <ns>
kubectl get endpoints <svc> -n <ns>
If ENDPOINTS shows <none>, the service selector doesn’t match any pod label.
kubectl get svc <svc> -o yaml | grep -A 5 selector
kubectl get pods -n <ns> --show-labels
Fix the mismatch by relabeling the pods or updating the selector:
kubectl label pod <pod> tier=frontend --overwrite
Other failure modes:
- Endpoints exist but traffic fails → pod’s
containerPortdoesn’t match servicetargetPort. Compare both. - Endpoints exist, in-cluster works, external fails → check service
type(ClusterIP vs NodePort vs LoadBalancer).
Test connectivity from inside the cluster:
kubectl run tmp --rm -it --image=busybox --restart=Never -- \
wget -qO- --timeout=3 http://<svc>.<ns>.svc.cluster.local
DNS Failures
If service-name resolution fails inside a pod:
kubectl run tmp --rm -it --image=busybox:1.28 --restart=Never -- \
nslookup my-svc.dev.svc.cluster.local
Use busybox:1.28 — newer tags have a broken nslookup.
If DNS fails:
- CoreDNS pods might be down:
kubectl get pods -n kube-system -l k8s-app=kube-dns - A NetworkPolicy blocking egress to
kube-system:53
Wrong ConfigMap or Secret Reference
If a pod fails to start with “couldn’t find key” or “configmap not found”:
kubectl describe pod <pod> | grep -A 5 -E "configmap|secret"
Verify the resource exists:
kubectl get cm <cm-name> -n <ns>
kubectl get secret <secret-name> -n <ns>
Check that the key the pod references actually exists in the ConfigMap:
kubectl get cm <cm-name> -o yaml | grep <key-name>
Two common bugs:
- ConfigMap is in a different namespace than the pod.
- Key name in the pod spec doesn’t match the ConfigMap key.
Resource-Related Failures
If a pod is Pending and kubectl describe mentions “insufficient resources”:
# What does the pod request?
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'
# What's available on each node?
kubectl describe nodes | grep -A 5 "Allocated resources"
Reduce the pod’s resources.requests or schedule it on a different node.
If a pod is OOMKilled but its limit seems generous, the app might have a memory leak — check kubectl logs --previous.
Wrong Image, Wrong Command, Wrong Env
For pods that start but immediately fail:
kubectl logs <pod> # what did the app say?
kubectl logs <pod> --previous # what did it say before crashing?
kubectl exec -it <pod> -- env | grep <expected-var> # are env vars set correctly?
kubectl exec -it <pod> -- ls /etc/config # is the ConfigMap mount visible?
Debugging Multi-Container Pods
# Pod stuck in PodInitializing
kubectl logs <pod> -c <init-container-name>
# 1/2 Ready (one container failing)
kubectl describe pod <pod> # which container is failing
kubectl logs <pod> -c <container-name> --previous
# Containers can't see shared volume
kubectl describe pod <pod> | grep -A 5 Volumes
kubectl describe pod <pod> | grep -A 5 Mounts
For deeper coverage, see our CKAD multi-container pod patterns guide.
Deployment Rollout Stuck
kubectl rollout status deploy <name> # exits with error message
kubectl describe deploy <name> # check Conditions
kubectl get rs -l app=<name> # multiple ReplicaSets during rollout
kubectl describe rs <newest-rs> # find why new pods aren't healthy
Most stuck rollouts are caused by:
- New pods failing (image pull, crash loop, probe failure) — investigate the new pod, not the deployment.
maxUnavailable: 0with a probe failing — no old pods can be killed because no new pods are healthy.
For the full deployment recovery playbook, see our CKAD Deployments rolling updates guide.
A Cheatsheet of Symptoms → First Command
| Symptom | First command |
|---|---|
| Pod Pending | kubectl describe pod <p> (Events) |
| ImagePullBackOff | kubectl describe pod <p> (Failed Pull) |
| CrashLoopBackOff | kubectl logs <p> --previous |
| 0/1 Ready | kubectl describe pod <p> (Readiness) |
| Service no endpoints | kubectl get endpoints <s> |
| DNS not resolving | nslookup from busybox:1.28 |
| ConfigMap not found | kubectl get cm -n <ns> |
| Multi-container failing | kubectl logs <p> -c <c> --previous |
| Rollout stuck | kubectl describe rs <newest-rs> |
| Forbidden | check RoleBinding/ServiceAccount |
Print this table. Tape it next to your monitor while you practice.
Useful kubectl Diagnostic Tricks
Debug Container in a Running Pod (Ephemeral Containers)
kubectl debug -it <pod> --image=busybox --target=<container-name>
Spawns a new container in the running pod with a shell. Useful when the main container has no shell (distroless, scratch).
Copy Files Out of a Pod
kubectl cp <pod>:/var/log/app.log ./app.log
For inspecting logs that aren’t going to stdout.
Run an Ephemeral Debug Pod
kubectl run debug --rm -it --image=nicolaka/netshoot --restart=Never -- bash
netshoot has every networking tool you might need (curl, dig, nslookup, tcpdump, mtr).
How to Practice Troubleshooting
Reading isn’t enough — troubleshooting is muscle memory. The most effective practice is breaking things on purpose:
- Build a kind or kubeadm cluster (see our Kubernetes lab setup).
- Deploy 5-10 sample apps.
- Break each one in a different way:
- Wrong image tag
- CrashLoopBackOff (wrong command)
- 0/1 Ready (probe pointing at wrong path)
- Service with selector typo
- ConfigMap reference to a non-existent key
- Pending pod (resource requests too high)
- Diagnose and fix each one without recreating from scratch.
- Time yourself. Aim for under 5 minutes per failure.
Validate Your Speed Under Exam Conditions
Drilling on your own cluster is necessary but not sufficient. The CKAD tests speed under pressure with unfamiliar problems. Take a full-length scored simulator with our CKAD Mock Exam Bundle — every simulator includes pre-broken pods and services with the same UI and scoring rubric as the real exam.
Frequently Asked Questions
Q: How much of the CKAD is troubleshooting? A: ~30% in direct troubleshooting, but troubleshooting skills help on another 30-40% of work (deployments, probes, services).
Q: What’s the most common pod failure on the exam?
A: CrashLoopBackOff caused by a wrong command or missing env var. kubectl logs --previous reveals it in seconds.
Q: When should I delete and recreate vs fix in place? A: Fix in place by default — the grader checks resource properties (labels, ownership, annotations) that recreating may lose. Recreate only when the resource is fundamentally wrong (wrong kind, wrong name).
Q: Can I use kubectl debug?
A: Yes. Ephemeral containers and kubectl debug are part of the modern CKAD curriculum.
Q: What’s the fastest way to test in-cluster connectivity?
A: kubectl run tmp --rm -it --image=busybox --restart=Never -- wget -qO- --timeout=3 http://<svc> — the --rm cleans up afterward, and --timeout=3 prevents hangs on blocked traffic.
Q: How fast should I be at troubleshooting? A: Aim for under 5 minutes per troubleshooting question. The 30% domain typically has 4-5 questions; budget 25 minutes total.
Q: What if I can’t find the root cause?
A: Move on. Flag the question, bank points elsewhere, and return at the end. A guess applied with kubectl patch is better than 20 minutes of staring at logs.
Ready to test your troubleshooting under exam conditions? Run a full scored simulator with our CKAD Mock Exam Bundle and find out where you’d score today.