CKAD Application Troubleshooting Guide: Diagnose Any Failing Pod (2026)

Application troubleshooting is the highest-leverage CKAD topic. Direct troubleshooting questions are 30% of the exam, but troubleshooting skills also help on every other domain — when your deployment doesn’t roll out, when your probe fails, when your service has no endpoints. The candidates who pass aren’t the ones who memorize more YAML; they’re the ones who diagnose faster.

This guide gives you a deterministic playbook for every kind of application failure the CKAD tests. By the end, you’ll have a mental flowchart for each failure mode and a set of kubectl commands that surface the root cause in under 60 seconds.

The Universal Diagnostic Workflow

Every troubleshooting question follows the same shape: something isn’t working, and you need to fix it without scattering changes. Before typing any command, answer three questions:

What layer is broken? Pod, container, config, network, or controller.
What’s the smallest signal that confirms the layer? Status, event, log line, or describe output.
What’s the minimum change to fix it? Edit, patch, or recreate.

Resist the urge to delete and recreate. The grader checks specific properties — labels, ownership, resource references — that recreating from scratch may lose.

The 60-Second Diagnostic for Any Pod

When a pod is failing, this sequence finds the cause every time:

# 1. What's the current state?
kubectl get pod <pod> -n <ns>

# 2. What does Kubernetes itself say?
kubectl describe pod <pod> -n <ns>

# 3. What did the application say?
kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous   # for crashed containers

# 4. For multi-container pods, target the failing container
kubectl logs <pod> -c <container-name> --previous

The STATUS from step 1 tells you which path to take.

Status: Pending

The pod has been accepted but isn’t scheduled or its containers haven’t started.

kubectl describe pod <pod> | grep -A 20 Events

Look for these patterns:

“0/N nodes are available: insufficient cpu” → resource requests don’t fit. Reduce requests or add capacity.
“didn’t match Pod’s node affinity/selector” → wrong nodeSelector or affinity. Compare with kubectl get nodes --show-labels.
“FailedScheduling: persistentvolumeclaim … not found” → PVC missing or not bound. Check kubectl get pvc -n <ns>.
“had untolerated taint” → add a toleration matching the node’s taint.

Status: ImagePullBackOff or ErrImagePull

The image can’t be pulled.

kubectl describe pod <pod> | grep -A 5 -E "Failed|Image"

Common causes and fixes:

Wrong Image Name or Tag

kubectl set image deploy/<deploy> <container>=<correct-image>

For standalone pods, you can’t set image — recreate them, or kubectl edit pod and change the image (kubelet will restart the container).

Private Registry Without imagePullSecret

kubectl create secret docker-registry regcred \
  --docker-server=<server> \
  --docker-username=<user> \
  --docker-password=<pass> \
  --docker-email=<email>

Then add to the pod spec:

spec:
  imagePullSecrets:
    - name: regcred

Tag Doesn’t Exist

nginx:1.99 doesn’t exist. Use a tag that does (nginx:1.25, nginx:latest).

Status: CrashLoopBackOff

The container starts and exits repeatedly.

kubectl logs <pod> --previous
kubectl describe pod <pod> | grep -A 5 "Last State"

The Reason field on Last State is the answer:

Reason: Error

The app crashed. Read kubectl logs --previous for the exception. Common causes:

Missing env var or ConfigMap key
Wrong DB connection string
Missing file the app expects to read

Reason: OOMKilled

Container exceeded memory limit.

kubectl describe pod <pod> | grep -A 5 Limits

Raise resources.limits.memory or fix the leak in the app.

Reason: Completed

The container’s command finishes too fast for what’s expected to be a long-running pod. Most often:

Wrong command: or args: in the spec
Missing sleep or wrong entrypoint

Example: a pod intended to run nginx but with command: ['echo', 'hello'] will exit immediately.

Status: Running but 0/1 Ready

Container is up but failing readiness.

kubectl describe pod <pod> | grep -A 5 "Readiness"

The Events show the actual probe failure:

Readiness probe failed: HTTP probe failed with statuscode: 404 → wrong path
Readiness probe failed: connection refused → wrong port or app not listening yet
Readiness probe failed: timeout → app too slow; tune timeoutSeconds

To test the probe endpoint manually:

kubectl exec -it <pod> -- wget -O- http://localhost:8080/healthz
kubectl exec -it <pod> -- nc -zv localhost 8080

For a complete probe troubleshooting playbook, see our CKAD probes guide.

Service Has No Endpoints

A pod is running, but the service returns nothing — one of the most common CKAD scenarios.

kubectl get svc <svc> -n <ns>
kubectl get endpoints <svc> -n <ns>

If ENDPOINTS shows <none>, the service selector doesn’t match any pod label.

kubectl get svc <svc> -o yaml | grep -A 5 selector
kubectl get pods -n <ns> --show-labels

Fix the mismatch by relabeling the pods or updating the selector:

kubectl label pod <pod> tier=frontend --overwrite

Other failure modes:

Endpoints exist but traffic fails → pod’s containerPort doesn’t match service targetPort. Compare both.
Endpoints exist, in-cluster works, external fails → check service type (ClusterIP vs NodePort vs LoadBalancer).

Test connectivity from inside the cluster:

kubectl run tmp --rm -it --image=busybox --restart=Never -- \
  wget -qO- --timeout=3 http://<svc>.<ns>.svc.cluster.local

DNS Failures

If service-name resolution fails inside a pod:

kubectl run tmp --rm -it --image=busybox:1.28 --restart=Never -- \
  nslookup my-svc.dev.svc.cluster.local

Use busybox:1.28 — newer tags have a broken nslookup.

If DNS fails:

CoreDNS pods might be down: kubectl get pods -n kube-system -l k8s-app=kube-dns
A NetworkPolicy blocking egress to kube-system:53

Wrong ConfigMap or Secret Reference

If a pod fails to start with “couldn’t find key” or “configmap not found”:

kubectl describe pod <pod> | grep -A 5 -E "configmap|secret"

Verify the resource exists:

kubectl get cm <cm-name> -n <ns>
kubectl get secret <secret-name> -n <ns>

Check that the key the pod references actually exists in the ConfigMap:

kubectl get cm <cm-name> -o yaml | grep <key-name>

Two common bugs:

ConfigMap is in a different namespace than the pod.
Key name in the pod spec doesn’t match the ConfigMap key.

If a pod is Pending and kubectl describe mentions “insufficient resources”:

# What does the pod request?
kubectl get pod <pod> -o jsonpath='{.spec.containers[*].resources}'

# What's available on each node?
kubectl describe nodes | grep -A 5 "Allocated resources"

Reduce the pod’s resources.requests or schedule it on a different node.

If a pod is OOMKilled but its limit seems generous, the app might have a memory leak — check kubectl logs --previous.

Wrong Image, Wrong Command, Wrong Env

For pods that start but immediately fail:

kubectl logs <pod>                                      # what did the app say?
kubectl logs <pod> --previous                           # what did it say before crashing?
kubectl exec -it <pod> -- env | grep <expected-var>     # are env vars set correctly?
kubectl exec -it <pod> -- ls /etc/config                # is the ConfigMap mount visible?

Debugging Multi-Container Pods

# Pod stuck in PodInitializing
kubectl logs <pod> -c <init-container-name>

# 1/2 Ready (one container failing)
kubectl describe pod <pod>     # which container is failing
kubectl logs <pod> -c <container-name> --previous

# Containers can't see shared volume
kubectl describe pod <pod> | grep -A 5 Volumes
kubectl describe pod <pod> | grep -A 5 Mounts

For deeper coverage, see our CKAD multi-container pod patterns guide.

Deployment Rollout Stuck

kubectl rollout status deploy <name>     # exits with error message
kubectl describe deploy <name>            # check Conditions
kubectl get rs -l app=<name>              # multiple ReplicaSets during rollout
kubectl describe rs <newest-rs>           # find why new pods aren't healthy

Most stuck rollouts are caused by:

New pods failing (image pull, crash loop, probe failure) — investigate the new pod, not the deployment.
maxUnavailable: 0 with a probe failing — no old pods can be killed because no new pods are healthy.

For the full deployment recovery playbook, see our CKAD Deployments rolling updates guide.

A Cheatsheet of Symptoms → First Command

Symptom	First command
Pod Pending	`kubectl describe pod <p>` (Events)
ImagePullBackOff	`kubectl describe pod <p>` (Failed Pull)
CrashLoopBackOff	`kubectl logs <p> --previous`
0/1 Ready	`kubectl describe pod <p>` (Readiness)
Service no endpoints	`kubectl get endpoints <s>`
DNS not resolving	`nslookup` from busybox:1.28
ConfigMap not found	`kubectl get cm -n <ns>`
Multi-container failing	`kubectl logs <p> -c <c> --previous`
Rollout stuck	`kubectl describe rs <newest-rs>`
Forbidden	check RoleBinding/ServiceAccount

Print this table. Tape it next to your monitor while you practice.

Useful kubectl Diagnostic Tricks

Debug Container in a Running Pod (Ephemeral Containers)

kubectl debug -it <pod> --image=busybox --target=<container-name>

Spawns a new container in the running pod with a shell. Useful when the main container has no shell (distroless, scratch).

Copy Files Out of a Pod

kubectl cp <pod>:/var/log/app.log ./app.log

For inspecting logs that aren’t going to stdout.

Run an Ephemeral Debug Pod

kubectl run debug --rm -it --image=nicolaka/netshoot --restart=Never -- bash

netshoot has every networking tool you might need (curl, dig, nslookup, tcpdump, mtr).

How to Practice Troubleshooting

Reading isn’t enough — troubleshooting is muscle memory. The most effective practice is breaking things on purpose:

Build a kind or kubeadm cluster (see our Kubernetes lab setup).
Deploy 5-10 sample apps.
Break each one in a different way:
- Wrong image tag
- CrashLoopBackOff (wrong command)
- 0/1 Ready (probe pointing at wrong path)
- Service with selector typo
- ConfigMap reference to a non-existent key
- Pending pod (resource requests too high)
Diagnose and fix each one without recreating from scratch.
Time yourself. Aim for under 5 minutes per failure.

Validate Your Speed Under Exam Conditions

Drilling on your own cluster is necessary but not sufficient. The CKAD tests speed under pressure with unfamiliar problems. Take a full-length scored simulator with our CKAD Mock Exam Bundle — every simulator includes pre-broken pods and services with the same UI and scoring rubric as the real exam.

Frequently Asked Questions

Q: How much of the CKAD is troubleshooting? A: ~30% in direct troubleshooting, but troubleshooting skills help on another 30-40% of work (deployments, probes, services).

Q: What’s the most common pod failure on the exam? A: CrashLoopBackOff caused by a wrong command or missing env var. kubectl logs --previous reveals it in seconds.

Q: When should I delete and recreate vs fix in place? A: Fix in place by default — the grader checks resource properties (labels, ownership, annotations) that recreating may lose. Recreate only when the resource is fundamentally wrong (wrong kind, wrong name).

Q: Can I use kubectl debug? A: Yes. Ephemeral containers and kubectl debug are part of the modern CKAD curriculum.

Q: What’s the fastest way to test in-cluster connectivity? A: kubectl run tmp --rm -it --image=busybox --restart=Never -- wget -qO- --timeout=3 http://<svc> — the --rm cleans up afterward, and --timeout=3 prevents hangs on blocked traffic.

Q: How fast should I be at troubleshooting? A: Aim for under 5 minutes per troubleshooting question. The 30% domain typically has 4-5 questions; budget 25 minutes total.

Q: What if I can’t find the root cause? A: Move on. Flag the question, bank points elsewhere, and return at the end. A guess applied with kubectl patch is better than 20 minutes of staring at logs.

Ready to test your troubleshooting under exam conditions? Run a full scored simulator with our CKAD Mock Exam Bundle and find out where you’d score today.