Troubleshooting is worth 30% of your CKA exam score — more than any other domain. If you nail troubleshooting, you can absorb a couple of mistakes elsewhere and still pass. If you struggle with it, even perfect performance on the rest of the exam may not be enough. This guide gives you a deterministic diagnostic workflow for every kind of failure the CKA tests: pods, services, nodes, DNS, and the control plane itself.
By the end, you’ll have a mental flowchart for each failure type and a set of kubectl commands that surface the root cause in under 60 seconds.
The Universal Troubleshooting Mindset
Every CKA troubleshooting question follows the same shape: something isn’t working, and you need to fix it without scattering changes. Before typing any command, answer three questions:
- What layer is broken? Application, pod, service/network, node, or control plane.
- What’s the smallest signal that confirms the layer? Status field, event, log line, or component health.
- What’s the minimum change to fix it? A label update, a config change, an image tag, a manifest fix.
Resist the urge to delete and recreate resources blindly. The exam grades final state — if you delete something the grader is checking, you lose the points even if your replacement looks correct.
The 60-Second Diagnostic: Pod Not Running
When a pod is failing, this exact sequence finds the root cause every time:
# 1. What's the current state?
kubectl get pod <pod> -n <ns>
# 2. What does Kubernetes itself say?
kubectl describe pod <pod> -n <ns>
# 3. What did the application say?
kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous # for crashed containers
The STATUS column from step 1 tells you which path to take:
Pending
The pod has been accepted but isn’t scheduled or its containers haven’t started.
kubectl describe pod <pod> | grep -A 20 Events
Look for these messages:
- “0/N nodes are available: insufficient cpu” → resource requests don’t fit. Reduce requests or add capacity.
- “didn’t match Pod’s node affinity/selector” → wrong
nodeSelectoror affinity rule. Compare withk get nodes --show-labels. - “had untolerated taint” → add a toleration matching the node’s taint, or remove the taint with
k taint nodes <node> key:NoSchedule-. - “FailedScheduling: persistentvolumeclaim … not found” → PVC missing or not bound. Check
k get pvc -n <ns>.
ImagePullBackOff / ErrImagePull
The image can’t be pulled.
kubectl describe pod <pod> | grep -A 5 -E "Failed|Image"
Common causes (and exact fixes):
- Typo in image name →
k set image pod/<pod> <container>=<image>doesn’t work for pods, but you cank edit pod <pod>for the image, or recreate. - Private registry without imagePullSecret →
k create secret docker-registry regcred ...then addimagePullSecrets:to the pod spec. - Tag doesn’t exist → use a tag that does (e.g.,
nginx:1.25notnginx:1.99).
CrashLoopBackOff
The container starts and exits repeatedly.
kubectl logs <pod> --previous # logs from the last crashed container
kubectl describe pod <pod> | grep -A 5 "Last State"
The Reason field on Last State is the answer:
- “Error” → app error. Read the previous logs for the actual exception.
- “OOMKilled” → exceeded memory limit. Raise
resources.limits.memoryor fix the app. - “Completed” → command finishes too fast for a long-running pod. Check
command:andargs:— usually a missingsleepor wrong entrypoint.
Running but not Ready (0/1)
Container is up but failing readiness probes.
kubectl describe pod <pod> | grep -A 5 "Readiness"
kubectl logs <pod>
Check the readiness probe path, port, and protocol against what the app actually exposes. The exam will sometimes hand you a probe targeting :8080/healthz while the container listens on :80/health.
When a Service Has No Endpoints
A pod is running, but the service returns nothing. This is one of the most common CKA scenarios.
kubectl get svc <svc> -n <ns>
kubectl get endpoints <svc> -n <ns>
If ENDPOINTS shows <none>, the service selector doesn’t match any pod label. Diagnose:
kubectl get svc <svc> -o yaml | grep -A 5 selector
kubectl get pods -n <ns> --show-labels
Fix the mismatch. Either change the service selector or relabel the pods:
kubectl label pod <pod> tier=frontend --overwrite
Other failure modes:
- Endpoints exist but traffic fails → pod’s
containerPortdoesn’t match servicetargetPort. Compare both. - Endpoints exist, in-cluster works, external fails → check service
type(ClusterIP vs NodePort vs LoadBalancer) and any NetworkPolicy in the namespace.
Test connectivity from inside the cluster, not from your laptop:
kubectl run tmp --rm -it --image=busybox --restart=Never -- \
wget -qO- --timeout=5 http://<svc>.<ns>.svc.cluster.local
DNS Failures
If service-name resolution fails, CoreDNS is the suspect.
# 1. Confirm it's DNS
kubectl run tmp --rm -it --image=busybox:1.28 --restart=Never -- \
nslookup kubernetes.default
# 2. Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
# 3. Check the CoreDNS configmap
kubectl get configmap coredns -n kube-system -o yaml
If CoreDNS pods are healthy but resolution fails inside a specific namespace, suspect a NetworkPolicy blocking egress to kube-system on port 53.
Tip: Use
busybox:1.28for DNS tests. Newer busybox tags brokenslookupagainst cluster DNS.
Node Troubleshooting
A node showing NotReady is a frequent CKA scenario.
kubectl get nodes
kubectl describe node <node>
In the Conditions section, look at the message:
- “kubelet stopped posting node status” → kubelet isn’t running. SSH in and fix it (see below).
- “runtime is down” → container runtime (containerd or CRI-O) isn’t running.
- “network plugin is not ready” → CNI plugin (Calico, Flannel, etc.) failed.
SSH-Based Diagnosis (Don’t Forget This Step)
CKA gives you sudo access to nodes. Many “fix the node” questions require leaving kubectl entirely:
ssh <node>
# kubelet
sudo systemctl status kubelet
sudo journalctl -u kubelet -n 100 --no-pager
# Container runtime
sudo systemctl status containerd
sudo crictl ps -a
sudo crictl logs <container-id>
# Common kubelet failures
ls /var/lib/kubelet/config.yaml
cat /etc/kubernetes/kubelet.conf
Frequent fixes:
- kubelet service stopped →
sudo systemctl enable --now kubelet - swap is on (kubelet refuses to start) →
sudo swapoff -aand remove swap from/etc/fstab - wrong kubelet config path → check
/var/lib/kubelet/kubeadm-flags.envand/etc/kubernetes/kubelet.conf - runtime socket mismatch → kubelet flag
--container-runtime-endpoint=unix:///run/containerd/containerd.sock
Control Plane Troubleshooting
If kubectl get nodes returns “connection refused” or hangs, the API server is down. Control plane components on a kubeadm cluster run as static pods, not regular pods.
# On the control plane node
ssh <control-plane>
ls /etc/kubernetes/manifests/
# You'll see: etcd.yaml, kube-apiserver.yaml, kube-controller-manager.yaml, kube-scheduler.yaml
Static pods are managed by kubelet, not the API server. To restart one, you can’t kubectl delete it — you have to move and replace its manifest, or kubelet will recreate it.
# Force a static pod restart
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
# wait for pod to terminate
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
Diagnose with crictl
When the API is unreachable, you can’t use kubectl. Drop down to the runtime:
sudo crictl ps -a | grep -E "etcd|api|scheduler|controller"
sudo crictl logs <container-id>
sudo crictl logs --tail=100 <container-id>
Most control plane failures on the CKA come from a single edit to a static pod manifest — a typo in --etcd-servers, a wrong --service-cluster-ip-range, or a malformed YAML indent. Open the manifest in the /etc/kubernetes/manifests/ directory and fix it.
etcd Failures
If etcd is down, the API server can’t start. Symptoms include API server logs showing “connection refused to 127.0.0.1:2379”.
sudo crictl logs $(sudo crictl ps -a | grep etcd | awk '{print $1}')
# Common causes
ls -la /var/lib/etcd/ # data dir intact?
cat /etc/kubernetes/manifests/etcd.yaml | grep -E "data-dir|cert-file|key-file"
If you’ve restored an etcd backup, the most common bug is forgetting to update --data-dir in etcd.yaml to the new directory. See our full CKA etcd backup and restore guide for the complete recovery walkthrough.
NetworkPolicy Blocking Traffic
If a pod can’t reach a service that should be reachable, check NetworkPolicies in the namespace:
kubectl get networkpolicies -n <ns>
kubectl describe networkpolicy <np> -n <ns>
A NetworkPolicy with empty podSelector: {} and no egress rules denies all egress from every pod in the namespace. The fix is usually to add an egress rule allowing the destination, not to delete the policy (the grader checks the policy still exists).
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
ports:
- protocol: UDP
port: 53
For a deeper NetworkPolicy walkthrough, see our CKA networking deep dive.
RBAC Permission Denied
A user or service account gets “forbidden” errors. The diagnosis is one command:
kubectl auth can-i <verb> <resource> --as=<user> -n <ns>
kubectl auth can-i list pods --as=system:serviceaccount:dev:app-sa -n dev
Then trace the role chain:
kubectl get rolebindings,clusterrolebindings -A -o wide | grep <user>
kubectl describe role <role> -n <ns>
If the binding is missing, create it:
kubectl create rolebinding <name> --role=<role> --user=<user> -n <ns>
Detailed walkthrough in our CKA RBAC hands-on guide.
A Cheatsheet of Symptoms → First Command
| Symptom | First command |
|---|---|
| Pod Pending | kubectl describe pod <p> (Events) |
| ImagePullBackOff | kubectl describe pod <p> (Failed Pull) |
| CrashLoopBackOff | kubectl logs <p> --previous |
| 0/1 Ready | kubectl describe pod <p> (Readiness) |
| Service no endpoints | kubectl get endpoints <s> |
| DNS not resolving | nslookup from busybox:1.28 |
| Node NotReady | ssh node && systemctl status kubelet |
| API unreachable | crictl ps -a on control plane |
| etcd error | crictl logs <etcd-container> |
| Forbidden | kubectl auth can-i ... --as=... |
Print this table. Tape it next to your monitor while you practice.
How to Practice Troubleshooting
Reading is not enough — troubleshooting is muscle memory. The most effective practice is breaking things on purpose:
- Build a kubeadm cluster with at least one control plane and one worker. (See our Kubernetes lab setup for CKA guide.)
- Break it deliberately — stop kubelet, edit
kube-apiserver.yamlto have a wrong port, change a service selector, delete a CNI pod. - Recover without recreating the cluster. Force yourself to find and fix the root cause.
Build a list of “broken cluster” recipes and rotate through them weekly until each one takes under 5 minutes to diagnose and fix.
When Practice Becomes Real: Take a Scored Mock
Drilling on your own cluster is necessary but not sufficient. The CKA tests your ability to do this under time pressure with unfamiliar problems. The only way to validate that is a scored, exam-realistic simulator.
Our CKA Mock Exam Bundle puts you in front of pre-broken clusters with the same UI, time limits, and scoring rubric as the real exam. You’ll find out exactly which troubleshooting categories cost you the most points before you sit the $395 exam.
Frequently Asked Questions
Q: How much of the CKA is pure troubleshooting? A: 30% of the score, but troubleshooting skills also help on cluster admin and networking tasks — closer to 40-50% of practical work involves diagnosis.
Q: What’s the most common pod failure on the exam?
A: Pending due to a wrong nodeSelector, taint, or PVC reference. kubectl describe pod reveals it in seconds.
Q: Do I need to know crictl?
A: Yes. When the API is down, crictl is your only window into running containers. Practice crictl ps, crictl logs, and crictl inspect.
Q: Can I just delete and recreate a broken resource?
A: Sometimes — but the exam often checks specific properties (labels, annotations, ownership) that recreating from scratch will lose. When in doubt, fix in place with kubectl edit or kubectl patch.
Q: Is journalctl allowed during the exam?
A: Yes. SSH access to nodes is part of the exam, and journalctl is the primary tool for kubelet diagnostics. Get comfortable with journalctl -u kubelet -n 100 --no-pager.
Q: How fast should I be at troubleshooting? A: Aim for under 5 minutes per troubleshooting question. The 30% domain typically has 4-5 questions; budget 25 minutes total and don’t dwell.
Q: What if I can’t find the root cause?
A: Move on. Flag the question, bank points elsewhere, and return at the end. A guess applied with kubectl patch is better than 20 minutes of staring.
Ready to test your troubleshooting under exam conditions? Run a full scored simulator with our CKA Mock Exam Bundle and find out where you’d score today.