Exercise 2: Troubleshooting AKS Issues
Task 1 - Common Troubleshooting Commands
Familiarize yourself with key troubleshooting commands:
# Get pod status kubectl get pods # Get detailed pod information kubectl describe pod <pod-name> # Get pod logs kubectl logs <pod-name> # Get logs from previous instance of a pod (if it restarted) kubectl logs <pod-name> --previous # Get events across the cluster kubectl get events --sort-by='.lastTimestamp' # Check node status kubectl get nodes kubectl describe node <node-name> # Get resource usage kubectl top pods kubectl top nodes
# Get pod status kubectl get pods # Get detailed pod information kubectl describe pod <pod-name> # Get pod logs kubectl logs <pod-name> # Get logs from previous instance of a pod (if it restarted) kubectl logs <pod-name> --previous # Get events across the cluster kubectl get events --sort-by='.lastTimestamp' # Check node status kubectl get nodes kubectl describe node <node-name> # Get resource usage kubectl top pods kubectl top nodes
Task 2 - Simulate and Troubleshoot a Crashing Pod
Create and examine the crasher pod manifest:
$crasherPod = @" apiVersion: v1 kind: Pod metadata: name: crasher-pod spec: containers: - name: crasher image: k8sonazureworkshoppublic.azurecr.io/busybox command: ["/bin/sh"] args: ["-c", "echo 'Starting pod...'; echo 'Pod running normally...'; sleep 10; echo 'WARNING: Memory corruption detected!'; echo 'ERROR: Critical system failure!'; echo 'Pod will now terminate with error...'; exit 1"] # This will output logs and crash after 10 seconds resources: requests: memory: "64Mi" cpu: "100m" limits: memory: "128Mi" cpu: "200m" restartPolicy: Always # The pod will continuously restart and crash "@ # Output the manifest to review it $crasherPod
cat << EOF > crasher-pod.yaml apiVersion: v1 kind: Pod metadata: name: crasher-pod spec: containers: - name: crasher image: k8sonazureworkshoppublic.azurecr.io/busybox command: ["/bin/sh"] args: ["-c", "echo 'Starting pod...'; echo 'Pod running normally...'; sleep 10; echo 'WARNING: Memory corruption detected!'; echo 'ERROR: Critical system failure!'; echo 'Pod will now terminate with error...'; exit 1"] # This will output logs and crash after 10 seconds resources: requests: memory: "64Mi" cpu: "100m" limits: memory: "128Mi" cpu: "200m" restartPolicy: Always # The pod will continuously restart and crash EOF # Output the manifest to review it cat crasher-pod.yaml
This pod is designed to output several log messages and then exit with an error after 10 seconds, causing it to crash and restart.
Deploy the crashing pod:
$crasherPod | kubectl apply -f -
kubectl apply -f crasher-pod.yaml
Observe the pod’s status:
kubectl get pods -w
kubectl get pods -w
You should see the pod status cycle through
Running
→Error
→CrashLoopBackOff
→Running
again.Terminate the watch by pressing Ctrl+C.
Diagnose the issue:
# Check the pod status kubectl describe pod crasher-pod # Look at the restart count and last state # Check the logs kubectl logs crasher-pod
# Check the pod status kubectl describe pod crasher-pod # Look at the restart count and last state # Check the logs kubectl logs crasher-pod
You should see error messages in the logs that provide clues about why the pod is crashing, such as
WARNING: Memory corruption detected!
andERROR: Critical system failure!
.Fix the issue by creating a new pod that doesn’t crash:
$fixedPod = @" apiVersion: v1 kind: Pod metadata: name: fixed-pod spec: containers: - name: fixed image: k8sonazureworkshoppublic.azurecr.io/busybox command: ["/bin/sh"] args: ["-c", "while true; do echo 'Running stably'; sleep 30; done"] resources: requests: memory: "64Mi" cpu: "100m" limits: memory: "128Mi" cpu: "200m" "@ $fixedPod | kubectl apply -f -
cat << EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: fixed-pod spec: containers: - name: fixed image: k8sonazureworkshoppublic.azurecr.io/busybox command: ["/bin/sh"] args: ["-c", "while true; do echo 'Running stably'; sleep 30; done"] resources: requests: memory: "64Mi" cpu: "100m" limits: memory: "128Mi" cpu: "200m" EOF
Verify the fixed pod is running stably:
kubectl get pods fixed-pod -w
kubectl get pods fixed-pod -w
Wait for a minute or so to confirm it doesn’t restart.
Terminate the watch by pressing Ctrl+C.
Task 3 - Troubleshoot Application Connectivity with Network Policies
Create a dedicated namespace for network policy testing:
kubectl create namespace netpolicy-test
kubectl create namespace netpolicy-test
Create a simple web application and service in the namespace:
# Create a deployment kubectl create deployment web --image=nginx -n netpolicy-test # Add the app=web label to the deployment kubectl label deployment web app=web -n netpolicy-test # Expose the deployment with a service kubectl expose deployment web --port=80 --type=ClusterIP -n netpolicy-test
# Create a deployment kubectl create deployment web --image=nginx -n netpolicy-test # Add the app=web label to the deployment kubectl label deployment web app=web -n netpolicy-test # Expose the deployment with a service kubectl expose deployment web --port=80 --type=ClusterIP -n netpolicy-test
Info
If the
kubectl label
command returnsdeployment.apps/web not labeled
this is OK. It indicates that the deployment was already labelled.Create a debugging pod in the same namespace:
kubectl run debug --image=busybox -n netpolicy-test -- sleep 3600
kubectl run debug --image=busybox -n netpolicy-test -- sleep 3600
Wait for the pod to be ready:
kubectl wait --for=condition=Ready pod/debug --timeout=60s -n netpolicy-test
kubectl wait --for=condition=Ready pod/debug --timeout=60s -n netpolicy-test
Verify connectivity before applying network policies:
kubectl exec -it debug -n netpolicy-test -- wget -O- http://web
kubectl exec -it debug -n netpolicy-test -- wget -O- http://web
You should see the HTML output from the nginx welcome page.
Create a default deny-all policy for the namespace:
Info
We will cover network policies in more detail in the next section, but for now, these policies will be used for our lab to block all traffic and then allow specific traffic.
$denyAllPolicy = @" apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-ingress namespace: netpolicy-test spec: podSelector: {} # This selects ALL pods in the namespace policyTypes: - Ingress "@ $denyAllPolicy | kubectl apply -f -
cat << EOF | kubectl apply -f - apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-ingress namespace: netpolicy-test spec: podSelector: {} # This selects ALL pods in the namespace policyTypes: - Ingress EOF
This policy denies all ingress traffic to all pods in the namespace.
Try to access the web service again:
kubectl exec -it debug -n netpolicy-test -- wget -O- --timeout=5 http://web
kubectl exec -it debug -n netpolicy-test -- wget -O- --timeout=5 http://web
This should fail due to the network policy blocking all traffic, and you should find that the attempt times out.
Create a policy that allows traffic from the debug pod to the web service:
$allowPolicy = @" apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-debug-to-web namespace: netpolicy-test spec: podSelector: matchLabels: app: web policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: run: debug "@ $allowPolicy | kubectl apply -f -
cat << EOF | kubectl apply -f - apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-debug-to-web namespace: netpolicy-test spec: podSelector: matchLabels: app: web policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: run: debug EOF
Test connectivity again:
kubectl exec -it debug -n netpolicy-test -- wget -O- http://web
kubectl exec -it debug -n netpolicy-test -- wget -O- http://web
This should succeed now because the network policy specifically allows traffic from the debug pod to the web service.
Task 4 - Troubleshoot Resource Constraints
Create a pod with insufficient resources:
$memoryHogPod = @" apiVersion: v1 kind: Pod metadata: name: memory-hog spec: containers: - name: memory-hog image: k8sonazureworkshoppublic.azurecr.io/nginx resources: requests: memory: "100Mi" cpu: "100m" limits: memory: "100Mi" cpu: "100m" command: ["sh", "-c", "apt-get update && apt-get install -y stress && stress --vm 1 --vm-bytes 200M"] "@ $memoryHogPod | kubectl apply -f -
cat << EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: memory-hog spec: containers: - name: memory-hog image: k8sonazureworkshoppublic.azurecr.io/nginx resources: requests: memory: "100Mi" cpu: "100m" limits: memory: "100Mi" cpu: "100m" command: ["sh", "-c", "apt-get update && apt-get install -y stress && stress --vm 1 --vm-bytes 200M"] EOF
Check the pod status:
kubectl get pods memory-hog -w
kubectl get pods memory-hog -w
The pod should eventually be killed due to OOM (Out of Memory).
Terminate the watch with Ctrl+C.
Diagnose the issue:
kubectl describe pod memory-hog
kubectl describe pod memory-hog
Look for the termination reason, which should indicate an OOM kill.
Fix the issue by increasing the memory limit:
kubectl delete pod memory-hog $memoryFixedPod = @" apiVersion: v1 kind: Pod metadata: name: memory-fixed spec: containers: - name: memory-fixed image: k8sonazureworkshoppublic.azurecr.io/nginx resources: requests: memory: "250Mi" cpu: "100m" limits: memory: "300Mi" cpu: "100m" command: ["sh", "-c", "apt-get update && apt-get install -y stress && stress --vm 1 --vm-bytes 200M"] "@ $memoryFixedPod | kubectl apply -f -
kubectl delete pod memory-hog cat << EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: memory-fixed spec: containers: - name: memory-fixed image: k8sonazureworkshoppublic.azurecr.io/nginx resources: requests: memory: "250Mi" cpu: "100m" limits: memory: "300Mi" cpu: "100m" command: ["sh", "-c", "apt-get update && apt-get install -y stress && stress --vm 1 --vm-bytes 200M"] EOF
Check that the pod runs correctly:
kubectl get pods memory-fixed -w
kubectl get pods memory-fixed -w
Terminate the watch with Ctrl+C after a couple of minutes.
Task 5 - Using Kubectl Debug
Create a deployment with multiple replicas:
kubectl create deployment debug-demo --image=nginx --replicas=3
kubectl create deployment debug-demo --image=nginx --replicas=3
Use
kubectl debug
to troubleshoot a pod:# Get pod names $POD_NAME = kubectl get pods -l app=debug-demo -o jsonpath='{.items[0].metadata.name}' # Create a debug container in the pod kubectl debug $POD_NAME -it --image=busybox --target=nginx
# Get pod names POD_NAME=$(kubectl get pods -l app=debug-demo -o jsonpath='{.items[0].metadata.name}') # Create a debug container in the pod kubectl debug $POD_NAME -it --image=busybox --target=nginx
Inside the debug container, run some diagnostics:
# Check processes ps aux # Check network netstat -tulpn # Check the nginx config in main container cat /proc/1/root/etc/nginx/conf.d/default.conf # Exit when done exit
Task 6 - Clean Up
Clean up the resources created in this exercise:
# Delete regular resources kubectl delete pod crasher-pod fixed-pod memory-fixed kubectl delete deployment debug-demo # Delete network policy test resources kubectl delete namespace netpolicy-test
# Delete regular resources kubectl delete pod crasher-pod fixed-pod memory-fixed kubectl delete deployment debug-demo # Delete network policy test resources kubectl delete namespace netpolicy-test