Exercise 2: Troubleshooting AKS Issues

Task 1 - Common Troubleshooting Commands

  1. Familiarize yourself with key troubleshooting commands:

    # Get pod status
    kubectl get pods
    
    # Get detailed pod information
    kubectl describe pod <pod-name>
    
    # Get pod logs
    kubectl logs <pod-name>
    
    # Get logs from previous instance of a pod (if it restarted)
    kubectl logs <pod-name> --previous
    
    # Get events across the cluster
    kubectl get events --sort-by='.lastTimestamp'
    
    # Check node status
    kubectl get nodes
    kubectl describe node <node-name>
    
    # Get resource usage
    kubectl top pods
    kubectl top nodes
    # Get pod status
    kubectl get pods
    
    # Get detailed pod information
    kubectl describe pod <pod-name>
    
    # Get pod logs
    kubectl logs <pod-name>
    
    # Get logs from previous instance of a pod (if it restarted)
    kubectl logs <pod-name> --previous
    
    # Get events across the cluster
    kubectl get events --sort-by='.lastTimestamp'
    
    # Check node status
    kubectl get nodes
    kubectl describe node <node-name>
    
    # Get resource usage
    kubectl top pods
    kubectl top nodes

Task 2 - Simulate and Troubleshoot a Crashing Pod

  1. Create and examine the crasher pod manifest:

    $crasherPod = @"
    apiVersion: v1
    kind: Pod
    metadata:
      name: crasher-pod
    spec:
      containers:
      - name: crasher
        image: k8sonazureworkshoppublic.azurecr.io/busybox
        command: ["/bin/sh"]
        args: ["-c", "echo 'Starting pod...'; echo 'Pod running normally...'; sleep 10; echo 'WARNING: Memory corruption detected!'; echo 'ERROR: Critical system failure!'; echo 'Pod will now terminate with error...'; exit 1"]  # This will output logs and crash after 10 seconds
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"
      restartPolicy: Always  # The pod will continuously restart and crash
    "@
    
    # Output the manifest to review it
    $crasherPod
    cat << EOF > crasher-pod.yaml
    apiVersion: v1
    kind: Pod
    metadata:
      name: crasher-pod
    spec:
      containers:
      - name: crasher
        image: k8sonazureworkshoppublic.azurecr.io/busybox
        command: ["/bin/sh"]
        args: ["-c", "echo 'Starting pod...'; echo 'Pod running normally...'; sleep 10; echo 'WARNING: Memory corruption detected!'; echo 'ERROR: Critical system failure!'; echo 'Pod will now terminate with error...'; exit 1"]  # This will output logs and crash after 10 seconds
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"
      restartPolicy: Always  # The pod will continuously restart and crash
    EOF
    
    # Output the manifest to review it
    cat crasher-pod.yaml

    This pod is designed to output several log messages and then exit with an error after 10 seconds, causing it to crash and restart.

  2. Deploy the crashing pod:

    $crasherPod | kubectl apply -f -
    kubectl apply -f crasher-pod.yaml
  3. Observe the pod’s status:

    kubectl get pods -w
    kubectl get pods -w

    You should see the pod status cycle through RunningErrorCrashLoopBackOffRunning again.

  4. Terminate the watch by pressing Ctrl+C.

  5. Diagnose the issue:

    # Check the pod status
    kubectl describe pod crasher-pod
    
    # Look at the restart count and last state
    
    # Check the logs
    kubectl logs crasher-pod
    # Check the pod status
    kubectl describe pod crasher-pod
    
    # Look at the restart count and last state
    
    # Check the logs
    kubectl logs crasher-pod

    You should see error messages in the logs that provide clues about why the pod is crashing, such as WARNING: Memory corruption detected! and ERROR: Critical system failure!.

  6. Fix the issue by creating a new pod that doesn’t crash:

    $fixedPod = @"
    apiVersion: v1
    kind: Pod
    metadata:
      name: fixed-pod
    spec:
      containers:
      - name: fixed
        image: k8sonazureworkshoppublic.azurecr.io/busybox
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo 'Running stably'; sleep 30; done"]
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"
    "@
    
    $fixedPod | kubectl apply -f -
    cat << EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: fixed-pod
    spec:
      containers:
      - name: fixed
        image: k8sonazureworkshoppublic.azurecr.io/busybox
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo 'Running stably'; sleep 30; done"]
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m"
          limits:
            memory: "128Mi"
            cpu: "200m"
    EOF
  7. Verify the fixed pod is running stably:

    kubectl get pods fixed-pod -w
    kubectl get pods fixed-pod -w

    Wait for a minute or so to confirm it doesn’t restart.

  8. Terminate the watch by pressing Ctrl+C.

Task 3 - Troubleshoot Application Connectivity with Network Policies

  1. Create a dedicated namespace for network policy testing:

    kubectl create namespace netpolicy-test
    kubectl create namespace netpolicy-test
  2. Create a simple web application and service in the namespace:

    # Create a deployment
    kubectl create deployment web --image=nginx -n netpolicy-test
    
    # Add the app=web label to the deployment
    kubectl label deployment web app=web -n netpolicy-test
    
    # Expose the deployment with a service
    kubectl expose deployment web --port=80 --type=ClusterIP -n netpolicy-test
    # Create a deployment
    kubectl create deployment web --image=nginx -n netpolicy-test
    
    # Add the app=web label to the deployment
    kubectl label deployment web app=web -n netpolicy-test
    
    # Expose the deployment with a service
    kubectl expose deployment web --port=80 --type=ClusterIP -n netpolicy-test
    Info

    If the kubectl label command returns deployment.apps/web not labeled this is OK. It indicates that the deployment was already labelled.

  3. Create a debugging pod in the same namespace:

    kubectl run debug --image=busybox -n netpolicy-test -- sleep 3600
    kubectl run debug --image=busybox -n netpolicy-test -- sleep 3600
  4. Wait for the pod to be ready:

    kubectl wait --for=condition=Ready pod/debug --timeout=60s -n netpolicy-test
    kubectl wait --for=condition=Ready pod/debug --timeout=60s -n netpolicy-test
  5. Verify connectivity before applying network policies:

    kubectl exec -it debug -n netpolicy-test -- wget -O- http://web
    kubectl exec -it debug -n netpolicy-test -- wget -O- http://web

    You should see the HTML output from the nginx welcome page.

  6. Create a default deny-all policy for the namespace:

    Info

    We will cover network policies in more detail in the next section, but for now, these policies will be used for our lab to block all traffic and then allow specific traffic.

    $denyAllPolicy = @"
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: default-deny-ingress
      namespace: netpolicy-test
    spec:
      podSelector: {}  # This selects ALL pods in the namespace
      policyTypes:
      - Ingress
    "@
    
    $denyAllPolicy | kubectl apply -f -
    cat << EOF | kubectl apply -f -
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: default-deny-ingress
      namespace: netpolicy-test
    spec:
      podSelector: {}  # This selects ALL pods in the namespace
      policyTypes:
      - Ingress
    EOF

    This policy denies all ingress traffic to all pods in the namespace.

  7. Try to access the web service again:

    kubectl exec -it debug -n netpolicy-test -- wget -O- --timeout=5 http://web
    kubectl exec -it debug -n netpolicy-test -- wget -O- --timeout=5 http://web

    This should fail due to the network policy blocking all traffic, and you should find that the attempt times out.

  8. Create a policy that allows traffic from the debug pod to the web service:

    $allowPolicy = @"
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: allow-debug-to-web
      namespace: netpolicy-test
    spec:
      podSelector: 
        matchLabels:
          app: web
      policyTypes:
      - Ingress
      ingress:
      - from:
        - podSelector:
            matchLabels:
              run: debug
    "@
    
    $allowPolicy | kubectl apply -f -
    cat << EOF | kubectl apply -f -
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: allow-debug-to-web
      namespace: netpolicy-test
    spec:
      podSelector: 
        matchLabels:
          app: web
      policyTypes:
      - Ingress
      ingress:
      - from:
        - podSelector:
            matchLabels:
              run: debug
    EOF
  9. Test connectivity again:

    kubectl exec -it debug -n netpolicy-test -- wget -O- http://web
    kubectl exec -it debug -n netpolicy-test -- wget -O- http://web

    This should succeed now because the network policy specifically allows traffic from the debug pod to the web service.

Task 4 - Troubleshoot Resource Constraints

  1. Create a pod with insufficient resources:

    $memoryHogPod = @"
    apiVersion: v1
    kind: Pod
    metadata:
      name: memory-hog
    spec:
      containers:
      - name: memory-hog
        image: k8sonazureworkshoppublic.azurecr.io/nginx
        resources:
          requests:
            memory: "100Mi"
            cpu: "100m"
          limits:
            memory: "100Mi"
            cpu: "100m"
        command: ["sh", "-c", "apt-get update && apt-get install -y stress && stress --vm 1 --vm-bytes 200M"]
    "@
    
    $memoryHogPod | kubectl apply -f -
    cat << EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: memory-hog
    spec:
      containers:
      - name: memory-hog
        image: k8sonazureworkshoppublic.azurecr.io/nginx
        resources:
          requests:
            memory: "100Mi"
            cpu: "100m"
          limits:
            memory: "100Mi"
            cpu: "100m"
        command: ["sh", "-c", "apt-get update && apt-get install -y stress && stress --vm 1 --vm-bytes 200M"]
    EOF
  2. Check the pod status:

    kubectl get pods memory-hog -w
    kubectl get pods memory-hog -w

    The pod should eventually be killed due to OOM (Out of Memory).

  3. Terminate the watch with Ctrl+C.

  4. Diagnose the issue:

    kubectl describe pod memory-hog
    kubectl describe pod memory-hog

    Look for the termination reason, which should indicate an OOM kill.

    OOM Killed OOM Killed

  5. Fix the issue by increasing the memory limit:

    kubectl delete pod memory-hog
    
    $memoryFixedPod = @"
    apiVersion: v1
    kind: Pod
    metadata:
      name: memory-fixed
    spec:
      containers:
      - name: memory-fixed
        image: k8sonazureworkshoppublic.azurecr.io/nginx
        resources:
          requests:
            memory: "250Mi"
            cpu: "100m"
          limits:
            memory: "300Mi"
            cpu: "100m"
        command: ["sh", "-c", "apt-get update && apt-get install -y stress && stress --vm 1 --vm-bytes 200M"]
    "@
    
    $memoryFixedPod | kubectl apply -f -
    kubectl delete pod memory-hog
    
    cat << EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
      name: memory-fixed
    spec:
      containers:
      - name: memory-fixed
        image: k8sonazureworkshoppublic.azurecr.io/nginx
        resources:
          requests:
            memory: "250Mi"
            cpu: "100m"
          limits:
            memory: "300Mi"
            cpu: "100m"
        command: ["sh", "-c", "apt-get update && apt-get install -y stress && stress --vm 1 --vm-bytes 200M"]
    EOF
  6. Check that the pod runs correctly:

    kubectl get pods memory-fixed -w
    kubectl get pods memory-fixed -w

    Terminate the watch with Ctrl+C after a couple of minutes.

Task 5 - Using Kubectl Debug

  1. Create a deployment with multiple replicas:

    kubectl create deployment debug-demo --image=nginx --replicas=3
    kubectl create deployment debug-demo --image=nginx --replicas=3
  2. Use kubectl debug to troubleshoot a pod:

    # Get pod names
    $POD_NAME = kubectl get pods -l app=debug-demo -o jsonpath='{.items[0].metadata.name}'
    
    # Create a debug container in the pod
    kubectl debug $POD_NAME -it --image=busybox --target=nginx
    # Get pod names
    POD_NAME=$(kubectl get pods -l app=debug-demo -o jsonpath='{.items[0].metadata.name}')
    
    # Create a debug container in the pod
    kubectl debug $POD_NAME -it --image=busybox --target=nginx
  3. Inside the debug container, run some diagnostics:

    # Check processes
    ps aux
    
    # Check network
    netstat -tulpn
    
    # Check the nginx config in main container
    cat /proc/1/root/etc/nginx/conf.d/default.conf
    
    # Exit when done
    exit

Task 6 - Clean Up

  1. Clean up the resources created in this exercise:

    # Delete regular resources
    kubectl delete pod crasher-pod fixed-pod memory-fixed
    kubectl delete deployment debug-demo
    
    # Delete network policy test resources
    kubectl delete namespace netpolicy-test
    # Delete regular resources
    kubectl delete pod crasher-pod fixed-pod memory-fixed
    kubectl delete deployment debug-demo
    
    # Delete network policy test resources
    kubectl delete namespace netpolicy-test