Move from: bobsql.com
If you are new to Kubernetes (K8’s), like I am, you quickly learn that the servicing node’s journal holds a wealth of information.
In the lab I was encountering evictions due to memory, disk space and other resource limitations. My collection scripts were connecting to the pods and not the nodes. This approach lacks valuable, lineage information needed to fully troubleshoot issues.
For example, my master-0, SQL Server, pod was getting evicted and restarted. I used SSH to connect to the master-0 pod.
Note: the debug information and logs located on the pod are reset during the restart and limit troubleshooting.
Using kubectl and running kubectl get events or kubectl describe pod master-0 -n mssql-cluster did not give me what I needed to understand what happened with the kubelet interactions related to the evictions.
I then executed kubectl get pods -n mssql-cluster -o wide. This output shows the node the master-0 pod is serviced by is node adlgh1-0 on my cluster.
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
mssql-cluster master-0 3/3 Running 0 4h36m 10.244.0.209 adlghl-0 <none> <none>
I then executed kubectl get nodes -o wide and obtained the IP address of the adlgh1-0 node. Using the ip address I established an SSH connection to the node and continued troubleshooting.
I exeuted various flavors of the journalctl command, spanning the time ranges the pod restarted and found events highlighting the eviction.
kubelet[1242]: I1205 02:54:35.371117 1242 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 85% which is over the high threshold (85%). Trying to free 5214368563 bytes down to the low threshold (80%).
kubelet[1242]: I1205 02:55:10.469831 1242 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage
kubelet[1242]: I1205 02:55:10.471522 1242 eviction_manager.go:362] eviction manager: pods ranked for eviction: sparkhead-0_mssql-cluster(7bb01e5f-2e33-4b72-b8a7-7737010f56b5), app1-vqc87_mssql-cluster(21a3a235-dda4-44b0-8943-f1b4aa1aebf3), logsdb-0_mssql-cluster(90657763-ffcb-4df3-94c5-7791f4f60c58), gateway-0_mssql-cluster(ea099f36-de15-4e2f-aa59-364e21db97cb), master-0_mssql-cluster(9cef954d-0f2b-4eda-9d14-067214eafd87), compute-0-…
In a bit more complicated scenario all pods on a given node appeared to have been paused for ~5 hours. The workload was running along and stalled for ~5 hours. No performance counters, no query cancellations honored, no new connections honored, current connections didn’t make forward progress, no new log entries for any service of any pod assigned to by the node. It was as if someone put the node into sleep mode for ~5 hours.
Using the time range and servicing node information I reviewed the CPU metrics using Grafana on my cluster: https://mybdcclusterip:30777/grafana (You can use kubectl cluster-info or azdata list endpoints to acquire the cluster ip address.)
The metrics collection showed the ~5 hour gap and the more troubleshooting I did the more it was clear that every pod on the same node encountered the same paused behavior.
Using the kubectl get pods and get nodes I determined the target node and studied the time range of events present in the journal. The journal of the node is what told the real story and allowed me to understand the behavior.
The journal shows the operating system on the node (host the pods in question) encountered an issue. The issue was a stalled scheduler at the OS level preventing all threads in all pods assigned to the node from making forward progress.
Nov 01 23:47:46 adlghl-3 kernel: INFO: task khugepaged:47 blocked for more than 120 seconds.
Nov 01 23:47:46 adlghl-3 kernel: Not tainted 4.4.0-165-generic #193-Ubuntu
Nov 01 23:47:46 adlghl-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Messages like the one above repeated for ~5 hours until the oom killer a process. Then the system came back to life as if the resume button had been pushed.
Nov 02 04:32:20 adlghl-3 kernel: monkey: wrk/0 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=998
Recommendation
For a SQL Server Big-Data-Cluster (BDC), or any K8’s cluster, deploy a solution that collects the journal and other relevant metrics from the servicing nodes.
Updated Feb 09, 2022
Version 1.0BobDorr
Microsoft
Joined January 13, 2022
SQL Server Blog
Follow this blog board to get notified when there's new activity