SQL Server BDC Hints and Tips: The node’s Journal can be your best friend
Published Feb 09 2022 08:30 AM 1,007 Views
Microsoft

Move from: bobsql.com

 

If you are new to Kubernetes (K8’s), like I am, you quickly learn that the servicing node’s journal holds a wealth of information.

In the lab I was encountering evictions due to memory, disk space and other resource limitations.  My collection scripts were connecting to the pods and not the nodes.  This approach lacks valuable, lineage information needed to fully troubleshoot issues.

 

For example, my master-0, SQL Server, pod was getting evicted and restarted.  I used SSH to connect to the master-0 pod. 

 

Note: the debug information and logs located on the pod are reset during the restart and limit troubleshooting.

 

Using kubectl and running kubectl get events or kubectl describe pod master-0 -n mssql-cluster did not give me what I needed to understand what happened with the kubelet interactions related to the evictions.

 

I then executed kubectl get pods -n mssql-cluster -o wide.  This output shows the node the master-0 pod is serviced by is node adlgh1-0 on my cluster.

 

NAMESPACE       NAME                                    READY   STATUS    RESTARTS   AGE     IP              NODE       NOMINATED NODE   READINESS GATES

mssql-cluster   master-0                                3/3     Running   0          4h36m   10.244.0.209    adlghl-0   <none>           <none>


I then executed kubectl get nodes -o wide and obtained the IP address of the adlgh1-0 node.  Using the ip address I established an SSH connection to the node and continued troubleshooting.

 

I exeuted various flavors of the journalctl command, spanning the time ranges the pod restarted and found events highlighting the eviction.

 

kubelet[1242]: I1205 02:54:35.371117    1242 image_gc_manager.go:300] [imageGCManager]: Disk usage on image filesystem is at 85% which is over the high threshold (85%). Trying to free 5214368563 bytes down to the low threshold (80%).

 

kubelet[1242]: I1205 02:55:10.469831    1242 eviction_manager.go:344] eviction manager: must evict pod(s) to reclaim ephemeral-storage

 

kubelet[1242]: I1205 02:55:10.471522    1242 eviction_manager.go:362] eviction manager: pods ranked for eviction: sparkhead-0_mssql-cluster(7bb01e5f-2e33-4b72-b8a7-7737010f56b5), app1-vqc87_mssql-cluster(21a3a235-dda4-44b0-8943-f1b4aa1aebf3), logsdb-0_mssql-cluster(90657763-ffcb-4df3-94c5-7791f4f60c58), gateway-0_mssql-cluster(ea099f36-de15-4e2f-aa59-364e21db97cb), master-0_mssql-cluster(9cef954d-0f2b-4eda-9d14-067214eafd87), compute-0-…

 

In a bit more complicated scenario all pods on a given node appeared to have been paused for ~5 hours.  The workload was running along and stalled for ~5 hours.  No performance counters, no query cancellations honored, no new connections honored, current connections didn’t make forward progress, no new log entries for any service of any pod assigned to by the node.  It was as if someone put the node into sleep mode for ~5 hours.

 

Using the time range and servicing node information I reviewed the CPU metrics using Grafana on my cluster:  https://mybdcclusterip:30777/grafana   (You can use kubectl cluster-info or azdata list endpoints to acquire the cluster ip address.)

 

The metrics collection showed the ~5 hour gap and the more troubleshooting I did the more it was clear that every pod on the same node encountered the same paused behavior.

BobDorr_0-1644424186989.png

 

Using the kubectl get pods and get nodes I determined the target node and studied the time range of events present in the journal.  The journal of the node is what told the real story and allowed me to understand the behavior.  

 

The journal shows the operating system on the node (host the pods in question) encountered an issue.  The issue was a stalled scheduler at the OS level preventing all threads in all pods assigned to the node from making forward progress.

 

Nov 01 23:47:46 adlghl-3 kernel: INFO: task khugepaged:47 blocked for more than 120 seconds.

Nov 01 23:47:46 adlghl-3 kernel:       Not tainted 4.4.0-165-generic #193-Ubuntu

Nov 01 23:47:46 adlghl-3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Messages like the one above repeated for ~5 hours until the oom killer a process.  Then the system came back to life as if the resume button had been pushed.

 

Nov 02 04:32:20 adlghl-3 kernel: monkey: wrk/0 invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=998

 

Recommendation

For a SQL Server Big-Data-Cluster (BDC), or any K8’s cluster, deploy a solution that collects the journal and other relevant metrics from the servicing nodes.  

Co-Authors
Version history
Last update:
‎Feb 09 2022 08:30 AM
Updated by: