Moved from: bobsql.com
This blog post focuses on connecting to the SQL Server BDC, some helpful log files and utility outputs. Understanding a few basics about a SQL Server BDC and the Contained Availability Group (containedag) makes managing and troubleshooting easier.
Cluster IP Address and Port
To connect to a SQL Server BDC use the exposed IP address (FQDN for an AD enabled cluster) and the target’s, listening port. To locate the IP address and port, use one of the following commands:
kubectl cluster-info
Kubernetes master is running at https://10.193.5.209:6443
KubeDNS is running at https://10.193.5.209:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
{
"description": "SQL Server Master Instance Front-End",
"endpoint": "mssql.aris.local,31433",
"name": "sql-server-master",
"protocol": "tds"
},
{
"description": "SQL Server Master Readable Secondary Replicas",
"endpoint": "mssql-secondary.aris.local,31436",
"name": "sql-server-master-readonly",
"protocol": "tds"
}
kubectl get svc -n mssql-cluster
- master-svc-external NodePort 10.99.130.239 <none> 1433:31433/TCP
- master-secondary-svc-external NodePort 10.100.49.221 <none> 1433:31436/TCP
The connection port value depends on the cluster’s High Availability enablement and the desired target of primary or read-only secondary.
Port |
Description |
31433 |
When connecting to a SQL Server the default port is commonly 1433. On a SQL Server BDC 1433 is exposed externally using port 31433 (default.) Example: sqlcmd -UMyUser -PMyPassword -SclusterIP,31433 Non-HA – Connects to the master SQL Server instance. HA Enabled – Connects to the primary (listener for read and write operations.) |
31436 |
For an HA enabled BDC port 31436 connects to the read-only, high availability replicas. |
1433 |
Locally available from /var/opt/mssql-tools/bin/sqlcmd on the pod |
1533 |
Locally available from /var/opt/mssql-tools/bin/sqlcmd on a high availability enabled cluster.
Note: The 1533 port connects to the instance level instead of the high availability listener.
Hint: You may want expose 1533 externally so you can connect with SSMS or Azure Data Studio to execute instance specific queries. Here is an example temporarily exposing 1533 from the three master nodes. kubectl expose pod master-0 -n mssql-cluster --port=1533 --type=NodePort --name=sql1533a kubectl expose pod master-1 -n mssql-cluster --port=1533 --type=NodePort --name=sql1533b kubectl expose pod master-2 -n mssql-cluster --port=1533 --type=NodePort --name=sql1533c kubectl get svc -n mssql-cluster
sql1533a NodePort 10.111.117.80 <none> 1533:8115/TCP sql1533b NodePort 10.111.105.5 <none> 1533:31402/TCP sql1533c NodePort 10.101.183.78 <none> 1533:30367/TCP |
Contained AG (containedag_master, containedag_msdb)
When a SQL Server BDC is enabled for high availability the contained availability group (containedag) is created and maintained. A highly available SQL Server BDC installs master-0, master-1 and master-2 replicas and joins them to the containedag. The containedag includes the databases containedag_master and containedag_msdb (which are commonly hidden from external view.)
When connected to the listener port, database create and drop commands are upgraded to automatically join/remove the database from the containedag and replicate the state to secondaries.
Try this: Connect to the listener port and issue a create database. Establish a second connection to the read-only port and query the database on a secondary replica. The database exists on all three replicas, automatically added to the containedag availability group.
When connected to the SQL Server BDC over the listener port (default=31433), references to master and msdb are automatically redirected to the containedag_master and containedag_msdb. For example, if you connect to the listener and create a SQLAgent, TSQL job the job definition and history are physically stored in the containedag_msdb and replicated to secondary replicas. In fact, I wrote a test creating and executing a job on the primary then connecting to the secondary (port=31436) and waiting for the job outcome to arrive in the job history table. From the user’s perspective it appears as if you are using msdb as you normally would for SQLAgent activities.
Try this: Create a job using the listener port and then issue the following query: select * from msdb..sysjobs
- 31433 – Listener connection: You will see the new job (you are connected to the listener and msdb references are redirected to containedag_msdb)
- 1533 – Instance level connection: You will NOT see the job (you are connected to the physical instance and using the physical database msdb)
- 1533 on a Secondary - Instance level connection: You will NOT see the new job. Issue a use containedag_msdb command and run the query and you will see the new job has been replicated to the secondary replica in containedag_msdb.
- 31436 – Read only connection: You will see the new job (you are connected to the read only listener and msdb references are redirected to containedag_msdb)
Note: The target of the connection determines if the job is replicated or the job remains instance specific.
Execute select db_name(database_id) from sys.dm_hadr_database_replica_states
- Listener connection: You see msdb because of the automatic redirection taking place
- Instance level connection: You see containedag_msdb because you are looking at the physical data
You are querying at the same database. The logical referencing of the contained availability group listener redirection makes containedag_msdb appear as msdb.
Who is Primary and Secondary Replicas?
Using the high availability DMVs (sys.availability_groups, sys.dm_hadr_database_replica_states, …) you can query to see which master-* is currently acting as the primary replica. In fact, login to the listener port and select @@SERVERNAME to quickly determine the server name of the primary replica.
You can also query Kubernetes to view the label assignments for the master-* pods. When a SQL Server High Availability failover takes place, the labels are updated to indicate the role of primary or secondary.
Example: kubectl edit pod master-0 -n <<clusterNamespace>
labels:
MSSQL_CLUSTER: mssql-cluster
app: master
controller-revision-hash: master-79fc98b57
mssql.microsoft.com/sql-instance: master
plane: data
role: master-pool
role.ag.mssql.microsoft.com/master-containedag: secondary
statefulset.kubernetes.io/pod-name: master-0
type: sqlservr
Port 31433 Connection Fails on High Availability SQL Server BDC
A connection to port 31433, on a highly available SQL Server BDC, may fail if the availability group is in an unhealthy state. When a master-* pod is elected as the primary replica it transitions the databases to primary state and listens on port 31433. If the containedag becomes unhealthy the listener won’t be established and connection attempts fail. Troubleshoot an unhealthy containedag as you would troubleshoot any SQL Server Availability Group to restore functionality.
The SQL Server DMVs and errorlog are great places to start when troubleshooting an unhealthy availability group.
Controller.log
The controller log is a great source of information to follow the availability group and database state transitions. You can use the following commands to access the controller.log file.
kubectl exec -it -n clusterNamespace control-xxxxx /bin/bash
cd /var/log/controller/##/##/##/controller.log
azdata bdc debug copy-logs --namespace clusterNamespace --target-folder c:\temp
Search for the log entries containing the keyword “HadrRoleManagerStateMachine” to find entries like the following:
2020-01-30 18:04:35.1118 | INFO | [HadrRoleManagerStateMachine: containedag] Transitioning from 'Healthy' to 'StopLeaseRenewal' state.
2020-01-30 18:04:35.4225 | INFO | [HadrRoleManagerStateMachine: containedag] Transitioning from 'StopLeaseRenewal' to 'Offline' state.
2020-01-30 18:04:35.4988 | INFO | [HadrRoleManagerStateMachine: containedag] Transitioning from 'Offline' to 'Resolving' state.
2020-01-30 18:04:35.5689 | INFO | [HadrRoleManagerStateMachine: containedag] Transitioning from 'Resolving' to 'NoPrimary' state.
2020-01-30 18:04:38.0689 | INFO | [HadrRoleManagerStateMachine: containedag] Transitioning from 'NoPrimary' to 'Resolving' state.
2020-01-30 18:05:13.1124 | INFO | [HadrRoleManagerStateMachine: containedag] Transitioning from 'Resolving' to 'OnlinePending' state.
2020-01-30 18:05:13.1452 | INFO | HadrRoleManagerStateMachine:master:containedag:: OnlinePendingTransition: CurPrimary(master-2)
Other information such as a database drop requests are contained in the controller.log as well.
2020-01-30 18:05:34.4335 | INFO | DropManagedDatabase request received. StatefulSet: master, Database: StressTREBORClone
mssql-ha-supervisor
Each mssql-ha-supervisor container helps to manage high availability and the associated logs contain additional information.
kubectl describe pod master-0 -n clusterNamespace
Containers:
mssql-server:
mssql-ha-supervisor:
…
Sample entries from the mssql-ha-supervisor stdout/stderr log files as copied using the azdata copy-logs command.
[containedag-agent] 2020/02/01 16:37:24 lease expired while database disconnected
[supervisor] 2020/02/01 16:37:25 Database operation state for database GiantDB, requestId bea5a9c7-7e2b-43e7-9f3b-98c4dc342ec8
[supervisor] 2020/02/01 16:37:25 Database operation state for database StressTREBOR_830179E1_9451_4E1A_9FB2_75548674A41E, requestId 7b4281f7-820a-4cde-ad2d-33a8a136de0b
[containedag-agent] 2020/02/01 16:37:27 Sent unhealthy heartbeat
[configuration-agent] ERROR: 2020/02/01 16:37:40 Unable to query AG configurations from SQL Server: read tcp 100.64.2.185:57820->100.64.2.185:1533: i/o timeout
operator-#### pod
The operator pod listens to and reacts to Kubernetes events. The operator log is another location for tracking what transpired on a SQL Server BDC.
2020/02/08 15:32:00 Got pod event for master-1
2020/02/08 15:32:00 Got pod event for master-2
2020/02/08 15:32:00 Got pod event for master-0
Use the SQL Server errorlog, controller, mssql-ha-supervisor, operator and DMV information to view the state of your SQL Server BDC.
Mange with notebook: https://docs.microsoft.com/en-us/sql/big-data-cluster/manage-notebooks?view=sql-server-ver15
Stuart is a colleague of mine and is the ultimate champion for SQL Server BDC notebooks. I learn more about Notebooks everyday as I use them to help monitor and troubleshoot the 12, SQL Server BDCs in the longhaul lab. I enjoy the notebooks because they automate many of the data collection and management activities while saving the output. I also get to view the commands used so I learn how to troubleshoot and manage aspects of the SQL Server BDC I may not be familiar with.
I am currently working to convert the SQL Server BDC longhaul test runs to rely solely on notebook collection. This means our development and support staffs use the same notebooks to troubleshoot a SQL Server BDC we provide to you.
Manage with controller dashboard : https://docs.microsoft.com/en-us/sql/big-data-cluster/manage-with-controller-dashboard?view=sql-server-ver15
Another helpful facility is the controller dashboard. The controller dashboard provides you reports, links and drill-ins for your SQL Server BDC.