When we deploy SQL Server on AKS, sometimes we may find SQL HA is not working as expect.
For example, when we deploy AKS using our default sample with 2 nodes:
az aks create \ --resource-group myResourceGroup \ --name myAKSCluster \ --node-count 2 \ --generate-ssh-keys \ --attach-acr <acrName>
There should be 2 instances deployed in the AKS virtual machine scale set:
According to the SQL document:
In the following diagram, the node hosting the mssql-server container has failed. The orchestrator starts the new pod on a different node, and mssql-server reconnects to the same persistent storage. The service connects to the re-created mssql-server.
However, this seems not always be true when we manually stop the AKS node instance from the portal.
Before we stop any nodes, we may see the status of the pod is running.
If we stop node 0, nothing will happen as SQL reside on node 1.
The status of SQL pod remains running.
However, if we stop node 1 instead of node 0, then there comes the issue.
We may see original sql remains in the status of Terminating while the new sql pod stucks in the middle of status ContainerCreating.
$ kubectl describe pod mssql-deployment-569f96888d-bkgvf Name: mssql-deployment-569f96888d-bkgvf Namespace: default Priority: 0 Node: aks-nodepool1-26283775-vmss000000/10.240.0.4 Start Time: Thu, 17 Dec 2020 16:29:10 +0800 Labels: app=mssql pod-template-hash=569f96888d Annotations: <none> Status: Pending IP: IPs: <none> Controlled By: ReplicaSet/mssql-deployment-569f96888d Containers: mssql: Container ID: Image: mcr.microsoft.com/mssql/server:2017-latest Image ID: Port: 1433/TCP Host Port: 0/TCP State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: MSSQL_PID: Developer ACCEPT_EULA: Y SA_PASSWORD: <set to the key 'SA_PASSWORD' in secret 'mssql'> Optional: false Mounts: /var/opt/mssql from mssqldb (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-jh9rf (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: mssqldb: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: mssql-data ReadOnly: false default-token-jh9rf: Type: Secret (a volume populated by a Secret) SecretName: default-token-jh9rf Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned default/mssql-deployment-569f96888d-bkgvf to aks-nodepool1-26283775-vmss000000 Warning FailedAttachVolume 18m attachdetach-controller Multi-Attach error for volume "pvc-6e3d4aac-6449-4c9d-86d0-c2488583ec5c" Volume is already used by pod(s) mssql-deployment-569f96888d-d8kz7 Warning FailedMount 3m16s (x4 over 14m) kubelet, aks-nodepool1-26283775-vmss000000 Unable to attach or mount volumes: unmounted volumes=[mssqldb], unattached volumes=[mssqldb default-token-jh9rf]: timed out waiting for the condition Warning FailedMount 62s (x4 over 16m) kubelet, aks-nodepool1-26283775-vmss000000 Unable to attach or mount volumes: unmounted volumes=[mssqldb], unattached volumes=[default-token-jh9rf mssqldb]: timed out waiting for the condition
This issue caused by an multi-attach error should be expected due to the current AKS internal design.
If you restart the node instance that was shutdown, the issue will be resolved.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.