VirtualMachineScaleSets and ZonalAllocationFailed

Copper Contributor

Hello,

 

We are using Virtual Machine Scale Sets and due to unexpected Azure Event - one of our machines was deallocated by "RepairVM" event and suddenly was unable to be started again with error ZonalAllocationFailed

 

Virtual Machine Scale Sets are having manual and automatic scaling and in order to reserve "Capacity" need to create "OnDemand Reservation" and in the end of the day it ends up as DIY rather then platform. Would you expect any customers running Virtual Machine Scale sets which does not expect "Reservation of Capacity" for "Manual Scaling" and "Automatic - Min Instances"?

 

Thanks.

1 Reply
The most common cause of this issue is limited compute capacity on the cluster /host in which your VM scale set is deployed. Any incident that causes your VM to be deallocated get exposed to a risk of not starting if the host cluster is bordering on being maxed out. Your VM became unlucky and most likely someone else VM which was off was started when yours was deallocated and therefore took over your resources. The fastest way to get your VM is to resize it to another SKU either upwards or downwards and attempt to start the VM. It is very unlikely that all the SKUs are used up. If that does not work, take the VM snapshot then a disk from it and then create another VM and add to the same cluster, I'm not 100% sure but might place this new VM on a different host or different cluster with the same SKU you originally had.

My thoughts: This is unfair situation to a customer with critical workloads in the VM. I think Microsoft product team should take this feedback and enable capacity monitoring on clusters. The customer should never be the first to find out that the cluster is full, after all the more VMs running, the better is the look of the Ms bottom line:)