Leveraging Spot instances in Azure Databricks can greatly reduce costs; however, we strongly advise against their use for critical production workloads requiring high SLAs. Since Spot instances are subject to availability and can be interrupted at any time, they pose a risk to workload stability. If you still choose to use Spot instances for such workloads, it is essential to follow best practices to mitigate potential risks.
Spot instances provide a cost-efficient way to scale compute clusters, but improper configurations may lead to instability and job failures. This blog outlines key best practices for using Spot instances, whether auto-scaling is enabled or disabled.
When Auto Scaling is Disabled with Spot Instances
Without auto-scaling, Spot instance availability is crucial for successful cluster startup. Here’s what you need to consider:
- Cluster Availability
Ensure that 80% of the total requested nodes are available for startup. For instance, if you request four Spot compute worker nodes, the eviction of even a single node can delay the cluster's launch.
- Cluster Launch Attributes
Use attributes like FALL_BACK_TO_AZURE during cluster launch. This ensures that if Spot instances are unavailable, an on-demand compute node will be provisioned instead, preventing cluster failures.
- Avoid Using Pools with Spot Instances
Creating clusters from pools with Spot instances can introduce instability, especially if the driver node is assigned to a Spot instance. To prevent this, we strongly discourage using pools with Spot instances when launching clusters.
When Auto Scaling is Enabled with Spot Instances
Auto-scaling allows clusters to dynamically adjust resources, but careful setup is necessary for smooth scaling.
- On-Demand Nodes First
Set the Azure attribute first_on_demand=2 in the job cluster definition. This ensures that the first two nodes (one driver and one worker) are on-demand, stabilizing cluster creation.
- Autoscaling Settings
- Enable auto-scaling on the cluster.
- Set min_workers=1, ensuring that at least one worker is always on-demand.
- Define the maximum cluster size to prevent over-scaling issues.
This setup ensures reliable cluster startup and reduces the risk of job failures.
- Upscaling Considerations
The cluster should always start with on-demand nodes before scaling up with Spot instances. While this approach improves stability, it may slightly increase overall job duration due to the upscaling process.
- Avoid Using Pools with Spot Instances
Just like in the non-auto-scaling setup, avoid creating clusters from pools with Spot instances. Doing so can lead to delayed startups and instability.
Final Thoughts
By following these best practices, you can maximize the benefits of Spot instances while ensuring cluster stability and efficiency. Whether auto-scaling is enabled or not, prioritizing on-demand instances during startup and carefully managing scaling policies will help mitigate potential risks.