Model training and Fine Tuning with serverless compute

Microsoft

Nov 15, 2023

We are happy to announce the General Availability of Model Training with Serverless Compute.

Serverless compute is a fully-managed, on-demand compute target for a simplified way of running training jobs in Azure Machine Learning. Through serverless compute, machine learning (ML) professionals can focus on their expertise in building ML models, rather than learning about compute infrastructure. Serverless compute also reduces the management burden on IT admins by managing the compute infrastructure and providing managed network isolation, while still meeting the most stringent enterprise security requirements. All Azure Machine Learning job types are supported, including generative AI scenarios such as fine-tuning, evaluations, and retrieval augmented generation (RAG) for large language models.

Advantages of serverless compute

Azure Machine Learning manages creating, setting up, scaling, deleting, and patching for compute infrastructure, reducing management overhead on IT admins
No need for enterprises to perform repetitive processes to create compute using the same settings for each workspace
Simplifies the job submission experience by reducing the steps involved to run a job
ML professionals don’t need to learn about compute concepts, various compute types, or related properties and instead can just focus on the job specification
Dynamic defaulting of VM size needed to run the training job
Meets the most stringent enterprise security requirements by providing support for No public IP compute, private link workspaces, customer virtual network, managed virtual network, managed identity, and user identity. Admin control through quota and Azure policies.
Enterprises can optimize costs by specifying the exact resources each job needs at runtime. Utilization metrics of the job can be monitored to optimize the resources a job would need. Low-priority VMs are also supported.
Elastic training support in case of quota, low-priority, and fault tolerance scenarios
Reduced wait times before jobs start executing in some cases