Batch Inference in Azure Machine Learning

Former Employee

May 26, 2020

Today, we are announcing the general availability of Batch Inference in Azure Machine Learning service, a new solution called ParallelRunStep that allows customers to get inferences for terabytes of structured or unstructured data using the power of the cloud. ParallelRunStep provides parallelism out of the box and makes it extremely easy to scale fire-and-forget inference to large clusters of machines, thereby increasing development productivity and decreasing end-to-end cost.

Batch inference is now being widely applied to businesses, whether to segment customers, forecast sales, predict customer behaviors, predict maintenance, or improve cyber security. It is the process of generating predictions on a high volume of instances without the need of instant responses. The predictions are stored and accessible for further usage. Often times, data scientists and engineers want to generate many predictions at once. But it’s challenging to take advantage of scalable compute resource to parallelize the large workload to achieve this. For example, how to partition the large amounts of input data? How to distribute and coordinate workloads across a cluster of machines? How to consolidate the output results? How to manage the machine cluster to avoid unnecessary cost? What if a task fails or machine dies?

ParallelRunStep, the new solution we provide for batch inference, handles all these for you. It simplifies scaling up and out large machine learning workloads so data scientists and engineers can spend less time developing computer programs and focus on business objectives.

ParallelRunStep includes capabilities that enable parallel processing and easy management of your large workloads:

Managed stack with out-of-the-box parallelism
Resiliency with failure tolerance
Fully composable with Azure Machine Learning Pipelines
Flexible design for a variety of workloads

Managed stack with out-of-the-box parallelism

ParallelRunStep is a managed stack with out-of-the-box parallelism. You only need to provide full data inputs, scoring script, and necessary configures, the left will be taken care of.

Partition input data

Let’s start with input data partitioning. You have 10K invoices to be extracted using your form recognizer model. To parallelize the workload, you want to combine 10 invoices as a mini batch and send to one machine for execution. You will have 1000 mini batches in total. In another use case, you have a 5GB csv file containing customer data to predict churn score per each record. You need to split the big file into smaller ones for parallel processing. For example, 500 files with the size of 10MB for each. With ParallelRunStep, you can easily achieve both above partition strategy.

ParallelRunStep accepts data inputs through Azure Machine Learning datasets. Dataset is a resource for exploring, transforming, and managing data. Partition your data by setting the mini batch size and leveraging the two types of dataset.

FileDataset represents single or multiple files with any format. For the invoice extraction case, use FileDataset and define mini batch size as 10 will automatically partition the workload into 1000 mini batches.

TabularDataset represents data in a tabular format by parsing the provided file or list of files. It can be created from csv, tsv, parquet files, SQL query results, etc. For the churn score calculation case, use TabularDataset and specify mini batch size as 10MB to get 500 mini batch workloads created.

Distribute workloads to managed machine cluster

After the entire input data is partitioned into multiple mini batches, ParallelRunStep distributes the mini batch workloads to a managed Azure Machine Learning compute cluster.

The Azure Machine Learning compute cluster is created and managed by Azure Machine Learning. It can be auto scaled each time you run a job. Such autoscaling ensures that machines are shut down when your job is completed to save your cost. It supports for both CPU and GPU resources. You can also choose low priority virtual machines to save resources for other latency sensitive jobs and reduce your costs further.

One mini batch workload will be sent to one compute node for execution. The more compute nodes you use, the more parallelism you will get. Execution efficiency for each mini batch workload is determined by the power of the compute node.

Consolidate the results

Processed results from each mini batch workload are collected, stored, and made available to you for further analysis. You will also get a job run summary and a performance report.

Resiliency with failure tolerance

ParallelRunStep is a resilient and highly available solution. While the system manages the strategy, you also have the control of when to timeout your job, how many times to retry and how many errors to tolerant.

Timeout setting is provided per mini batch workload invocation. You can tune this timeout properly to fail fast and avoid waiting forever in case of failures.

Automatic retry is embedded.The system by default retries three times for each mini batch workload if it fails and you can customize the max retry count. Even with several failed mini batch workloads, you can still get the partial results from other successful mini batch workloads.

You can control how many errors to tolerant through the error threshold configure. It’s defined as the number of files or records processing errors the entire job can tolerant. When the error threshold is reached, your job will be terminated. You can also choose to ignore all errors and allow your job to process all inputs.

Fully composable in Azure Machine Learning Pipelines

ParallelRunStep is available through Azure Machine Learning pipelines. Pipelines are constructed from multiple steps, which are distinct computational units in the pipeline. ParallelRunStep is one of such steps. Existing Azure ML Pipeline customers can easily add or switch to ParallelRunStep to run batch inference.

A pipeline is reusable after it is designed and published. Defining the input dataset or parameters as the type PipelineParameter gives you the ability to use dynamic input for each run, and fine tune your pipeline for better performance.

You can run a pipeline on recurring schedule based on elapsed time, or on data changes. For example, if you want to get a monthly customer churn score, you can create a time-based schedule to kick off your batch inference job every month.

Flexible design for a variety of workloads

ParallelRunStep is flexibly designed for a variety of workloads. It’s not just for batch inference, but also other workloads which necessitate parallel processing, e.g. training many models concurrently, or processing large amount of data.