We have added support for Azure Databricks instance pools in Azure Data Factory for orchestrating notebooks, jars and python code (using databricks activities, code-based ETL), which in turn will leverage the pool feature for quicker job start-up.
This remarkably helps if you have chained executions of databricks activities orchestrated through Azure Data Factory. You can also leverage a single pool through different pipelines as long as they share the same databricks linked service or pool id from another linked service (be mindful of databricks concurrency limits during planning so that you don't overload a single workspace causing job failures).
Lower start-up times would not only reduce the overall pipeline execution time but also reduce the total VM cost incurred during cluster start-ups.
Note: Instance Pools feature is currently in public preview. We do see the start-up latency coming down from 5-7 mins. to around 2 mins. This way, you can leverage job clusters (where each databricks activity creates a new job cluster) which are more reliable and cost-effective for running automated jobs and still cut down on the start-up latency of the job clusters.
Prerequisite: You should create a pool in your databricks workspace before leveraging it in Azure Data Factory. To create a pool refer to the documentation.
Getting started in Data Factory:
You can create databricks activities just the way you earlier did, and reference the above-created databricks linked service to get started.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.