Forum Discussion

AP's avatar
AP
Copper Contributor
Jun 22, 2021

Advice architecture, streaming data

I’m looking for advise with pros / cons on the following:
 
 
data > azure event hub > databricks structured streaming > delta lake (bronze)
 
vs
 
data > azure event hub > event hub capture to Azure Data Lake gen 2 > Databricks Autoloader > delta lake(bronze)

 

 

Azure event hub has only 7 day retention.

The need is not on realtime but more batch with trigger once and schedule

1 Reply

  • LukeJMadden's avatar
    LukeJMadden
    Brass Contributor
    Good morning AP,

    I am not an SME on the subject but did some digging. Both of the proposed architectures have their own pros and cons, and the choice between them will depend on your specific requirements and constraints. Here's a breakdown of some of the key considerations for each option:

    Option 1: Azure Event Hub > Databricks Structured Streaming > Delta Lake (bronze)

    Pros:

    Databricks Structured Streaming provides a powerful and flexible way to process and transform streaming data.
    Delta Lake provides a scalable and reliable storage layer for your streaming data, with built-in features for versioning, schema enforcement, and data retention policies.
    This architecture is well-suited for scenarios where you need to transform or enrich your streaming data before storing it.
    Cons:

    Azure Event Hub has a maximum retention period of 7 days, which may be a limitation depending on your data retention requirements.
    Databricks Structured Streaming can be complex to set up and manage, particularly if you're not familiar with Spark and Scala.
    This architecture may require more compute resources to handle the data processing and transformation steps.
    Option 2: Azure Event Hub > Event Hub Capture to Azure Data Lake Gen 2 > Databricks Autoloader > Delta Lake (bronze)

    Pros:

    Event Hub Capture provides a simple and efficient way to store your streaming data in Azure Data Lake Gen 2, with support for retention policies and automatic scaling.
    Databricks Autoloader simplifies the process of ingesting data from Azure Data Lake Gen 2 into Delta Lake, with support for schema evolution and partitioning.
    This architecture is well-suited for scenarios where you need to store your streaming data for longer periods of time without the need for complex transformation or enrichment steps.
    Cons:

    The use of Azure Data Lake Gen 2 as an intermediary storage layer may add some complexity to the overall architecture.
    Databricks Autoloader may have some limitations in terms of performance and scalability for large datasets.
    This architecture may require more storage resources to store the data in Azure Data Lake Gen 2.
    Overall, both of these architectures have their own strengths and weaknesses, and the best option will depend on your specific requirements and constraints. If you require more complex data transformation or enrichment steps, the first option may be a better fit. If you need to store your data for longer periods of time without complex transformations, the second option may be more appropriate.

    Kind regards,

    Luke

Resources