I would like to share some insights regarding the use of Azure Data Explorer (ADX) in data platform architectures. While I appreciate the strengths of ADX, I would like to highlight a few considerations and challenges that I have encountered in the course of setting up a data platform following this approach.
Records Update Limitations: One notable limitation is the challenge associated with updating records in ADX. Although materialized views and the method proposed by wernerzirkel can be employed, implementing updates may not always be straightforward. Thoughtful consideration during data modelling and the design of the overall data flow is essential to accommodate updates and deletions effectively.
Micro-Batch Engine: ADX is primarily set up as a micro-batch engine, finely tuned to handle concurrent loads, including multiple parallel ingestions and queries. As a result significant batch loading may encounter challenges, such as potential memory exhaustion or prolonged processing times. While Microsoft recommends splitting large batches to prevent resource issues, this practice may introduce complexities and the risk of partial loading. Those are things I really not appreciate in daily operations.
Resource Congestion: ADX clusters come with limited ingestion capacity slots, which can lead to ingestion failures if too many occur simultaneously. This congestion issue is particularly relevant in development and test clusters during active work periods. While this risk can be partially mitigated using multiple clusters in leader/follower mode, it prompts the question of whether alternatives like Databricks, which inherently supports the segregation of jobs, users, and teams, would be a more suitable solution.
Considering these factors, I would recommend an ADX-based Lakehouse architecture only for smaller teams dealing with datasets smaller than a few terabytes. In such scenarios, the ease of use, speed and overall efficiency offered by ADX may outweigh the associated challenges. However, for larger and more complex setups, I lean towards Databricks due to its flexibility and scalability, especially when dealing with advanced requirements.