Author(s): Amanjeet Singh is a Program Manager in Azure Synapse Customer Success Engineering (CSE) team.
As organisations grow, the volume of data and the number of systems generating data also grows. The size and number of disparate systems introduces procedural, operational and technology challenges which impact (in a negative way) agility, scale, and governance associated with analytics. This is the problem space which data mesh is targeted at. It’s an approach to data architecture which aims at addressing these challenges through decentralization of technology platforms and processes; establishing domain-aligned teams; and governance at scale. It brings product thinking to analytics and calls for a fundamental shift in the assumptions, architecture, technical solutions, and structure of organisations. [1]
Data mesh has four principles [2]:
Before we go any further, it’s worth noting that:
Azure Synapse is an analytics service that brings together enterprise data warehousing and Big Data analytics. Azure Synapse brings together the best of SQL technologies used in enterprise data warehousing, Spark technologies used for big data, Data Explorer for log and time series analytics, Pipelines for data integration and ETL/ELT, and deep integration with other Azure services such as Power BI, Cosmos DB, Azure ML etc. which are part of Microsoft Intelligent Data Platform. A conceptual view of Azure Synapse ecosystem is shown in figure 1.
Figure 1 – Overview of Azure Synapse Analytics
A data product is one of the core building blocks of data mesh architecture. It encompasses the domain-specific data; pipelines to load and transform the data; and interfaces to share the data with other domains. In simple terms, a data product consists of data and the underlying software and infrastructure required to ingest, store, and share data with other domains.
There are three main functions [4] of a data product:
Figure 2 shows alignment of Synapse Analytics features to the functions of a data product.
Figure 2 – Data product functions and Azure Synapse Analytics feature alignment
Now that we understand the composition of a data product, we shall now walk through our perspective on how Azure Synapse Analytics features align to its functions.
Function |
Properties [5] |
Synapse Analytics feature(s) |
Consume data |
A data product has one or more input data ports or a mechanism to connect and source data from various sources including other data products. |
Azure Synapse Analytics offers features which enable consumption of data from various first-party and third-party sources such as SAP, Oracle etc. via Connectors which ship with Pipelines and Data flow.
Additionally, notebooks, by design, are extensible and can leverage third-party libraries to connect with data sources which do not ship with a connector.
Through integration with the broader Microsoft Intelligent Platform, there are options for ingesting streams directly to Data Explorer pool. |
Transform data |
All data products perform data transformation and within data mesh architecture, transformation is an internal implementation of data product. |
On Azure Synapse Analytics, artifacts such as Pipelines, Data Flows, notebooks stored procedures etc. enable developers to build complex transformations as part of data products. |
Serve data |
A data product may present its data in multimodal form such as columnar files, relational tables, graph, or events. |
Through a combination of Synapse capabilities and tight integration with various polyglot persistence options on Azure, data can be served via different formats and mechanisms (file shares, APIs etc.).
File access – Data in various formats can be served from storage services such as ADLS Gen2.
SQL access - SQL Pool serves data as relations which can plug into a reporting tool such as Power BI and/or other domains can use SQL endpoint as an interface to serve data to downstream domains.
Events – Azure Event Hubs ships with Capture feature which can write events out to ADLS Gen2 in Parquet and Avro formats. Events written to ADLS can then be processed and stored or served to other domains downstream via Synapse Analytics. Additionally, streaming data via Kusto query
Organizations have flexibility to enable in-place reads (to serve other domains) or physically shipping data or files to other domains. |
|
Immutability |
With Pipeline and Data flow features, developers can implement INSERT ONLY ingestion patterns thereby keeping historical data unchanged.
Immutability can also be implemented using file formats which are immutable such as Parquet. Parquet can be read via Synapse Serverless, Synapse Spark notebooks etc.
To summarise, immutability can be implemented using file formats which are immutable or writing ingestion/jobs which are append only within Synapse Analytics. |
|
Bitemporal data |
Bitemporal data is a way of modelling data so that every piece of data records two timestamps “actual” and “processed”.
With Azure Synapse Pipeline and Data flow features, users can add datetime attributes to the data stored in SQL Pool (relations) or add these attributes to an output being written out to ADLS Gen2. |
Let’s now look at primary capabilities of a data product and where Azure Synapse Analytics features fit-in. A data product must have the following capabilities:
The following table outlines data product capabilities and Azure Synapse Analytics feature alignment.
Data Product capability |
Azure Synapse Analytics Features |
Data storage |
Synapse Dedicated SQL pools Synapse Data Explorer |
Data movement |
|
Data serving |
Notebooks |
Transformations |
Pipelines and Data Flows Synapse Spark Transact-SQL or CLR stored procedures |
Governance |
Platform - Azure Policy Data - Microsoft Purview |
To summarize, by using a combination of one or more Azure Synapse Analytics features and its native integrations with the wider Microsoft Intelligent Platform, individual domains can build a gamut of rich data products.
It’s important to understand Synapse Analytics hierarchy model as it has an implication on scale and domain access control within data mesh architecture. Figure 3 below shows key components of a workspace and their relationship with each other.
Figure 3 – Azure Synapse Analytics hierarchy
The relationship between a Subscription, Synapse workspace and underlying Synapse artifacts influences scale and access control a domain has within each deployment pattern discussed below. Scale boundary refers to how large a single instance of a Synapse artifact or Synapse Workspace can scale. Domain autonomy refers to access control which members of a domain have within a Subscription, a Synapse workspace, and a single instance to build data products. This will influence agility of a domain and on Azure, this is controlled by Azure RBAC and resource-level permissions (on control and data planes).
The table below summarizes various patterns for deploying Azure subscription(s) and Synapse workspace(s) as a data product. A Synapse artifact refers to individual features of Synapse such as SQL Pool, Spark Pool, Data Explorer, Pipelines etc.
Pattern |
Deployment Pattern |
Scale boundary |
Domain autonomy |
||
Azure Subscription |
Azure Synapse Analytics Workspace |
Synapse Artifact (SQL Pool, Spark Pool etc.) |
|||
Pattern# 1 Refer Figure 4, domains share a single instance of a Synapse artifact (SQL Pool, Spark Pool etc.) within a single workspace. |
Single subscription |
Single workspace |
Single instance of a Synapse artifact shared by all domains, i.e., a single instance of SQL Pool services all relational use-cases for all the domains. |
Scale limits for a single Azure Subscription apply here. |
Subscription-scoped RBAC, policy and Management Group membership applies to all child resources including Synapse workspaces. Since there’s a single workspace, one platform team assumes ownership of managing and operating a workspace and all Synapse artifacts deployed within it from platform perspective.
Individual domains have read access to control plane, however they may have WRITE access to data plane of individual Synapse artifacts such as SQL Pool, Spark Pool etc. |
Pattern# 2
Refer Figure 5, each domain has a dedicated Synapse artifact within a workspace.
|
Single subscription |
Single workspace |
Domain aligned artifacts such as SQL Pool, Spark Pool, Pipelines etc. |
Since each domain has a dedicated instance of Synapse artifact, scale limit is dictated by a single instance of a Synapse artifact.
|
Single platform team administers a workspace.
Domains may have write access to control plane of a Synapse artifact, i.e., manage attributes of an artifact such as size etc.
Domains can have full access to data plane within an artifact. Example – full access to an instance of SQL Pool to create, modify and delete objects (tables etc.). |
Pattern# 3
Refer figures 6A and 6B, single subscription with multiple workspaces. Domains consolidated across multiple workspaces. |
Single subscription |
Multiple workspaces |
Multiple instances of Synapse artifacts deployed since there are multiple workspaces. |
Scale limits for a single Azure Subscription apply here. |
Subscription level policies and RBAC are inherited by child resources.
Since there are multiple workspaces, one or more platform teams can assume ownership based on a criteria.
Domains may have full privileges to individual instances of an artifact. |
Pattern# 4
Refer figure 7, Single subscription with a dedicated Synapse workspace for each domain. |
Single subscription |
Separate workspaces for each domain. |
Each domain gets a dedicated workspace and associated Synapse artifacts. |
Scale limits for a single Azure Subscription apply here. |
Subscription-scoped policies and RBAC applies to resources. |
Pattern# 5
Separate subscription with a separate Synapse workspace for each domain. |
Separate subscription for each domain |
Separate workspaces for each domain |
Each domain gets a dedicated workspace and artifacts. |
Scale enabled through separate subscriptions; multiple workspaces; and Synapse artifacts. |
Flexibility to have different subscription-scoped policies and RBAC across domains.
Additionally, domain-aligned subscriptions can belong to different Management Groups. |
Let’s discuss considerations for each of the deployment patterns discussed above.
In this deployment model, a single Synapse workspace is shared across domains. In this domain multi-tenancy model, analytics pools and other artifacts belonging to a workspace are shared across domains (as shown in the figure below). Essentially, each domain gets a slice of resources and privileges to ship data products.
Figure 4 – Subscription and Synapse Workspace multi-tenancy model. A single workspace hosts all domains, and each domain gets a slice of a Synapse artifact.
From systems perspective, since there’s a single instance of a resource, all domains sharing that single artifact are bound in terms of scale limits, performance targets, recovery targets etc. Essentially, we are treating all domains the same from perspective of scale, performance, recovery targets, maintenance etc.
Following considerations apply to this deployment pattern:
Since all domains share a workspace, there is no ability to restrict access for development artifacts like Synapse Pipelines, notebooks, etc. between different domain users.
When may organisations opt for this deployment pattern?
This model of deployment is like pattern 1, however organizations deploy separate instances of Synapse artifacts aligned to various domains. Example – dedicated SQL Pool for domain A; dedicated Spark Pool for domain B and so on and so forth within a single shared Synapse Analytic workspace.
Figure 5 – Each domain has a separate instance of Synapse artifact but within bounds of a single workspace.
In this model, following considerations apply:
Since all domains share a workspace, there is no ability to restrict access for development artifacts like Synapse Pipelines, notebooks, etc. between different domain users.
When may organisations choose this model?
The pattern has similarities to pattern 1 where single subscription and single workspace scale and management limits apply; however, this model offers larger scale boundary for a domain to operate within. Common reasons to adopt this model include:
In this model, an Azure subscription houses one or more Synapse workspaces. The difference here is that workspaces could be used to consolidate a function such as data ingestion, or perhaps consolidate a set of domains based on a criterion such as region of deployment.
Figure 6A – An example of grouping workspaces based on a function such as data ingestion within a single subscription.
Figure 6B – Deployment pattern where domains are grouped under separate workspaces are due to reasons such as region.
Following considerations apply for this deployment pattern:
When may organisations choose this model?
The key difference between this pattern and other patterns (discussed above) is that each domain has a separate dedicated Synapse workspace. All the considerations for separate workspaces highlighted in patterns 2 and 3 apply here.
Figure 7 shows a logical view of layout of workspaces within a subscription within context of lakehouse architecture.
Figure 7 – Separate workspaces for each domain along with a dedicated workspace for a function such as lakehouse medallion architecture.
This model offers largest scale and highest degree of autonomy to domains. Considerations discussed previously for separate subscriptions and workspaces apply here.
Common reasons for implementing this model include:
Figure 8 – Each domain with a dedicated subscription and workspace.
In part two of this blog series, we will focus on topics such as ingestion patterns; networking; layout of subscriptions etc. in context of Azure Synapse Analytics and data mesh.
Our team publishes blog(s) regularly and you can find all these blogs here: https://aka.ms/synapsecseblog
For deeper level understanding of Synapse implementation best practices, please refer to Success by Design (SBD) site: https://aka.ms/Synapse-Success-By-Design
[1] Zhamak Dehghani (2022), Data Mesh: Delivering Data-Driven Value at Scale (O'Reilly Media, Inc, USA).
[2] See note 1 above.
[3] Dehghani (2022), Data Mesh: Delivering Data-Driven Value at Scale, “Chapter 9. The Logical Architecture”.
[4] Dehghani (2022), Data Mesh: Delivering Data-Driven Value at Scale, “Chapter 12. Design consuming, transforming, and serving data”.
[5] See note 4 above.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.