Overview
Data is an important asset to all organizations big and small. As these organizations mature, building an end-to-end data platform to enable BI and AI at scale has become part of that journey. Some organizations, have the requirement to expose modelled data in a data warehouse or data lake (Azure Data Lake Storage Gen2) to downstream consumer applications (mobile or web apps) where access patterns can be unpredictable in respect to frequency of access and/or type of data that is requested.
Data warehouse engines and data lakes are not designed for singleton transactional (request / response) interactions. To serve these requests at scale and to meet the different SLAs and access pattern unpredictability, data needs to be offloaded to a suitable database engine (i.e., a caching layer) that is built to serve such queries.
The “Design Patterns” section of this article highlights a generalized pattern for implementing a data serving API which meets this requirement – consisting of a Data Platform component and an API component. For implementing the API, two patterns are commonly adopted – a synchronous pattern or an asynchronous pattern. Both are explored in the “API Implementation Patterns” section of this article.
Potential use-cases
Below are some common enterprise use-cases for a data sharing API:
- Making decommissioned legacy application data available for consumption. Legacy and historical application data usually already exists in a data platform and is often required to continue to be available and served for legal and compliance reasons as well as to mash up and integrate with new application data
- Consolidation of customer interactions (i.e., consolidating various lines of business (LOB) applications into one single customer-facing application). Data from different LOBs may already exist in the warehouse or data lake
- Emerging compliance requirements, such as Consumer Data Right (CDR) legislation in Australia. Some sectors, such as Banking, Energy, Telco have mandates to make data available to a customer or external 3rd party brokers on specific consent from the customer.
Design Patterns
For the data sharing API, there are many suitable engine/services in Azure that could be considered (i.e., Azure Cosmos DB, Azure SQL Database, Azure Data Explorer, Azure Database for PostgreSQL, Azure Redis Cache, etc.). For the use-case outlined in this article Azure Cosmos DB was selected. Azure Cosmos DB is a NoSQL cloud database engine built specifically for large scale queries and support for multi-regional deployment models.
Azure Cosmos DB is usually chosen for its guaranteed throughput, speed, automatic and instant scalability, compatibility with multiple open-source NoSQL APIs, and strong business continuity guarantees. It has SLA backed latency and allows schema flexibility. These attributes make it a good candidate to serve data using APIs.
Azure functions is serverless on-demand compute that enables execution of event-trigger code without having to worry about the underlying application infrastructure.
To load data into the serving engine for the API three patterns can be adopted based on the type of data platform implementation in Azure:
- Azure Databricks or Synapse Spark: use this when the preference is to leverage spark capabilities for streaming and transformations
- Synapse Pipelines: use this if you already have implemented Synapse Analytics. It provides extract-transform-load (ETL), extract-load-transform (ELT), and data integration workflows capabilities at scale
- Azure Data Factory: similar to Synapse pipelines, but it's a stand-alone cloud service. It provides extract-transform-load (ETL), extract-load-transform (ELT), and data integration workflows capabilities at scale
The following diagram outlines example patterns with Azure Cosmos DB central to the API interface approach to the architecture:
API Implementation Patterns
Synchronous Pattern (request / response immediately)
For example, the requirement is for a request to return the most recent 3 months of data on demand. This includes events or transactions that occurred in the last {x} minutes. The assumption / requirement is the data is flowing into the data platform in near real-time.
This use case is supported by the following pattern.
- Load/Ingest rolling 3 months of historical data into Azure Cosmos DB using either Azure Databricks, Synapse Spark, or Azure Data Factory. This must be over a {x}-minute schedule to ensure the time requirement is met. Ideally this should only include delta loads.
- Set the Time to Live (TTL) container property so that any data ingested into Azure Cosmos DB will expire after 92 days. Azure Cosmos DB automatically manages the removal of data beyond TTL value.
- Note that deleting records using TTL doesn’t consume Azure Cosmos DB RU’s
- Based upon the query pattern, index the Azure Cosmos DB container to support the required query attributes e.g., customer_id, transaction_date, etc. Include composite keys to support the different predicate combinations that may be required. Learn more about indexing here
- If there are distinctly different query requirements, consider data replication across containers utilizing the change feed functionality. Note, storage is cheaper in Azure Cosmos DB compared to high RUs for querying
- Be sure to choose the right partition to evenly distribute data and requests (i.e., avoid hot partitions and data skew). Learn more about Azure Cosmos DB partitioning here
- The partition key must be included in the query predicate (i.e., WHERE clause)
- A good partition key:
- Evenly distributes requests
- Evenly distributes storage
- Has a high cardinality
- It is included in the filter of your most frequent queries i.e., read-heavy workloads
Asynchronous Pattern (request / response delayed)
For example, the requirement is for a request to return the last +36 months of data, including events or transactions that occurred in the previous {x} minutes, on demand.
Given the volume of transactions, having more than +36 months of historical event or transactions will likely require pagination on the client application. This could potentially be hundreds of scrolling pages (e.g., 1000 records of calls or energy usage, incidents data). This has limited value for consumption via mobile devices or web apps, so let’s review options to make this process more efficient.
Option #1
This option aligns with to the synchronous request/response design, i.e. load Azure Cosmos DB with up to +36 months of historical data in a container. Then instead of returning the result set back to the API, return a file (e.g., PDF or CSV) or a link to a file is provided (i.e., pointer to the file)
The processing flow can be described as.
- The downstream App makes a read request to Azure Cosmos DB (i.e., cold historical container)
- Extract the result set from Azure Cosmos and convert the result set to PDF or CSV. This is executed via an Azure Function and retained in memory
- The resulting file is then returned as an attachment to the calling application
This option is not complicated to implement however, it may not be cost effective considering the requirement is to load +36 months of historical data (which is queried infrequently), and the result set, may be too large and cumbersome to handle in the Azure Function or calling client application.
Option #2
In this option, you do not load data into Azure Cosmos DB and instead load historical data into Azure Storage (ADLS Gen2) as parquet or delta format files.
The processing flow can be described as.
- The downstream consuming App makes a read request via the API for historical data greater than 3 months old
- Save the actual request details in Azure Cosmos DB i.e., in a history_query_request container. This acts as a “job queue”
- A Spark job scheduled, or streaming (Synapse Spark or Azure Databricks) reads the history_query_request container's change feed for new entries. Change feed is a persistent record of changes to a container in the order they occur. The persisted changes can be processed asynchronously and incrementally, and the output can be distributed across one or more consumers for parallel processing.
- The Spark job then queries ADLS storage data (parquet or delta) across the given time range as requested by the client application, creates a pdf/csv from the query result set and save back into ADLS Storage
- Finally, the Azure Function returns a link of the pdf/csv document stored on Azure Storage with a “time-boxed” Shared Access Signature (SAS)
- If using a document management system, then that systems link will be sent to the client application
- Alternatively, the query result document can be encrypted or password-protected and then attached to an email. This method should require at least a One Time Password (OTP) to control opening the document
Performance and Operationalization Tips
There are several performance recommendations which should be considered for these patterns to ensure optimal performance
- If using Azure Data Factory (ADF) and a self-hosted IR (SHIR) on an Azure VM, be sure to use a compute optimized VMs and configure ADF to maximize parallelism
- Use Azure Cosmos DB manual throughput if the scaling or burst requirements are predictable. The auto scale cost rate per RU’s is 1.5 * the standard (manual) rate. Use it only when the workload pattern is unpredictable. Learn how to choose between standard (manual) and auto scale provisioned throughput
- Scale Cosmos DB RU's up when performing data loading processes and scale down again to steady state (RU’s) when complete. Learn the best practices for scaling provisioned throughput (RU/s).
- Adjust the following Azure Cosmos SQL sink properties according to your needs
- writeBatchSize: An integer that stands for how many objects are written to Azure Cosmos DB collection in each batch. Increase the value for writeBatchSize to improve performance and decrease the value if your document size is too large. Usually, starting with the default batch size is sufficient and then iteratively test your specific workload.
- Azure Cosmos DB limits a single request's size to 2MB. The formula is "Request Size = Single Document Size * Batch Size". If you hit an error "Request size is too large", reduce the batch size value. the default batch size is 10,000
- The larger the batch size, the better throughput the service can achieve, however be sure to allocate enough RUs to drive your data processing workload.
- maxConcurrentConnections: specify a value only when you want to limit concurrent connections. You want to do this if you have upper limit on RU's and want avoid rate limiting 429 errors . You will find this setting on the Azure Data Factory copy activity, sink tab
- If extracting data directly from Azure Synapse Dedicated SQL Pools be sure to configure workload management to ensure predictable resource usage and utilization of the dedicated pool (by other SQL workloads)
- If using Azure Functions, consider using premium plan and leverage VNet integration, Managed Identity for authentication, and Mutual SSL Authentication (MSSL) to secure the API. You can opt for other deployment plans (e.g., Consumption Plan) however be sure to consider any host plan limitations. For example, for use-cases which may require VNet integration (i.e., secure networking) and warm startup for each function call/request, then these features are only supported by Premium Plan
- An alternative to Azure Functions would be using Azure Kubernetes Service (AKS) to host the API as containerized microservice. This will provide more granular control, manageability, and support scalability requirements
- Consider using an API Gateway in front of the API layer (e.g., Azure API Management) to provide backend abstraction, security/protection, observability, logging, performance, and end-point service discoverability
- To better manage costs for Azure Cosmos DB layer, it is important to understand how it apportions request units (RUs) (a unit or currency of how Azure Cosmos DB is charged). By default, Azure Cosmos DB enforces a minimum throughput of 10 RU/s per GB of storage. To override this default setting subscribe to high storage / low throughput program i.e., 1 RU per GB of storage. This is for scenarios where you need to store large amounts of data, but in turn have quite low RU throughput requirements in comparison. Enabling this will reduce your overall Azure Cosmos DB costs and give you flexibility with RU’s.
- Note - To enable this you must raise a Support Request (SR) ticket - or fill in the request form provided in the MS Docs