analytics
106 TopicsDecision Guide for Selecting an Analytical Data Store in Microsoft Fabric
Learn how to select an analytical data store in Microsoft Fabric based on your workload's data volumes, data type requirements, compute engine preferences, data ingestion patterns, data transformation needs, query patterns, and other factors.5.2KViews4likes1CommentSecuring Azure HDInsight: ESM Support with Ubuntu 18.04, Cluster Updates, and Best Practices
Azure HDInsight, Microsoft's cloud-based big data analytics platform, continues to advance its features to provide users with a secure and efficient environment. In this article, we will explore the latest enhancements, focusing on Expanded Security Maintenance (ESM) support, the importance of regular cluster updates, and best practices recommended by Microsoft to fortify HDInsight deployments. The foundation of a secure Azure HDInsight environment lies in its ability to address critical vulnerabilities promptly. Microsoft ensures this by shipping the latest HDInsight images with Expanded Security Maintenance (ESM) support, which provides a framework for ongoing support, stability with minimal changes specifically targeting critical, high and some medium-level fixes. This ensures that HDInsight users benefit from a continuously updated and secure environment. ESM Support in Latest Images: Azure HDInsight 5.0 and 5.1 versions use Ubuntu 18.04 pro image. Ubuntu Pro includes security patching for all Ubuntu packages due to Expanded Security Maintenance (ESM) for Infrastructure and Applications. Ubuntu Pro 18.04 LTS will remain fully supported until April 2028. For more information on what's new in the latest HDInsight images with ESM support, users can refer to the official release notes on the Azure HDInsight Release Notes Archive. Periodic Cluster Updates: Maintaining a secure HDInsight environment requires diligence in keeping clusters up to date. Microsoft facilitates this process through the HDInsight OS patching mechanism. Periodically updating clusters using the procedures outlined in the official documentation ensures that users benefit from the latest features, performance improvements, and crucial security patches. Learn more about updating HDInsight clusters through the Azure HDInsight OS Patching documentation. ESM and HDI Release Integration: Extended Security Maintenance is seamlessly integrated into HDInsight releases. As part of each HDInsight release, critical fixes provided by ESM are bundled. This ensures that users benefit from the latest security enhancements with each new release. Customer Recommendation: Use the Latest Image: To maximize the benefits of the latest features and security updates, customers are strongly recommended to use the most recent HDInsight image number. By doing so, organizations ensure that their HDInsight clusters are fortified against the latest threats and vulnerabilities. Accessing Fixed CVE Details: For users seeking detailed information about the fixed Common Vulnerabilities and Exposures (CVEs), the Ubuntu CVE site serves as a valuable resource. Here, users can access comprehensive insights into the specific vulnerabilities addressed in each release, empowering them to make informed decisions about their security posture.2.1KViews0likes0CommentsEfficient Log Management with Microsoft Fabric
Introduction In the era of digital transformation, managing and analyzing log files in real-time is essential for maintaining application health, security, and performance. There are many 3rd party solutions in this area allowing collecting / processing storing, analyzing and acting upon this data source. But sometimes as your systems scale, those solution can become very costly, their cost model increases based on the amount of ingested data and not according to the real resources utilization or customer value This blog post explores a robust architecture leveraging Microsoft Fabric SaaS platform focused on its Realtime Intelligence capabilities for efficient log files collection processing and analysis. The use cases can vary from simple application errors troubleshooting, to more advanced use case such as application trends detection: detecting slowly degrading performance issues: like average user session in the app for specific activities last more than expected to more proactive monitoring using log based KPIs definition and monitoring those APIS for alerts generation Regarding cost , since Fabric provides a complete separation between compute and storage you can grow your data without necessarily growing your compute costs and you still pay for the resources that re used in a pay as you go model. Architecture Overview The proposed architecture integrates Microsoft Fabric’s Real time intelligence (Realtime Hub) with your source log files to create a seamless, near real-time log collection solution It is based on Microsoft Fabric: a SAAS solution which is a unified suite integrating several best of breed Microsoft analytical experiences. Fabric is a modern data/ai platform based on unified and open data formats (parquet/delta) allowing both classical data lakes experiences using both traditional Lakehouse/warehouse SQL analytics as well as real-time intelligence on semi structured data , all in on a lake-centric SaaS platform. Fabric's open foundation with built-in governance enables you to connect to various clouds and tools while maintaining data trust. This is High level Overview of Realtime Intelligence inside Fabric Log events - Fabric based Architecture When looking in more details a solution for log collection processing storage and analysis we propose the following architecture Now let's discuss it in more details: General notes: Since Fabric is a SAAS solution, all the components can be used without deploying any infrastructure in advance, just by a click of a button and very simple configurations you can customize the relevant components for this solution The main components used in this solution are Data Pipeline Onelake and Eventhouse Our data source for this example is taken from this public git repo: https://github.com/logpai/loghub/tree/master/Spark The files were taken and stored inside an S3 bucket to simulate the easiness of the data pipeline integration to external data sources. A typical log file looks like this : 16/07/26 12:00:30 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 59219. 16/07/26 12:00:30 INFO spark.SparkEnv: Registering MapOutputTracker 16/07/26 12:00:30 INFO spark.SparkEnv: Registering BlockManagerMaster 16/07/26 12:00:30 INFO storage.DiskBlockManager: Created local directory at /opt/hdfs/nodemanager/usercache/curi/appcache/application_1460011102909_0176/blockmgr-5ea750cb-dd00-4593-8b55-4fec98723714 16/07/26 12:00:30 INFO storage.MemoryStore: MemoryStore started with capacity 2.4 GB Components Data Pipeline First challenge to solve is how to bring the log files from your system into Fabric this is the Log collection phase: many solutions exist for this phase each with its pros and cons In Fabric the standard approach to bring data in is by use of Copy Activity in ADF or in its Fabric SAAS version is now called Data Pipeline: Data pipeline is a low code / no code tool allowing to manage and automate the process of moving and transforming data within Microsoft Fabric, a serverless ETL tool with more than 100 connectors enabling integration with a wide variety of data sources, including databases, cloud services, file systems, and more. In addition, it supports an on prem agent called self-hosted integration runtime, this agent that you install on a VM, acts as a bridge allowing to run your pipeline on a local VM and securing your connection from on prem network to the cloud Let’s describe in more details our solution data pipeline: Bear in mind ADF is very flexible and supports reading at scale from a wide range of data sources / files integrated as well to all major cloud vendors from blob storage retrieval : like S3, GCS, Oracle Cloud, File systems, FTP/SFTP etc so that even if your files are generated externally to Azure this is not an issue at all. Visualization of Fabric Data Pipeline Log Collection ADF Copy Activity: Inside Data pipeline we will create an Activity called Copy Activity with the following basic config Source: mapped to your data sources: it can be azure blob storage with container containing the log files, other cloud object storage like S3 or GCS , log files will be retrieved in general from a specific container/folder and are fetched based on some prefix/suffix in the file name. To support incremental load process we can configure it to delete the source files that it reads so that once the files are successfully transferred to their target they will be automatically deleted from their source . On the next iteration pipeline will not have to process the same files again. Sink: Onelake/Lakehouse folder: we create ahead of time a Lakehouse which is an abstract data container allowing to hold and manage at scale your data either structured or unstructured, we will then select it from the list of connectors (look for Onelake/Lakehouse) Log Shippers: This is an optional component, sometimes it is not allowed for the ETL to access your OnPrem Vnet , in this case tools like Fluentd , Filebeat , Open Telemetry collector used to forward your application collected logs to the main entry point of the system: the Azure Blob Storage. Azcopy CLI: if you don’t wish to invest into expensive tools and all you need to copy your data in a scale/secure manner to Azure Storage, you might consider create your own log shipping solution based on the free Azcopy tool together withs some basic scripting around it for scheduling: Azcopy is a command-line utility designed for high-performance uploading, downloading, and copying data to and from Microsoft Azure Blob and File storage. Fabric first Activity : Copy from Source Bucket to Lakehouse Log Preparation Upon log files landing in the azure blob storage, EventStream can be used to trigger the Data Pipeline that will handle the data preparation and loading phase. So what is Data preparation phase’s main purpose? After the log files land in the storage and before they are loaded to the realtime logs database the KQL Database , it might be necessary to transform the data with some basic manipulations . The reasons for that might be different A Few examples Bad data formats: for example, sometimes logs files contain problematic characters like new lines inside a row (stack trace error message with new lines as part of the message field of the record) Metadata enrichment: sometimes the log file names contain in their name some meaningful data : for example file name describes the originating process name / server name , so this metadata can be lost once the file content is loaded into database Regulation restrictions: sometimes logs contain private data like names, credit card numbers, social security number etc called PII that must be removed , hashed or encrypted before the load to database In our case we will be running a pyspark notebook who reads the files from Onelake folder, fixes the new lines inside a row issue, and create new files in another Onelake folder, we call this notebook with a base parameter called log_path that defines the log files location on the Onelake to read from Fabric second Activity : Running the Notebook Log Loading Inside Data pipeline , the last step, after the transformation phase, we call again the Copy data activity but this time source and sink are differen: Source: Lakehouse folder (previous notebook output) Sink: Evenhouse specific Table (created ahead of time): it is basically an empty table (lograw) Visualization of Fabric last Activity : Loading to EventHouse In summary for this stage the log collection and preparation: we broke this into 3 data pipeline activities: Copy Activity: Read the log files from source: This is the first step of the log ingestion pipeline it is running inside our orchestrator Data pipeline. Run Notebook Activity : Transform the log files : this is the execution of a single or chain of notebooks Copy Activity : Load the log files into destination datatbase : KQL inside Evenhouse : the logs database table called lograw, it is a specific table created ahead of time inside EventHouse Database Inside The Eventhouse We needed to create a KQL database with a table to hold the raw ingested log records KQL datbase is a scalable and efficient storage solution for log files, optimized for high-volume data ingestion and retrieval. Eventhouses and KQL databases operate on a fully managed Kusto engine. With an Eventhouse or KQL database, you can expect available compute for your analytics within 5 to 10 seconds. The compute resources grow with your data analytic needs. Log Ingestion to KQL Database with Update Policy We can separate the ETL transformation logic of what happens to the data before, it reaches the Eventhouse KQL database and after that. Before it reached the database , the only transformation we did was calling during the data pipeline a notebook to handle the new lines merge logic, This cannot be easily done as part of the database ingestion logic simply because when we try to load the files with new lines as part of a field of a record , it breaks the convention and what happens is that the ingestion process creates separate table records for each new line of the exceptions stacktrace. On the other hand, we might need to define basic transformation rules: such as date formatting, type conversion (string to numbers) , parse and extract some interesting value from a String based on regular exception, create JSON (dynamic type) of a hierarchical string (XML / JSON string etc) for all these transformations we can work with what is called an update policy we can define a simple ETL logic inside KQL database as explained here During this step we create from logsraw staging table a new table called logparsed , that will be our destination final table for the log queries. Those are the KQL Tables defined to hold the log files .create table logsraw ( timestamp:string , log_level:string, module:string, message:string) .create table logsparsed ( formattedDatetime:datetime , log_level:string, log_module:string, message:string) This is the update policy that automatically converts data from, the staging table logsraw to the destination table logparsed .create-or-alter function parse_lograw() { logsraw | project formattedtime = todatetime(strcat("20", substring(timestamp, 0, 2), "-", substring(timestamp, 3, 2), "-", substring(timestamp, 6, 2), "T", substring(timestamp, 9, 8))), log_level, logmodule=module, message } .alter table logsparsed policy update @'[{ "IsEnabled": true, "Source": "logsraw", "Query": "parse_lograw()", "IsTransactional": true, "PropagateIngestionProperties": true}]' Since we don't need to retain the data in the staging table (lograw) we can define a retention policy of 0 TTL like this : .alter-merge table logsraw policy retention softdelete = 0sec recoverability = disabled Query Log files After data is ingested and transformed it lands in a basic logs table that is schematized : logparsed, in general we have some common fields that are mapped to their own columns like : log level (INFO/ ERROR/ DEBUG) , log category , log timestamp (a datetime typed column) and log message which can be in general either a simple error string or a complex JSON formatted string in which case it is usually preferred to be converted to dynamic type that will bring additional benefits like simplified query logic, and reduced data processing (to avoid expensive joins) Example for Typical Log Queries Category Purpose KQL Query Troubleshooting Looking for an error at specific datetime range logsparsed | where message contains "Exception" and formattedDatetime between ( datetime(2016-07-26T12:10:00) .. datetime(2016-07-26T12:20:00)) Statistics Basic statistics Min/Max timestamp of log events logsparsed | summarize minTimestamp=min(formattedDatetime), maxTimestamp=max(formattedDatetime) Exceptions Stats Check Exceptions Distributions logsparsed | extend exceptionType = case(message contains "java.io.IOException","IOException", message contains "java.lang.IllegalStateException","IllegalStateException", message contains "org.apache.spark.rpc.RpcTimeoutException", "RpcTimeoutException", message contains "org.apache.spark.SparkException","SparkException", message contains "Exception","Other Exceptions", "No Exception") | where exceptionType != "No Exception" | summarize count() by exceptionType Log Module Stats Check Modules Distribution logsparsed | summarize count() by log_module | order by count_ desc | take 10 Realtime Dashboards After querying the logs, it is possible to visualize the query results in Realtime dashboards, for that all what’s required Select the query Click on Pin to Dashboard After adding the queries to tiles inside the dashboard this is a typical dashboard we can easily build: Realtime dashboards can be configured to be refreshed in Realtime like illustrated here: in which case user can very easily configure how often to refresh the queries and visualization : at the extreme case it can be as low as Continuus There are many more capabilities implemented in the Real-Time Dashboard, like data exploration Alerting using Data Activator , conditional formatting (change items colors based on KPIs threshold) and this framework and capabilities are heavily invested and keep growing. What about AI Integration ? Machine Learning Models: Kusto supports out of the box time series analysis allowing for example anomaly detection: https://learn.microsoft.com/en-us/fabric/real-time-intelligence/dashboard-explore-data and clustering but if it’s not enough for you, you can always mirror the data of your KQL tables into Onelake delta parquet format by selecting OneLake availability This configuration will create another copy of your data in open format delta parquet : you have it available for any Spark/Python/SparkML/SQL analytics for whatever machine learning exploration and ML modeling you wish to explore train and serve This is illustrated here : Bear in mind , there is no additional storage cost to turn on OneLake availability Conclusion A well-designed real-time intelligence solution for log file management using Microsoft Fabric and EventHouse can significantly enhance an organization’s ability to monitor, analyze, and respond to log events. By leveraging modern technologies and best practices, organizations can gain valuable insights and maintain robust system performance and security.440Views0likes0CommentsAzure Stream Analytics Virtual Network Integration Goes GA!
We are thrilled to announce that the highly anticipated capability of running your Azure Stream Analytics (ASA) job in an Azure Virtual Network (VNET) is now generally available (GA)! This feature, which has been in public preview, is set to revolutionize how you secure and manage your ASA jobs by leveraging the power of virtual networks. What Does This Mean for You? With VNET integration, you can now lock down access to your ASA jobs within your virtual network infrastructure. This provides enhanced security through network isolation, ensuring that your data remains protected and accessible only within your private network. By deploying a containerized instance of your ASA job inside your VNET, you can privately access your resources using: Private Endpoints: These allow you to connect your VNET-injected ASA job to your data sources privately via Azure Private Link. This means that your data traffic remains within the Azure backbone network, reducing exposure to the public internet and enhancing security. Service Endpoints: These enable you to connect your data sources directly to your VNET-injected ASA job. This simplifies the network architecture by providing direct connectivity. Service Tags: These allow you to manage network security by defining rules that allow or deny traffic to Azure Stream Analytics. This helps in maintaining a secure environment by controlling which services can communicate with your ASA jobs. Overall, VNET integration enhances the security of your ASA jobs by leveraging Azure's robust networking features. Expanded Regional Availability We are also excited to announce that this capability is now available in additional regions! Along with the existing regions (West US, Central Canada, East US, East US 2, Central US, West Europe, and North Europe), you can now enable VNET integration in the following regions: Australia East France Central North-Central US Southeast Asia Brazil South Japan East UK South Central India These regions were added in response to customer feedback. If you have suggestions for additional regions, please complete this form: https://forms.office.com/r/NFKdb3W6ti?origin=lprLink This expansion ensures that more customers around the globe can benefit from the enhanced security and network isolation provided by VNET integration. Getting Started To get started with VNET integration for your ASA jobs, follow these steps: Set Up Your VNET: Create or use an existing Azure Virtual Network. Create a Subnet: Add a dedicated subnet for your ASA job within the VNET. Set Up Azure NAT Gateway or disable outbound connectivity: Enhance security and reliability by setting up an Azure NAT Gateway or disable default outbound connectivity. Associate a Storage Account: Ensure you have a General Purpose V2 (GPV2) Storage account linked to your ASA job. Configure Your ASA Job: Azure Portal: Go to Networking and select "Run this job in virtual network." Follow the prompts to configure and save. Visual Studio Code: In the 'JobConfig.json' file, set up the 'VirtualNetworkConfiguration' to reference the subnet. Check Permissions: Make sure you have the necessary Role-based access control permissions on the subnet or higher. For detailed instructions and requirements, refer to the official documentation Run your Stream Analytics in Azure virtual network - Azure Stream Analytics | Microsoft Learn. Join the Revolution Stay tuned for more updates and exciting features as we continue to innovate and improve Azure Stream Analytics. Our other Ignite releases include Azure Stream Analytics Kafka Connectors is Now Generally Available! If you have any questions or need assistance, feel free to reach out to us at askasa@microsoft.com. Happy streaming!244Views0likes0CommentsAzure Stream Analytics Kafka Connectors is Now Generally Available!
We are excited to announce that Kafka Input and Output with Azure Stream Analytics is now generally available! This marks a major milestone in our commitment to empowering our users with robust and innovative solutions. With Stream Analytics Kafka Connectors, users can natively read and write data to and from Kafka topics. This enables users to fully leverage Stream Analytics' rich capabilities and features even when the data resides outside of Azure. Azure Stream Analytics is a job service, so you do not have to spend time managing clusters, and downtime concerns are alleviated with a 99.99% SLA (Service Level Agreement) at the job level. Key Benefits: Stream Analytics job can ingest Kafka events from anywhere, process them and output them to any number of Azure services as well as to other Kafka clusters. No need to use workarounds such as MirrorMaker or Kafka extensions with Azure function to process data in Kafka with azure stream analytics. The solution is low code and entirely managed by the Azure Stream Analytics team at Microsoft. Getting Started: To get started with Stream Analytics Kafka input and output connector please refer to these links provided below: Stream data from Kafka into Azure Stream Analytics Kafka output from Azure Stream Analytics You can add Kafka input or output to a new or an existing Stream Analytics job in a few simple clicks. To add Kafka input, go to input under job topology, click Add input, and select Kafka. For Kafka output, go to output under job topology, click Add output, and select Kafka. Next, you will be presented with Kafka connection configuration. Once filled you will be able to test the connection with Kafka cluster. VNET Integration: You can connect to Kafka cluster from Azure Stream Analytics whether it is on cloud or on prem with a public endpoint. You can also securely connect to Kafka cluster inside a virtual network with Azure Stream Analytics. Visit the Run your Azure Stream Analytics job in an Azure Virtual Network documentation for more information. Automated deployment with ARM template ARM templates allow for quick and automated deployment of Stream Analytics jobs. To deploy a stream analytics job with Kafka Input or Output quickly and automatically, users can include the following sample snippet in their Stream Analytics job ARM template. "type": "Kafka", "properties": { "consumerGroupId": "string", "bootstrapServers": "string", "topicName": "string", "securityProtocol": "string", "securityProtocolKeyVaultName": "string", "sasl": { "mechanism": "string", "username": "string", "password": "string" }, "tls": { "keystoreKey": "string", "keystoreCertificateChain": "string", "keyPassword": "string", "truststoreCertificates": "string" } } We can’t wait to see what you’ll build with Azure Stream Analytics Kafka input and output connectors. Try it out today and let us know your feedback. Stay tuned for more updates as we continue to innovate and enhance this feature. Call to Action: For direct help with using the Azure Stream Analytics Kafka input, please reach out to askasa@microsoft.com. To learn more about Azure Stream Analytics click here164Views1like0CommentsGenerally Available:Protocol Buffers (Protobuf) with Azure Stream Analytics
We are excited to announce that Azure Stream Analytics built-in Protobuf deserializer is now generally available! With the built-in deserializer, Azure Stream Analytics supports an out-of-the-box feature to align with the other file formats we support, including JSON and AVRO. Using Protobuf, you can handle data streams in a compact binary format, making it ideal for high-throughput, low-latency applications. Protobuf Protocol Buffers (Protobuf) is a language-neutral, platform-neutral mechanism for serializing structured data, developed by Google. It is widely used for communication between services or for saving data in a compact and efficient format. Configure your Stream Analytics job. Using the Protobuf deserializer is simple to configure. When setting up your Stream Analytics input, select the file format as Protobuf and then upload the Protobuf definition file (.proto file). Complete your configuration by specifying the message type and prefix style. Steps to configure a Stream Analytics job To set up your Stream Analytics job to deserialize events in Protobuf: After creating your job, go to Inputs. Click Add input and choose the desired input to configure. Under Event serialization format, select Protobuf from the dropdown list. Protobuf definition file A file that specifies the structure and data types of your Protobuf event Message type When you define a .proto schema, each message represents a data structure with specific fields. Add the message type that you want to deserialize Prefix style The setting that determines how long a message is, to deserialize Protobuf events correctly We can’t wait to see what you’ll build with Azure Stream Analytics built in Protobuf deserializer. Try it out today and let us know your feedback. Stay tuned for more updates as we continue to innovate and enhance this feature. Call to Action: For direct help with using the Azure Stream Analytics Kafka input, please reach out to askasa@microsoft.com. To learn more about Azure Stream Analytics click here To learn more about Azure Stream Analytics Protobuf deserializer click here86Views0likes0CommentsMigrating Azure Data Factory’s Snowflake Connector from Legacy to latest V2
Recently we announced the new updated V2 of our Popular Snowflake Connector in Azure Data Factory. The new connector promises better performance and security. The new V2 Connector supports the Pipelines and Data Flows and comes with the new Go-based Snowflake driver. This blog provides an approach and considerations when upgrading to V2.1.1KViews1like0CommentsSimplifying Migration to Fabric Real-Time Intelligence for Power BI Real Time Reports
Power BI with real-time streaming has been the preferred solution for users to visualize streaming data. Real-Time streaming in PowerBI is being retired. We recommend users to start planning the migration of their data processing pipeline to Fabric Real-Time Intelligence.2.2KViews2likes0CommentsImplementing Business Logic using Data Vault 2.0 on Azure Fabric
This Article is Authored By Michael Olschimke, co-founder and CEO at Scalefree International GmbH and Co-authored with Kilian GrünhagenSenior BI Consultant from Scalefree The Technical Review is done by Ian Clarke and Naveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft Business logic serves an important role in the data-driven data platform. There is a business expectation of the information to be delivered. This expectation can be defined by two characteristics: the user expects the information in a certain structure (often a dimensional model) and they expect certain content, for example, all currency amounts to be in Euros. But there is a gap between these expectations and the actual data from the data sources. In the Data Vault 2.0 architecture, the Business Vault is used to bridge this gap and focuses on implementing the business logic to transform the data to meet the content expectations of the business users. Introduction The previous articles focused on modelling and implementing the Raw Data Vault, where the raw data is captured by integrating the data on shared business keys and their relationships and versioning all changes to descriptive data. In the Raw Data Vault layer, only so-called hard rules are applied. Hard rules don’t change the meaning of the content, they only change the structure of the incoming data set. The creative destruction of the data into business keys (stored in hubs), relationships between business keys (stored in links), and descriptive data (stored in satellites) is a prime example. But also the data type alignment to match the data types of the (often relational) data model is a good example: changing the data type of a whole number from a CSV string into an integer is not changing the content, but only the structure of the data. This is important to ensure auditability when the original data deliveries must be reproduced and to create multiple business perspectives when information users cannot agree on shared business logic or the definition of their concepts (“Who is a customer?”). In both cases, the unmodified raw data is required. How to Model the Business Vault The business logic to transform the raw data into useful information, for example by cleansing it, recomputing foreign currency amounts, or standardizing addresses, is implemented in the Business Vault. This sparsely modelled layer sits right between the Raw Data Vault with its unmodified raw data and the information mart where the final information in the expected structure and with the expected content is delivered to the presentation layer. “Sparsely modelled” refers to the fact that, believe it or not, some of your data is good enough for reporting. There is no need to cleanse, standardize, or otherwise transform the data because it is exactly what the information user expects. In this case, the dimensional entities in the information mart are directly derived from the Raw Data Vault entities. However, if business logic needs to be applied, it's done in the Business Vault. The entities are typically modelled in the same way as in the Raw Data Vault, so one can expect hubs, links, and many satellites in the Business Vault, including special entity types such as multi-active satellites, non-historized links, etc. For example, in the above diagram, there are the invoice hub and its two satellites originating from the CRM and ERP system. In addition to these Raw Data Vault entities, a computed (or business) satellite invoice_lroc_bsat with one calculated attribute for the invoice amount has been added. But in either case, with or without additional Business Vault entities, it also means that the final information is not done yet, as the desired target structure (e.g., a dimensional model) is not created yet. This model will be derived from the information mart. To do so, the dimensional modeler can combine (and often has to combine) entities from the Raw Data Vault and the Business Vault. Implementing Business Logic on Fabric In many cases, the business logic is relatively simple and can be implemented in SQL. In such cases, an SQL view is the preferred choice and is used to implement the Business Vault entity. If the business logic becomes too complex or the performance of the view is not as desired, an external script might be used as an alternative. For example, a Python script could retrieve data from the Raw Data Vault (but also from the Business Vault) and write the results into a table in the Business Vault. This external Python script is considered to be part of the Business Vault as long as the data platform team has it under its own version control. Keep in mind that there are other options, such as PIT tables, to improve the performance of virtualized entities in the Data Vault architecture. Note that there are actually two options to implement business logic in the Data Vault architecture: besides the discussed option in the Business Vault, it is also possible to implement business rules directly in the information marts, for example in dimension or fact entities. However, when doing so, the business logic is not re-usable. If the same business logic should be used for dimensions in multiple information marts, the business logic must be replicated. If the business logic is implemented in the Business Vault instead, the entities can be reused by multiple information marts. The Business Vault is often implemented in a cascading fashion: a Business Vault entity is not limited to a Raw Data Vault entity as its only data source. Instead, a Business Vault entity can source from multiple entities, both from the Raw Data Vault and other Business Vault entities. By doing so, the overall implementation logic is cascading across multiple Business Vault entities, which is a typical scenario. In some cases, developers try to avoid this, but end up with a Business Vault entity with complex implementation logic. From an auditing perspective, there is one more requirement: It should be possible to truncate a materialized Business Vault entity and rebuild it by applying the same, unmodified business logic to the same, unmodified source data. The results in the Business Vault entity must be the same. If this is not the case, either the source data has been modified or the business logic. Cleansing Dirty Data using Computed Satellites A typical entity type in the Business Vault is the computed satellite, also known as the business satellite. To be short: it’s just a satellite in the Business Vault. The only difference to its counterpart in the Raw Data Vault is that it contains computed results, not just raw data. This makes sense as descriptive data is stored in a satellite in the Raw Data Vault and subject to the application of business logic, for example to cleanse the data, standardize addresses and phone numbers and otherwise increase the value of the data. For example, if the Raw Data Vault satellite captures raw data from the data source, it might be erroneous (e.g. on the city name): This data is captured as it is in the Raw Data Vault of the Data Vault architecture, completely unmodified. The goal of the Raw Data Vault is to capture the good, the bad, and the ugly data, and no judgment is made about these categories. As discussed at the beginning of this article, data cleansing is part of the Business Vault, not of the Raw Data Vault. The next diagram shows the relationship between the computed satellite with the computed city attribute, the Raw Data Vault satellite, and their shared hub: The computed satellite is attached to the same hub, as it still describes the same store, just with cleansed data. In this case, the business logic is simple: data is cleansed by joining into a mapping table for the city name based on the raw data. For each city in the Raw Data Vault satellite, there is a mapping in the reference data for mapping the original city name to the cleansed city name: CREATE VIEW [dv_core].[store_address_crm_lroc_bsat] AS ( SELECT hk_store_hub ,load_datetime ,record_source ,hd_store_address_crm_lroc_sat ,address_street ,postal_code , CASE WHEN store.city != cities.CityName AND cities.ZipCode IS NOT NULL THEN cities.CityName ELSE store.city END AS city ,country FROM [dv_core].[store_address_crm_lroc0_sat] store LEFT JOIN [MS_BLOG_DWH].[dv_core].[city_ref_sat] cities on store.postal_code = cities.ZipCode); It is not uncommon for a lot of business logic to be implemented as simply as the above code. This is achieved by providing a materialized mapping table between dirty data and cleansed data as a reference hub and satellite. Once the computed satellite is deployed, the downstream developer for the information mart can now choose between the cleansed address data or the original address data to be used for the dimension entity by joining the appropriate satellite entity. Dealing with Duplicate Records Another often-used entity in the Business Vault is the Same-As-Link (SAL). The name of the SAL stems from the sentence, “This business key identifies the same business object as the other business key.” So, the link relates two business keys in a duplicate-to-master relationship. One business key identifies the master key to be used for reporting, and the other identifies the duplicate key. If the data contains multiple duplicates, multiple duplicate business keys might be mapped to the same master key. For example, the following table shows a source data set with duplicate records: There are different variations of Michael Olschimke, and because the operational system did not recognize that all of them refer to the same actual customer, the operational system assigned separate business keys to each record. The business key customer_id is captured by the hub customer_hub, while the name is captured by a satellite, not shown in the following diagram: Based on the descriptive data in the satellite, SQL Server’s SOUNDEX function (supported in Fabric Warehouse) can be used to calculate the similarity of two strings, based on the pronunciation of the text. The matches where the similarity is above a certain threshold are considered as duplicates and added to the same-as-link (SAL). That way, duplicates are marked and the mapping can be used later to retrieve a deduplicated dimension. The following code shows the virtual implementation of the same-as-link: CREATE VIEW [dv_core].[customer_sal] AS WITH crm_data AS ( SELECT crm.hk_customer_hub , crm.load_datetime , hub.record_source , hub.customer_id , crm.name , SOUNDEX(crm.name) as soundex_name , crm.email FROM dv_core.customer_hub hub LEFT JOIN dv_core.customer_crm_lroc_sat crm ON hub.hk_customer_hub = crm.hk_customer_hub AND crm.is_current = 0 ) , shop_data AS ( SELECT shop.hk_customer_hub , shop.load_datetime , hub.record_source , hub.customer_id , shop.name , SOUNDEX(shop.name) as soundex_name , shop.email FROM dv_core.customer_hub hub LEFT JOIN dv_core.customer_shop_lroc_sat shop ON hub.hk_customer_hub = shop.hk_customer_hub AND shop.is_current = 0 ) SELECT hk_customer_hub , load_datetime , record_source , hk_master , hk_duplicate FROM( SELECT crm.hk_customer_hub AS hk_customer_hub, LEAST(crm.load_datetime, shop.load_datetime) AS load_datetime, 'https://wiki.scalefree.com/business_rules/unique_customers' AS record_source, crm.hk_customer_hub AS hk_master, shop.hk_customer_hub AS hk_duplicate, DIFFERENCE(crm.name, shop.name) AS similarity_score FROM crm_data crm LEFT JOIN shop_data shop ON crm.soundex_name = shop.soundex_name )level1 WHERE level1.similarity_score = 4; Once the same-as-link is created, it can be joined with the hub to reduce the duplicates to the master record based on the SoundEx function. The actual business logic is implemented in the view - the users who query this model don’t necessarily need to know how to apply SoundEx - they just use the results by joining the link to the customer hub. Concluding the Value of the Business Vault By implementing the business logic in Business Vault entities, the business logic can be used by multiple information marts but also data scientists and other power users. The Business Vault model presents the result of the business logic, while the source code implements it in many cases. However, in other cases, the business logic could also be implemented in external scripts, such as Python. In this case, the Business Vault entity would be a physical table that is loaded from the external script. There are some cases, where it makes more sense to implement the business logic in the dimensional entity - for example in the dimension view. However, in such cases, the code will not be reused by multiple information marts. To use the same logic in multiple marts, the logic must be replicated. We will discuss the information marts and how we derive dimensional models from a Data Vault model in the next article of this series. <<< Back to Blog Series Title Page332Views1like0Comments