data & ai
13 TopicsA Practical Guide to Implement End-to-End Data Warehousing PoC Using Microsoft Fabric-Part1
Sometime back we had the opportunity to help a customer implement a PoC using Microsoft Fabric. In a series of blogs, we will demonstrate how we helped our customer from the SfMC team to implement a Data Warehouse PoC using the ELT (Extract, Load, Transform) approach. For this PoC, we used sample data from SQL Server’s well-known sample databases, giving us a straightforward and accessible way to illustrate the capabilities of Microsoft Fabric in a real-world data warehousing scenario. The following were the customer requirements: Build the enterprise grade Data Warehouse solution in Microsoft Fabric Ingest data from 59 diverse sources across 130 different countries Source data from ADLS (JSON) Source data from Azure and on-prem SQL Server databases Other data sources. PoC scope: ELT approach ingest (OLTP DB & ADLS Gen2) data directly into the Warehouse Transformation using stored procedures to load the production schema Solution Summary: Ingest OLTP data from source SQL Server (full load) Meta data driven copy activity to load data into staging schema of the Data Warehouse. Stored procedures to process the staging data along with data from ADLS shortcuts. Load the production Dimension tables. Application’s customer data arrives into ADLS storage as JSON files. Stored Procedures to process the data and load Customer dimension and Fact data using incremental load. Built 6 Dimensions and 1 Fact. Build reports. Sample overview of items that were implemented: Sample report from the PoC: Prerequisites: To implement this PoC, the following are the prerequisites: you may need to download the following files: You can download them from: Download all the required files to local drive (eg: C:\temp). We assume that you already have a Fabric tenant setup with a Fabric Capacity (you will need a F64 capacity to test the co-pilot feature else a F2 capacity will be sufficient). If Fabric is not enabled, use the link below to enable Fabric for your organization. Step by Step Guide to Enable Microsoft Fabric for Microsoft 365 Developer Account You will also need: A logical SQL Server (how to deploy one, how to add local IP address to the firewall exception) ADLS Gen2 Account (how to deploy one, how to create a container) Once created, Once the account is created, navigate to the “Data Storage” section and create a container name it levelup or a name of choice Open the container “levelup”and create a folder called “JSON_FILES” Install Storage Explorer, configure to connect to your Azure subscription, and navigate to storage account container. Upload five JSON files from the downloaded folder “JSON_FILES” to ADLS Gen2 in the "levelup" container under the "JSON_FILES" folder. Upload the following folders by choosing “Upload Folder” from Azure Storage Explorer. Sales_SalesOrderDetail Sales_SalesOrderHeader The above two folders containing Delta files, will be used for creating shortcuts in the Lakehouse which will be used for building the tables within the Warehouse in the subsequent tasks. After the upload, you should have the folders below inside your Levelup container. Next, create a database on the Azure SQL Server by using the bacpac files downloaded earlier: Connect to Azure SQL Logical Server using SSMS (if you don’t have, you can download here) Right click on Databases and select the option “Import Data-Tier Application”. Follow the screen captures below to complete the database bacpac import. Note: Depending upon DTU/V-core chosen, the Import activities might take up 30 mins. Continue to the next blogA Practical Guide to Implement End-to-End Data Warehousing PoC Using Microsoft Fabric-Part2
In this continuation (part 2) of the previous blog, you will create the required Fabric items within a workspace and create the tables in Lakehouse and Data Warehouse. Task1: Create a Microsoft Fabric Workspace, Data Warehouse and Lakehouse Before you can start building the solution, you'll first need to create a workspace where you'll create the rest of the items. Follow the steps below to create the workspace. Sign in to https://app.fabric.microsoft.com/ Select Workspaces > New Workspace. Fill out the Create a workspace dialog as follows: Name: LevelUpFabric1 (or a name of choice). Name should be unique. Expand the Advanced section. In the License Mode choose Trial or Fabric Capacity. Click Apply. The workspace will be created and opened. In the upper left corner, select New Item> search and click on Warehouse to create Data Warehouse and name it “DW_Levelup”. This will create the Data Warehouse and open it. Click on the workspace icon the left navigation to navigate to the workspace page. In the upper left corner of the Workspace, select New Item > search and click on Lakehouse to create a Lakehouse and name it, “LH_Levelup”. This will create the Lakehouse and open it. Task2: Create table Shortcuts in the Lakehouse from ADLS Gen2 We will now create shortcuts in the Lakehouse pointing to the two delta folders (Sales_SalesOrderHeader and Sales_SalesOrderDetail) which you had uploaded into the ADLS Gen2 store account in the pre-requisite steps. In the Lakehouse explorer, hover over the Tables folder (not Files folder) and click on the ellipsis and click “New Shortcut”. In the New Shortcut dialog box, select ADLS Gen2 under External sources. In the next screen, provide the required details as shown in the below snip. You may refer to your storage account settings for details. It is discouraged to use Account key for authentication but is done so for ease of use. The URL should be similar to: https://xxxxx.dfs.core.windows.net/levelup Ensure the shortcut name “Sales_SalesOrderDetail” and “Sales_SalesOrderHeader” matches the delta folder name. Important: Please use Sales_SalesOrderHeader and Sales_SalesOrderDetail as shortcut names. These names are used in the stored procedures. If you choose to use different shortcut names in step 5, you may need to update the SPs that refer to it, to avoid errors. The shortcuts should show up within the tables section of the Lakehouse like the below. We will use these shortcuts later in the solution. You may have to refresh the tables section for the shortcuts to appear (you may also have to refresh the tables section if the shortcuts appear as under “Unidentified” too. If it continues to be unidentified, there could be some steps above which were not followed correctly.) Task3: Create the required Tables and SPs in the Fabric Data Warehouse Now, go to Fabric workspace “LevelUpFabric1” and Navigate to the Data Warehouse and open a new query window and run below scripts. Note: The above scripts are available in the “FabricLevelUp_Files” folder that was downloaded during the prerequisite steps within “TSQL_Script” folder. Scripts Details 01_create_AllStgDimTbls.sql Creates staging (stg) schema and all the staging related tables. 02_TablesToCopy.sql Creates metadata tables to dynamic full load activity. 03_del_AllRecordsOfStgDimTbls.sql Creates stored procedure to delete all the records of the staging dim tables. 04_crt_Prd_Sch_Tbl_Sp.sql Create production (prd) schema, fact table, and SP for incremental update to fact table. 05_crt_Prd_Dim_Tbls.sql Create stored procedure to create all the production dim tables. 06_upd_prd_dim_Customer.sql Creates SP to update incremental stg.DimCustomer records to prd.DimCustomer tables. So far, we have created workspace, all the required items in the Lakehouse and Data Warehouse which will be used to load the Data Warehouse. Continue to the next blog...Create an Organizational Assets Library (including Multi-Geo & Information Barriers guidance)
Overview This guide walks through a practical approach to setting up SharePoint Online (SPO) Organizational Assets Libraries (OAL). It includes optional guidance for more complex tenants—such as Multi-Geo and Information Barriers (IB) - because those scenarios are often under-documented. What you’ll accomplish: Create and register Organizational Assets Libraries so templates, fonts, and brand images are available in Office apps, with notes for Multi-Geo, Information Barriers, Brand Center, and Copilot integration where applicable. Applies to: Standard (single-geo) tenants, Multi-Geo tenants, tenants with Information Barriers, and environments using Brand Center and/or Copilot features for organizational assets. Quick start (standard single-geo tenant) Create a SharePoint site to host Organizational Assets Libraries (often the Brand Center site). Create three document libraries (typical): ImageAssets, DocumentAssets (templates), FontAssets. Grant your intended audience Read access (commonly Everyone except external users via the site’s Visitors group). Enable the SharePoint Online Public CDN (tenant setting). Add a Public CDN origin for each library path (one origin per library). Upload approved assets (images, templates, fonts) into their respective libraries. Register each library with Add-SPOOrgAssetsLibrary (repeat per library). Validate registration and end-user experience, then allow up to 24 hours for Office apps to reflect changes. If you’re Multi-Geo or using Information Barriers: follow the same flow, but repeat per geo and complete registration while the site is in Open IB mode (details below). Key constraints and gotchas Multi-Geo: plan a repeatable per-geo pattern (typically one Org Assets site + matching libraries per geo) and keep naming consistent. Information Barriers (IB): Add-SPOOrgAssetsLibrary cannot be run when the target site is segmented—create and register libraries first (site in Open mode), then segment if needed. The “Everyone except external users” principal may be hidden by default, but it’s still commonly used for broad read access. Brand Center: many orgs host Org Assets Libraries in the Brand Center site; if Brand Center is created after libraries exist, it typically detects and uses them automatically. A public CDN must be enabled to support Organizational Assets Libraries. The “Everyone except external users” principal may be hidden by default, but it’s still commonly used for broad read access. Brand Center: many orgs host Org Assets Libraries in the Brand Center site; if Brand Center is created after libraries exist, it typically detects and uses them automatically. A public CDN must be enabled to support Organizational Assets Libraries. Implementation steps Prerequisites: SharePoint Online Management Shell access (or equivalent), permission to manage tenant settings, and the ability to create sites and libraries in each geo. Create a site to host your Organizational Assets Libraries (many orgs use a communication site). For ease of support, keep the site name, library names, and structure consistent over time. Note: A Communication site is recommended, but a Team site can also work. Example site URLs: In a standard tenant you’ll have one site; in Multi-Geo you’ll typically use one per geo. Primary geo: https://contoso.sharepoint.com/sites/BrandCenter EUR geo: https://contosoEUR.sharepoint.com/sites/BrandCenter APC geo: https://contosoAPC.sharepoint.com/sites/BrandCenter If your tenant uses Information Barriers, keep each site in Open IB mode while creating the Org Assets Libraries. You can segment the site later (if required) after libraries are created. Configure a public CDN (required) To use Brand Center and Organizational Assets Libraries, configure SharePoint Online to use a Public CDN. Set-SPOTenantCdnEnabled -CdnType Public -Enable $true Example output: Public CDN enabled locations: SITES/BRANDCENTER/FONTS */MASTERPAGE (configuration pending) */STYLE LIBRARY (configuration pending) */CLIENTSIDEASSETS (configuration pending) Note: You will see the new CDN is in a pending state until complete. This will take some time. Wait for the CDN to finish provisioning. Re-run the status/list commands until “pending” entries clear. Get-SPOTenantCdnEnabled -CdnType Public Get-SPOTenantCdnOrigins -CdnType Public Add CDN origins for each library Add allowed CDN origins for each asset library path (typically one origin per library). Example: Add-SPOTenantCdnOrigin -OriginUrl sites/BrandCenter/ImageAssets -CdnType Public Add-SPOTenantCdnOrigin -OriginUrl sites/BrandCenter/TemplateAssets -CdnType Public Add-SPOTenantCdnOrigin -OriginUrl sites/BrandCenter/FontAssets -CdnType Public Set permissions (required for broad consumption) To ensure most users can consume the assets, grant Everyone except external users (often abbreviated as EEEU) Read access (commonly via the site’s Visitors group). Example: add Everyone except external users to the Visitors group of the Organizational Assets site. Connect-SPOService -Url 'https://contoso-admin.sharepoint.com' $tenant = "9cfc42cb-51da-4055-87e9-b20a170b6ba3" $site = Get-SPOSite -Identity "https://contoso.sharepoint.com/sites/BrandCenter" $group = Get-SPOSiteGroup $site -Group "BrandCenter Visitors" Add-SPOUser -LoginName ("c:0-.f|rolemanager|spo-grid-all-users/" + $tenant) -Site $site -Group $group.Title Note: Organizational Assets Libraries respect SharePoint security trimming. If you need a narrower audience, grant Read to the appropriate groups instead of tenant-wide access. In many environments, Everyone except external users is required during registration (Add-SPOOrgAssetsLibrary) so Office can enumerate the library—test and confirm in your tenant before removing broad access. Create libraries and upload assets Create a document library for each asset type you plan to publish (for example: images, Office templates, fonts). For example: Upload your assets into the appropriate libraries. Example: Register each library using Add-SPOOrgAssetsLibrary. For this to work, Everyone except external users must already have access to the site (for example, via the Visitors group). Office Template Library Example: Add-SPOOrgAssetsLibrary -LibraryUrl 'https://contoso.sharepoint.com/sites/BrandCenter/DocumentAssets' -OrgAssetType OfficeTemplateLibrary Image Document Library Example: Add-SPOOrgAssetsLibrary -LibraryUrl 'https://contoso.sharepoint.com/sites/BrandCenter/ImageAssets' -OrgAssetType ImageDocumentLibrary Font Document Library Example: Add-SPOOrgAssetsLibrary -LibraryUrl 'https://contoso.sharepoint.com/sites/BrandCenter/FontAssets' -OrgAssetType OfficeFontLibrary -CdnType Public Optional: Enable Copilot support for an image library (only applicable to ImageDocumentLibrary). Set-SPOOrgAssetsLibrary -LibraryUrl 'https://contoso.sharepoint.com/sites/BrandCenter/ImageAssets' -OrgAssetType ImageDocumentLibrary -CopilotSearchable $true Multi-Geo mini runbook (recommended pattern) Use this as a simple tracking sheet so each geo ends up with a complete, consistent setup. Geo Site URL Libraries CDN origins added Libraries registered Primary https://<tenant>.sharepoint.com/sites/<BrandCenterOrAssetsSite> ImageAssets / DocumentAssets / FontAssets Yes/No Yes/No EUR https://<tenant>EUR.sharepoint.com/sites/<BrandCenterOrAssetsSite> ImageAssets / DocumentAssets / FontAssets Yes/No Yes/No APC https://<tenant>APC.sharepoint.com/sites/<BrandCenterOrAssetsSite> ImageAssets / DocumentAssets / FontAssets Yes/No Yes/No Naming standard (strongly recommended): keep the same site path and the same library names in every geo (for example, always ImageAssets, DocumentAssets, FontAssets). This minimizes per-geo scripting differences and reduces support effort. Wrap-up At this point, each geo should have its own site, libraries, CDN origins, and registered Organizational Assets Libraries. From here, focus on governance (who can publish/approve assets), naming standards, and ongoing lifecycle management (retire old templates/fonts and keep branding current). Validate configuration Admin checks (PowerShell) Confirm the Public CDN is enabled. Confirm CDN origins include one entry per assets library path. List registered Org Assets Libraries and verify each URL + type is present. Get-SPOTenantCdnEnabled -CdnType Public Get-SPOTenantCdnOrigins -CdnType Public Get-SPOOrgAssetsLibrary End-user checks (Office apps) In PowerPoint/Word, confirm organizational templates appear in the template picker (if you registered an OfficeTemplateLibrary). In Office font lists, confirm your org fonts appear (if you registered an OfficeFontLibrary). For image libraries, confirm approved brand images appear in supported pickers; if you enabled -CopilotSearchable, confirm images are discoverable as expected. Timing: New registrations and updates can take up to 24 hours to appear in Office apps. If you updated content, run Set-SPOOrgAssetsLibrary for each changed library, then wait for propagation. Updating content in existing Org Assets Libraries If you already have Organizational Assets Libraries registered and you need to publish updated templates, fonts, or images, use the process below. The high-level flow is: update content → run Set-SPOOrgAssetsLibrary (per library) → wait for propagation. Replace or update content in each library. Upload the new versions of templates/fonts/images into the appropriate library (and remove/retire older versions if needed). If Multi-Geo applies, repeat per geo. Update the matching libraries in each geo’s site so users in each geo get the same (or intentionally regional) set of assets. Run Set-SPOOrgAssetsLibrary for each updated library. Execute the cmdlet against the library URL to refresh the configuration after content changes (run it once per library you updated). Wait for Office app propagation. Allow up to 24 hours for updates to begin showing in Office apps. Example: Set-SPOOrgAssetsLibrary -LibraryUrl 'https://contoso.sharepoint.com/sites/BrandCenter/DocumentAssets' -OrgAssetType OfficeTemplateLibrary Notes: If your site is segmented by Information Barriers, confirm the cmdlet behavior in your environment before making changes, and prefer performing registration/updates while the site is in Open mode when possible. For image libraries, if you are using Copilot integration settings (for example -CopilotSearchable), keep the setting consistent when you run Set-SPOOrgAssetsLibrary. Make sure the intended audience still has Read access to the site/library; otherwise users may not see updates due to security trimming. Please note: After registering (or updating) your assets libraries, it can take up to 24 hours before changes become available in Office apps. Once fully enabled, Office apps will surface your templates and fonts. Below is an example. Example of interacting with Org Assets from M365 Apps Org Fonts from PowerPoint: From SharePoint: From Office Apps: Troubleshooting tips If Add-SPOOrgAssetsLibrary fails, confirm the site is not segmented by Information Barriers (Open mode during setup). If assets don’t appear in Office apps, wait for propagation (up to 24 hours) and re-check that the library was registered successfully. If CDN commands show “pending”, allow time for provisioning and re-run the status command. If users can’t see assets, verify the site/library permissions include Everyone except external users (or the intended audience group). Guidance: Using the SharePoint Online Public CDN Enabling the SharePoint Online Public CDN is a required and supported configuration for Organizational Assets Libraries, Brand Center, and related Office experiences. While the word “public” can sound concerning, it’s important to understand what is (and is not) exposed. We take great care to protect the data that runs your business. Data stored in the Microsoft 365 CDN is encrypted both in transit and at rest, and access to data in the Microsoft 365 SharePoint CDN is secured by Microsoft 365 user permissions and token authorization. Requests for data in the Microsoft 365 SharePoint CDN must be referred (redirected) from your Microsoft 365 tenant or an authorization token won't be generated. See: Content delivery networks - Microsoft 365 Enterprise | Microsoft Learn What “Public CDN” actually means Only explicitly approved library paths are cached The CDN does not expose your entire tenant. Administrators must explicitly register CDN origins (specific library paths). If a library is not registered as a CDN origin, it is not served via the CDN. No new content types are exposed The CDN is intended for static, non-sensitive assets such as: Brand images Office templates Fonts It is not designed for documents containing confidential or regulated data. Why Microsoft requires a Public CDN for Org Assets? Performance and reliability Office clients worldwide retrieve assets faster using geographically distributed edge caching. This avoids repeated downloads from SharePoint origin sites. Consistent Office app experiences PowerPoint, Word, Excel, and Copilot rely on CDN-backed delivery to surface: Templates Fonts Brand images Without a public CDN, these features may not function correctly or at all. Best practices Use the practices below to keep Organizational Assets Libraries reliable, secure, and easy for end users to adopt. Where relevant, notes call out additional considerations for Multi-Geo, Information Barriers, Brand Center, and Copilot. Governance and ownership checklist Owners/publishers: named group who can add/change assets (limited membership). Approvals: defined review/approval step before publishing new templates/fonts/images. Versioning/retention: how you retire old assets and prevent outdated branding from appearing in pickers. Rollback plan: how to revert a bad template/font/image quickly. Change communication: how you notify users about new/updated assets and expected timing (up to 24 hours). Assign clear owners (typically Brand/Comms) and a small admin group (typically IT) for each geo’s library and site. Decide what is “approved” vs “draft” content, and enforce it with a simple publishing process (for example, a review checklist or an approvals flow). Version and retire assets deliberately: keep one “current” template set and archive old assets to prevent users from picking outdated branding. Information architecture and naming Keep library names and structures consistent across geos (same library names, same folder conventions) to simplify support and documentation. Use descriptive filenames users can recognize in pickers (for example, “Contoso_Proposal_Template_v3”). Prefer a small number of clearly defined libraries by asset type (images, templates, fonts) rather than many small libraries. Permissions and access Ensure your intended audience has at least Read access to the site and libraries; Organizational Assets still follow SharePoint security trimming. If you use broad access (for example, Everyone except external users), document it and pair it with tight contributor permissions so only approved publishers can change assets. Avoid breaking inheritance in ways that make troubleshooting difficult—keep permissions simple and predictable whenever possible. CDN configuration Plan CDN changes ahead of time: enabling and provisioning can take time, and changes may not be immediate. Register only the origins you need (one per assets library path) and keep them consistent across environments. After changes, allow for propagation time before validating in Office apps. Multi-Geo and Brand Center Use a repeatable pattern: one site + matching libraries per geo, with the same structure and operational runbook. Be aware Brand Center is created in the primary geo; confirm how your org wants to manage global vs regional assets. Document which assets are global (shared everywhere) vs regional (geo-specific) to avoid confusion for publishers and users. Information Barriers (IB) sequencing Create and register Org Assets Libraries before segmenting the site when IB is enabled (create while the site is in Open mode, then segment later if required). After segmentation, re-validate that the right audience can still read the libraries (and that publishers can still manage content). Copilot readiness (image libraries) Use consistent, high-quality metadata for images (titles, descriptions, and tags). Copilot search quality depends heavily on this. If enabling image tagging integration, standardize on a tagging vocabulary (for example, brand terms, campaigns, departments, regions) so results are predictable. Only enable Copilot searchable settings on libraries where content is approved and intended for broad reuse. Q&A Q: What is an Organizational Assets Library (OAL)? A: It’s a SharePoint document library (or set of libraries) that you register so Office apps can surface approved templates, fonts, and images to users directly within the app experience. Q: Do I need SharePoint Brand Center to use OAL? A: No. You can use Organizational Assets Libraries without Brand Center. Brand Center can make asset management more accessible, for example, allowing SharePoint sites to use organizational branding, but OAL can be configured on its own. Q: Why is a “Public CDN” required, and is it safe? A: Office experiences rely on CDN-backed delivery for performance and reliability. “Public CDN” does not mean your whole tenant is exposed—only the specific library paths you register as CDN origins are cached. Access is still governed by Microsoft 365 authentication, token authorization, and SharePoint permissions. Q: Can I use this guide in a standard (single-geo) tenant? A: Yes. In a standard tenant you usually create one site and one set of libraries. The Multi-Geo guidance is only needed if your tenant is Multi-Geo (in which case you’ll typically repeat the pattern per geo). Q: How do Information Barriers (IB) affect setup? A: If a site is segmented, Add-SPOOrgAssetsLibrary cannot register the library. Create the site and register the libraries while the site is in Open mode, then segment afterward if required. Q: Why does “Everyone except external users” (EEEU) matter? A: In many environments, EEEU is required during library registration so Office can enumerate the library. However, OAL still respects SharePoint security trimming. If broad internal availability is the goal, a common pattern is to grant EEEU Read (often via the Visitors group) so Office apps can surface the assets to most internal users. If you need a narrower audience, use a group instead. Q: How long until assets show up (or update) in Office apps? A: It can take up to 24 hours for new registrations or updates to propagate. If you replaced content in an existing library, run Set-SPOOrgAssetsLibrary for each updated library, then allow time for Office apps to refresh. Q: How do I update content in an existing Org Assets Library? A: Replace the files in the library (and repeat across geos if applicable), then run Set-SPOOrgAssetsLibrary against each library you updated. After that, allow up to 24 hours for the updated assets to start showing in Office apps. Q: Do I need to run Set-SPOOrgAssetsLibrary every time I replace files? A: If you want Office apps to reliably pick up changes, run Set-SPOOrgAssetsLibrary after you update content (especially when publishing new/updated templates, fonts, or images). Treat it as the “refresh” step, then wait for propagation. Q: When should I enable Copilot support (CopilotSearchable) for an image library? A: Enable it only for libraries that contain approved, broadly reusable images and have strong metadata (title/description/tags). This helps ensure search results are on-brand and reduces the chance of surfacing unreviewed content. Q: Can I undo this later? A: Yes. You can unregister an Organizational Assets Library using SharePoint Online PowerShell (for example, Remove-SPOOrgAssetsLibrary) and remove CDN origins if you no longer need them. Plan governance so you can retire assets cleanly without disrupting users. Q: Users can’t see the assets (or updates)—what should I check first? A: Start with (1) permissions to the site/library (security trimming), (2) successful registration via Add-SPOOrgAssetsLibrary, (3) if you’re expecting an update, confirm you ran Set-SPOOrgAssetsLibrary for that library, (4) CDN provisioning status and configured origins, and (5) propagation time (up to 24 hours). Additional Reading Create an organization assets library - SharePoint in Microsoft 365 | Microsoft Learn Connect organizational asset libraries to Copilot for an on-brand experience - SharePoint in Microsoft 365 | Microsoft Learn Connect organizational asset libraries to PowerPoint for an on-brand experience - SharePoint in Microsoft 365 | Microsoft Learn Set up and connect organizational asset library (OAL) with image tagging to Copilot search | Microsoft Learn Add-SPOOrgAssetsLibrary (Microsoft.Online.SharePoint.PowerShell) | Microsoft Learn SharePoint Brand Center - SharePoint in Microsoft 365 | Microsoft Learn How to Enable Enterprise Brand Images with PowerPoint Copilot - SharePoint in Microsoft 365 | Microsoft Learn Office 365 Content Delivery Network (CDN) Quickstart - Microsoft 365 Enterprise | Microsoft Learn Use Office 365 Content Delivery Network (CDN) with SharePoint Online - Microsoft 365 Enterprise | Microsoft Learn Content delivery networks - Microsoft 365 Enterprise | Microsoft Learn Multi-Geo Capabilities in OneDrive and SharePoint - Microsoft 365 Enterprise | Microsoft Learn Use Information Barriers with SharePoint | Microsoft Learn691Views3likes0CommentsA Deep Dive into Spark UI for Job Optimization
Key Insights for Spark Job Optimization The Spark UI is your X-ray into application execution: It provides real-time and post-mortem insights into every job, stage, task, and resource usage, moving you from guesswork to evidence-driven tuning. Systematic analysis is crucial: Start from high-level overviews in the Jobs tab, drill down into Stages for bottlenecks and shuffle operations, examine Tasks for skew and spills, and review Executors for resource allocation issues. Targeted optimizations yield significant gains: Address issues like data skew, excessive shuffles, memory pressure, and inefficient SQL plans with specific techniques such as repartitioning, broadcast joins, Kryo serialization, and proper resource allocation. Apache Spark is a powerful distributed computing framework, but extracting its full potential often requires meticulous optimization. The Spark UI (User Interface) stands as an indispensable tool, offering a detailed, web-based dashboard that provides real-time and historical insights into your Spark applications. It's the diagnostic center that helps you pinpoint performance bottlenecks, understand resource consumption, and identify inefficiencies that may be hindering your jobs. This comprehensive guide will walk you through the process of accessing, navigating, and interpreting the Spark UI, empowering you to translate its rich data into concrete strategies for optimizing your Spark jobs. As of July 1, 2025, modern Spark versions like 4.0.0 place significant emphasis on UI-driven performance tuning, making this a critical skill for any data professional. Accessing and Navigating the Spark UI: Your Diagnostic Gateway Before diving into optimization, you need to know how to access the Spark UI. Its accessibility varies depending on your Spark deployment mode: Local Mode: When running Spark locally, the UI is typically available at http://localhost:4040. Cluster Mode: In cluster environments like YARN, Mesos, or Kubernetes, the UI is usually accessed via the Spark History Server (often at port 18080) for post-mortem analysis, or through the application master's URL while the job is running. Cloud Platforms: On cloud services such as AWS Glue, Databricks, or EMR, the Spark UI is typically integrated into their respective consoles or accessible by enabling Spark event logging. Ensure event logs are configured to roll over to prevent metrics truncation for long-running jobs. Once accessed, the Spark UI is structured into several key tabs, each providing a different lens into your application's behavior: Jobs Tab: High-level overview of all jobs. Stages Tab: Detailed breakdown of stages within a job. Tasks Tab: Granular information about individual task execution. Storage Tab: Insights into cached RDDs and DataFrames. Environment Tab: Spark configuration and system properties. Executors Tab: Resource usage of individual executors. SQL Tab: Specific details for SQL queries and DataFrame operations (if applicable). Deciphering the Spark UI: A Tab-by-Tab Analysis An overview of the Jobs tab in the Apache Spark UI, showing job progress and details. 1. The Jobs Tab: Your Application's Pulse Check The Jobs tab is your initial point of contact for understanding the overall health and progress of your Spark application. It summarizes all submitted jobs, their status (running, completed, failed), duration, and general progress. This tab helps you quickly identify jobs that are stalling, taking excessively long, or have failed outright. What to look for: Overall Duration: Identify jobs that exhibit long durations. These are prime candidates for deeper optimization. Status and Progress: Observe jobs that are stuck or show a high number of failed tasks, indicating potential underlying issues that need immediate attention. Event Timeline: This visual representation of the application's lifecycle, including job execution and executor activity, can reveal patterns of resource contention or uneven parallel execution. 2. The Stages Tab: Unveiling Bottlenecks Stages are the backbone of a Spark job's execution, representing a sequence of tasks that can run together without data shuffling. The Stages tab provides granular details about each stage, making it crucial for pinpointing specific bottlenecks. The Stages tab in Spark UI, displaying detailed information for each stage of a job. Key Metrics and Analysis: Duration: Sort stages by duration to identify the longest-running ones. These are where your optimization efforts will likely yield the greatest impact. Input/Output (I/O) Sizes: High input/output metrics suggest that the stage might be I/O-bound. This points to opportunities for optimizing data formats or storage. Shuffle Read/Write: These are critical metrics. High "Shuffle Read" or "Shuffle Write" values indicate significant data movement between nodes, which is a very expensive operation. This often signals inefficient joins, aggregations, or partitioning. Task Progress and Event Timeline: Within the detail view of a stage, the event timeline visually represents individual task execution. Look for "straggler" tasks – tasks that take significantly longer than others – as this is a strong indicator of data skew where certain partitions hold disproportionately more data or require more computation. DAG Visualization: The Directed Acyclic Graph (DAG) visualization within a stage illustrates the flow of RDDs/DataFrames and the operations applied to them. This visual can simplify understanding complex data transformations and dependencies. For example, if a stage shows 3.2 TB of shuffle read and one task processes 400 GB compared to a median of 25 GB, this immediately highlights a severe data skew issue. 3. The Tasks Tab: Drilling Down to Individual Performance The Tasks tab offers the most granular view, showing execution details for individual tasks within a stage. This is where you can confirm observations from the Stages tab and identify specific issues like out-of-memory errors or high garbage collection times. Critical data points: Executor Run Time: Helps identify slow-running tasks. GC Time (Garbage Collection Time): High GC times indicate memory pressure and inefficient memory management, suggesting a need to optimize memory configurations or data serialization. Shuffle Spill (Memory Bytes Spilled / Disk Bytes Spilled): If tasks are spilling data to disk, it means they ran out of memory. This is a severe performance bottleneck, pointing to insufficient executor memory or inefficient data processing. Host: Sorting the task table by host can reveal skewed executors, where one executor is burdened with significantly more work due to data imbalance. 4. The SQL Tab: Optimizing Your DataFrames and SQL Queries For Spark DataFrame and SQL workloads, the SQL tab is invaluable. It provides detailed information about executed SQL queries, including their duration, associated jobs, and, most importantly, their physical and logical execution plans. Analyzing SQL queries: Physical Plan: This is a textual and graphical representation of how the Spark optimizer decided to execute your query. Look for inefficient join strategies (e.g., unintended Cartesian joins, inefficient Sort-Merge Joins where Broadcast Join would be better), missed filter pushdowns, or unnecessary data shuffles (indicated by "Exchange" operations). Graphical Visualization: This visual simplifies the analysis by showing aggregated information about rows and data processed at each stage of the SQL query. By analyzing the physical plan, you can validate whether your DataFrame transformations or SQL queries are being optimized as expected. For instance, if you've hinted for a broadcast join but the plan shows a Sort-Merge Join with a huge shuffle, you know there's a problem. 5. The Executors Tab: Resource Utilization Deep Dive This tab provides a detailed view of the resources consumed by each executor in your cluster, including CPU cores, allocated memory, used memory, disk usage, and the number of active tasks. It's essential for understanding resource allocation and identifying bottlenecks related to cluster configuration. Key checks: Memory Used vs. Total Memory: Identify if executors are underutilized or overloaded. High memory usage combined with disk spills indicates memory pressure. CPU Cores: Verify if your allocated CPU cores are being efficiently utilized. Low utilization might suggest insufficient parallelism or tasks waiting for resources. Disk Usage: Indicates if tasks are performing large I/O operations or spilling excessive data to disk. Thread Dump: Allows you to inspect the JVM thread dump on each executor for advanced debugging of performance issues. 6. The Storage Tab: Managing Cached Data If your Spark application uses caching or persistence (e.g., via cache() or persist()), the Storage tab provides details about persisted RDDs and DataFrames, including their storage levels (memory, disk, or both), sizes, and partition distribution. Insights from the Storage tab: Memory Management: Ensure cached data is not consuming excessive memory or being spilled to disk unnecessarily. Appropriate Caching Strategy: Verify that frequently accessed datasets are cached with suitable storage levels to minimize recomputation without causing memory overflows. 7. The Environment Tab: Configuration Validation This tab displays all Spark configuration properties, JVM settings, and system environment variables. It's a crucial place to confirm that your Spark application is running with the intended configurations. Key usage: Configuration Validation: Double-check if critical Spark configurations like spark.executor.memory, spark.executor.cores, spark.sql.shuffle.partitions, and spark.serializer are set correctly. Misconfigurations can severely impact performance. Translating UI Insights into Optimization Strategies Once you've analyzed the Spark UI and identified specific bottlenecks, you can apply targeted optimization techniques. This shift from "guess-and-check" to "evidence-driven" tuning can significantly improve job runtimes and reduce costs. 1. Addressing Data Skew Detection: Long "straggler" tasks in the Stage Event Timeline, uneven partition sizes, or highly skewed "Shuffle Read/Write" metrics in the Stages tab. Optimization: Repartitioning: Use repartition(N) or repartitionByRange(N, column) to distribute data more evenly across partitions. For instance, df = df.repartitionByRange(800, "customer_id") for a skewed customer_id key. Salting: For highly skewed join keys, add a random prefix (salt) to the key before joining, then remove it afterward. Adaptive Query Execution (AQE): In Spark 3.2+, enable AQE (spark.sql.adaptive.enabled=true and spark.sql.adaptive.skewJoin.enabled=true). AQE can dynamically detect and mitigate data skew during shuffle operations. 2. Optimizing Shuffles Detection: High "Shuffle Read" and "Shuffle Write" metrics in the Stages tab, indicating excessive data movement. Optimization: Filter Early: Push down filters and projections as early as possible to reduce the amount of data processed and shuffled. Broadcast Joins: For small tables (typically under spark.sql.autoBroadcastJoinThreshold, default 10MB), use broadcast(df) hint or set spark.sql.autoBroadcastJoinThreshold to enable broadcast joins. This avoids a shuffle for the smaller table. Adjust Shuffle Partitions: Configure spark.sql.shuffle.partitions appropriately. A common rule of thumb is 2-4 times the number of total executor cores, ensuring each partition is between 100-200 MB to avoid OOM errors and small file overhead. Coalesce: Use coalesce() for reducing the number of partitions without triggering a full shuffle if data size allows. 3. Memory Management and Garbage Collection Detection: High "Shuffle Spill" (Memory/Disk Bytes Spilled) in the Tasks tab, out-of-memory errors, or significant "GC Time" in the Executors tab or Task details. Optimization: Executor Memory: Increase spark.executor.memory if tasks are spilling to disk. Memory Fractions: Adjust spark.memory.fraction and spark.memory.storageFraction to allocate more memory for execution or caching. Serialization: Use Kryo serialization (spark.serializer=org.apache.spark.serializer.KryoSerializer) for faster and more compact data serialization, reducing memory footprint and network I/O. Caching: Cache only necessary DataFrames that are reused multiple times, and use appropriate storage levels (e.g., MEMORY_AND_DISK). Unpersist data promptly when no longer needed. GC Tuning: For large heaps, consider tuning JVM garbage collector settings, often involving the G1GC algorithm, to minimize GC pauses. High GC time (e.g., >15% of task time) indicates too many small objects. 4. Resource Allocation and Parallelism Detection: Underutilized executors (low CPU usage, many idle cores), tasks waiting for resources in the Jobs/Executors tabs, or dynamic allocation adding/removing executors frequently. Optimization: Executor Cores/Memory: Adjust spark.executor.cores and spark.executor.memory to match your cluster's capacity and workload. Ensure you have enough executors to handle the desired parallelism. Default Parallelism: Set spark.default.parallelism to a value that provides sufficient concurrent tasks, ideally 2-4 times the total number of CPU cores in your cluster. 5. SQL Query and DataFrame Optimization Detection: Inefficient physical plans in the SQL tab, long-running SQL queries, or unnecessary "Exchange" operations. Optimization: Predicate Pushdown: Ensure filters are applied as early as possible (e.g., directly in the data source read) to reduce the amount of data processed. Join Order and Strategy: Reorder joins to place selective filters and smaller tables first. Leverage specific join hints (BROADCAST, SHUFFLE_HASH) where appropriate. Column Pruning: Select only the columns you need, avoiding full table scans. Bucketing and Partitioning: For frequently joined or filtered columns, consider bucketing and partitioning your data to improve performance of joins and aggregations. This bar chart quantifies the common performance bottlenecks in Spark, indicating their typical impact on job execution on a scale of 0 to 10. Higher scores suggest more significant performance degradation. Understanding these high-impact areas helps prioritize optimization efforts. A Practical Example: Tackling Data Skew with the UI Imagine a PySpark ETL job that takes 48 minutes to complete. A quick glance at the Jobs tab shows that "Job 3" accounts for 42 of those minutes. Drilling into Job 3, the Stages tab reveals that "Stage 19" is the culprit, consuming 38 minutes and involving 3.2 TB of shuffle read. Further inspection of Stage 19's Event Timeline within the Stage Detail view immediately highlights a "straggler" task on a specific host (e.g., ip-10-0-4-11). This task processed an anomalous 400 GB of data, compared to the median 25 GB for other tasks in the same stage. This disparity is a classic symptom of data skew, likely caused by a highly skewed key like "customer_id". The Fix: Based on this evidence, an optimization is implemented: df = df.repartitionByRange(800, "customer_id") potentially combined with salting if the skew is severe. After redeploying, the Spark UI confirms the success: Stage 19's runtime drops to 6 minutes, the total job to 12 minutes, and crucially, there's no disk spill and GC time is less than 3%. This example underscores how the Spark UI provides the exact evidence needed to diagnose issues and validate the effectiveness of applied optimizations. Optimizing for the Future: Best Practices and Continuous Improvement Effective use of the Spark UI isn't a one-time activity; it's an ongoing process for continuous optimization. Table of Common Symptoms and Proven Fixes Symptom in UI Root Cause What to Change / Fix Few very long tasks; wide idle band at end of stage (stragglers) Too few partitions or severe data skew repartition(N) or repartitionByRange; for skew: salting, skew join hint, enable AQE skew mitigation Shuffle spill: "Disk Bytes Spilled" > 0 Executor memory insufficient Raise spark.executor.memory / spark.memory.fraction, use Kryo serialization, filter earlier Stage uses SortMergeJoin with huge shuffle where BroadcastJoin was expected Broadcast join not chosen or threshold too low broadcast(df) hint or configure spark.sql.autoBroadcastJoinThreshold GC Time > 15% of Task Time Too many small objects, inefficient memory usage cache() only necessary data, use Dataset encoders or vectorized Parquet reader, increase executor heap but watch GC algorithm Executors idle in timeline; dynamic allocation frequently adds/removes Slots > parallelism; poor partitioning for workload Lower spark.sql.shuffle.partitions, coalesce downstream if appropriate, adjust spark.default.parallelism SQL plan shows multiple "Exchanges" stacking Unnecessary repartitions (e.g., narrow-wide-narrow pattern) Use colocated sort-merge join hints, reuse partitioning columns, analyze query logic for redundant shuffles High I/O metrics in Stages tab (e.g., large input size without sufficient processing) Inefficient data format, full table scans, or lack of predicate pushdown Optimize data formats (e.g., Parquet with snappy compression), apply filters/projections early, leverage partitioning/bucketing in source data Application fails with OutOfMemoryError (OOM) on driver or executor Insufficient driver/executor memory for data or operations Increase spark.driver.memory or spark.executor.memory; reduce partition size or number of partitions; enable off-heap memory if applicable This table summarizes common symptoms observed in the Spark UI, their root causes, and corresponding solutions. It serves as a quick reference guide for targeted optimization efforts. Visualization of Spark UI Concepts This Mermaid mindmap visually organizes the key concepts related to analyzing the Spark UI and optimizing Spark jobs, covering accessing the UI, understanding its various tabs, specific optimization strategies, and overarching best practices for continuous improvement. Conclusion Analyzing the Spark UI is an art and a science, offering an unparalleled view into the inner workings of your Spark applications. By systematically navigating its various tabs—Jobs, Stages, Tasks, SQL, Executors, Storage, and Environment—you can gather crucial evidence to diagnose performance issues such as data skew, excessive shuffles, memory pressure, and inefficient resource allocation. This evidence-driven approach allows you to implement targeted optimizations, whether it's through repartitioning data, adjusting memory configurations, fine-tuning SQL queries, or optimizing resource allocation. Mastering the Spark UI not only transforms you into a more effective Spark developer but also ensures that your big data pipelines run with optimal efficiency, leading to significant reductions in execution time and operational costs. Continuous monitoring and iterative optimization based on UI insights are the keys to maintaining robust and performant Spark applications in production environments. Frequently Asked Questions (FAQ) What is the primary purpose of the Spark UI? The Spark UI serves as a web-based interface for monitoring, debugging, and optimizing Spark applications by providing real-time and historical insights into job execution, resource utilization, and performance bottlenecks. How can I access the Spark UI in a cluster environment? In a cluster environment, the Spark UI can typically be accessed via the Spark History Server (often running on port 18080) for completed jobs, or through the application master's URL while the job is still active. Cloud platforms like AWS Glue or Databricks usually provide direct links in their respective consoles. What does "Shuffle Read/Write" indicate in the Spark UI? "Shuffle Read/Write" metrics in the Stages tab indicate the amount of data transferred between executors across the network during shuffle operations. High values often point to expensive data redistribution, which can be a significant performance bottleneck, typically caused by wide transformations like joins or aggregations. How do "straggler" tasks relate to data skew? "Straggler" tasks are individual tasks within a stage that take significantly longer to complete than others. They are a primary indicator of data skew, where certain data partitions have disproportionately more data or require more computation, leading to uneven work distribution across executors. What are some immediate actions to take if the Spark UI shows high "Shuffle Spill"? High "Shuffle Spill" (data written to disk due to memory limitations) suggests that executors are running out of memory. Immediate actions include increasing spark.executor.memory, optimizing data serialization (e.g., using Kryo), or filtering data earlier to reduce memory footprint. Referenced Sources Performance Tuning - Spark 4.0.0 Documentation - spark.apache.org Diagnose cost and performance issues using the Spark UI - Databricks Documentation Web UI - Spark 4.0.0 Documentation - spark.apache.org Diagnose cost and performance issues using the Spark UI | Databricks Documentation How to interpret Spark UI - Databricks Community - 109593 Apache Spark Performance Tuning: 7 Optimization Tips (2025) Diagnose cost and performance issues using the Spark UI - Azure Databricks | Microsoft Learn Mastering Spark UI Monitoring in PySpark: Optimizing Performance ... Diagnose cost and performance issues using the Spark UI r/dataengineering on Reddit: Beginner’s Guide to Spark UI: How to Monitor and Analyze Spark Jobs Diagnose cost and performance issues using the Spark UI How to Optimize Spark Jobs for Maximum Performance Monitoring and Instrumentation - Spark 4.0.0 Documentation Spark Web UI - Understanding Spark Execution - Spark By {Examples} How to read Spark UI - Stack Overflow2.2KViews3likes0CommentsDiagnose performance issues in Spark jobs through Spark UI.
Agenda Introduction Overview of Spark UI Navigating to Spark UI Jobs Timeline Opening Jobs Timeline Reading Event Timeline Failing Jobs or Executors Diagnosing Failing Jobs Diagnosing Failing Executors Scenario - Memory Issues Scenario - Long Running Jobs Scenario - Identifying Longest Stage Introduction Diagnosing performance issues of job using Spark UI This guide walks you through how to use the Spark UI to diagnose performance issues Overview of Spark UI Job Composition Composed of multiple stages Stages may contain more than one task Task Breakdown Tasks are broken into executors Navigating to Spark UI: Navigating to Cluster's Page Navigate to your cluster’s page: Navigating to Spark UI: Clicking Spark UI Click Spark UI: Jobs Timeline Jobs timeline The jobs timeline is a great starting point for understanding your pipeline or query. It gives you an overview of what was running, how long each step took, and if there were any failures along the way Opening Jobs Timeline Accessing the Jobs Timeline Navigate to the Spark UI Click on the Jobs tab Viewing the Event Timeline Click on Event Timeline Highlighted in red in the screenshot Example Timeline Shows driver and executor 0 being added Failing Jobs or Executors: Example of Failed Job Failed Job Example Indicated by a red status Shown in the event timeline Removed Executors Also indicated by a red status Shown in the event timeline Failing Jobs or Executors: Common Reasons for Executors Being Removed Autoscaling Expected behavior, not an error See Enable autoscaling for more details Compute configuration reference - Azure Databricks | Microsoft Learn Spot instance losses Cloud provider reclaiming your VMs Learn more about Spot instances here Executors running out of memory Diagnosing Failing Jobs: Steps to Diagnose Failing Jobs Identifying Failing Jobs Click on the failing job to access its page Reviewing Failure Details Scroll down to see the failed stage Check the failure reason Diagnosing Failing Jobs: Generic Errors You may get a generic error. Click on the link in the description to see if you can get more info: Diagnosing Failing Jobs: Memory Issues Task Failure Explanation Scroll down the page to see why each task failed Memory issue identified as the cause Scenario – Spot instance , Auto-scaling Diagnosing Failing Executors: Checking Event Log Check Event Log Identify any explanations for executor failures Spot Instances Cloud provider may reclaim spot instances Diagnosing Failing Executors: Navigating to Executors Tab Check Event Log for Executor Loss Look for messages indicating cluster resizing or spot instance loss Navigate to Spark UI Click on the Executors tab Diagnosing Failing Executors: Getting Logs from Failed Executors Here you can get the logs from the failed executors: Scenario - Memory Issues Memory Issues Common cause of problems Requires thorough investigation Quality of Code Potential source of memory issues Needs to be checked for efficiency Data Quality Can affect memory usage Must be organized correctly Spark memory issues - Azure Databricks | Microsoft Learn Identifying Longest Stage Identify the longest stage of the job Scroll to the bottom of the job’s page Locate the list of stages Order the stages by duration Identifying Longest Stage Identify the longest stage of the job Scroll to the bottom of the job’s page Locate the list of stages Order the stages by duration Stage I/O Details High-Level Data Overview Input Output Shuffle Read Shuffle Write Number of Tasks in Long Stage Identifying the number of tasks Helps in pinpointing the issue Look at the specified location to determine the number of tasks Investigating Stage Details Investigate Further if Multiple Tasks Check if the stage has more than one task Click on the link in the stage’s description Get More Info About Longest Stage Click on the link provided Gather detailed information Conclusion Potential Data Skew Issues Data skew can impact performance May cause uneven distribution of data Spelling Errors in Data Incorrect spelling can affect data processing Ensure data accuracy for optimal performance Learn More Navigate to Skew and Spill - Skew and spill - Azure Databricks | Microsoft LearnReducing SQL Connection Latency for Apps Using Azure AAD Authentication
Challenge: connection latency and token overhead Consider a cloud-native application deployed in Azure App Service or Kubernetes (AKS) that needs to query an Azure SQL Database for real-time data. The application uses Azure Active Directory (AAD) for secure authentication, but every time the application establishes a new connection to the database, it requests a new AAD token. In high-traffic environments where thousands of requests are processed per second, this repetitive token issuance introduces latency and performance degradation. This delay becomes particularly problematic for time-sensitive applications where every millisecond counts. Each token request impacts response times and creates unnecessary resource consumption. Solution: token caching and expiration management To mitigate these delays, we can optimize the authentication process by caching the AAD token and reusing it for the duration of its validity (typically 1 hour to 24 hours). Instead of requesting a new token for every database connection, the token is fetched only when the existing one is near expiration. This approach eliminates the repeated authentication overhead and ensures that the application can maintain seamless connectivity to the database without the performance hit of generating a new token for each request. In addition to reducing latency, this approach reduces the number of HTTP calls made to the Azure Active Directory service, resulting in better resource utilization and lower operational costs. Concrete performance gains: optimized SQL client connection As part of the mitigation, we provide a custom code implementation that uses SqlClient, a supported library, to optimize the connection time. The test was conducted with the S0 database, where using a single process and using connection pooling, we opened a connection, executed the SELECT 1, and closed the connection. During the testing phase with a connection pooler script running for 96 hours (without the AAD token cache), the following results were observed: 10 connections took 1 second, representing 0.866% of total connections. 1 connection took 4 seconds, representing 0.0866%. 1.144 connections took less than 1 second, representing 99.05% of total connections. All executions of SELECT 1 were completed in 0 seconds. These results demonstrate how caching AAD tokens and reusing them effectively reduced connection overhead and improved performance. None of the connections exceeded 5 seconds in duration, while with the default behavior, connections were reaching 30 seconds and more, depending on the environment complexity. Step-by-step implementation Here’s a step-by-step guide on how to implement this solution using C# and the Microsoft.Data.SqlClient package to optimize SQL database connections: Obtain and cache a token: Instead of requesting a new AAD token with every connection, we obtain a token once and cache it. This is done by leveraging Azure Managed Identity to authenticate the application, which eliminates the need to repeatedly authenticate with Azure Active Directory for every database connection. In this step, we fetch the token once and store it securely for reuse. Renew the token only when it’s near expiry We will refresh the token only when it is nearing expiration or has already expired. The application checks the token’s expiration time before attempting to use it. If the token is still valid, it continues to be reused. If it's close to expiration, a new token is fetched. Reuse a single token across multiple connections: The cached token can be used for multiple database connections during its lifetime. Rather than requesting a new token for each new connection, the application will use the same token across all connections until the token is about to expire. Code example: optimized SQL connection management Here’s an example of how you can implement token caching in a C# application using Microsoft.Data.SqlClient. using System; using System.Data.SqlClient; using System.Diagnostics; using System.Threading; using Azure.Identity; namespace SqlConnectionOptimization { public class SqlConnectionManager { private string _accessToken; private DateTimeOffset _tokenExpiration; private readonly string _connectionString = "Server=tcp:servername.database.windows.net,1433;Initial Catalog=DBName;..."; private readonly Stopwatch _stopwatch = new Stopwatch(); public SqlConnectionManager() { _accessToken = string.Empty; _tokenExpiration = DateTimeOffset.UtcNow; } public void Run() { while (true) { // Refresh token if necessary if (IsTokenExpired()) { RefreshToken(); } // Establish connection and perform operations using (var connection = CreateConnection()) { LogExecutionTime("Connected"); ExecuteQuery(connection); LogExecutionTime("Query Executed"); } // Simulate some idle time between operations Log("Waiting before next operation..."); Thread.Sleep(1000); } } private bool IsTokenExpired() { return string.IsNullOrEmpty(_accessToken) || DateTimeOffset.UtcNow.AddMinutes(5) >= _tokenExpiration; } private void RefreshToken() { _stopwatch.Start(); try { var result = FetchAccessToken(); _accessToken = result.Token; _tokenExpiration = result.Expiration; LogExecutionTime("Token Refreshed"); Log($"Token expires at: {_tokenExpiration}"); } catch (Exception ex) { Log($"Error fetching token: {ex.Message}"); } } private (string Token, DateTimeOffset Expiration) FetchAccessToken() { var managedIdentityCredential = new ManagedIdentityCredential(); var tokenRequestContext = new Azure.Core.TokenRequestContext(new[] { "https://database.windows.net/" }); var accessToken = managedIdentityCredential.GetTokenAsync(tokenRequestContext).Result; return (accessToken.Token, accessToken.ExpiresOn.UtcDateTime); } private SqlConnection CreateConnection() { var connection = new SqlConnection(_connectionString) { AccessToken = _accessToken }; int retries = 0; while (true) { try { connection.Open(); return connection; } catch (Exception ex) { retries++; if (retries > 5) { Log($"Error connecting after multiple retries: {ex.Message}"); throw; } Log($"Connection attempt failed. Retrying in {retries} seconds..."); Thread.Sleep(retries * 1000); } } } private void ExecuteQuery(SqlConnection connection) { var query = "SELECT 1"; // Simple query, replace with real logic as needed int retries = 0; while (true) { try { using (var command = new SqlCommand(query, connection)) { command.CommandTimeout = 5; // Adjust timeout for more complex queries command.ExecuteScalar(); } return; } catch (Exception ex) { retries++; if (retries > 5) { Log($"Max retries reached for query execution: {ex.Message}"); throw; } Log($"Query execution failed. Retrying in {retries} seconds..."); Thread.Sleep(retries * 1000); } } } private void Log(string message) { Console.WriteLine($"{DateTime.Now:yyyy-MM-dd HH:mm:ss.fff}: {message}"); } private void LogExecutionTime(string action) { _stopwatch.Stop(); var elapsed = _stopwatch.Elapsed; Log($"{action} - Elapsed time: {elapsed:hh\\:mm\\:ss\\.fff}"); _stopwatch.Reset(); } public static void Main(string[] args) { var manager = new SqlConnectionManager(); manager.Run(); } } } Key points in the code Token Expiration Check: The IsTokenExpired() method checks whether the token has expired by comparing it to the current time. We’ve added a 5-minute buffer for token expiration. This can be adjusted based on your needs. Managed Identity Authentication: The application uses Azure Managed Identity to authenticate and fetch the token, ensuring secure and scalable access to Azure SQL Database without requiring client secrets. Retry Logic: In the event of a connection failure or query execution failure, the system retries a set number of times with exponential backoff, making it resilient to transient network or authentication issues. Conclusion By implementing a token caching and expiration management strategy, applications can dramatically improve the performance and scalability of their database interactions, especially in environments with high request volumes. By leveraging Azure Managed Identity for secure, reusable tokens, you can reduce authentication latency and improve the overall efficiency of your SQL database connections. This approach can also be adapted to any service using Azure SQL Database and Azure Active Directory for authentication. Next steps Benchmarking: Test the implementation in your environment to quantify the performance gains. Error Handling: Extend the retry logic and error handling to better handle transient failures, especially in production environments. Resources: Introducing Configurable Retry Logic in Microsoft.Data.SqlClient v3.0.0-Preview1 Configurable retry logic in SqlClient Troubleshoot transient connection errors Scaling: Consider how this strategy can be applied across multiple services in larger architectures. Consider reading and applying managed identity best practices. Resources: Managed identity best practice recommendationsScaling PostgreSQL Connections in Azure: A Deep Dive into Multi-PgBouncer Architectures
This article dives into building a robust and scalable connection pooling layer for Azure Database for PostgreSQL - Flexible Server. You'll learn how to deploy multiple PgBouncer instances, managed by an Azure Load Balancer, to eliminate single points of failure and dramatically boost throughput for your applications.592Views2likes1CommentA Practical Guide to Implement End-to-End Data Warehousing PoC Using Microsoft Fabric-Part 3
In this continuation (part 3) of the previous blog, you will create a simple metadata driven pipeline to ingest data into the Data Warehouse and populate the dimension tables. Task4: Create a metadata driven pipeline to ingest data into the Data Warehouse and populate the dimension tables. We will use the table mtd.TablesToCopy created in Task3 of above while building a metadata driven pipeline, to ingest data from the Azure SQL Database. Click on the Workspace icon on the left side navigation and click “+ New Item” > search “Data pipeline” to create a new pipeline and provide a name (eg., PL_Dim_FullLoad_Levelup) Once the pipeline editor opens, click on the Activities tab and add a Stored Procedure activity and name it DeleteAllRowsfromStgTbls in the General tab On the settings tab, use the below snip to populate the details. This stored procedure deletes all the records of the staging dimension tables to facilitate a full load (except stg.DimCustomer) Note: We'll utilize stg.DimCustomer for incremental changes, won't be deleting all rows from the table. Click on the whitespace in the pipeline editor/canvas to add a parameter called “LookupTable” like the below. Add a lookup activity into the canvas. Connect the “On success” port of the “Stored Procedure” activity to the Lookup activity. From the general tab of the lookup activity, rename it to TablesToCopyLookup. In the settings tab, use the below snip to populate the details. Important: Remember to uncheck the First row only Note: In the above snip the Query field is set dynamically. To do so, click on the Add dynamic content as shown below and paste the following code: SELECT * FROM @{pipeline().parameters.LookupTable} The lookup activity will get the table name passed by the pipeline parameter at runtime and run the select statement against the table – mtd.TablesToCopy. Add a ForEach activity to the canvas and connect the “On success” port of the “Lookup” activity to the “ForEach” activity. The ForEach activity will iterate once for each value returned by the Lookup activity. Rename the activity name to ForEach_Row_In_TablesToCopy Set the settings for the ForEach activity as below. The “items” field should be set similar to note section of step 5. Paste the following into the items field - @activity('TablesToCopyLookup').output.value. Double click or edit the ForEach activity and within the ForEach activity, add a Copy Activity, set its name as “Copy_Data_From_SQL_To_Fabric_DW” in the General tab to: Set the source tab settings based on the snip below. In the connection field, setup a new connection to the Azure SQL DB that you created in the pre-requisites section. In the Query field click Add dynamic content and type - SELECT * FROM @{item().tableName}. This will pass the name of the table based on the iteration we are in. Set the destination tab settings based on the snip below. Notice that we will copy the source tables from the Azure SQL database to the staging schema and dynamically pass table names based on the source tables. The schema field is statically set to stg. To create the table names in the data warehouse dynamically, in the table name field, click Add dynamic content and paste - @concat('Dim',split(item().tableName,'.')[1]). This expression, concatenates the string “Dim” to the second part of the two part source table. In the canvas showing the Copy activity, click on Main canvas on the top left corner to exit out of the ForEach activity. Add a Stored procedure activity to the canvas and connect the Success port of the ForEach activity to the stored procedure. Rename the stored procedure activity from the General tab to - CTAS_For_PrdDimTbls. In the settings tab, fill out the details using the snip below Click on the Home tab > Save to save the pipeline and the Play button to execute the pipeline. The pipeline should look like the below. After a successful run, it would have copied all the tables from the Azure SQL Database to the stg schema of the Data Warehouse and would have populated all the Dimensions tables in the production schema.Azure Databricks - Best Practices for Using Spot Instances in Cluster Scaling
Leveraging Spot instances in Azure Databricks can greatly reduce costs; however, we strongly advise against their use for critical production workloads requiring high SLAs. Since Spot instances are subject to availability and can be interrupted at any time, they pose a risk to workload stability. If you still choose to use Spot instances for such workloads, it is essential to follow best practices to mitigate potential risks. Spot instances provide a cost-efficient way to scale compute clusters, but improper configurations may lead to instability and job failures. This blog outlines key best practices for using Spot instances, whether auto-scaling is enabled or disabled. When Auto Scaling is Disabled with Spot Instances Without auto-scaling, Spot instance availability is crucial for successful cluster startup. Here’s what you need to consider: Cluster Availability Ensure that 80% of the total requested nodes are available for startup. For instance, if you request four Spot compute worker nodes, the eviction of even a single node can delay the cluster's launch. Cluster Launch Attributes Use attributes like FALL_BACK_TO_AZURE during cluster launch. This ensures that if Spot instances are unavailable, an on-demand compute node will be provisioned instead, preventing cluster failures. Avoid Using Pools with Spot Instances Creating clusters from pools with Spot instances can introduce instability, especially if the driver node is assigned to a Spot instance. To prevent this, we strongly discourage using pools with Spot instances when launching clusters. When Auto Scaling is Enabled with Spot Instances Auto-scaling allows clusters to dynamically adjust resources, but careful setup is necessary for smooth scaling. On-Demand Nodes First Set the Azure attribute first_on_demand=2 in the job cluster definition. This ensures that the first two nodes (one driver and one worker) are on-demand, stabilizing cluster creation. Autoscaling Settings Enable auto-scaling on the cluster. Set min_workers=1, ensuring that at least one worker is always on-demand. Define the maximum cluster size to prevent over-scaling issues. This setup ensures reliable cluster startup and reduces the risk of job failures. Upscaling Considerations The cluster should always start with on-demand nodes before scaling up with Spot instances. While this approach improves stability, it may slightly increase overall job duration due to the upscaling process. Avoid Using Pools with Spot Instances Just like in the non-auto-scaling setup, avoid creating clusters from pools with Spot instances. Doing so can lead to delayed startups and instability. Final Thoughts By following these best practices, you can maximize the benefits of Spot instances while ensuring cluster stability and efficiency. Whether auto-scaling is enabled or not, prioritizing on-demand instances during startup and carefully managing scaling policies will help mitigate potential risks.Comparing Open-Source vs Closed LLMs for Enterprise Apps
LLMs are advanced AI models trained in vast data. They enable tasks such as summarization, translation, content creation, and data analysis. When companies build applications that use AI, one of the most important decisions they face is choosing the right type of Large Language Model (LLM). There are two main choices: open‑source LLMs and closed or proprietary LLMs. Understanding the differences between them helps businesses decide which option fits their needs, goals, and security requirements.