azure
7790 TopicsLive AMA: Demystifying Azure pricing (AM session)
⏱️ This live AMA is on January 22nd, 2026 at 9:00 AM PT. This same session is also scheduled at 5:00 PM PT on January 22nd. SESSION DETAILS This session breaks down the complexity of Azure pricing to help you make informed decisions. We’ll cover how to estimate costs accurately using tools like the Azure Pricing Calculator, explore strategic pricing offers such as Reservations, Savings Plans, and Azure Hybrid Benefit, and share best practices for optimizing workloads for cost efficiency. Whether you’re planning migrations or managing ongoing cloud spend, you’ll learn actionable strategies to forecast, control, and reduce costs without compromising performance. This session includes a live chat-based AMA. Submit your questions below as Comment to be answered by the product team!642Views0likes8CommentsLive AMA: Demystifying Azure pricing (PM session)
⏱️ This live AMA is on January 22nd, 2026 at 5:00 PM PT. This same session is also scheduled at 9:00 AM PT on January 22nd. SESSION DETAILS This session breaks down the complexity of Azure pricing to help you make informed decisions. We’ll cover how to estimate costs accurately using tools like the Azure Pricing Calculator, explore strategic pricing offers such as Reservations, Savings Plans, and Azure Hybrid Benefit, and share best practices for optimizing workloads for cost efficiency. Whether you’re planning migrations or managing ongoing cloud spend, you’ll learn actionable strategies to forecast, control, and reduce costs without compromising performance. This session includes a live chat-based AMA. Submit your questions below as Comment to be answered by the product team!148Views0likes0CommentsAdvanced Container Apps Networking: VNet Integration and Centralized Firewall Traffic Logging
Azure community, I recently documented a networking scenario relevant to Azure Container Apps environments where you need to control and inspect application traffic using a third-party network virtual appliance. The article walks through a practical deployment pattern: • Integrate your Azure Container Apps environment with a Virtual Network. • Configure user-defined routes (UDRs) so that traffic from your container workloads is directed toward a firewall appliance before reaching external networks or backend services. • Verify actual traffic paths using firewall logs to confirm that routing policies are effective. This pattern is helpful for organizations that must enforce advanced filtering, logging, or compliance checks on container egress/ingress traffic, going beyond what native Azure networking controls provide. It also complements Azure Firewall and NSG controls by introducing a dedicated next-generation firewall within your VNet. If you’re working with network control, security perimeters, or hybrid network architectures involving containerized workloads on Azure, you might find it useful. Read the full article on my blog15Views0likes0CommentsCloud Native Identity with Azure Files: Entra-only Secure Access for the Modern Enterprise
Azure Files introduces Entra only identities authentication for SMB shares, enabling cloud-only identity management without reliance on on-premises Active Directory. This advancement supports secure, seamless access to file shares from anywhere, streamlining cloud migration and modernization, and reducing operational complexity and costs.12KViews8likes15CommentsAnnouncing seamless integration of Apache Kafka with Azure Cosmos DB in Azure Native Confluent
Integrate Azure Cosmos DB with Kafka applications Confluent announced general availability (GA) of the fully managed V2 Kafka connector for Azure Cosmos DB, enabling users to seamlessly integrate their Azure Cosmos DB containers with Kafka-powered event streaming applications. without worrying about provisioning, scaling, or managing the connector infrastructure. The Confluent Cosmos DB v2 connector offers significant advantages as compared to the v1 connector in terms of higher throughput, enhanced security and observability, and increased reliability. Seamless Integration with Azure Native Confluent We are excited to announce a new capability in Azure Native Confluent service that enables users to create and configure Confluent-managed Cosmos DB Kafka connectors (v2) for Azure Cosmos DB containers through a direct, seamless experience in the Azure portal. Users can also provision and manage environments, Kafka clusters and Kafka topics from within the Azure Native Confluent service, creating a holistic end-to-end experience to integrate with Azure Cosmos DB. This eliminates the need for users to switch between the Azure and Confluent Cloud portal. Key Highlights Bi-directional Support: Allows users to create source connectors to stream data from Cosmos DB to Kafka topics, or sink connectors to move data from Kafka into Cosmos DB. Secure Authentication: Users can authenticate to Kafka cluster using service accounts, enabling least-privilege access controls to provision the connectors - aligned with Confluent’s recommended security guidelines. Create a Confluent Cosmos DB (v2) Kafka Connector from Azure portal The following section summarizes the key steps required to provision the connector from the Azure Native Confluent service. Navigate to the native Confluent resource in Azure. Navigate to Connectors -> Create new connector You can also create an environment, cluster and topic from within the Azure Portal. Choose the desired connector type Source: to stream data from Azure Cosmos DB Sink to move data into Azure Cosmos DB Select ‘Azure Cosmos DB V2’ as the connector plugin Enter the connector name. Then, select the required Kafka topics, Azure Cosmos DB account and database. Select Service Account authentication and provide a name for the service account. When the connector is created, this will create a new service account on Confluent Cloud. Optionally, you can also select the user-account based authentication by provisioning an API key on Confluent Cloud. Do the required connector configurations. Enter the topic container mapping in the form of ‘topic1#container1,topic2#container2…’ Review the configuration summary and click Create. Your connector will appear in the list with real-time status indicators. Other Resources Try out the Azure Native Confluent Service right away! Every new sign-up gets a free $1000 credit! To learn more, check out the Microsoft Docs Follow the Azure Partner discussion board to keep up to date all on Azure announcements and join the conversation with subject matter experts! If you would like to give us feedback on this feature, the overall product or have any suggestions for us to work on, please drop in your suggestions in the comments.140Views0likes0CommentsTableau to Power BI Migration: Semantic Layer-First Approach for Cloud Architects
Author's: Lavanya Sreedhar LavanyaSreedhar, Peter Lo PeterLo, Aryan Anmol aryananmol and Rafia Aqil Rafia_Aqil In this guide, we provide practical guidance for migrating from Tableau to Power BI, with a focus on technical best practices and architecture. Unifying business intelligence on the Microsoft Fabric platform, enterprises gain closer integration with Microsoft 365 (Teams, Copilot, Excel). For cloud solution architects and BI developers, a successful migration is not just about rebuilding dashboards in a new tool. It requires thoughtful architectural planning and a shift to a more model-centric approach to BI. Why Semantic Layer-First Architecture Matters The Traditional Migration Challenge Most Tableau to Power BI migrations follow a dashboard-centric approach: teams attempt to replicate existing Tableau workbooks, calculated fields, and LOD (Level of Detail) expressions directly into Power BI reports. While this may seem efficient initially, it creates significant downstream challenges: Duplicated logic: Each report embeds its own calculations and business rules, leading to conflicting KPIs across the organization Maintenance overhead: Changes to business logic require updating dozens or hundreds of individual reports Governance gaps: Without centralized definitions, semantic drift occurs—different teams calculate "Revenue" or "Active Customer" differently Scalability issues: As data volumes grow, report-level transformations become performance bottlenecks The Semantic Layer-First Alternative Microsoft's recommended approach centers on semantic models (formerly called datasets)—centralized, governed data models that separate business logic from visualization. In this architecture: The payoff is substantial: when data evolves or business rules change, you update the semantic model once, and all dependent reports automatically reflect the changes—no manual redesign required. Understanding Migration Complexity: Simple to Very Complex Dashboards Not all Tableau dashboards are created equal. The migration strategy should align with dashboard complexity, and the semantic layer approach becomes increasingly valuable as complexity grows. Follow a Step-by-Step Migration Strategy Migrating from Tableau to Power BI is not a one-click effort – it requires a mix of automated and manual refactoring, plus a sound change management plan. Below are key strategies and best practices for a successful migration: Audit your Tableau estate: Start by taking inventory of all existing Tableau workbooks, data sources, and dashboards. Determine what needs to be migrated (focus on high-value, widely used reports first) and identify any redundant or obsolete content that can be retired rather than converted. Conduct a proof-of-concept (PoC): Before migrating everything, pick a representative complex dashboard (or a subset of your data) and perform a pilot migration. This will help you validate that Power BI can connect to your data (e.g. setting up the Power BI gateways for on-premises sources), test performance (Import vs DirectQuery modes), and experiment with replicating key visuals or calculations. Use the PoC to uncover any surprises early – for example, test that any Level of Detail expressions or table calculations in Tableau can be re-created in DAX. The lessons learned here should inform your overall project plan. Use a phased migration approach: Plan to run Tableau and Power BI in parallel for some period, rather than switching everything at once. Migrate in waves – for example, by business unit or subject area – and incorporate user feedback as you go. This phased approach reduces risk and allows your team to improve the process with each iteration. It also gives end users time to adjust gradually. Migrate high-impact dashboards first: Prioritize the migration of key reports and dashboards that are critical to the business or have the most usage. Delivering these early wins will not only surface any technical challenges to solve but will also help demonstrate the value of Power BI’s capabilities to stakeholders. Early success builds buy-in and momentum for the rest of the migration. Reimagine (don’t just replicate) the experience: It’s rarely possible – or desirable – to exactly re-create every Tableau visualization pixel-for-pixel in Power BI. Embrace the opportunity to focus on business questions and improve user experience with Power BI’s features. For example, rather than replicating a complex Tableau workaround, you might implement a cleaner solution in Power BI using native features (like bookmarks, drilldowns, or simpler navigation between pages). Engage business users and subject matter experts during this redesign to ensure the new reports meet their needs. Enable dataset reusability: One major benefit of the Power BI approach is the ability to create shared datasets and dataflows. As you migrate, look for opportunities to create central semantic models (datasets) that can serve multiple reports. For instance, if several Tableau workbooks are all using similar data about sales, you can create one central Sales dataset in Power BI. Report creators across the organization can then build different Power BI reports on that single dataset without duplicating data or logic. This reduces maintenance and promotes a “build once, reuse often” strategy. Provide training and support: Expect a learning curve for teams moving to Power BI – especially those who are very fluent in Tableau. Plan for user upskilling and training programs. Establish a support community or office hours where new users can ask questions and get help. If possible, identify Power BI champions or recruit a Power BI Center of Excellence (COE) team who can guide others. During the transition, ensure there are subject matter experts (SMEs) available to address questions and validate that the new reports are correct. Manage change and expectations: It’s important to communicate why the organization is moving to Power BI (e.g. benefits like deeper integration, lower TCO, better governance) to get buy-in from end users. Some users may be resistant to change, especially if they’ve invested a lot of time in mastering Tableau. Prepare to handle varying responses – emphasize the personal benefits (like improved performance, new capabilities, or career growth with popular skills) to encourage adoption. Also, involve influential business users early and gather their feedback, so they feel ownership in the new solution. Establish governance from Day 1: Don’t wait until after migration to think about governance. Use this chance to set up Power BI governance aligned to best practices. Decide on important aspects such as workspace naming conventions, who can create or publish content, how you’ll monitor usage and costs, and how to manage data access and security (for example, designing a strategy for RLS, and deciding when to use per-user datasets vs. organizational semantic models). Good governance will ensure your shiny new Power BI environment doesn’t sprawl into chaos over time. Allow time for adjustment and iteration: Finally, be patient and iterative. Depending on the scale of your organization and the number of Tableau assets, a full migration can take months or even a year or more. Plan realistic transition periods where both systems might coexist. Continuously refine your approach with each wave of migration. Power BI’s frequent update cadence (monthly releases) means new features may emerge even during your project – stay updated, as new capabilities could simplify your migration (for example, the introduction of field parameters or Copilot might let you modernize certain Tableau features more easily). Reimagine (don’t just replicate) the experience (Step 5): Phase 1: Assessment and Planning 1. Audit Your Tableau Estate Inventory all workbooks, data sources, and calculated fields Identify high-traffic dashboards (prioritize for early migration) Categorize by complexity (Simple/Medium/Complex/Very Complex) 2. Design Your Semantic Architecture Map Tableau data sources to Power BI data sources (DirectQuery, Import, or Direct Lake) Plan star schema for fact/dimension tables Identify shared calculations that should live in semantic models vs. report-specific logic 3. Choose Storage Modes Source Type Recommended Mode Rationale Databricks Delta Lake Direct Lake Real-time analytics, no refresh lag Azure SQL Database DirectQuery or Import Based on data volume and refresh SLAs On-Premises SQL Server Import (via Gateway) Network latency considerations Excel/CSV files Import Small reference data Phase 2: Build the Semantic Layer 1. Create Star Schema Data Models Tableau often relies on flat, denormalized datasets. Power BI performs best with star schemas: Fact tables: Transactional data (sales, orders, events) with foreign keys to dimensions Dimension tables: Descriptive attributes (customers, products, dates) with primary keys Relationships: One-to-many from dimension to fact, leveraging bidirectional filtering sparingly 2. Migrate Calculations to DAX Measures Convert Tableau calculated fields to DAX measures in the semantic model: --Example of DAX: -- Define as measure: Total Revenue = SUMX( 'Sales', 'Sales'[Quantity] * 'Sales'[Unit Price] ) 2.1 Use Copilot to Accelerate DAX Development Leverage Copilot in Power BI Desktop to generate and validate DAX: Describe the calculation in natural language Copilot suggests DAX syntax Review, test, and refine Phase 3: Understanding Migration Complexity: Simple to Very Complex Dashboards Not all Tableau dashboards are created equal. The migration strategy should align with dashboard complexity, and the semantic layer approach becomes increasingly valuable as complexity grows. 1. Dashboard Conversion Best Practices Think in "pages" not "sheets": Power BI reports combine multiple visuals per page; group related visuals logically Use slicers for interactivity: Replace Tableau filters with Power BI slicers and filter pane Leverage bookmarks for navigation: Create dynamic report experiences with show/hide containers Simple Complexity Level Category Tableau Feature Power BI Equivalent Microsoft Fabric Enhancements Best Practice Notes Data Model Single custom SQL Power Query for data shaping and ETL. OneLake Shortcuts for unified data access. Use star schema for optimized performance; push logic into the semantic layer rather than visuals. Calculations Basic IF/ELSE, SUM Data Analysis Expressions (DAX) for measures and calculated columns. Copilot for Power BI to assist with DAX creation. Fabric IQ for natural language queries. Centralize calculations in semantic models for consistency and governance. Medium Complexity Level Category Tableau Feature Power BI Equivalent Fabric Enhancements Best Practice Notes Data Model Multiple custom SQL (up to 3) Connect live to databases (Azure Databricks): DirectQuery in Power BI Connect with cloud data sources: Power BI data sources OneLake Shortcuts for unified access without databricks compute cost. Semantic Models can combine multiple sources. Optimize with star schema; Prefer OneLake Shortcuts for performance; avoid heavy transformations in visuals. Calculations Nested IFs, CASE Data Analysis Expressions (DAX) for measures and calculated columns. Copilot for Power BI to assist with DAX creation. Fabric Data Agent for conversational BI. Fabric IQ for natural language queries: Fabric IQ Centralize logic in semantic models; use Copilot for automation and validation; keep calculations reusable. Reporting Tooltip format in Bar and Map visuals Select All/Clear option for Single Select dropdown Standard tooltips offer help tooltips, text, and background formatting. Dynamic tooltip will be able to create the Tooltip page and reuse it in multiple visuals The customization is so much better than the OOB tooltips Create report tooltip pages in Power BI - Power BI | Microsoft Learn Use Clear All Slicers Button. Disable Single Select, Add Clear All Slicers button, Customize the Button and Use the Button Complex Complexity Level Category Tableau Feature Power BI Equivalent Fabric Enhancements Best Practice Notes Data Model Multiple sources Create relationship using more than one column Composite Models in Power BI (DirectQuery + Import) for combining multiple sources, also connect to various cloud services. Dataflows for pre-processing. Power BI allows a relationship between 2 tables based on only one active column. OneLake Shortcuts for unified access without Azure Databricks compute cost; Microsoft Fabric Dataflows Gen2 offers multiple ways to ingest, transform, and load data efficiently. Consolidate sources into semantic models; use Direct Lake for performance; Plan and design data model to comply with star schema supported by Power BI Relationship DAX USERELATIONSHIP DAX for activating relationships in Power BI for a specific calculation Calculations LOD, window functions Data Analysis Expressions (DAX) for measures and calculated columns. Copilot to assist with complex DAX. Fabric IQ Ontology for semantic alignment. Q&A visual for natural language. Change how visuals interact in a Power BI report. Centralize calculations in semantic layer; use variables in DAX for readability and performance. Fabric Data Agent for a conversational BI. Very Complex Complexity Level Category Tableau Feature Power BI Equivalent Fabric Enhancements Best Practice Notes Data Model Multi-source, Excel, SQL Composite Models in Power BI (DirectQuery + Import) for combining multiple sources, also connect to various cloud services. Dataflows for pre-processing. OneLake Shortcuts for unified access; Connector overview build-in support. Mirroring for real-time sync. Combine multiple sources into well-structured semantic models for consistency and optimized performance. Calculations Predictive logic Data Analysis Expressions (DAX) for measures and calculated columns. Fabric AutoML, ML models, AI Insights, Python/R, Notebook‑based ML (Spark/Scikit‑Learn), Fabric AI Functions, Fabric IQ Ontology Fabric Data Agent for a conversational BI. Centralize logic in semantic models; leverage Copilot for automation and parameter-driven workflows. Prepare for Copilot and Q&A integration. 2. Tableau Feature Equivalents Tableau Feature Power BI Equivalent Microsoft Learn Link Calculated Fields DAX Measures DAX Documentation Parameters Field Parameters / Bookmarks Use report readers to change visuals Actions Drillthrough / Bookmarks Drillthrough Tableau Prep Power Query / Dataflows Differences between Dataflow Gen1 and Dataflow Gen2 Tableau Server Power BI Service What is Power BI? Overview of Components and Benefits Phase 4: Governance and Deployment Workspace Planning (Dev / Test / Prod Separation) A proper workspace strategy is essential for governed deployments in Fabric and Power BI. Fabric supports separate Development, Test, and Production stages using Deployment Pipelines, enabling controlled promotions of semantic models, reports, dataflows, notebooks, lakehouses, and other items. You can assign each workspace to a pipeline stage (Dev → Test → Prod) to ensure safe lifecycle management. Sensitivity Labeling (Microsoft Purview Information Protection) Sensitivity labels allow governed classification and protection of data across Fabric items. Sensitivity labels can be applied directly to Fabric items (semantic models, reports, dataflows, etc.) through the item's header flyout or the item settings. Labels from Microsoft Purview Information Protection enforce data access rules and help organizations meet compliance requirements. Endorsement & Certification (Promoted, Certified, Master Data) Endorsement improves discoverability and trust in shared organizational content. Promoted: Item creators mark content as recommended for broader use. Certified: Administrators or authorized reviewers validate content meets organizational quality standards. Master Data: Indicates authoritative single‑source‑of‑truth items such as semantic models or lakehouses. All Fabric items except dashboards can be promoted or certified; data‑containing items can be designated as Master Data. Monitoring & Capacity Planning Determine the appropriate size for fabric capacity when migrating from Tableau to PowerBI. The Fabric SKU Estimator can generate a SKU recommendation (estimate) for your capacity requirements. Ensuring performance and cost efficiency requires ongoing monitoring of your Fabric capacity. Microsoft recommends evaluating workloads using Fabric Capacity Metrics and planning SKU sizes based on real usage. Fabric uses bursting and smoothing to handle spikes while enforcing capacity limits. Monitoring helps identify high compute usage, background refreshes, and interactive workloads to optimize performance. Fabric Data Source Connections (OneLake+ Manage Connections) Microsoft Fabric is designed as an end‑to‑end analytics platform that integrates data from many different source systems into a unified environment powered by OneLake, Data Factory, Real‑Time Analytics, Dataflows , Lakehouses, Warehouses, and Mirrored Databases. The Strategic Advantage: Semantic Layer + Fabric IQ The semantic layer-first approach sets the foundation for the next evolution in enterprise analytics. Fabric IQ (announced at Ignite 2025) is Microsoft's semantic intelligence platform that auto-elevates semantic models into ontologies—structured knowledge graphs that power AI agents, Copilot experiences, and cross-domain data reasoning. What this means for your migration: Semantic models you build today become the foundation for AI-driven analytics tomorrow Data Agents can reason across multiple semantic models, answering questions that span domains Business users transition from "report consumers" to "data explorers" via natural language interfaces Conclusion: Build for the Future, Not Just for Today Migrating from Tableau to Power BI is more than a technology swap—it's an opportunity to re-architect your analytics strategy for the cloud-native, AI-powered era. The semantic layer-first approach requires upfront investment in data modeling, DAX expertise, and Fabric platform adoption. But the payoff is transformative: Consistency: Single source of truth for all business metrics Scalability: Semantic models that serve hundreds of reports and thousands of users Agility: Changes to business logic propagate instantly across the enterprise Future-readiness: Foundation for Fabric IQ, Data Agents, and AI-driven insights Start your migration with the end in mind: not just convert dashboards, but a modern, governed, AI-ready analytics platform that scales with your business. Addressing Key Migration Concerns (1) Why a semantic‑layered model approach is better than recreating Tableau dashboards A semantic‑layered modeling approach is the optimal strategy for migration and is significantly more effective than attempting to recreate Tableau dashboards exactly as they exist. By contrast, Power BI and Fabric encourage a semantic model–first architecture, where all business rules, relationships, calculations, and transformations are centralized in a governed model that serves many dashboards. The approach not only provides consistency and reuse across the enterprise but also ensures that report authors build on a single certified version of the truth. (2) How semantic-layered model approach reduces the constant redesign caused by changing data needs. A semantic‑layered modeling approach directly addresses concern about constant changes and frequent redesigns of dashboards when data evolves. With a semantic layer, changes are absorbed in the model layer—so the logic is updated once and flows automatically into all dependent reports. Combined with Fabric features like OneLake shortcuts, Direct Lake mode, and centralized governance, the semantic layer drastically reduces breakage, minimizes rework, and ensures scalability as data continues to grow and shift. Additional Resources Direct Lake in Microsoft Fabric Create Fabric Data Agents OneLake Shortcuts Write DAX queries with Copilot - DAX1.1KViews2likes2CommentsBlank screen in "new" AI Foundry and unable to go back to "classic" view
Hi, I toggled on the "new" AI foundry experience. But it just gives me a blank page. I cannot disable it and go back to the old view though. When I go to https://ai.azure.com/ it takes me to the "nextgen" page (e.g. https://ai.azure.com/nextgen/?wsid=%2Fsubscriptions...) Removing "nextgen" from the URL just redirects me back to the "nextgen" version. This also happens to my colleague, and we can't use the Foundry. Can you help?56Views1like1CommentPublished agent from Foundry doesn't work at all in Teams and M365
I've switched to the new version of Azure AI Foundry (New) and created a project there. Within this project, I created an Agent and connected two custom MCP servers to it. The agent works correctly inside Foundry Playground and responds to all test queries as expected. My goal was to make this agent available for my organization in Microsoft Teams / Microsoft 365 Copilot, so I followed all the steps described in the official Microsoft documentation: https://learn.microsoft.com/en-us/azure/ai-foundry/agents/how-to/publish-copilot?view=foundry Issue description The first problems started at Step 8 (Publishing the agent). Organization scope publishing I published the agent using Organization scope. The agent appeared in Microsoft Admin Center in the list of agents. However, when an administrator from my organization attempted to approve it, the approval always failed with a generic error: “Sorry, something went wrong” No diagnostic information, error codes, or logs were provided. We tried recreating and republishing the agent multiple times, but the result was always the same. Shared scope publishing As a workaround, I published the agent using Shared scope. In this case, the agent finally appeared in Microsoft Teams and Microsoft 365 Copilot. I can now see the agent here: Microsoft Teams → Copilot Microsoft Teams → Applications → Manage applications However, this revealed the main issue. Main problem The published agent cannot complete any query in Teams, despite the fact that: The agent works perfectly in Foundry Playground The agent responds correctly to the same prompts before publishing In Teams, every query results in messages such as: “Sorry, something went wrong. Try to complete a query later.” Simplification test To exclude MCP or instruction-related issues, I performed the following: Disabled all MCP tools Removed all complex instructions Left only a minimal system prompt: “When the user types 123, return 456” I then republished the agent. The agent appeared in Teams again, but the behavior did not change — it does not respond at all. Permissions warning in Teams When I go to: Teams → Applications → Manage Applications → My agent → View details I see a red warning label: “Permissions needed. Ask your IT admin to add InfoConnect Agent to this team/chat/meeting.” This message is confusing because: The administrator has already added all required permissions All relevant permissions were granted in Microsoft Entra ID Admin consent was provided Because of this warning, I also cannot properly share the agent with my colleagues. Additional observation I have a similar agent configured in Copilot Studio: It shows the same permissions warning However, that agent still responds correctly in Teams It can also successfully call some MCP tools This suggests that the issue is specific to Azure AI Foundry agents, not to Teams or tenant-wide permissions in general. Steps already taken to resolve the issue Configured all required RBAC roles in Azure Portal according to: https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/rbac-foundry?view=foundry-classic During publishing, an agent-bot application was automatically created. I added my account to this bot with the Azure AI User role I also assigned Azure AI User to: The project’s Managed Identity The project resource itself Verified all permissions related to AI agents publishing in: Microsoft Admin Center Microsoft Teams Admin Center Simplified and republished the agent multiple times Deleted the automatically created agent-bot and allowed Foundry to recreate it Created a new Foundry project, configured several simple agents, and published them — the same issue occurs Tried publishing with different models: gpt-4.1, o4-mini Manually configured permissions in: Microsoft Entra ID → App registrations / Enterprise applications → API permissions Added both Delegated and Application permissions and granted Admin consent Added myself and my colleagues as Azure AI User in: Foundry → Project → Project users Followed all steps mentioned in this related discussion: https://techcommunity.microsoft.com/discussions/azure-ai-foundry-discussions/unable-to-publish-foundry-agent-to-m365-copilot-or-teams/4481420 Questions How can I make a Foundry agent work correctly in Microsoft Teams? Why does the agent fail to process requests in Teams while working correctly in Foundry? What does the “Permissions needed” warning actually mean for Foundry agents? How can I properly share the agent with other users in my organization? Any guidance, diagnostics, or clarification on the correct publishing and permission model for Foundry agents in Teams would be greatly appreciated.A Technical Implementation Guide for Multi-Store Retail Environments
Understanding the Problem Space When organizations first approach multi-source data ingestion, they typically start with explicit configuration. Each database connection is defined individually, each table mapping is specified, and each destination partition is created manually. This works reasonably well for five or ten sources. It becomes painful at twenty. It becomes nearly unmanageable at fifty or more. The operational cost manifests in several ways. Every new store requires a ticket to the data engineering team. Someone must configure the connection, verify the schema compatibility, set up the ingestion pipeline, create the destination structures, and validate the data flow. In a fast-growing retail operation, this creates a bottleneck that delays time-to-insight for new locations. Beyond the immediate operational burden, there is also the risk of configuration drift. When each source is configured individually, small inconsistencies creep in over time. One store might have slightly different table names. Another might be ingesting an extra column that was added during a schema migration. These inconsistencies compound, making the overall system harder to maintain and debug. The solution presented here eliminates most of this manual work by implementing two complementary patterns: automatic source detection through regex-based CDC configuration, and dynamic partition creation through Delta Lake's native capabilities. Architecture Overview The proposed architecture places Debezium as the CDC engine, reading from PostgreSQL databases and publishing change events to Azure Event Hubs. Fabric EventStream consumes these events and writes them to a Delta Lake table in the Lakehouse. The key insight is that neither Debezium nor Delta Lake requires explicit enumeration of every data source. Both can operate on patterns rather than explicit lists. At the source layer, Debezium connects to PostgreSQL and monitors the write-ahead log for changes. Rather than configuring a separate connector for each store database, a single connector is configured with a regex pattern that matches all store databases. When a new database is created that matches this pattern, Debezium automatically begins capturing its changes without any configuration update. At the destination layer, Delta Lake tables are defined with partition columns but without explicit partition values. When a record arrives with a previously unseen partition value, Delta Lake automatically creates the necessary directory structure and begins writing data to the new partition. No DDL statement is required, no manual intervention is needed. The combination of these two behaviors creates an end-to-end pipeline where adding a new store database is as simple as creating the database itself. Everything downstream happens automatically. Implementation Details Configuring Debezium for Automatic Source Detection The critical configuration element in Debezium is the database include list parameter. Rather than specifying each database explicitly, this parameter accepts a regular expression that defines which databases should be monitored. For a retail environment where store databases follow a naming convention such as store_001, store_002, and so forth, the configuration would specify a pattern like store_.* as the include list. This pattern matches any database whose name begins with store_ followed by any characters. When the DBA creates store_058, Debezium detects this new database during its periodic metadata refresh and automatically begins capturing changes from it. The connector configuration should also specify which tables within each database to monitor. In most retail scenarios, the schema is standardized across all stores, so this can be a fixed list such as public.transactions, public.inventory, and public.products. If schema variations exist between stores, additional filtering logic may be required, but this is generally a sign that the source systems need standardization rather than accommodation of inconsistency. The topic routing configuration is equally important. All CDC events should be routed to a single Event Hubs topic, with the store identifier included in the message payload. This allows a single EventStream to process all store data while preserving the ability to identify which store each record originated from. The source metadata that Debezium includes with each change event contains the database name, which serves as the natural store identifier. Initial snapshot behavior must be considered carefully. When Debezium detects a new database, it will perform an initial snapshot to capture the current state before beginning to track incremental changes. For a store with substantial historical data, this snapshot may take anywhere from a few minutes to several hours. During this period, the connector is occupied with the snapshot and may exhibit increased latency for change events from other databases. Scheduling new store database creation during off-peak hours helps mitigate this impact. Configuring Delta Lake for Dynamic Partitioning Delta Lake supports dynamic partition creation as a default behavior. When a table is created with partition columns specified, any INSERT operation that includes a previously unseen partition value will automatically create the corresponding partition directory. The table definition should include the store identifier as the primary partition column. A secondary partition on date is typically advisable for managing data lifecycle and optimizing query performance. The combination of store_id and event_date provides a natural organization that supports both store-specific queries and time-range queries efficiently. The table should be created with automatic optimization enabled. The autoOptimize.optimizeWrite property causes Delta Lake to automatically coalesce small files during write operations, reducing the small file problem that frequently plagues streaming ingestion workloads. The autoOptimize.autoCompact property enables background compaction of files that have accumulated between optimization runs. When EventStream writes a record with store_id equal to store_058 and this value has never been seen before, Delta Lake creates the partition directory structure automatically. The first write creates the directory, and subsequent writes append to files within that directory. From the perspective of downstream queries, the new store's data is immediately available without any schema changes or administrative intervention. EventStream Processing Logic The EventStream configuration bridges the gap between Event Hubs and the Lakehouse. It must parse the Debezium change event format, extract the relevant fields including the store identifier, and route the data to the appropriate Delta table. The transformation logic should extract the store identifier from the source metadata section of the Debezium payload. This is typically found at a path like source.db within the JSON structure. The operation type indicating whether the change is an insert, update, or delete should be preserved for downstream CDC merge processing. A derived column for the event date should be computed from the event timestamp. This serves as the secondary partition key and enables time-based data management. The date extraction should use the source system timestamp rather than the ingestion timestamp to ensure that data is partitioned based on when the business event occurred rather than when it was processed. The destination configuration specifies the Delta table and the partition columns. EventStream handles the actual write operations, and Delta Lake handles the partition management automatically based on the values present in each record. Best Practices for Production Deployment Naming Convention Enforcement The automatic detection pattern is only as reliable as the naming convention it depends on. Before implementing this architecture, establish and enforce a strict naming convention for store databases. Document the convention, communicate it to all teams that provision databases, and implement validation checks in the database provisioning process. The naming convention should be simple and unambiguous. A pattern like store_NNN where NNN is a zero-padded three-digit number provides clear structure and allows for up to 999 stores without format changes. Avoid conventions that might conflict with other databases or that include characters with special meaning in regex patterns. Schema Standardization Automatic source detection assumes that all detected sources share a compatible schema. If store_058 has different table structures than store_001, the downstream processing will fail or produce incorrect results. Schema standardization must be enforced at the source system level before relying on automatic detection. Implement schema validation as part of the store database provisioning process. When a new store database is created, it should be created from a template that guarantees schema compatibility. If schema migrations are necessary, they should be applied uniformly across all store databases before being reflected in the CDC configuration. Partition Key Selection The choice of partition keys has significant implications for both storage efficiency and query performance. The store identifier is the natural first-level partition because it provides the strongest cardinality and aligns with common query patterns such as analyzing a specific store's performance. Date as a secondary partition enables efficient time-range queries and simplifies data lifecycle management. Retention policies can be implemented by dropping old date partitions rather than scanning and deleting individual records. However, the combination of store and date partitions can produce a large number of partition directories. With 100 stores and 365 days of retained data, the table would have 36,500 partitions. While Delta Lake handles this reasonably well, query planning overhead increases with partition count. Consider using month rather than date as the secondary partition if the total partition count becomes problematic. This reduces the partition count by a factor of roughly 30 while still enabling reasonably efficient time-based queries and lifecycle management. Monitoring and Alerting Automatic detection reduces operational burden but does not eliminate the need for monitoring. Implement alerting for several key scenarios. First, monitor for new store detection. While the system handles new stores automatically, operations teams should be notified when a new store begins ingesting data. This serves as a sanity check that the detection is working and provides visibility into the growth of the system. Second, monitor for schema compatibility failures. If a new database is detected but its schema does not match expectations, the CDC process may fail or produce malformed data. Alerting on processing errors helps catch these issues quickly. Third, monitor for snapshot completion. When a new store database triggers an initial snapshot, track the snapshot progress and completion. Extended snapshot times may indicate unusually large source tables or performance issues that warrant investigation. Fourth, monitor partition growth. While dynamic partitioning is convenient, runaway partition creation can indicate a problem such as incorrect store identifiers being generated. Alert if the number of distinct store partitions grows faster than expected based on the known rate of new store openings. Initial Snapshot Planning The initial snapshot that occurs when a new database is detected can be resource-intensive for both the source database and the CDC infrastructure. Plan for this by establishing a new store onboarding window during off-peak hours when the impact of snapshot processing is minimized. Consider implementing a two-phase onboarding process for stores with large historical datasets. In the first phase, configure the database but exclude it from the Debezium include pattern. Perform a bulk historical load using batch processing, which can be throttled and scheduled more flexibly than the streaming snapshot. In the second phase, add the database to the include pattern to begin capturing incremental changes. This approach reduces the load on the streaming infrastructure while still achieving complete data capture. Capacity Planning Dynamic detection and partitioning enable easy scaling in terms of configuration, but the underlying infrastructure must still be sized appropriately. Event Hubs throughput units must accommodate the aggregate event volume from all stores. Fabric capacity units must handle the combined processing load of EventStream and Delta Lake operations. Develop a capacity model that estimates resource requirements per store. Multiply by the current store count plus a growth buffer to determine infrastructure sizing. Review and adjust this model as actual usage patterns become clear. Risk Assessment and Mitigation Uncontrolled Source Proliferation The convenience of automatic detection carries a risk of unintended sources being captured. If the naming convention is not strictly enforced, or if the regex pattern is too broad, databases that should not be ingested may be detected and processed. Mitigation involves implementing strict naming convention governance and using precise regex patterns. The pattern should be as specific as possible while still accommodating legitimate variations. Consider implementing a whitelist in addition to the pattern match, where new databases matching the pattern are flagged for approval before ingestion begins. This adds a manual step but provides a safety checkpoint. Schema Drift Between Sources Over time, individual store databases may drift from the standard schema due to local modifications, failed migrations, or version inconsistencies. When the CDC process encounters unexpected schema elements, it may fail or produce incorrect data. Mitigation requires implementing schema validation at both the source and destination. At the source, periodic schema audits should compare each store database against the canonical schema and flag deviations. At the destination, schema evolution policies in Delta Lake should be configured to reject incompatible changes rather than silently accepting them. The merge schema option should be used cautiously and only when schema evolution is intentional. Partition Explosion Dynamic partition creation can lead to an excessive number of partitions if the partition key has unexpectedly high cardinality or if erroneous data introduces spurious partition values. A misconfigured pipeline might create thousands of partitions, degrading query performance and complicating data management. Mitigation involves implementing partition count monitoring with alerts at defined thresholds. Additionally, validate partition key values before writing to Delta Lake. If a store identifier does not match the expected format, reject the record or route it to an error table for investigation rather than creating an erroneous partition. CDC Lag During Snapshot When a new database triggers an initial snapshot, the Debezium connector dedicates resources to reading the full table contents. During this period, change events from other databases may experience increased latency. In severe cases, the replication slot lag may grow to problematic levels. Mitigation involves scheduling new database provisioning during low-activity periods, sizing the CDC infrastructure with headroom for snapshot operations, and monitoring replication slot lag with alerts at defined thresholds. For very large initial loads, consider the two-phase onboarding approach described earlier. Event Hubs Partition Affinity Event Hubs uses partitions to parallelize message processing. If all messages are routed to a single partition, throughput is limited to what that partition can handle. If messages are distributed across partitions without regard to ordering requirements, related events may be processed out of order. Mitigation involves configuring the Debezium producer to use the store identifier as the partition key. This ensures that all events for a given store are routed to the same Event Hubs partition, preserving ordering within each store while distributing load across partitions for different stores. The number of Event Hubs partitions should be set high enough to accommodate the expected number of stores with room for growth. Unlike some properties, partition count cannot be increased after the Event Hub is created. Orphaned Replication Slots If a store database is decommissioned but the replication slot is not cleaned up, PostgreSQL continues to retain write-ahead log segments for the orphaned slot. Over time, this can fill the disk and cause database outages. Mitigation requires implementing a decommissioning procedure that includes replication slot cleanup. Monitor for inactive replication slots and alert when a slot has not been read from in an extended period. Consider implementing automatic slot cleanup for slots that have been inactive beyond a defined threshold, though this should be done cautiously to avoid accidentally removing slots that are temporarily inactive due to maintenance. Operational Procedures Adding a New Store The procedure for adding a new store is intentionally minimal. The DBA creates the store database following the established naming convention. The database should be created from the standard template to ensure schema compatibility. Within minutes of database creation, Debezium detects the new database and begins the initial snapshot. Operations teams receive a notification of the new store detection. The snapshot progresses, with completion typically occurring within 30 minutes for a standard store data volume. Once the snapshot completes, incremental CDC begins. The first records arriving at the Lakehouse trigger automatic partition creation. From this point forward, the new store's data is fully integrated into the analytics platform with no additional intervention required. Removing a Store Store removal requires more deliberate action than store addition. First, stop ingesting new data by either renaming the database to no longer match the include pattern or by dropping the database entirely. Second, drop the replication slot associated with the store to prevent WAL retention issues. Third, decide whether to retain or purge historical data in the Lakehouse. If historical data should be retained, no action is needed at the Lakehouse level. The partition remains but simply receives no new data. If historical data should be purged, drop the partition using Delta Lake's partition drop capability. This removes the data files and the partition metadata in a single atomic operation. Handling Schema Changes Schema changes require coordination across all store databases and the downstream processing logic. Minor additive changes such as new nullable columns can often be handled through Delta Lake's schema evolution capabilities. The merge schema option allows new columns to be added automatically when encountered. Breaking changes such as column renames, type changes, or column removals require a more deliberate migration process. First, update the downstream processing logic to handle both the old and new schemas. Deploy this change and verify it works with existing data. Then, apply the schema change to source databases in a rolling fashion. Finally, once all sources have been migrated, remove support for the old schema from the processing logic. Disaster Recovery The architecture provides several recovery options depending on the failure scenario. If Event Hubs experiences an outage, Debezium buffers changes locally and resumes publishing when connectivity is restored. The replication slot ensures no changes are lost during the outage, though extended outages may cause WAL accumulation on the source database. If Fabric experiences an outage, events accumulate in Event Hubs up to the retention period. Once Fabric recovers, EventStream resumes processing from its last checkpoint, catching up on accumulated events. The Delta Lake table remains consistent due to its transactional nature. If a source database is lost, recovery depends on backup strategy. The Lakehouse contains a copy of all ingested data, which can serve as a read-only recovery source. Full database recovery requires restoring from PostgreSQL backups. Dynamic partitioning and automatic source detection transform multi-store data ingestion from an operational burden into a largely self-managing system. The combination of Debezium's pattern-based database detection with Delta Lake's dynamic partition creation eliminates most manual configuration work while maintaining the flexibility to accommodate growth. The implementation requires careful attention to naming conventions, schema standardization, and monitoring. The risks are real but manageable with appropriate governance and operational procedures. For organizations operating at scale with dozens or hundreds of similar data sources, this architecture provides a sustainable path to unified analytics without proportional growth in operational overhead. The key principle underlying this approach is that systems should adapt to data rather than requiring data to conform to rigid system configurations. By designing for automatic detection and dynamic accommodation, the data platform becomes a utility that business operations can leverage without constant engineering involvement. This shift from explicit configuration to pattern-based adaptation is essential for organizations seeking to derive value from data at scale.342Views0likes0CommentsGuide for Architecting Azure-Databricks: Design to Deployment
Author's: Chris Walk cwalk, Dan Johnson danjohn1234, Eduardo dos Santos eduardomdossantos, Ted Kim tekim, Eric Kwashie ekwashie, Chris Haynes Chris_Haynes, Tayo Akigbogun takigbogun and Rafia Aqil Rafia_Aqil Peer Reviewed: Mohamed Sharaf mohamedsharaf Note: This article does not cover the Serverless Workspace option, which is currently in Public Preview. We plan to update this article once Serverless Workspaces are Generally Available. Also, while Terraform is the recommended method for production deployments due to its automation and repeatability, for simplicity in this article we will demonstrate deployment through the Azure portal. Introduction Video to Databricks: what is databricks | introduction - databricks for dummies DESIGN: Architecting a Secure Azure Databricks Environment Step 1: Plan Workspace, Subscription Organization, Analytics Architecture and Compute Planning your Azure Databricks environment can follow various arrangements depending on your organization’s structure, governance model, and workload requirements. The following guidance outlines key considerations to help you design a well-architected foundation. 1.1 Align Workspaces with Business Units A recommended best practice is to align each Azure Databricks workspace with a specific business unit. This approach—often referred to as the “Business Unit Subscription” design pattern—offers several operational and governance advantages. Streamlined Access Control: Each unit manages its own workspace, simplifying permissions and reducing cross-team access risks. For example, Sales can securely access only their data and notebooks. Cost Transparency: Mapping workspaces to business units enables accurate cost attribution and supports internal chargeback models. Each workspace can be tagged to a cost center for visibility and accountability. Even within the same workspace, costs can be controlled using system tables that provide detailed usage metrics and resource consumption insights. Challenges to keep-in-mind: While per-BU workspaces have high impact, be mindful of workspace sprawl. If every small team spins up its own workspace, you might end up with dozens or hundreds of workspaces, which introduces management overhead. Databricks recommends a reasonable upper limit (on Azure, roughly 20–50 workspaces per account/subscription) because managing “collaboration, access, and security across hundreds of workspaces can become extremely difficult, even with good automation” [1]. Each workspace will need governance (user provisioning, monitoring, compliance checks), so there is a balance to strike. 1.2 Workspace Alignment and Shared Metastore Strategy As you align workspaces with business units, it's essential to understand how Unity Catalog and the metastore fit into your architecture. Unity Catalog is Databricks’ unified governance layer that centralizes access control, auditing, and data lineage across workspaces. Each Unity Catalog is backed by a metastore, which acts as the central metadata repository for tables, views, volumes, and other data assets. In Azure Databricks, you can have one metastore per region, and all workspaces within that region share it. This enables consistent governance and simplifies data sharing across teams. If your organization spans multiple regions, you’ll need to plan for cross-region sharing, which Unity Catalog supports through Delta Sharing. By aligning workspaces with business units and connecting them to a shared metastore, you ensure that governance policies are enforced uniformly, while still allowing each team to manage its own data assets securely and independently. 1.3 Distribute Workspaces Across Subscriptions When scaling Azure Databricks, consider not just the number of workspaces, but also how to distribute them across Azure subscriptions. Using multiple Azure subscriptions can serve both organizational needs and technical requirements: Environment Segmentation (Dev/Test/Prod): A common pattern is to put production workspaces in a separate Azure subscription from development or test workspaces. This provides an extra layer of isolation. Microsoft highly recommends separating workspaces into prod and dev, in separate subscriptions. This way, you can apply stricter Azure policies or network rules to the prod subscription and keep the dev subscription a bit more open for experimentation without risking prod resources. Honor Azure Resource Limits: Azure subscriptions come with certain capacity limits and Azure Databricks workspaces have their own limits (since it’s a multi-tenant PaaS). If you put all workspaces in one subscription, or all teams in one workspace, you might hit those limits. Most enterprises naturally end up with multiple subscriptions as they grow – planning this early avoids later migration headaches. If you currently have everything in one subscription, evaluate usage and consider splitting off heavy workloads or prod workloads into a new one to adhere to best practices. 1.4 Consider Completing Azure Landing Zone Assessment When evaluating and planning your next deployment, it’s essential to ensure that your current landing zone aligns with Microsoft best practices. This helps establish a robust Databricks architecture and minimizes the risk of avoidable issues. Additionally, customers who are early in their cloud journey can benefit from Cloud Assessments—such as an Azure Landing Zone Review and a review of the “Prepare for Cloud Adoption” documentation—to build a strong foundation. 1.5 Planning Your Azure Databricks Workspace Architecture Your workspace architecture should reflect the operational model of your organization and support the workloads you intend to run, from exploratory notebooks to production-grade ETL pipelines. To support your planning, Microsoft provides several reference architectures that illustrate well-architected patterns for Databricks deployments. These solution ideas can serve as starting points for designing maintainable environments: Simplified Architecture: Modern Data Platform Architecture, ETL-Intensive Workload Reference Architecture: Building ETL Intensive Architecture, End-to-End Analytics Architecture: Create a Modern Analytics Architecture. 1.6 Planning for that “Right” Compute Choosing the right compute setup in Azure Databricks is crucial for optimizing performance and controlling costs, as billing is based on Databricks Units (DBUs) using a per-second pricing model. Classic Compute: You can fine-tune your own compute by enabling auto-termination and autoscaling, using Photon acceleration, leveraging spot instances, selecting the right VM type and node count for your workload, and choosing SSDs for performance or HDDs for archival storage. Preferred by mature internal teams and developers who need advanced control over clusters—such as custom VM selection, tuning, and specialized configurations. Serverless Compute: Alternatively, managed services can simplify operations with built-in optimizations. Removes infrastructure management and offers instant scaling without cluster warm-up, making it ideal for agility and simplicity. Step 2: Plan the “Right” CIDR Range (Classic Compute) Note: You can skip this step if you plan to use serverless compute for all your resources, as CIDR range planning is not required in serverless deployments. When planning CIDR ranges for your Azure Databricks workspace, it's important to ensure your virtual network has enough IP address capacity to support cluster scaling. Why this matters: If you choose a small VNet address space and your analytics workloads grow, you might hit a ceiling where you simply cannot launch more clusters or scale-out because there are no free IPs in the subnet. The subnet sizes—and by extension, the VNet CIDR—determine how many nodes you can. Databricks recommends using a CIDR block between /16 and /24 for the VNet, and up to /26 for the two required subnets: the container subnet and the host subnet. Here’s a reference Microsoft provides. If your current workspace’s VNet lacks sufficient IP space for active cluster nodes, you can request a CIDR range update through your Azure Databricks account team as noted in the Microsoft documentation. 2.1 Considerations for CIDR Range Workload Type & Concurrency: Consider what kinds of workloads will run (ETL Pipelines, Machine Learning Notebooks, BI Dashboards, etc.) and how many jobs or clusters may need to run in parallel. High concurrency (e.g. multiple ETL jobs or many interactive clusters) means more nodes running at the same time, requiring a larger pool of IP addresses. Data Volume (Historical vs. Incremental): Are you doing a one-time historical data load or only processing new incremental data? A large backfill of terabytes of data may require spinning up a very large cluster (hundreds of nodes) to process in a reasonable time. Ongoing smaller loads might get by with fewer nodes. Estimate how much data needs processing. Transformation Complexity: The complexity of data transformations or machine learning workloads matters. Heavy transformations (joins, aggregations on big data) or complex model training can benefit more workers. If your use cases include these, you may need larger clusters (more nodes) to meet performance SLAs, which in turn demands more IP addresses available in the subnet. Data Sources and Integration: Consider how your Databricks environment will connect to data. If you have multiple data sources or sinks (e.g. ingest from many event hubs, databases, or IoT streams), you might design multiple dedicated clusters or workflows, potentially all active at once. Also, if using separate job clusters per job (Databricks Jobs), multiple clusters might launch concurrently. All these scenarios increase concurrent node count. 2.2 Configuring a Dedicated Network (VNet) per Workspace with Egress Control By default, Azure Databricks deploys its classic compute resources into a Microsoft-managed virtual network (VNet) within your Azure subscription. While this simplifies setup, it limits control over network configuration. For enhanced security and flexibility, it's recommended to use VNet Injection, which allows you to deploy the compute plane into your own customer-managed VNet. This approach enables secure integration with other Azure services using service endpoints or private endpoints, supports user-defined routes for accessing on-premises data sources, allows traffic inspection via network virtual appliances or firewalls, and provides the ability to configure custom DNS and enforce egress restrictions through network security group (NSG) rules. Within this VNet (which must reside in the same region and subscription as the Azure Databricks workspace), two subnets are required for Azure Databricks: a container subnet (referred to as private subnet) and a host subnet (referred to as public subnet). To implement front-end Private Link, back-end Private Link, or both, your workspace VNet needs a third subnet that will contain the private endpoint (PrivateLink subnet). It is recommended to also deploy an Azure Firewall for egress control. Step 3: Plan Network Architecture for Securing Azure-Databricks 3.1 Secure Cluster Connectivity Secure Cluster Connectivity, also known as No Public IP (NPIP), is a foundational security feature for Azure Databricks deployments. When enabled, it ensures that compute resources within the customer-managed virtual network (VNet) do not have public IP addresses, and no inbound ports are exposed. Instead, each cluster initiates a secure outbound connection to the Databricks control plane using port 443 (HTTPS), through a dedicated relay. This tunnel is used exclusively for administrative tasks, separate from the web application and REST API traffic, significantly reducing the attack surface. For the most secure deployment, Microsoft and Databricks strongly recommend enabling Secure Cluster Connectivity, especially in environments with strict compliance or regulatory requirements. When Secure Cluster Connectivity is enabled, both workspace subnets become private, as cluster nodes don’t have public IP addresses. 3.2 Egress with VNet Injection (NVA) For Databricks traffic, you’ll need to assign a UDR to the Databricks-managed VNet with a next hop type of Network Virtual Appliance (NVA)—this could be an Azure Firewall, NAT Gateway, or another routing device. For control plane traffic, Databricks recommends using Azure service tags, which are logical groupings of IP addresses for Azure services and should be routed with the next hop type of internet. This is important because Azure IP ranges can change frequently as new resources are provisioned, and manually maintaining IP lists is not practical. Using service tags ensures that your routing rules automatically stay up to date. 3.3 Front-End Connectivity with Azure Private Link (Standard Deployment) To further enhance security, Azure Databricks supports Private Link for front-end connections. In a standard deployment, Private Link enables users to access the Databricks web application, REST API, and JDBC/ODBC endpoints over a private VNet interface, bypassing the public internet. For organizations with no public internet access from user networks, a browser authentication private endpoint is required. This endpoint supports SSO login callbacks from Microsoft Entra ID and is shared across all workspaces in a region using the same private DNS zone. It is typically hosted in a transit VNet that bridges on-premises networks and Azure. Note: There are two deployment types: standard and simplified. To compare these deployment types, see Choose standard or simplified deployment. 3.4 Serverless Compute Networking Azure Databricks offers serverless compute options that simplify infrastructure management and accelerate workload execution. These resources run in a Databricks-managed serverless compute plane, isolated from the public internet and connected to the control plane via the Microsoft backbone network. To secure outbound traffic from serverless workloads, administrators can configure Serverless Egress Control using network policies that restrict connections by location, FQDN, or Azure resource type. Additionally, Network Connectivity Configurations (NCCs) allow centralized management of private endpoints and firewall rules. NCCs can be attached to multiple workspaces and are essential for enabling secure access to Azure services like Data Lake Storage from serverless SQL warehouses. DEPLOYMENT: Step-to-Step Implementation using Azure Portal Step 1: Create an Azure Resource Group For each new workspace, create a dedicated Resource Group (to contain the Databricks workspace resource and associated resources). Ensure that all resources are deployed in the same Region and Resource Group (i.e. workspace, subnets...) to optimize data movement performance and enhance security. Step 2: Deploy Workspace Specific Virtual Network (VNET) From your Resource Group, create a Virtual Network. Under the Security section, enable Azure Firewall. Deploying an Azure Firewall is recommended for egress control, ensuring that outbound traffic from your Databricks environment is securely managed. Define address spaces for your Virtual Network (Review Step 2 from Design). As documented, you could create a VNet with these values: IP range: First remove the default IP range, and then add IP range 10.28.0.0/23. Create subnet public-subnet with range 10.28.0.0/25. Create subnet private-subnet with range 10.28.0.128/25. Create subnet private-link with range 10.28.1.0/27. Please note: your IP values can be different depending on your IPAM and available scopes. Review + Create your Virtual Network. Step 3: Deploy Azure-Databricks Workspace: Now that networking is in place, create the Databricks workspace. Below are detailed steps your organization should review while creating workspace creation: In Azure Portal, search for Azure Databricks and click Create. Choose the Subscription, RG, Region, select Premium, enter in “Managed Resource Group name” and click Next. Managed Resource Group- will be created after your Databrick workspace is deployed and contains infrastructure resources for the workspace i.e. VNets, DBFS. Required: Enable “Secure Cluster Connectivity” (No Public IP for clusters), to ensure that Databricks clusters are deployed without public IP addresses (Review Section 3.1). Required: Enable the option to deploy into your Virtual Network (VNet Injection), also known as “Bring Your Own VNet” (Review Section 3.2). Select the Virtual Network created in Step 2. Enter Private, Public Subnet Names. Enable or Disable “Deploying Nat Gateway”, according to your workspace requirement. Disable “Allow Public Network Access”. Select “No Azure Databricks Rules” for Required NSG Rules. Select “Click on add to create a private endpoint”, this will open a panel for private endpoint setup. Click “Add” to enter your Private Link details created in Step 2. Also, ensure that Private DNS zone integration is set to “Yes” and that a new Private DNS Zone is created, indicated by (New)privatelink.azuredatabricks.net. Unless an existing DNS zone for this purpose already exists. (Optional) Under Encryption Tab, Enable Infrastructure Encryption, if you have requirement for FIPS 140-2. It comes at a cost, it takes time to encrypt and decrypt. By default your data is already encrypted. If you have a standard regulatory requirement (ex. HIPAA). (Optional) Compliance security profile- for HIPAA. (Optional) Automatic cluster updates, First Sunday of every Month. Review + Create the workspace and wait for it to deploy. Step 4: Create a private endpoint to support SSO for web browser access: Note: This step is required when front-end Private Link is enabled, and client networks cannot access the public internet. After creating your Azure Databricks workspace, if you try to launch it without the proper Private Link configuration, you will see an error like the image below: This happens because the workspace is configured to block public network access, and the necessary Private Endpoints (including the browser_authentication endpoint for SSO) are not yet in place. Create Web-Auth Workspace Note: Deploy a “dummy”: WEB_AUTH_DO_NOT_DELETE_<region> workspace in the same region as your production workspace. Purpose: Host the browser_authentication private endpoint (one required per region). Lock the workspace (Delete lock) to prevent accidental removal. Follow step 2 to create Virtual Network (Vnet) Follow step 3 and create a VNet injected “dummy” workspace. Create Browser Authentication Private Endpoint In Azure Portal, Databricks workspace (dummy), Networking, Private endpoint connections, + Private endpoint. Resource step: Target sub-resource: browser_authentication Virtual Network step: VNet: Transit/Hub VNet (central network for Private Link) Subnet: Private Endpoint subnet in that VNet (not Databricks host subnets) DNS step: Integrate with Private DNS zone: Yes Zone: privatelink.azuredatabricks.net Ensure DNS zone is linked to the Transit VNet After creation: A-records for *.pl-auth.azuredatabricks.net are auto-created in the DNS zone. Workspace Connectivity Testing If you have VPN or ExpressRoute, Bastion is not required. However, for the purposes of this article we will be testing our workpace connectivity through Bastion. If you don’t have private connectivity and need to test from inside the VNet, Azure Bastion is a convenient option. Step 5: Create Storage Account From your Resource Group, click Create and select Storage account. On the configuration page: Select Preferred Storage type as: Azure Blob Storage or Azure Data Lake Storage Gen 2. Choose Performance and Redundancy options based on your business requirements. Click Next to proceed. Under the Advanced tab: Enable Hierarchical namespace under Data Lake Storage Gen2. This is critical for: Directory and file-level operations, Access Control Lists (ACLs). Under the Networking tab: Set Public Network Access to Disabled. Complete the creation process and then create container(s) inside the storage account. Step 6: Create Private Endpoints for Workspace Storage Account Pre-requisite: You need to create two private endpoints from the VNet used for VNet injection to your workspace storage account for the following Target sub-resources: dfs and blob. Navigate to your Storage Account. Go to Networking, Private Endpoints tab and click on to + Create Private Endpoint. In the Create Private Endpoint wizard: Resource tab: Select your Storage Account. Set Target sub-resource to dfs for the first endpoint. Virtual Network tab: Choose the VNet you used for VNet injection. Select the appropriate subnet. Complete the creation process. The private endpoint will be auto approved and visible under Private Endpoints. Repeat the process for the second private endpoint: This time set Target sub-resource to blob. Step 7: Link Storage and Databricks Workspace: Create Access Connector In your Resource Group, create an Access Connector for Azure Databricks. No additional configuration is required during creation. Assign Role to Access Connector Navigate to your Storage Account, Access Control (IAM), Add role assignment. Select: Role: Storage Blob Data Contributor Assign access to: Managed Identity Under Members: Click Select members. Find and select your newly created Access Connector for Azure Databricks. Save the role assignment. Copy Resource ID Go to the Access Connector Overview page. Copy the Resource ID for later use in Databricks configuration. Step 8: Link Storage and Databricks Workspace: Navigate to Unity Catalog In your Databricks Workspace, go to Unity Catalog, External Data and select “Create external Location” button. Configure External Location Select ADLS as the storage type. Enter the ADLS storage URL in the following format: abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/ Update these two parameters: <container_name> and <storage_name> Provide Access Connector Select “Create new storage credential” from Storage credential field. Paste the Resource ID of the Access Connector for Azure Databricks (from Step 10) into the Access Connector ID field. Validate Connection Click Submit. You should see a “Successful” message confirming the connection. Click submit and you should receive a “Successful” message, indicating your connection has succeeded. You can now create Catalogs and link your secure storage. Step 9: Configuring Serverless Compute Networking: If your organization plans to use Serverless SQL Warehouses or Serverless Jobs Compute, you must configure Serverless Networking. Add Network Connectivity Configuration (NCC) Go to the Databricks Account Console: https://accounts.azuredatabricks.net/ Navigate to Cloud resources, click Add Network Connectivity Configuration. Fill in the required fields and create a new NCC. Associate NCC with Workspace In the Account Console, go to Workspaces. Select your workspace, click Update Workspace. From the Network Connectivity Configuration dropdown, select the NCC you just created. Add Private Endpoint Rule In Cloud resources, select your NCC, select Private Endpoint Rules and click Add Private Endpoint Rule. Provide: Resource ID: Enter your Storage Account Resource ID. Note: this can be found from your storage account, click on “JSON View” top right. Azure Subresource type: dfs & blob. Approve Pending Connection Go to your Storage Account, Networking, Private Endpoints. You will see a Pending connection from Databricks. Approve the connection and you will see the Connection status in your Account Console as ESTABLISHED. Step 10: Test Your Workspace: Launch a small test cluster and verify the following: It can start (which means it can talk to the control plane). It can read/write from the storage, following the following code to confirm read/write to storage: Set Spark properties to configure Azure credentials to access Azure storage. Check Private DNS Record has been created. (Optional) If on-prem data is needed: try connecting to an on-prem database (using the ExpressRoute path): Connect your Azure Databricks workspace to your on-premises network - Azure Databricks | Microsoft Learn. Step 11: Account Console, Planning Workspace Access Controls and Getting Started: Once your Azure Databricks workspace is deployed, it's essential to configure access controls and begin onboarding users with the right permissions. From your account console: https://accounts.azuredatabricks.net/, you can centrally manage your environment: add users and groups, enable preview features, and view or configure all your workspaces. Azure Databricks supports fine-grained access management through Unity Catalog, cluster policies, and workspace-level roles. Start by defining who needs access to what—whether it's notebooks, tables, jobs, or clusters—and apply least-privilege principles to minimize risk. DBFS Limitation: DBFS is automatically created upon Databricks Workspace creation. DBFS can be found in your Managed Resource Group. Databricks cannot secure DBFS (see reference image below). If there is a business need to avoid DBFS then you can disable DBFS access following instructions here: Disable access to DBFS root and mounts in your existing Azure Databricks workspace. Use Unity Catalog to manage data access across catalogs, schemas, and tables, and consider implementing cluster policies to standardize compute configurations across teams. To help your teams get started, Microsoft provides a range of tutorials and best practice guides: Best practice articles - Azure Databricks | Microsoft Learn. Step 12: Planning Data Migration: As you prepare to move data into your Azure Databricks environment, it's important to assess your migration strategy early. This includes identifying source systems, estimating data volumes, and determining the appropriate ingestion methods—whether batch, streaming, or hybrid. For organizations with complex migration needs or legacy systems, Microsoft offers specialized support through its internal Azure Cloud Accelerated Factory program. Reach out to your Microsoft account team to explore nomination for Azure Cloud Accelerated Factory, which provides hands-on guidance, tooling, and best practices to accelerate and streamline your data migration journey. Summary Regular maintenance and governance are as important as the initial design. Continuously review the environment and update configurations as needed to address evolving requirements and threats. For example, tag all resources (workspaces, VNets, clusters, etc.) with clear identifiers (workspace name, environment, department) to track costs and ownership effectively. Additionally, enforce least privilege across the platform: ensure that only necessary users are given admin privileges, and use cluster-level access control to restrict who can create or start clusters. By following the above steps, an organization will have an Azure Databricks architecture that is securely isolated, well-governed, and scalable. References: [1] 5 Best Practices for Databricks Workspaces AzureDatabricksBestPractices/toc.md at master · Azure ... - GitHub Deploy a workspace using the Azure Portal Additional Links: Quick Introduction to Databricks: what is databricks | introduction - databricks for dummies Connect Purview with Azure Databricks: Integrating Microsoft Purview with Azure Databricks Secure Databricks Delta Share between Workspaces: Secure Databricks Delta Share for Serverless Compute Azure-Databricks Cost Optimization Guide: Databricks Cost Optimization: A Practical Guide Integrate Azure Databricks with Microsoft Fabric: Integrating Azure Databricks with Microsoft Fabric Databricks Solution Accelerators for Data & AI Azure updates Appendix 3.5 Understanding Data Transfer (Express Route vs. Public Internet) For data transfers, your organization must decide to use ExpressRoute or Internet Egress. There are several considerations that can help you determine your choice: 3.5.1. Connectivity Model • ExpressRoute: Provides a private, dedicated connection between your on-premises infrastructure and Microsoft Azure. It bypasses the public internet entirely and connects through a network service provider. • Internet Egress: Refers to outbound data traffic from Azure to the public internet. This is the default path for most Azure services unless configured otherwise. 3.6 Planning for User-Defined Routes (UDRs) When working with Databricks deployments—especially in VNet-injected workspaces—setting up User Defined Routes (UDRs) is a smart move. It’s a best practice that helps manage and secure network traffic more effectively. By using UDRs, teams can steer traffic between Databricks components and external services in a controlled way, which not only boosts security but also supports compliance efforts. 3.6.1 UDRs and Hub and Spoke Topology If your Databricks workspace is deployed into your own virtual network (VNet), you’ll need to configure standard user-defined routes (UDRs) to manage traffic flow. In a typical hub-and-spoke architecture, UDRs are used to route all traffic from the spoke VNets to the hub VNet. 3.6.2 Hub and Spoke with VWANHUB If your Databricks workspace is deployed into your own virtual network (VNet) and is peered to a Virtual WAN (VWAN) hub as the primary connectivity hub into Azure, a user-defined route (UDR) is not required—provided that a private traffic routing policy or internet traffic routing policy is configured in the VWAN hub. 3.6.3 Use of NVAs and Service Tags For Databricks traffic, you’ll need to assign a UDR to the Databricks-managed VNet with a next hop type of Network Virtual Appliance (NVA)—this could be an Azure Firewall, NAT Gateway, or another routing device. For control plane traffic, Databricks recommends using Azure service tags, which are logical groupings of IP addresses for Azure services and should be routed with the next hop type of internet. This is important because Azure IP ranges can change frequently as new resources are provisioned, and manually maintaining IP lists is not practical. Using service tags ensures that your routing rules automatically stay up to date. 3.6.4 Default Outbound Access Retirement (Non-Serverless Compute) Microsoft is retiring default outbound internet access for new deployments starting September 30,2025. Going forward, outbound connectivity will require an explicit configuration using an NVA, NAT Gateway, Load Balancer, or Public IP address. Also, note that using a Public IP Address in the deployment is discouraged for Security purposes, and it is recommended to deploy the workspace in a ‘Secure Cluster Connectivity ration.” Configure connectivity will require an explicit configuration using an NVA, NAT Gateway, Load Balancer, or Public IP address. Also, note that using a Public IP Address in the deployment is discouraged for Security purposes, and it is recommended to deploy the workspace in a ‘Secure Cluster Connectivity ration.”1.9KViews2likes0Comments