azure
7732 TopicsDon’t miss Building Agents with Microsoft Foundry and Microsoft Foundry Agent Service!
Our dynamic four-part webinar series, Agentic AI + Copilot Partner Skilling Accelerator, empowers you to harness the Microsoft AI ecosystem to unlock new revenue streams and enhance customer success. Across the four sessions, Microsoft partners can expect to learn how to apply AI tools in no-code, low-code, and pro-code scenarios to build intelligent chat and workflow solutions, extend and customize capabilities, and create advanced, custom AI functionality. Don't miss the final session in the series, Building Agents with Microsoft Foundry and Microsoft Foundry Agent Service, where you'll learn how to design and deploy intelligent agents with Microsoft Foundry and Microsoft Foundry Agent Service, including multi-agent architectures and key protocols such as A2A and MCP. The live virtual event is scheduled for December 15, 2025. Register today to reserve your spot! Be sure to follow this Partner news blog for all partner related announcements by clicking follow above!97Views0likes0CommentsTrusted Signing Public Preview Update
Nearly a year ago we announced the Public Preview of Trusted Signing with availability for organizations with 3 years or more of verifiable history to onboard to the service to get a fully managed code signing experience to simplify the efforts for Windows app developers. Over the past year, we’ve announced new features including the Preview support for Individual Developers, and we highlighted how the service contributes to the Windows Security story at Microsoft BUILD 2024 in the Unleash Windows App Security & Reputation with Trusted Signing session. During the Public Preview, we have obtained valuable insights on the service features from our customers, and insights into the developer experience as well as experience for Windows users. As we incorporate this feedback and learning into our General Availability (GA) release, we are limiting new customer subscriptions as part of the public preview. This approach will allow us to focus on refining the service based on the feedback and data collected during the preview phase. The limit in new customer subscriptions for Trusted Signing will take effect Wednesday, April 2, 2025, and make the service only available to US and Canada-based organizations with 3 years or more of verifiable history. Onboarding for individual developers and all other organizations will not be directly available for the remainder of the preview, and we look forward to expanding the service availability as we approach GA. Note that this announcement does not impact any existing subscribers of Trusted Signing, and the service will continue to be available for these subscribers as it has been throughout the Public Preview. For additional information about Trusted Signing please refer to Trusted Signing documentation | Microsoft Learn and Trusted Signing FAQ | Microsoft Learn.5.8KViews7likes29CommentsPartner Case Study | Tiger Analytics
In every industry, preventing equipment downtime and maintaining operational continuity is a top priority. Preventive maintenance—making rounds, doing routine maintenance, and logging data—has long been a manual, tedious process, but with recent advances in AI, predicting and preventing failures and downtime is becoming more manageable than ever before. Tiger Analytics is a solutions integrator and Microsoft partner that implements AI-powered solutions. The organization has all three Solutions Partner designations for Azure, as well as three specializations: AI Platform on Microsoft Azure (formerly AI and Machine Learning on Microsoft Azure), Build AI Apps on Microsoft Azure, and Analytics on Microsoft Azure. They believe in using AI to transform operations for customers across industries. Their commitment to using AI to develop smart solutions empowered them to help a medical device company develop a predictive maintenance solution for their radiation therapy devices. Continue reading here Explore all case studies or submit your own Subscribe to case studies tag to follow all new case study posts. Don't forget to follow this blog to receive email notifications of new stories!37Views1like0CommentsStaying in the flow: SleekFlow and Azure turn customer conversations into conversions
A customer adds three items to their cart but never checks out. Another asks about shipping, gets stuck waiting eight minutes, only to drop the call. A lead responds to an offer but is never followed up with in time. Each of these moments represents lost revenue, and they happen to businesses every day. SleekFlow was founded in 2019 to help companies turn those almost-lost-customer moments into connection, retention, and growth. Today we serve more than 2,000 mid-market and enterprise organizations across industries including retail and e-commerce, financial services, healthcare, travel and hospitality, telecommunications, real estate, and professional services. In total, those customers rely on SleekFlow to orchestrate more than 600,000 daily customer interactions across WhatsApp, Instagram, web chat, email, and more. Our name reflects what makes us different. Sleek is about unified, polished experiences—consolidating conversations into one intelligent, enterprise-ready platform. Flow is about orchestration—AI and human agents working together to move each conversation forward, from first inquiry to purchase to renewal. The drive for enterprise-ready agentic AI Enterprises today expect always-on, intelligent conversations—but delivering that at scale proved daunting. When we set out to build AgentFlow, our agentic AI platform, we quickly ran into familiar roadblocks: downtime that disrupted peak-hour interactions, vector search delays that hurt accuracy, and costs that ballooned under multi-tenant workloads. Development slowed from limited compatibility with other technologies, while customer onboarding stalled without clear compliance assurances. To move past these barriers, we needed a foundation that could deliver the performance, trust, and global scale enterprises demand. The platform behind the flow: How Azure powers AgentFlow We chose Azure because building AgentFlow required more than raw compute power. Chatbots built on a single-agent model often stall out. They struggle to retrieve the right context, they miss critical handoffs, and they return answers too slowly to keep a customer engaged. To fix that, we needed an ecosystem capable of supporting a team of specialized AI agents working together at enterprise scale. Azure Cosmos DB provides the backbone for memory and context, managing short-term interactions, long-term histories, and vector embeddings in containers that respond in 15–20 milliseconds. Powered by Azure AI Foundry, our agents use Azure OpenAI models within Azure AI Foundry to understand and generate responses natively in multiple languages. Whether in English, Chinese, or Portuguese, the responses feel natural and aligned with the brand. Semantic Kernel acts as the conductor, orchestrating multiple agents, each of which retrieves the necessary knowledge and context, including chat histories, transactional data, and vector embeddings, directly from Azure Cosmos DB. For example, one agent could be retrieving pricing data, another summarizing it, and a third preparing it for a human handoff. The result is not just responsiveness but accuracy. A telecom provider can resolve a billing question while surfacing an upsell opportunity in the same dialogue. A financial advisor can walk into a call with a complete dossier prepared in seconds rather than hours. A retailer can save a purchase by offering an in-stock substitute before the shopper abandons the cart. Each of these conversations is different, yet the foundation is consistent on AgentFlow. Fast, fluent, and focused: Azure keeps conversations moving Speed is the heartbeat of a good conversation. A delayed answer feels like a dropped call, and an irrelevant one breaks trust. For AgentFlow to keep customers engaged, every operation behind the scenes has to happen in milliseconds. A single interaction can involve dozens of steps. One agent pulls product information from embeddings, another checks it against structured policy data, and a third generates a concise, brand-aligned response. If any of these steps lag, the dialogue falters. On Azure, they don’t. Azure Cosmos DB manages conversational memory and agent state across dedicated containers for short-term exchanges, long-term history, and vector search. Sharded DiskANN indexing powers semantic lookups that resolve in the 15–20 millisecond range—fast enough that the customer never feels a pause. Microsoft Phi’s model Phi-4 as well as Azure OpenAI in Foundry Models like o3-mini and o4-mini, provide the reasoning, and Azure Container Apps scale elastically, so performance holds steady during event-driven bursts, such as campaign broadcasts that can push the platform from a few to thousands of conversations per minute, and during daily peak-hour surges. To support that level of responsiveness, we run Azure Container Apps on the Pay-As-You-Go consumption plan, using KEDA-based autoscaling to expand from five idle containers to more than 160 within seconds. Meanwhile, Microsoft Orleans coordinates lightweight in-memory clustering to keep conversations sleek and flowing. The results are tangible. Retrieval-augmented generation recall improved from 50 to 70 percent. Execution speed is about 50 percent faster. For SleekFlow’s customers, that means carts are recovered before they’re abandoned, leads are qualified in real time, and support inquiries move forward instead of stalling out. With Azure handling the complexity under the hood, conversations flow naturally on the surface—and that’s what keeps customers engaged. Secure enough for enterprises, human enough for customers AgentFlow was built with security-by-design as a first principle, giving businesses confidence that every interaction is private, compliant, and reliable. On Azure, every AI agent operates inside guardrails enterprises can depend on. Azure Cosmos DB enforces strict per-tenant isolation through logical partitioning, encryption, and role-based access control, ensuring chat histories, knowledge bases, and embeddings remain auditable and contained. Models deployed through Azure AI Foundry, including Azure OpenAI and Microsoft Phi, process data entirely within SleekFlow’s Azure environment and guarantees it is never used to train public models, with activity logged for transparency. And Azure’s certifications—including ISO 27001, SOC 2, and GDPR—are backed by continuous monitoring and regional data residency options, proving compliance at a global scale. But trust is more than a checklist of certifications. AgentFlow brings human-like fluency and empathy to every interaction, powered by Azure OpenAI running with high token-per-second throughput so responses feel natural in real time. Quality control isn’t left to chance. Human override workflows are orchestrated through Azure Container Apps and Azure App Service, ensuring AI agents can carry conversations confidently until they’re ready for human agents. Enterprises gain the confidence to let AI handle revenue-critical moments, knowing Azure provides the foundation and SleekFlow provides the human-centered design. Shaping the next era of conversational AI on Azure The benefits of Azure show up not only in customer conversations but also in the way our own teams work. Faster processing speeds and high token-per-second throughput reduce latency, so we spend less time debugging and more time building. Stable infrastructure minimizes downtime and troubleshooting, lowering operational costs. That same reliability and scalability have transformed the way we engineer AgentFlow. AgentFlow started as part of our monolithic system. Shipping new features used to take about a month of development and another week of heavy testing to make sure everything held together. After moving AgentFlow to a microservices architecture on Azure Container Apps, we can now deploy updates almost daily with no down time or customer impact. And this is all thanks to native support for rolling updates and blue-green deployments. This agility is what excites us most about what's ahead. With Azure as our foundation, SleekFlow is not simply keeping pace with the evolution of conversational AI—we are shaping what comes next. Every interaction we refine, every second we save, and every workflow we streamline brings us closer to our mission: keeping conversations sleek, flowing, and valuable for enterprises everywhere.216Views3likes0CommentsPantone’s Palette Generator enhances creative exploration with agentic AI on Azure
Color can be powerful. When creative professionals shape the mood and direction of their work, color plays a vital role because it provides context and cues for the end product or creation. For more than 60 years, creatives from all areas of design—including fashion, product, and digital—have turned to Pantone color guides to translate inspiration into precise, reproducible color choices. These guides offer a shared language for colors, as well as inspiration and communication across industries. Once rooted in physical tools, Pantone has evolved to meet the needs of modern creators through its trend forecasting, consulting services, and digital platform. Today, Pantone Connect and its multi-agent solution called the Pantone Palette Generator seamlessly bring color inspiration and accuracy into everyday design workflows (as well as the New York City mayoral race). Simply by typing in a prompt, designers can generate palettes in seconds. Available in Pantone Connect, the tool uses Azure services like Microsoft Foundry, Azure AI Search, and Azure Cosmos DB to serve up the company’s vast collection of trend and color research from the color experts at the Pantone Color Institute. reached in seconds instead of days. Now, with Microsoft Foundry, creatives can use agents to get instant color palettes and suggestions based on human insights and trend direction.” Turning Pantone’s color legacy into an AI offering The Palette Generator accelerates the process of researching colors and helps designers find inspiration or validate some of their ideas through trend-backed research. “Pantone wants to be where our customers are,” says Rohani Jotshi, Director of Software Engineering and Data at Pantone. “As workflows become increasingly digital, we wanted to give our customers a way to find inspiration while keeping the same level of accuracy and trust they expect from Pantone.” The Palette Generator taps into thousands of articles from Pantone’s Color Insider library, as well as trend guides and physical color books in a way that preserves the company’s color standards science while streamlining the creative process. Built entirely on Microsoft Foundry, the solution uses Azure AI Search for agentic retrieval-augmented generation (RAG) and Azure OpenAI in Foundry Models to reason over the data. It quickly serves up palette options in response to questions like “Show me soft pastels for an eco-friendly line of baby clothes” or “I want to see vibrant metallics for next spring.” Over the course of two months, the Pantone team built the initial proof of concept for the Palette Generator, using GitHub Copilot to streamline the process and save over 200 hours of work across multiple sprints. This allowed Pantone’s engineers to focus on improving prompt engineering, adding new agent capabilities, and refining orchestration logic rather than writing repetitive code. Building a multi-agent architecture that accelerates creativity The Pantone team worked with Microsoft to develop the multi-agent architecture, which is made up of three connected agents. Using Microsoft Agent Framework—an open source development kit for building AI orchestration systems—it was a straightforward process to bring the agents together into one workflow. “The Microsoft team recommended Microsoft Agent Framework and when we tried it, we saw how it was extremely fast and easy to create architectural patterns,” says Kristijan Risteski, Solutions Architect at Pantone. “With Microsoft Agent Framework, we can spin up a model in five lines of code to connect our agents.” When a user types in a question, they interact with an orchestrator agent that routes prompts and coordinates the more specialized agents. Behind the scenes an additional agent retrieves contextually relevant insights from Pantone’s proprietary Color Insider dataset. Using Azure AI Search with vectorized data indexing, this agent interprets the semantics of a user’s query rather than relying solely on keywords. A third agent then applies rules from color science to assemble a balanced palette. This agent ensures the output is a color combination that meets harmony, contrast, and accessibility standards. The result is a set of Pantone-curated colors that match the emotional and aesthetic tone of the request. “All of this happens in seconds,” says Risteski. To manage conversation flow and achieve long-term data persistence, Pantone uses Azure Cosmos DB, which stores user sessions, prompts, and results. The database not only enables designers to revisit past palette explorations but also provides Pantone with valuable usage intelligence to refine the system over time. “We use Azure Cosmos DB to track inputs and outputs,” says Risteski. “That data helps us fine-tune prompts, measure engagement, and plan how we’ll train future models.” Improving accuracy and performance with Azure AI Search With Azure AI Search, the Palette Generator can understand the nuance of color language. Instead of relying solely on keyword searches that might miss the complexity of words like “vibrant” or “muted,” Pantone’s team decided to use a vectorized index for more accurate palette results. Using the built-in vectorization capability of Azure AI Search, the team converted their color knowledge base—including text-based color psychology and trend articles—into numerical embeddings. “Overall, vector search gave us better results because it could understand the intent of the prompt, not just the words,“ says Risteski. “If someone types, ‘Show me colors that feel serene and oceanic,’ the system understands intent. It finds the right references across our color psychology and trend archives and delivers them instantly.” The team also found ways to reduce latency as they evolved their proof of concept. Initially, they encountered slow inference times and performance lags when retrieving search results. By switching from GPT-4.1 to GPT-5, latency improved. And using Azure AI Search to manage ranking and filtering results helped reduce the number of calls to the large language model (LLM). “With Azure, we just get the articles, put them in a bucket, and say ‘index it now,’ says Risteski. “It takes one or two minutes—and that’s it. The results are so much better than traditional search.” Moving from inspiration to palettes faster The Palette Generator has transformed how designers and color enthusiasts interact with Pantone’s expertise. What once took weeks of research and review can now be done in seconds. “Typically, if someone wanted to develop a palette for a product launch, it might take many months of research,” says Jotshi. “Now, they can type one sentence to describe their inspiration then immediately find Pantone-backed insight and options. Human curation will still be hugely important, but a strong set of starting options can significantly accelerate the palette development process.” Expanding the palette: The next phase for Pantone’s design agent Rapidly launching the Palette Generator in beta has redefined what the Pantone engineering team thought was possible. “We’re a small development team, but with Azure we built an enterprise-grade AI system in a matter of weeks,” says Risteski. “That’s a huge win for us.” Next up, the team plans to migrate the entire orchestration layer to Azure Functions, moving to a fully scalable, serverless deployment. This will allow Pantone to run its agents more efficiently, handle variable workloads automatically, and integrate seamlessly with other Azure products such as Microsoft Foundry and Azure Cosmos DB. At the same time, Pantone plans to expand its multi-agent system to include new specialized agents, including one focused on palette harmony and another focused on trend prediction.379Views1like0CommentsEnd-to-End Observability for Azure Databricks: From Infrastructure to Internal Application Logging
Author's: Peter Lo PeterLo, Amudha Palani amudhapalani, Geoffrey Rathinapandi geofegeo and Rafia Aqil Rafia_Aqil Observability in Azure Databricks is the ability to continuously monitor and troubleshoot the health, performance, and usage of data workloads by capturing metrics, logs, and traces. In a structured observability approach, we consider two broad categories of logging: Internal Databricks Logging (within the Databricks environment) and External Databricks Logging (leveraging Azure services). Each plays a distinct role in providing insights. By combining internal and external observability mechanisms, organizations can achieve a comprehensive view: internal logs enable detailed analysis of Spark jobs and data quality, while external logs ensure global visibility, auditing, and integration with broader monitoring dashboards and alerting systems. The article is organized into two main sections: Infrastructure Logging for Azure Databricks (external observability) Internal Databricks Logging (in-platform observability) Considerations Addressing key questions upfront ensures your observability strategy is tailored to your organization’s unique workloads, risk profile, and operational needs. By proactively evaluating what to monitor, where to store logs, and who needs access, you can avoid blind spots, streamline incident response, and align monitoring investments with business priorities. What types of workloads are running in Databricks? Why it matters: Different workloads (e.g., batch ETL, streaming pipelines, ML training, interactive notebooks) have distinct performance profiles and failure modes. Business impact: Understanding workload types helps prioritize monitoring for mission-critical processes like real-time fraud detection or daily financial reporting. What failure scenarios need to be monitored? Examples: Job failures, cluster provisioning errors, quota limits, authentication issues. Business impact: Early detection of failures reduces downtime, improves SLA adherence, and prevents data pipeline disruptions that could affect reporting or customer-facing analytics. Where should logs be stored and analyzed? Options: Centralized Log Analytics workspace, Azure Storage for archival, Event Hub for streaming analysis. Business impact: Centralized logging enables unified dashboards, cross-team visibility, and faster incident response across data engineering, operations, and compliance teams. Who needs access to logs and alerts? Stakeholders: Data engineers, platform administrators, security analysts, compliance officers. Business impact: Role-based access ensures that the right teams can act on insights while maintaining data governance and privacy controls. Infrastructure Logging for Azure Databricks Approach 1: Diagnostic Settings for Azure Databricks Diagnostic settings in Azure Monitor allow you to capture detailed logs and metrics from your Azure Databricks workspace, supporting operational monitoring, troubleshooting, and compliance. By configuring diagnostic settings at the workspace level, administrators can route Databricks logs—including cluster events, job statuses, and audit logs—to destinations such as Log Analytics, Azure Storage, or Event Hub. This enables unified analysis, alerting, and long-term retention of critical operational data. Configuration Overview Enable Diagnostic Settings on the Databricks workspace to route logs to Log Analytics Workspace. Logs can also be combined with other logs mentioned below for full Azure Databricks observability. Here is a Guide to Azure Databricks Diagnostic Settings Log Reference: Configure diagnostic log delivery Implement tagging strategy-organizations can gain granular visibility into resource consumption and align spending with business priorities. Default tags: Automatically applied by Databricks to cloud-deployed resources. Custom tags: User-defined tags that you can add to compute resources and serverless workloads. Use Cases Operational Monitoring: Detect job or resource bottlenecks. Security & Compliance: Audit user actions and enforce governance policies. Incident Response: Correlate Databricks logs with infrastructure events for faster troubleshooting. Best Practices Enable only relevant log categories to optimize cost and performance. Use role-based access control (RBAC) to secure access to logs. Approach 2: Azure Databricks Compute Log Delivery Compute log delivery in Azure Databricks enables you to automatically collect and archive logs from Spark driver nodes, worker nodes, and cluster events for both all-purpose and job compute resources. When you create a cluster, you can specify a log delivery location—such as DBFS, Azure Storage, or a Unity Catalog volume—where logs are delivered every five minutes and archived hourly. All logs generated up until the compute resource is terminated are guaranteed to be delivered, supporting troubleshooting, auditing, and compliance. Configure: To configure the log delivery location: On the compute page, click the Advanced toggle. Click the Logging tab. Select a destination type: DBFS or Volumes (Preview). Enter the Log path. To store the logs, Databricks creates a subfolder in your chosen log path named after the compute's cluster_id. Approach 3: Azure Activity Logs Whenever you create, update, or delete Databricks resources (such as provisioning a new workspace, scaling a cluster, or modifying network settings), these actions are captured in the Activity Log. This enables teams to track who made changes, when, and what impact those changes had on the environment. Each event in the Activity Log has a particular category that is described in the following document: Azure Activity Log event schema. For Databricks, this is especially valuable for: Auditing resource deployments and configuration changes Investigating failed provisioning or quota errors Monitoring compliance with organizational policies Responding to incidents or unauthorized actions Use Cases Auditing infrastructure-level changes outside the Databricks workspace. Monitoring provisioning delays or resource availability. Best Practices Use Activity Logs in conjunction with other logs for full-stack visibility. Set up alerts for critical infrastructure events. Review logs regularly to ensure compliance and operational health. Approach 4: Azure Monitor VM Insights Azure Databricks cluster nodes run on Azure virtual machines (VMs), and their infrastructure-level performance can be monitored using Azure Monitor VM Insights (formerly OMS). This approach provides visibility into resource utilization across individual cluster VMs, helping identify bottlenecks that may affect Spark job performance or overall workload efficiency. Configuration Overview: To enable VM performance monitoring: Enable VM Insights on the Databricks cluster for VMs. Monitored Metrics: Once enabled, VM Insights collects: CPU usage, Memory consumption, Disk I/O, Network throughput, Process-level statistics. These metrics help assess whether Spark workloads are constrained by infrastructure limits, such as insufficient memory or high disk latency. Considerations This is a standard Azure VM monitoring technique and is not specific to Databricks. Use role-based access control (RBAC) to secure access to performance data. Approach 5: Virtual Network Flow Logs For Azure Databricks workspaces deployed in a custom Azure Virtual Network (VNet-injected mode), enabling Virtual Network Flow Logs provides deep visibility into IP traffic flowing through the virtual network. These logs help monitor and optimize resources or support large enterprises that are trying to detect intrusion, flow logs can help. Review common use cases here: Vnet Flow Logs Common Usecases and how logging works here: Key properties of virtual network flow logs. Follow these steps to setup Vnet Flow Logs: Create a flow log Configuration Overview Virtual Network Flow Logs are a feature of Azure Network Watcher. Optionally, logs can be analyzed using Traffic Analytics for deeper insights. These logs help identify: Unexpected or unauthorized traffic Bandwidth usage patterns Effectiveness of NSG rules and network segmentation Considerations NSG flow logging is only available for VNet-injected deployment modes. Ensure Network Watcher is enabled in the region where the Databricks workspace is deployed. Use Traffic Analytics to visualize trends and detect anomalies in network flows. Approach 6: Spark Monitoring Logging & Metrics The spark-monitoring library is a Python toolkit designed to interact with the Spark History Server REST API. Its main purpose is to help users programmatically access, analyze, and visualize Spark application metrics and job details after execution. Here’s what it offers: Application Listing: Retrieve a list of all Spark applications available on the History Server, including metadata such as application ID, name, start/end time, and status. Job and Stage Details: Access detailed information about jobs and stages within each application, including execution times, status, and resource usage. Task Metrics: Extract metrics for individual tasks, such as duration, input/output size, and shuffle statistics, supporting performance analysis and bottleneck identification. Considerations The Spark Monitoring Library must be installed, see Git Repository here. Metrics can be exported to external observability platforms for long-term retention and alerting. Use cases Automated reporting of Spark job performance and resource usage Batch analysis of completed Spark applications Integration of Spark metrics into external dashboards or monitoring systems Post-execution troubleshooting and optimization Internal Databricks Logging Approach 7: Databricks System Tables (Unity Catalog) Databricks System Tables are a recent addition to Azure Databricks observability, offering structured, SQL-accessible insights into workspace usage, performance, and cost. These tables reside in the Unity Catalog and are organized into schemas such as system.billing, system.lakeflow, and system.compute. You can enable System Tables through these steps: _enable_system_tables - Databricks Overview and Capabilities When enabled by an administrator, system tables allow users to query operational metadata directly using SQL. Examples include: system.billing.usage: Tracks compute usage (CPU core-hours, memory) per job. system.compute.clusters: Captures cluster lifecycle events. system.lakeflow.job_run: Provides job execution details. Use Cases Cost Monitoring: Aggregate usage records to identify high-cost jobs or users. Import pre-built usage dashboards to your workspaces to monitor account- and workspace-level usage: Usage dashboards and Create and monitor budgets Operational Efficiency: Track job durations, cluster concurrency, and resource utilization. In-Platform BI: Build dashboards in Databricks SQL to visualize usage trends without relying on external billing tools. Best Practices Schedule regular queries to track cost trends, job performance, and resource usage. Apply role-based access control to restrict sensitive usage data. Integrate system table insights into Databricks SQL dashboards for real-time visibility. Approach 8: Data Quality Monitoring Data Quality Monitoring is a native Azure Databricks feature designed to track data quality and machine learning model performance over time. It enables automated monitoring of Delta tables and ML inference outputs, helping teams detect anomalies, data drift, and reliability issues directly within the Databricks environment. Follow these steps to enable Data Quality Monitoring. Data Quality Monitoring supports three profile types: Time Series: Monitors time-partitioned data, computing metrics per time window. Inference: Tracks prediction drift and anomalies in model request/response logs. Snapshot: Performs full-table scans to compute metrics across the entire dataset. From the enabling Lakehouse Monitoring, on step 5 you can also enable data profiling to view Data Profiling Dashboards. Use Cases Data Quality Monitoring: Track null values, column distributions, and schema changes. Model Performance Monitoring: Detect concept drift, prediction anomalies, and accuracy degradation. Operational Reliability: Ensure consistent data pipelines and ML inference behavior. Approach 9: Databricks SQL Dashboards and Alerts Databricks SQL Dashboards and Alerts provide in-platform observability for operational monitoring, enabling teams to visualize metrics and receive notifications based on SQL query results. This approach complements infrastructure-level monitoring by focusing on application-level conditions, data correctness, and workflow health. Users can build dashboards using Databricks SQL or SQL Warehouses by querying: System tables (e.g., job runs, billing usage), Data Quality Monitoring metric tables, Custom operational datasets. You can create alerts through these steps: Databricks SQL alerts. Alerting Features: Databricks SQL supports alerting on query results, allowing users to define conditions that trigger notifications via: Email, Slack (via webhook integration). Alerts can be configured for scenarios such as: Job failure counts exceeding thresholds Row count drops in critical tables Cost/Workload spikes or resource usage anomalies Considerations Alerts are query-driven and run on a schedule; ensure queries are optimized for performance. Dashboards and alerts are workspace-specific and require appropriate permissions. Best Practices Use system tables and Data Quality Monitoring metrics as data sources for dashboards. Schedule alerts to run at appropriate intervals (e.g., hourly for job failures). Combine internal alerts with external monitoring for full-stack coverage. Approach 10: Custom Tags for Workspace-Level Assets Custom tags allow organizations to classify and organize Databricks resources (clusters, jobs, pools, notebooks) for better governance, cost tracking, and observability. Tags are key-value pairs applied at the resource level and can be propagated to Azure for billing and monitoring. Why Use Custom Tags? Cost Attribution: Assign tags like Environment=Prod, Project=HealthcareAnalytics to track costs in Azure Cost Management. Governance: Enforce policies based on tags (e.g., restrict high-cost clusters to Environment=Dev). Observability: Filter logs and metrics by tags for dashboards and alerts. Taggable Assets Clusters: Apply tags during cluster creation via the Databricks UI or REST API. Jobs: Include tags in job configurations for workload-level tracking. Instance Pools: Tag pools to manage shared compute resources. Notebooks & Workflows: Use tags in metadata for classification and reporting. Best Practices Define a standard tag taxonomy (e.g., Environment, Owner, CostCenter, Compliance). Validate tags regularly to ensure consistency across workspaces. Use tags in Log Analytics queries for cost and performance dashboards.329Views1like0CommentsI'm stuck!
Logically, I'm not sure how\if I can do this. I want to monitor for EntraID Group additions - I can get this to work for a single entry using this: AuditLogs | where TimeGenerated > ago(7d) | where OperationName == "Add member to group" | where TargetResources[0].type == "User" | extend GroupName = tostring(parse_json(tostring(parse_json(tostring(TargetResources[0].modifiedProperties))[1].newValue))) | where GroupName == "NameOfGroup" <-- This returns the single entry | extend User = tostring(TargetResources[0].userPrincipalName) | summarize ['Count of Users Added']=dcount(User), ['List of Users Added']=make_set(User) by GroupName | sort by GroupName asc However, I have a list of 20 Priv groups that I need to monitor. I can do this using: let PrivGroups = dynamic[('name1','name2','name3'}); and then call that like this: blahblah | where TargetResources[0].type == "User" | extend GroupName = tostring(parse_json(tostring(parse_json(tostring(TargetResources[0].modifiedProperties))[1].newValue))) | where GroupName has_any (PrivGroup) But that's a bit dirty to update - I wanted to call a watchlist. I've tried defining with: let PrivGroup = (_GetWatchlist('TestList')); and tried calling like: blahblah | where TargetResources[0].type == "User" | extend GroupName = tostring(parse_json(tostring(parse_json(tostring(TargetResources[0].modifiedProperties))[1].newValue))) | where GroupName has_any ('PrivGroup') I've tried dropping the let and attempted to lookup the watchlist directly: | where GroupName has_any (_GetWatchlist('TestList')) The query runs but doesn't return any results (Obvs I know the result exists) - How do I lookup that extracted value on a Watchlist. Any ideas or pointers why I'm wrong would be appreciated! Many thanksSolved59Views0likes2CommentsApproaches to Integrating Azure Databricks with Microsoft Fabric: The Better Together Story!
Azure Databricks and Microsoft Fabric can be combined to create a unified and scalable analytics ecosystem. This document outlines eight distinct integration approaches, each accompanied by step-by-step implementation guidance and key design considerations. These methods are not prescriptive—your cloud architecture team can choose the integration strategy that best aligns with your organization’s governance model, workload requirements and platform preferences. Whether you prioritize centralized orchestration, direct data access, or seamless reporting, the flexibility of these options allows you to tailor the solution to your specific needs.2KViews6likes1CommentFrom Bronze to Gold: Data Quality Strategies for ETL in Microsoft Fabric
Introduction Data fuels analytics, machine learning, and AI but only if it’s trustworthy. Most organizations struggle with inconsistent schemas, nulls, data drift, or unexpected upstream changes that silently break dashboards, models, and business logic. Microsoft Fabric provides a unified analytics platform with OneLake, pipelines, notebooks, and governance capabilities. When combined with Great Expectations, an open-source data quality framework, Fabric becomes a powerful environment for enforcing data quality at scale. In this article, we explore how to implement enterprise-ready, parameterized data validation inside Fabric notebooks using Great Expectations including row-count drift detection, schema checks, primary-key uniqueness, and time-series batch validation. A quick reminder: ETL (Extract, Transform, Load) is the process of pulling raw data from source systems, applying business logic and quality validations, and delivering clean, curated datasets for analytics and AI. While ETL spans the full Medallion architecture, this guide focuses specifically on data quality checks in the Bronze layer using the NYC Taxi sample dataset. 🔗 Full implementation is available in my GitHub repository: sallydabbahmsft/Data-Quality-Checks-in-Microsoft-Fabric: Data Quality Checks in Microsoft Fabric Why Data Quality Matters More Than Ever? AI and analytics initiatives fail not because of model quality but because the underlying data is inaccurate, incomplete, or inconsistent. Organizations adopting Microsoft Fabric often ask: How can we validate data as it lands in Bronze? How do we detect schema changes before they break downstream pipelines? How do we prevent silent failures, anomalies, and drift? How do we standardize data quality checks across multiple tables and pipelines? Great Expectations provides a unified, testable, automation-friendly way to answer these questions. Great Expectations in Fabric Great Expectations (GX) is an open-source library for: ✔ Declarative data quality rules ("expectations") ✔ Automated validation during ETL ✔ Rich documentation and reporting ✔ Batch-based validation for time-series or large datasets ✔ Integration with Python, Spark, SQL, and cloud data platforms Fabric notebooks now support Great Expectations natively (via PySpark), enabling engineering teams to: Build reusable DQ suites Parameterize expectations by pipeline Validate full datasets or daily partitions Integrate validation into Fabric pipelines and alerting Data Quality Across the Medallion Architecture This solution follows the Medallion Architecture, with validation at every layer. This pipeline follows a Medallion Architecture, moving data through the Bronze, Silver, and Gold layers while enforcing data quality checks at every stage. 📘 P.S. Fabric also supports this via built-in Medallion task flows: Task flows overview - Microsoft Fabric | Microsoft Learn 🥉Bronze Layer: Ingestion & Validation Ingest raw source data into Bronze without transformations. Run foundational DQ checks to ensure structural integrity. Bronze DQ answers: ➡ Did the data arrive correctly? 🥈Silver Layer: Transformation & Validation Clean, standardize, and enrich Bronze data. Validate business rules, schema consistency, reference values, and more. Silver DQ answers: ➡ Is the data accurate and logically correct? 🥇 Gold Layer: Enrichment & Consumption Produce curated, analytics-ready datasets. Validate metrics, aggregates, and business KPIs. Gold DQ answers: ➡ Can executives trust the numbers? Recommended Data Quality Validations: Bronze Layer (Raw Ingestion) Ingestion Volume & Row Drift – Validate total row count and detect unexpected volume drops or spikes. Schema & Data Type Compliance – Ensure the table structure and column data types match the expected schema. Null / Empty Column Checks – Identify missing or empty values in required fields. Primary Key Uniqueness – Detect duplicate records based on the defined composite or natural key. Silver Layer (Cleaned & Standardized Data) Reference & Domain Value Validation – Confirm that values match valid categories, lookups, or reference datasets. Business Rule Enforcement – Validate logic constraints (e.g., StartDate <= EndDate, percentages within range). Anomaly / Outlier Detection – Identify unusual patterns or values that deviate from historical behavior. Post-Standardization Deduplication – Ensure standardized and enriched records no longer contain duplicates. Gold Layer (Curated, Business-Ready Data) Metric & Aggregation Consistency – Validate totals, ratios, rollups, and other aggregated metrics. KPI Threshold Monitoring – Trigger alerts when KPIs exceed defined thresholds. Data / Feature Drift Detection (for ML) – Monitor changes in distributions across time. Cross-System Consistency Checks – Compare business metrics across internal systems to ensure alignment. Implementing Data Quality with Great Expectations in Fabric Step 1 - Read data from Lakehouse (parametrized): lakehouse_name = "Bronze" table_name = "NYC Taxi - Green" query = f"SELECT * FROM {lakehouse_name}.`{table_name}`" df = spark.sql(query) Step 2 - Create and Register a Suite: context = gx.get_context() suite = context.suites.add( gx.ExpectationSuite(name="nyc_bronze_suite") ) Step 3 - Add Bronze Layer Expectations (Reusable Function): import great_expectations as gx def add_bronze_expectations( suite: gx.ExpectationSuite, primary_key_columns: list[str], required_columns: list[str], expected_schema: list[str], expected_row_count: int | None = None, max_row_drift_pct: float = 0.2, ) -> gx.ExpectationSuite: # 1. Ingestion Count & Row Drift if expected_row_count is not None: min_rows = int(expected_row_count * (1 - max_row_drift_pct)) max_rows = int(expected_row_count * (1 + max_row_drift_pct)) row_count_expectation = gx.expectations.ExpectTableRowCountToBeBetween( min_value=min_rows, max_value=max_rows, ) suite.add_expectation(expectation=row_count_expectation) # 2. Schema Compliance schema_expectation = gx.expectations.ExpectTableColumnsToMatchSet( column_set=expected_schema, exact_match=True, ) suite.add_expectation(expectation=schema_expectation) # 3. Required columns: NOT NULL for col in required_columns: not_null_expectation = gx.expectations.ExpectColumnValuesToNotBeNull( column=col ) suite.add_expectation(expectation=not_null_expectation) # 4. Primary key uniqueness (if provided) if primary_key_columns: unique_pk_expectation = gx.expectations.ExpectCompoundColumnsToBeUnique( column_list=primary_key_columns ) suite.add_expectation(expectation=unique_pk_expectation) return suite Step 4 - Attach Data Asset & Batch Definition: data_source = context.data_sources.add_spark(name="bronze_datasource") data_asset = data_source.add_dataframe_asset(name="nyc_bronze_data") batch_definition = data_asset.add_batch_definition_whole_dataframe("full_bronze_batch") Step 5 - Run Validation: validation_definition = gx.ValidationDefinition( data=batch_definition, suite=suite, name="Bronze_DQ_Validation" ) results = validation_definition.run( batch_parameters={"dataframe": df} ) print(results) 7. Optional: Time-Series Batch Validation (Daily Slices) Fabric does not yet support add_batch_definition_timeseries, so your notebook implements custom logic to validate each day independently: dates_df = df.select(F.to_date("lpepPickupDatetime").alias("dt")).distinct() for d in dates: df_day = df.filter(F.to_date("lpepPickupDatetime") == d) results = validation_definition.run(batch_parameters={"dataframe": df_day}) This enables: Daily anomaly detection Partition-level completeness checks Early schema drift detection Automating DQ with Fabric Pipelines Fabric pipelines can orchestrate your data quality workflow: Trigger notebook after ingestion Pass parameters (table, layer, suite name) Persist DQ results to Lakehouse or Log Analytics Configure alerts in Fabric Monitor Production workflow Run the notebook Check validation results If failures exist: Raise an incident Fail the pipeline Notify the on-call engineer This creates a closed loop of ingestion → validation → monitoring → alerting. An example of DQ pipeline: Results: How Enterprises Benefit By standardizing data quality rules across all domains, organizations ensure consistent expectations and uniform validation practices , improved observability makes data quality issues visible and actionable, enabling teams to detect and resolve failures early. This, in turn, enhances overall reliability, ensuring downstream transformations and Power BI reports operate on clean, trustworthy data. Ultimately, stronger data quality directly contributes to AI readiness high-quality, well-validated data produces significantly better analytics and machine learning outcomes. Conclusion Great Expectations + Microsoft Fabric creates a scalable, modular, enterprise-ready approach for ensuring data quality across the entire medallion architecture. Whether you're validating raw ingested data, transformed datasets, or business-ready tables, the approach demonstrated here enables consistency, observability, and automation across all pipelines. With Fabric’s unified compute, orchestration, and monitoring, teams can now integrate DQ as a first-class citizen not an afterthought. Links: Implement medallion lakehouse architecture in Fabric - Microsoft Fabric | Microsoft Learn GX Expectations Gallery • Great Expectations207Views0likes0CommentsAzure support team not responding to support request
I am posting here because I have not received a response to my support request despite my plan stating that I should hear back within 8 hours. It has now gone a day beyond that limit, and I am still waiting for assistance with this urgent matter. This issue is critical for my operations, and the delay is unacceptable. The ticket/reference number for my original support request was 2410100040000309. And I have created a brand new service request with ID 2412160040010160. I need this addressed immediately.467Views1like7Comments