healthcare
513 TopicsHow Microsoft Dragon Copilot Uses The Azure Health Data Services De-Identification Service
Empowering physician productivity through secure AI Microsoft developed Dragon Copilot to revolutionize real-time clinical documentation. Using clinically adapted generative AI, it listens to patient-clinician conversations and automatically generates draft clinical notes, freeing physicians to focus on what matters most: their patients. Dragon Copilot also allows clinicians to get the information they need when they need it and automates many other tasks such as initiating orders or writing draft patient after-visit summaries. The tool eliminates the burden of manual note-taking and multiple other clicks in the EMR, boosting efficiency, and reducing burnout, all of which are critical challenges in healthcare. With strong market traction across hospitals and physician practices across the USA, Dragon Copilot, previously known as Dragon Ambient eXperience (DAX) Copilot, has become a trusted productivity engine for healthcare organizations. In a field where protecting patient data is critical , privacy is paramount. Dragon Copilot’s deep commitment to data privacy, however, requires a strategic partner like the de-identification service to support safe and responsible AI development at scale. How the Azure Health Data Services de-identification service empowers Dragon Copilot Dragon Copilot operates at the intersection of audio capture, natural language generation (NLG), and clinical workflows. Its data pipelines include highly sensitive patient health information. As a result, Microsoft has invested in the Azure Health Data Services de-identification service to de-identify millions of patient transcripts and notes to uphold strict privacy standards and deliver secure, scalable clinical documentation. De-identifying unstructured text like clinical notes is particularly challenging due to the complexity and variability of how Protected Health Information (PHI) appears in real-world clinical documentation. References to dates like “Christmas” or “New Year’s Eve,” names, locations, and other identifiers are often embedded in free text in unpredictable ways. The Azure Health Data Services de-identification service is purpose-built to handle these nuances. It accurately identifies and replaces patient names while distinguishing them from doctors’ names, and it can also detect and tag the names of family members or close contacts mentioned in the clinical narrative. The service also retains the format of the dates presents in clinical notes, shifting them by a random number within a 45-day window and surrogates holidays with replacements close in seasonality. A key strength of the de-identification service is its use of surrogation, where sensitive terms are replaced with realistic, context-appropriate substitutes. This approach, used in services like Dragon Copilot, helps ensure clinical notes remain readable and useful while concealing real PHI in plain sight, strengthening privacy without sacrificing usability. Connecting to Microsoft Fabric for scalable analytics Once Dragon Copilot generates draft clinical notes, the data can be securely ingested into Microsoft Fabric, a unified data platform built for analytics and governance. Within Fabric, healthcare organizations can centralize and manage de-identified data using OneLake, making it accessible for advanced analytics, operational reporting, and research. Azure Health Data Services play a critical role in this ecosystem by ensuring that sensitive PHI is de-identified before analysis, allowing healthcare agents to extract meaningful insights, identify trends, and optimize care delivery without compromising patient privacy. Use Cases unlocked through partnering with the Azure Health Data Services de-identification service Azure Health Data Services de-identification has become a critical component of the Dragon Copilot data ingestion pipeline. Our service supports several teams within Dragon Copilot: Research Enablement: De-identified data fuels AI model building, success tracking, and product improvement—without exposing sensitive patient data. AI Model Quality & Evaluation: De-identified data supports safe iteration and experimentation while preserving important context (i.e. gender, timeline, and more). What makes Azure Health Data Services de-identification service stand out Dragon Copilot builds on the consistency, robustness, and seamless integration offered by Azure Health Data Services' de-identification capabilities. This service is purpose-built for healthcare and plays a critical role in enabling Dragon Copilot to uphold the highest privacy standards while continuing to innovate. Key strengths of the service include: Context Preservation: Maintains formatting and context alignment, which are essential for clinical accuracy. Surrogation Support: Replaces PHI with realistic pseudonyms to ensure de-identified data remains useful for model training. Beyond HIPAA Compliance: De-identifies 27 categories of PHI, surpassing HIPAA’s 18 identifiers, to support more comprehensive privacy protection. This foundation allows Dragon Copilot to scale responsibly, ensuring both compliance and usability in real-world clinical settings. Looking Ahead: Where Dragon Copilot is going with de-identification As Dragon Copilot expands and continues to add new capabilities, Azure Health Data Services de-identification service will continue to be a foundational piece of their AI development lifecycle. For Dragon Copilot, de-identification isn’t just a checkbox, it’s a catalyst for innovation. Learn more about the Azure Health Data Services De-identification service1.2KViews0likes0CommentsModel Mondays S2E11: Exploring Speech AI in Azure AI Foundry
1. Weekly Highlights This week’s top news in the Azure AI ecosystem included: Lakuna — Copilot Studio Agent for Product Teams: A hackathon project built with Copilot Studio and Azure AI Foundry, Lakuna analyzes your requirements and docs to surface hidden assumptions, helping teams reflect, test, and reduce bias in product planning. Azure ND H200 v5 VMs for AI: Azure Machine Learning introduced ND H200 v5 VMs, featuring NVIDIA H200 GPUs (over 1TB GPU memory per VM!) for massive models, bigger context windows, and ultra-fast throughput. Agent Factory Blog Series: The next wave of agentic AI is about extensibility: plug your agents into hundreds of APIs and services using Model Connector Protocol (MCP) for portable, reusable tool integrations. GPT-5 Tool Calling on Azure AI Foundry: GPT-5 models now support free-form tool calling—no more rigid JSON! Output SQL, Python, configs, and more in your preferred format for natural, flexible workflows. Microsoft a Leader in 2025 Gartner Magic Quadrant: Azure was again named a leader for Cloud Native Application Platforms—validating its end-to-end runway for AI, microservices, DevOps, and more. 2. Spotlight On: Azure AI Foundry Speech Playground The main segment featured a live demo of the new Azure AI Speech Playground (now part of Foundry), showing how developers can experiment with and deploy cutting-edge voice, transcription, and avatar capabilities. Key Features & Demos: Speech Recognition (Speech-to-Text): Try real-time transcription directly in the playground—recognizing natural speech, pauses, accents, and domain terms. Batch and Fast transcription options for large files and blob storage. Custom Speech: Fine-tune models for your industry, vocabulary, and noise conditions. Text to Speech (TTS): Instantly convert text into natural, expressive audio in 150+ languages with 600+ neural voices. Demo: Listen to pre-built voices, explore whispering, cheerful, angry, and more styles. Custom Neural Voice: Clone and train your own professional or personal voice (with strict Responsible AI controls). Avatars & Video Translation: Bring your apps to life with prebuilt avatars and video translation, which syncs voice-overs to speakers in multilingual videos. Voice Live API: Voice Live API (Preview) integrates all premium speech capabilities with large language models, enabling real-time, proactive voice agents and chatbots. Demo: Language learning agent with voice, avatars, and proactive engagement. One-click code export for deployment in your IDE. 3. Customer Story: Hilo Health This week’s customer spotlight featured Helo Health—a healthcare technology company using Azure AI to boost efficiency for doctors, staff, and patients. How Hilo Uses Azure AI: Document Management: Automates fax/document filing, splits multi-page faxes by patient, reduces staff effort and errors using Azure Computer Vision and Document Intelligence. Ambient Listening: Ambient clinical note transcription captures doctor-patient conversations and summarizes them for easy EHR documentation. Genie AI Contact Center: Agentic voice assistants handle patient calls, book appointments, answer billing/refill questions, escalate to humans, and assist human agents—using Azure Communication Services, Azure Functions, FastAPI (community), and Azure OpenAI. Conversational Campaigns: Outbound reminders, procedure preps, and follow-ups all handled by voice AI—freeing up human staff. Impact: Hilo reaches 16,000+ physician practices and 180,000 providers, automates millions of communications, and processes $2B+ in payments annually—demonstrating how multimodal AI transforms patient journeys from first call to post-visit care. 4. Key Takeaways Here’s what you need to know from S2E11: Speech AI is Accessible: The Azure AI Foundry Speech Playground makes experimenting with voice recognition, TTS, and avatars easy for everyone. From Playground to Production: Fine-tune, export code, and deploy speech models in your own apps with Azure Speech Service. Responsible AI Built-In: Custom Neural Voice and avatars require application and approval, ensuring ethical, secure use. Agentic AI Everywhere: Voice Live API brings real-time, multimodal voice agents to any workflow. Healthcare Example: Hilo’s use of Azure AI shows the real-world impact of speech and agentic AI, from patient intake to after-visit care. Join the Community: Keep learning and building—join the Discord and Forum. Sharda's Tips: How I Wrote This Blog I organize key moments from each episode, highlight product demos and customer stories, and use GitHub Copilot for structure. For this recap, I tested the Speech Playground myself, explored the docs, and summarized answers to common developer questions on security, dialects, and deployment. Here’s my favorite Copilot prompt this week: "Generate a technical blog post for Model Mondays S2E11 based on the transcript and episode details. Focus on Azure Speech Playground, TTS, avatars, Voice Live API, and healthcare use cases. Add practical links for developers and students!" Coming Up Next Week Next week: Observability! Learn how to monitor, evaluate, and debug your AI models and workflows using Azure and OpenAI tools. Register For The Livestream – Sep 1, 2025 Register For The AMA – Sep 5, 2025 Ask Questions & View Recaps – Discussion Forum About Model Mondays Model Mondays is your weekly Azure AI learning series: 5-Minute Highlights: Latest AI news and product updates 15-Minute Spotlight: Demos and deep dives with product teams 30-Minute AMA Fridays: Ask anything in Discord or the forum Start building: Register For Livestreams Watch Past Replays Register For AMA Recap Past AMAs Join The Community Don’t build alone! The Azure AI Developer Community is here for real-time chats, events, and support: Join the Discord Explore the Forum About Me I'm Sharda, a Gold Microsoft Learn Student Ambassador focused on cloud and AI. Find me on GitHub, Dev.to, Tech Community, and LinkedIn. In this blog series, I share takeaways from each week’s Model Mondays livestream.112Views0likes0CommentsAgentic AI in Healthcare
Healthcare organizations are at a crossroads where rising patient loads, complex data, and administrative burdens demand new solutions. Agentic AI – AI systems capable of autonomous action – is emerging as a catalyst for transformation, promising to act not just as tools but as collaborative digital team members. Microsoft’s ecosystem of AI technologies provides a robust foundation to harness agentic AI in healthcare. This report offers a comprehensive overview of agentic AI, distinguishes it from traditional AI, and explores its role in clinical workflows, administrative efficiency, patient engagement, and data governance. It also examines how Microsoft’s offerings (Microsoft 365 Copilot, Azure Health Data Services, Microsoft Fabric, Copilot Studio, and more) enable these advances responsibly and in compliance with healthcare regulations like HIPAA.Towards Robust Evaluation of Multi-Agent Systems in Clinical Settings
Authors: Hao Qiu, Leonardo Schettini, Mert Öz, Noel Codella, Sam Preston, Wen-wai Yim As multi-agent systems become more capable and collaborative, their behavior begins to exhibit emergent properties that are difficult to predict or control – particularly in safety critical domains like healthcare. Coordination among agents can yield outputs that are non-deterministic, multi-faceted, and context sensitive. This makes robust evaluation not just a matter of accuracy, but of safety, accountability, and trust. Traditional NLP metrics like ROUGE or BLEU fall short in these settings as they presuppose a single ground truth and fail to capture clinically relevant errors such as subtle omissions, hallucinations, or fact distortions. To address this, we present a modular evaluation framework for the Healthcare Agent Orchestrator, designed to support fine-grained, clinical grounded assessment across both deployed clinical workflows and simulated scenarios. This framework enables targeted stress-testing of multi-agent behavior – particularly how agents share information, reason under uncertainty, and maintain factual fidelity in high-stakes contexts. Central to our framework is TBFact, a domain specific factuality metric that evaluates agent outputs based on three key criteria: factual inclusion, factual distortion, and factual omission. TBFact shows strong correlation with human experts (κ=0.760) and demonstrates that our Patient History agent successfully included up to 94% of high-importance information in the generated patient timelines. To ground evaluations of the Patient History agent, we constructed a high-quality benchmark dataset from de-identified tumor board discussions and associated patient histories. Reference patient timeline summaries (originally written by medical professionals) formatting was standardized via a large language model to facilitate consistent evaluation. And under our benchmark, while the Patient History agent included over 94% of high-importance facts (counting both fully and partially entailed information), the Patient History agent achieved 0.84 TBFact recall on high-importance facts, showing that TBFact's strict entailment criteria and partial credit scoring create meaningful headroom for future improvements. For more technical information about the evaluation framework, refer to the documentation. The healthcare-agent-orchestrator repository also includes an evaluation notebook with concrete examples for simulating conversations and evaluating them. : High-level architecture of the evaluation framework, showing data sources (real and simulated conversations) feeding into modular metrics for both orchestrator and individual agent assessment. Available Metrics Traditional similarity metrics (e.g.: ROUGE, BERTScore) fail to capture subtle yet critical factual inaccuracies in the output. Moreover, in agentic workflows, a ground truth answer often doesn’t exist or is expensive to curate. To overcome these shortcomings, we leverage Model-as-a-Judge to implement the following metrics: Component Metric Description Orchestrator Agent and tool selection accuracy Correct routing to specialized agents Orchestrator Intent resolution How accurately the orchestrator interprets and completes user requests, including scoping and clarification. Orchestrator Information aggregation Effective synthesis of multiple agent outputs. Individual Agents Context relevancy Relevance of retrieved information in relation to user’s requests. Individual Agents TBFact (Factual Consistency) An adapted version of RadFact for the text modality, that measures the factuality of claims in agents' messages and helps identifying omissions and hallucinations. Large Language Models serve as useful evaluation tools in our framework, offering advantages especially when ground truth data is not available. They can follow detailed evaluation guidelines, maintain consistency when applying criteria across conversations, and generate explanations for their assessments—facilitating verification of the evaluation process. However, due to their subjective nature, LLM-based evaluations should be treated as directional signals rather than absolute scores, providing better directional guidance for system improvement rather than absolute judgment of correctness. To complement LLM-based metrics with reproducible measurements especially when reference data is available, we include Rouge implementation, serving as an example for developers to incorporate other similarity metrics like BLEU or BERT-Score by extending the ReferenceBasedMetric class. TBFact: Domain-Specific Factuality Evaluation TBFact builds on RadFact (Bannur et al., 2024), a framework originally developed for evaluating factual consistency in radiology reports, by adapting its core principles to the text-only modality of healthcare agent interactions: Fact Extraction: Separately decomposes both agent responses and reference texts into discrete factual claims, categorized by clinical relevance (e.g., demographics, diagnosis, treatment) Logical Entailment: Compares each fact to determine if it's fully entailed, partially entailed, or not entailed by the reference, and further categorizes the reason for partial and total mismatches into “missing”, “ambiguous”, “incorrect” or “other”. Metric Calculation: TBFact performs the logical entailment in two directions: Precision (pred-to-gold): Measures the proportion of factual claims in the agent’s output that are supported by the reference data. A lower precision score may indicate the presence of hallucinated or extraneous facts not found in the reference, even if they are accurate. Precision can be seen as a proxy for succinctness. Recall (gold-to-pred): Measures the proportion of reference facts that are successfully captured in the agent’s output. A lower recall score signals missing or omitted information, which is especially critical in clinical contexts where completeness is essential. By operating at the level of atomic factual units, TBFact shifts the focus from holistic summary judgments to targeted, claim-by-claim analysis. While claim extraction introduces its own challenges—such as ensuring consistent coverage of verifiable content, maintaining entailment fidelity, and handling decontextualization (Metropolitansky & Larson, 2025)—factual claims make the evaluation process more modular and transparent, providing actionable insights into where and how agent responses differ from references. For example, when evaluating a discharge summary, TBFact might identify that while demographic facts achieve 95% precision, treatment recommendations only reach 75% recall, pinpointing specific areas for agent improvement. This granular feedback enables developers to identify systematic issues, such as an agent consistently omitting medication dosages or incorrectly interpreting temporal information, that would be difficult to detect with traditional metrics. Data Sources Due to the challenge of having real-world data for each use-case we want to evaluate, and to accommodate different development stages and data availability, the framework supports two primary evaluation modes: Real conversations: Healthcare Agent Orchestrator automatically saves chat sessions whenever a conversation is terminated with the command @Orchestrator: clear, enabling insight into actual clinical workflow performance. Simulated conversations: Generated for controlled testing using predefined scripts or adaptive scenarios. Essential for specialized scenarios with limited real-world data. Results and Performance Assessment Note: The following results represent initial validation from our current research phase, with ongoing work expanding evaluation scope and refining methodologies. These preliminary results demonstrate promising capabilities for clinical system coordination and factual accuracy assessment. Orchestrator Performance We evaluated the orchestrator using simulated conversations across multiple patient scenarios. GPT-4o served as the evaluator, providing both quantitative scores and qualitative explanations based on defined metric criteria. In this preliminary experiment, the orchestrator demonstrated promising coordination capabilities: Metric Score Range Average Score Agent Selection Accuracy 3.89 – 5 4 Intent Resolution 4 – 5 4.5 Information Aggregation 3 – 5 3.7 In our preliminary evaluation, agent selection examples are relatively straightforward given our agents' well-defined responsibilities but provide a foundation for expanding to more complex scenarios involving agent-human expert interactions as we gather real-world data. Future work could include turn-level labeling of tumor board dataset dialogues to test classification accuracy of choosing the right next expert or agent. Agent selection can also be combined with "tool selection" metrics, addressing the fragmentation problem in multi-agent evaluation approaches. In the current state, we mainly used the explanations provided by the evaluator model to better understand the behavior of the system in clinical workflows and guide the development process. Patient History Agent Performance with TBFact To evaluate the Patient History agent, we used an anonymized and PHI-free proprietary dataset, named TB-Bench, that comprehensively aggregates diverse medical records for 71 patients who had undergone the care of a Molecular Tumor Board (MTB). TB-Bench includes data such as tumor board transcripts, exported EHR data, and clinician-generated patient summaries. Due to the logistical challenges involved in curating such a comprehensive dataset across potentially multiple healthcare institutions and record keeping systems, we found that in some instances clinician-generated summaries available in the tumor board transcripts might refer to patient records that were lost in the data curation process. This mismatch made direct evaluation challenging. Therefore, to ensure evaluation reflects system performance when complete patient records are accessible, we used TBFact to evaluate the agent’s output against a curated set of dataset verifiable facts— facts limited to those referring to information that is present in the dataset. While TBFact measures both recall and precision of fact generation, our study focuses on recall because it measures how much of all important information is covered, which we consider the most critical metric for clinical applications where missing information can have serious consequences. The preliminary experiments revealed significant performance improvements through prompt optimization and format adjustments. With specialized prompting, we specify the types of information to prioritize—such as biomarker results, imaging assessments, and treatment timelines. For instance, our updated prompt instructs the agent to “organize the patient data in chronological order” and explicitly calls out key elements to include: “all biomarkers”, “response to treatment including dates and imaging,” and “a summary of current status.” This prompt engineering approach proved to be one of the most effective levers for improving the quality and completeness of Patient History outputs. Configuration TBFact Recall for All Facts TBFact Recall for Important Facts Generic prompts (baseline) 0.56 0.66 Specialized Prompts 0.71 0.84 Since TBFact operates by comparing discrete factual claims, higher scores indicate that the agent is, according to the reference data, factually accurate and comprehensive in its coverage of the available patient information. In other words, optimizing for TBFact scores brings the agent’s output structurally and semantically closer to the curated reference timelines. And, in our case, that meant striving for detailed outputs, including information about allergies and ongoing medications, even when specific dates were unavailable. This underscores the importance of having high-quality, human-validated reference datasets, as without them, even well-performing agents may appear incomplete or inaccurate. Human Validation Study To validate TBFact's reliability, we conducted a preliminary study with human annotators, medical scribes by training, using 71 patient records. Two annotators assessed (a) whether a claim was properly extracted from its source text, (b) whether the fact was important (low, medium, high), and (c) whether individual claims were properly entailed by a reference text. Inter-annotator agreement was measured at 0.999, 0.66(strict) and 0.77(relaxed), and 0.914 for the three tasks respectively. The accuracy of the fact extraction pipeline was calculated to be 99.9%, validating that during the fact extraction phase minimal-to-no hallucinations are introduced. System accuracy for fact importance classification was at 66% when measured strictly, however, when allowing for a tolerance of one level (e.g. classifying medium instead of high), this was at 93%. These values are comparable to those of the medical annotators. Entailment classification at 88%, suggesting reasonable performance of the system’s ability to recognize entailment. Finally, we measured the correlation of the entire end-to-end TBFact F1 score of the system compared to humans using Kendall Tau, Pearson, and Spearman correlations. These were revealed to be at 55.8%, 70.5%, 72.8%, moderate-to-high correlations suggesting that the TBFact metric are well-aligned with expert clinical reasoning. Qualitative insights from TBFact The table below illustrates how TBFact evaluates factual alignment between agent-generated summaries and reference data. Each row shows a fact extracted from the agent’s output, the corresponding excerpt from the reference, and the entailment judgment. The logical entailment was produced by TBFact, while the accompanying explanations were generated separately to support interpretability. Facts Extracted from Agent Response Related Excerpt from Reference Text (Ground Truth) TBFact Judgment Molecular studies from the 2019-05-18 surgery identified TERT promoter mutation, PTEN mutation, EGFR amplification, CDKN2A/B deletion, monosomy 10, and trisomy 7. […] Tumor Genetics: EGFR: Amplified CDKN2A/B: Deleted PTEN: p.L112R TERT: c.-146C>T Chromosome 10: Monosomy Chromosome 7: Trisomy […] Timeline: 05/18/2019: Diagnosis of multifocal glioblastoma; craniotomy and resection of lesion from right temporal lobe. […] ✔ Entailed: The summary lists TERT mutation, PTEN mutation, EGFR amplification, CDKN2A/B deletion, monosomy 10, and trisomy 7. Immunohistochemistry from 2019-05-18 showed GFAP positive, BRAF V600E negative, IDH1 R132H negative, ATRX retained, p53 negative, and a Ki-67 index of 3%. […] Tumor Genetics: IDH1: Wildtype - BRAF V600E: Negative […] Timeline: 05/18/2019: Diagnosis of multifocal glioblastoma; craniotomy and resection of lesion from right temporal lobe. […] ⚠️ Partial Entailment: Some IHC findings match (BRAF negative, IDH1 wildtype) but others (GFAP, p53, Ki-67) are not mentioned in the reference summary. During the first cycle of CCNU on 2020-04-14, the patient reported significant fatigue, thrombocytopenia, and occasional confusion. Introduction: […] The patient is experiencing poor tolerance to lomustine and is considering discontinuation due to further disease progression as confirmed by recent MRI scans. […] Timeline: 04/14/2020 - Present: Lomustine treatment initiated. […] ⚠️ Partial Entailment: Poor tolerance to lomustine is reported, but specific side effects are not listed in the reference summary. On 2020-05-16, the plan was to continue CCNU and monitor with imaging. No related information in the reference text. ⚠️ No Entailment: No mention in the summary of a plan on 2020-05-16 to continue CCNU with imaging follow-up. These examples show that partial entailments are not necessarily errors. In many cases, they reflect the agent surfacing clinically relevant details that are absent from the reference. This is especially important in healthcare settings, where agent outputs may synthesize information across multiple documents or express facts in more complete or structured ways than the reference defined. To further assess the factual grounding of the agent’s outputs, we compared all facts extracted from the Patient History agent’s summaries against the full set of available data for each patient in the TB-Bench dataset. We found that 97% of the extracted facts were entailed by at least one data point. Upon manually reviewing the remaining 3% of facts, we found that they often reflected condensed or synthesized information drawn from multiple sources, meaning these claims could not be matched to any one document in our one-to-one entailment setup. While we cannot rule out the presence of hallucinations entirely, this analysis highlights the agent’s capacity for multi-source summarization. Closing Thoughts As multi-agent systems become more capable and autonomous, robust evaluation must evolve in parallel. The framework presented here is a step toward that goal: modular, clinically grounded, and designed to surface actionable insights across both simulated and real-world workflows. By moving beyond traditional accuracy metrics and embracing factuality, relevance, and coordination as core evaluation dimensions, we can better understand how multi-agent systems work, and when and why they fail. Our preliminary experiments and insights reinforce the value of TBFact not just as a metric, but as a diagnostic tool. Its structured, claim-level analysis (combined with fact categorization and human validation) offers a transparent and clinically meaningful way to evaluate and improve healthcare agents. In evaluating the Patient History agent, findings demonstrate that the agent remains faithful to the underlying data and produces complete, clinically relevant summaries. These outputs can help physicians prepare more efficiently and productively for tumor board review meetings, and being in a chat multiple agents, facilitate further investigation and understanding about patients. Looking ahead, we see several promising directions for extending this work: incorporating human-in-the-loop review pipelines, expanding to multimodal evaluation, improving observability across agent interactions, and scaling to more diverse real-world datasets. We are also developing a standardized benchmark of synthetic and de-identified patient cases to support broader community testing and reproducibility. We hope this work encourages others to adopt similarly rigorous approaches to evaluation, and to contribute to the development of shared benchmarks, metrics, and methodologies References Bannur, S., Bouzid, K., Castro, D. C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., ... & Hyland, S. L. (2024). Maira-2: Grounded radiology report generation. arXiv:2406.04449v2. Metropolitansky, D. & Larson, J. (2025). Towards Effective Extraction and Evaluation of Factual Claims. arXiv:2502.10855v2.Azure Logic App AI-Powered Monitoring Solution: Automate, Analyze, and Act on Your Azure Data
Introduction In today’s cloud-driven world, monitoring and analyzing application health is critical for business continuity and operational excellence. However, the sheer volume of monitoring data can make it challenging to extract actionable insights quickly. Enter the Azure Logic App AI-Powered Monitoring Solution—an intelligent, serverless pipeline that leverages Azure Logic Apps and Azure OpenAI to automate monitoring, analyze data, and deliver comprehensive reports right to your inbox. This solution is ideal for organizations seeking to modernize their monitoring workflows, reduce manual analysis, and empower teams with AI-driven insights for faster decision-making. What Does This Solution Accomplish? The Azure Logic App AI-Powered Monitoring Solution creates an automated pipeline that: Extracts monitoring data from Azure Log Analytics using KQL queries. Analyzes data with AI using the Azure OpenAI GPT-4o model. Generates intelligent reports and sends them via email. Runs automatically on a daily schedule. Uses managed identity for secure authentication across Azure services. Business Case Solved Automated Monitoring: No more manual log reviews—let AI do the heavy lifting. Actionable Insights: Receive daily, AI-generated summaries highlighting system health, key metrics, potential issues, and recommendations. Operational Efficiency: Reduce time-to-insight and empower teams to act faster on critical events. Secure and Scalable: Built on Azure’s serverless and identity-driven architecture. Key Features Serverless Architecture: Built on Azure Logic Apps Standard for scalability and cost efficiency. AI-Powered Insights: Uses Azure OpenAI for advanced data analysis and summarization. Infrastructure as Code: Deployable via Bicep templates for reproducibility and automation. Secure by Design: Managed identity and Azure RBAC ensure secure access. Cost Effective: Pay-per-execution model with optimized resource usage. Customizable: Easily modify KQL queries and AI prompts to fit your monitoring needs. Solution Architecture Technologies Involved Azure Logic Apps Standard: Orchestrates the workflow. Azure OpenAI Service (GPT-4o): Performs AI-powered data analysis and summarization. Azure Log Analytics: Source for monitoring data, queried via KQL. Application Insights: Monitors workflow execution and telemetry. Azure Storage Account: Stores Logic App runtime data. Managed Identity: Secures authentication across Azure services. Infrastructure as Code (Bicep): Enables automated, repeatable deployments. Office 365 Connector: Sends email notifications. Support Documentation: https://docs.microsoft.com/en-us/azure/logic-apps/ Issues: https://github.com/vinod-soni-microsoft/logicapp-ai-summarize/issues Star this repository if you find it helpful!1.3KViews0likes0CommentsOptimizing Azure Healthcare Multimodal AI Models for Intel CPU Architecture
Alexander Mehmet Ersoy, Principal Product Manager, Microsoft HLS AI Abhishek Khowala, Principal AI Engineer, Intel Ravi Panchumarthy, AI Framework Engineer, Intel Srinarayan Srikanthan, AI Framework Engineer, Intel Ekaterina Aidova, AI Frameworks Engineer, Intel Alberto Santamaria-Pang, Principal Applied Data Scientist, Microsoft HLS AI and Adjunct Faculty at Johns Hopkins Medicine, Microsoft Peter Lee, Applied Scientist, Microsoft HLS AI and Adjunct Assistant Professor at Vanderbilt University Ivan Tarapov, Sr. Director, Microsoft HLS AI Pradeep Sakhamoori, Sr. SW Engineer, Microsoft The Rise of Multimodal AI in Healthcare The healthcare sector is witnessing a surge in the adoption of multimodal AI models, which are crucial for applications ranging from diagnostics to personalized treatment plans. These models combine data from various sources such as medical images, patient records, and genomic data to provide comprehensive insights. Microsoft’s Azure AI Foundry's Model Catalog of multimodal healthcare foundation models is at the forefront of this change. Models recently launched (such as MedImageInsights, MedImageParse, CXRReportGen [8], and many others) are designed to help healthcare organizations rapidly build and deploy AI solutions tailored to their specific needs, while minimizing the extensive compute and data requirements typically associated with building multimodal models from scratch. Real-World Examples from our industry partners regarding the adoption of multimodal AI models are highlighted in the article “Unlocking next-generation AI capabilities with healthcare AI models”. Challenges and Opportunities in Hardware Optimization As models get more complex, which is the case with the foundation model trend, the demands on the hardware rise. While GPUs remain the platform of choice for minimizing the model execution times, CPUs present substantial optimization possibilities, especially for inference workloads. We believe that providing a framework for efficient CPU-based environments holds a huge potential for many production scenarios where speed can be traded off for cost savings. With multimodal healthcare AI, the complexity of handling different data modalities and ensuring efficient inference requires innovative solutions and collaboration between industry leaders. Companies are increasingly looking towards hardware-specific optimizations to enhance model efficiency and reduce latency while keeping costs at bay. Intel, with its robust suite of AI tools and extensions for frameworks like PyTorch, is pioneering this optimization effort. For instance, the Intel® Distribution of OpenVINO™ toolkit has been instrumental in accelerating the development of computer vision and deep learning applications in healthcare [1]. You can learn about our recent collaboration with Intel on AI optimizations to advance medical innovations in the article "Empower Medical Innovations: Intel Accelerates PadChest & fMRI Models on Microsoft Azure* Machine Learning”. The demand for AI applications in healthcare is rapidly increasing. Multimodal AI models, which can process and analyze complex datasets, are essential for tasks such as early disease detection, treatment planning, and patient monitoring. While optimizing these models to perform efficiently on specific hardware is important, it is not necessarily a barrier to adoption. Models optimized with CUDA for Nvidia GPUs often deliver optimal performance and run faster than on any other hardware. However, the benefit of using CPUs lies in the tradeoff they offer. You can choose to optimize for speed by running your model on a GPU and optimizing for it in PyTorch, or you can optimize for cost by sacrificing speed. This is the proposition here: the option to run the model slower with an accessible CPU, which can be advantageous in scenarios where speed is not the primary concern, but access to GPU hardware is. The Intel® oneAPI Deep Neural Network Library (oneDNN) have proven effective in reducing GPU requirement burden and accelerating time to market for AI solutions [2]. Both Intel® Extension for PyTorch (IPEX) and OpenVINO utilize the Intel® oneDNN to accelerate deep learning operations, taking advantage of underlying hardware features. IPEX optimizes existing PyTorch workflows with minimal code changes. OpenVINO provides cross-platform deep learning optimization for deployment flexibility. In this blog post, a custom deployment was implemented using CXRReportGen along with both IPEX and OpenVINO optimizations, demonstrating how these techniques can support different deployment scenarios and technical requirements. This optimization is accessible through Azure's compute services and Intel's technology. Benchmarking and Performance Acceleration To address these challenges, our new collaboration with Intel focuses on leveraging Intel’s advanced AI tools and hardware capabilities to optimize multimodal AI models for greater healthcare access. By utilizing Intel's Extension for PyTorch and other optimization techniques, we aim to optimize CPUs for best model run time speed. While this may slightly degrade performance, the main benefit is addressing the problem of GPU hardware scarcity. This partnership not only underscores the importance of hardware-specific optimizations but also sets a new standard for AI model deployment in real-world healthcare applications. Both IPEX and OpenVINO are built on a common foundation - Intel® oneDNN which is a high-performance library designed specifically for deep learning applications and optimized for Intel architecture. oneDNN leverages specialized hardware instructions available in Intel processors such as Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) [3] on Intel CPUs as well as Intel XeMatrix Extensions (XMX) AI engines on Intel discrete GPUs. Figure 1: OneDNN Library IPEX [4] extends PyTorch* with the latest performance optimizations for Intel hardware [5]. It leverages oneDNN under the hood to provide optimized implementations of key operations. This allows developers to stay within their existing PyTorch code with minimal changes - making it an excellent choice for teams already comfortable with the PyTorch ecosystem who want to quickly optimize their models for Intel hardware. import torch ############## import ipex ############### import intel_extension_for_pytorch as ipex model = Model() model.eval() ############## Optimize with IPEX ############### model = ipex.optimize(model, dtype=torch.bfloat16) # Continue with inference as normal Figure 2. Intel Extension for PyTorch The Intel® Distribution of OpenVINO™ toolkit is a powerful solution for optimizing and deploying deep learning models across a wide range of Intel hardware [6]. Like IPEX, it leverages oneDNN under the hood, but takes a different approach - offering cross-platform optimization and flexible deployment options. OpenVINO supports two main workflows: a convenience workflow, where you run models directly with minimal setup, and a performance workflow, recommended for production, where models are first converted offline into the OpenVINO Intermediate Representation (IR). This one-time conversion step enables highly optimized inference and allows the final application to remain lightweight and efficient. Here’s a simple example using OpenVINO for inference with a pre-converted IR model. Refer to OpenVINO Notebooks repo for more samples: import openvino as ov core = ov.Core() ############## Load the OpenVINO IR model ############### compiled_model = core.compile_model("model.xml", "CPU") ############## Run inference ################### infer_request = compiled_model.create_infer_request() results = infer_request.infer({input_tensor_name: input_tensor}) Figure 3: OpenVINO toolkit Overview. IPEX and OpenVINO are supported in all Intel architectures. However, for optimal performance, Intel recommends using instances powered by 4th Gen Intel® Xeon® Scalable processors or newer, which feature AMX and other hardware acceleration capabilities, such as Azure’s v6-series (e.g., Standard_E48s_v6) [7]. Results We conducted a detailed performance benchmark by using CXRReportGen, a state-of-the-art foundation model designed to generate a list of radiological findings from chest X-rays, over Standard_E48s_v6 hardware (48 vCPUs, 248 GiB RAM) with and without IPEX and OpenVINO optimization. We realized up to 70% improvement in CXRReportGen foundation model run time when applying optimizations with IPEX and similarly substantial gains using OpenVINO, compared to the non-optimized baseline on the same CPU hardware. This significant improvement highlights the potential of leveraging Intel's performance optimizations to make critical healthcare AI models more cost-efficient and accessible. Such advancements enable healthcare providers to deploy advanced diagnostic tools even in resource-constrained environments, ultimately improving patient care and operational efficiency. SKU Run Type (100 Runs) Mean Run Time (seconds) Standard Deviation of Run Time (seconds) Standard_E48s_v6 (48 vCPUs, 348 GiB RAM) No Optimization 22.47 0.1061 Standard_E48s_v6 (48 vCPUs, 348 GiB RAM) IPEX 8.21 0.2375 Standard_E48s_v6 (48 vCPUs, 348 GiB RAM) OpenVINO 7.01 0.0569 Table 1: Performance Comparison of CXRReportGen Model Across 100 Runs with CPU. Future Prospects and Innovations Our benchmarks with Intel optimizations with both IPEX and OpenVINO show great potential on decreasing the model run time of our foundation models and increasing scalability via CPU. This optimization positions Intel CPUs as a viable deployment. This not only increases deployment options but also offers opportunities to reduce cloud costs with CPU-based instances and even consider deploying these workflows on existing compute headroom at the edge. For custom deployments, the setup described in this blog post is now available on the provided compute instances in Azure and with optimization software from Intel. So that developers can optimize inference workloads while taking advantage of large memory pools available via CPU and use towards handling large batch workloads. Our advancements with Intel in model runtime optimizations are considered to be available in the Azure AI model catalogs. Please stay tuned for further updates. As we continue to innovate and optimize, the potential for AI to transform healthcare and improve patient outcomes becomes increasingly attainable. We are now more equipped than ever to making it easier for our partners and customers to create connected experiences at every point of care, empower their healthcare workforce, and unlock the value from their data using data standards that are important to the healthcare industry. References [1] Intel OpenVINO Optimizes Deep Learning Performance for Healthcare Imaging [2] Accelerating Healthcare Diagnostics with Intel oneAPI and AI Tools [3] Intel Advanced Matrix Extensions [4] Intel Extension for Pytorch [5] Accelerate with Intel Extension to PyTorch [6] Intel Accelerates PadChest and fMRI Models on Azure ML [7] Azure’s first 5th Gen Intel® Xeon® processor instances are now available and we're excited! [8] CxrReportGen Model Card in Azure AI Foundry The healthcare AI models in Azure AI Foundry are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.Integrating remote patient monitoring solutions with healthcare data solutions in Microsoft Fabric
Co-Authors: Kemal Kepenek, Mustafa Al-Durra PhD, Matt Dearing, Jason Foerch, Manoj Kumar Introduction Remote patient monitoring solutions rely on connected devices, wearable technology, and advanced software platforms to collect and transmit patient health data. They facilitate monitoring of vital signs, chronic conditions, and behavioral patterns. Healthcare data solutions in Microsoft Fabric offers a secure, scalable, and interoperable data platform as part of Microsoft for Healthcare. Such a unified data platform is crucial for integrating disparate data sources and generating actionable health insights. This article provides a reference architecture and the steps to integrate remote patient monitoring solutions with healthcare data solutions in Fabric. The integration is aimed at satisfying low data resolution use cases. With low data resolution, we address infrequent (hourly, daily, or less) transfer of aggregated or point-in-time-snapshot device data into healthcare data solutions in Fabric to be used in a batch fashion to generate analytical insights. Integration steps for high data resolution use cases, which necessitate high frequency transfer of highly granular medical device data (for example, data from EKGs or ECGs) to become input to either batch or (near) real-time analytics processing and consumption, is a candidate for a future article. There are several methods, solutions and partners available in the marketplace today that will allow you to integrate a remote patient monitoring solution with the healthcare data solutions in Fabric. In this article, we leveraged the solution from Life365 (a Microsoft partner). The integration approach discussed here is applicable to most remote patient monitoring solutions whose integration logic (code) can be run inside a platform that can programmatically access (for example, through REST API calls) Microsoft Fabric. In our approach, the integration platform chosen is the Function App service within Microsoft Azure. In the subsequent sections of this article, we cover the integration approach in two phases: Interoperability phase, which illustrates how the data from medical devices (used by the remote patient monitoring solution) can be converted into format suitable for transferring into healthcare data solutions in Fabric. Analytical processing and consumption phase, which provides the steps to turn the medical device data into insights that can be easily accessed through Fabric. Integration Approach Interoperability Phase Step 1 of this phase performs the transfer of proprietary device data. As part of this step, datasets are collected from medical devices and transferred (typically, in the form of files) to an integration platform or service. In our reference architecture, the datasets are trans ferred to the Function App (inside an Azure Resource Group) that is responsible for the integration function. It is important for these datasets to contain information about (at least) three concepts or entities: Medical device(s) from which the datasets are collected. Patient(s) to whom the datasets belong. Reading(s) obtained from the medical device(s) throughout the time that the patients utilize these devices. Medical device readings data may be point-in-time data capture, metrics, measures, calculations, collections, or similar data points. Information about the entities listed above will be used in the later step of interoperability phase (discussed below) when we will convert this information into resources to be transferred to the second phase that will perform analytical processing and consumption. In step 2, to maintain mapping between proprietary device data and FHIR® resources, you can use transformation templates, or follow a programmatic approach, to convert datasets received from medical devices into appropriate FHIR® resources. Using the entities mentioned in the previous step, the conversion takes place as follows: Medical device information is converted to Device resource in FHIR® * . Patient information is converted to Patient resource in FHIR®. Device reading information is converted to Observation resource in FHIR®. * Currently, healthcare data solutions in Fabric supports FHIR® Release 4 (R4) standard. Consequently, the FHIR® resources that are created as part of this step should follow the same standard. Transformation and mapping activities are under the purview of each specific remote patient monitoring integration solution and are not reviewed in detail in this article. As an example, we provided below the high-level steps that one of the Microsoft partners (Life365) followed to integrate their remote patient monitoring solution with healthcare data solutions in Fabric: Life365 team developed a cloud-based transformation service that translates internal device data into standardized FHIR® (Fast Healthcare Interoperability Resources) Observations to enable compatibility with healthcare data solutions in Microsoft Fabric and other health data ecosystems. This service is implemented in Microsoft Azure Cloud and designed to ingest structured payloads from Life365-connected medical devices —including blood pressure monitors, weight scales, and pulse oximeters— and convert them into FHIR®-compliant formats in real time. When a reading is received: The service identifies relevant clinical metrics (e.g., systolic/diastolic blood pressure, heart rate, weight, SpO₂). These metrics are mapped to FHIR® Observation resources using industry-standard LOINC codes and units. Each Observation is enriched with references to the associated patient and device, formatted in NDJSON to meet the ingestion requirements in healthcare data solutions in Fabric. The resulting FHIR®-compliant data is securely transmitted to the Fabric instance using token-based authentication. This implementation provides a consistent, standards-aligned pathway for Life365 device data to integrate with downstream FHIR®-based platforms while abstracting the proprietary structure of the original device payloads. For examples from the public domain, you can use the following open-source projects as references: https://github.com/microsoft/fit-on-FHIR® https://github.com/microsoft/healthkit-to-FHIR® https://github.com/microsoft/FitbitOnFHIR® https://github.com/microsoft/FHIR®-Converter Please note that above open-source repositories might not be up to date. While they may not provide a complete (end to end) solution to map medical device data to FHIR®, they may still be helpful as a starting point. If you decide to incorporate them into your remote patient monitoring integration solution, validate their functionality and make necessary changes to meet your solution’s requirements. For the resulting FHIR® resources to be successfully consumed by the analytics processing later (within healthcare data solutions in Fabric), they need to satisfy the requisites listed below. Each FHIR® resource, in its entirety, needs to be saved as a single row into an NDJSON-formatted file. We recommend creating one NDJSON file per FHIR® resource type. That means creating Device.ndjson, Patient.ndjson, and Observation.ndjson files for the three entities we reviewed above. Each FHIR® resource needs to have a meta segment populated with inclusion of lastUpdated value. As an example: "meta":{"lastUpdated":"2025-05-15T15:35:04.218Z", "profile":["http://hl7.org/FHIR®/us/core/StructureDefinition/us-core-documentreference"]} Cross references between Observation and Patient, as well as between Observation and Device FHIR® resources need to be represented correctly, either through formal FHIR® identifiers or logical identifiers. As an example, the subject and device attributes of Observation FHIR® resource need to refer to Patient and Device FHIR® resources, respectively, in this manner: "subject":{"reference":"Patient/d3281621-1584-4631-bc82-edcaf49fda96"} "device":{"reference":"Device/5a934020-c2c4-4e92-a0c5-2116e29e757d"} For Patient FHIR® resource, if MRN is used as the identifier, it is important to represent the MRN value according to the FHIR® standard. Patient identifier is a critical attribute that it is used to establish cross-FHIR®-resource relationships throughout the analytics processing and consumption phase. We will review that phase later in this article. At a minimum, a Patient identifier, which uses MRN coding as its identifier type, needs to have its value, system, type.coding.system, and type.coding.code (with value “MR”) attributes populated correctly. See an example below. You can also refer to a Patient FHIR® resource example from hl7.org. "reference": null, "type": "Patient", "identifier": { "extension": null, "use": null, "value": "4e7e5bf8-2823-8ec1-fe37-eba9c9d69463", "system": "urn:oid: 1.2.36.146.595.217.0.1", "type": { "extension": null, "id": null, "coding": [ { "extension": null, "id": null, "system": "http://terminology.h17.org/CodeSystem/v2-0203", "version": null, "code": "MR", "display": null, "userSelected": null } "text": null }, ... With Step 3, to perform the transfer of FHIR® resource NDJSON files to healthcare data solutions in Fabric: Ensure that the integration platform (Azure Function App, in our case) has permission to transfer (upload) files to the healthcare data solutions in Fabric: Find the managed identity or the service principal that the Azure Function App is running under: Navigate to the Azure portal and find your Function App within your resource group. In the Function App's navigation pane, under "Settings," select "Identity". Identify the Managed Identity (if enabled): If System-assigned managed identity is enabled, you'll see information about the system-assigned managed identity, including its object ID and principal ID. If User-assigned managed identity is linked, the details of that identity will be displayed. You can also add user-assigned identities here if needed. Service Principal (if applicable): If the Function App is configured to use a service principal, you'll need to look for the service principal within the Azure Active Directory (a.k.a. Microsoft Entra ID). You can find this by searching for "Enterprise Applications" within Azure Active Directory and looking for the application associated with the Function App. Grant Azure Function App’s identity access to upload files: Having been logged into Fabric with an administrator account, navigate to the Fabric workspace where your healthcare data solutions instance is deployed. Click on the “Manage Access” button on the top right. Click on “Add People or Groups” Add the managed identity or the service principal, which is associated with your Azure Function App, with Contributor access by selecting “Contributor” from the dropdown list. Using a coding environment, similar to the Python example provided below, you can manage the OneLake content programmatically. This includes the ability to transfer (upload) the NDJSON-formatted files, which have been created earlier, to the destination OneLake folder. from azure.identity import DefaultAzureCredential from azure.storage.filedatalake import DataLakeFileClient, DataLakeFileSystemClient # Replace with your OneLake URI onelake_uri = "https://your-account-name.dfs.core.windows.net" # Replace with the destination path to your file file_path = "/<full path to destination folder (see below)>/<entity name>.ndjson" # Get the credential credential = DefaultAzureCredential() # Create a DataLakeFileClient file_client = DataLakeFileClient( url=f"{onelake_uri}{file_path}", credential=credential ) # Upload the file with open("<entity name>.ndjson", "rb") as f: file_client.upload_data(f, overwrite=True) print(f"File uploaded successfully: {file_path}") The destination OneLake folder to use for the remote patient monitoring solution integration into healthcare data solutions in Fabric is determined as follows: Navigate to the bronze lakehouse created with the healthcare data solutions instance inside the Fabric workspace. The lakehouse is named as “healthcare1_msft_bronze”. “healthcare1” segment in the name of the lakehouse points to the name of the healthcare data solutions instance deployed in the workspace. You might see a different name in your Fabric workspace; however, the rest of the lakehouse name (“_msft_bronze”) remains unchanged. Unified folder structure of healthcare data solutions is located inside the bronze lakehouse. Within that folder structure, create a subfolder named with the name of the remote patient monitoring solution you are integration with. See the screenshot below. This subfolder is referred to as namespace in healthcare data solutions documentation, and is used to uniquely identify the source of incoming (to-be-uploaded) data. NDJSON files, which have been generated during the previous interoperability phase, will be transferred (uploaded) into that subfolder. The full path of the destination OneLake folder to use in your file transfer (upload) code is: healthcare1_msft_bronze.Lakehouse\Files\Ingest\Clinical\FHIR®-NDJSON\<Solution-Name-as-Namespace> Analytics Processing and Consumption Phase Step 1 of this phase connects the interoperability phase discussed earlier with the analytics processing and consumption phase. As part of this step, you can simply verify that the NDJSON files have been uploaded to the remote patient monitoring solution subfolder inside the unified folder structure in bronze lakehouse of healthcare data solutions in Fabric. The path to that subfolder is provided earlier in this article. After the upload of the files has been completed, you are ready to run the data pipeline that will perform data ingestion and transformation so that the device readings data may be used for analytics purposes. In the Fabric workspace, where healthcare data solutions instance is deployed, find and open the data pipeline named “healthcare1_msft_omop_analytics”. As is the case with the bronze lakehouse name, “healthcare1” segment in the name of the data pipeline points to the name of the healthcare data solutions instance deployed in the workspace. You might see a different name in your Fabric workspace depending on your own instance. This data pipeline will execute four activities, first of which will copy the transferred files into another subfolder within unified folder structure so that they can be input to the ingestion step next. Then, the subsequent pipeline activities perform steps 2 through 4 as illustrated in the analytics processing and consumption phase diagram further above. Step 2 ingests the content from the transferred (NDJSON) file(s) to the ClinicalFHIR delta table of the bronze lakehouse. Step 3 transforms the content from the ClinicalFHIR delta table of the bronze lakehouse into flattened FHIR® data model content inside silver lakehouse. Step 4 transforms the flattened FHIR® content of silver lakehouse into OMOP data model content inside gold lakehouse. As part of step 5, you can develop your own gold lakehouse(s) through transforming content from the silver lakehouse into data model(s) best suited for your custom analytics use cases. Device data, once transformed into a gold lakehouse, may be used for analytics or reporting through several ways some of which are discussed briefly below. In step 6, Power BI reports and dashboards can be built inside Fabric that offer a visual and interactive canvas to analyze the data in detail. (Overview of Power BI - Microsoft Fabric | Microsoft Learn) As part of step 7, Fabric data share feature can be used to grant teams within external organizations (that you collaborate with) access to the data (External data sharing in Microsoft Fabric - Microsoft Fabric | Microsoft Learn). Finally, step 8 enables you to utilize the discover and build cohorts capability of healthcare data solutions in Fabric. With this capability, you can submit natural language queries to explore the data and build patient cohorts that fit the criteria that your use cases are aiming for. (Build patient cohorts with generative AI in discover and build cohorts (preview) - Microsoft Cloud for Healthcare | Microsoft Learn) Conclusion When integrated with healthcare data solutions in Fabric, remote patient monitoring solutions can enable transformative potential in enhancing patient outcomes, optimizing care coordination, and streamlining healthcare system operations. If your organization would like to explore the next steps in such a journey, please contact your Microsoft account team.AI in IDD
Here is a little bout how I am using AI in IDD. Our licensed Co-Pilot, has proven to be incredibly productive and time-saving, especially in reviewing documents for errors against regulations, identifying discrepancies between various care plans for the same individual, analyzing and summarizing documents for relevant information in investigations (reducing the closure time from 90 days to 7), suggesting corrective actions for internal/external audits, trending data sets (such as surveys, assessments, and audits), and proposing opportunities and solutions in light of the regulations. AI can audit documents, policies, and procedures with extreme accuracy against 6400 and 6100 regulations (PA regs). We have recently implemented this solution to streamline and audit our Medicaid billing process prior to submission. It enables us to identify billing errors, uncover unbilled days we should be billing for, and flag days we should not be billing - significantly reducing the time required for this process from 30 hours to 4 for a months' worth of billing. For systems that are not directly compatible with AI integration, we leverage AI to develop macros that process raw data exports. This allows us to extract precisely the information we need in seconds, rather than spending hours on manual analysis.Mastering Agent Governance in Microsoft 365
The "Mastering Agent Governance in Microsoft 365" series is based on the Administering and Governing Agents whitepaper published by Microsoft and designed to educate IT leaders, compliance officers, and decision-makers about the importance of governance for AI agents in Microsoft 365, particularly in highly regulated industries like Healthcare and Life Sciences (HLS). The six-episode series cover the growing role of agents, the risks of unmanaged agents, and the strategic importance of governance frameworks. Empowering innovation while protecting patient data and ensuring compliance In the age of AI-powered productivity, agents—automated digital assistants built with tools like Microsoft 365 Copilot, SharePoint, and Copilot Studio—are transforming how work gets done. From streamlining clinical documentation to automating regulatory reporting, agents are becoming indispensable in Healthcare and Life Sciences (HLS). But with great power comes great responsibility. Why Governance Can’t Be an Afterthought In highly regulated industries like HLS, where data sensitivity and compliance are paramount, the rise of autonomous agents introduces new risks: Unauthorized data access could expose protected health information (PHI). Unmonitored agent behavior could lead to regulatory violations. Lack of lifecycle controls could result in outdated or insecure agents operating in production environments. Agent governance isn’t just an IT concern—it’s a business imperative. It ensures that innovation doesn’t outpace compliance, and that every agent deployed aligns with organizational policies, security standards, and regulatory frameworks like HIPAA, GDPR, and FDA 21 CFR Part 11. Understanding the Agent Landscape Microsoft 365 supports a spectrum of agent creators: End Users using SharePoint or Copilot templates to automate simple tasks. Makers building more complex agents in Copilot Studio. Developers crafting sophisticated, enterprise-grade agents with Azure AI and Teams Toolkit. Each persona requires a different level of oversight. For example, a clinical researcher using SharePoint to build a data retrieval agent may need minimal governance, while a developer building a patient-facing chatbot must adhere to strict data protection and validation protocols. Governance in Action Microsoft provides a layered governance model: Tool Controls: Define what agent creators can do within tools like Copilot Studio and SharePoint. Content Controls: Ensure agents only access data they’re authorized to use, leveraging Microsoft Purview for sensitivity labeling and DLP. Agent Management: Monitor usage, enforce lifecycle policies, and block non-compliant agents via the Microsoft 365 Admin Center. This framework allows organizations to empower innovation while maintaining control—critical in environments where patient safety and regulatory compliance are non-negotiable. The Business Case for Governance For HLS organizations, agent governance delivers tangible benefits: Reduced compliance risk through proactive policy enforcement. Improved operational efficiency by enabling safe automation. Greater trust from patients, regulators, and internal stakeholders. In short, governance is the foundation that allows agents to scale safely and sustainably.2.2KViews2likes3CommentsImage Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology
Introduction Dermatology is inherently visual, with diagnosis often relying on morphological features such as color, texture, shape, and spatial distribution of skin lesions. However, the diagnostic process is complicated by the large number of dermatologic conditions, with over 3,000 identified entities, and the substantial variability in their presentation across different anatomical sites, age groups, and skin tones. This phenotypic diversity presents significant challenges, even for experienced clinicians, and can lead to diagnostic uncertainty in both routine and complex cases. Image-based retrieval systems represent a promising approach to address these challenges. By enabling users to query large-scale image databases using a visual example, these systems can return semantically or visually similar cases, offering useful reference points for clinical decision support. However, dermatology image search is uniquely demanding. Systems must exhibit robustness to variations in image quality, lighting, and skin pigmentation while maintaining high retrieval precision across heterogeneous datasets. Beyond clinical applications, scalable and efficient image search frameworks provide valuable support for research, education, and dataset curation. They enable automated exploration of large image repositories, assist in selecting challenging examples to enhance model robustness, and promote better generalization of machine learning models across diverse populations. In this post, we continue our series on using healthcare AI models in Azure AI Foundry to create efficient image search systems. We explore the design and implementation of such a system for dermatology applications. As a baseline, we first present an adapter-based classification framework for dermatology images by leveraging fixed embeddings from the MedImageInsight foundation model, available in the Azure AI Foundry model catalog. We then introduce a Retrieval-Augmented Generation (RAG) method that enhances vision-language models through similarity-based in-context prompting. We use the MedImageInsight foundation model to generate image embeddings and retrieve the top-k visually similar training examples via FAISS. The retrieved image-label pairs are included in the Vision-LLM prompt as in-context examples. This targeted prompting guides the model using visually and semantically aligned references, enhancing prediction quality on fine-grained dermatological tasks. It is important to highlight that the models available on the AI Foundry Model Catalog are not designed to generate diagnostic-quality results. Developers are responsible for further developing, testing, and validating their appropriateness for specific tasks and eventually integrating these models into complete systems. The objective of this blog is to demonstrate how this can be achieved efficiently in terms of data and computational resources. The Data The DermaVQA-IIYI [2] dermatology image dataset is a de-identified, diverse collection of nearly 1,000 patient records and nearly 3,000 dermatological images, created to support research in skin condition recognition, classification, and visual question answering. DermaVQA-IIYI dataset: https://osf.io/72rp3/files/osfstorage (data/iiyi) The dataset is split into three subsets: Training Set: 2,474 images associated with 842 patient cases Validation Set: 157 images associated with 56 cases Test Set: 314 images associated with 100 cases Total Records: 2,945 images (998 patient cases) Patient Demographics: Out of 998 patient cases: Sex – F: 218, M: 239, UNK: 541 Age (available for 398 patients): Mean: 31 yrs | Min: 0.08 yrs | Max: 92 yrs This wide range supports studies across all age groups, from infants to the elderly. A total of 2,945 images are associated with the patient records, with an average of 2.9 images per patient. This multiplicity enables the study of skin conditions from different perspectives and at various stages. Image Count per Entry: 1 image: 225 patients 2 images: 285 patients 3 images: 200 patients 4 or more images: 288 patients The dataset includes additional annotations for anatomic location, comprising 39 distinct labels (e.g., back, fingers, fingernail, lower leg, forearm, eye region, unidentifiable). Each image is associated with one or multiple labels. We use these annotations to evaluate the performance of various methods across different anatomical regions. Image Embeddings We generate image embeddings using the MedImageInsight foundation model [1] from the Azure AI Foundry model catalog [3]. We apply Uniform Manifold Approximation and Projection (UMAP) to project high-dimensional image embeddings produced by the MedImageInsight model into two dimensions. The visualization is generated using embeddings extracted from both the DermaVQA training and test sets, which covers 39 anatomical regions. For clarity, only the most frequent anatomical labels are displayed in the projection. Figure 1. UMAP projection of image embeddings produced by the MedImageInsight Model on the DermaVQA dataset. The resulting projection reveals that the MedImageInsight model captures meaningful anatomical distinctions: visually distinct regions such as fingers, face, fingernail, and foot form well-separated clusters, indicating high intra-class consistency and inter-class separability. Other anatomically adjacent or visually similar regions, such as back, arm, and abdomen, show moderate overlap, which is expected due to shared visual features or potential labeling ambiguity. Overall, the embeddings exhibit a coherent and interpretable organization, suggesting that the model has learned to encode both local and global anatomical structures. This supports the model’s effectiveness in capturing anatomy-specific representations suitable for downstream tasks such as classification and retrieval. Enhancing Visual Understanding We explore two strategies for enhancing visual understanding through foundation models. I. Training an Adapter-based Classifier We build an adapter-based classification framework designed for efficient adaptation to medical imaging tasks (see our prior posts for introduction into the topic of adapters: Unlocking the Magic of Embedding Models: Practical Patterns for Healthcare AI | Microsoft Community Hub). The proposed adapter model builds upon fixed visual features extracted from the MedImageInsight foundation model, enabling task-specific fine-tuning without requiring full model retraining. The architecture consists of three main components: MLP Adapter: A two-layer feedforward network that projects 1024-dimensional embeddings (generated by the MedImageInsight model) into a 512-dimensional latent space. This module utilizes GELU activation and Layer Normalization to enhance training stability and representational capacity. As a bottleneck adapter, it facilitates parameter-efficient transfer learning. Convolutional Retrieval Module: A sequence of two 1D convolutional layers with GELU activation, applied to the output of the MLP adapter. This component refines the representations by modeling local dependencies within the transformed feature space. Prediction Head: A linear classifier that maps the 512-dimensional refined features to the task-specific output space (e.g., 39 dermatology classes). The classifier is trained for 10 epochs (approximately 48 seconds) using only CPU resources. Built on fixed image embeddings extracted from the MedImageInsight model, the adapter efficiently tailors these representations for downstream classification tasks with minimal computational overhead. By updating only the adapter components, while keeping the MedImageInsight backbone frozen, the model significantly reduces computational and memory overhead. This design also mitigates overfitting, making it particularly effective in medical imaging scenarios with limited or imbalanced labeled data. A Jupyter Notebook detailing the construction and training of an MedImageInsight -based adapter model is available in our Samples Repository: https://aka.ms/healthcare-ai-examples-mi2-adapter Figure 3: MedImageInsight-based Adapter Model II. Boosting Vision-Language Models with in-Context Prompting We leverage vision-language models (e.g., GPT-4o, GPT-4.1), which represent a recent class of multimodal foundation models capable of jointly reasoning over visual and textual inputs. These models are particularly promising for dermatology tasks due to their ability to interpret complex visual patterns in medical images while simultaneously understanding domain-specific medical terminology. 1. Few-shot Prompting In this setting, a small number of examples from the training dataset are randomly selected and embedded into the input prompt. These examples, consisting of paired images and corresponding labels, are intended to guide the model's interpretation of new inputs by providing contextual cues and examples of relevant dermatological features. 2. MedImageInsight-based Retrieval-Augmented Generation (RAG) This approach enhances vision-language model performance by integrating a similarity-based retrieval mechanism rooted in MedImageInsight (Medical Image-to-Image) comparison. Specifically, it employs a k-nearest neighbors (k-NN) search to identify the top k dermatological training images that are most visually similar to a given query image. The retrieved examples, consisting of dermatological images and their corresponding labels, are then used as in-context examples in the Vision-LLM prompt. By presenting visually similar cases, this approach provides the model with more targeted contextual references, enabling it to generate predictions grounded in relevant visual patterns and associated clinical semantics. As illustrated in Figure 2, the system operates in two phases: Index Construction: Embeddings are extracted from all training images using a pretrained vision encoder (MedImageInsight). These embeddings are then indexed to enable efficient and scalable similarity search during retrieval. Query and Retrieval: At inference time, the test image is encoded similarly to produce a query embedding. The system computes the Euclidean distance between this query vector and all indexed embeddings, retrieving the k nearest neighbors with the smallest distances. To handle the computational demands of large-scale image datasets, the method leverages FAISS (Facebook AI Similarity Search), an open-source library designed for fast and scalable similarity search and clustering of high-dimensional vectors. The implementation of the image search method is available in our Samples Repository: https://aka.ms/healthcare-ai-examples-mi2-2d-image-search Figure 2: MedImageInsight-based Retrieval-Augmented Generation Evaluation Table 1 presents accuracy scores for anatomic location prediction on the DermaVQA-iiyi test set using the proposed modeling approaches. The adapter model achieves a baseline accuracy of 31.73%. Vision-language models perform better, with GPT-4o (2024-11-20) achieving an accuracy of 47.11%, and GPT-4.1 (2025-04-14) improving to 50%. However, incorporating few-shot prompting with five randomly selected in-context examples (5-shot) slightly reduces GPT-4.1’s performance to 48.72%. This decline suggests that unguided example selection may introduce irrelevant or low-quality context, potentially reducing the effectiveness of the model’s predictions for this specialized task. The best performance among the vision-language approaches is achieved using the retrieval-augmented generation (RAG) strategy. In this setup, GPT-4.1 is prompted with five nearest-neighbor examples retrieved using the MedImageInsight-based search method (RAG-5), leading to a notable accuracy increase to 51.60%. This improvement over GPT-4.1’s 50% accuracy without retrieval showcases the relevance of the MedImageInsight-based RAG method. We expect larger performance gains when using a more extensive dermatology dataset, compared to the relatively small dataset used in this example -- a collection of 2,474 images associated with 842 patient cases which served as the basis for selecting relevant cases and similar images. Dermatology is a particularly challenging domain, marked by a high number of distinct conditions and significant variability in skin tone, texture, and lesion appearance. This diversity makes robust and representative example retrieval especially critical for enhancing model performance. The results underscore the importance of example relevance in few-shot prompting, demonstrating that similarity-based retrieval can effectively guide the model toward more accurate predictions in complex visual reasoning tasks. Table 1: Comparative Accuracy of Anatomic Location Prediction on DermaVQA-iiyi Figure 2: Confusion Matrix of Anatomical Location Predictions by the trained MLP adapter: The matrix illustrates the model's performance in classifying wound images across 39 anatomical regions. Strong diagonal values indicate correct classifications, while off-diagonal entries highlight common misclassifications, particularly among anatomically adjacent or visually similar regions such as 'lowerback' vs. 'back' and 'hand' vs. 'fingers'. Figure 3. Examples of correct anatomical predictions by the RAG approach. Each image depicts a case where the model's predicted anatomical region exactly matches the ground truth. Shown are examples from visually and anatomically distinct areas including the eye region, lips, lower leg, and neck. Figure 4. Examples of misclassifications by the RAG approach. Each image displays a case where the predicted anatomical label differs from the ground truth. In several examples, predictions are anatomically close to the correct regions (e.g., hand vs. hand-back, lower leg vs. foot, palm vs. fingers), suggesting that misclassifications often occur between adjacent or visually similar areas. These cases highlight the challenge of precise localization in fine-grained anatomical classification and the importance of accounting for anatomical ambiguity in both modeling and evaluation. Conclusion Our exploration of scalable image retrieval and advanced prompting strategies demonstrates the growing potential of vision-language models in dermatology. A particularly challenging task we address is anatomic location prediction, which involves 39 fine-grained classes of dermatology images, imbalanced training data, and frequent misclassifications between adjacent or visually similar regions. By leveraging Retrieval-Augmented Generation (RAG) with similarity-based example selection using image embeddings from the MedImageInsight foundation model, we show that relevant contextual guidance can significantly improve model performance in such complex settings. These findings underscore the importance of intelligent image retrieval and prompt construction for enhancing prediction accuracy in fine-grained medical tasks. As vision-language models continue to evolve, their integration with retrieval mechanisms and foundation models holds substantial promise for advancing clinical decision support, medical research, and education at scale. In the next blog of this series, we will shift focus to the wound care subdomain of dermatology, and we will release accompanying Jupyter notebooks for the adapter-based and RAG-based methods to provide a reproducible reference implementation for researchers and practitioners. The Microsoft healthcare AI models, including MedImageInsight, are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals. References Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaría-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie L. Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew P. Lungren, Mu Wei: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging. CoRR abs/2410.06542 (2024) Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Asma Ben Abacha, Meliha Yetisgen, Fei Xia: DermaVQA: A Multilingual Visual Question Answering Dataset for Dermatology. MICCAI (5) 2024: 209-219 Model catalog and collections in Azure AI Foundry portal https://learn.microsoft.com/en-us/azure/ai-studio/how-to/model-catalog-overview