healthcare
508 TopicsOperationalizing AI powered medical imaging pipeline for cohort building
Authors: Jared Erwin, Senior Software Engineer, HLS Nursing AI and Data Platform, Faculty UW School of Medicine Manoj Kumar, Director, HLS - Data & AI HLS Frontiers AI Alberto Santamaria-Pang, Principal Applied Data Scientist, HLS Frontiers AI and Adjunct Faculty, Johns Hopkins Medicine Overview In Part 1, of this series, we showed how natural language could be used to define medical imaging cohorts and retrieve relevant studies in seconds instead of months. That proof-of-concept demonstrated the value of the idea — but not how to make it repeatable, or production-ready. This post focuses on how we turned that prototype into a production-oriented Azure Machine Learning pipeline — to scale execution and produce clear, versioned artifacts that could drive an interactive cohort exploration UI. If you're building ML pipelines for medical imaging, or any domain where data is large, messy, and locked behind access controls, we hope our experience saves you time. From scripts to a pipeline: Why Azure ML components? The original hackathon implementation consisted of notebooks and scripts that required careful manual execution. To make the system repeatable and auditable, we standardized it using Azure ML pipelines. Azure ML pipelines gave us: Componentized execution — each processing step is a self-contained unit with defined inputs, outputs, and dependencies Parallel branches — steps that don't depend on each other run concurrently Reproducibility — every run is versioned and logged with full lineage Compute flexibility — run on CPU for metadata extraction, GPU for model inference, without manual orchestration The pipeline architecture The pipeline consists of 5 python components arranged in a DAG with two parallel branches: [0]scans a DICOM directory and extracts metadata from headers — study/series UIDs, modality, body part, slice counts. [1]classifies each series by anatomy and orientation using a multi-tier strategy (more on this below). [2] and [3] form the search pipeline: anatomy labels are converted to natural language text templates, then encoded with BiomedCLIP into a FAISS vector index. [4]generates 2D UMAP coordinates from the embeddings for the interactive scatter plot visualization in the UI. The image depicts a flowchart detailing the process of DICOM metadata extraction, anatomy classification, visualization enrichment, and text template generation, followed by the creation of a FAISS vector index. Components 2 and 4 run in parallel after component 1 completes, saving roughly 10-15% of total execution time. It's a modest gain for a single run, but it adds up when iterating on pipeline parameters. [1] Anatomy classification, integrating MedImageInsight The Anatomy classification component in the pipeline relies on MedImageInsight (MI2). MedImageInsight is Microsoft's foundation model for medical image understanding, available through the Azure AI Foundry model catalog. Unlike generative models, MedImageInsight is an embedding model — it maps medical images and text into a shared 1024-dimensional vector space, enabling tasks like classification and similarity search by comparing image embeddings against text label embeddings. Given a DICOM image, we compare its embedding against candidate labels (e.g., "Brain", "Chest", "Abdomen") to determine the body part, scan orientation, and other imaging characteristics through zero-shot classification. We also may get directly annotated anatomy from component 0, the DICOM metadata extractor component. We can combine both data points to build our final search index. [2] [3] FAISS index construction As an input to the FAISS index, we first run component 2, the text template generator. This component takes the metadata and anatomy information from components 0 and 1 and feeds them into 5 different agents with different instructions on how to describe the DICOM study. This results in textual descriptions which some variation, referred to as text templates, which can be indexed in the next component The FAISS index builder (component 3) uses BiomedCLIP to encode all text templates into 512-dimensional vectors: MODEL_NAME = "hf-hub:microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224" @torch.no_grad() def encode(self, texts: List[str], batch_size: int = 256) -> np.ndarray: embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] tokens = self.tokenizer(batch).to(self.device) batch_embeddings = self.model.encode_text(tokens) batch_embeddings = F.normalize(batch_embeddings, dim=-1) # L2 normalize embeddings.append(batch_embeddings.cpu().numpy()) return np.vstack(embeddings) We L2-normalize all vectors and use faiss.IndexFlatIP (inner product), which is equivalent to cosine similarity on normalized vectors. For our current dataset sizes (thousands of series), flat indexing is fast enough. For hospital-scale datasets with millions of images, we might switch to IndexIVFFlat or IndexHNSW for approximate nearest neighbor search. In the cohort explorer app, a user will enter a natural language query, which is then converted to embeddings using the same BiomedCLIP model. This allows a search using the FAISS index to find relevant DICOM studies. [4] Visualization: making embeddings explorable The scatter plot in the UI is often the first thing users interact with. It needs to show meaningful clusters without requiring users to understand dimensionality reduction. Component 4 takes the embeddings from component 1 and projects them to 2D with UMAP: umap = UMAP( n_components=2, n_neighbors=10, # Balances local vs. global structure min_dist=0.5, # Prevents over-clustering metric='cosine', # Matches our embedding similarity metric random_state=42 # Reproducible layouts ) coordinates_2d = umap.fit_transform(features) Each point in the scatter plot corresponds to a single DICOM series produced by the pipeline, with color, grouping, and hover metadata derived directly from the JSON artifacts emitted by components 1 and 4. Each pipeline run produces a small set of well-defined artifacts — metadata tables, embedding vectors, UMAP coordinates, and the FAISS index — which are consumed directly by the cohort exploration UI. The cohort explorer application can reload or switch between datasets. The diagram is a screen capture of an Azure ML pipeline. It includes 5 pipeline components along with connecting arrows showing incoming and outgoing data, including the final outputs of the pipeline. Pipeline execution: time, cost, and what we learned Here's what a typical pipeline run looks like for a dataset of ~4,500 DICOM series: Component Task Approximate Time (CPU) Approximate Time (GPU) 0 - DICOM Metadata Extractor Scan files, extract headers 5-10 min 5-10 min 1 - Anatomy Classification Classify anatomy/orientation 90-120 min 5-10 min 2 - Text Template Generator Generate 5 templates per series 5-10 min 5-10 min 3 - FAISS Index Builder BiomedCLIP encoding + FAISS build 60-90 min 10-15 min 4 - Visualization Enrichment UMAP + color assignment 20-40 min 5-10 min Azure ML overhead Compute provisioning, env setup 5-10 min 5-10 min Total ~200-300 min ~30-50 min Key observations: Azure ML overhead is significant when doing quick iteration and testing. Compute provisioning, conda environment builds, and data mounting add several minutes before any component code runs. We first built each component as python code to run locally and debug before our first Azure ML run. This way we quickly iterated and avoided cost until we were ready. BiomedCLIP encoding dominates on CPU. Component 3 is the bottleneck. Moving to GPU compute for this component cuts encoding time roughly in half, but GPU clusters cost more. For a pipeline you run occasionally, CPU is fine. For frequent re-indexing, GPU pays for itself. Batch size tuning matters. The default BiomedCLIP batch size of 256 balances memory and throughput. On GPU, you can push to 512. On CPU with limited RAM, drop to 128. At Scale: 120,000 Images, CPU vs. GPU We ran the full pipeline against a larger dataset of ~120,000 images to understand how compute choice affects end-to-end time and cost: CPU Pipeline GPU Pipeline Pipeline compute time 4 days, 12 hours (108 hrs) 15 hours Pipeline compute cost ~$0.25/hr × 108 hrs = ~$27 ~$3.00/hr × 15 hrs = ~$45 MedImageInsight endpoint (MaaP on Standard_NC4as_T4_v3) ~$151 ~$21 Total estimated cost ~$178 ~$66 Both pipeline runs make the same ~120,000 classification calls to the MedImageInsight endpoint, but those calls are spread out over different time periods depending on how quickly and efficiently the pipeline can make the calls to MedImageInsight. The hourly cost for MedImageInsight on a Standard_NC4as_T4_v3 VM is ~$1.40/hr. Resulting in the estimated costs for MedImageInsight in the table above. GPU compute was roughly 7× faster at about 0.37× the total cost when endpoint costs are included. This was a key learning and clearly indicates the benefits of the more powerful compute resources. MedImageInsight can be deployed in two ways, depending on dataset size and operational needs. For smaller or infrequently processed datasets, we deploy MedImageInsight as a managed Azure ML online endpoint and invoke it from the pipeline. This keeps the pipeline simpler and avoids managing the MedImageInsight compute directly, while offering comparable performance at modest scale. For larger batch workloads, an alternative approach is to load MedImageInsight directly on the Azure ML pipeline’s GPU-backed compute. In this model, the pipeline handles both model loading and classification, eliminating per-request network round trips and the fixed cost of hosting a persistent endpoint. While this approach requires slightly longer pipeline run time, it becomes more cost‑effective at scale by avoiding endpoint overhead and improving throughput during bulk processing. Possible future enhancements Additional modalities: Extending the pipeline and classification to CT, X-ray, and ultrasound imaging, and build on the pattern for pathology images Image embeddings fusion: Combining MedImageInsight image embeddings with text embeddings for hybrid search Condition-aware search: Enabling queries about findings and conditions, not just imaging parameters The gap between a hackathon demo and a production system is where the real engineering happens. We hope sharing our journey helps others building similar systems. If you’re interested in partnering with us to work toward this goal or need access to the GitHub repo with the pipeline and UI code, contact authors through your Microsoft account team or reach out to Microsoft HLS AI frontier team The healthcare AI models in Microsoft Foundry are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models' performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.115Views0likes0CommentsDriving AI‑Powered Healthcare: A Data & AI Webinar and Workshop Series
Across these sessions, you’ll learn how healthcare organizations are using Microsoft Fabric, advanced analytics, and AI to unify fragmented data, modernize analytics, and enable intelligent, scalable solutions, from enterprise reporting to AI‑powered use cases. Whether you’re just getting started or looking to accelerate adoption, these sessions offer practical guidance, real‑world examples, and hands‑on learning to help you build a strong data foundation for AI in healthcare. Date Topic Details Location Registration Link May 6 Webinar: Microsoft Fabric Foundations - A Simple Path to Modern Analytics and AI Discover how Microsoft Fabric consolidates fragmented analytics into a single integrated data platform, making it easier to deliver trusted insights and adopt AI without added complexity. Virtual Register May 13 Webinar: Reduce BI Sprawl, Cut Cost and Build an AI-Ready Analytics Foundation Learn how Power BI enables enterprise BI consolidation, consistent metrics, and secure, scalable analytics that support both operational reporting and emerging AI use cases. Virtual Register May 19-20 In Person Workshop: Driving AI‑Powered Healthcare: Advanced Analytics, AI, and Real‑World Impact Attend this two‑day, in‑person event to learn how healthcare organizations use Microsoft Fabric to unify data, accelerate AI adoption, and deliver measurable clinical and operational value. Day 1 focuses on strategy, architecture, and real‑world healthcare use cases, while Day 2 offers hands‑on workshops to apply those concepts through guided labs and agent‑powered solutions. Chicago Register May 27 Webinar: Unified Data Foundation for AI & Analytics - Leveraging OneLake and Microsoft Fabric This session shows how organizations can simplify fragmented data architectures by using Microsoft Fabric and OneLake as a single, governed foundation for analytics and AI. Virtual Register June 3-4 In Person Workshop: Driving AI‑Powered Healthcare: Advanced Analytics, AI, and Real‑World Impact Attend this two‑day, in‑person event to learn how healthcare organizations use Microsoft Fabric to unify data, accelerate AI adoption, and deliver measurable clinical and operational value. Day 1 focuses on strategy, architecture, and real‑world healthcare use cases, while Day 2 offers hands‑on workshops to apply those concepts through guided labs and agent‑powered solutions. New York Register June 10 Webinar: From Data to Decisions: How AI Data Agents in Microsoft Fabric Redefine Analytics Join us to learn how Fabric Data Agents enable users to interact with enterprise data through AI‑powered, governed agents that understand both data and business context. Virtual Register June 23-24 In Person Workshop: Driving AI‑Powered Healthcare: Advanced Analytics, AI, and Real‑World Impact Attend this two‑day, in‑person event to learn how healthcare organizations use Microsoft Fabric to unify data, accelerate AI adoption, and deliver measurable clinical and operational value. Day 1 focuses on strategy, architecture, and real‑world healthcare use cases, while Day 2 offers hands‑on workshops to apply those concepts through guided labs and agent‑powered solutions. Dallas RegisterThe Agent Era Has Already Arrived in Healthcare. Are You Ready to Govern It?
Start here. Answer honestly. Right now, how many AI agents are running inside your organization? Who built them? Which patient data, claims information, or proprietary research are they configured to access? If your CISO walked into your office tomorrow and asked for a complete inventory of every agent in your enterprise, including each one's owner, the systems it is permitted to access, and the policies that govern how it operates, could you produce that inventory before lunch? When the analyst who built that clinical summarization agent moves to a new role next quarter, what happens to the agent? Does its access continue? Does anyone notice? If a regulator opened an audit tomorrow, could you prove that every AI agent operating in your environment is subject to the same lifecycle controls, identity standards, and data protection policies you apply to your human workforce? Could you disable a compromised agent enterprise-wide with a single click, the same way you would revoke a lost access credential? If those questions made you hesitate, you are not alone. Almost no healthcare or life sciences organization can answer them confidently today. And that gap is exactly where the next decade of risk, and the next decade of competitive advantage, will be decided. The quiet crisis nobody talks about yet Healthcare and life sciences leaders are caught in a paradox. You need AI to survive the operational pressures squeezing your organization from every direction. Physician burnout is at crisis levels, with 45.2% of US physicians reporting symptoms in recent Mayo Clinic research. Revenue cycle complexity continues to climb, and McKinsey now estimates that the cost to collect consumes 30 to 60 percent of net patient revenue at many provider organizations. Prior authorization backlogs delay care. Clinical trial timelines stretch into years. Documentation burden eats hours that belong to patients. So you started piloting Microsoft 365 Copilot. You experimented with agents in Copilot Studio. Maybe a clinical team built an agent to draft discharge summaries. A revenue cycle group spun up an agent to triage denials. A medical affairs team built one to comb through literature. Each one delivered value. Each one was approved on its own merits. And then a quiet thing happened. You lost track of how many agents you have. According to KPMG's AI Quarterly Pulse Survey, 88 percent of organizations are now exploring or piloting AI agents. IDC projects that 1.3 billion agents will be in operation by 2028. Inside your own walls, the number is climbing fast. Each new agent is a digital identity that authenticates into your environment, accesses your data, and executes work on behalf of your business. Most have no formal owner. Most have no documented access scope. Most have no decommissioning plan. Most have never been reviewed by Compliance. Microsoft's 2024 Data Security Index found that 84 percent of organizations lack confidence in their AI data security posture, and 40 percent have already experienced an AI related data security incident. That is not a future problem. That is a now problem. If shadow IT was the defining governance challenge of the last decade, agent sprawl is the defining challenge of this one. And in healthcare and life sciences, where ePHI, member PII, and proprietary clinical trial data are at stake, the consequences are not theoretical. They are existential. The reframe that changes everything Here is the counterintuitive truth that separates HLS organizations that scale AI from those stuck in pilot purgatory. Governance is not the brake on AI adoption. Governance is the accelerator. When security, identity, and agent oversight are engineered in from day one, your teams stop tiptoeing. They build with confidence because the guardrails are real. They expand into clinical use cases because Compliance trusts the foundation. They scale wall-to-wall because IT can prove every agent is accounted for. The organizations that lead with trust end up moving faster in the long run, not slower. This is the bet behind Microsoft Agent 365 and Microsoft 365 E7. What Agent 365 and Microsoft 365 E7 actually are Microsoft 365 E7, announced March 6, 2026 and now generally available, is the Frontier Suite. It is Microsoft's answer to a single question that every healthcare CIO, CISO, and COO is wrestling with: how do you run AI safely, at scale, across an entire organization? E7 is not another SKU on top of your existing stack. It is one cohesive platform that brings together four essential capabilities: Microsoft 365 E5 for your enterprise productivity, collaboration, and security foundation, including Microsoft Defender, Microsoft Purview, and Microsoft Intune. Microsoft 365 Copilot for AI grounded in your organizational data through Work IQ, embedded in the flow of work for clinicians, researchers, operations teams, and administrators. Microsoft Entra Suite for identity governance, Conditional Access, and Zero Trust network access, extended consistently across users, applications, and AI agents. Microsoft Agent 365 as the centralized control plane to observe, govern, and secure every AI agent, whether built by Microsoft, your internal teams, or external partners. Agent 365 is also available as a standalone capability. But the magic happens when it works alongside the rest of E7, because that is where AI, identity, security, and governance stop being separate disciplines and become one operating system for the agentic era. The mental model that unlocks everything: agents are first-class digital identities Here is the simplest way to understand what Agent 365 does. Microsoft 365 governs your enterprise identities. Agent 365 governs your agent identities. The same control plane disciplines apply to both. Think about the rigor you apply to any privileged identity in your environment, whether a service account, an API integration, or a third-party application connector. You issue it a unique identity in Microsoft Entra. You assign a human owner who is accountable. You scope its access to least privilege. You apply DLP, sensitivity labels, and Conditional Access. You monitor for anomalous behavior. You have a documented decommissioning path. Identities that no one watches over become identities that get exploited. Now ask yourself how the last AI agent in your environment was created. The honest answer at most organizations: someone opened Copilot Studio, pointed it at a SharePoint library of clinical protocols, gave it a name, and moved on. No documented owner. No access review. No retirement plan. Compliance was never consulted. You would never stand up a privileged service account that way. Yet that is exactly how most organizations are standing up the fastest-growing class of digital identities in their environment. Agent 365 closes that gap by extending the identity, security, and lifecycle controls you already trust for users and applications so they apply with the same rigor to AI agents. Every agent receives a unique Entra Agent ID, a first-class identity in Azure AD with the same governance primitives as any other privileged identity. Every agent has a designated human owner who is accountable for its scope and behavior. Access is granted explicitly through Conditional Access and policy templates, so each agent operates only against the resources its purpose requires. Microsoft Purview DLP and sensitivity labels govern which data the agent is permitted to read, generate, or share. Microsoft Defender monitors agent activity for anomalies and surfaces alerts the same way it does for any other identity-driven risk. Lifecycle rules flag or auto-retire agents that are dormant, orphaned, or risky, eliminating the unowned automations that quietly accumulate in every enterprise. This is not metaphor. It is the actual architecture. The fastest path to governing agents is to extend the identity infrastructure you already trust. The three pillars of Agent 365: Observe, Govern, Secure Pillar 1: Observe. Know what is actually happening. You cannot govern what you cannot see. The first job of Agent 365 is to give you complete, continuous visibility into every AI agent operating in your environment. The Agent Registry is the single authoritative inventory of every agent, whether built by Microsoft, custom developed by your team, deployed by a partner, or discovered as a shadow agent operating without oversight. Each entry shows the owner, purpose, capabilities, lifecycle status, and business context. Agent Analytics tracks adoption, quality, performance, and business impact. Agent Map visualizes how agents connect with other agents, people, tools, and data sources, surfacing dependencies and risk concentrations you would never spot in a spreadsheet. Real time monitoring flows directly into Microsoft Defender, so unusual agent behavior generates alerts the same way unusual user behavior does today. For a health system CISO, that means finally being able to answer the question: which agents are touching ePHI, and is every one of them authorized? For a life sciences compliance officer, it means audit ready visibility into every AI system operating across R&D, regulatory affairs, and commercial. For a payer operations leader, it means knowing which claims processing agents are actually delivering accuracy and throughput, and which are quietly underperforming. Pillar 2: Govern. Set the rules. Control the lifecycle. Visibility is the start. Control is what turns visibility into outcomes. Agent 365 ensures that every agent is approved, compliant, and accountable from creation through retirement. IT led onboarding workflows make sure each agent launches with the right identity, access, and ownership before it ever touches data. Policy templates enforce data handling, permission, and usage rules consistently from day one through Defender, Entra, and Purview. Rules based agent management gives admins an automated If This Then That interface. If an agent is unused for 90 days, auto retire it. If an agent is flagged as risky, block it and alert the security operations team. No human in the loop required for the routine cases, full alerting and override for the exceptions. Ownership enforcement requires every agent to have a designated human owner. When that owner leaves the organization, the platform flags the orphaned agent for bulk reassignment, so nothing operates without clear accountability. The Tools Gateway brokers and audits tool access for agents, enabling least privilege at the action level, not just the identity level. For HLS specifically, that translates to outcomes you can take to your board. A hospital CIO can ensure any agent touching Epic or Cerner goes through standardized approval. A pharma IT director can enforce that clinical trial matching agents only touch de identified data unless elevated permissions are explicitly granted and documented. A payer compliance team can automatically retire agents tied to a completed open enrollment campaign instead of letting them silently expand the attack surface. Pillar 3: Secure. Protect agents and data with the stack you already trust. The final pillar is what makes Agent 365 production grade for healthcare and life sciences. Security and compliance are not bolted on. They are the same proven Microsoft security stack you already run for your users, extended natively to agents. Microsoft Purview, your data security and compliance backbone: Data Security Posture Management for AI gives visibility into how agents interact with sensitive data and detects risky usage patterns. Data Loss Prevention stops agents from accessing or processing files labeled Highly Confidential, even when a user prompts them to. Sensitivity labels are inherited automatically by agent outputs, governing how data is viewed, extracted, or shared downstream. Insider Risk Management detects risky behavior by users interacting with agents, such as unusual prompt patterns or excessive access to sensitive data. Communication Compliance monitors AI driven interactions for regulatory or ethical violations and unauthorized disclosures. eDiscovery and Audit logs every agent interaction, giving legal, compliance, and IT teams the transparency required for HIPAA, GDPR, and FDA 21 CFR Part 11. Oversharing Assessments run weekly checks for sensitive data exposure across SharePoint sites and agent access patterns. Microsoft Entra, your identity control plane: Entra Agent ID gives every agent a unique identity in Azure AD, so Conditional Access, role based access, and risk based policies apply individually. Conditional Access for agents enforces policies like only allow this prior authorization agent to access claims data from approved devices and locations during business hours. Identity Governance provides access packages for agents with reduced scope permissions and least privilege defaults. Block at Scale lets you instantly disable all high-risk agents from Entra in a single action. Microsoft Defender, your threat protection layer: Security Posture Management identifies and remediates agent misconfigurations, such as agents running with no authentication. Threat Detection and Blocking monitors suspicious agent activity, generates alerts, and blocks unauthorized tool invocations. Threat Investigation and Hunting collects unified agent observability logs so SOC teams can forensically trace every action an agent took. One Click Kill Switch instantly disables any agent and surfaces the complete audit trail of every action it took before being stopped. For a hospital security operations team, that means the same DLP policies protecting patient records in email and Teams now protect agents that summarize clinical notes. For a life sciences data protection officer, it means agents accessing proprietary compound data respect the same sensitivity labels as human researchers. For a payer CISO, it means an anomalous claims agent can be killed in seconds, with a complete forensic record of every member record it touched. Why this only works as an integrated platform Individual capabilities are useful. Integration is what makes them transformative. Here is the contrast HLS leaders feel today versus what changes the moment E7 lights up. Without an integrated platform, you operate with: Fragmented tools for identity, security, compliance, and AI, each with its own console and its own gaps. No centralized agent inventory, forcing your IT and security teams to track bots and automations in spreadsheets. Inconsistent policy enforcement across agents, creating compliance gaps every audit team will eventually find. Blind spots where agents access data, invoke tools, or interact with other agents without any oversight. Manual triage when an incident hits, because nothing connects user identity, agent identity, and data classification in one view. With Microsoft 365 E7, you gain: A Unified Agent Registry providing a single source of truth for every agent, whether Microsoft built, custom developed, partner deployed, or shadow discovered. Entra Agent ID giving each agent a unique identity, so Conditional Access, role based access, and risk based policies apply at the individual agent level. Full lifecycle governance with standardized onboarding, periodic review, ownership transfers, auto retirement of dormant agents, and structured offboarding. Policy by design, where Purview DLP, sensitivity labels, and compliance rules extend to all agent interactions through pre built templates applied consistently from day one. One click disable to instantly freeze any agent, with Defender threat detection extended to agents and full audit trails for forensic investigation. Expanded threat coverage that addresses agent sprawl, overprivileged access, tool misuse, misconfiguration, and inter agent risk patterns no legacy tool was designed to see. Shared registry and controls that let IT, Security, and Compliance reference the same authoritative inventory across Defender, Entra, and Purview, eliminating the silos that slow incident response. This is the reason E7 exists as a platform, not a bundle. AI, identity, security, and governance stop being separate disciplines and start operating as one system. What this is actually worth: the Forrester numbers Microsoft commissioned Forrester to conduct a Total Economic Impact study of Microsoft 365 Copilot, published in March 2025. The composite organization in that study, modeled on real customer interviews, achieved: 132 percent three-year ROI with payback in under one year. 9 hours saved per Copilot user per month through automation of routine work like drafting, summarizing, and analysis. Up to 2.6 percent top line revenue lift through better qualified opportunities, improved win rates, and stronger retention in customer facing teams. 25 percent acceleration in new employee onboarding as new hires ramp faster on summarized institutional knowledge. Those are the verified numbers. The bigger story for HLS is what they look like when applied to clinical, claims, and research workflows where every reclaimed hour is an hour that goes back to patients, members, or science. AI is already defending AI The same agentic capabilities transforming clinical and operational workflows are now embedded in your security stack. Microsoft Security Copilot agents work alongside human analysts inside Defender, Entra, Purview, and Intune, accelerating threat response and absorbing the manual load that today drowns most security operations teams. Independent benchmarks back the impact. In a 162 admin randomized study published in 2025, the Conditional Access Optimization Agent in Microsoft Entra completed configuration tasks 43 percent faster and produced 48 percent more accurate Conditional Access policies than admins working without it. Security triage, alert investigation, and identity hygiene are following the same trajectory. For HLS security teams already stretched thin, that is hours reclaimed every week to focus on the threats that actually matter, with the same Agent 365 governance applying to the security agents themselves. The defenders are governed by the same rules as the workforce they defend. How HLS organizations are putting Agent 365 to work Here is how the value shows up across the three biggest HLS segments. For providers: reclaiming time for care The challenge: clinicians spend more time on documentation than on patients. Care coordination is fragmented. Burnout is gutting retention. The strategy: deploy agents that absorb administrative load while Agent 365 ensures every one of them respects ePHI boundaries. Clinical documentation agents integrated with Microsoft Dragon Copilot structure dictation against EHR requirements, apply billing codes, and flag missing elements before submission. Care coordination agents generate care plans, allocate tasks, and surface relevant patient context during multidisciplinary rounds, optimized for HL7 FHIR interoperability. Patient intake and scheduling agents built in Copilot Studio handle appointment booking, reminders, eligibility verification, and referral management. Handoff and shift summary agents pull from multiple systems to generate complete handoff summaries for nurses and physicians transitioning between shifts, reducing communication gaps that drive adverse events. The aha moment: applied across a 10,000 employee health system, nine hours per user per month is more than one million reclaimed hours a year. That is the equivalent of hundreds of full time clinicians, returned to direct patient care, with every agent governed under the same Conditional Access and DLP policies your IT team already manages today. For payers: transforming revenue cycle and member experience The challenge: prior auth backlogs delay care. Denial rates climb. Member services teams drown in volume. The strategy: agentic AI rewires the most expensive, most manual workflows in your operation while Agent 365 keeps every agent inside the lines on member PII. Prior authorization agents autonomously gather clinical documentation, cross reference medical policy, determine approval criteria, and route decisions, accelerating turnaround from days to hours. Claims processing agents automate billing and denial management. With cost to collect running 30 to 60 percent of net patient revenue at many organizations, even modest automation produces material margin recovery. Denial resolution and appeals agents analyze denial patterns, surface root causes, generate appeal documentation, and track success rates over time, turning a cost center into a continuous improvement engine. Member services agents integrated with Microsoft 365 Copilot Chat handle benefits inquiries, claims status, and self service triage, deflecting call volume and improving first contact resolution. Fraud detection and risk adjustment agents scan claims data for anomalies and optimize coding accuracy for Medicare Advantage and ACA populations. The aha moment: a payer CISO can disable an anomalous prior auth agent in one click and produce a complete forensic record of every member record it accessed, while Compliance simultaneously confirms the agent never violated DLP. That is regulatory readiness that legacy automation cannot deliver. For life sciences and pharma: accelerating discovery and commercialization The challenge: clinical trials take years. Regulatory submissions consume teams. Medical affairs cannot keep up with literature volume. The strategy: orchestrate agents across R&D, regulatory, medical, and commercial, with Agent 365 enforcing the data classification rules that proprietary IP and clinical data demand. Clinical trial matching agents scan patient profiles and eligibility criteria to surface trial opportunities, accelerating recruitment. Regulatory document preparation agents assemble submissions, cross reference data across modules, and ensure consistency in FDA, EMA, and global filings. Medical research and literature review agents powered by Microsoft GraphRAG retrieve research backed insights with verified source references, giving medical science liaisons trustworthy synthesis on demand. Pharmacovigilance agents monitor safety databases, flag potential adverse events, and generate timely case reports. Commercial insights and launch planning agents synthesize market data, payer policy, and HCP sentiment for sharper launch and field strategy. The aha moment: cutting even three months off a regulatory cycle on a single high revenue product can mean tens of millions in additional sales, while Purview sensitivity labels guarantee every agent accessing proprietary compound data respects the same data classification as your senior researchers. A phased path that actually works in regulated industries In regulated industries, a big bang AI rollout is a recipe for incidents. The HLS organizations getting this right are following a five-phase pattern that builds expertise and validates governance before scale. Establish. Form a cross-functional champion team across IT, Compliance, Clinical Operations, and Research. Define what risks you are mitigating and what outcomes you are unlocking. Inventory the agents already in flight. Configure. Stand up identity, DLP, and policy templates in Microsoft 365 Admin Center, Power Platform Admin Center, and Microsoft Purview. Enforce that any agent handling PHI runs in a secure environment with audit logging on by default. Pilot. Choose a small group of makers in a controlled environment. Start with non-critical workflows like internal reporting or scheduling before moving to clinical or member facing use cases. Run weekly reviews with Compliance and Security. Empower. Launch role specific training for clinicians, researchers, makers, and IT. Stand up a Center of Excellence to provide templates, best practices, and reusable patterns. Promote success stories internally to build momentum. Scale. Expand agent development across departments with governance as a guardrail, not a gate. Use pay as you go metering to track usage and optimize licensing. Refine policies continuously based on Purview signals and audit results. The strategic insight: organizations that lead with governance reach scale faster than those that lead with experimentation. Trust is the unlock, not the obstacle. Governance is a team sport Here is the pattern we see again and again. The HLS organizations that succeed with AI at scale are not the ones with the smartest IT shop or the boldest Compliance officer. They are the ones whose IT, Security, Compliance, Clinical, Research, and Operations leaders sit at the same table on agent strategy from week one. Agent 365 was designed for that table. The Agent Registry is the shared truth. Purview policies satisfy your Compliance officer. Entra controls reassure your CISO. The lifecycle workflows give your CIO confidence. The clinical and research outcomes give your COO and Chief Medical Officer the business case. Everyone gets the view they need from the same single source. Stand up an agent governance council. Meet every two weeks. Use the Agent Registry as your standing agenda. Make decisions in plain sight. The organizations that do this consistently outperform on both speed and safety. The ones that try to keep AI inside a single function fall behind on both. Who contributes what Think back to the mental model. You would never let a single function authorize, configure, and oversee a new privileged system on its own, not when it touches ePHI, claims, or proprietary research. Security, IT, Compliance, Clinical, and the relevant business owner all weigh in because the stakes are too high for any one seat to carry alone. Agent governance demands the same multidisciplinary scrutiny, and the council is where that happens. Each seat brings something the others cannot. CIO. Owns the agent strategy and the platform investment. Translates board-level AI ambition into an operating model the rest of the organization can execute against. CISO and Security Operations. Define agent identity standards, Conditional Access policies, and incident response playbooks. Without this seat, an anomalous agent touching ePHI becomes a breach instead of a contained event. Chief Compliance Officer and Privacy. Translate HIPAA, GDPR, FDA 21 CFR Part 11, and state regulations into Purview policies and audit requirements. This is the seat that keeps you out of an OCR investigation or a 483 letter. Chief Medical Officer and Clinical Operations. Validate that clinical agents are safe, accurate, and aligned with care standards. Own the clinical risk review for any agent that touches patient care, the same way you would for a new clinical protocol. Chief Research Officer or Head of R&D. Govern how agents interact with proprietary trial data, compound libraries, and scientific IP. The seat that protects the next decade of pipeline value. COO and Revenue Cycle Leadership. Prioritize the operational workflows where agents will move the needle on cost to collect, denial rates, and throughput, and own the business outcomes that justify the investment. Center of Excellence Lead. Maintains templates, reusable patterns, and maker enablement. Turns every council decision into a guardrail builders can actually use the next morning. Frontline champions. Clinicians, claims specialists, and researchers who pilot, give feedback, and carry credibility back to their peers. The seat that decides whether agents get adopted or quietly ignored. When every one of these voices is in the room, your governance council operates like a tumor board for AI. Different lenses, one shared decision, full accountability. That is how regulated industries make complex calls safely, and it is exactly the muscle Agent 365 was built to support. Seven questions to bring to your next leadership meeting If you want to know whether your organization is ready, run through these together. The places you hesitate are exactly where Agent 365 and E7 deliver the most value. Visibility. Do you know which AI agents, bots, and automations are running in your environment today, who built them, what they have access to, and whether they are still needed? Control. If someone on your team builds a new AI agent tomorrow, what is the actual process to make sure it is approved and secured? Or could they deploy it with wide open access? Security. What prevents an AI agent from reading or transmitting patient data it should not? Do you have a way to detect and stop a rogue or compromised agent? Accountability. Who owns the outputs of an AI agent's actions? What is the offboarding process when the agent or its creator leaves? Scale. Six months from now, you may have a hundred agents deployed across departments. Are your oversight and compliance structures ready for that volume? Cross-functional alignment. How are your IT, Security, and Compliance teams partnering on AI today? Governance is a team sport. Data readiness. How confident are you that your data estate is clean, labeled, and governed well enough for AI to surface accurate answers and not outdated or conflicting information? If you hesitated on even one of those, you have just identified where Agent 365 and Microsoft 365 E7 will pay for themselves the fastest. The path forward Here is the honest truth. The healthcare and life sciences organizations that lead in the next decade will not be the ones that adopted AI first. They will be the ones that adopted AI safely, compliantly, and at scale, with intelligence and trust woven into every layer. Microsoft Agent 365 and Microsoft 365 E7 give you the only integrated platform that brings AI, identity, security, and governance into one cohesive system, running in the flow of work you already use. This is not about adding another tool to your stack. It is about extending the investments you have already made in Microsoft 365, Entra, Defender, and Purview to cover the fastest-growing class of digital identities in your environment. The agent era has already arrived. The question is whether you will govern it with confidence or chase it with anxiety. We would love to help you lead. Take the next step Explore Microsoft Agent 365: The Control Plane for Agents Microsoft Entra Agent ID: aka.ms/EntraAgentID Learn more about Microsoft 365 E7, the Frontier Suite: Introducing Microsoft 365 E7 See Microsoft 365 Copilot in action: Microsoft 365 Copilot Read the Forrester TEI study: The Total Economic Impact of Microsoft 365 CopilotHealthcare Agent Orchestrator: Multi-agent Framework for Domain-Specific Decision Support
At Microsoft Build, we introduced the Healthcare Agent Orchestrator, now available in Azure AI Foundry Agent Catalog . In this blog, we unpack the science: how we structured the architecture, curated real tumor board data, and built robust agent coordination that brings AI into real healthcare workflows. Healthcare Agent Orchestrator assisting a simulated tumor board meeting. Introduction Healthcare is inherently collaborative. Critical decisions often require input from multiple specialists—radiologists, pathologists, oncologists, and geneticists—working together to deliver the best outcomes for patients. Yet most AI systems today are designed around narrow tasks or single-agent architectures, failing to reflect the real-world teamwork that defines healthcare practice. That’s why we developed the Healthcare Agent Orchestrator: an orchestrator and code sample built around Microsoft’s industry-leading healthcare AI models, designed to support reasoning and multidisciplinary collaboration -- enabling modular, interpretable AI workflows that mirror how healthcare teams actually work. The orchestrator brings together Microsoft healthcare AI models—such as MedImageParse for image recognition, CXRReportGen for automated radiology reporting, and MedImageInsight for retrieval and similarity analysis—into a unified, task-aware system that enables developers to build an agent that reflects real-word healthcare decision making pattern. This work was led by Yu (Aiden) Gu, Principal Applied Scientist at Microsoft Research, who conceived the study, defined the research direction, and led the design and development of the Healthcare Agent Orchestrator proof-of-concept. Healthcare Is Naturally Multi-Agent Healthcare decision-making often requires synthesizing diverse data types—radiologic images, pathology slides, genetic markers, and unstructured clinical narratives—while reconciling differing expert perspectives. In a molecular tumor board, for instance, a radiologist might highlight a suspicious lesion on CT imaging, a pathologist may flag discordant biopsy findings, and a geneticist could identify a mutation pointing toward an alternate treatment path. Effective collaboration in these settings hinges not on isolated analysis, but on structured dialogue—where evidence is surfaced, assumptions are challenged, and hypotheses are iteratively refined. To support the development of healthcare agent orchestrator, we partnered with a leading healthcare provider organization, who independently curated and de-identified a proprietary dataset comprising longitudinal patient records and real tumor board transcripts—capturing the complexity of multidisciplinary discussions. We provided guidance on data types most relevant for evaluating agent coordination, reasoning handoffs, and task alignment in collaborative settings. We then applied LLM-based structuring techniques to convert de-identified free-form transcripts into interpretable units, followed by expert review to ensure domain fidelity and relevance. This dataset provides a critical foundation for assessing agent coordination, reasoning handoffs, and task alignment in simulated collaborative settings. Why General-Purpose LLMs Fall Short for Healthcare Collaboration While general-purpose large language models have delivered remarkable results in many domains, they face key limitations in high-stakes healthcare environments: Precision is critical: Even small hallucinations or inconsistencies can compromise safety and decision quality Multi-modal integration is required: Many healthcare decisions involve interpreting and correlating diverse data types—images, reports, structured records—much of which is not available in public training sets Transparency and traceability matter: Users must understand how conclusions are formed and be able to audit intermediate steps The Healthcare Agent Orchestrator addresses these challenges by pairing general reasoning capabilities with specialized agents that operate over imaging, genomics, and structured EHRs—ensuring grounded, explainable results aligned with clinical expectations. Each agent contributes domain-specific expertise, while the orchestrator ensures coherence, oversight, and explainability—resulting in outputs that are both grounded and verifiable. Architecture: Coordinating Specialists Through Orchestration Healthcare Agent Orchestrator. Healthcare Agent Orchestrator’s multi-agent framework is built on modular AI infrastructure, designed for secure, scalable collaboration: Semantic Kernel: A lightweight, open-source development kit for building AI agents and integrating the latest AI models into C#, Python, or Java codebases. It acts as efficient middleware for rapidly delivering enterprise-grade solutions—modular, extensible, and designed to support responsible AI at scale. Model Context Protocol (MCP): an open standard that enables developers to build secure, two-way connections between their data sources and AI-powered tools. Magentic-One: Microsoft’s generalist multi-agent system for solving open-ended web and file-based tasks across domains—built on Microsoft AutoGen, our popular open-source framework for developing multi-agent applications. Each agent is orchestrated within the system and integrated via Semantic Kernel’s group chat infrastructure, with support for communication and modular deployment via Azure. This orchestration ensures that each model—whether interpreting a lung nodule, analyzing a biopsy image, or summarizing a genomic variant—is applied precisely where its expertise is most relevant, without overloading a single system with every task. The modularity of the framework also future-proofs: as new health AI models and tools emerge, they can be seamlessly incorporated into the ecosystem without disrupting existing workflows—enabling continuous innovation while maintaining clinical stability. Microsoft’s healthcare AI models at the Core Healthcare agent orchestrator also enables developers to explore the capabilities of Microsoft’s latest healthcare AI models: CXRReportGen: Integrates multimodal inputs—including current and prior X-ray images and report context—to generate grounded, interpretable radiology reports. The model has shown improved accuracy and transparency in automated chest X-ray interpretation, evaluated on both public and private data. MedImageParse 3 : A biomedical foundation model for imaging parsing that can jointly conduct segmentation, detection, and recognition across 9 imaging modalities. MedImageInsight 4 : Facilitates fast retrieval of clinically similar cases, supports disease classification across broad range of medical image modalities, accelerating second opinion generation and diagnostic review workflows. Each model has the ability to act as a specialized agent within the system, contributing focused expertise while allowing flexible, context-aware collaboration orchestrated at the system level. CXRReportGen is included in the initial release and supports the development and testing of grounded radiology report generation. Other Microsoft healthcare models such as MedImageParse and MedImageInsight are being explored in internal prototypes to expand the orchestrator’s capabilities across segmentation, detection, and image retrieval tasks. Seamless Integration with Microsoft Teams Rather than creating new silos, Healthcare Agent Orchestrator integrates directly into the tools clinicians already use—specifically Microsoft Teams. Developers are investigating how clinicians can engage with agents through natural conversation, asking questions, requesting second opinions, or cross-validating findings—all without leaving their primary collaboration environment. This approach minimizes friction, improves user experience, and brings cutting-edge AI into real-world care settings. Building Toward Robust, Trustworthy Multi-Agent Collaboration Think of the orchestrator as managing a secure, structured group chat. Each participant is a specialized AI agent—such as a ‘Radiology’ agent, ‘PatientHistory’ agent, or 'ClinicalTrials‘ agent. At the center is the ‘Orchestrator’ agent, which moderates the interaction: assigning tasks, maintaining shared context, and resolving conflicting outputs. Agents can also communicate directly with one another, exchanging intermediate results or clarifying inputs. Meanwhile, the user can engage either with the orchestrator or with specific agents as needed. Each agent is configured with instructions (the system prompt that guides its reasoning), and a description (used by both the UI and the orchestrator to determine when the agent should be activated). For example, the Radiology agent is paired with the cxr_report_gen tool, which wraps Microsoft’s CXRReportGen model for generating findings from chest X-ray images. Tools like this are declared under the agent’s tools field and allow it to call foundation models or other capabilities on demand—such as the clinical_trials tool 5 for querying ClinicalTrials.gov. Only one agent is marked as facilitator, designating it as the moderator of the conversation; in this scenario, the Orchestrator agent fills that role. Early observations highlight that multi-agent orchestration introduces new complexities—even as it improves specialization and task alignment. To address these emergent challenges, we are actively evolving the framework across several dimensions: Mitigating Error Propagation Across Agents: Ensuring that early-stage errors by one agent do not cascade unchecked through subsequent reasoning steps. This includes introducing critical checkpoints where outputs from key agents are verified before being consumed by others. Optimizing Agent Selection and Specialization: Recognizing that more agents are not always better. Adding unnecessary or redundant agents can introduce noise and confusion. We’ve implemented a systematic framework that emphasizes a few highly suited agents per task —dynamically selected based on case complexity and domain needs—while continuously tracking performance gains and catching regressions early. Improving Transparency and Hand-off Clarity: Structuring agent interactions to make intermediate outputs and rationales visible, enabling developers (and the system itself) to trace how conclusions were reached, catch inconsistencies early, and intervene when necessary. Adapting General Frameworks for Healthcare Complexity Generic orchestration frameworks like Semantic Kernel provide a strong foundation—but healthcare demands more. The stakes are higher, the data more nuanced, and the workflows require precision, traceability, and regulatory compliance. Here’s how we’ve extended and adapted these systems to help address healthcare demands: Precision and Safety: We introduced domain-aware verification checkpoints and task-specific agent constraints to reduce inappropriate tool usage—supporting more reliable reasoning. To help uphold the high standards required in healthcare, we defined two complementary metric systems (Check Healthcare Agent Orchestrator Evaluation for more details): Core Metrics: monitor health agents selection accuracy, intent resolution, contextual relevance, and information aggregation RoughMetric: a composite score based on ROUGE that helps quantify the precision of generated outputs and conversation reliability. TBFact: A modified version of RadFact 2 that measures factuality of claims in agents' messages and helps identifying omissions and hallucination Domain-Specific Tool Planning: Healthcare agents must reason across multimodal inputs—such as chest X-rays, CT slices, pathology images, and structured EHRs. We’ve customized Semantic Kernel’s tool invocation and planning modules to reflect clinical workflows, not generic task chains. These infrastructure-level adaptations are designed to complement Microsoft Healthcare AI models—such as CXRReportGen, MedImageParse, and MedImageInsight—working together to enable coordinated, domain-aware reasoning across complex healthcare tasks. Enabling Collaborative, Trustworthy AI in Healthcare Healthcare demands AI systems that are as collaborative, adaptive, and trustworthy as the clinical teams they aim to support. The Healthcare Agent Orchestrator is a concrete step toward that vision—pairing specialized health AI models with a flexible, multi-agent coordination framework, purpose-built to reflect the complexity of real clinical decision-making. By aligning with existing healthcare workflows and enabling transparent, role-specific collaboration, this system shows promise to empower clinicians to work more effectively—with AI as a partner, not a replacement. Healthcare Multi-Agent Orchestrator and the Microsoft healthcare AI models are intended for research and development use. Healthcare Multi-Agent Orchestrator and the healthcare AI models not designed or intended to be deployed in clinical settings as-is nor is it intended for use in the diagnosis or treatment of any health or medical condition, and its performance for such purposes has not been established. You bear sole responsibility and liability for any use of Healthcare Multi-Agent Orchestrator or the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals. 1 arXiv, Universal Abstraction: Harnessing Frontier Models to Structure Real-World Data at Scale, February 2, 2025 2 arXiv, MAIRA-2: Grounded Radiology Report Generation, June 6, 2024 3 Nature Method, A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities, Nov 18, 2024 4 arXiv, Medimageinsight: An open-source embedding model for general domain medical imaging, Oct 9, 2024 5 Machine Learning for Healthcare Conference, Scaling Clinical Trial Matching Using Large Language Models: A Case Study in Oncology, August 4, 20237.9KViews2likes1CommentModernizing Digital Health Record Governance with Microsoft Entra Identity Governance
With Entra Identity Governance Microsoft provides cloud-driven identity lifecycle automation, application provisioning, entitlement management, and access reviews that can be applied to users, guests, agents, groups, and enterprise applications—including EHR systems like Epic, Oracle Health (Cerner), and Meditech.Implementing Disaster Recovery for Azure App Service Web Applications
Starting March 31, 2025, Microsoft will no longer automatically place Azure App Service web applications in disaster recovery mode in the event of a regional disaster. This change emphasizes the importance of implementing robust disaster recovery (DR) strategies to ensure the continuity and resilience of your web applications. Here’s what you need to know and how you can prepare. Understanding the Change Azure App Service has been a reliable platform for hosting web applications, REST APIs, and mobile backends, offering features like load balancing, autoscaling, and automated management. However, beginning March 31, 2025, in the event of a regional disaster, Azure will not automatically place your web applications in disaster recovery mode. This means that you, as a developer or IT professional, need to proactively implement disaster recovery techniques to safeguard your applications and data. Why This Matters Disasters, whether natural or technical, can strike without warning, potentially causing significant downtime and data loss. By taking control of your disaster recovery strategy, you can minimize the impact of such events on your business operations. Implementing a robust DR plan ensures that your applications remain available and your data remains intact, even in the face of regional outages. Common Disaster Recovery Techniques To prepare for this change, consider the following commonly used disaster recovery techniques: Multi-Region Deployment: Deploy your web applications across multiple Azure regions. This approach ensures that if one region goes down, your application can continue to run in another region. You can use Azure Traffic Manager or Azure Front Door to route traffic to the healthy region. Multi-region load balancing with Traffic Manager and Application Gateway Highly available multi-region web app Regular Backups: Implement regular backups of your application data and configurations. Azure App Service provides built-in backup and restore capabilities that you can schedule to run automatically. Back up an app in App Service How to automatically backup App Service & Function App configurations Active-Active or Active-Passive Configuration: Set up your applications in an active-active or active-passive configuration. In an active-active setup, both regions handle traffic simultaneously, providing high availability. In an active-passive setup, the secondary region remains on standby and takes over only if the primary region fails. About active-active VPN gateways Design highly available gateway connectivity Automated Failover: Use automated failover mechanisms to switch traffic to a secondary region seamlessly. This can be achieved using Azure Site Recovery or custom scripts that detect failures and initiate failover processes. Add Azure Automation runbooks to Site Recovery recovery plans Create and customize recovery plans in Azure Site Recovery Monitoring and Alerts: Implement comprehensive monitoring and alerting to detect issues early and respond promptly. Azure Monitor and Application Insights can help you track the health and performance of your applications. Overview of Azure Monitor alerts Application Insights OpenTelemetry overview Steps to Implement a Disaster Recovery Plan Assess Your Current Setup: Identify all the resources your application depends on, including databases, storage accounts, and networking components. Choose a DR Strategy: Based on your business requirements, choose a suitable disaster recovery strategy (e.g., multi-region deployment, active-active configuration). Configure Backups: Set up regular backups for your application data and configurations. Test Your DR Plan: Regularly test your disaster recovery plan to ensure it works as expected. Simulate failover scenarios to validate that your applications can recover quickly. Document and Train: Document your disaster recovery procedures and train your team to execute them effectively. Conclusion While the upcoming change in Azure App Service’s disaster recovery policy may seem daunting, it also presents an opportunity to enhance the resilience of your web applications. By implementing robust disaster recovery techniques, you can ensure that your applications remain available and your data remains secure, no matter what challenges come your way. Start planning today to stay ahead of the curve and keep your applications running smoothly. Recover from region-wide failure - Azure App Service Reliability in Azure App Service Multi-Region App Service App Approaches for Disaster Recovery Feel free to share your thoughts or ask questions in the comments below. Let's build a resilient future together! 🚀Image Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval
Introduction Wound assessment and management are central tasks in clinical practice, requiring accurate documentation and timely decision-making. Clinicians and nurses often rely on visual inspection to evaluate wound characteristics such as size, color, tissue composition, and healing progress. However, when seeking comparable cases (e.g., to inform treatment choices, validate assessments, or support education), existing search methods have significant limitations. Traditional keyword-based systems require precise terminology, which may not align with the way wounds are described in practice. Moreover, textual descriptors cannot fully capture the variability of visual wound features, resulting in incomplete or imprecise retrieval. Recent advances in computer vision offer new opportunities to address these challenges through both image classification and image retrieval. Automated classification of wound images into clinically meaningful categories (e.g., wound type, tissue condition, infection status) can support standardized documentation and assist clinicians in making more consistent assessments. In parallel, image retrieval systems enable search based on visual similarity rather than textual input alone, allowing clinicians to query databases directly with wound images and retrieve cases with similar characteristics. Together, these AI-based functionalities have the potential to improve case comparison, facilitate consistent monitoring, and enhance clinical training by providing immediate access to relevant examples and structured decision support. The Data The WoundcareVQA dataset is a new multimodal multilingual dataset for Wound Care Visual Question Answering. The WoundcareVQA dataset is available at https://osf.io/xsj5u/ [1] Table 1 summarizes dataset statistics. WoundcareVQA contains 748 images associated with 447 instances (each instance/query includes one or more images). The dataset is split into training (279 instances, 449 images), validation (105 instances, 147 images), and test (93 instances, 152 images). The training set was annotated by a single expert, the validation set by two annotators, and the test set by three medical doctors. Each query is also labeled with wound metadata, covering seven categories: anatomic location (41 classes), wound type (8), wound thickness (6), tissue color (6), drainage amount (6), drainage type (5), and infection status (3). Table 1: Statistics about the WoundcareVQA Dataset We selected two tasks with the highest inter-annotator agreement: Wound Type Classification and Infection Detection (cf. Table 2). Table 3 lists the classification labels for these tasks. Table 2: Inter-Annotator Agreement in the WoundcareVQA Dataset Table 3: Classification Labels for the Tasks: Infection Detection & Wound Type Classification Methods 1. Foundation-Model-based Image Search This approach relies on an image similarity-based retrieval mechanism using a medical foundation model, MedImageInsight [2-3]. Specifically, it employs a k-nearest neighbors (k-NN) search to identify the top k training images most visually similar to a given query image. The image search system operates in two phases: Index Construction: Embeddings are extracted from all training images using a pretrained vision encoder (MedImageInsight). These embeddings are then indexed to enable efficient and scalable similarity search during retrieval. Query and Retrieval: At inference time, the test image is encoded to produce a query embedding. The system computes the Euclidean distances between this query vector and all indexed embeddings, retrieving the k nearest neighbors with the smallest distances. To address the computational demands of large-scale image datasets, the method leverages FAISS (Facebook AI Similarity Search), an open-source library designed for fast and scalable similarity search and clustering of high-dimensional vectors. 2. Vision-Language Models (VLMs) & Retrieval-Augmented Generation (RAG) We leverage vision-language models (e.g., GPT-4o, GPT-4.1), a recent class of multimodal foundation models capable of jointly reasoning over visual and textual inputs. These models can be used for wound assessment tasks due to their ability to interpret complex visual patterns in medical images while simultaneously understanding medical terminology. We evaluate three settings: Zero-shot: The model predicts directly from the query input without additional examples. Few-shot Prompting: A small number of examples (5) from the training dataset are randomly selected and embedded into the input prompt. These paired images and labels provide contextual cues that guide the model's interpretation of new inputs. Retrieval-Augmented Generation (RAG): The system first retrieves the Top-k visually similar wound images using the MedImageInsight-based image search described above. The language model then reasons over the retrieved examples and their labels to generate the final prediction. The implementation of the MedImageInsight-based image search and the RAG method for the infection detection task is available in our Samples Repository: https://aka.ms/healthcare-ai-examples rag_infection_detection.ipynb Evaluation We computed accuracy scores to evaluate the image search methods (Top-1 and Top-5 with majority vote), GPT-4o and GPT-4.1 models (zero-shot), as well as 5-shot and RAG-based methods. Table 4 reports accuracy for wound type classification and infection detection. Figure 1 presents examples of correct and incorrection predictions. Accuracy Image Search Top-1 Image Search Top-5 + majority vote GPT-4o (2023-07-01) GPT-4o (2024-11-20) GPT4.1 (2025-04-14) GPT4.1 5-shot Prompting GPT-4.1- RAG-5 Wound Type 0.7933 0.8333 0.4671 0.4803 0.5066 0.6118 0.7533 Infection 0.6800 0.7267 0.3947 0.3882 0.375 0.7237 0.7697 Table 4: Accuracy Scores for Wound Type Classification & Infection Detection Figure 1: Examples of Correct and Incorrection Predictions (GPT-4.1-RAG-5 Method) For wound type classification, image search with MedImageInsight embeddings performs best, achieving 0.7933 (Top-1) and 0.8333 (Top-5 + majority vote). GPT models alone perform substantially worse (0.4671-0.6118), while GPT-4.1 with retrieval augmentation (RAG-5), which uses the same MedImageInsight-based image search method to retrieve the Top-5 similar cases, narrows the gap (0.7533) but does not surpass direct image search. This suggests that categorical wound type is more effectively captured by visual similarity than by case-based reasoning with vision-language models. For infection detection, the trend reverses. Image search reaches 0.7267 (Top-5 + majority vote), while RAG-5 achieves the highest accuracy at 0.7697. In this case, the combination of visually similar cases with VLM-based reasoning outperforms both standalone image search and GPT prompting. This indicates that infection assessment depends on contextual or clinical cues that may not be fully captured by visual similarity alone but can be better interpreted when enriched with contextual reasoning over retrieved cases and their associated labels. Overall, these findings highlight complementary strengths: foundation-model-based image search excels at categorical visual classification (wound type), while retrieval-augmented VLMs leverage both visual similarity and contextual reasoning to improve performance on more nuanced tasks (infection detection). A hybrid system integrating both approaches may provide the most robust clinical support. Conclusion This study demonstrates the complementary roles of vision-language models in wound assessment. Image search using foundation-model embeddings shows strong performance on categorical tasks such as wound type classification, where visual similarity is most informative. In contrast, retrieval-augmented generation (RAG-5), which combines image search with case-based reasoning by a vision-language model, achieves the best results for infection detection, highlighting the value of integrating contextual interpretation with visual features. These findings suggest that a hybrid approach, leveraging both direct image similarity and retrieval-augmented reasoning, provides the most robust pathway for clinical decision support in wound care. Image Search Series: Blog Posts & Jupyter Notebooks Image Search Series Part 1: Chest X-ray lookup with MedImageInsight | Microsoft Community Hub 2d_image_search.ipynb Image Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology | Microsoft Community Hub 3d_image_search.ipynb Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology | Microsoft Community Hub Image Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval | Microsoft Community Hub rag_infection_detection.ipynb Image Search Series Part V: Building Histopathology Image Search with Prov-GigaPath | Microsoft Community Hub 2d_pathology_image_search.ipynb The Microsoft healthcare AI models, including MedImageInsight, are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals. References Wen-wai Yim, Asma Ben Abacha, Robert Doerning, Chia-Yu Chen, Jiaying Xu, Anita Subbarao, Zixuan Yu, Fei Xia, M Kennedy Hall, Meliha Yetisgen. Woundcarevqa: A Multilingual Visual Question Answering Benchmark Dataset for Wound Care. Journal of Biomedical Informatics, 2025. Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaría-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie L. Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew P. Lungren, Mu Wei: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging. CoRR abs/2410.06542 (2024) Model catalog and collections in Azure AI Foundry portal https://learn.microsoft.com/en-us/azure/ai-studio/how-to/model-catalog-overviewImage Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology
Introduction As the use of diagnostic 3D images increases, effective management and analysis of these large volumes of data grows in importance. Medical 3D image search systems can play a vital role by enabling clinicians to quickly retrieve relevant or similar images and cases based on the anatomical features and pathologies present in a query image. Unlike traditional 2D imaging, 3D imaging offers a more comprehensive view for examining anatomical structures from multiple planes with greater clarity and detail. This enhanced visualization has potential to assist doctors with improved diagnostic accuracy and more precise treatment planning. Moreover, advanced 3D image retrieval systems can support evidence-based and cohort-based diagnostics, demonstrating an opportunity for more accurate predictions and personalized treatment options. These systems also hold significant potential for advancing research, supporting medical education, and enhancing healthcare services. This blog offers guidance on using Azure AI Foundry and the recently launched healthcare AI models to design and test a 3D image search system that can retrieve similar radiology images from a large collection of 3D images. Along with this blog, we share a Jupyter Notebook with the the 3D image search system code, which you may use to reproduce the experiments presented here or start you own solution. 3D Image Search Notebook: http://aka.ms/healthcare-ai-examples-mi2-3d-image-search It is important to highlight that the models available on the AI Foundry Model Catalog are not designed to generate diagnostic-quality results. Developers are responsible for further developing, testing, and validating their appropriateness for specific tasks and eventually integrating these models into complete systems. The objective of this blog is to demonstrate how this can be achieved efficiently in terms of data and computational resources. The Problem Generally, the problem of 3D image search can be posed as retrieving cross-sectional (CS) imaging series (3D image results) that are similar to a given CS imaging series (query 3D image). Once posited this way, the key question becomes how to define such similarity? In the previous blog of this series, we worked with radiographs of the chest which constrained the notion of "similar" to the similarity between two 2D images, and a certain class of anatomy. In the case of 3D images, we are dealing with a volume of data, and a lot more variations of anatomy and pathologies, which expands the dimensions to consider for similarity; e.g., are we looking for similar anatomy? Similar pathology? Similar exam type? In this blog, we will discuss a technique to approximate the 3D similarity problem through a 2D image embedding model and some amount of supervision to constrain the problem to a certain class of pathologies (lesions) and cast it as "given cross-sectional MRI image , retrieve series with similar grade of lesions in similar anatomical regions". To build a search system for 3D radiology images using a foundation model (MedImageInsight) designed for 2D inputs, we explore the generation of representative 3D embedding vectors for the volumes with the foundation model embeddings of 2D slices to create a vector index from a large collection of 3D images. Retrieving relevant results for a given 3D image then consists in generating a representative 3D image embedding vector for the query image and searching for similar vectors in the index. An overview of this process is illustrated in Figure 1. Figure 1: Overview of the 3D image search process. The Data In the sample notebook that is provided alongside this blog, we use 3D CT images from the Medical Segmentation Decathlon (MSD) dataset [2-3] and annotations from the 3D-MIR benchmark [4]. The 3D-MIR benchmark offers four collections (Liver, Colon, Pancreas, and Lung) of positive and negative examples created from the MSD dataset with additional annotations related to the lesion flag (with/without lesion), and lesion group (1, 2, 3). The lesion grouping focuses on lesion morphology and distribution and considers the number, length, and volume of the lesions to define the three groups. It also adheres to the American Joint Committee on Cancer's Tumor, Node, Metastasis classification system’s recommendations for classifying cancer stages and provides a standardized framework for correlating lesion morphology with cancer stage. We selected the 3D-MIR Pancreas collection. 3D-MIR Benchmark: https://github.com/abachaa/3D-MIR Since the MSD collections only include unhealthy/positive volumes, each 3D-MIR collection was augmented with volumes randomly selected from the other datasets to integrate healthy/negative examples in the training and test splits. For instance, the Pancreas dataset was augmented using volumes from the Colon, Liver, and Lung datasets. The input images consist of CT volumes and associated 2D slices. The training set is used to create the index, and the test set is used to query and evaluate the 3D search system. 3D Image Retrieval Our search strategy, called volume-based retrieval, relies on aggregating the embeddings of the 2D slices of a volume to generate one representative 3D embedding vector for the whole volume. We describe additional search strategies in our 3D-MIR paper [4]. The 2D slice embeddings are generated using the MedImageInsight foundation model [5-6] from Azure AI Foundry model catalog [1]. In the search step, we generate the embeddings of the 3D query volumes according to the selected Aggregation method (Agg) and search for the top-k similar volumes/vectors in the corresponding 3D (Agg) index. We use the Median aggregation method to generate the 3D vectors and create the associated 3D index. We construct a 3D (Median) index using the training slices/volumes from the 3D-MIR Pancreas collection. Three other aggregation methods are available in the 3D image search notebook: Max Pooling, Average Pooling, and Standard Deviation. The search is performed following the k-Nearest Neighbors algorithm (or k-NN search) to find the k nearest neighbors of a given vector by calculating the distances between the query vector and all other vectors in the collection, then selecting the K vectors with the shortest distances. If the collection is large, the computation can be expensive, and it is recommended to use specific libraries for optimization. We use the FAISS (Facebook AI Similarity Search) library, an open-source library for efficient similarity search and clustering of high-dimensional vectors. Evaluation of the search results The 3D-MIR Pancreas test set consists of 32 volumes: 4 volumes with no lesion (lesion flag/group= -1) 3 volumes with lesion group 1 19 volumes with lesion group 2 6 volumes with lesion group 3 The training set consists of 269 volumes (with and without lesions) and was used to create the index. We evaluate the 3D search system by comparing the lesion group/category of the query volume and the top 10 retrieved volumes. We then compute Precision@k (P@k). Table 1 presents the P@1, P@3, P@5, P@10, and overall Precision. Table 1: Evaluation results on the 3D-MIR Pancreas test set The system accurately recognizes Healthy cases, consistently retrieving the correct label in test scenarios involving non-lesion pancreas images. However, performance varies for different lesion groups, reflecting challenges in precisely identifying smaller lesions (Group 1) or more advanced lesions (Group 3). This discrepancy highlights the complexity of lesion detection and underscores the importance of carefully tuning embeddings or adjusting the vector index to improve retrieval accuracy for specific lesion sizes. Visualization Figure 2 presents four different test queries from the Pancreas test set and the top 5 nearest neighbors retrieved by the volume-based search method. In each row, the first image is the query, followed by the retrieved images ranked by similarity. The visual overlays help in assessing retrieval accuracy; Blue indicates the pancreas organ boundaries, and Red highlights the mark regions corresponding to the pancreas tumor. Figure 2: Top 5 results for different queries from the Pancreas test set Table 2 presents additional results of the volume-based retrieval system [4] on other 3D-MIR datasets/organs (Liver, Colon, and Lung) using additional foundation models: BiomedCLIP [7], Med-Flamingo [8], and BiomedGPT [9]. When considering the macro-average across all datasets, MedImageInsight-based retrieval outperforms substantially other foundation models. Table 2: Evaluation Results on the 3D-MIR benchmark (Liver, Colon, Pancreas, and Lung) These results mirror a use case akin to lesion detection and severity measurement in a clinical context. In real-world applications—such as diagnostic support or treatment planning—it may be necessary to optimize the model to account for particular goals (e.g., detecting critical lesions early) or accommodate different imaging protocols. By refining search criteria, integrating more domain-specific data, or adjusting embedding methods, practitioners can enhance retrieval precision and better meet clinical requirements. Conclusion The integration of 3D image search systems in clinical environment can enhance and accelerate the retrieval of similar cases and provide better context to clinicians and researchers for accurate complex diagnoses, cohort selection, and personalized patient care. This 3D radiology image search blog and related notebook offers a solution based on 3D embedding generation for building and evaluating a 3D image search system using the MedImageInsight foundation model from Azure AI Foundry model catalog. Image Search Series: Blog Posts & Jupyter Notebooks Image Search Series Part 1: Chest X-ray lookup with MedImageInsight | Microsoft Community Hub 2d_image_search.ipynb Image Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology | Microsoft Community Hub 3d_image_search.ipynb Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology | Microsoft Community Hub Image Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval | Microsoft Community Hub rag_infection_detection.ipynb Image Search Series Part V: Building Histopathology Image Search with Prov-GigaPath | Microsoft Community Hub 2d_pathology_image_search.ipynb The Microsoft healthcare AI models, including MedImageInsight, are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals. References Model catalog and collections in Azure AI Foundry portal https://learn.microsoft.com/en-us/azure/ai-studio/how-to/model-catalog-overview Michela Antonelli et al. The medical segmentation decathlon. Nature Communications, 13(4128), 2022 https://www.nature.com/articles/s41467-022-30695-9 MSD: http://medicaldecathlon.com/ Asma Ben Abacha, Alberto Santamaría-Pang, Ho Hin Lee, Jameson Merkow, Qin Cai, Surya Teja Devarakonda, Abdullah Islam, Julia Gong, Matthew P. Lungren, Thomas Lin, Noel C. F. Codella, Ivan Tarapov: 3D-MIR: A Benchmark and Empirical Study on 3D Medical Image Retrieval in Radiology. CoRR abs/2311.13752, 2023 https://arxiv.org/abs/2311.13752 Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaría-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew P. Lungren, Mu Wei: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging. CoRR abs/2410.06542, 2024 https://arxiv.org/abs/2410.06542 MedImageInsight: https://aka.ms/mi2modelcard Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Angela Crabtree, Brian Piening, Carlo Bifulco, Matthew P. Lungren, Tristan Naumann, Sheng Wang, Hoifung Poon. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. NEJM AI 2025; 2(1) https://ai.nejm.org/doi/full/10.1056/AIoa2400640 Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y., Leskovec, J., Zakka, C., Reis, E.P., Rajpurkar, P.: Med-flamingo: a multimodal medical few-shot learner. Machine Learning for Health, ML4H@NeurIPS 2023, 10 December 2023, New Orleans, Louisiana, USA. Proceedings of Machine Learning Research, vol. 225, pp. 353–367. PMLR, (2023) https://proceedings.mlr.press/v225/moor23a.html Zhang, K., Zhou, R., Adhikarla, E., Yan, Z., Liu, Y., Yu, J., Liu, Z., Chen, X., Davison, B.D., Ren, H., et al.: A generalist vision–language foundation model for diverse biomedical tasks. Nature Medicine, 1–13 (2024) https://www.nature.com/articles/s41591-024-03185-2Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology
Introduction Dermatology is inherently visual, with diagnosis often relying on morphological features such as color, texture, shape, and spatial distribution of skin lesions. However, the diagnostic process is complicated by the large number of dermatologic conditions, with over 3,000 identified entities, and the substantial variability in their presentation across different anatomical sites, age groups, and skin tones. This phenotypic diversity presents significant challenges, even for experienced clinicians, and can lead to diagnostic uncertainty in both routine and complex cases. Image-based retrieval systems represent a promising approach to address these challenges. By enabling users to query large-scale image databases using a visual example, these systems can return semantically or visually similar cases, offering useful reference points for clinical decision support. However, dermatology image search is uniquely demanding. Systems must exhibit robustness to variations in image quality, lighting, and skin pigmentation while maintaining high retrieval precision across heterogeneous datasets. Beyond clinical applications, scalable and efficient image search frameworks provide valuable support for research, education, and dataset curation. They enable automated exploration of large image repositories, assist in selecting challenging examples to enhance model robustness, and promote better generalization of machine learning models across diverse populations. In this post, we continue our series on using healthcare AI models in Azure AI Foundry to create efficient image search systems. We explore the design and implementation of such a system for dermatology applications. As a baseline, we first present an adapter-based classification framework for dermatology images by leveraging fixed embeddings from the MedImageInsight foundation model, available in the Azure AI Foundry model catalog. We then introduce a Retrieval-Augmented Generation (RAG) method that enhances vision-language models through similarity-based in-context prompting. We use the MedImageInsight foundation model to generate image embeddings and retrieve the top-k visually similar training examples via FAISS. The retrieved image-label pairs are included in the Vision-LLM prompt as in-context examples. This targeted prompting guides the model using visually and semantically aligned references, enhancing prediction quality on fine-grained dermatological tasks. It is important to highlight that the models available on the AI Foundry Model Catalog are not designed to generate diagnostic-quality results. Developers are responsible for further developing, testing, and validating their appropriateness for specific tasks and eventually integrating these models into complete systems. The objective of this blog is to demonstrate how this can be achieved efficiently in terms of data and computational resources. The Data The DermaVQA-IIYI [2] dermatology image dataset is a de-identified, diverse collection of nearly 1,000 patient records and nearly 3,000 dermatological images, created to support research in skin condition recognition, classification, and visual question answering. DermaVQA-IIYI dataset: https://osf.io/72rp3/files/osfstorage (data/iiyi) The dataset is split into three subsets: Training Set: 2,474 images associated with 842 patient cases Validation Set: 157 images associated with 56 cases Test Set: 314 images associated with 100 cases Total Records: 2,945 images (998 patient cases) Patient Demographics: Out of 998 patient cases: Sex – F: 218, M: 239, UNK: 541 Age (available for 398 patients): Mean: 31 yrs | Min: 0.08 yrs | Max: 92 yrs This wide range supports studies across all age groups, from infants to the elderly. A total of 2,945 images are associated with the patient records, with an average of 2.9 images per patient. This multiplicity enables the study of skin conditions from different perspectives and at various stages. Image Count per Entry: 1 image: 225 patients 2 images: 285 patients 3 images: 200 patients 4 or more images: 288 patients The dataset includes additional annotations for anatomic location, comprising 39 distinct labels (e.g., back, fingers, fingernail, lower leg, forearm, eye region, unidentifiable). Each image is associated with one or multiple labels. We use these annotations to evaluate the performance of various methods across different anatomical regions. Image Embeddings We generate image embeddings using the MedImageInsight foundation model [1] from the Azure AI Foundry model catalog [3]. We apply Uniform Manifold Approximation and Projection (UMAP) to project high-dimensional image embeddings produced by the MedImageInsight model into two dimensions. The visualization is generated using embeddings extracted from both the DermaVQA training and test sets, which covers 39 anatomical regions. For clarity, only the most frequent anatomical labels are displayed in the projection. Figure 1. UMAP projection of image embeddings produced by the MedImageInsight Model on the DermaVQA dataset. The resulting projection reveals that the MedImageInsight model captures meaningful anatomical distinctions: visually distinct regions such as fingers, face, fingernail, and foot form well-separated clusters, indicating high intra-class consistency and inter-class separability. Other anatomically adjacent or visually similar regions, such as back, arm, and abdomen, show moderate overlap, which is expected due to shared visual features or potential labeling ambiguity. Overall, the embeddings exhibit a coherent and interpretable organization, suggesting that the model has learned to encode both local and global anatomical structures. This supports the model’s effectiveness in capturing anatomy-specific representations suitable for downstream tasks such as classification and retrieval. Enhancing Visual Understanding We explore two strategies for enhancing visual understanding through foundation models. I. Training an Adapter-based Classifier We build an adapter-based classification framework designed for efficient adaptation to medical imaging tasks (see our prior posts for introduction into the topic of adapters: Unlocking the Magic of Embedding Models: Practical Patterns for Healthcare AI | Microsoft Community Hub). The proposed adapter model builds upon fixed visual features extracted from the MedImageInsight foundation model, enabling task-specific fine-tuning without requiring full model retraining. The architecture consists of three main components: MLP Adapter: A two-layer feedforward network that projects 1024-dimensional embeddings (generated by the MedImageInsight model) into a 512-dimensional latent space. This module utilizes GELU activation and Layer Normalization to enhance training stability and representational capacity. As a bottleneck adapter, it facilitates parameter-efficient transfer learning. Convolutional Retrieval Module: A sequence of two 1D convolutional layers with GELU activation, applied to the output of the MLP adapter. This component refines the representations by modeling local dependencies within the transformed feature space. Prediction Head: A linear classifier that maps the 512-dimensional refined features to the task-specific output space (e.g., 39 dermatology classes). The classifier is trained for 10 epochs (approximately 48 seconds) using only CPU resources. Built on fixed image embeddings extracted from the MedImageInsight model, the adapter efficiently tailors these representations for downstream classification tasks with minimal computational overhead. By updating only the adapter components, while keeping the MedImageInsight backbone frozen, the model significantly reduces computational and memory overhead. This design also mitigates overfitting, making it particularly effective in medical imaging scenarios with limited or imbalanced labeled data. A Jupyter Notebook detailing the construction and training of an MedImageInsight -based adapter model is available in our Samples Repository: https://aka.ms/healthcare-ai-examples-mi2-adapter Figure 3: MedImageInsight-based Adapter Model II. Boosting Vision-Language Models with in-Context Prompting We leverage vision-language models (e.g., GPT-4o, GPT-4.1), which represent a recent class of multimodal foundation models capable of jointly reasoning over visual and textual inputs. These models are particularly promising for dermatology tasks due to their ability to interpret complex visual patterns in medical images while simultaneously understanding domain-specific medical terminology. 1. Few-shot Prompting In this setting, a small number of examples from the training dataset are randomly selected and embedded into the input prompt. These examples, consisting of paired images and corresponding labels, are intended to guide the model's interpretation of new inputs by providing contextual cues and examples of relevant dermatological features. 2. MedImageInsight-based Retrieval-Augmented Generation (RAG) This approach enhances vision-language model performance by integrating a similarity-based retrieval mechanism rooted in MedImageInsight (Medical Image-to-Image) comparison. Specifically, it employs a k-nearest neighbors (k-NN) search to identify the top k dermatological training images that are most visually similar to a given query image. The retrieved examples, consisting of dermatological images and their corresponding labels, are then used as in-context examples in the Vision-LLM prompt. By presenting visually similar cases, this approach provides the model with more targeted contextual references, enabling it to generate predictions grounded in relevant visual patterns and associated clinical semantics. As illustrated in Figure 2, the system operates in two phases: Index Construction: Embeddings are extracted from all training images using a pretrained vision encoder (MedImageInsight). These embeddings are then indexed to enable efficient and scalable similarity search during retrieval. Query and Retrieval: At inference time, the test image is encoded similarly to produce a query embedding. The system computes the Euclidean distance between this query vector and all indexed embeddings, retrieving the k nearest neighbors with the smallest distances. To handle the computational demands of large-scale image datasets, the method leverages FAISS (Facebook AI Similarity Search), an open-source library designed for fast and scalable similarity search and clustering of high-dimensional vectors. The implementation of the image search method is available in our Samples Repository: https://aka.ms/healthcare-ai-examples-mi2-2d-image-search Figure 2: MedImageInsight-based Retrieval-Augmented Generation Evaluation Table 1 presents accuracy scores for anatomic location prediction on the DermaVQA-iiyi test set using the proposed modeling approaches. The adapter model achieves a baseline accuracy of 31.73%. Vision-language models perform better, with GPT-4o (2024-11-20) achieving an accuracy of 47.11%, and GPT-4.1 (2025-04-14) improving to 50%. However, incorporating few-shot prompting with five randomly selected in-context examples (5-shot) slightly reduces GPT-4.1’s performance to 48.72%. This decline suggests that unguided example selection may introduce irrelevant or low-quality context, potentially reducing the effectiveness of the model’s predictions for this specialized task. The best performance among the vision-language approaches is achieved using the retrieval-augmented generation (RAG) strategy. In this setup, GPT-4.1 is prompted with five nearest-neighbor examples retrieved using the MedImageInsight-based search method (RAG-5), leading to a notable accuracy increase to 51.60%. This improvement over GPT-4.1’s 50% accuracy without retrieval showcases the relevance of the MedImageInsight-based RAG method. We expect larger performance gains when using a more extensive dermatology dataset, compared to the relatively small dataset used in this example -- a collection of 2,474 images associated with 842 patient cases which served as the basis for selecting relevant cases and similar images. Dermatology is a particularly challenging domain, marked by a high number of distinct conditions and significant variability in skin tone, texture, and lesion appearance. This diversity makes robust and representative example retrieval especially critical for enhancing model performance. The results underscore the importance of example relevance in few-shot prompting, demonstrating that similarity-based retrieval can effectively guide the model toward more accurate predictions in complex visual reasoning tasks. Table 1: Comparative Accuracy of Anatomic Location Prediction on DermaVQA-iiyi Figure 2: Confusion Matrix of Anatomical Location Predictions by the trained MLP adapter: The matrix illustrates the model's performance in classifying wound images across 39 anatomical regions. Strong diagonal values indicate correct classifications, while off-diagonal entries highlight common misclassifications, particularly among anatomically adjacent or visually similar regions such as 'lowerback' vs. 'back' and 'hand' vs. 'fingers'. Figure 3. Examples of correct anatomical predictions by the RAG approach. Each image depicts a case where the model's predicted anatomical region exactly matches the ground truth. Shown are examples from visually and anatomically distinct areas including the eye region, lips, lower leg, and neck. Figure 4. Examples of misclassifications by the RAG approach. Each image displays a case where the predicted anatomical label differs from the ground truth. In several examples, predictions are anatomically close to the correct regions (e.g., hand vs. hand-back, lower leg vs. foot, palm vs. fingers), suggesting that misclassifications often occur between adjacent or visually similar areas. These cases highlight the challenge of precise localization in fine-grained anatomical classification and the importance of accounting for anatomical ambiguity in both modeling and evaluation. Conclusion Our exploration of scalable image retrieval and advanced prompting strategies demonstrates the growing potential of vision-language models in dermatology. A particularly challenging task we address is anatomic location prediction, which involves 39 fine-grained classes of dermatology images, imbalanced training data, and frequent misclassifications between adjacent or visually similar regions. By leveraging Retrieval-Augmented Generation (RAG) with similarity-based example selection using image embeddings from the MedImageInsight foundation model, we show that relevant contextual guidance can significantly improve model performance in such complex settings. These findings underscore the importance of intelligent image retrieval and prompt construction for enhancing prediction accuracy in fine-grained medical tasks. As vision-language models continue to evolve, their integration with retrieval mechanisms and foundation models holds substantial promise for advancing clinical decision support, medical research, and education at scale. In the next blog of this series, we will shift focus to the wound care subdomain of dermatology, and we will release accompanying Jupyter notebooks for the adapter-based and RAG-based methods to provide a reproducible reference implementation for researchers and practitioners. The Microsoft healthcare AI models, including MedImageInsight, are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals. Image Search Series: Blog Posts & Jupyter Notebooks Image Search Series Part 1: Chest X-ray lookup with MedImageInsight | Microsoft Community Hub 2d_image_search.ipynb Image Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology | Microsoft Community Hub 3d_image_search.ipynb Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology | Microsoft Community Hub Image Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval | Microsoft Community Hub rag_infection_detection.ipynb Image Search Series Part V: Building Histopathology Image Search with Prov-GigaPath | Microsoft Community Hub 2d_pathology_image_search.ipynb The Microsoft healthcare AI models, including MedImageInsight, available in the Microsoft Foundry model catalog, are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals. References Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaría-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie L. Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew P. Lungren, Mu Wei: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging. CoRR abs/2410.06542 (2024) Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Asma Ben Abacha, Meliha Yetisgen, Fei Xia: DermaVQA: A Multilingual Visual Question Answering Dataset for Dermatology. MICCAI (5) 2024: 209-219 Model catalog and collections in Azure AI Foundry portal https://learn.microsoft.com/en-us/azure/ai-studio/how-to/model-catalog-overviewDragon Copilot centralizes trusted medical content and relevant contextual information in-workflow
This blog is co-authored by Bert Hoorne, Principal Program Manager & Ksenya Kveler, Principle Medical Science Manager Dragon Copilot delivers medical intelligence from trusted sources directly within clinical workflows for healthcare organizations in one solution. We are pleased to announce that we are expanding those knowledge sources with additional best‑in‑class content providers and enabling broader access to your organization’s internal sources with Microsoft 365 Copilot integration. Access information from new credible medical content providers Dragon Copilot users will gain access to an additional robust collection of trusted clinical content from leading evidence-based resources. We are partnering with renowned publishers to bring you the best, most trusted content, safely and securely, within clinician’s workflows while helping to reduce the use of unauthorized AI tools and applications, commonly referred to, as “shadow AI.” Access content from Wolters Kluwer UpToDate We’ve partnered with Wolters Kluwer UpToDate to bring trusted, evidence-based clinical guidance directly into Dragon Copilot. Customers with an active Wolters Kluwer UpToDate license will be able to access UpToDate content in Dragon Copilot, within the context of their clinical workflows. This integration allows clinicians to ask both general questions and patient specific questions and receive answers grounded in UpToDate evidence, with clear references to supporting sources. Over time, it will also introduce contextual links to UpToDate concepts layered on top of Dragon Copilot–generated notes, further enhancing clinical insight at the point of care. “Clinicians need reliable guidance that supports fast, confident decision-making without disrupting care delivery. We are excited to partner with Microsoft to bring UpToDate’s gold standard evidence and expertise-based clinical insights to Dragon Copilot, helping clinicians quickly access, actionable answers that reduce cognitive burden and support better patient care.” Yaw Fellin, Senior Vice President and General Manager, UpToDate Clinical Decision Support and Provider Solutions Wolters Kluwer Health Here’s an example of UpToDate content embedded in the Dragon Copilot workflow: Obtain trusted clinical evidence with Elsevier ClinicalKey AI Elsevier’s ClinicalKey AI will be available in Dragon Copilot. This integration enables customers with an active Elsevier ClinicalKey AI license to surface trusted medical literature and clinical evidence directly within clinicians’ workflows. “Clinicians are navigating a complex and rapidly changing healthcare landscape and need solutions they can trust. The ClinicalKey AI extension for Dragon Copilot transforms how clinicians interact with trusted medical literature and clinical answers. The conversational interface makes evidence discovery faster and more intuitive.” Jukka Valimaki, SVP Clinical Solutions Elsevier Here’s an example of ClinicalKey AI content embedded in the Dragon Copilot workflow: Support clinical decisions with EBMcalc With the integration of EBMcalc medical calculators, Dragon Copilot enables clinicians to use evidence-based calculators directly within their workflows—applied in context to the patient they’re caring for. “Clinicians need trusted, evidence-based insights exactly at the point of care. By integrating EBMcalc’s rigorously curated clinical calculators and references into Dragon Copilot, we’re helping make high quality medical evidence more accessible, more actionable, and easier to use within everyday clinical workflows”. Louis Leff, MD, MACP, Founder and CEO EBMcalc Access independent evidence in Dragon Copilot with Wiley and Cochrane Wiley and Microsoft are partnering to bring scientific literature and clinical evidence directly into the healthcare workflow, starting with the Cochrane Library. Through this integration, customers with an active Cochrane Library AI license will be able to access Cochrane’s high-quality, independent evidence, systematic reviews, and clinical answers, to inform more reliable and efficient decision-making. This includes the Cochrane Database of Systematic Reviews (CDSR), the home of gold-standard evidence syntheses, widely used to inform clinical guidelines worldwide. "Working with Microsoft to bring the Cochrane Library into Dragon Copilot reflects a shared commitment to meeting researchers and clinicians where they are. Healthcare Institutions can now access independent, peer-reviewed evidence— right within their clinical workflow” Josh Jarrett, SVP & GM of AI Growth Wiley Access work context with Microsoft 365 Copilot in Dragon Copilot With the Microsoft 365 Copilot integration, Dragon Copilot enables clinicians to seamlessly access information from their emails, chats, OneDrive and SharePoint, within the flow of their clinical work. Clinicians can combine this information with additional questions and actions, all governed by existing organizational and user access controls. Use of this data within Dragon Copilot workflow remains fully at the user’s discretion. Here’s an example of content from an email surfaced by Microsoft 365 Copilot accessible through the Dragon Copilot workflow: Read more for a deeper dive on how Dragon Copilot enables work context access with Microsoft 365 Copilot integration. Safe web search Dragon Copilot safe web search delivers trusted, evidence linked answers when curated sources are unavailable—ensuring clinicians continue to receive timely support without disrupting their workflow. The goal of safe web search is to prevent broken workflows and eliminate unsafe external browsing. Clinicians remain within their clinical context, focused on the patient—without tab hopping or the risk of landing on unreliable or unverified websites. Safe web search eliminates “no response” dead ends by maintaining a seamless conversational experience in Dragon Copilot and reducing unanswered prompts. This capability is enabled by using verified, secure, and responsible mechanisms designed for safe clinical experiences. It enforces multilayer protection through evidence validation, provenance linked responses, content filtering, and regulated search with built in safeguards. Here’s an example of content from a safe web search in the Dragon Copilot workflow: Conclusion These advancements represent an important step forward in how Dragon Copilot delivers trusted medical intelligence - bringing together best‑in‑class clinical evidence, organizational knowledge, and safe web access in one governed, in‑workflow experience. We will continue to expand our partner ecosystem, deepen integrations with leading evidence providers, and evolve Dragon Copilot conversational extensibility to meet clinicians where they work.