life sciences
289 TopicsAzure OpenAI GPT model to review Pull Requests for Azure DevOps
In recent months, the use of Generative Pre-trained Transformer (GPT) models for natural language processing (NLP) has gained significant traction. GPT models, which are based on the Transformer architecture, can generate text from arbitrary sources of input data and can be trained to identify errors and detect anomalies in text. As such, GPT models are increasingly being used for a variety of applications, ranging from natural language understanding to text summarization and question-answering. In the software development world, developers use pull requests to submit proposed changes to a codebase. However, reviews by other developers can sometimes take a long time and not accurate, and in some cases, these reviews can introduce new bugs and issues. In order to reduce this risk, During my research I found the integration of GPT models is possible and we can add Azure OpenAI service as pull request reviewers for Azure Pipelines service. The GPT models are trained on developer codebases and are able to detect potential coding issues such as typos, syntax errors, style inconsistencies and code smells. In addition, they can also assess code structure and suggest improvements to the overall code quality. Once the GPT models have been trained, they can be integrated into the Azure Pipelines service so that they can automatically review pull requests and provide feedback. This helps to reduce the time taken for code reviews, as well as reduce the likelihood of introducing bugs and issues.45KViews4likes13CommentsMicrosoft Azure continues to expand scalability for Healthcare EHR Workloads
Microsoft Azure has reached a new milestone for Epic Chronicles Operational Database (ODB) scalability with the Standard_M416bs_v3 (Mbv3) VM. It can now scale up to 110 million GRefs/s (Global References per second) in the ECP configuration and up to 39 million GRefs/s in the SMP configuration, improving upon the previous Azure benchmarks of 65 million GRefs/s and 20 million GRefs/s respectively. Microsoft Azure now can host 96% of the Epic customer base, enabling healthcare organizations to run their EHR systems on Azure. New VM Size Purpose-Built for Epic’s Chronicles ODB The Standard_M416bs_v3 VM, newly added to Azure’s Mbv3 series, is purpose-built to meet the growing performance and scalability demands of large healthcare EHR environments. With higher CPU capacity, expanded memory and improved remote storage throughput, it delivers the reliability needed for mission-critical workloads at scale. Key specifications include: Mbv3 Processor Performance: Built on 4th Gen Intel® Xeon® Scalable processors, the Mbv3 series is optimized for high memory and storage performance, supporting workloads up to 4 TB of RAM with an NVMe interface for faster remote disk access. Compute Capacity: The Standard_M416bs_v3 delivers 416 vCPUs - more than twice the capacity of previous Mbv3 sizes, delivering stronger performance. Storage Performance: Achieves up to 550,000 IOPS and 10 GBps remote disk bandwidth using Azure Ultra Disk. Performance Optimization: Enhanced by Azure Boost, the M416bs_v3 provides low-latency, high remote storage performance, making it ideal for storage throughput-intensive applications such as Epic ODB, relational databases and analytics workloads. Available Regions: M416bs_v3 is available in 4 regions - East US, East US 2, Central US, and West US 2. Explore Epic on Azure to learn more. Epic and Chronicles are trademarks of Epic Systems Corporation.1.7KViews2likes1CommentHealthcare agent service in Microsoft Copilot Studio is now Generally Available
Healthcare organizations continue to face immense challenges: workforce shortages, rising costs, and growing demands for patient care. The clinical staff is overburdened, leading to stress, burnout, and staff shortages. Generative AI presents a powerful opportunity when it can automate administrative workflows, surface relevant insights, and assist the clinical staff with contextual, credible and up-to-date information. With that opportunity, we are excited to announce General Availability (GA) of healthcare agent service in Microsoft Copilot Studio. Building responsible, AI-powered healthcare agents With healthcare agent service, organizations can create healthcare-specialized AI applications that use generative AI within a framework that promotes trust, compliance, and real-world clinical scenarios. Agents combine built-in credible medical sources, such as FDA, CDC, MedlinePlus, MSD Manuals, DailyMed and more, with the organization’s own knowledge sources and plugins, while leveraging healthcare-specific actions. Customers can define the intended healthcare roles, such as healthcare professionals or patients, so the behavior is relevant and appropriate for the audience and use case. Pre-built use cases include clinical documentation assistance, patient self-service, helping healthcare professionals triage by organizing information, finding medication information, accessing recent clinical guidelines information, and more. Because responsible AI in healthcare is a top priority, healthcare agent service is infused with safeguards that are reinforced by a healthcare-adapted orchestrator optimized for safety. Clinical, chat, and compliance safeguards help keep interactions evidence-based and trustworthy, increasing the reliability and accuracy of generated responses and adherence to the highest standards of safety, privacy, and regulatory compliance. Healthcare agent service underscores our ongoing commitment to responsible AI in healthcare, by offering customers a reliable, production-ready foundation for healthcare solutions that can be used to help support patients and medical professionals. Extending Dragon Copilot with conversational solutions Healthcare agent service provides a framework for building conversational AI applications that can be integrated directly into Dragon Copilot, giving partners and healthcare organizations the ability to extend its functionality in a scalable, compliant way. Today, Information Assist in Dragon Copilot, built on healthcare agent service, delivers safeguarded generative AI answers grounded in trusted sources and enriched with patient history and context, ensuring clinicians receive accurate, timely, and context-aware insights. Clinicians can effortlessly access a broad range of clinical topics directly within their workflow using natural language, surfacing insights from leading, trusted healthcare content partners that promote more informed clinical decisions with less administrative work. Partners and healthcare organizations can use healthcare agent service to create tailored solutions with built-in safeguards that help ensure output meets healthcare standards and supports safe decision-making at the point of care. These solutions can be integrated directly into Dragon Copilot to enhance both clinical and financial performance. Real-world impact with customers Healthcare organizations are already adopting healthcare agent service to bring generative AI into real-world care settings. Early adopters are seeing meaningful impact in reducing administrative burden, improving patient experience, and empowering clinicians with trusted information. Bayer Pharmaceuticals has recently worked with Microsoft to enable new agentic AI workflows for drug submission using healthcare agent service in Copilot Studio: “We have collaborated with Microsoft to build an AI-powered multi-agent decision board using the healthcare agent service in Copilot Studio. This multi-agent decision board revolutionizes how we strategize drug submissions, pricing, and patient targeting for global market access. By simulating expert board discussions and synthesizing diverse data—from regulatory approvals to health economics and real-world evidence—the system streamlines the complex process of securing drug reimbursement. Healthcare agent service helped us get results quicker, empowering teams to make smarter, data-driven decisions without replacing human expertise, which would enable better access to life-changing therapies for patients worldwide. Importantly, this tool is not limited to pharmaceutical companies. It also supports decision-making for health authorities, NGOs, and other stakeholders across the healthcare ecosystem—enabling more informed, collaborative, and impactful choices that benefit public health at large.” — Shay Zohar, local Market Access Director and member of Bayer Pharmaceutical’s global Early Access team Allgemeines Krankenhaus (AKH) Wien, the largest hospital in Vienna, Austria and the Medical University of Vienna collaborated with Microsoft to extend Dragon Copilot with healthcare agent service, to automate pre-anesthesia intake. “Transforming pre-anesthesia assessments with AI agents for greater efficiency has a great potential to decrease the administrative burden on anesthesiologists. In this project we used healthcare agent service to extend Dragon Copilot with AI-powered agents that automate pre-anesthesia intake to enhance clinical documentation, significantly reducing the administrative workload for anesthesiologists. By orchestrating conversational and workflow agents, the solution interacts with patients, completes assessments, checks for data conflicts, and generates clinical notes, all consolidated for physician review in Dragon Copilot.” — Dr. Oliver Kimberger, Professor for Perioperative Information Management at the Department of General Anesthesia and Intensive Care Medicine, AKH Wien. Empowering healthcare innovation Healthcare agent service offers a low-code interface for building and deploying custom AI solutions with chat, compliance and clinical safeguards that support safety and accuracy in generative AI. With seamless integration and the ability to extend the capabilities of Dragon Copilot, you gain the flexibility to tailor solutions to your organization’s evolving needs. Learn more in healthcare agent service in Copilot Studio documentation Explore the possibilities with Microsoft Copilot Studio Expand your knowledge about Microsoft for Healthcare Discover how we are shaping the future of health with cutting-edge solutions and collaborative efforts here Medical Device Disclaimer: Microsoft products and services (1) are not designed, intended or made available as a medical device, and (2) are not designed or intended to be a substitute for professional medical advice, diagnosis, treatment, or judgment and should not be used to replace or as a substitute for professional medical advice, diagnosis, treatment, or judgment. Customers/partners are responsible for ensuring solutions comply with applicable laws and regulations. Generative AI Disclaimer: Generative AI does not always provide accurate or complete information. AI outputs do not reflect the opinions of Microsoft. Customers/partners will need to thoroughly test and evaluate whether an AI tool is fit for the intended use and identify and mitigate1.6KViews0likes0CommentsCopilot Chat: Prompting
To start a new prompt, head over to Copilot Chat and hit the blue chat button in the upper right corner. 🔄 When should I start a new chat? A good rule of thumb: hit that button whenever you're switching contexts or subject areas. This helps keep Copilot focused and prevents information from getting muddled.🍸 🧪 How do I improve my prompts? To get the best results, use the GCSE Formula—that’s: Goal: What do you want Copilot to do? Context: What background info will help? Source: Where should Copilot pull from? Expectations: What kind of output do you want? 🧩 Example Here’s a basic prompt: Give me a concise summary of recent news about Pfizer. Now let’s expand it using the GCSE Formula: Summarize the latest news about Pfizer from reputable sources like Reuters or Bloomberg. Focus on developments in their vaccine pipeline and financial performance. Keep it concise—under 150 words. Let's see how this might look like in practice. This is my prompt: Give me a concise summary of recent news about Pfizer. Let's try expanding this to include our other key ingredients: 🎯 Challenge Try using the GCSE Formula in your next prompt and compare it to using just the goal. See how your results stack up!Copilot Chat: Downloads
On PC and Mac: Follow the download links below to install the Copilot Chat desktop app. Double-click the installer when prompted, and you're in. Windows: Microsoft 365 Copilot - Free download and install on Windows | Microsoft Store MacOS: Microsoft 365 Copilot on the App Store On Mobile: Scan the QR code to download the app to your device. In Your Browser: Prefer not to download anything? You can also access Copilot Chat from Microsoft 365 Copilot Chat. Once you're in, try starting a conversation in the prompt box. Not sure where to begin? No worries—use or tweak one of the suggested prompts to get going. Here are a few other handy entry points:1.4KViews4likes0CommentsHow Copilot Can Save Us Energy
Let’s face it! Our homes are getting smarter, but our energy bills are getting dumber. If you’ve ever asked Alexa to dim the lights while binge watching your favorite show or told Google Home to crank up the AC during a heatwave, congratulations, you’ve officially joined the AI-powered energy club. But before you start blaming your smart speaker for your rising electricity costs, let’s talk about how Copilot can actually help you save energy (and maybe even your sanity).😁 First, the good news. Devices like Amazon Alexa and Google Home are not just glorified trivia machines, they’re energy-saving ninjas when used correctly. According to Tom’s Guide and SmartHomeMuse, setting up routines like "Alexa, I’m leaving, can you automatically turn off lights, lower thermostats, and shut down unnecessary devices?" Google Home can do the same, adjusting smart thermostats based on occupancy and weather forecasts. It’s like having a personal energy butler who never complains. And then there’s the Alexa Energy Dashboard. A nifty tool that tracks the power usage of connected devices. It’s like Fitbit for your fridge, letting you see which gadgets are guzzling electricity and which ones are behaving. Pair that with smart plugs and solar panel integration, and you’ve got a recipe for serious savings. Even Alexa’s 'Hunches' feature can detect when you’re away and shut things down automatically. Smart, right? 👍 But here’s the plot twist: these devices can also be energy vampires. According to Harvard Magazine and SFGATE, the 'always-on' nature of smart assistants means they’re constantly listening, syncing, and updating. Even when you’re not talking to them. That persistent power draw adds up, especially in homes with multiple devices. The Amazon Echo, for example, has no battery and must be plugged in 24/7. It’s like having a roommate who never sleeps and always leaves the lights on. Internal reports like the Amazon 2020 Sustainability Report and Alexa usage studies show that frequent users often have entire ecosystems of smart devices lights, thermostats, speakers, and more, all connected and consuming energy. Without proper optimization, your smart home could become a not-so-smart drain on your wallet. So, what’s the solution? Enter Copilot. By leveraging AI to automate energy-saving routines, monitor device usage, and suggest optimizations, Copilot can help you strike the perfect balance between convenience and conservation. Think of it as your energy-saving sidekick. Always watching, always learning, and never judging you for asking Alexa to play 'Eye of the Tiger' at 2 a.m. In conclusion, smart assistants are a double-edged sword. They can save you energy if used wisely or sneakily inflate your bills if left unchecked. With Copilot in your corner, you can harness the power of AI to make your home smarter, greener, and a little less expensive. And hey, if it also helps you win trivia night, that’s just a bonus. 😉Towards Robust Evaluation of Multi-Agent Systems in Clinical Settings
Authors: Hao Qiu, Leonardo Schettini, Mert Öz, Noel Codella, Sam Preston, Wen-wai Yim As multi-agent systems become more capable and collaborative, their behavior begins to exhibit emergent properties that are difficult to predict or control – particularly in safety critical domains like healthcare. Coordination among agents can yield outputs that are non-deterministic, multi-faceted, and context sensitive. This makes robust evaluation not just a matter of accuracy, but of safety, accountability, and trust. Traditional NLP metrics like ROUGE or BLEU fall short in these settings as they presuppose a single ground truth and fail to capture clinically relevant errors such as subtle omissions, hallucinations, or fact distortions. To address this, we present a modular evaluation framework for the Healthcare Agent Orchestrator, designed to support fine-grained, clinical grounded assessment across both deployed clinical workflows and simulated scenarios. This framework enables targeted stress-testing of multi-agent behavior – particularly how agents share information, reason under uncertainty, and maintain factual fidelity in high-stakes contexts. Central to our framework is TBFact, a domain specific factuality metric that evaluates agent outputs based on three key criteria: factual inclusion, factual distortion, and factual omission. TBFact shows strong correlation with human experts (κ=0.760) and demonstrates that our Patient History agent successfully included up to 94% of high-importance information in the generated patient timelines. To ground evaluations of the Patient History agent, we constructed a high-quality benchmark dataset from de-identified tumor board discussions and associated patient histories. Reference patient timeline summaries (originally written by medical professionals) formatting was standardized via a large language model to facilitate consistent evaluation. And under our benchmark, while the Patient History agent included over 94% of high-importance facts (counting both fully and partially entailed information), the Patient History agent achieved 0.84 TBFact recall on high-importance facts, showing that TBFact's strict entailment criteria and partial credit scoring create meaningful headroom for future improvements. For more technical information about the evaluation framework, refer to the documentation. The healthcare-agent-orchestrator repository also includes an evaluation notebook with concrete examples for simulating conversations and evaluating them. : High-level architecture of the evaluation framework, showing data sources (real and simulated conversations) feeding into modular metrics for both orchestrator and individual agent assessment. Available Metrics Traditional similarity metrics (e.g.: ROUGE, BERTScore) fail to capture subtle yet critical factual inaccuracies in the output. Moreover, in agentic workflows, a ground truth answer often doesn’t exist or is expensive to curate. To overcome these shortcomings, we leverage Model-as-a-Judge to implement the following metrics: Component Metric Description Orchestrator Agent and tool selection accuracy Correct routing to specialized agents Orchestrator Intent resolution How accurately the orchestrator interprets and completes user requests, including scoping and clarification. Orchestrator Information aggregation Effective synthesis of multiple agent outputs. Individual Agents Context relevancy Relevance of retrieved information in relation to user’s requests. Individual Agents TBFact (Factual Consistency) An adapted version of RadFact for the text modality, that measures the factuality of claims in agents' messages and helps identifying omissions and hallucinations. Large Language Models serve as useful evaluation tools in our framework, offering advantages especially when ground truth data is not available. They can follow detailed evaluation guidelines, maintain consistency when applying criteria across conversations, and generate explanations for their assessments—facilitating verification of the evaluation process. However, due to their subjective nature, LLM-based evaluations should be treated as directional signals rather than absolute scores, providing better directional guidance for system improvement rather than absolute judgment of correctness. To complement LLM-based metrics with reproducible measurements especially when reference data is available, we include Rouge implementation, serving as an example for developers to incorporate other similarity metrics like BLEU or BERT-Score by extending the ReferenceBasedMetric class. TBFact: Domain-Specific Factuality Evaluation TBFact builds on RadFact (Bannur et al., 2024), a framework originally developed for evaluating factual consistency in radiology reports, by adapting its core principles to the text-only modality of healthcare agent interactions: Fact Extraction: Separately decomposes both agent responses and reference texts into discrete factual claims, categorized by clinical relevance (e.g., demographics, diagnosis, treatment) Logical Entailment: Compares each fact to determine if it's fully entailed, partially entailed, or not entailed by the reference, and further categorizes the reason for partial and total mismatches into “missing”, “ambiguous”, “incorrect” or “other”. Metric Calculation: TBFact performs the logical entailment in two directions: Precision (pred-to-gold): Measures the proportion of factual claims in the agent’s output that are supported by the reference data. A lower precision score may indicate the presence of hallucinated or extraneous facts not found in the reference, even if they are accurate. Precision can be seen as a proxy for succinctness. Recall (gold-to-pred): Measures the proportion of reference facts that are successfully captured in the agent’s output. A lower recall score signals missing or omitted information, which is especially critical in clinical contexts where completeness is essential. By operating at the level of atomic factual units, TBFact shifts the focus from holistic summary judgments to targeted, claim-by-claim analysis. While claim extraction introduces its own challenges—such as ensuring consistent coverage of verifiable content, maintaining entailment fidelity, and handling decontextualization (Metropolitansky & Larson, 2025)—factual claims make the evaluation process more modular and transparent, providing actionable insights into where and how agent responses differ from references. For example, when evaluating a discharge summary, TBFact might identify that while demographic facts achieve 95% precision, treatment recommendations only reach 75% recall, pinpointing specific areas for agent improvement. This granular feedback enables developers to identify systematic issues, such as an agent consistently omitting medication dosages or incorrectly interpreting temporal information, that would be difficult to detect with traditional metrics. Data Sources Due to the challenge of having real-world data for each use-case we want to evaluate, and to accommodate different development stages and data availability, the framework supports two primary evaluation modes: Real conversations: Healthcare Agent Orchestrator automatically saves chat sessions whenever a conversation is terminated with the command @Orchestrator: clear, enabling insight into actual clinical workflow performance. Simulated conversations: Generated for controlled testing using predefined scripts or adaptive scenarios. Essential for specialized scenarios with limited real-world data. Results and Performance Assessment Note: The following results represent initial validation from our current research phase, with ongoing work expanding evaluation scope and refining methodologies. These preliminary results demonstrate promising capabilities for clinical system coordination and factual accuracy assessment. Orchestrator Performance We evaluated the orchestrator using simulated conversations across multiple patient scenarios. GPT-4o served as the evaluator, providing both quantitative scores and qualitative explanations based on defined metric criteria. In this preliminary experiment, the orchestrator demonstrated promising coordination capabilities: Metric Score Range Average Score Agent Selection Accuracy 3.89 – 5 4 Intent Resolution 4 – 5 4.5 Information Aggregation 3 – 5 3.7 In our preliminary evaluation, agent selection examples are relatively straightforward given our agents' well-defined responsibilities but provide a foundation for expanding to more complex scenarios involving agent-human expert interactions as we gather real-world data. Future work could include turn-level labeling of tumor board dataset dialogues to test classification accuracy of choosing the right next expert or agent. Agent selection can also be combined with "tool selection" metrics, addressing the fragmentation problem in multi-agent evaluation approaches. In the current state, we mainly used the explanations provided by the evaluator model to better understand the behavior of the system in clinical workflows and guide the development process. Patient History Agent Performance with TBFact To evaluate the Patient History agent, we used an anonymized and PHI-free proprietary dataset, named TB-Bench, that comprehensively aggregates diverse medical records for 71 patients who had undergone the care of a Molecular Tumor Board (MTB). TB-Bench includes data such as tumor board transcripts, exported EHR data, and clinician-generated patient summaries. Due to the logistical challenges involved in curating such a comprehensive dataset across potentially multiple healthcare institutions and record keeping systems, we found that in some instances clinician-generated summaries available in the tumor board transcripts might refer to patient records that were lost in the data curation process. This mismatch made direct evaluation challenging. Therefore, to ensure evaluation reflects system performance when complete patient records are accessible, we used TBFact to evaluate the agent’s output against a curated set of dataset verifiable facts— facts limited to those referring to information that is present in the dataset. While TBFact measures both recall and precision of fact generation, our study focuses on recall because it measures how much of all important information is covered, which we consider the most critical metric for clinical applications where missing information can have serious consequences. The preliminary experiments revealed significant performance improvements through prompt optimization and format adjustments. With specialized prompting, we specify the types of information to prioritize—such as biomarker results, imaging assessments, and treatment timelines. For instance, our updated prompt instructs the agent to “organize the patient data in chronological order” and explicitly calls out key elements to include: “all biomarkers”, “response to treatment including dates and imaging,” and “a summary of current status.” This prompt engineering approach proved to be one of the most effective levers for improving the quality and completeness of Patient History outputs. Configuration TBFact Recall for All Facts TBFact Recall for Important Facts Generic prompts (baseline) 0.56 0.66 Specialized Prompts 0.71 0.84 Since TBFact operates by comparing discrete factual claims, higher scores indicate that the agent is, according to the reference data, factually accurate and comprehensive in its coverage of the available patient information. In other words, optimizing for TBFact scores brings the agent’s output structurally and semantically closer to the curated reference timelines. And, in our case, that meant striving for detailed outputs, including information about allergies and ongoing medications, even when specific dates were unavailable. This underscores the importance of having high-quality, human-validated reference datasets, as without them, even well-performing agents may appear incomplete or inaccurate. Human Validation Study To validate TBFact's reliability, we conducted a preliminary study with human annotators, medical scribes by training, using 71 patient records. Two annotators assessed (a) whether a claim was properly extracted from its source text, (b) whether the fact was important (low, medium, high), and (c) whether individual claims were properly entailed by a reference text. Inter-annotator agreement was measured at 0.999, 0.66(strict) and 0.77(relaxed), and 0.914 for the three tasks respectively. The accuracy of the fact extraction pipeline was calculated to be 99.9%, validating that during the fact extraction phase minimal-to-no hallucinations are introduced. System accuracy for fact importance classification was at 66% when measured strictly, however, when allowing for a tolerance of one level (e.g. classifying medium instead of high), this was at 93%. These values are comparable to those of the medical annotators. Entailment classification at 88%, suggesting reasonable performance of the system’s ability to recognize entailment. Finally, we measured the correlation of the entire end-to-end TBFact F1 score of the system compared to humans using Kendall Tau, Pearson, and Spearman correlations. These were revealed to be at 55.8%, 70.5%, 72.8%, moderate-to-high correlations suggesting that the TBFact metric are well-aligned with expert clinical reasoning. Qualitative insights from TBFact The table below illustrates how TBFact evaluates factual alignment between agent-generated summaries and reference data. Each row shows a fact extracted from the agent’s output, the corresponding excerpt from the reference, and the entailment judgment. The logical entailment was produced by TBFact, while the accompanying explanations were generated separately to support interpretability. Facts Extracted from Agent Response Related Excerpt from Reference Text (Ground Truth) TBFact Judgment Molecular studies from the 2019-05-18 surgery identified TERT promoter mutation, PTEN mutation, EGFR amplification, CDKN2A/B deletion, monosomy 10, and trisomy 7. […] Tumor Genetics: EGFR: Amplified CDKN2A/B: Deleted PTEN: p.L112R TERT: c.-146C>T Chromosome 10: Monosomy Chromosome 7: Trisomy […] Timeline: 05/18/2019: Diagnosis of multifocal glioblastoma; craniotomy and resection of lesion from right temporal lobe. […] ✔ Entailed: The summary lists TERT mutation, PTEN mutation, EGFR amplification, CDKN2A/B deletion, monosomy 10, and trisomy 7. Immunohistochemistry from 2019-05-18 showed GFAP positive, BRAF V600E negative, IDH1 R132H negative, ATRX retained, p53 negative, and a Ki-67 index of 3%. […] Tumor Genetics: IDH1: Wildtype - BRAF V600E: Negative […] Timeline: 05/18/2019: Diagnosis of multifocal glioblastoma; craniotomy and resection of lesion from right temporal lobe. […] ⚠️ Partial Entailment: Some IHC findings match (BRAF negative, IDH1 wildtype) but others (GFAP, p53, Ki-67) are not mentioned in the reference summary. During the first cycle of CCNU on 2020-04-14, the patient reported significant fatigue, thrombocytopenia, and occasional confusion. Introduction: […] The patient is experiencing poor tolerance to lomustine and is considering discontinuation due to further disease progression as confirmed by recent MRI scans. […] Timeline: 04/14/2020 - Present: Lomustine treatment initiated. […] ⚠️ Partial Entailment: Poor tolerance to lomustine is reported, but specific side effects are not listed in the reference summary. On 2020-05-16, the plan was to continue CCNU and monitor with imaging. No related information in the reference text. ⚠️ No Entailment: No mention in the summary of a plan on 2020-05-16 to continue CCNU with imaging follow-up. These examples show that partial entailments are not necessarily errors. In many cases, they reflect the agent surfacing clinically relevant details that are absent from the reference. This is especially important in healthcare settings, where agent outputs may synthesize information across multiple documents or express facts in more complete or structured ways than the reference defined. To further assess the factual grounding of the agent’s outputs, we compared all facts extracted from the Patient History agent’s summaries against the full set of available data for each patient in the TB-Bench dataset. We found that 97% of the extracted facts were entailed by at least one data point. Upon manually reviewing the remaining 3% of facts, we found that they often reflected condensed or synthesized information drawn from multiple sources, meaning these claims could not be matched to any one document in our one-to-one entailment setup. While we cannot rule out the presence of hallucinations entirely, this analysis highlights the agent’s capacity for multi-source summarization. Closing Thoughts As multi-agent systems become more capable and autonomous, robust evaluation must evolve in parallel. The framework presented here is a step toward that goal: modular, clinically grounded, and designed to surface actionable insights across both simulated and real-world workflows. By moving beyond traditional accuracy metrics and embracing factuality, relevance, and coordination as core evaluation dimensions, we can better understand how multi-agent systems work, and when and why they fail. Our preliminary experiments and insights reinforce the value of TBFact not just as a metric, but as a diagnostic tool. Its structured, claim-level analysis (combined with fact categorization and human validation) offers a transparent and clinically meaningful way to evaluate and improve healthcare agents. In evaluating the Patient History agent, findings demonstrate that the agent remains faithful to the underlying data and produces complete, clinically relevant summaries. These outputs can help physicians prepare more efficiently and productively for tumor board review meetings, and being in a chat multiple agents, facilitate further investigation and understanding about patients. Looking ahead, we see several promising directions for extending this work: incorporating human-in-the-loop review pipelines, expanding to multimodal evaluation, improving observability across agent interactions, and scaling to more diverse real-world datasets. We are also developing a standardized benchmark of synthetic and de-identified patient cases to support broader community testing and reproducibility. We hope this work encourages others to adopt similarly rigorous approaches to evaluation, and to contribute to the development of shared benchmarks, metrics, and methodologies References Bannur, S., Bouzid, K., Castro, D. C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., ... & Hyland, S. L. (2024). Maira-2: Grounded radiology report generation. arXiv:2406.04449v2. Metropolitansky, D. & Larson, J. (2025). Towards Effective Extraction and Evaluation of Factual Claims. arXiv:2502.10855v2.Azure Logic App AI-Powered Monitoring Solution: Automate, Analyze, and Act on Your Azure Data
Introduction In today’s cloud-driven world, monitoring and analyzing application health is critical for business continuity and operational excellence. However, the sheer volume of monitoring data can make it challenging to extract actionable insights quickly. Enter the Azure Logic App AI-Powered Monitoring Solution—an intelligent, serverless pipeline that leverages Azure Logic Apps and Azure OpenAI to automate monitoring, analyze data, and deliver comprehensive reports right to your inbox. This solution is ideal for organizations seeking to modernize their monitoring workflows, reduce manual analysis, and empower teams with AI-driven insights for faster decision-making. What Does This Solution Accomplish? The Azure Logic App AI-Powered Monitoring Solution creates an automated pipeline that: Extracts monitoring data from Azure Log Analytics using KQL queries. Analyzes data with AI using the Azure OpenAI GPT-4o model. Generates intelligent reports and sends them via email. Runs automatically on a daily schedule. Uses managed identity for secure authentication across Azure services. Business Case Solved Automated Monitoring: No more manual log reviews—let AI do the heavy lifting. Actionable Insights: Receive daily, AI-generated summaries highlighting system health, key metrics, potential issues, and recommendations. Operational Efficiency: Reduce time-to-insight and empower teams to act faster on critical events. Secure and Scalable: Built on Azure’s serverless and identity-driven architecture. Key Features Serverless Architecture: Built on Azure Logic Apps Standard for scalability and cost efficiency. AI-Powered Insights: Uses Azure OpenAI for advanced data analysis and summarization. Infrastructure as Code: Deployable via Bicep templates for reproducibility and automation. Secure by Design: Managed identity and Azure RBAC ensure secure access. Cost Effective: Pay-per-execution model with optimized resource usage. Customizable: Easily modify KQL queries and AI prompts to fit your monitoring needs. Solution Architecture Technologies Involved Azure Logic Apps Standard: Orchestrates the workflow. Azure OpenAI Service (GPT-4o): Performs AI-powered data analysis and summarization. Azure Log Analytics: Source for monitoring data, queried via KQL. Application Insights: Monitors workflow execution and telemetry. Azure Storage Account: Stores Logic App runtime data. Managed Identity: Secures authentication across Azure services. Infrastructure as Code (Bicep): Enables automated, repeatable deployments. Office 365 Connector: Sends email notifications. Support Documentation: https://docs.microsoft.com/en-us/azure/logic-apps/ Issues: https://github.com/vinod-soni-microsoft/logicapp-ai-summarize/issues Star this repository if you find it helpful!1.3KViews0likes0CommentsOptimizing Azure Healthcare Multimodal AI Models for Intel CPU Architecture
Alexander Mehmet Ersoy, Principal Product Manager, Microsoft HLS AI Abhishek Khowala, Principal AI Engineer, Intel Ravi Panchumarthy, AI Framework Engineer, Intel Srinarayan Srikanthan, AI Framework Engineer, Intel Ekaterina Aidova, AI Frameworks Engineer, Intel Alberto Santamaria-Pang, Principal Applied Data Scientist, Microsoft HLS AI and Adjunct Faculty at Johns Hopkins Medicine, Microsoft Peter Lee, Applied Scientist, Microsoft HLS AI and Adjunct Assistant Professor at Vanderbilt University Ivan Tarapov, Sr. Director, Microsoft HLS AI Pradeep Sakhamoori, Sr. SW Engineer, Microsoft The Rise of Multimodal AI in Healthcare The healthcare sector is witnessing a surge in the adoption of multimodal AI models, which are crucial for applications ranging from diagnostics to personalized treatment plans. These models combine data from various sources such as medical images, patient records, and genomic data to provide comprehensive insights. Microsoft’s Azure AI Foundry's Model Catalog of multimodal healthcare foundation models is at the forefront of this change. Models recently launched (such as MedImageInsights, MedImageParse, CXRReportGen [8], and many others) are designed to help healthcare organizations rapidly build and deploy AI solutions tailored to their specific needs, while minimizing the extensive compute and data requirements typically associated with building multimodal models from scratch. Real-World Examples from our industry partners regarding the adoption of multimodal AI models are highlighted in the article “Unlocking next-generation AI capabilities with healthcare AI models”. Challenges and Opportunities in Hardware Optimization As models get more complex, which is the case with the foundation model trend, the demands on the hardware rise. While GPUs remain the platform of choice for minimizing the model execution times, CPUs present substantial optimization possibilities, especially for inference workloads. We believe that providing a framework for efficient CPU-based environments holds a huge potential for many production scenarios where speed can be traded off for cost savings. With multimodal healthcare AI, the complexity of handling different data modalities and ensuring efficient inference requires innovative solutions and collaboration between industry leaders. Companies are increasingly looking towards hardware-specific optimizations to enhance model efficiency and reduce latency while keeping costs at bay. Intel, with its robust suite of AI tools and extensions for frameworks like PyTorch, is pioneering this optimization effort. For instance, the Intel® Distribution of OpenVINO™ toolkit has been instrumental in accelerating the development of computer vision and deep learning applications in healthcare [1]. You can learn about our recent collaboration with Intel on AI optimizations to advance medical innovations in the article "Empower Medical Innovations: Intel Accelerates PadChest & fMRI Models on Microsoft Azure* Machine Learning”. The demand for AI applications in healthcare is rapidly increasing. Multimodal AI models, which can process and analyze complex datasets, are essential for tasks such as early disease detection, treatment planning, and patient monitoring. While optimizing these models to perform efficiently on specific hardware is important, it is not necessarily a barrier to adoption. Models optimized with CUDA for Nvidia GPUs often deliver optimal performance and run faster than on any other hardware. However, the benefit of using CPUs lies in the tradeoff they offer. You can choose to optimize for speed by running your model on a GPU and optimizing for it in PyTorch, or you can optimize for cost by sacrificing speed. This is the proposition here: the option to run the model slower with an accessible CPU, which can be advantageous in scenarios where speed is not the primary concern, but access to GPU hardware is. The Intel® oneAPI Deep Neural Network Library (oneDNN) have proven effective in reducing GPU requirement burden and accelerating time to market for AI solutions [2]. Both Intel® Extension for PyTorch (IPEX) and OpenVINO utilize the Intel® oneDNN to accelerate deep learning operations, taking advantage of underlying hardware features. IPEX optimizes existing PyTorch workflows with minimal code changes. OpenVINO provides cross-platform deep learning optimization for deployment flexibility. In this blog post, a custom deployment was implemented using CXRReportGen along with both IPEX and OpenVINO optimizations, demonstrating how these techniques can support different deployment scenarios and technical requirements. This optimization is accessible through Azure's compute services and Intel's technology. Benchmarking and Performance Acceleration To address these challenges, our new collaboration with Intel focuses on leveraging Intel’s advanced AI tools and hardware capabilities to optimize multimodal AI models for greater healthcare access. By utilizing Intel's Extension for PyTorch and other optimization techniques, we aim to optimize CPUs for best model run time speed. While this may slightly degrade performance, the main benefit is addressing the problem of GPU hardware scarcity. This partnership not only underscores the importance of hardware-specific optimizations but also sets a new standard for AI model deployment in real-world healthcare applications. Both IPEX and OpenVINO are built on a common foundation - Intel® oneDNN which is a high-performance library designed specifically for deep learning applications and optimized for Intel architecture. oneDNN leverages specialized hardware instructions available in Intel processors such as Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) [3] on Intel CPUs as well as Intel XeMatrix Extensions (XMX) AI engines on Intel discrete GPUs. Figure 1: OneDNN Library IPEX [4] extends PyTorch* with the latest performance optimizations for Intel hardware [5]. It leverages oneDNN under the hood to provide optimized implementations of key operations. This allows developers to stay within their existing PyTorch code with minimal changes - making it an excellent choice for teams already comfortable with the PyTorch ecosystem who want to quickly optimize their models for Intel hardware. import torch ############## import ipex ############### import intel_extension_for_pytorch as ipex model = Model() model.eval() ############## Optimize with IPEX ############### model = ipex.optimize(model, dtype=torch.bfloat16) # Continue with inference as normal Figure 2. Intel Extension for PyTorch The Intel® Distribution of OpenVINO™ toolkit is a powerful solution for optimizing and deploying deep learning models across a wide range of Intel hardware [6]. Like IPEX, it leverages oneDNN under the hood, but takes a different approach - offering cross-platform optimization and flexible deployment options. OpenVINO supports two main workflows: a convenience workflow, where you run models directly with minimal setup, and a performance workflow, recommended for production, where models are first converted offline into the OpenVINO Intermediate Representation (IR). This one-time conversion step enables highly optimized inference and allows the final application to remain lightweight and efficient. Here’s a simple example using OpenVINO for inference with a pre-converted IR model. Refer to OpenVINO Notebooks repo for more samples: import openvino as ov core = ov.Core() ############## Load the OpenVINO IR model ############### compiled_model = core.compile_model("model.xml", "CPU") ############## Run inference ################### infer_request = compiled_model.create_infer_request() results = infer_request.infer({input_tensor_name: input_tensor}) Figure 3: OpenVINO toolkit Overview. IPEX and OpenVINO are supported in all Intel architectures. However, for optimal performance, Intel recommends using instances powered by 4th Gen Intel® Xeon® Scalable processors or newer, which feature AMX and other hardware acceleration capabilities, such as Azure’s v6-series (e.g., Standard_E48s_v6) [7]. Results We conducted a detailed performance benchmark by using CXRReportGen, a state-of-the-art foundation model designed to generate a list of radiological findings from chest X-rays, over Standard_E48s_v6 hardware (48 vCPUs, 248 GiB RAM) with and without IPEX and OpenVINO optimization. We realized up to 70% improvement in CXRReportGen foundation model run time when applying optimizations with IPEX and similarly substantial gains using OpenVINO, compared to the non-optimized baseline on the same CPU hardware. This significant improvement highlights the potential of leveraging Intel's performance optimizations to make critical healthcare AI models more cost-efficient and accessible. Such advancements enable healthcare providers to deploy advanced diagnostic tools even in resource-constrained environments, ultimately improving patient care and operational efficiency. SKU Run Type (100 Runs) Mean Run Time (seconds) Standard Deviation of Run Time (seconds) Standard_E48s_v6 (48 vCPUs, 348 GiB RAM) No Optimization 22.47 0.1061 Standard_E48s_v6 (48 vCPUs, 348 GiB RAM) IPEX 8.21 0.2375 Standard_E48s_v6 (48 vCPUs, 348 GiB RAM) OpenVINO 7.01 0.0569 Table 1: Performance Comparison of CXRReportGen Model Across 100 Runs with CPU. Future Prospects and Innovations Our benchmarks with Intel optimizations with both IPEX and OpenVINO show great potential on decreasing the model run time of our foundation models and increasing scalability via CPU. This optimization positions Intel CPUs as a viable deployment. This not only increases deployment options but also offers opportunities to reduce cloud costs with CPU-based instances and even consider deploying these workflows on existing compute headroom at the edge. For custom deployments, the setup described in this blog post is now available on the provided compute instances in Azure and with optimization software from Intel. So that developers can optimize inference workloads while taking advantage of large memory pools available via CPU and use towards handling large batch workloads. Our advancements with Intel in model runtime optimizations are considered to be available in the Azure AI model catalogs. Please stay tuned for further updates. As we continue to innovate and optimize, the potential for AI to transform healthcare and improve patient outcomes becomes increasingly attainable. We are now more equipped than ever to making it easier for our partners and customers to create connected experiences at every point of care, empower their healthcare workforce, and unlock the value from their data using data standards that are important to the healthcare industry. References [1] Intel OpenVINO Optimizes Deep Learning Performance for Healthcare Imaging [2] Accelerating Healthcare Diagnostics with Intel oneAPI and AI Tools [3] Intel Advanced Matrix Extensions [4] Intel Extension for Pytorch [5] Accelerate with Intel Extension to PyTorch [6] Intel Accelerates PadChest and fMRI Models on Azure ML [7] Azure’s first 5th Gen Intel® Xeon® processor instances are now available and we're excited! [8] CxrReportGen Model Card in Azure AI Foundry The healthcare AI models in Azure AI Foundry are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.Mastering Agent Governance in Microsoft 365
The "Mastering Agent Governance in Microsoft 365" series is based on the Administering and Governing Agents whitepaper published by Microsoft and designed to educate IT leaders, compliance officers, and decision-makers about the importance of governance for AI agents in Microsoft 365, particularly in highly regulated industries like Healthcare and Life Sciences (HLS). The six-episode series cover the growing role of agents, the risks of unmanaged agents, and the strategic importance of governance frameworks. Empowering innovation while protecting patient data and ensuring compliance In the age of AI-powered productivity, agents—automated digital assistants built with tools like Microsoft 365 Copilot, SharePoint, and Copilot Studio—are transforming how work gets done. From streamlining clinical documentation to automating regulatory reporting, agents are becoming indispensable in Healthcare and Life Sciences (HLS). But with great power comes great responsibility. Why Governance Can’t Be an Afterthought In highly regulated industries like HLS, where data sensitivity and compliance are paramount, the rise of autonomous agents introduces new risks: Unauthorized data access could expose protected health information (PHI). Unmonitored agent behavior could lead to regulatory violations. Lack of lifecycle controls could result in outdated or insecure agents operating in production environments. Agent governance isn’t just an IT concern—it’s a business imperative. It ensures that innovation doesn’t outpace compliance, and that every agent deployed aligns with organizational policies, security standards, and regulatory frameworks like HIPAA, GDPR, and FDA 21 CFR Part 11. Understanding the Agent Landscape Microsoft 365 supports a spectrum of agent creators: End Users using SharePoint or Copilot templates to automate simple tasks. Makers building more complex agents in Copilot Studio. Developers crafting sophisticated, enterprise-grade agents with Azure AI and Teams Toolkit. Each persona requires a different level of oversight. For example, a clinical researcher using SharePoint to build a data retrieval agent may need minimal governance, while a developer building a patient-facing chatbot must adhere to strict data protection and validation protocols. Governance in Action Microsoft provides a layered governance model: Tool Controls: Define what agent creators can do within tools like Copilot Studio and SharePoint. Content Controls: Ensure agents only access data they’re authorized to use, leveraging Microsoft Purview for sensitivity labeling and DLP. Agent Management: Monitor usage, enforce lifecycle policies, and block non-compliant agents via the Microsoft 365 Admin Center. This framework allows organizations to empower innovation while maintaining control—critical in environments where patient safety and regulatory compliance are non-negotiable. The Business Case for Governance For HLS organizations, agent governance delivers tangible benefits: Reduced compliance risk through proactive policy enforcement. Improved operational efficiency by enabling safe automation. Greater trust from patients, regulators, and internal stakeholders. In short, governance is the foundation that allows agents to scale safely and sustainably.2.4KViews2likes3Comments