fhir
14 TopicsEmpowering multi-modal analytics with the medical imaging capability in Microsoft Fabric
This blog is part of a series that explores the recent announcement of the public preview of healthcare data solutions in Microsoft Fabric. The DICOM® (Digital Imaging and Communications in Medicine) data ingestion capability within the healthcare data solutions in Microsoft Fabric enables the storage, management, and analysis of imaging metadata from various modalities, including X-rays, CT scans, and MRIs, directly within Microsoft Fabric. It fosters collaboration, R&D and AI innovation for healthcare and life science use cases. Our customers and partners can now integrate DICOM® imaging datasets with clinical data stored in FHIR® (Fast Healthcare Interoperability Resources) format. By making imaging pixels and metadata accessible alongside clinical history and laboratory data, it enables clinicians and researchers to interpret imaging findings in the appropriate clinical context. This leads to enhanced diagnostic accuracy, informative clinical decision-making, and ultimately, improved patient outcomes.Providing a unified healthcare analytics solution for the era of AI
This blog is part of a series that explores the recent announcement of the public preview of healthcare data solutions in Microsoft Fabric. Healthcare data solutions in Microsoft Fabric is a comprehensive, end-to-end analytics SaaS platform that allows you to ingest, store, and analyze healthcare data from a variety of sources, including electronic health records and picture archiving and communication systems. With this platform, you can unlock new insights and drive value from your healthcare data.
6.7KViews6likes0CommentsGeneral Availability - Medical imaging DICOM® in healthcare data solutions in Microsoft Fabric
As part of the healthcare data solutions in Microsoft Fabric, the DICOM® (Digital Imaging and Communications in Medicine) data transformation is now generally available. Our Healthcare and Life Sciences customers and partners can now ingest, store, transform and analyze DICOM® imaging datasets from various modalities, such as X-rays, CT scans, and MRIs, directly within Microsoft Fabric. This was made possible by providing a purpose-built data pipeline built to top of the medallion Lakehouse architecture. The imaging data transformation capabilities enable seamless transformation of DICOM® (imaging) data into tabular formats that can persist in the lake in FHIR® (Fast Healthcare Interoperability Resources) (Silver) and OMOP (Observational Medical Outcomes Partnership) (Gold) formats, thus facilitating exploratory analysis and large-scale imaging analytics and radiomics. Establishing a true multi-modal biomedical Lakehouse in Microsoft Fabric Along with other capabilities in the healthcare data solutions in Microsoft Fabric, this DICOM® data transformation will empower clinicians and researchers to interpret imaging findings in the appropriate clinical context by making imaging pixel and metadata available alongside the clinical history and laboratory data. By integrating DICOM® pixels and metadata with clinical history and laboratory data, our customers and partners can achieve more with their multi-modal biomedical data estate, including: Unify your medical imaging and clinical data estate for analytics Establish a regulated hub to centralize and organize all your multi-model healthcare data, creating a foundation for predictive and clinical analytics. Built natively on well-established industry data models, including DICOM®, FHIR® and OMOP. Build fit-for-purpose analytics models Start constructing ML and AI models on a connected foundation of EHR and pixel-data. Enable researchers, data scientists and health informaticians to perform analysis on large volumes of multi-model datasets to achieve higher accuracy in diagnosis, prognosis and improved patient outcomes 1 . Advance research, collaboration and sharing of de-identified imaging Build longitudinal views of patients’ clinical history and related imaging studies with the ability to apply complex queries to identify patient cohorts for research and collaboration. Apply text and imaging de-identification to enable in-place sharing of research datasets with role-based access control. Reduce the cost of archival storage and recovery Take advantage of the cost-effective, HIPAA compliant and reliable cloud-based storage to back up your medical imaging data from the redundant storage of on-prem PACS and VNA systems. Improve your security posture with a 100% off-site cloud archival of your imaging datasets in case of unplanned data loss. Employ AI models to recognize pixel-level markers and patterns Deploy existing precision AI models such as Microsoft’s Project InnerEye and NVIDIA’s MONAI to enable automated segmentation of 3D radiology imaging that can help expedite the planning of radiotherapy treatments and reduce waiting times for oncology patients. Conceptual architecture The DICOM® data transformation capabilities in Microsoft Fabric continue to offer our customers and partners the flexibility to choose the ingestion pattern that best meets their existing data volume and storage needs. At a high level, there are three patterns for ingesting DICOM® data into the healthcare data solutions in Microsoft Fabric. Depending on the chosen ingestion pattern, there are up to eight end-to-end execution steps to consider from the ingestion of the raw DICOM® files to the transformation of the Gold Lakehouse into the OMOP CDM format, as depicted in the conceptual architecture diagram below. To review the eight end-to-end execution steps, please refer to the Public Preview of the DICOM® data ingestion in Microsoft Fabric. Conceptual architecture and ingestion patterns of the DICOM® data ingestion capability in Microsoft Fabric You can find more details about each of those three ingestion patterns in our public documentation: Use DICOM® data ingestion - Microsoft Cloud for Healthcare | Microsoft Learn Enhancements in the DICOM® data transformation in Microsoft Fabric. We received great feedback from our public preview customers and partners. This feedback provided an objective signal for our product group to improve and iterate on features and the product roadmap to make the DICOM® data transformation capabilities more conducive and sensible. As a result, several new features and improvements in DICOM® data transformation are now generally available, as described in the following sections: All DICOM® Metadata (Tags) are now accessible in the Silver Lakehouse We acknowledge the importance and practicality to avail all DICOM® metadata, i.e. tags, in the Silver Lakehouse closer to the clinical and ImagingStudy FHIR® resources. This makes it easier to explore any existing DICOM® tags from within the Silver Lakehouse. It also helps position the DICOM® staging table in the Bronze Lakehouse (ImagingDICOM) as a transient store, i.e., after the DICOM® metadata is processed and transformed from the bronze Lakehouse to the Silver Lakehouse, the data in the bronze staging table can now be considered as ready to be purged. This ensures cost and storage efficiency and reduces data redundancy between source files and staging tables in the bronze Lakehouse. Unified Folder Structure OneLake in Microsoft Fabric offers a logical data lake for your organization. Healthcare data solutions in Microsoft Fabric provide a unified folder structure that helps organize data across various modalities and formats. This structure streamlines data ingestion and processing while maintaining data lineage at the source file and source system levels in the bronze Lakehouse. A complete set of unified folders, including the Imaging modality and DICOM® format, is now deployed as part of the healthcare data foundation deployment experience in the healthcare data solutions in Microsoft Fabric. Purpose-built DICOM® data transformation pipeline Healthcare data foundations offer ready-to-run data pipelines that are designed to efficiently structure data for analytics and AI/machine learning modeling. We introduce an imaging data pipeline to streamline the end-to-end execution of all activities in the DICOM® data transformation capabilities. The DICOM® data transformation in the imaging data pipeline consists of the following stages: The pipeline ingests and persists the raw DICOM® imaging files, present in the native DCM format, in the bronze Lakehouse. Then, it extracts the DICOM® metadata (tags) from the imaging files and inserts them into the ImagingDICOM table in the bronze Lakehouse. The data in the ImagingDICOM will then be converted to FHIR® ImagingStudy NDJSON files, stored in OneLake. The data in the ImagingStudy NDJSON files will be transformed to relational FHIR® format and ingested in the ImagingStudy delta table in the Silver Lakehouse. Compression-by-design Healthcare data solutions in Microsoft Fabric support compression-by-design across the medallion Lakehouse design. Data ingested into the delta tables across the medallion Lakehouse are stored in a compressed, columnar format using parquet files. In the ingest pattern, when the files move from the Ingest folder to the Process folder, they will be compressed by default after successful processing. You can configure or disable the compression as needed. The imaging data transformation pipeline can also process the DICOM® files in a raw format, i.e. dcm files, and/or in a compressed format, i.e. ZIP format of dcm files/folders. Global configuration The admin Lakehouse was introduced in this release to manage cross-Lakehouse configuration, global configuration, status reporting, and tracking for healthcare data solutions in Microsoft Fabric. The admin Lakehouse system-configurations folder centralizes the global configuration parameters. The three configuration files contain preconfigured values for the default deployment of all healthcare data solutions capabilities. You can use the global configuration to repoint the data ingestion pipeline to any source folder other than the unified folder configured by default. You can also configure any of the input parameters for each activity in the imaging data transformation pipeline. Sample Data In this release, a more comprehensive sample data is provided to help you run the data pipelines in DICOM® data transformation end-to-end and explore the data processing in each step through the medallion Lakehouse, Bronze, Silver and Gold. The imaging sample data may not be clinically meaningful, but they are technically complete and comprehensive to demonstrate the full DICOM® data transformation capabilities 2 . In total, the sample data for DICOM® data transformation contains 340, 389 and 7739 DICOM® studies, series and instances respectively. One of those studies, i.e. dcm files, is an invalid DICOM® study, which was intentionally provided to showcase how the pipeline manages files that do not conform to the DICOM® format. Those sample DICOM® studies are related to 302 patients and those patients are also included in the sample data for the clinical ingestion pipeline. Thus, when you ingest the sample data for the DICOM® data transformation and clinical data ingestion, you will have a complete view that depicts how the clinical and imaging data would appear in a real-world scenario. Enhanced data lineage and traceability All delta tables in the Healthcare Data Model in the Silver Lakehouse now have the following columns to ensure lineage and traceability at the record and file level. msftCreatedDatetime: the datatime at which the record was first created in the respective delta table in the Silver Lakehouse msftModifiedDatetime: the datatime at which the record was last modified in the respective delta table in the Silver Lakehouse msftFilePath: the full path to the source file in the Bronze Lakehouse (including shortcut folders) msftSourceSystem: the source system of this record. It corresponds to the [Namespace] that was specified in the unified folder structure. As such, and to ensure lineage and traceability extend to the entire medallion Lakehouse, the following columns are added to the OMOP delta table in the Gold Lakehouse: msftSourceRecordId: the original record identifier from the respective source delta table in the Silver Lakehouse. This is important because OMOP records will have newly generated IDs. More details are provided here. msftSourceTableName: the name of the source delta table in the Silver Lakehouse. Due to the specifics of FHIR-to-OMOP mappings, there are cases where many OMOP tables in the Gold Lakehouse may be sourced from the same/single FHIR table in the Silver Lakehouse, such as the OBSERVATION and MEASUREMENT OMOP delta tables in the Gold Lakehouse that are both sources from the Observation FHIRL delta table in the Silver Lakehouse. There is also the case where a single delta table in the Gold Lakehouse may be sourced from many delta tables in the Silver Lakehouse, such as the LOCATION OMOP table that could be sourced from either the Patient or Organization FHIR table. msftModifiedDatetime: the datatime at which the record was last modified in the respective delta table in the Silver Lakehouse. In summary, this article provides comprehensive details on how the DICOM® data transformation capabilities in the healthcare data solutions in Microsoft Fabric offer a robust and all-encompassing solution for unifying and analyzing the medical imaging data in a harmonized pattern with the clinical dataset. We also listed major enhancements to these capabilities that are now generally available for all our healthcare and life sciences customers and partners. For more details, please refer to our public documentation: Overview of DICOM® data ingestion - Microsoft Cloud for Healthcare | Microsoft Learn 1 S. Kevin Zhou, Hayit Greenspan, Christos Davatzikos, James S. Duncan, Bram van Ginneken, Anant Madabhushi, Jerry L. Prince, Daniel Rueckert, Ronald M. Summers A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. arXiv:2008.09104 2 Microsoft provides the Sample Data in the Healthcare data solutions in Microsoft Fabric on an "as is" basis. This data is provided to test and demonstrate the end-to-end execution of data pipelines provided within the Healthcare data solutions in Microsoft Fabric. This data is not intended or designed to train real-world or production-level AI/ML models, or to develop any clinical decision support systems. Microsoft makes no warranties, express or implied, guarantees or conditions with respect to your use of the datasets. To the extent permitted under your local law, Microsoft disclaims all liability for any damages or losses, including direct, consequential, special, indirect, incidental, or punitive, resulting from your use of this data. The Sample Data in the Healthcare data solutions in Microsoft Fabric is provided under the Community Data License Agreement – Permissive – Version 2.0 DICOM® is the registered trademark of the National Electrical Manufacturers Association (NEMA) for its Standards publications relating to digital communications of medical information. FHIR® is a registered trademark of Health Level Seven International, registered in the U.S. Trademark Office, and is used with their permission.FHIRlink Connector Support for EPIC® on FHIR®
The Health and Life Sciences Data Platform team recently released an update to the FHIRlink connector introducing support for EPIC® on FHIR® connectivity. This is our initial release of connectivity for EPIC® on FHIR® application registrations configured with an application audience of Patient or Clinicians/Administrative Users.3.7KViews4likes0CommentsFHIRlink Power Platform connector Public Preview Release
Microsoft FHIRlink creates a direct connection between healthcare apps built on Microsoft Azure services and FHIR’s servers, bypassing the need to duplicate data from Microsoft Dataverse. FHIRlink reduces the complexity and cost of building low code/no code applications on Power Platform and Azure because developers can build their apps directly against the FHIR services rather than having to duplicate data between systems. Connect Power Automate Flows, Power Platform Canvas Apps, and Azure Logic Apps to various FHIR services and perform create, retrieve, update and delete operations directly on FHIR resources.3.1KViews3likes2CommentsMicrosoft Fabric healthcare data model querying and identifier harmonization
The healthcare data model in Healthcare data solutions (HDS) in Microsoft Fabric is the silver layer of the medallion and is based on the FHIR R4 standard. Native FHIR® can be challenging to query using SQL because its reference properties (foreign keys) often follow varying formats, complicating query writing. One of the benefits of the silver healthcare data model is harmonizing these ids to create a simpler and more consistent query experience. Today we will walk through writing spark SQL and T-SQL queries against the silver healthcare data model. Supporting both spark SQL and T-SQL provides flexibility to users to use the compute engine they are most comfortable with and that is most suitable for the use case. The examples below leverage the synthetic sample data set that is included with Healthcare data solutions. The spark SQL queries can be written and run from a Fabric notebook while the T-SQL queries can be run from the SQL Analytics endpoint of the silver lakehouse or a T-SQL Fabric notebook. Simple query Let’s look at a simple query: finding the first instance of a Patient named “Andy”. Example spark-SQL query SELECT * FROM Patient WHERE name[0].given[0] = 'Andy' LIMIT 1 Example T-SQL query SELECT TOP(1) * FROM Patient WHERE JSON_VALUE(name_string, '$[0].given[0]') = 'Andy' Beyond syntax differences between SQL dialects, a key distinction is that T-SQL uses JSON functions to interpret complex fields, while Spark SQL can directly interact with complex types (Note: complex types are those columns of type: struct, list, or map vs. primitive types whose columns or of types like string or integer). Part of the silver transformations include adding _string suffixed column for each complex column to support querying this data from the T-SQL endpoint. Without the _string columns these complex columns would not be surfaced for T-SQL to query. You can see above that in the T-SQL version the column name_string is used while in the spark SQL version name is used directly. Note: in the example above, we are looking at the first name element, but the queries could be updated to search for the first “official” name, for example, vs. relying on an index. Keys and references Part of the value proposition of the healthcare data model is key harmonization. FHIR resources have ids that are unique, should not change, and can be logically thought of like a primary key for the resource. FHIR resources can relate to each other through references which can be logically thought of as foreign keys. FHIR references can refer to the related FHIR resource through FHIR id or through business identifiers which include a system for the identifier as well as a value (e.g. reference by MRN instead of FHIR id). Note: although ids and references can logically be thought of as primary keys and foreign keys, respectively, there is no actual key constraint enforcement in the lakehouse. In healthcare data solutions in Microsoft Fabric these resource level FHIR ids are hashed to ensure uniqueness across multiple source systems. FHIR references go through a harmonization process outlined with the example below to make querying in a SQL syntax simpler: Example raw observation reference field from sample ndjson file "subject": { "reference": "Patient/904d247a-0fc3-773a-b564-7acb6347d02c" }, Example of the observation’s harmonized subject reference in silver "subject":{ "type": "Patient", "identifier": { "value": "904d247a-0fc3-773a-b564-7acb6347d02c", "system": "FHIR-HDS", "type": { "coding": [ { "system": "http://terminology.hl7.org/CodeSystem/v2-0203", "code": "fhirId", "display": "FHIR Id" } ], "text": "FHIR Id" } }, "id": "828dda871b817035c42d7f1ecb2f1d5f10801c817d69063682ff03d1a80cadb5", "idOrig": "904d247a-0fc3-773a-b564-7acb6347d02c", "msftSourceReference": "Patient/904d247a-0fc3-773a-b564-7acb6347d02c" } You’ll notice the subject reference contains more data in silver than the raw representation. You can see a full description of what is added here. This reference harmonization makes querying from a SQL syntax easier as you don’t need to parse FHIR references like “Patient/<id>” or include joins on both FHIR ids and business identifiers in your query. If your source data only uses FHIR ids, the id property can be used directly in joins. If your source data uses a mixture of FHIR ids and business identifiers you can query by business identifier consistently as you see even when FHIR id is used, HDS adds a FHIR business identifier to the reference. NOTE: you can see examples of business identifier-based queries in the Observational Medical Outcomes Partnership (OMOP) dmfAdapter.json file which queries resources by business identifier. Here are 2 example queries looking for the top 5 body weight observations of male patients by FHIR id. Example spark-SQL query SELECT o.id FROM observation o INNER JOIN patient p on o.subject.id = p.id WHERE p.gender = 'male' AND ARRAY_CONTAINS (o.code.coding.code, '29463-7') LIMIT 5 Example T-SQL query SELECT TOP 5 o.id FROM Observation o INNER JOIN Patient p ON JSON_VALUE(o.subject_string, '$.id') = p.id WHERE p.gender = 'male' AND EXISTS ( SELECT 1 FROM OPENJSON(o.code_string, '$.coding') WITH (code NVARCHAR(MAX) '$.code') WHERE code = '29463-7' ) You’ll notice the T-SQL query is using JSON functions to interact with the string fields while the spark SQL query can natively handle the complex types like the previous query. The joins themselves though are using the id property directly as we know in this case only FHIR ids are being used. By using the id property, we do not need to parse a string representation like “Patient/<id>” to do the join. Overall we’ve shown how either spark SQL or T-SQL can be used to query the same set of silver data and also how key harmonization helps when writing SQL based queries. We welcome your questions and feedback in the comments section at the end of this post! Helpful links For more details of to start building your own queries, explore these helpful resources: Healthcare data solutions in Microsoft Fabric FHIR References T-SQL JSON functions T-SQL surface area in Fabric FHIR® is the registered trademark of HL7 and is used with permission of HL7.1.7KViews2likes0CommentsFHIRlink connector and Epic on FHIR for Power Platform development
The FHIRlink connector for Power Platform enables direct access to FHIR based endpoints. In this post, we look at a working sample application that connects Dataverse with Epic on EPIC® on FHIR® and Azure OpenAI to provide patient details and AI support for clinical users.
2.5KViews2likes0CommentsVirtual Health Data Tables Create Update and Delete Support
Health organizations are considering low-code development to improve productivity, gain faster time-to-market, experiment more easily, and overall, be more agile when responding to market changes. A key blocker has been the inability to pull health data from multiple sources and manage it in a secure and compliant way. Microsoft Cloud for Healthcare includes configurable solutions to exchange data between Dataverse and external systems using the FHIR standard. Microsoft's Virtual Health Data Tables provides the ability to connect directly to Azure Health Data Services FHIR service from within Dataverse. As part of the latest release, Virtual Health Data Tables has been updated to include support for the create, update, and delete FHIR operations.
Can you use AI to implement an Enterprise Master Patient Index (EMPI)?
The Short Answer: Yes. And It's Better Than You Think. If you've worked in healthcare IT for any length of time, you've dealt with this problem. Patient A shows up at Hospital 1 as "Jonathan Smith, DOB 03/15/1985." Patient B shows up at Hospital 2 as "Jon Smith, DOB 03/15/1985." Patient C shows up at a clinic as "John Smythe, DOB 03/15/1985." Same person? Probably. But how do you prove it at scale — across millions of records, dozens of source systems, and data quality that ranges from pristine to "someone fat-fingered a birth year"? That's the problem an Enterprise Master Patient Index (EMPI) solves. And traditionally, it's been solved with expensive commercial products, rigid rule engines, and a lot of manual review. We built one with AI. On Azure. With open-source tooling. And the results are genuinely impressive. This post walks through how it works, what the architecture looks like, and why the combination of deterministic matching, probabilistic algorithms, and AI-enhanced scoring produces better results than any single approach alone. 1. Why EMPI Still Matters (More Than Ever) Healthcare organizations don't have a "patient data problem." They have a patient identity problem. Every EHR, lab system, pharmacy platform, and claims processor creates its own patient record. When those systems exchange data via FHIR, HL7, or flat files, there's no universal patient identifier in the U.S. — Congress has blocked funding for one since 1998. The result: Duplicate records inflate costs and fragment care history Missed matches mean clinicians don't see a patient's full medical picture False positives can merge two different patients into one record — a patient safety risk Traditional EMPI solutions use deterministic matching (exact field comparisons) and sometimes probabilistic scoring (fuzzy string matching). They work. But they leave a significant gray zone of records that require human review — and that queue grows faster than teams can process it. What if AI could shrink that gray zone? 2. The Architecture: Three Layers of Matching Here's the core insight: no single matching technique is sufficient. Exact matches miss typos. Fuzzy matches produce false positives. AI alone hallucinates. But layer them together with calibrated weights, and you get something remarkably accurate. Let's break each layer down. 3. Layer 1: Deterministic Matching — The Foundation Deterministic matching is the bedrock. If two records share an Enterprise ID, they're the same person. Full stop. The system assigns trust levels to each identifier type: Identifier Weight Why Enterprise ID 1.0 Explicitly assigned by an authority SSN 0.9 Highly reliable when present and accurate MRN 0.8 System-dependent — only valid within the same healthcare system Date of Birth 0.35 Common but not unique — 0.3% of the population shares any given birthday Phone 0.3 Useful signal but changes frequently Email 0.3 Same — supportive evidence, not proof The key implementation detail here is MRN system validation. An MRN of "12345" at Hospital A is completely unrelated to MRN "12345" at Hospital B. The system checks the identifier's source system URI before considering it a match. Without this, you'd get a flood of false positives from coincidental MRN collisions. If an Enterprise ID match is found, the system short-circuits — no need for probabilistic or AI scoring. It's a guaranteed match. 4. Layer 2: Probabilistic Matching — Where It Gets Interesting This is where the system earns its keep. Probabilistic matching handles the messy reality of healthcare data: typos, nicknames, transposed digits, abbreviations, and inconsistent formatting. Name Similarity The system uses a multi-algorithm ensemble for name matching: Jaro-Winkler (60% weight): Optimized for short strings like names. Gives extra credit when strings share a common prefix — so "Jonathan" vs "Jon" scores higher than you'd expect. Soundex / Metaphone (phonetic boost): Catches "Smith" vs "Smythe," "Jon" vs "John," and other sound-alike variations that string distance alone would miss. Levenshtein distance (typo detection): Handles single-character errors — "Johanson" vs "Johansn." These scores are blended, and first name and last name are scored independently before combining. This prevents a matching last name from compensating for a wildly different first name. Date of Birth — Smarter Than You'd Think DOB matching goes beyond exact comparison. The system detects month/day transposition — one of the most common data entry errors in healthcare: Scenario Score Exact match 1.0 Month and day swapped (e.g., 03/15 vs 15/03) 0.8 Off by 1 day 0.9 Off by 2–30 days 0.5–0.8 (scaled) Different year 0.0 This alone catches a category of mismatches that pure deterministic systems miss entirely. Address Similarity Address matching uses a hybrid approach: Jaro-Winkler on the normalized full address (70% weight) Token-based Jaccard similarity (30% weight) to handle word reordering Bonus scoring for matching postal codes, city, and state Abbreviation expansion — "St" becomes "Street," "Ave" becomes "Avenue" 5. Layer 3: AI-Enhanced Matching — The Game Changer This is where the architecture diverges from traditional EMPI solutions. OpenAI Embeddings (Semantic Similarity) The system generates a text embedding for each patient's complete demographic profile using OpenAI's text-embedding-3-small model. Then it computes cosine similarity between patient pairs. Why does this work? Because embeddings capture semantic relationships that string-matching can't. "123 Main Street, Apt 4B, Springfield, IL" and "123 Main St #4B, Springfield, Illinois" are semantically identical even though they differ character-by-character. The embedding score carries only 10% of the total weight — it's a signal, not a verdict. But in ambiguous cases, it's the signal that tips the scale. GPT-5.2 LLM Analysis (Intelligent Reasoning) For matches that land in the human review zone (0.65–0.85), the system optionally invokes GPT-5.2 to analyze the patient pair and provide structured reasoning: { "match_score": 0.92, "confidence": "high", "reasoning": "Multiple strong signals: identical last name, DOB matches exactly, same city. First name 'Jon' is a common nickname for 'Jonathan'.", "name_analysis": "First name variation is a known nickname pattern.", "potential_issues": [], "recommendation": "merge" } The LLM doesn't just produce a number — it explains why it thinks two records match. This is enormously valuable for the human reviewers who make final decisions on ambiguous cases. Instead of staring at two records and guessing, they get AI-generated reasoning they can evaluate. When LLM analysis is enabled, the final score blends traditional and LLM scores: Final Score = (Traditional Score × 0.8) + (LLM Score × 0.2) The LLM temperature is set to 0.1 for consistency — you want deterministic outputs from your matching engine, not creative ones. 6. The Graph Database: Modeling Patient Relationships Records and scores are only half the story. The real power comes from how the system stores and traverses relationships. We use Azure Cosmos DB with the Gremlin API — a graph database that models patients, identifiers, addresses, and clinical data as vertices connected by typed edges. (:Patient)──[:HAS_IDENTIFIER]──▶(:Identifier) │ ├──[:HAS_ADDRESS]──▶(:Address) │ ├──[:HAS_CONTACT]──▶(:ContactPoint) │ ├──[:LINKED_TO]──▶(:EmpiRecord) ← Golden Record │ ├──[:POTENTIAL_MATCH {score, confidence}]──▶(:Patient) │ └──[:HAS_ENCOUNTER]──▶(:Encounter) └──[:HAS_OBSERVATION]──▶(:Observation) Why a Graph? Three reasons: Candidate retrieval is a graph traversal problem. "Find all patients who share an identifier with Patient X" is a natural graph query — traverse from the patient to their identifiers, then back to other patients who share those same identifiers. In Gremlin, this is a few lines. In SQL, it's a multi-table join with performance that degrades as data grows. Relationships are first-class citizens. A POTENTIAL_MATCH edge stores the match score, confidence level, and detailed breakdown directly on the relationship. You can query "show me all high-confidence matches" without any joins. EMPI records are naturally hierarchical. A golden record (EmpiRecord) links to multiple source patients via LINKED_TO edges. When you merge two patients, you're adding an edge — not rewriting rows in a relational table. Performance at Scale Cosmos DB's partition strategy uses source_system as the partition key, providing logical isolation between healthcare systems. The system handles Azure's 429 rate-limiting with automatic retry and exponential backoff, and uses batch operations for bulk loads to avoid RU exhaustion. 7. FHIR-Native Data Ingestion The system ingests HL7 FHIR R4 Bundles — the emerging interoperability standard for healthcare data exchange. Each FHIR Bundle is a JSON file containing a complete patient record: demographics, encounters, observations, conditions, procedures, immunizations, medication requests, and diagnostic reports. The FHIR loader: Maps FHIR identifier systems to internal types (SSN, MRN, Enterprise ID) Handles all three FHIR date formats (YYYY, YYYY-MM, YYYY-MM-DD) Extracts clinical data for comprehensive patient profiles Uses an iterator pattern for memory-efficient processing of thousands of patients Tracks source system provenance for audit compliance This means the service can ingest data directly from any FHIR-compliant EHR — Epic, Cerner, MEDITECH, or Synthea-generated test data — without custom integration work. 8. The Conversational Agent: Matching via Natural Language Here's where it gets fun. The system includes a conversational AI agent built on the Azure AI Foundry Agent Service. It's deployed as a GPT-5.2-powered agent with OpenAPI tools that call the matching service's REST API. Instead of navigating a complex UI to find matches, a data steward can simply ask: "Search patients named Aaron" "Compare patient abc-123 with patient xyz-456" "What matches are pending review?" "Approve the match between patient A and patient B" The agent is integrated directly into the Streamlit dashboard's Agent Chat tab, so users never leave their workflow. Under the hood, when the agent decides to call a tool (like "search patients"), Azure AI Foundry makes an HTTP request directly to the Container App API — no local function execution required. Available Agent Tools Tool What It Does searchPatients Search patients by name, DOB, or identifier getPatientDetails Get detailed patient demographics and history findPatientMatches Find potential duplicates for a patient compareTwoPatients Side-by-side comparison with detailed scoring getPendingReviews List matches awaiting human decision submitReviewDecision Approve or reject a match getServiceStatistics MPI dashboard metrics This same tool set is also exposed via a Model Context Protocol (MCP) server, making the matching engine accessible from AI-powered IDEs and coding assistants. 9. The Dashboard: Putting It All Together The Patient Matching Service includes a full-featured Streamlit dashboard for operational management. Page What You See Dashboard Key metrics, score distribution charts, recent match activity Match Results Filterable list with score breakdowns — deterministic, probabilistic, AI, and LLM tabs Patients Browse and search all loaded patients with clinical data Patient Graph Interactive graph visualization of patient relationships using streamlit-agraph Review Queue Pending matches with approve/reject actions Agent Chat Conversational AI for natural language queries Settings Configure match weights, thresholds, and display preferences The match detail view provides six tabs that walk reviewers through every scoring component: Summary, Deterministic, Probabilistic, AI/Embeddings, LLM Analysis, and Raw Data. Reviewers don't just see a number — they see exactly why the system scored a match the way it did. 10. Azure Architecture The full solution runs on Azure: Service Role Azure Cosmos DB (Gremlin + NoSQL) Patient graph storage and match result persistence Azure OpenAI (GPT-5.2 + text-embedding-3-small) LLM analysis and semantic embeddings Azure Container Apps Hosts the FastAPI REST API Azure AI Foundry Agent Service Conversational agent with OpenAPI tools Azure Log Analytics Centralized logging and monitoring The separation between Cosmos DB's Gremlin API (graph traversal) and NoSQL API (match result documents) is intentional. Graph queries excel at relationship traversal — "find all patients connected to this identifier." Document queries excel at filtering and aggregation — "show me all auto-merge matches from the last 24 hours." 11. What We Learned AI doesn't replace deterministic matching. It augments it. The three-layer approach works because each layer compensates for the others' weaknesses: Deterministic handles the easy cases quickly and with certainty Probabilistic catches the typos, nicknames, and formatting differences that exact matching misses AI provides semantic understanding and human-readable reasoning for the ambiguous middle ground The LLM is most valuable as a reviewer's assistant, not a decision-maker. We deliberately keep the LLM weight at 20% of the final score. Its real value is the structured reasoning it produces — the "why" behind a match score. Human reviewers process cases faster when they have AI-generated analysis explaining the matching signals. Graph databases are naturally suited for patient identity. Patient matching is fundamentally a relationship problem. "Who shares identifiers with whom?" "Which patients are linked to this golden record?" "Show me the cluster of records that might all be the same person." These are graph traversal queries. Trying to model this in relational tables works, but you're fighting the data model instead of leveraging it. FHIR interoperability reduces integration friction to near zero. By accepting FHIR R4 Bundles as the input format, the service can ingest data from any modern EHR without custom connectors. This is a massive practical advantage — the hardest part of any EMPI project is usually getting the data in, not matching it. 12. Try It Yourself The Patient Matching Service is built entirely on Azure services and open-source tooling https://github.com/dondinulos/patient-matching-service : Python with FastAPI, Streamlit, and the Azure AI SDKs Azure Cosmos DB (Gremlin API) for graph storage Azure OpenAI for embeddings and LLM analysis Azure AI Foundry for the conversational agent Azure Container Apps for deployment Synthea for FHIR test data generation The matching algorithms (Jaro-Winkler, Soundex, Metaphone, Levenshtein) use pure Python implementations — no proprietary matching engines required. Whether you're building a new EMPI from scratch or augmenting an existing one with AI capabilities, the three-layer approach gives you the best of all worlds: the certainty of deterministic matching, the flexibility of probabilistic scoring, and the intelligence of AI-enhanced analysis. Final Thoughts Can you use AI to implement an EMPI? Yes. And the answer isn't "replace everything with an LLM." It's "use AI where it adds the most value — semantic understanding, natural language reasoning, and augmenting human reviewers — while keeping deterministic and probabilistic matching as the foundation." The combination is more accurate than any single approach. The graph database makes relationships queryable. The conversational agent makes the system accessible. And the whole thing runs on Azure with FHIR-native data ingestion. Patient matching isn't a solved problem. But with AI in the stack, it's a much more manageable one. Tags: Healthcare, Azure, AI, EMPI, FHIR, Patient Matching, Azure Cosmos DB, Azure OpenAI, Graph Database, Interoperability