microsoft cloud for healthcare engineering
9 TopicsMicrosoft Fabric healthcare data model querying and identifier harmonization
The healthcare data model in Healthcare data solutions (HDS) in Microsoft Fabric is the silver layer of the medallion and is based on the FHIR R4 standard. Native FHIR® can be challenging to query using SQL because its reference properties (foreign keys) often follow varying formats, complicating query writing. One of the benefits of the silver healthcare data model is harmonizing these ids to create a simpler and more consistent query experience. Today we will walk through writing spark SQL and T-SQL queries against the silver healthcare data model. Supporting both spark SQL and T-SQL provides flexibility to users to use the compute engine they are most comfortable with and that is most suitable for the use case. The examples below leverage the synthetic sample data set that is included with Healthcare data solutions. The spark SQL queries can be written and run from a Fabric notebook while the T-SQL queries can be run from the SQL Analytics endpoint of the silver lakehouse or a T-SQL Fabric notebook. Simple query Let’s look at a simple query: finding the first instance of a Patient named “Andy”. Example spark-SQL query SELECT * FROM Patient WHERE name[0].given[0] = 'Andy' LIMIT 1 Example T-SQL query SELECT TOP(1) * FROM Patient WHERE JSON_VALUE(name_string, '$[0].given[0]') = 'Andy' Beyond syntax differences between SQL dialects, a key distinction is that T-SQL uses JSON functions to interpret complex fields, while Spark SQL can directly interact with complex types (Note: complex types are those columns of type: struct, list, or map vs. primitive types whose columns or of types like string or integer). Part of the silver transformations include adding _string suffixed column for each complex column to support querying this data from the T-SQL endpoint. Without the _string columns these complex columns would not be surfaced for T-SQL to query. You can see above that in the T-SQL version the column name_string is used while in the spark SQL version name is used directly. Note: in the example above, we are looking at the first name element, but the queries could be updated to search for the first “official” name, for example, vs. relying on an index. Keys and references Part of the value proposition of the healthcare data model is key harmonization. FHIR resources have ids that are unique, should not change, and can be logically thought of like a primary key for the resource. FHIR resources can relate to each other through references which can be logically thought of as foreign keys. FHIR references can refer to the related FHIR resource through FHIR id or through business identifiers which include a system for the identifier as well as a value (e.g. reference by MRN instead of FHIR id). Note: although ids and references can logically be thought of as primary keys and foreign keys, respectively, there is no actual key constraint enforcement in the lakehouse. In healthcare data solutions in Microsoft Fabric these resource level FHIR ids are hashed to ensure uniqueness across multiple source systems. FHIR references go through a harmonization process outlined with the example below to make querying in a SQL syntax simpler: Example raw observation reference field from sample ndjson file "subject": { "reference": "Patient/904d247a-0fc3-773a-b564-7acb6347d02c" }, Example of the observation’s harmonized subject reference in silver "subject":{ "type": "Patient", "identifier": { "value": "904d247a-0fc3-773a-b564-7acb6347d02c", "system": "FHIR-HDS", "type": { "coding": [ { "system": "http://terminology.hl7.org/CodeSystem/v2-0203", "code": "fhirId", "display": "FHIR Id" } ], "text": "FHIR Id" } }, "id": "828dda871b817035c42d7f1ecb2f1d5f10801c817d69063682ff03d1a80cadb5", "idOrig": "904d247a-0fc3-773a-b564-7acb6347d02c", "msftSourceReference": "Patient/904d247a-0fc3-773a-b564-7acb6347d02c" } You’ll notice the subject reference contains more data in silver than the raw representation. You can see a full description of what is added here. This reference harmonization makes querying from a SQL syntax easier as you don’t need to parse FHIR references like “Patient/<id>” or include joins on both FHIR ids and business identifiers in your query. If your source data only uses FHIR ids, the id property can be used directly in joins. If your source data uses a mixture of FHIR ids and business identifiers you can query by business identifier consistently as you see even when FHIR id is used, HDS adds a FHIR business identifier to the reference. NOTE: you can see examples of business identifier-based queries in the Observational Medical Outcomes Partnership (OMOP) dmfAdapter.json file which queries resources by business identifier. Here are 2 example queries looking for the top 5 body weight observations of male patients by FHIR id. Example spark-SQL query SELECT o.id FROM observation o INNER JOIN patient p on o.subject.id = p.id WHERE p.gender = 'male' AND ARRAY_CONTAINS (o.code.coding.code, '29463-7') LIMIT 5 Example T-SQL query SELECT TOP 5 o.id FROM Observation o INNER JOIN Patient p ON JSON_VALUE(o.subject_string, '$.id') = p.id WHERE p.gender = 'male' AND EXISTS ( SELECT 1 FROM OPENJSON(o.code_string, '$.coding') WITH (code NVARCHAR(MAX) '$.code') WHERE code = '29463-7' ) You’ll notice the T-SQL query is using JSON functions to interact with the string fields while the spark SQL query can natively handle the complex types like the previous query. The joins themselves though are using the id property directly as we know in this case only FHIR ids are being used. By using the id property, we do not need to parse a string representation like “Patient/<id>” to do the join. Overall we’ve shown how either spark SQL or T-SQL can be used to query the same set of silver data and also how key harmonization helps when writing SQL based queries. We welcome your questions and feedback in the comments section at the end of this post! Helpful links For more details of to start building your own queries, explore these helpful resources: Healthcare data solutions in Microsoft Fabric FHIR References T-SQL JSON functions T-SQL surface area in Fabric FHIR® is the registered trademark of HL7 and is used with permission of HL7.1.4KViews2likes0CommentsA scalable and efficient approach for ingesting medical imaging data using DICOM data transformation
The transformation of Digital Imaging and Communications in Medicine (DICOM®) data is a crucial capability in healthcare data solutions (HDS). This feature allows healthcare providers to bring their DICOM® data into Fabric OneLake, enabling the ingestion, storage, and analysis of imaging metadata from various modalities such as X-rays, Computerized Tomography (CT) scans, and Magnetic Resonance Imaging (MRI) scans. By leveraging this capability, healthcare organizations can enhance their digital infrastructure, improve patient care, and streamline their data management processes. Ingestion patterns The capability provides various ingestion mechanisms for processing medical images based on the use case. For processing small datasets, comprising of medical images not more than 10 million at once, customers can either choose to upload their data to Fabric's OneLake or connect an external storage to Fabric. Let’s try to understand the rationale behind the 10 million image limit. Both the ingestion options as mentioned above setup spark structured streaming on the input DICOM® files. During file listing, one of the steps before the start of the actual processing, spark tries to gather the input files metadata like file paths, timestamps associated with the files etc. When dealing with millions of files, file listing process itself is split across multiple executor nodes. Once the metadata is fetched by the executor nodes, it is collected back at the driver node for making the execution plan. The memory allocated for storing file listing results is controlled by spark property called driver.maxResultSize. The default value for this configuration can vary based on the platform - spark in Microsoft Fabric defaults it to 4GB. Users can estimate the results collected at the driver for an input dataset by understanding the input folder structure for file paths and keeping some buffer for overhead. They need to make sure the file listing result is not more than the allocated space (4GB) to avoid Out of Memory (OOM) issues. Based on our experiments, it turned out that 10 million limit on the number of input files will give a reliable and successful run with the aforesaid memory. Now, driver.maxResultSize is a spark configuration, and it can be set to higher value to increase the allocated space for collecting file listing results. Based on spark architecture, driver memory is split into multiple portions for various purposes. Users need to be careful while increasing the value for this property as it can hamper other functioning of the spark architecture. Refer to the below table for more details on tuning the property appropriately in Microsoft Fabric. Note: Below is a reference table illustrating how various node sizes and configurations impact the file listing capacity and result sizes. The table presents rough estimates based on a test dataset deployed in the default unified folder structure in HDS. These values can serve as an initial reference but may vary depending on specific dataset characteristics. Users should not expect identical numbers in their own datasets or use cases. Node size Available memory per node (GB) Driver node vCores Executor node vCores Driver.maxResultSize (GB) File paths size (GB) Files listed (millions) Medium 56 8 8 4 3.38 10.8 Large 112 8 16 8 6.75 21.6 XL 224 8 32 16 12.81 41 XXL 400 16 64 24 15.94 51 Emergence of Inventory based ingestion Microsoft Fabric provides a variety of compute nodes which can be utilized for different use cases. The highest configuration node can be XX-Large with 512GB memory and 64 vCores. Even with such a node configuration, we can increase driver.maxResultSize to a certain limit. Thereby, posing a restriction on the dataset size which can be ingested in a single run. One way to tackle this problem is to segregate the entire dataset into smaller chunks, which is exactly the purpose of having the unified folder structure in HDS where data is segmented by default into date folders, such that the file listing result for a single chunk is within the limits of allocated memory. However, it might not be feasible to make changes every time at the data source. This is where HDS Inventory Based Ingestion comes into play, enabling the scalable ingestion of DICOM® imaging files into Fabric. Inventory based ingestion is built on an approach of segregating the file listing step from the core processing logic. This means, given the files metadata information aka inventory files, which is analogous to file listing result, users don’t need to setup the spark streaming on the folder containing DICOM® files directly rather they can consume the metadata information from inventory files and initiate the core processing logic. This way we avoid the OOM issues arising due to file listing. In case your data resides in Azure gen2 storage account, there is an out of the box service called Azure storage blob inventory to generate inventory files in parquet format. However, while inventory based ingestion does support other storages as well, users need to provide the inventory files in a required format and follow some minimal configuration changes. Capability configurations This capability includes various configuration levers which can be configured by updating deployment parameters config to tune the ingestion process for better throughput. max_files_per_trigger – this is an interface for using maxFilesPerTrigger in spark structured streaming. It is defaulted to 100K. For inventory-based ingestion, it is advisable to lower down this number to either 1 or 2 based on number of records contained in each parquet file. max_bytes_per_trigger – this is an interface for using maxBytesPerTrigger in spark structured streaming. This option doesn’t work directly with all input files as source. However, it works on parquet files as source and thus becomes relevant when using Inventory based ingestion. This is defaulted to 1GB. rows_per_partition – this option is specifically designed for Inventory based ingestion, where the number of default partitions might not be efficient for full utilization of available resources. In a given execution batch, the number of input files is divided by this number to repartition the dataframe. Default value is 250. This implies, let’s say if the current batch size is 10Million then it would create 10Million/250 = 40k partitions which translates to 40k spark tasks. DICOM® is the registered trademark of the National Electrical Manufacturers Association (NEMA) for its Standards publications relating to digital communications of medical information. Medical device disclaimer: Microsoft products and services (1) are not designed, intended or made available as a medical device, and (2) are not designed or intended to be a substitute for professional medical advice, diagnosis, treatment, or judgment and should not be used to replace or as a substitute for professional medical advice, diagnosis, treatment, or judgment. Customers/partners are responsible for ensuring solutions comply with applicable laws and regulations.1.5KViews0likes0CommentsHealthcare data solutions in Microsoft Fabric ALM support
Healthcare data solutions in Microsoft Fabric released support for Application Lifecycle Management (ALM). This new capability provides common tooling for your team to manage and deploy their work as part of an enterprise development and release processes.1.3KViews0likes0CommentsBuilding Apps on Healthcare Data Solutions for Power Platform
Healthcare Data Solutions for Power Platform is a cloud solution helping customers and partners build healthcare applications using low code/no code tools and a FHIR-aligned data model. The solution also provides template applications, a Dataverse Healthcare API, Virtual Health Data Tables, and a FHIRlink Power Platform connector. The solution is free for Power Platform customers and can be customized and extended using Dataverse tools and custom code.FHIRlink connector and Epic on FHIR for Power Platform development
The FHIRlink connector for Power Platform enables direct access to FHIR based endpoints. In this post, we look at a working sample application that connects Dataverse with Epic on EPIC® on FHIR® and Azure OpenAI to provide patient details and AI support for clinical users.2.3KViews2likes0CommentsFHIRlink Power Platform connector Public Preview Release
Microsoft FHIRlink creates a direct connection between healthcare apps built on Microsoft Azure services and FHIR’s servers, bypassing the need to duplicate data from Microsoft Dataverse. FHIRlink reduces the complexity and cost of building low code/no code applications on Power Platform and Azure because developers can build their apps directly against the FHIR services rather than having to duplicate data between systems. Connect Power Automate Flows, Power Platform Canvas Apps, and Azure Logic Apps to various FHIR services and perform create, retrieve, update and delete operations directly on FHIR resources.Virtual Health Data Tables Create Update and Delete Support
Health organizations are considering low-code development to improve productivity, gain faster time-to-market, experiment more easily, and overall, be more agile when responding to market changes. A key blocker has been the inability to pull health data from multiple sources and manage it in a secure and compliant way. Microsoft Cloud for Healthcare includes configurable solutions to exchange data between Dataverse and external systems using the FHIR standard. Microsoft's Virtual Health Data Tables provides the ability to connect directly to Azure Health Data Services FHIR service from within Dataverse. As part of the latest release, Virtual Health Data Tables has been updated to include support for the create, update, and delete FHIR operations.FHIRlink Connector Support for EPIC® on FHIR®
The Health and Life Sciences Data Platform team recently released an update to the FHIRlink connector introducing support for EPIC® on FHIR® connectivity. This is our initial release of connectivity for EPIC® on FHIR® application registrations configured with an application audience of Patient or Clinicians/Administrative Users.3.5KViews4likes0CommentsHow does Azure ensure the security and privacy of sensitive patient data in the cloud?
In the healthcare industry, where privacy and security are paramount, storing sensitive patient data in the cloud can feel like a gamble. But Microsoft Azure employs a multi-layered approach to ensure your information stays safe. Here's how: Encryption at Rest and In Transit: Imagine your data wrapped in multiple layers of security. Azure encrypts patient data at rest (when stored) using industry-standard 256-bit AES encryption, which is practically uncrackable. And when data travels between Azure datacenters, it's encrypted again using secure protocols for additional protection. Compartmentalization: Azure uses a multi-tenant model, meaning various customers share the physical infrastructure. But worry not! Logical isolation keeps your data segregated from others, like placing your files in a separate folder on a shared server. Customer Control: You hold the reins! Azure Key Vault empowers you to manage the encryption keys that unlock your data. This ensures only authorized personnel can access sensitive information. Confidentiality Through Confidential Computing: For an extra layer of security, Azure offers confidential computing environments. These are like secure fortresses within the cloud that encrypt data even while it's being processed. This makes it virtually impossible for unauthorized users, even within Microsoft, to access your data. Compliance with Regulations: Azure adheres to a wide range of healthcare data privacy regulations, including HIPAA and HITRUST. This gives you peace of mind knowing your data security meets industry standards. By implementing these robust security measures, Azure ensures your patient data remains confidential, compliant, and protected in the cloud.560Views0likes0Comments