Announcing a de-identification service for Health and Life Sciences
Published Oct 10 2023 09:00 AM 12.9K Views

This blog has been co-authored by Kimia Mavon and Gord Lueck.


Data is the foundation of machine learning and clinical research. As healthcare organizations invest in AI-powered solutions there is an increasing need to utilize multi-modal data for various purposes such as clinical research, AI model training, and analytics. However, organizations find it hard to utilize clinical data containing Protected Health Information (PHI) and personal identifiers while still abiding by the Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule. This privacy concern slows down potential research and insights derived from analytics because data containing PHI is not shareable or accessible to those wanting to work with it.  


Today, we are excited to announce the new de-identification service in Azure Health Data Services so organizations can de-identify clinical data such that the resulting data retains its clinical relevance and distribution while also adhering to the HIPAA privacy rule. Our service currently supports unstructured text and will soon cover various other data types (structured and imaging). The service uses state-of-the-art machine learning models to automatically extract, redact, or surrogate over 30 entities—including HIPAA’s 18 PHI identifiers – from unstructured text such as clinical notes, messages, or clinical trial studies (Table 1).  Our service offers best practices for PHI protection in the form of surrogate replacement, where PHI elements are replaced with plausible looking surrogates; resulting in data that is most representative of the source data. The de-identification service will also be available in Fabric.  


Healthcare organizations currently use a variety of administrative, technical, and physical safeguards to protect PHI in accordance with HIPAA. Identifiable data may be siloed within small groups of trusted individuals and require significant legal processes to use or share. Healthcare organizations can de-identify data for secondary uses, such as analysis, research, or business applications, but the process of removing PHI from unstructured data poses unique challenges. For instance, human annotators can de-identify documents manually but the process is costly and time-consuming. Machine learned de-identification models reduce costs and time but developing and maintaining these models can be difficult without domain experts. Protecting patient health records using traditional de-identification methods can be time consuming, expensive, and inefficient to support clinical research.   


Using our de-identification service, scientists and healthcare researchers can leverage large amounts of de-identified, multi-modal data to find patterns and trends to address the most critical patient needs. Collaborating institutions can de-identify clinical records across research groups to develop larger, diverse datasets to unlock the power of their shared data. This service also enables consultations about outliers or troubleshooting of hard-to-reproduce issues in test environments.  


Consistent replacement of PHI via surrogation enables organizations to retain relationships occurring in the underlying dataset, which is often critical for health use cases.  Our service allows for consistent replacement across entities and preserves the relative temporal relationships between events. Analysts can also soon use the de-identification service from within healthcare data solutions in Microsoft Fabric to de-identify data in OneLake and share aggregated dashboards for population-level views. The de-identification service expedites the speed and efficiency of healthcare innovations by enhancing collaboration, simplifying legal processes, and streamlining data collection, supporting both machine learning, and analytics workflows.  


The de-identification service is available from Azure as a real-time endpoint or an asynchronous batch API. Users select between three operations on their unstructured data.  ”TAG” to identify and extract PHI, ”REDACT” to mask PHI, or ”SURROGATE” to replace the PHI value with synthetic data. Examples of these operation types are included in the figure below.   


Screenshot 2023-10-09 at 12.16.03 PM.png



The de-identification service within Azure Health Data Services helps healthcare professionals de-identify their unstructured health data using state-of-the-art PHI detection, surrogation, and industry best practices to protect patient data.  The service maintains entity and temporal relationships in the resulting data, which maximizes the utility of the de-identified data for many downstream use cases including machine learning, real-world evidence, and longitudinal research.   


At Microsoft, we hope to empower healthcare providers, payors, scientists, and life sciences companies to unlock clinical research, discoveries, and patient-centered care by enabling data access to those who need it—while upholding patient privacy.   


Learn more 


Table1: The HIPAA Privacy Rule provides guidance to help customers achieve the safe harbor method of de-identification by removing 18 identifiers from PHI. The de-identification service extends these identifiers to additional entities that could be used to identify an individual patient, such as the organization and profession of the patient. For more information, see Health Information Privacy on the U.S. Government Health and Human Services website. 


HIPAA Identifier 

De-identification Service Entity 


“Doctor”, “Patient” 

Address (all geographic subdivisions smaller than state, including street address, city county, and zip code)  

“City”,CountryOrRegion”, “Location-Other”, “State”, “Street”, “Zip” 

All elements (except years) of dates related to an individual (including birthdate, admission date, discharge date, date of death, and exact age if over 89) 

“Age”, “Date” 

Telephone numbers 


Fax numbers  


Email address  


Social security numbers  


Medical record numbers  


Health plan beneficiary numbers 


Account numbers 


Certificate or license number  


Vehicle identifiers and serial numbers, including license plate numbers 


Device identifiers and serial numbers 


Web Universal Resource Locators (URLs) 


Internet Protocol (IP) address numbers 


Biometric identifiers, including finger and voice prints 


Photographic image  

The De-identification Service API only accepts plain text files 

Any other characteristic that could uniquely identify the individual 

“Hospital”,IDNum”, “Organization”, “Profession”, “Username” 


Version history
Last update:
‎Oct 09 2023 12:34 PM
Updated by: