Machine learning and analytics are transforming healthcare by streamlining clinical workflows, powering AI models and unlocking new insights from patient data. These innovations are fueled by textual data rich in Protected Health Information (PHI). To be used for research, innovation and operational improvements, this data must be responsibly de-identified to protect patient privacy. Manual de-identification can be slow, expensive, and error-prone, creating bottlenecks that delay progress and limit collaboration. De-identification is more than a compliance standard; it is the key to unlocking healthcare data’s full potential while maintaining patient privacy and trust.
Today, we are excited to announce the expansion of the Azure Health Data Services de-identification service to support five new preview language-locale combinations:
- Spanish (United States)
- German (Germany)
- French (France)
- French (Canada)
- English (United Kingdom)
This language expansion enables global healthcare organizations to unlock insights from data beyond English while continuing to adhere to regulatory standards.
Why Language Support Matters
Healthcare data is generated in many languages around the world, and each one comes with its own linguistic structure, formatting, and privacy considerations. By expanding support to multiple preview languages such as Spanish, French, German, and English, our de-identification service allows organizations to unlock data from a broader range of countries and regions.
But language alone isn’t the whole story. Different locales within the same language (French in France vs. Canada, or English in the UK vs. the US) often format PHI in unique ways. Addresses, medical institutions, and identifiers can all look different depending on the region. Our service is designed to recognize and accurately de-identify these locale-specific patterns, supporting privacy and compliance wherever the data originates.
How It Works
The Azure Health Data Service de-identification service empowers healthcare organizations to protect patient data through three key operations:
- TAG detects and annotates PHI from unstructured text.
- REDACT obfuscates PHI to prevent exposure.
- SURROGATE replaces PHI with realistic, synthetic surrogates, preserving data utility while ensuring privacy.
Our service leverages state-of-the-art machine learning models to identify and handle sensitive information, supporting compliance with HIPAA's Safe Harbor standards and unlinked pseudonymization aligned with GDPR principle. By maintaining entity consistency and temporal relationships, organizations can use de-identified data for research, analytics, and machine learning without compromising patient privacy.
Unlocking New Use Cases
By expanding the service's language support, organizations can now address some of the most pressing data challenges in healthcare:
- Reduce organizational liability by meeting evolving privacy standards.
- Enable secure data sharing across institutions and regions.
- Unlock AI opportunities by training models on multilingual, de-identified data.
- Share de-identified data across institutions to create larger, more diverse datasets.
- Conduct longitudinal research while preserving patient privacy.
Proven Accuracy
Researchers at the University of Oxford recently conducted a comprehensive comparative study evaluating multiple automated de-identification systems across 3,650 UK hospital records. Their analysis compared both task-specific transformer models and general-purpose large language models. The Azure Health Data Services de-identification service achieved the highest overall performance among the 9 evaluated tools, demonstrating a recall score of 0.95. The study highlights how robust de-identification enables large-scale, privacy-preserving EHR research and supports the responsible use of AI in healthcare. Read the full study here: Benchmarking transformer-based models for medical record deidentification
Preview: Your Feedback Matters
This multilingual feature is now available in preview. We invite healthcare organizations, research institutions, and clinicians to:
- Try it out Overview of the de-identification service in Azure Health Data Services | Microsoft Learn.
- Provide feedback to help refine the service: Azure Health Data Service multilingual de-identification Service Feedback – Fill out form.
- Join us in shaping the future of privacy-preserving healthcare innovation.
At Microsoft, we are committed to helping healthcare providers, payors, researchers, and life sciences companies unlock the value of data while maintaining the highest standards of patient privacy. Azure Health Data Services de-identification service empowers organizations to accelerate AI and analytics initiatives safely, supporting innovation and improving patient outcomes across the healthcare ecosystem.
Explore Azure Health Data Services to see how our solutions help organizations transform care, research, and operational efficiency.