This post was authored by Bruce Nelson, Senior Solutions Architect at Databricks and Clinton Ford, Staff Partner Marketing Manager at Databricks
Healthcare organizations are improving the patient experience and delivering better health outcomes with analytic dashboards and machine learning models on top of existing electronic health records (EHR), digital medical images and streaming data from medical devices and wearables. Azure Databricks and Delta Lake make it easier to work with large clinical datasets to identify top patient conditions.
Simulated EHR data are based on roughly 10,000 patients in Massachusetts and generated using the Synthea simulator. Our ETL notebook ingests and de-identifies our data, then prepares it for our visualization notebook. We create visualizations and a simple dashboard that show the top conditions (comorbidities) in our real world data and also analyze the correlation between any two conditions specified by the user.
To begin, we use pyspark to read EHR data from comma-separated values (CSV) files, de-identify patient personally identifiable information (PII) and write to Delta Lake for analysis. Using Delta Lake is a best practice for ingestion, ETL and stream processing as an open source format with support for ACID transactions, faster processing with Delta Engine and easy integration with other Azure services for additional use cases.
In this notebook we visualize top conditions in the database and create a simple dashboard to analyze the correlation between any two conditions specified by the user. You can share this notebook as a dashboard following these instructions.
For additional background on this use case see this blog post. See live demos or get hands on at an Azure Databricks event. Go even deeper with this 3-part webinar training series to operationalize machine learning models for your own organization.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.