Updated Fabric GitHub Repo for 250M rows of CMS Healthcare data
Last year I teamed up with my colleague Inder Rana to build and release a GitHub repo for using CMS Medicare Part D data within Microsoft Fabric. The repo is intended to provide an example of an end-to-end analytics solution in Fabric that can be easily deployed by anyone with a Fabric environment. We have updated the analytics solution with some valuable improvements:
- The ELT (extract, load, and transform) process, end-to-end from CMS to the Gold layer of the Lakehouse, now takes less than 20 minutes to run with increased automation.
- The repo now contains logic to import new data for the year 2022 so that the solution contains 10 years of data (2013-2022) and nearly 250 million rows.
- There are two simple options to move the data from the CMS servers to the Gold layer in less than 20 minutes:
- Spark Notebooks orchestrated with a Pipeline, or 2) Spark Notebooks and SQL Stored Procedures to move the data to the Gold layer.
- Option 2 lands the Gold layer in the Fabric Warehouse for those of you who come from a SQL versus a Python background
The updated GitHub repo can be found at this link, please give us a “Star” if you find it useful!: fabric-samples-healthcare/analytics-bi-directlake-starschema at main · isinghrana/fabric-samples-healthcare (github.com)
The first option, using three Spark Notebooks with a single Pipeline, is reviewed in the video below. A video reviewing the SQL Stored Procedure version is coming soon:
Here is a diagram reviewing the new and updated process: