End-to-end Fabric Git Repo for 220M rows of CMS data
Published Aug 04 2023 09:40 AM 2,891 Views
Microsoft

Microsoft Fabric is uniting data and analytics tools within a single SaaS platform. Fabric encompasses many different user personas and tools, and new users are seeking opportunities to skill up. This article reviews a new Github repository that will allow anyone with access to Fabric to deploy an end-to-end solution in Fabric that leverages 220 million rows of real healthcare open data from CMS. Without having to code, users can follow the instructions in the Git repo to import the data into OneLake, serve it up in the Lakehouse, and then query it from Power BI and Excel using the new Direct Lake connector. Below is an architectural diagram of the solution:

 

DirectLake_Architecture.png

 

Here is a link to the Git repo: fabric-samples-healthcare/analytics-bi-directlake at main · isinghrana/fabric-samples-healthcare (g...

 

This is the first release for the Github repository which will be a hub for new easy-to-deploy Fabric healthcare solutions moving forward. In the diagram above, the simple steps of the solution are shown:

  1. Download the files from CMS and Upload them to Fabric OneLake
  2. Combine the files into a single table in the Fabric Lakehouse using delta parquet file format. Don't worry, you can deploy the Spark Notebook without having to write code!
  3. Create a Fabric Direct Lake dataset that queries the table without caching any of the 220M+ rows. 
  4. Create reports in Power BI and Excel that query the Lakehouse with impressive query performance.

The data used in the solution is real CMS open data for Medicare Part D Prescribers - By Prescriber and Drug. The data details drug names, physician names, geographical data, costs, beneficiary counts, and more. The data spans from 2013 to 2021, and totals over 220 million rows. 

 

DirectLake_PBI_Landing.png

 

This solution was created by Greg Beaumont and Inder Rana, who are Data & AI Technical Specialists for Microsoft Healthcare and Life Sciences:

 

Inder Rana

Linkedin: https://www.linkedin.com/in/singhinderjit

Blog: https://isinghrana.medium.com/

 

Greg Beaumont

Linkedin: https://www.linkedin.com/in/gregbeaumont 

Twitter: https://twitter.com/grbeaumont 

 

Future planned releases for this GitHub repo include easy-to-deploy healthcare solutions such as:

  • Machine Learning and Predictive Analytics within Fabric
  • Comparing Direct Lake, SQL Serverless Endpoint, and Import models for Power BI
  • Comparing query performance for flattened data models versus star schemas and composite models
  • OpenAI integration with Fabric
  • Pass along your ideas and suggestions!

Here's a few of the instructional videos from the Git Repo tutorial:

 

Import the files manually into Fabric OneLake

 

Use a Fabric Spark Notebook to create a table in the Lakehouse in delta parquet format

 

Create a Fabric Power BI dataset in Direct Lake mode to query 220M+ rows of data without caching

 

Version history
Last update:
‎Nov 09 2023 10:46 AM
Updated by: