Forum Discussion

Thomas LeBlanc's avatar
Jan 24, 2020

What kind of data would you put in a Data Lake?

How would describe the data structure in a Data Lake for a Data Scientist?

 

How would data be extracted for a Data Vault or Star Schema?

 

Thanks,

Thomas

 

  • Thomas LeBlanc the Data Lakes flexibility is it's key value proposition.  Customers no longer need to determine what can / can't be put into storage.  Discussions I have with customer is how they can put wave and mp3 files alongside high resolution medical imaging and videos.  This is in addition to raw data generated from operations of running the business.

     

    When introducing the customer to the Data Lake and how to navigate I usually describe using the Windows File Explorer.

     

    The structure or model of the Data Lake is what tailors the folders to your Data Scientists or customers requirements.  This is a community discussion Rodrigo Souza has a post summarizing a great viewpoint (https://techcommunity.microsoft.com/t5/data-architecture-blog/how-to-organize-your-data-lake/ba-p/1182562)

     

    Data extraction from the lake also has equal flexibility.  For databases and Star Schema's that might require some transformations of the data I would lead with Azure Data Factory (or SSIS).  For some Data Scientists I work with they prefer Python as there primary language and choose to use this to extract, manipulate, and write back data to the Lake.  The third process that currently has my interest for reporting purposes from the Azure Data Lake is using Power Platform Data Flows as my customer relies on Power BI (https://cloudblogs.microsoft.com/dynamics365/it/2019/09/18/using-power-platform-dataflows-to-extract-and-process-data-from-business-central-post-3/)

  • cobrow's avatar
    cobrow
    Copper Contributor

    Thomas LeBlanc the Data Lakes flexibility is it's key value proposition.  Customers no longer need to determine what can / can't be put into storage.  Discussions I have with customer is how they can put wave and mp3 files alongside high resolution medical imaging and videos.  This is in addition to raw data generated from operations of running the business.

     

    When introducing the customer to the Data Lake and how to navigate I usually describe using the Windows File Explorer.

     

    The structure or model of the Data Lake is what tailors the folders to your Data Scientists or customers requirements.  This is a community discussion Rodrigo Souza has a post summarizing a great viewpoint (https://techcommunity.microsoft.com/t5/data-architecture-blog/how-to-organize-your-data-lake/ba-p/1182562)

     

    Data extraction from the lake also has equal flexibility.  For databases and Star Schema's that might require some transformations of the data I would lead with Azure Data Factory (or SSIS).  For some Data Scientists I work with they prefer Python as there primary language and choose to use this to extract, manipulate, and write back data to the Lake.  The third process that currently has my interest for reporting purposes from the Azure Data Lake is using Power Platform Data Flows as my customer relies on Power BI (https://cloudblogs.microsoft.com/dynamics365/it/2019/09/18/using-power-platform-dataflows-to-extract-and-process-data-from-business-central-post-3/)

Resources