As data engineers, we grapple with numerous challenges daily. Data is often scattered across various sources, residing in a multitude of file types with varying data quality. The time spent locating specific files—figuring out which tenant they belong to and deciphering access rights—can be exasperating. This is where OneLake steps in.
OneLake streamlines data management, breaks down silos, and ensures that your data resides in one unified home—just like OneDrive for files!
A Basic Setup would be;
What is OneLake?
OneLake is essentially the OneDrive for data within the Fabric ecosystem. Just like OneDrive, it’s automatically provisioned for every Fabric tenant, requiring no infrastructure management.
Key benefits of OneLake include:
- Unified data storage across different domains and tenants.
- Support for both managed and unmanaged data storage.
- Full Delta support using VertiParq (a powerful feature for tracking changes in data).
- Distributed ownership of data and security.
- Integration with DirectLake, providing robust Power BI support
How does OneLake work?
The architecture of OneLake allows seamless connectivity to multiple cloud providers. Let’s explore the basics:
- Symbolic links (Shortcuts): Using symbolic links, you can connect to both Azure and Amazon storage. These shortcuts enable data from these providers to be accessible within the same OneLake, without having to copy the data
- Unified management: All personas—data engineers, real-time analysts, and BI developers—can directly access data stored in OneLake.
- Delta file format: Data within OneLake uses the open-source delta file format, which optimizes storage for data engineering workflows. It supports efficient storage, versioning, schema enforcement, ACID transactions, and streaming.
- Ingestion methods: You can get data into OneLake via Shortcuts or data pipelines. Shortcuts create symbolic links to external storage locations, simplifying navigation. Data pipelines, familiar to Data Factory or Synapse users, link external lakes into the managed tables area.
Managed Data: Tables
Tables play a crucial role in managing and organizing data within the lakehouse architecture. Once set up in the managed section of the lakehouse, you have several options:
- Browse tables using the Lakehouse Explorer.
- Query and analyze data efficiently.
Connecting External Data to Microsoft Fabric OneLake
Now that you’ve grasped how the oneLake works, let’s get some data from an External source into oneLake. For this, we will be using the Data Engineering Experience, feel free to choose any other Experience.
- Create a workspace:
Begin by creating a workspace within your Microsoft Fabric environment. This workspace will serve as the container for your data-related activities.
Select lakehouse item from the drop-down menu and give it a name
-
Setup a Lakehouse:
Next, create a lakehouse item within your workspace by following the following steps;
- Select the workspace into which you want to create the lakehouse.
- In the open worspace, select new.
- Select lakehouse item from the drop-down menu and give it a name.
3. Ingest the Data from an Extenal source into the Lakehouse.
Use any of the following options to create a shortcut, which is allows you to point to other storage locations, which can either be internal or external to oneLake.That will launch up a shortcut wizard, select the source you want to pull your data from. For this demo select OneLake to create an internal shortcut.
Find and connect to the data you want to use with your shortcut. And click next. Your data will be loaded in the files section of your lakehouse
Preview the Loaded data by clicking on the files section
4. Transform the Data into Delta Tables
Once your data is in the Lakehouse, create a new notebook and associate it with the Lakehouse created. Drag and drop the file into the notebook.
5. Transform it into delta tables using Spark within the Fabric notebook. Delta tables provide efficient change tracking and management.
6. Build Reports and Analyze the Data
From the table view, click on Lakehouse and select SQL analytics endpoint.From the SQL endpoint view, select new visual to create a simple visual
You can create the visuals manually, or let co-pilot do the magic for you.
Clean Up Resources: After completing the task, remember to clean up any temporary or test data.
Conclusion
OneLake aims to give you the most value possible out of a single copy of data without data movement or duplication. You no longer need to copy data just to use it with another engine or to break down silos so you can analyze the data with data from other sources.
Further Guides
Signup for the Microsoft Fabric Global AI Hack, a virtual event where you can learn, experiment, and hack together with the new Copilot and AI features in Microsoft Fabric!
Sign up for the Fabric Cloud Skills Challenge at https://aka.ms/fabric30dtli and complete all the modules to become eligible for a 50% discount on the DP-600 exam.
Learn how to use copilot in Microsoft Fabric, your data insights AI assistant.
Join the Fabric Community to stay updated on the latest About Microsoft Fabric
Consider joining the Fabric Career Hub so you won’t miss out on any Careers in Microsoft Fabric