Optimal Loading Practices for Microsoft Fabric: The Data Professional’s Ultimate Guide

Microsoft

Apr 08, 2025

As a seasoned data professional with nearly two decades of data engineering experience, I've had the privilege of guiding many organizations in leveraging data and AI technologies to drive innovation and achieve their goals. My passion lies in sharing knowledge and empowering others to succeed in their data and AI journeys.

Microsoft Fabric is a comprehensive platform designed to accelerate data potential in the era of AI. It integrates Power BI, Data Factory, and the next generation of Synapse to offer a modern analytics solution that is price-performant and easy to manage. Fabric's architecture is lake-centric and open, ensuring seamless integration with OneLake to minimize data management time and effort. The Fabric Data Warehouse is a key component, optimized for industry-leading performance over an open Lakehouse storage format.

As we delve deeper into the intricacies of optimal loading practices within Microsoft Fabric, it becomes essential to highlight the advantages these practices bring to the data management landscape. By strategically implementing these methods, organizations can achieve significant improvements in their data operations, as outlined below:

Enhancing Load Efficiency: By employing best practices for optimizing data loads, organizations can reduce loading times and resource utilization, ensuring smoother data integration processes. This is especially beneficial for systems handling frequent and large-scale data updates.
Scaling Data Operations: As data volumes grow, optimization ensures that the data warehouse can handle these increases seamlessly, maintaining performance levels even under heavy loads. This is critical for industries that generate large amounts of transactional or operational data.
Streamlining ETL Processes: Optimized loading strategies help simplify and accelerate ETL workflows, enabling faster and more reliable data consolidation from diverse sources into the warehouse. This paves the way for comprehensive and unified analytics.

Overview of Data Ingestion Methods

For the purposes of this article, we will focus on four methods of ingesting data into the Fabric Data Warehouse.

COPY INTO command: Enables efficient data ingestion directly into the Fabric Data Warehouse from Azure Storage or OneLake, ensuring rapid handling of large datasets and immediate data availability for analysis.
OPENROWSET function: Provides a flexible approach to importing external data into the warehouse from Azure Storage, adept at managing large volumes of data.
INSERT … SELECT combo: Utilizes query-driven transfers to seamlessly move data from the Lakehouse into the warehouse, streamlining the integration process. INSERT … SELECT combo statements can use OPENROWSET for the select portion, but the focus of this article will be selecting from the Lakehouse.
COPY DATA activity from Fabric Pipelines: Employs automated pipelines for consistent and robust data ingestion from a variety of data sources, ensuring that data is kept up-to-date and instantly available for reporting.

COPY INTO

The COPY INTO command is a powerful and efficient method for data ingestion, enabling flexible, scalable, and high-throughput data ingestion from external Azure storage accounts, supporting both PARQUET and CSV formats. For example, using the COPY INTO command, you can quickly load external data into internal tables without the need for complex setup or maintenance.

Note that COPY INTO only firehoses the data into the table – no transformations can be done as part of the process. As a result, you may need to do additional updates or modifications after the data is loaded if your data does not match the desired schema prior to loading. Use a process like Spark Notebooks or Pipelines to transform your data prior to loading or use T-SQL like the example below:

-- Step 1: Drop the table if it exists
IF (SELECT OBJECT_ID('TempForWeather')) IS NOT NULL DROP TABLE TempForWeather;
GO

-- Step 2: Create the initial table for COPY INTO
CREATE TABLE TempForWeather (
  TimeStamp            VARCHAR(50) NOT NULL,
  SeattleTemperature   FLOAT       NOT NULL,
  SeattlePrecip        FLOAT       NOT NULL,
  SeattleWindSpeed     FLOAT       NOT NULL,
  SeattleWindDirection FLOAT       NOT NULL
);

-- Step 3: Execute the COPY INTO command to load data into the table
COPY INTO TempForWeather
FROM 'https://<storageAccountName>.blob.core.windows.net/polybasecontainer/Seattle_Weather_History_Sample.csv'
WITH (
  FILE_TYPE       = 'CSV',
  FIRSTROW        = 11,
  FIELDTERMINATOR = ',',
  ROWTERMINATOR   = '\n'
);

-- Step 4: Alter the table to add the new column for transformed data
ALTER TABLE TempForWeather
ADD DateOfRecording DATETIME2(3);

-- Step 5: Update the new column with transformed data  
UPDATE TempForWeather
SET DateOfRecording = CAST(
  SUBSTRING(TimeStamp, 1, 4) + '-' +
  SUBSTRING(TimeStamp, 5, 2) + '-' +
  SUBSTRING(TimeStamp, 7, 2) + ' ' +
  SUBSTRING(TimeStamp, 10, 2) + ':' +
  SUBSTRING(TimeStamp, 13, 2)
  AS DATETIME2(3)
)
WHERE TimeStamp IS NOT NULL;

-- Step 6: Verify the loaded data
SELECT TOP 100 * FROM TempForWeather;

OPENROWSET

The OPENROWSET function provides a flexible and powerful method for querying data directly from Azure Data Lake Storage (ADLS) or Blob Storage. This function can be used as part of an INSERT ... SELECT combo statement or a CREATE TABLE AS SELECT (CTAS) statement that allows for data transformation in flight, eliminating the need for staging tables or intermediate steps in ETL processes. An example of using the OPENROWSET function in a CTAS such as below.

Note that although using an OPENROWSET like this will not ingest data as fast as the COPY INTO, it does allow for data transformation as part of the single INSERT … SELECT combo statement which may eliminate the need for a second step to bring data into a staging table. In these cases, eliminating the step of staging may be a net gain in performance. Testing will be required to determine this in such cases. An example of creating a table using OPENROWSET:

-- Step 1: Drop the table if it exists
IF (SELECT OBJECT_ID('TempForWeather')) IS NOT NULL DROP TABLE TempForWeather;
GO

-- Step 2: Create the TempForWeather table using OPENROWSET
CREATE TABLE TempForWeather AS
SELECT 
  DateOfRecording = CAST(
    SUBSTRING([TimeStamp], 1, 4) + '-' +
    SUBSTRING([TimeStamp], 5, 2) + '-' +
    SUBSTRING([TimeStamp], 7, 2) + ' ' +
    SUBSTRING([TimeStamp], 10, 2) + ':' +
    SUBSTRING([TimeStamp], 12, 2) AS DATETIME2(3)
  ),
  * 
FROM OPENROWSET(
  BULK 'https://<StorageAccountNameHere>.blob.core.windows.net/polybasecontainer/Seattle_Weather_History_Sample.csv',
  FORMAT = 'CSV',
  FIRSTROW = 11
) WITH (
  [TimeStamp]           VARCHAR(50),
  SeattleTemperature    FLOAT,
  SeattlePrecip         FLOAT,
  SeattleWindSpeed      FLOAT,
  SeattleWindDirection  FLOAT
) AS r;

SELECT TOP 100 * FROM TempForWeather;

INSERT … SELECT

INSERT … SELECT as part of cross warehouse queries offers performance comparable to COPY INTO while allowing data transformation in one single query for complex loads. In a Fabric workspace, each Lakehouse and Warehouse shows as a separate database on a single server providing significantly enhanced data integration and accessibility over previous solutions. Loading data into the Lakehouse is beyond the scope of this article, but using Lakehouse optimizations such as V-order can optimize the SELECT portion of the query, thus optimizing the load process. An example of cross warehouse queries are shown below:

-- Step 1: Drop the table if it exists
IF (SELECT OBJECT_ID('TempForWeather')) IS NOT NULL DROP TABLE TempForWeather;
GO

-- Step 2: Create the initial table for COPY INTO
CREATE TABLE TempForWeather (
  DateOfRecording       DATETIME2(3) NOT NULL,
  TimeStamp             VARCHAR(50)  NOT NULL,
  SeattleTemperature    FLOAT        NOT NULL,
  SeattlePrecip         FLOAT        NOT NULL,
  SeattleWindSpeed      FLOAT        NOT NULL,
  SeattleWindDirection  FLOAT        NOT NULL
);

-- Step 3: Execute the INSERT … SELECT statement as a cross warehouse query
INSERT INTO [DataflowsStagingWarehouse].[dbo].[TempForWeather]
SELECT 
  DateOfRecording = CAST(
    SUBSTRING([timestamp], 1, 4) + '-' +
    SUBSTRING([timestamp], 5, 2) + '-' +
    SUBSTRING([timestamp], 7, 2) + ' ' +
    SUBSTRING([timestamp], 9, 2) + ':' +
    SUBSTRING([timestamp], 11, 2) + ':' +
    SUBSTRING([timestamp], 13, 2)
    AS DATETIME2(3)
  ),
  [timestamp],
  SeattleTemperature,
  SeattlePrecip,
  SeattleWindSpeed,
  SeattleWindDirection
FROM [DataflowsStagingLakehouse].[dbo].[seattleweather];

-- Step 4: Validate the data
SELECT TOP 100 * FROM [DataflowsStagingWarehouse].[dbo].[TempForWeather];

Note: When loading with cross warehouse queries, the database context of the query is not important. It will perform the same regardless of whether it originates from the Lakehouse or the Warehouse.

Copy Data Activity

Lastly, the COPY DATA activity in Fabric Pipelines leverages the COPY INTO command to enable high-throughput data ingestion from external Azure storage accounts into the Fabric Data Warehouse. This activity supports direct copying from Azure Blob Storage or ADLS Gen2, ensuring efficient and scalable data ingestion workflows.

Performance Optimization Strategies

To ensure efficient data ingestion and optimal performance in the Fabric Data Warehouse, it is essential to consider various factors and techniques.

The number of files used during data ingestion significantly impacts load performance. Performance benchmarking has shown that using 1,000 files can be up to seven times faster than using a single large file. For example, with a total file size of 194GB, each file being 194MB, 1,008 uncompressed CSV files achieved the best throughput.

Additionally, the number of parallel loads (threads) used during data ingestion significantly impacts load performance. For loads under F64 capacity, using three parallel loads into the same table or different tables achieves the maximum throughput. For loads over F64 capacity, benchmarks showed performance continued to increase with additional parallel loads up to six parallel loads. Microsoft Fabric also provides built-in optimizations for the data warehouse that enhance performance without requiring user intervention, such as V-order optimization and Round Robin distribution.

Putting it together

To efficiently load data into tables in parallel within Fabric Pipelines:

Encapsulate the loading logic for each table into individual stored procedures using COPY INTO, INSERT … SELECT, or CTAS.
Invoke these stored procedures within Fabric Pipelines using Stored Procedure activities.
For scenarios requiring direct data copying, use Copy Data activities to streamline the process.
Combine Copy Data activities and Stored Procedure activities within a single pipeline when needed.
Ensure that activities in the pipeline are unconnected by dependencies to enable parallel execution.
Maximize resource utilization and reduce load times by running these disconnected activities simultaneously.

Alternatively, create multiple pipelines that can run simultaneously or independently of each other to handle large-scale data operations.

Refer to the diagram provided below for guidance on structuring parallel activities in a pipeline:

Parallel activities in a data pipeline interface

Leaving each activity disconnected means no one activity waits for another activity before it starts, thus, resulting in parallel execution of activities.

Ref: app.fabric.microsoft.com.

Conclusion

Efficient data ingestion pipelines in Microsoft Fabric ensure optimal performance and scalability. Tools like COPY INTO, OPENROWSET, INSERT...SELECT, and the COPY DATA activity streamline workflows and handle large datasets effectively. Use COPY INTO for external data, or the COPY DATA activity for automation. For data transformable with Spark, cross-warehouse queries offer the best performance.

Optimizations, such as parallel loads, targeting the right file counts, and built-in features like V-order and Round Robin distribution, enhance ingestion without additional effort. By following these best practices, organizations can maximize the Fabric Data Warehouse's potential. Be sure to watch for upcoming articles to build on the concepts and methods here to further enhance your optimal loading design.

For Further Reading

To explore optimal loading techniques in Fabric Data Warehouse, check out the following resources:

OPENROWSET Function: A guide on browsing file content efficiently.
COPY INTO Statement: Learn how to quickly ingest data into Fabric Data Warehouse.
INSERT ... SELECT and CTAS Statements: Strategies for transforming and loading data effectively.
COPY Data Activity: Insights on moving data seamlessly between sources and destinations.