Blog Post

Analytics on Azure Blog
5 MIN READ

From Bronze to Gold: Data Quality Strategies for ETL in Microsoft Fabric

Sally_Dabbah's avatar
Sally_Dabbah
Icon for Microsoft rankMicrosoft
Dec 09, 2025

How to build reusable, automated validation across the Medallion Architecture

Introduction

Data fuels analytics, machine learning, and AI  but only if it’s trustworthy. Most organizations struggle with inconsistent schemas, nulls, data drift, or unexpected upstream changes that silently break dashboards, models, and business logic.

Microsoft Fabric provides a unified analytics platform with OneLake, pipelines, notebooks, and governance capabilities. When combined with Great Expectations, an open-source data quality framework, Fabric becomes a powerful environment for enforcing data quality at scale.

In this article, we explore how to implement enterprise-ready, parameterized data validation inside Fabric notebooks using Great Expectations  including row-count drift detection, schema checks, primary-key uniqueness, and time-series batch validation.

A quick reminder: ETL (Extract, Transform, Load) is the process of pulling raw data from source systems, applying business logic and quality validations, and delivering clean, curated datasets for analytics and AI. While ETL spans the full Medallion architecture, this guide focuses specifically on data quality checks in the Bronze layer using the NYC Taxi sample dataset.

🔗 Full implementation is available in my GitHub repository:
sallydabbahmsft/Data-Quality-Checks-in-Microsoft-Fabric: Data Quality Checks in Microsoft Fabric

Why Data Quality Matters More Than Ever?

AI and analytics initiatives fail not because of model quality but because the underlying data is inaccurate, incomplete, or inconsistent. Organizations adopting Microsoft Fabric often ask:

  • How can we validate data as it lands in Bronze?
  • How do we detect schema changes before they break downstream pipelines?
  • How do we prevent silent failures, anomalies, and drift?
  • How do we standardize data quality checks across multiple tables and pipelines?

Great Expectations provides a unified, testable, automation-friendly way to answer these questions.

Great Expectations in Fabric

Great Expectations (GX) is an open-source library for:

✔ Declarative data quality rules ("expectations")
✔ Automated validation during ETL
✔ Rich documentation and reporting
✔ Batch-based validation for time-series or large datasets
✔ Integration with Python, Spark, SQL, and cloud data platforms

Fabric notebooks now support Great Expectations natively (via PySpark), enabling engineering teams to:

  • Build reusable DQ suites
  • Parameterize expectations by pipeline
  • Validate full datasets or daily partitions
  • Integrate validation into Fabric pipelines and alerting

Data Quality Across the Medallion Architecture

This solution follows the Medallion Architecture, with validation at every layer.

This pipeline follows a Medallion Architecture, moving data through the Bronze, Silver, and Gold layers while enforcing data quality checks at every stage.

📘 P.S. Fabric also supports this via built-in Medallion task flows:
Task flows overview - Microsoft Fabric | Microsoft Learn

 

🥉Bronze Layer: Ingestion & Validation

  • Ingest raw source data into Bronze without transformations.
  • Run foundational DQ checks to ensure structural integrity.

Bronze DQ answers:
Did the data arrive correctly?

🥈Silver Layer: Transformation & Validation

  • Clean, standardize, and enrich Bronze data.
  • Validate business rules, schema consistency, reference values, and more.

Silver DQ answers:
Is the data accurate and logically correct?

🥇 Gold Layer: Enrichment & Consumption

  • Produce curated, analytics-ready datasets.
  • Validate metrics, aggregates, and business KPIs.

Gold DQ answers:
Can executives trust the numbers?

 

Recommended Data Quality Validations:

Bronze Layer (Raw Ingestion)

  1. Ingestion Volume & Row Drift – Validate total row count and detect unexpected volume drops or spikes.
  2. Schema & Data Type Compliance – Ensure the table structure and column data types match the expected schema.
  3. Null / Empty Column Checks – Identify missing or empty values in required fields.
  4. Primary Key Uniqueness – Detect duplicate records based on the defined composite or natural key.

Silver Layer (Cleaned & Standardized Data)

  • Reference & Domain Value Validation – Confirm that values match valid categories, lookups, or reference datasets.
  • Business Rule Enforcement – Validate logic constraints (e.g., StartDate <= EndDate, percentages within range).
  • Anomaly / Outlier Detection – Identify unusual patterns or values that deviate from historical behavior.
  • Post-Standardization Deduplication – Ensure standardized and enriched records no longer contain duplicates.

Gold Layer (Curated, Business-Ready Data)

  • Metric & Aggregation Consistency – Validate totals, ratios, rollups, and other aggregated metrics.
  • KPI Threshold Monitoring – Trigger alerts when KPIs exceed defined thresholds.
  • Data / Feature Drift Detection (for ML) – Monitor changes in distributions across time.
  • Cross-System Consistency Checks – Compare business metrics across internal systems to ensure alignment.



Implementing Data Quality with Great Expectations in Fabric

Step 1 - Read data from Lakehouse (parametrized):

lakehouse_name = "Bronze"
table_name = "NYC Taxi - Green"

query = f"SELECT * FROM {lakehouse_name}.`{table_name}`"
df = spark.sql(query)

Step 2 - Create and Register a Suite:

context = gx.get_context()
suite = context.suites.add(
    gx.ExpectationSuite(name="nyc_bronze_suite")
)
Step 3 - Add Bronze Layer Expectations (Reusable Function):
import great_expectations as gx

def add_bronze_expectations(
    suite: gx.ExpectationSuite,
    primary_key_columns: list[str],
    required_columns: list[str],
    expected_schema: list[str],
    expected_row_count: int | None = None,
    max_row_drift_pct: float = 0.2,
) -> gx.ExpectationSuite:
    # 1. Ingestion Count & Row Drift
    if expected_row_count is not None:
        min_rows = int(expected_row_count * (1 - max_row_drift_pct))
        max_rows = int(expected_row_count * (1 + max_row_drift_pct))

        row_count_expectation = gx.expectations.ExpectTableRowCountToBeBetween(
            min_value=min_rows,
            max_value=max_rows,
        )
        suite.add_expectation(expectation=row_count_expectation)

    # 2. Schema Compliance
    schema_expectation = gx.expectations.ExpectTableColumnsToMatchSet(
        column_set=expected_schema,
        exact_match=True,
    )
    suite.add_expectation(expectation=schema_expectation)

    # 3. Required columns: NOT NULL
    for col in required_columns:
        not_null_expectation = gx.expectations.ExpectColumnValuesToNotBeNull(
            column=col
        )
        suite.add_expectation(expectation=not_null_expectation)

    # 4. Primary key uniqueness (if provided)
    if primary_key_columns:
        unique_pk_expectation = gx.expectations.ExpectCompoundColumnsToBeUnique(
            column_list=primary_key_columns
        )
        suite.add_expectation(expectation=unique_pk_expectation)

    return suite
Step 4 - Attach Data Asset & Batch Definition:
data_source = context.data_sources.add_spark(name="bronze_datasource")
data_asset = data_source.add_dataframe_asset(name="nyc_bronze_data")
batch_definition = data_asset.add_batch_definition_whole_dataframe("full_bronze_batch")
Step 5 - Run Validation:
validation_definition = gx.ValidationDefinition(
    data=batch_definition,
    suite=suite,
    name="Bronze_DQ_Validation"
)

results = validation_definition.run(
    batch_parameters={"dataframe": df}
)
print(results)
7. Optional: Time-Series Batch Validation (Daily Slices)

Fabric does not yet support add_batch_definition_timeseries, so your notebook implements custom logic to validate each day independently:

dates_df = df.select(F.to_date("lpepPickupDatetime").alias("dt")).distinct()

for d in dates:
    df_day = df.filter(F.to_date("lpepPickupDatetime") == d)
    results = validation_definition.run(batch_parameters={"dataframe": df_day})

This enables:

  1. Daily anomaly detection
  2. Partition-level completeness checks
  3. Early schema drift detection

Automating DQ with Fabric Pipelines

Fabric pipelines can orchestrate your data quality workflow:

  • Trigger notebook after ingestion
  • Pass parameters (table, layer, suite name)
  • Persist DQ results to Lakehouse or Log Analytics
  • Configure alerts in Fabric Monitor

Production workflow

  1. Run the notebook
  2. Check validation results
  3. If failures exist:
    • Raise an incident
    • Fail the pipeline
    • Notify the on-call engineer

This creates a closed loop of ingestion → validation → monitoring → alerting.

An example of DQ pipeline: 

 

Results: 

How Enterprises Benefit

By standardizing data quality rules across all domains, organizations ensure consistent expectations and uniform validation practices , improved observability makes data quality issues visible and actionable, enabling teams to detect and resolve failures early. 

This, in turn, enhances overall reliability, ensuring downstream transformations and Power BI reports operate on clean, trustworthy data.

Ultimately, stronger data quality directly contributes to AI readiness high-quality, well-validated data produces significantly better analytics and machine learning outcomes.

Conclusion

Great Expectations + Microsoft Fabric creates a scalable, modular, enterprise-ready approach for ensuring data quality across the entire medallion architecture. Whether you're validating raw ingested data, transformed datasets, or business-ready tables, the approach demonstrated here enables consistency, observability, and automation across all pipelines.

With Fabric’s unified compute, orchestration, and monitoring, teams can now integrate DQ as a first-class citizen not an afterthought.

 

Links:

  1. Implement medallion lakehouse architecture in Fabric - Microsoft Fabric | Microsoft Learn
  2. GX Expectations Gallery • Great Expectations
Updated Dec 09, 2025
Version 1.0
No CommentsBe the first to comment