Disclaimer: The views in this article are my own and do not represent Microsoft or Databricks.
This article is part of a series focused on deploying a secure Medallion Architecture. The series follows a top-down approach , beginning with a high-level architectural perspective and gradually drilling down into implementation details using repeatable, code.
Within a Lakehouse, strong security and access controls are essential to govern and protect data processing. This article introduces a deployment pattern for Azure Databricks that uses Lakeflow Jobs and the Medallion Architecture to enforce robust isolation across pipelines. Each pipeline stage runs as its own job per layer, executed by a dedicated Microsoft Entra ID service principal with tightly scoped, least-privilege access.
In short, we apply least privilege consistently across Bronze, Silver, and Gold using Azure components such as ADLS Gen2, Managed Identities (service principals), Azure Key Vault, Lakeflow Jobs, and Unity Catalog to build a secure, auditable platform. We also advocate separation of duties by segregating clusters and using separate storage accounts per layer.
Conceptual ArchitectureMedallion Architecture in brief
The ‘Medallion Architecture’ provides a data archetype for managing data across its entire lifecycle ( from inception to enrichment, integration, and consumption) organised into sequential, compartmentalised data layers. Data disposition is typically handled by an additional tier, which not shown in the diagram below.
- Bronze (Raw): Ingests and stores unprocessed data from source systems. Data is typically kept in immutable, append-only Delta tables, optionally enriched with technical fields, and pruned by time-based policies.
- Silver (Refined): Cleans, filters, and transforms Bronze data into a consistent business context (e.g., 3NF or Data Vault) while applying appropriate temporal patterns.
- Gold (Curated): Final, analytics-ready datasets (often dimensional models) supporting dashboards, reports, or APIs. A semantic layer may sit on top to handle metadata, presentation, and performance needs.
The multi-hop design increases business value at each step, simplifies troubleshooting, and enables progressive data quality checks. To fully realise these benefits, pipelines must embed strong security so that each layer is protected and only authorised processes can access it.
Security challenges in Lakehouse pipelines
Running an end-to-end Lakeflow under a single identity with broad permissions creates security risks and governance challenges, such as:
- No least privilege: One user or service principal can read and write across all layers. If compromised or misused, it could access or alter even the more sensitive curated data.
- Human account dependency: Using individual user accounts for production jobs is fragile (people leave or change roles). Interactive users often hold broad permissions that violate least privilege and raise the risk of unauthorised access or accidental exposure.
- Cross-contamination: A defect in an upstream stage could overwrite downstream datasets (e.g., Silver accidentally writing into Gold).
This article proposed an archetype that decouple pipeline execution from any single human identity and securely compartmentalizing access across the Medallion Architecture.
Managed identities are special service principal objects in Microsoft Entra ID that simplify and secure authentication between Azure resources and services. By utilising managed identities for automation, we gain several key benefits:
- Decoupled from people: Pipelines run independently of personal accounts, ensuring continuity through personnel changes.
- Granular permissions: Grant only what each pipeline needs (least privilege).
- Auditability: Actions are logged under distinct non-human identities, improving traceability and compliance.
- Consistency: Standardised, repeatable deployments reduce credential drift and configuration errors.
The Azure Databricks Access Connector integrates a system-assigned managed identity (SAMI) with a Databricks workspace, enabling secure, least-privilege access to Unity Catalog governed data in line with identity and access-management best practices.
Secure Lakeflow Design - Isolate Each Medallion Layer
Enforce least privilege by splitting the end-to-end pipeline into three(3) Databricks Lakeflow jobs one per Medallion layer plus and overarching Lakeflow that acts as the orchestrator of the three(3) underline Lakeflow jobs.
- Each Lakeflow runs under its own service principal.
- Each connects to its own managed identity for accessing Azure Storage.
- Each identity is granted only the permissions required for that layer.
This creates clear separation of duties. No single principal, cluster, or job has full access across all layers. If Bronze code is compromised, it cannot read or corrupt Silver and Gold because the Bronze identity lacks access to those locations. The same design prevents accidental overwrites (e.g., a Silver bug attempting to write into Gold will be blocked by permissions). As shown in the table below.
Provisioning Steps:
- Create three(3) storage accounts (Bronze, Silver and Gold).
- Create three(3) managed identities/service principals, one(1) per layer, and assign only the required credentials/roles to the corresponding storage.
- In Databricks Unity Catalog, configure the storage credentials as external locations for each of the three(3) layers.
Compute Configuration
Apply the same isolation to compute: provision three(3) dedicated clusters, one(1) per layer, instead of sharing a single cluster.
- Bronze cluster: raw ingestion.
- Silver cluster: cleansing, transformation, and integration into consistent business models.
- Gold cluster: curated analytics, optimised for BI/reporting (dimensional/semantic layers).
The key benefits of a this multi-cluster approach are:
- Cost control & optimisation: Right-size per workload. Use autoscaling and auto-termination and enforce spend control with cluster policies.
- Security isolation: Restrict user groups to layer-specific clusters; align with Unity Catalog’s fine-grained permissions.
- Tailored tuning:
- Gold: general-purpose nodes compute optimised nodes with Photon optional, but most of the time disabled depending on workload.
- Silver: general-purpose nodes compute optimised with Photon acceleration enabled.
- Bronze: general-purpose node compute optimised with Photon optional, but most of the time disabled depending on workload.
- Choose the most suitable Databricks Runtime variant per layer.
- Performance & independent scaling: Avoid resource contention. Each stage scales to its own SLA.
- Operational simplicity: Easier monitoring, debugging, tagging, and policy enforcement by layer and environment.
Managed Identities & Service Principals configuration
Service principals can be Databricks or Microsoft Entra ID managed.
- Databricks managed service principals can authenticate to Databricks using Databricks OAuth (client-credentials / machine-to-machine) or personal access tokens (PATs).
- Microsoft Entra ID managed service principals can authenticate using Databricks OAuth (client-credentials) or Microsoft Entra ID access tokens (OIDC).
In this pattern we authenticate via Microsoft Entra ID tokens, using Entra-managed service principals.
Provisioning steps:
- Create three service principals (Bronze, Silver, Gold).
- Add them to the Databricks Account as Microsoft Entra ID managed.
- In the Account Console, grant each principal User permission on the target workspaces.
Lakeflow configuration
Connect the components and complete configuration:
- In the Databricks workspace, create External Locations (Unity Catalog) for Bronze, Silver, and Gold, each backed by the appropriate storage credentials.
- Grant the layer service principal access to its External Location.
- Grant Browse and Read File on source storage and Browse, Read File and Write File on target storage as appropriate for the layer’s job.
- If managing identities across multiple workspaces, ensure access is scoped only to the workspace that runs the Lakeflow (do not grant “all workspaces” global access if it’s not needed).
- Create three(3) cluster policies, one(1) per layer, with configurations aligned to the operations performed in that layer (as previously discussed).
- Create three(3) Lakeflow jobs, one for each layer.
- Create an orchestrator Lakeflow to execute the underline three(3) Lakeflow jobs.
- Create three(3) clusters and attach the appropriate cluster policy to each.
- For notebooks in the Lakeflow, grant the relevant service principals Can Run.
- For each Lakeflow, grant the layer’s service principal Can Manage as needed.
Please find below an high-level view on the cluster policies characteristics.
Choose your table type (managed vs external)
Under Unity Catalog, follow the catalog.schema.table convention. To preserve isolation, create separate catalogs for Bronze, Silver, and Gold. This also supports defining sub-layers, enabling smaller, incremental movements and transformations. In terms of table types we have the following choice:
- Managed tables: Unity Catalog manages both metadata and data. Dropping a table also deletes the underlying data. Managed tables can enable Predictive Optimization, Automatic liquid clustering (CLUSTER BY AUTO, DBR 15.4 LTS+), and Automatic statistics. Maintenance runs automatically in the background; no extra cluster setup is required.
- External tables: Unity Catalog registers the schema while files live in your storage path (you own permissions, lifecycle, and clean-up). You trade off some automatic optimisations and lineage for direct file ownership.
Note: For managed tables, Unity Catalog writes data to Azure Storage using GUID-based paths for catalogs, schemas, and tables. This obfuscates the physical layout, making it harder to infer table names from storage alone.
Example folder structure you might see in ADLS Gen2:
Therefore Azure Databricks Managed Tables will be our choice in the design.
Secrets and credentials
Keep secrets out of code, repos, and job parameters. Store them in Azure Key Vault (AKV) and read them at runtime from notebooks via AKV-backed secret scopes.
Provisioning steps( for each environment):
- Create an Azure Key Vault (e.g., kv-dev, kv-prod) and add the secrets required (e.g., API keys, DB passwords, webhooks).
- Create a Databricks secret scope backed by that Key Vault (e.g., kv-dev, kv-prod).
- Grant your layer service principals access to the Key Vault (least privilege).
Read secrets only at runtime using dbutils.secrets.get(...). Keep them in memory and, when you must pass them on, prefer process environment variables over string literals. Implement rotation policies in Azure Key Vault (AKV) so secrets rotate without code changes. Never hard-code secrets or commit them to Git. Do not print or log secret values, avoid display(api_token) and any f-strings that echo them, and don’t pass secrets as plain job parameters, as these appear in run history. Operate with one secret scope per environment and use consistent key names (e.g., api-token, db-password).
Enable AKV diagnostic logs for audit and review access regularly. Always validate secret access in a non-production workspace before promoting.
Observability & cost governance
Build visibility in from day one. Enable system tables and use the Jobs monitoring UI so you can track failures, durations, and spend per Medallion layer.
What to enable
- System tables: grant access to the system catalog and query the Jobs system tables in system.lakeflow/* (for runs, tasks, and timelines) alongside billing tables in system.billing/*. Note that lakeflow was previously named workflow.
- Jobs monitoring UI: use Jobs & Pipelines to view run history, drill into task details, and add notifications, useful for day-to-day triage
Final Words & next steps
Pairing Medallion layering with per-layer identities, storage, and compute enforces least privilege end to end, reduces blast radius, and makes operations auditable and predictable. Unity Catalog provides the governance backbone with External Locations, storage credentials, managed tables and system-table observability so you can track reliability and cost by layer. While Lakeflow Jobs provide orchestration, retries, alerts, of data pipelines.
In Part II, we’ll publish CI/CD code to deploy this pattern and address a few known challenges, like cluster reusability Lakeflow jobs and environment promotion.