Data Engineering
6 TopicsWays to control who can access data in Azure Synapse DW
Introduction The assets of a bank are only accessible to some high-ranking officials, and even they don't have access to individual user lockers. These privacy features help build trust among customers. The same goes with in our IT world. Every user wants their sensitive data to be accessible only to themselves, not even available to those with higher privileges within the company. So, as you move data to the cloud, securing the data assets is critical to building trust with your customers and partners. To enable these kinds of preventions, Azure Synapse supports a wide range of advanced access control features to control who can access what data. These features are: Object-level security Row-level security Column-level security Dynamic data masking Synapse role-based access control In this blog we will explore these features. Object-level security In Azure Synapse, whenever we create tables, views, stored procedures, and functions, they are created as objects. In a dedicated SQL pool these objects can be secured by granting specific permissions to database-level users or groups. For example, you can giveSELECTpermissions to user accounts orDatabase Rolesto give access to specific objects. To assign permission: GRANT SELECT ON [schema_name].[table_name] TO [user_or_group]; To revoke permission: REVOKE SELECT ON [schema_name].[table_name] FROM [user_or_group]; Additionally, when you assign a user toSynapse Administrator RBAC role, they automatically gain full access to all dedicated SQL pools within that workspace. It allows them to perform any action (including managing permissions) across all databases. In Addition, when a user assigned to the Storage Blob Data Contributor role (have READ, WRITE, and EXECUTE permissions) of data lakes and the data lakes is connected to the workspace like Synapse or Databricks, then these permissions automatically applied to the Spark-created tables. This is known asMicrosoft Entra pass-through. Look, when Storage Blob Data Contributor role assigned to me: Then I am able to query my Spark-created table. But when I removed that role from myself. Then it gave me an error! Row-level security RLS is a mechanism to restrict row level access (read, write, ...), based on the user's context data. A typical use cases is like, A common database tables used by multiple tenants to store the data, and in such case, we want each tenant to restrict access to their own data only. It enables this fine-grained access control without having to redesign your data warehouse. It also eliminates the need to use Views to filter out rows for access control management. NOTE: The access restriction logic is located in the database tier and the database system applies the access restrictions every time when the data is access from any tier. This makes the security system more reliable and robust by reducing the surface area of your security system. How to implement RLS? RLS can be implemented by usingSECURITY POLICY. RLS is a form of predicate-based access control that works by automatically applying a Security Predicate to all queries on a table. Security Predicate binds the predicate function to the table. Predicate Function is basically a user defined function which determines a user executing the query will have access to the row or not. There are two types of security predicates: Filter predicates:It silently filters out rows that users shouldn't see during SELECT, UPDATE, and DELETE operations. This is used when you want to hide data without disrupting the user experience. For example, in an employee database, filter predicate is used to ensure salespeople can only see their own customer records. They wouldn't even know about records belonging to other salespeople. Block predicates:It explicitly blocking write operations (INSERT, UPDATE, DELETE) that violate pre-defined rules. If a user tries to perform an action that breaks the rules, the operation fails with an error message. This is used where you want to prevent unauthorized modifications. Implementing Filter Predicates Step 1: Creating dummy users and tables, and then grant read access to these objects. CREATE SCHEMA Sales GO CREATE TABLE Sales.Region ( id int, SalesRepName nvarchar(50), Region nvarchar(50), CustomerName nvarchar(50) ); -- Inserting data INSERT INTO Sales.Region VALUES (1, 'Mann', 'Central Canada', 'C1'); INSERT INTO Sales.Region VALUES (2, 'Anna', 'East Canada', 'E1'); INSERT INTO Sales.Region VALUES (3, 'Anna', 'East Canada', 'E2'); INSERT INTO Sales.Region VALUES (4, 'Mann', 'Central Canada', 'C2'); INSERT INTO Sales.Region VALUES (6, 'Anna', 'East Canada', 'E3'); -- Creating Users CREATE USER SalesManager WITHOUT LOGIN; CREATE USER Mann WITHOUT LOGIN; CREATE USER Anna WITHOUT LOGIN; -- Granting Read Access to the Users GRANT SELECT ON Sales.Region TO SalesManager; GRANT SELECT ON Sales.Region TO Mann; GRANT SELECT ON Sales.Region TO Anna; Step 2: Create Security Filter Predicate Function. --Creating Schema for Security Predicate Function CREATE SCHEMA spf; CREATE FUNCTION spf.securitypredicatefunc(@SaleRepName AS NVARCHAR(50)) RETURNS TABLE WITH SCHEMABINDING AS RETURN SELECT 1 AS securitypredicate_result WHERE @SaleRepName = USER_NAME() OR USER_NAME() = 'SalesManager'; The function returns a table with a single value that is 1, when it satisfies theWHEREcondition. AndSCHEMABINDINGensures that the underlying objects (tables, views, etc.) referenced by the function cannot be modified (dropped or altered) while the function exists. Step 3: Create Security Policy that Filter Predicate Security and binds the predicate function to the table. CREATE SECURITY POLICY MySalesFilterPolicy ADD FILTER PREDICATE spf.securitypredicatefunc(SalesRepName) ON Sales.Region WITH (STATE = ON); Step 4: Test your RLS. EXECUTE AS USER = 'Mann'; SELECT * FROM Sales.Region ORDER BY id; REVERT; When a user (e.g., 'Mann') executes a query on the table, SQL Server automatically invokes the security predicate function for each row in the table. Internally the function is called by SQL Server as part of the query execution plan. So, the permissions required to execute the functions are inherently handled by the SQL Server engine. So, there is no need to explicitly give the permission to functions. Step 5: You can disable RLS by Altering the Security Policy. ALTER SECURITY POLICY MySalesFilterPolicy WITH (STATE = OFF); Column-level security It is similar to RLS, but as its name suggests, it applies at the column level. For example, in financial services, only account managers have access to customer social security numbers (SSN), phone numbers, and other personally identifiable information (PII). Additionally, the method of implementing CLS differs. It is implemented by granting Object level Security. Implementing CLS Step 1: Creating dummy user and table. CREATE USER TestUser WITHOUT LOGIN; CREATE TABLE Membership ( MemberID int IDENTITY, FirstName varchar(100) NULL, SSN char(9) NOT NULL, LastName varchar(100) NOT NULL, Phone varchar(12) NULL, Email varchar(100) NULL ); Step 2: Grant the User to access columns except sensitive columns. GRANT SELECT ON Membership ( MemberID, FirstName, LastName, Phone, Email ) TO TestUser; Step 3: Now if the user tries to access whole columns, it will give error. EXECUTE AS USER = 'TestUser'; SELECT * FROM Membership; Dynamic data masking It is the process of limiting the exposure of sensitive data, to the user who should not have access to viewing it. For example, Customer service agents who need to access customer records but should not see full credit card numbers, which can be masked. You may ask, why can't we use CLS, or why we don't completely restrict the access? Because of these reasons: - A CLS will completely restrict the access of reading and altering columns. But when a masking is applied on a column, it doesn't prevent updates to that column. So, if users receive masked data while querying the masked column, the same users can update the data if they have write permissions. In masking you can use SELECT INTO or INSERT INTO to copy data from a masked column into another table that will store as masked data (assuming it's exported by a user without UNMASK privileges). But in CLS you can't do anything, if you don't have access to restricted column. NOTE: Administrative users and roles (such as sysadmin or db_owner) can always view unmasked data via the CONTROL permission, which includes both theALTER ANY MASKandUNMASKpermission. You can grant, or revoke UNMASK permission at the database-level, schema-level, table-level or at the column-level to a user, database role, Microsoft Entra identity or Microsoft Entra group. Implementing DDM Step 1: Creating dummy user. CREATE USER MaskingTestUser WITHOUT LOGIN; Step 2: Create a table and apply the masking on required columns. CREATE TABLE Data.Membership ( FirstName VARCHAR(100) MASKED WITH (FUNCTION = 'partial(1, "xxxxx", 1)') NULL, LastName VARCHAR(100) NOT NULL, Phone VARCHAR(12) MASKED WITH (FUNCTION = 'default()') NULL, Email VARCHAR(100) MASKED WITH (FUNCTION = 'email()') NOT NULL, DiscountCode SMALLINT MASKED WITH (FUNCTION = 'random(1, 100)') NULL ); -- inserting sample data INSERT INTO Data.Membership VALUES ('Kapil', 'Dev', '555.123.4567', 'kapil@team.com', 10); Here, I have applied both default and custom masking functions. Step 3: Granting theSELECTpermission on the schema where the table resides. Users view masked data. GRANT SELECT ON SCHEMA::Data TO MaskingTestUser; Step 4: Granting theUNMASKpermission allows Users to see unmasked data. GRANT UNMASK TO MaskingTestUser; Step 5: Use theALTER TABLEstatement to add a mask to an existing column in the table, or to edit the mask on that column. ALTER TABLE Data.Membership ALTER COLUMN LastName ADD MASKED WITH (FUNCTION = 'partial(2,"xxxx",0)'); ALTER TABLE Data.Membership ALTER COLUMN LastName VARCHAR(100) MASKED WITH (FUNCTION = 'default()'); Synapse role-based access control Basically, it leverages the built-in roles to assign permissions to users, groups, or other security principals to manage who can: Publish code artifacts and list or access published code artifacts. Execute code on Apache Spark pools and integration runtimes. Access linked (data) services that are protected by credentials. Monitor or cancel job executions, review job output and execution logs.120Views0likes0CommentsWays to load data in Synapse DW from External Data Source
Introduction After performing transformation on data, then it needs to save it in Data Warehouse for further analysis. This process comes under ETL (Extract, Transform, Load). It involves extracting data from various sources, transforming it as per business need, and then loading it into destination. There are many ways to load data in DW: - POLYBASE COPY INTO command ADF Copy activity Spark Pool in Synapse Databricks In this blog I will explain these processes and talk about some limitations of some activity and its workaround. NOTE: Whenever we use other tools for inserting data in Synapse like ADF, Data Flow, Spark Pool and Databricks, it highly recommended to enable staging or under hood it can be implemented automatically. It allows for efficient data transfer and can handle large datasets more effectively. Basically, this approach minimizes the load on both the source and the destination during the transfer process. Internally the data is first loaded in Staging layer (temporary storage layer), then from there, data is loaded in DW. After loading the data in DW, the temporary data in staging layer is deleted. POLYBASE PolyBase is a technology that uses metadata to allows SQL Server and dedicated SQL pools to query and import data from external data sources using T-SQL. Step 1: Create an object of Database Scoped Credential. This indicates how it going to connect to external data source(Azure Data Lake Gen 2). In the Database Scoped Credentials, I am using Managed Identity (means Synapse uses its own credentials to access Data Lake). You also have to give Storage Blob Data Contributor role to the Managed Identity of Synapse Workspace. CREATE DATABASE SCOPED CREDENTIAL adlsgen2synp WITH IDENTITY = 'MANAGED IDENTITY' GO Step 2: Create an External Data Source. It's a definition of External Data Sorce Location. CREATE EXTERNAL DATA SOURCE taxi_raw_data WITH ( LOCATION = 'abfss://raw@synapselearningadls.dfs.core.windows.net/', CREDENTIAL = adlsgen2synp, TYPE = HADOOP ); GO Step 3: Create an External File Format. It defines format of the file like CSV. CREATE EXTERNAL FILE FORMAT csv_file_format WITH( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR = ',', STRING_DELIMITER = '"', FIRST_ROW = 2, USE_TYPE_DEFAULT = FALSE ) ); GO Step 4: Create External Table so that it able to query the external data directly from synapse without importing it. CREATE EXTERNAL TABLE temp_table.ext_taxi_zone ( LocationID SMALLINT, Borough VARCHAR(15), Zone VARCHAR(50), service_zone VARCHAR(15)) WITH( LOCATION = 'taxi_zone.csv', DATA_SOURCE = taxi_raw_data, FILE_FORMAT = csv_file_format ); GO Step 5: Use CTAS statement to import the data in Synapse SQL DW. CREATE TABLE temp_table.taxi_zone WITH ( DISTRIBUTION = ROUND_ROBIN ) AS SELECT * FROM temp_table.ext_taxi_zone; NOTE: Polybase does not supportDELTAfile format. External Table can't be modified. For altering columns, you must have to drop the table and then recreate again. COPY INTOcommand It is a newer way of loading the data in DW. It doesn't need any extra objects. It copies the data directly to the table in DW. It is faster and simpler compared to POLYBASE. COPY INTO [temp_table].[taxi_zone] FROM 'https://synapselearningadls.blob.core.windows.net/raw/taxi_zone.csv' WITH ( CREDENTIAL = (IDENTITY = 'MANAGED IDENTITY'), FILE_TYPE = 'CSV', FIRSTROW = 2 ) NOTE:It supports only these CSV, PARQUET, ORC file types. ADF Copy Activity Step 1: Giving some permissions to ADF Managed Identity to access Synapse Database. Here EXTERNAL PROVIDER refers to the Microsoft Entra. CREATE USER [adf-rnd-learning] FROM EXTERNAL PROVIDER; GO EXEC sp_addrolemember db_owner, [adf-rnd-learning]; GRANT ADMINISTER DATABASE BULK OPERATIONS TO [adf-rnd-learning]; Step 2: Create linked service for ADLS Gen 2 (External data source) and Azure Synapse Analytics. Create Dataset for source external file in ADLS and sink Table in synapse DB. Create a pipeline and use COPY Activity to copy data from ADLS to Synapse database. Set the source and sink dataset in COPY activity and enable the Staging. Here in staging, the data stored temporary and after loading in Synapse DB, it deleted. NOTE:ADF COPY activity doesn't support DELTA. But a workaround is you can use Data Flow in ADF. While adding the source in Data Flow, choose source type as Inline. Spark Pool in Synapse In Spark Pool we can use Synapse DW Objects with the help of spark connector. And the best part is, Synapse automatically manages all authentication for us. So, we don't have to focus on managing credentials. sqlanalytics contains a spark connector that is used to connect to dedicated SQL Pool from Spark Pool. Here, under the hood staging data is loaded automatically in some default location. Databricks In Databricks, for accessing Synapse DW Objects first we have to manage credentials for both Synapse and ADLS. Step 1: Give required permissions to Databricks to access ADLS Gen 2. Create a service principal in Microsoft Entra Id and add Client Secret to this service principal. Assign the Storage Blob Data Contributor role of ADLS to this service principal. Then add the client_id, tenant_id and client secrets of service principal to Azure Key Vault. Then create Scope in Databricks and add the required details of your key vault. Databricks keeps the Secret Scope in a hidden user interface. To reveal that interface, type this “#secrets/createScope” in the end of URL in the Databricks homepage. Step 2: Give permission to Service Principal to access Synapse Database. Execute this below query in Synapse Database. CREATE USER [databricks_learning_appreg] FROM EXTERNAL PROVIDER; GO EXEC sp_addrolemember db_owner, [databricks_learning_appreg]; GRANT ADMINISTER DATABASE BULK OPERATIONS TO [databricks_learning_appreg]; Step 3: Fetch the credentials from Databricks Secrets Scope. client_id = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-client-id') tenant_id = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-tenant-id') client_secret = dbutils.secrets.get(scope='dblearning-scope', key='db-learning-client-secret') Step 4: Set the configurations in Databricks, for ADLS. spark.conf.set("fs.azure.account.auth.type.synapselearningadls.dfs.core.windows.net", "OAuth") spark.conf.set("fs.azure.account.oauth.provider.type.synapselearningadls.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") spark.conf.set("fs.azure.account.oauth2.client.id.synapselearningadls.dfs.core.windows.net", client_id) spark.conf.set("fs.azure.account.oauth2.client.secret.synapselearningadls.dfs.core.windows.net", client_secret) spark.conf.set("fs.azure.account.oauth2.client.endpoint.synapselearningadls.dfs.core.windows.net", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token") Step 5: Set the configurations in Databricks, for Synapse SQL DW. spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.id", client_id) spark.conf.set("spark.databricks.sqldw.jdbc.service.principal.client.secret", client_secret) Step 6: Now load the data from ADLS into data frame and perform some transformations. Step 7: Now write the transformed data to Synapse SQL DW. df.write.format("sqldw") \ .option("url", 'jdbc:sqlserver://synapse-rnd-learning.sql.azuresynapse.net:1433;database=prac_dedicated') \ .option('tempDir', 'abfss://stagging@synapselearningadls.dfs.core.windows.net/stage') \ .option("forwardSparkAzureStorageCredentials", "false") \ .option("dbTable", "temp_table.taxi_zone") \ .option('enableServicePrincipalAuth', 'true') \ .mode('append') \ .save() Step 8: Now check the table, transformed data is inserted into Synapse SQL DW. df2 = spark.read.format("sqldw") \ .option("url", 'jdbc:sqlserver://synapse-rnd-learning.sql.azuresynapse.net:1433;database=prac_dedicated') \ .option('tempDir', 'abfss://stagging@synapselearningadls.dfs.core.windows.net/stage') \ .option("forwardSparkAzureStorageCredentials", "false") \ .option("dbTable", "temp_table.taxi_zone") \ .option('enableServicePrincipalAuth', 'true') \ .load() Here you notice that in PySpark code, I am defining staging location for both reading and writing.145Views0likes0CommentsProtect PII information in your Microsoft Fabric Lakehouse with Responsible AI
Data analysts and data scientists need to protect the personally identifiable information (PII) of their clients, such as names, addresses, emails, phone numbers, or social security numbers, that they use to build reports and dashboards. PII can pose risks to both the data subjects and the data holders and can introduce biases that affect the decisions made based on the data. One way to protect PII is to use Responsible AI, which is a set of principles and practices that help to mask PII with synthetic or anonymized data that preserves the statistical properties and structure of the original data but does not reveal the identity or attributes of the individuals.1.4KViews0likes0CommentsDiscover the Future of Data Engineering with Microsoft Fabric for Technical Students & Entrepreneurs
Microsoft Fabric is an all-in-one analytics solution for enterprises that covers everything from data movement to data science, Real-Time Analytics, and business intelligence.It offers a comprehensive suite of services, including data lake, data engineering, and data integration, all in one place. This makes it an ideal platform for technical students and entrepreneurial developers looking to streamline their data engineering and analytics workflows.6KViews4likes1CommentGame of Learner Clinics for Data 2022
The Game of Learners Clinics for Data is a 5-week skilling initiative, which aims to help you learn and gain technical skills on how to work with data in the cloud. This Initiative will also provide you with the option to getto receive a free voucher to sit for the DP-900 Azure Data Fundamentals Exam. Register athttps://aka.ms/dataclinics/register1.4KViews2likes0CommentsBlobfuse is an open source project developed to provide a virtual filesystem backed by the Azure Blob storage.
First published on MSDN on Feb 20, 2019 Blobfuse uses the libfuse open source library to communicate with the Linux FUSE kernel module, and implements the filesystem operations using the Azure Storage Blob REST APIs.1.9KViews0likes0Comments