sqldw
4 TopicsRobust data protection features of Azure Synapse
Introduction Data serves as the vital essence of any organization. Whether you’re dealing with sensitive customer information, or financial records, safeguarding your data is non-negotiable. Many organizations face challenges such as: How do you protect the data if you don't know where it is? What level of protection is needed? —because some datasets require more protection than others. Azure Synapse Analytics offers powerful features to help you achieve this, ensuring confidentiality, integrity, and availability. In this blog, we’ll explore the Data Encryption capabilities integrated into Azure Synapse Analytics, discussing encryption techniques for data at rest and in transit, as well as approaches for detecting and categorizing sensitive data in your Synapse workspace. What is Data Discovery and Classification? Imagine your company that have massive amounts of information stored in their databases. But some of columns needs extra protection – like Social Security numbers or financial records. Manually finding this sensitive data is a time-consuming nightmare. Here's the good news: there's a better way! Azure Synapse offers a feature calledData Discoverythat automates this process. How does Data Discovery work? Think of Data Discovery as a super-powered scanner. It automatically goes through every row and column of your data lake or databases, looking for patterns that might indicate sensitive information. Just like a smart assistant, it can identify potentially sensitive data and classify those columns for you. Once the data discovery process is complete, it provides classification recommendations based on a predefined set of patterns, keywords, and rules. These recommendations can then be reviewed, and thenSensitivity-classification labelscan be applied to the appropriate columns. This process is known asClassification. What happen after classifying sensitivity labels on columns? Sensitivity-classificationlabels is a new metadata attributes that have been added to the SQL Server database engine. So, after classifying sensitivity labels on columns, the organization can leverage these labels to: implement fine-grained access controls. Only authorized person with the necessary clearance can access sensitive data. masking the sensitive data when accessed by users who do not have the necessary permissions, allowing them to see only anonymized versions of the data. monitoring of access and modification activities on sensitive data (Auditing access to sensitive data). Any unusual or unauthorized activities can be flagged for investigation. Steps for Discovering, Classifying or labelling columns that contain sensitive data in your database The classification includes two metadata attributes: Labels: The main classification attributes, used to define the sensitivity level of the data stored in the column. Information types: Attributes that provide more granular information about the type of data stored in the column. Step 1 -> Choose Information Protection policy based on your requirement SQL Information Protection policy is a built-in set of sensitivity labels and information types with discovery logic, which is native to the SQL logical server. You can also customize the policy, according to your organization's needs, for more information, seeCustomize the SQL information protection policy in Microsoft Defender for Cloud (Preview). Step 2 -> View and apply classification recommendations The classification engine automatically scans your database for columns containing potentially sensitive data and provides a list of recommended column classifications. After accepting recommendation for columns by selecting the check box in the left column and then selectAccept selected recommendationsto apply the selected recommendations. You can also classify columns manually, as an alternative or in addition to the recommendation-based classification. To complete your classification, selectSavein the Classification page. Note: There is another option for data discovery and classification, which isMicrosoft Purview, which is a unified data governance solution that helps manage and govern on-premises, multi-cloud, and software-as-a-service (SaaS) data. It can automate data discovery, lineage identification, and data classification. By producing a unified map of data assets and their relationships, it makes data easily discoverable. Data Encryption Data encryption is a fundamental component of data security, ensuring that information is safeguarded both at rest and in transit. So, Azure Synapse take care of this responsibility for us. It leverages robust encryption technologies to protect data. Data at Rest Azure offers various methods of encryption across its different services. Azure Storage Encryption By default, Azure Storage encrypts all data at rest usingserver-side encryption(SSE). It's enabled for all storage types (including ADLS Gen2) and cannot be disabled. SSE uses AES 256 to encrypts and decrypts data transparently. AES 256 stands for 256-bit Advanced Encryption Standard. AES 256 is one of the strongest block ciphers available and is FIPS 140-2 compliant. Well, I know these sounds like some Hacking terms😅. But the platform itself manages the encryption key, so we don't need to understand these Hacking terms😅. Also, it forms the first layer of data encryption. This encryption applies to both user and system databases, including the master database. Note: For additional security, Azure offers the option of double encryption.Infrastructure encryptionuses a platform-managed key in conjunction with the SSE key, encrypting data twice with two different encryption algorithms and keys. This provides an extra layer of protection, ensuring that data at rest is highly secure. Double the Protection withTransparent Data Encryption(TDE) It is an industrial methodology that encrypts the underlying files of the database and not the data itself. This adds a second layer of data encryption. TDE performs real-time I/O encryption and decryption of the data at the page level. Each page is decrypted when it's read into memory and then encrypted before being written to disk. TDE encrypts the storage of an entire database by using a symmetric key called the Database Encryption Key. Means when data is written to the database, it is organized into pages and then TDE encrypts each page using DEK before it is written to disk, that makes it unreadable without the key. And when a page is read from disk into memory, TDE decrypts it using the DEK, making the data readable for normal database operations. Why do we call it transparent? because the encryption and decryption processes are transparent to applications and users, they have no idea that the data is encrypted or not, the only way they would know if they don't have access to it. This is because encryption and decryption happen at the database engine level, without requiring application awareness or involvement. By default, TDE protects the database encryption key (DEK) with a built-in server certificate managed by Azure. However, organizations can opt for Bring Your Own Key (BYOK), that key can be securely stored in Azure Key Vault, offering enhanced control over encryption keys. Data in transit Data encryption in transit is equally crucial to protect sensitive information as it moves between clients and servers. Azure Synapse utilizes Transport Layer Security (TLS) to secure data in motion. Azure Synapse, dedicated SQL pool, and serverless SQL pool use theTabular Data Stream (TDS) protocol to communicate between the SQL pool endpoint and a client machine. TDS depends on Transport Layer Security (TLS) for channel encryption, ensuring all data packets are secured and encrypted between endpoint and client machine. It uses a signed server certificate from the Certificate Authority (CA) used for TLS encryption, managed by Microsoft. Azure Synapse supports data encryption in transit with TLS v1.2, using AES 256 encryption.107Views0likes0CommentsRemove unwanted partition data in Azure Synapse (SQL DW)
Introduction to Partition Switching? Azure Synapse Dedicated SQL pool or SQL Server or Azure SQL Database, allows you to create partitions on a target table. Table partitions enable you to divide your data into multiple chunks or partitions. It improves query performance by eliminating partitions that is not necessary. In most cases partitions are built on date column. Why don't we simply drop the unwanted Partition? because of several regions: Clustered Columnstore Index: Dropping a partition directly can potentially lead to performance degradation, especially with a CCI. This is because CCIs are optimised for data locality and dropping a partition disrupts that organisation. Rebuilding the CCI after dropping the partition would be required, which can be time-consuming for a large table. Transaction Safety: Directly dropping a partition might not be a transactional operation. This means if the drop operation fails midway, the partition might be left in an inconsistent state, potentially causing data integrity issues. Requirement to apply Partition Switching The definitions of source and target tables are the same. Steps for Partition Switching in Synapse SQL Pool: Step 1 (Optional) -> Create a credential Skip this step if you're loading the Contoso public data. Don't skip this step if you're loading your own data. To access data through a credential, use the following script to create a database-scoped credential. Then use it when defining the location of the data source. CREATE MASTER KEY; CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential WITH IDENTITY = 'Managed Identity', SECRET = 'https://rnd-learning.vault.azure.net/secrets/synapselearningadls-accesskey/d94c967cb0c5452eafaf5d207afcb86a' ; CREATE EXTERNAL DATA SOURCE AzureStorage WITH ( TYPE = HADOOP, LOCATION = 'wasbs://<blob_container_name>@<azure_storage_account_name>.blob.core.windows.net', CREDENTIAL = AzureStorageCredential ); MASTER KEYis required to encrypt the credential secret in the next step. IDENTITYrefers to the type of authentication you're using. Here I am using Managed Identity, because I allow Azure Synapse workspace to securely connect to and authenticate with Azure Key Vault without having to embed any credentials directly in your code. TYPEis HADOOP because, PolyBase uses Hadoop APIs to access data in Azure blob storage. Step 2 -> Create the external data source Use this command to store the location of the data, and the data type. CREATE EXTERNAL DATA SOURCE AzureStorage_west_public WITH ( TYPE = Hadoop, LOCATION = 'wasbs://contosoretaildw-tables@contosoretaildw.blob.core.windows.net/' ); Step 3 -> Configure the data format The data is stored in text files in Azure blob storage, and each field is separated with a delimiter. Use this command to specify the format of the data in the text files. The Contoso data is uncompressed, and pipe delimited. CREATE EXTERNAL FILE FORMAT TextFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR = '|', STRING_DELIMITER = '', DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff', USE_TYPE_DEFAULT = FALSE ) ); Step 4 -> Create the schema for the external tables To create a place to store the Contoso data in your database, create a schema. CREATE SCHEMA [asb] GO Step 5 -> Create the external tables Run the following script to create the FactOnlineSales external tables. All you're doing here is defining column names and data types, and binding them to the location and format of the Azure blob storage files. The definition is stored in the data warehouse and the data is still in the Azure Storage Blob. CREATE EXTERNAL TABLE [asb].FactOnlineSales ( [OnlineSalesKey] [int] NOT NULL, [DateKey] [datetime] NOT NULL, [StoreKey] [int] NOT NULL, [ProductKey] [int] NOT NULL, [PromotionKey] [int] NOT NULL, [CurrencyKey] [int] NOT NULL, [CustomerKey] [int] NOT NULL, [SalesOrderNumber] [nvarchar](20) NOT NULL, [SalesOrderLineNumber] [int] NULL, [SalesQuantity] [int] NOT NULL, [SalesAmount] [money] NOT NULL, [ReturnQuantity] [int] NOT NULL, [ReturnAmount] [money] NULL, [DiscountQuantity] [int] NULL, [DiscountAmount] [money] NULL, [TotalCost] [money] NOT NULL, [UnitCost] [money] NULL, [UnitPrice] [money] NULL, [ETLLoadID] [int] NULL, [LoadDate] [datetime] NULL, [UpdateDate] [datetime] NULL ) WITH ( LOCATION='/FactOnlineSales/', DATA_SOURCE = AzureStorage_west_public, FILE_FORMAT = TextFileFormat, REJECT_TYPE = VALUE, REJECT_VALUE = 0 ); Step 6 -> Load the data There are different ways to access external data. You can query data directly from the external tables, load the data into new tables in the data warehouse, or add external data to existing data warehouse tables. Step 6.1 -> Create a new schema CREATE SCHEMA [cso] GO Step 6.2 -> Load the data into new tables To load data from Azure blob storage into the data warehouse table, use the CREATE TABLE AS SELECT (Transact-SQL) statement. CTAS creates a new table and populates it with the results of a select statement. CTAS defines the new table to have the same columns and data types as the results of the select statement. If you select all the columns from an external table, the new table will be a replica of the columns and data types in the external table. CREATE TABLE [cso].[FactOnlineSales] WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([ProductKey]), PARTITION ( [DateKey] RANGE RIGHT FOR VALUES ( '2007-01-01 00:00:00.000','2008-01-01 00:00:00.000', '2009-01-01 00:00:00.000','2010-01-01 00:00:00.000' ) ) ) AS SELECT * FROM [asb].FactOnlineSales; With this statement I have created 5 partitions in the [cso].[FactOnlineSales] table, each of which has the duration of a year, except the first that contains all rows with DateKey < 2007–01–01 and the last that contains all rows with DateKey ≥ 2010–01–01. Step 7 -> Create an empty partition table Now do the same thing for the target table. Here I forcefully created empty table, for switching with source table. CREATE TABLE [cso].[FactOnlineSales_out] WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([ProductKey]), PARTITION ( [DateKey] RANGE RIGHT FOR VALUES ( '2007-01-01 00:00:00.000' ) ) ) AS SELECT * FROM [cso].[FactOnlineSales] WHERE 1 = 2; NOTE:If you are switching out the partition (means deleting the partition) you can partition data out to any table irrespective of whether that table is partition or not. So here data will be switched from partition table to a non-partition table. But if you are switching in the partition (means switching the partition with new data), there is a strict criterion where you have to same partitioning boundary define. Step 8 -> Switch the Partition Here I switched out the partition. Now after switch; [cso].[FactOnlineSales_out] has the data about Jan 1st, 2007, till December 31st, 2007. While the [cso].[FactOnlineSales] has no data in partition 2. ALTER TABLE [cso].[FactOnlineSales] SWITCH PARTITION 2 TO [cso].[FactOnlineSales_out] PARTITION 2; NOTE:The command is very simple, but there is one catch; it requires the partition number of source and target tables to perform the switching. Validating partition switching for both source and target table. Step 9 -> Delete the staging table Based on your requirement, delete this table or archive the data of this table as cold data. DROP TABLE [cso].[FactOnlineSales_out]; What happens during the Partition Switch? Before the Switch: Imagine the data for FactOnlineSales is physically stored on disk, potentially spread across multiple files. Each partition in FactOnlineSales has its own metadata entry that keeps track of the specific locations of it's data on disk. During the Switch (using partition X as the example): You identify partition X (containing old data) in FactOnlineSales. The ALTER TABLE SWITCH statement updates the metadata entries for both tables: In FactOnlineSales, the metadata entry for partition X is modified to point to an empty location on disk. This essentially signifies that partition X is now "empty" within FactOnlineSales. In FactOnlineSales__out, a new metadata entry is created for partition X. This new entry points to the same physical location on disk where the data for partition X already resides (remember, the data itself doesn't move). After the Switch: Both FactOnlineSales and FactOnlineSales__out have metadata entries for partition X. However, these entries point to different things: FactOnlineSales entry points to an empty location, indicating the partition is no longer actively used within that table. FactOnlineSales__out entry points to the actual data location, making it appear like FactOnlineSales "owns" that partition. How to check or verify the number of partitions? SQL Pool provides different system, that is used to query the different metadata for all the objects that is in the SQL Pool. And one of the system views that provides all the information related to partition, number of rows in that partition and all those things issys.dm_pdw_nodes_db_partition_stats Use this script to check the number of partitions. SELECT pnp.partition_number, sum(nps.[row_count]) AS Row_Count FROM sys.tables t INNER JOIN sys.indexes i ON t.[object_id] = i.[object_id] AND i.[index_id] <= 1 /* HEAP = 0, CLUSTERED or CLUSTERED_COLUMNSTORE =1 */ INNER JOIN sys.pdw_table_mappings tm ON t.[object_id] = tm.[object_id] INNER JOIN sys.pdw_nodes_tables nt ON tm.[physical_name] = nt.[name] INNER JOIN sys.pdw_nodes_partitions pnp ON nt.[object_id]=pnp.[object_id] AND nt.[pdw_node_id]=pnp.[pdw_node_id] AND nt.[distribution_id] = pnp.[distribution_id] INNER JOIN sys.dm_pdw_nodes_db_partition_stats nps ON nt.[object_id] = nps.[object_id] AND nt.[pdw_node_id] = nps.[pdw_node_id] AND nt.[distribution_id] = nps.[distribution_id] AND pnp.[partition_id]=nps.[partition_id] WHERE t.name='FactOnlineSales' GROUP BY pnp.partition_number;120Views0likes0CommentsWays to control who can access data in Azure Synapse DW
Introduction The assets of a bank are only accessible to some high-ranking officials, and even they don't have access to individual user lockers. These privacy features help build trust among customers. The same goes with in our IT world. Every user wants their sensitive data to be accessible only to themselves, not even available to those with higher privileges within the company. So, as you move data to the cloud, securing the data assets is critical to building trust with your customers and partners. To enable these kinds of preventions, Azure Synapse supports a wide range of advanced access control features to control who can access what data. These features are: Object-level security Row-level security Column-level security Dynamic data masking Synapse role-based access control In this blog we will explore these features. Object-level security In Azure Synapse, whenever we create tables, views, stored procedures, and functions, they are created as objects. In a dedicated SQL pool these objects can be secured by granting specific permissions to database-level users or groups. For example, you can giveSELECTpermissions to user accounts orDatabase Rolesto give access to specific objects. To assign permission: GRANT SELECT ON [schema_name].[table_name] TO [user_or_group]; To revoke permission: REVOKE SELECT ON [schema_name].[table_name] FROM [user_or_group]; Additionally, when you assign a user toSynapse Administrator RBAC role, they automatically gain full access to all dedicated SQL pools within that workspace. It allows them to perform any action (including managing permissions) across all databases. In Addition, when a user assigned to the Storage Blob Data Contributor role (have READ, WRITE, and EXECUTE permissions) of data lakes and the data lakes is connected to the workspace like Synapse or Databricks, then these permissions automatically applied to the Spark-created tables. This is known asMicrosoft Entra pass-through. Look, when Storage Blob Data Contributor role assigned to me: Then I am able to query my Spark-created table. But when I removed that role from myself. Then it gave me an error! Row-level security RLS is a mechanism to restrict row level access (read, write, ...), based on the user's context data. A typical use cases is like, A common database tables used by multiple tenants to store the data, and in such case, we want each tenant to restrict access to their own data only. It enables this fine-grained access control without having to redesign your data warehouse. It also eliminates the need to use Views to filter out rows for access control management. NOTE: The access restriction logic is located in the database tier and the database system applies the access restrictions every time when the data is access from any tier. This makes the security system more reliable and robust by reducing the surface area of your security system. How to implement RLS? RLS can be implemented by usingSECURITY POLICY. RLS is a form of predicate-based access control that works by automatically applying a Security Predicate to all queries on a table. Security Predicate binds the predicate function to the table. Predicate Function is basically a user defined function which determines a user executing the query will have access to the row or not. There are two types of security predicates: Filter predicates:It silently filters out rows that users shouldn't see during SELECT, UPDATE, and DELETE operations. This is used when you want to hide data without disrupting the user experience. For example, in an employee database, filter predicate is used to ensure salespeople can only see their own customer records. They wouldn't even know about records belonging to other salespeople. Block predicates:It explicitly blocking write operations (INSERT, UPDATE, DELETE) that violate pre-defined rules. If a user tries to perform an action that breaks the rules, the operation fails with an error message. This is used where you want to prevent unauthorized modifications. Implementing Filter Predicates Step 1: Creating dummy users and tables, and then grant read access to these objects. CREATE SCHEMA Sales GO CREATE TABLE Sales.Region ( id int, SalesRepName nvarchar(50), Region nvarchar(50), CustomerName nvarchar(50) ); -- Inserting data INSERT INTO Sales.Region VALUES (1, 'Mann', 'Central Canada', 'C1'); INSERT INTO Sales.Region VALUES (2, 'Anna', 'East Canada', 'E1'); INSERT INTO Sales.Region VALUES (3, 'Anna', 'East Canada', 'E2'); INSERT INTO Sales.Region VALUES (4, 'Mann', 'Central Canada', 'C2'); INSERT INTO Sales.Region VALUES (6, 'Anna', 'East Canada', 'E3'); -- Creating Users CREATE USER SalesManager WITHOUT LOGIN; CREATE USER Mann WITHOUT LOGIN; CREATE USER Anna WITHOUT LOGIN; -- Granting Read Access to the Users GRANT SELECT ON Sales.Region TO SalesManager; GRANT SELECT ON Sales.Region TO Mann; GRANT SELECT ON Sales.Region TO Anna; Step 2: Create Security Filter Predicate Function. --Creating Schema for Security Predicate Function CREATE SCHEMA spf; CREATE FUNCTION spf.securitypredicatefunc(@SaleRepName AS NVARCHAR(50)) RETURNS TABLE WITH SCHEMABINDING AS RETURN SELECT 1 AS securitypredicate_result WHERE @SaleRepName = USER_NAME() OR USER_NAME() = 'SalesManager'; The function returns a table with a single value that is 1, when it satisfies theWHEREcondition. AndSCHEMABINDINGensures that the underlying objects (tables, views, etc.) referenced by the function cannot be modified (dropped or altered) while the function exists. Step 3: Create Security Policy that Filter Predicate Security and binds the predicate function to the table. CREATE SECURITY POLICY MySalesFilterPolicy ADD FILTER PREDICATE spf.securitypredicatefunc(SalesRepName) ON Sales.Region WITH (STATE = ON); Step 4: Test your RLS. EXECUTE AS USER = 'Mann'; SELECT * FROM Sales.Region ORDER BY id; REVERT; When a user (e.g., 'Mann') executes a query on the table, SQL Server automatically invokes the security predicate function for each row in the table. Internally the function is called by SQL Server as part of the query execution plan. So, the permissions required to execute the functions are inherently handled by the SQL Server engine. So, there is no need to explicitly give the permission to functions. Step 5: You can disable RLS by Altering the Security Policy. ALTER SECURITY POLICY MySalesFilterPolicy WITH (STATE = OFF); Column-level security It is similar to RLS, but as its name suggests, it applies at the column level. For example, in financial services, only account managers have access to customer social security numbers (SSN), phone numbers, and other personally identifiable information (PII). Additionally, the method of implementing CLS differs. It is implemented by granting Object level Security. Implementing CLS Step 1: Creating dummy user and table. CREATE USER TestUser WITHOUT LOGIN; CREATE TABLE Membership ( MemberID int IDENTITY, FirstName varchar(100) NULL, SSN char(9) NOT NULL, LastName varchar(100) NOT NULL, Phone varchar(12) NULL, Email varchar(100) NULL ); Step 2: Grant the User to access columns except sensitive columns. GRANT SELECT ON Membership ( MemberID, FirstName, LastName, Phone, Email ) TO TestUser; Step 3: Now if the user tries to access whole columns, it will give error. EXECUTE AS USER = 'TestUser'; SELECT * FROM Membership; Dynamic data masking It is the process of limiting the exposure of sensitive data, to the user who should not have access to viewing it. For example, Customer service agents who need to access customer records but should not see full credit card numbers, which can be masked. You may ask, why can't we use CLS, or why we don't completely restrict the access? Because of these reasons: - A CLS will completely restrict the access of reading and altering columns. But when a masking is applied on a column, it doesn't prevent updates to that column. So, if users receive masked data while querying the masked column, the same users can update the data if they have write permissions. In masking you can use SELECT INTO or INSERT INTO to copy data from a masked column into another table that will store as masked data (assuming it's exported by a user without UNMASK privileges). But in CLS you can't do anything, if you don't have access to restricted column. NOTE: Administrative users and roles (such as sysadmin or db_owner) can always view unmasked data via the CONTROL permission, which includes both theALTER ANY MASKandUNMASKpermission. You can grant, or revoke UNMASK permission at the database-level, schema-level, table-level or at the column-level to a user, database role, Microsoft Entra identity or Microsoft Entra group. Implementing DDM Step 1: Creating dummy user. CREATE USER MaskingTestUser WITHOUT LOGIN; Step 2: Create a table and apply the masking on required columns. CREATE TABLE Data.Membership ( FirstName VARCHAR(100) MASKED WITH (FUNCTION = 'partial(1, "xxxxx", 1)') NULL, LastName VARCHAR(100) NOT NULL, Phone VARCHAR(12) MASKED WITH (FUNCTION = 'default()') NULL, Email VARCHAR(100) MASKED WITH (FUNCTION = 'email()') NOT NULL, DiscountCode SMALLINT MASKED WITH (FUNCTION = 'random(1, 100)') NULL ); -- inserting sample data INSERT INTO Data.Membership VALUES ('Kapil', 'Dev', '555.123.4567', 'kapil@team.com', 10); Here, I have applied both default and custom masking functions. Step 3: Granting theSELECTpermission on the schema where the table resides. Users view masked data. GRANT SELECT ON SCHEMA::Data TO MaskingTestUser; Step 4: Granting theUNMASKpermission allows Users to see unmasked data. GRANT UNMASK TO MaskingTestUser; Step 5: Use theALTER TABLEstatement to add a mask to an existing column in the table, or to edit the mask on that column. ALTER TABLE Data.Membership ALTER COLUMN LastName ADD MASKED WITH (FUNCTION = 'partial(2,"xxxx",0)'); ALTER TABLE Data.Membership ALTER COLUMN LastName VARCHAR(100) MASKED WITH (FUNCTION = 'default()'); Synapse role-based access control Basically, it leverages the built-in roles to assign permissions to users, groups, or other security principals to manage who can: Publish code artifacts and list or access published code artifacts. Execute code on Apache Spark pools and integration runtimes. Access linked (data) services that are protected by credentials. Monitor or cancel job executions, review job output and execution logs.441Views1like0Comments