Synapse Spark - Encryption, Decryption and Data Masking
Published Oct 06 2022 08:00 AM 4,332 Views
Microsoft

ArshadAliTMMBA_0-1662056123607.png

Author(s): Arshad Ali is a Program Manager in Azure Synapse Customer Success Engineering (CSE) team.

 

Introduction

As a data engineer, we often get requirements to encrypt, decrypt, mask, or anonymize certain columns of data in files sitting in the data lake when preparing and transforming data with Apache Spark. The extensibility feature of Spark allows us to leverage a library which is not native to Spark. One such library is Microsoft Presidio, which provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data, and more. It facilitates both fully automated and semi-automated PII (Personal Identifiable Information) de-identification and anonymization flows on multiple platforms.

 

In this blog post, I am going to demonstrate step by step how to download and use this library to meet the above requirements with Spark pool of Azure Synapse Analytics.

 

Getting it ready

Microsoft Presidio is an open-source library from Microsoft, which can be used with Spark to ensure private and sensitive data is properly managed and governed. It mainly provides two modules, the analyzer module for fast identification and the anonymization module to anonymize private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data, and more.

 

Presidio analyzer

The Presidio analyzer is a Python based service for detecting PII entities in text. During analysis, it runs a set of different PII Recognizers, each one in charge of detecting one or more PII entities using different mechanisms. It comes with a set of predefined recognizers but can easily be extended with other types of custom recognizers. Predefined and custom recognizers leverage regex, Named Entity Recognition (NER), and other types of logic to detect PII in unstructured text.

ArshadAliTMMBA_0-1662055001811.png

 

You can download this library from here by clicking on “Download files” under Navigation on the left of the page: https://pypi.org/project/presidio-analyzer/  

 

Presidio anonymizer

The Presidio anonymizer is a Python based module for anonymizing detected PII text entities with desired values. Presidio anonymizer supports both anonymization and deanonymization by applying different operators. Operators are built-in text manipulation classes which can be easily extended like custom analyzer. It contains both Anonymizers and Deanonymizers:

  • Anonymizers are used to replace a PII entity text with some other value by applying a certain operator (e.g., replace, mask, redact, encrypt)
  • Deanonymizers are used to revert the anonymization operation. (e.g., to decrypt an encrypted text).

ArshadAliTMMBA_1-1662055001817.png

 

This library includes several built-in operators

Step 1 - You can download this library from here by clicking on “Download files” under Navigation on the left of the page: https://pypi.org/project/presidio-anonymizer/   

Additionally, it also contains, Presidio Image Redactor module as well, which is again a Python based module and used for detecting and redacting PII text entities in images. You can learn more about it here: https://microsoft.github.io/presidio/image-redactor/

 

Presidio uses an NLP engine which is an open-source model (the en_core_web_lg model from spaCy), however it can be customized to leverage other NLP engines as well, either public or proprietary. You can download this default NLP engine library from here:  

https://spacy.io/models/en#en_core_web_lg

https://github.com/explosion/spacy-models/releases/tag/en_core_web_lg-3.4.0 

Step 2 - Once you have downloaded all three libraries you can upload them to Synapse workspace, as documented here (https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-workspace-package...) and shown in the image below:

 

ArshadAliTMMBA_2-1662055001820.png

Figure 1 - Upload Libraries to Synapse Workspace

 

Given that NLP engine library is slightly bigger in size, you might have to wait a couple of minutes for the upload to complete. Once successfully uploaded, you will see a “Succeeded” status message for each of these libraries, as shown below:

ArshadAliTMMBA_3-1662055001822.png

Figure 2 - Required libraries uploaded to Synapse Workspace

 

Step 3 - Next, you have to apply these libraries from the Synapse workspace to the Spark pool where you are going to use it. Here are the instructions on how to do that and the screenshot below shows how it looks:

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-pool-packages

ArshadAliTMMBA_4-1662055001825.png

Figure 3 - Applying libraries to Synapse Spark pool

 

Once you hit “Apply” Synapse is going to trigger a system job to install and cache the specified libraries on the selected Spark pool. This process helps reduce overall session startup time. Once this system job is completed successfully, all new sessions will pick up the updated pool libraries.

 

Putting it all together in action

Step 1 - First thing first, we need to import the relevant classes/modules (and other relevant classes/modules from other existing libraries) which we just applied to the Spark pool.

 

 

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, EntityRecognizer, Pattern, RecognizerResult
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
from pyspark.sql.types import StringType
from pyspark.sql.functions import input_file_name, regexp_replace
from pyspark.sql.functions import col, pandas_udf
import pandas as pd 

 

 

Presidio Analyzer

Step 2 - Next, you can use an analyzer module for detecting PII entities in text. Here is an example to detect phone number in the given text.

 

 

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()
 
# Call analyzer to get results
results = analyzer.analyze(text="My phone number is 212-555-5555",
                           entities=["PHONE_NUMBER"],
                           language='en')
print(results) 

 

 

 

As you can see, it detected the phone number which starts at position 19 and ends at position 31 with a score of 75%:

ArshadAliTMMBA_5-1662055001826.png

 

In addition to the phone number entity, which we used earlier, you can use any of the other built-in entities, as below, or use custom developed entities:

ArshadAliTMMBA_6-1662055001828.png

 

For example, the next code uses two entities Person and Phone Number to detect the name of the person and phone number in the given text:

 

 

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
analyzer = AnalyzerEngine()
 
# Call analyzer to get results
results = analyzer.analyze(text="My name is David and my number is 212-555-1234",
                           entities=["PERSON", "PHONE_NUMBER"],
                           language='en')
print(results) 

 

 

ArshadAliTMMBA_7-1662055001834.png

Presidio anonymizer

Step 3 - Once you have used the analyzer to identify the text with private or sensitive data, you can use the anonymizer class for anonymizing it using different operators.

 

Anonymization Example

Here is an example of anonymizing the identified sensitive data by using the replace operator. In this example, for simplicity, I am using the recognizer result as hard-coded values however you can have this information coming directly from the analyzer during runtime.

 

 

# Anonymization Example
 
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
 
# Initialize the engine with logger.
engine = AnonymizerEngine()
 
# Invoke the anonymize function with the text, 
# analyzer results (potentially coming from presidio-analyzer) and
# Operators to get the anonymization output:
result = engine.anonymize(
    text="My name is Bond, James Bond",
    analyzer_results=[
        RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
        RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
    ],
    operators={"PERSON": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"})},
)
 
print(result) 

 

 

ArshadAliTMMBA_8-1662055001835.png

Figure 4 - Person name anonymized by using replace operator

 

Encryption Example

Step 4 - The next example demonstrates how you can use the encrypt operator to encrypt identified sensitive data from the text. Again, in this example, for simplicity I am using recognizer result as hard-coded values however you can have this information coming directly from analyzer during runtime.

Also, I have the encryption key hard-coded, however in your case you will have this information coming from Azure KeyVault: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/microsoft-spark-utilities?pivots=prog...

 

 

# Encryption Example
 
encryption_key = "WmZq4t7w!z%C&F)J" # in real world, this will come from Azure KeyVault
 
engine = AnonymizerEngine()
 
# Invoke the anonymize function with the text,
# analyzer results (potentially coming from presidio-analyzer)
# and an 'encrypt' operator to get an encrypted anonymization output:
anonymize_result = engine.anonymize(
    text="My name is Bond, James Bond",
    analyzer_results=[
        RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
        RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
    ],
    operators={"PERSON": OperatorConfig("encrypt", {"key": encryption_key})},
)
 
anonymize_result 

 

 

ArshadAliTMMBA_9-1662055001838.png

Figure 5 - Person name anonymized by using encrypt operator

 

Decryption Example

Step 5 - Like the encrypt operator to encrypt identified private and sensitive data, you can use the decrypt operator to decrypt already encrypted private data with the same key used during encryption.

 

 

# Decryption Example
 
# Initialize the engine:
engine = DeanonymizeEngine()
 
# Fetch the anonymized text from the result.
anonymized_text = anonymize_result.text
 
# Fetch the anonynized entities from the result.
anonymized_entities = anonymize_result.items
 
# Invoke the deanonymize function with the text, anonymizer results
# and a 'decrypt' operator to get the original text as output.
deanonymized_result = engine.deanonymize(
    text=anonymized_text,
    entities=anonymized_entities,
    operators={"DEFAULT": OperatorConfig("decrypt", {"key": encryption_key})},
)
 
deanonymized_result 

 

 

ArshadAliTMMBA_10-1662055001839.png

Figure 6 - Person name decrypted by using decrypt operator

 

Spark Example – Using it with Dataframe and UDF

Apache Spark is a distributed data processing platform and to use these libraries in Spark you can use a user defined function to encapsulate your logic. Next, you can use that function to perform operations (anonymize, encrypt, or decrypt etc.) on the Spark dataframe, as demonstrated below, for replace.

 

 

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
broadcasted_analyzer = sc.broadcast(analyzer)
broadcasted_anonymizer = sc.broadcast(anonymizer)
 
# define a pandas UDF function and a series function over it.
# Note that analyzer and anonymizer are broadcasted.
 
def anonymize_text(text: str) -> str:
    analyzer = broadcasted_analyzer.value
    anonymizer = broadcasted_anonymizer.value
    analyzer_results = analyzer.analyze(text=text, language="en")
    anonymized_results = anonymizer.anonymize(
        text=text,
        analyzer_results=analyzer_results,
        operators={"DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"})},
    )
    return anonymized_results.text
 
def anonymize_series(s: pd.Series) -> pd.Series:
    return s.apply(anonymize_text)
 
# define a the function as pandas UDF
anonymize = pandas_udf(anonymize_series, returnType=StringType()) 

 

 

ArshadAliTMMBA_11-1662055001845.png

Figure 7 - Sample data with no encryption yet (for Email and IP Address)

 

ArshadAliTMMBA_12-1662055001854.png

Figure 8 - Sample data anonymized for Email and IP Address

 

What we discussed so far barely scratched the surface. The possibilities are endless, and Presidio includes several samples for various kinds of scenarios. You can find more details here: https://microsoft.github.io/presidio/samples/

Here is the FAQ: https://microsoft.github.io/presidio/faq/

 

Summary

As we look to have better control and governance for compliance, we are often tasked with encrypting, decrypting, masking, or anonymizing certain columns with private or sensitive information. In this blog post, I demonstrated how you can use the Microsoft Presidio library with the Spark pool of Azure Synapse Analytics to perform operations on large scales of data.

 

Our team will be publishing blogs regularly and you can find all these blogs here: https://aka.ms/synapsecseblog

 

For deeper level understanding of Synapse implementation best practices, please refer to our Success By Design (SBD) site: https://aka.ms/Synapse-Success-By-Design

2 Comments
Co-Authors
Version history
Last update:
‎Oct 06 2022 09:32 AM
Updated by: