Organizations, Data analysts and data scientists need to protect the personally identifiable information (PII) of their clients, such as names, addresses, emails, phone numbers, or social security numbers (SSN), that they use to build reports and dashboards. PII can pose risks to both the data subjects and the data holders and can introduce security breach vulnerability, privacy violations or biases that affect the decisions made based on the data. For example, if a source data contains PII in a generative AI application, there's a risk of confidential information such as a bank account information to be returned from a user's prompt inquiry. One way to protect PII is to use responsible AI, which is a set of principles and practices that help to mask PII with synthetic or anonymized data that preserves the statistical properties and structure of the original data but does not reveal the identity or attributes of the individuals.
Let's look at a scenario with Contoso Bank, which is a fictitious bank that holds a dataset of their customers who apply for different kinds of credit: home loans, student loans, and personal loans. They want to use analytics and machine learning techniques to make data-driven decisions, but they also want to ensure that the security and privacy of their customers is not violated by someone accessing their PII information through a Generative AI Search. Moreover, they want to avoid any biased decisions based on sensitive features such as age, gender, and ethnicity in an AI Loan approval case. To achieve these goals, they need to use Responsible AI practices that protect the PII and sensitive features in their dataset, such as data masking, secure and compliant data storage and access, fairness and accuracy monitoring, and transparent and respectful communication.
One possible way to use Azure AI to identify and extract PII information in Microsoft Fabric is:
Use Azure AI Language to detect and categorize PII entities in text data, such as names, addresses, emails, phone numbers, social security numbers, etc. The method for utilizing PII in conversations is different than other use cases, and articles for this use are separate.
The foundation of Microsoft Fabric is a Lakehouse, which is built on top of the OneLake scalable storage layer and uses Apache Spark and SQL compute engines for big data processing. A Lakehouse is a unified platform that combines: The flexible and scalable storage of a data lake and the ability to query and analyze data of a data warehouse.
Note: if you have a large file the Azure AI Language service won’t be able to analyze and identify the PII information in your table. Read more on character and document limits here
To mask and identify the PII information from the dataset, you can use the Azure AI Language service that has the PII extraction feature that you can leverage. Then you will have to call this feature with your Microsoft Fabric Notebook in your Lakehouse. This can be achieved through:
import requests
import pandas as pd
import json
request_url = 'https://[Your_language_service_endpoint_url]/language/:analyze-text?api-version=2023-04-01'
access_key = 'Your_Language_APIKEY_Here'
pandas_df = df.toPandas()
json_df = pandas_df.to_json()
#Set request data and headers
request_data = {"kind": "PiiEntityRecognition", \
"parameters": {"modelVersion": "2023-09-01"},\
"analysisInput": {"documents": [{"Id":1,"language":"en","text":json_df}]}}
request_headers = {"Content-Type": "application/json",
"Ocp-Apim-Subscription-Key": access_key}
#Store the request results and display them on Success
response_data = requests.post(request_url,json=request_data,headers=request_headers)
if response_data.status_code == 200:
print(response_data.text)
else:
print(response_data.status_code)
#Convert the json string into an Object and select the results key
json_output = response_data.json()
json_results = json_output['results']
#Convert the JSON string in json_results to select the keys redactedText key and store into a new dataframe
results_list = list(json_results.values())
redactedtext = results_list[0][0]['redactedText']
display(redactedtext)
json_formatted = {"Your_formatted_JSON_here"}
json_string = json.dumps(json_formatted,indent=4)
json_object = json.loads(json_string)
new_pii_extracted_df = pd.DataFrame(json_object)
display(new_pii_extracted_df)
The output would look like this, and you can notice that in the new data frame the PII information is masked.
from pyspark.sql import SparkSession
spark_df = spark.createDataFrame(new_pii_extracted_df)
spark_df.write.mode("append").saveAsTable("PII_Extracted_Bank_Churn_dataset")
display(spark_df)
It is natural to want to move your transformed data into another DataSource like a vector database (e.g. Azure cosmos DB) and to achieve this within Microsoft Fabric you can build a pipeline that copies the data. This part of the blog assumes that you have an Azure Cosmos DB setup in the Azure Portal
To achieve this, you can use Pipelines and Azure Cosmos DB by following these steps:
The PII feature in the Azure AI language service enables you to mask the PII information in your dataset before you can use it in your Machine Learning/AI models. You can also use this feature with Microsoft Fabric so that you can work with cleansed data and build reports and dashboards that do not expose PII information within your dataset.
To learn more about Microsoft Fabric, you can join this Cloud Skills Challenge.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.