Privacy and Security are the topmost priorities for Consumers and Businesses. Bad actors can steal the information when it is in transit between the systems, when it is being processed or even when archived. Large Language Models have opened doors for more effective ways of sanitizing the documents. In this article I will provide some examples of how to sanitize the documents using ChatGPT on Azure.
What is Document Redaction and Sanitization?
It is the process of removing or replacing any information that is considered as sensitive, private or confidential.
What types of information needs redaction?
Following are some sensitive information types that needs redaction:
Personally Identifiable and Information (PII): Anything information that helps identify a person called needs to be redacted. Examples include people's names, addresses, social security numbers, drivers license etc.
Protected Health Information(PHI) : Health related information such as patient’s medical records, insurance group numbers, benefit information etc.
Business Confidential Information: Organizational information such as employee records, biometric records, business related secrets, contractual information, financial documents, judicial records etc.
Using ChatGPT on Azure for Redacting and Sanitizing the documents:
Example 1: Sanitize the sensitive information from an Invoice
The sample invoice I am going to use is in scanned format. As ChatGPT can only take plain text as input currently, we first have to digitalize the invoice. Form Recognizer service is a very effective way of digitizing the scanned documents.
Below is a screenshot from Form Recognizer with the Invoice example:
After extracting the data from the invoice using Form Recognizer and some cleansing, we will have the plain text something like below:
You will find the responses below from ChatGPT for my redaction and sanitizing instructions:
Prompt: Show me all the references of PII below:
Prompt: redact all references of PII data below:
Prompt: Replace all occurrences of Contoso with LinkedIn:
Prompt: convert all dates to MON-DD-YYYY format. show me only dates:
Example 2: Sanitize information from a Health Insurance Card for PHI data:
Similar to the above example, I used the Form Recognizer service to extract the information from a health insurance card.
Below are some examples on how to sanitize the PHI information.
Prompt: show me all references of PHI:
Prompt: redact all references of people names below:
You can continue with various redaction and sanitization activities in the document such as replacing text, removing text, translating text, converting currencies etc. All the Best!
Note 1: The responses may vary depending on the hyperparameters like Temperature, Top Probabilities etc.
Note 2: I also encourage you all to try the same prompts on text-davinci-003 model also.