Blog Post

Healthcare and Life Sciences Blog
9 MIN READ

Microsoft Purview- Paint By Numbers Series (Part 1c) - Exact Data Match (new UI)

James_Havens's avatar
James_Havens
Icon for Microsoft rankMicrosoft
Dec 21, 2022

 

Before we start, please not that if you want to see a table of contents for all the sections of this blog and their various Purview topics, you can locate the in the following link:

Microsoft Purview- Paint By Numbers Series (Part 0) - Overview - Microsoft Tech Community

 

Disclaimer

This document is not meant to replace any official documentation, including those found at docs.microsoft.com.  Those documents are continually updated and maintained by Microsoft Corporation.  If there is a discrepancy between this document and what you find in the Compliance User Interface (UI) or inside of a reference in docs.microsoft.com, you should always defer to that official documentation and contact your Microsoft Account team as needed.  Links to the docs.microsoft.com data will be referenced both in the document steps as well as in the appendix.

 

All of the following steps should be done with test data, and where possible, testing should be performed in a test environment.  Testing should never be performed against production data.

 

Target Audience

The Exact Data Match (EDM) section of this blog series is aimed at Compliance officers who need to identify not just any PII and PHI data but the exact PII and PHI belonging to their employees and customers/patients.

 

Document Scope

This document is meant to guide an administrator who is “net new” to Microsoft E5 Compliance through:

  • Configuration of an Exact Data Match (EDM) with the new UI
  • Testing of an Exact Data Match (EDM).

 

It is presumed that you already have a Sensitive Information Type that you want to use in your Exact Data Match policy.  For the purposes of this document, I will use a copy of the U.S. Social Security Number (SSN) called “U.S. SSN – Numbers Only” that I created in Part 1 of this blog series. 

 

Out-of-Scope

This document does not cover any other aspect of Microsoft Purview, including:

  • Data Classification (Sensitive Information Types)
  • Information Protection
  • Data Protection Loss (DLP) for Exchange, OneDrive, Devices
  • Data Lifecycle Management (retention and disposal)
  • Records Management (retention and disposal)
  • Premium eDiscovery
  • Insider Risk Management (IRM)
  • Priva
  • Advanced Audit
  • Microsoft Cloud App Security (MCAS)
  • Information Barriers
  • Communications Compliance
  • Licensing

It is presumed that you have a pre-existing understanding of what Microsoft Purview  does and how to navigate the User Interface (UI).

 

For details on licensing (i.e. which components and functions of Purview are in E3 vs E5) you will need to contact your Microsoft Security Specialist, Account Manager, or certified partner.

 

Overview of Document

We will walk through:

  1. Create an Exact Data Match (EDM) using the new User Interface that was published in 2022.
  2. Validate your EDM data.
  3. Hash and upload your EDM file

Use Case

Exact Data Matches (EDM) are used to apply Compliance to specific information, not only a pattern.  Here is an example of how to use EDM. 

 

Example – You do not want to look for all Social Security Numbers (SSNs) as not all SSNs are your patient or customer data.  Nor are all 9-digit numbers SSNs.

 

Definitions

  • Datastore – This is another term for a data you want to match.  For example, this could be a list of PHI or PII data for your customers or employees.
  • Rule packages – this is an alternate term for an Exact Data Match (EDM) schema
  • Keyword – a SIT, there is a limit of 50 words total
  • Keyword dictionary – This type of keyword SIT leverages a file to gather its keywords and has a 100,000 term limit per dictionary
  • Regex – a regular expression used to perform pattern matching (ex. “\d{5}”)
  • Personal Health Information (PHI) = There are many websites that list what these are.   18 items are normally listed as PHI.  Some of these are Socical Security Number, Credit Card Number address, Data of Birth, Name
  • Confidence – This is the amount of confidence the policy and administrator can have in the SIT has discovered in the data content.  Here are different ways Microsoft Compliance explains confidence
    • Older method to express confidence – percentage (65%, 75%, 85%)
    • New method to express confidence – Low, Medium High
  • Proximity = The proximity between SIT components (i.e. keywords, regex expressions, etc).  The default is 300 characters.

 

Notes

  • To keep everything as simple as possible, we will use ‘hrdata’ as our schema and file names wherever possible.
  • Replication times for a Compliance changes to take affect
    • SITs and EDMs should take affect within 15 minutes, but could take much longer depending on how your Tenant is configured for replication, Availability Zones, etc.
    • DLP policies will take approximately 15 minutes to take affect
    • Other Compliances items could take 24-48 hours for other changes to take affect

 

 

Pre-requisites

 

Create a .CSV file

For my spreadsheet, I will be using only a handful of names, Employee IDs, etc.  This will make my testing simpler later on.

 

I have created 6 columns with the following names: FName, LName, BirthYear, Country, EmployeeID, and Invention.  These column names will be used when creating an EDM Schema later-on.

 

 

 

Create EDM_DataUploaders Security Group in AAD

 

  1. Go to portal.azure.com and click on Azure Active Directory (AAD)

 

 

 

  1. Click Goups->Add New Group. Name the new group ‘EDM_DataUploaders’ and add the user you are using to access Compliance Manager and/or the user you are using to do your testing.  My account is the MOD_Admin account.

 

 

 

Using the new Exact Data Match (EDM) tool

As of the second half of 2022, there is now an additional New EDM experience.  We will walk you through how to use this to upload “Exact Data Match” data set to Microsoft Purview Data Classification tool.

 

 

 

If you click “review the end-to-end workflow”, you will see the over-arching steps needed to configure and EDM.

 

 

 

 

 

 

  1. We will start our EDM creation off by clicking Create EDM classifier.

 

 

 

  1. Give the new EDM a name and description.  Then click Next.

 

 

 

  1. Select the method you want to use to define your EDM.  For this blog, I will use the recommended option of Upload a file containing sample data.  Then click Next.
    1. Note1 – click on the link to learn more about the formatting of the sample file.
    2. Note2 - The data file can include a maximum of:

 

  • Up to 100 million rows of sensitive data
  • Up to 32 columns (fields) per data source
  • Up to 5 columns (fields) marked as searchable

 

 

 

  1. Now upload your sample file.

 

 

 

  1. Once the file is uploaded, you will see the information.  Below is an example of my EDM.  Once you are satisfied with what you see, click Next.

 

 

 

 

  1. Select one or many Primary elements for your Exact Data Match (EDM).  Then click Next.

 

 

 

  1. Configure the Column Settings.  I recommend you select Data in columns are case-insensitive, at the least.  Then click Next.

 

 

 

  1. Note – If you want to ignore delimiters and/or punctuation, select the appropriate button and then select the item you would like to ignore.  We will not select this option at this point.

 

 

 

  1. Configure your detection rules.  This means “how close your Primary element is to the other, supporting elements in this EDM you are creating.  When are satisfied, click Next.

 

 

 

  1. Last, review your settings of your EDM and click Submit when you are ready.  Then click Done.

 

 

 

  1. Now that you have uploaded sample data, you can now upload your full data set.  You will need to use the EDM Upload agent too to hash and upload your full EDM data. We will do this in EDMUploadAgent sections further down.  Then click Done.

 

 

 

 

  1. Copy the Schema name, as you will need this in the hashing and uploading of your full EDM data set. Then move to the next section.

 

 

EdmUploadAgent

 

Prepping for the EDMUploadAgent

 

With the EdmUploadAgent, you will:

  1. Download your schema from your tenants
  2. Validate that it works
  3. Hash (and salt, if desired) your exact data matches from your spreadsheet back into your tenant. 

 

Once that EDM information is into your tenant, you can then proceed to the other steps of the blog for your testing.

 

  1. Create working folders
    1. Create a folder to work on this
      1.       Example:

C:\scripts\

 

b. Create a subfolder for your hash.  This is where the hashing file will reside.

i.    Example

 

C:\scripts\EDM\hash

 

c. Create a subfolder for your data.  This is where the schema will be downloaded from your Compliance tenant.

i.     Example:

 

C:\scripts\EDM\data

 

  1. Here is the link to download the EdmUploadAgent.exe agent.
    1. https://go.microsoft.com/fwlink/?linkid=2088639

 

  1. Install EdmUploadAgent.exe
  2. Open a CMD (command line) as an Administrator
  3. Change the directory to the EDM Upload Agent directory
    1. Default directory:

 

C:\Program Files\Microsoft\EdmUploadAgent

 

b. Note – for a list of EdmUploadAgent commands and syntax use the following command:

 

EdmUploadAgent.exe /?

 

  1. Authorize the EDM Upload Agent to the proper tenant. 
    1.  Run the following command with your Tenant Admin credentials 
    2. This is only needed once and will use your admin account.
      1.       Here is the command to authorize:

 

EdmUploadAgent.exe /Authorize

 

  1. Download Schema
    1. You need to download the schema from your tenant to your local computer.  This will be used later on in step 2 of the next section to hash your data during upload
    2. Here is a sample of the command

 

EdmUploadAgent.exe /SaveSchema /DataStoreName employeeidmedicaledmschema /OutputDir C:\scripts\EDM\Data

 

  • Note – ‘employeeidmedicaledmschema’ is the name of the EDM schema you created previously.
  • Note – The ‘C:\scripts\EDM\Data’ directory is where the schema XML file will be placed.

 

Testing your Schema

1. Validate your Schema

a. Here is the syntax you will use to validate that the schema is correct.

 

EdmUploadAgent.exe /ValidateData /DataFile C:\scripts\DemoEmployeeIDsMedicalFull.csv /Schema C:\scripts\edm\Data\employeeidmedicaledmschema.xml

 

  •       Note – ‘DemoEmployeeIDsMedicalFull.csv’ is the name of my EDM spreadsheet with my PHI or PII
  •       Note – ‘employeeidmedicaledmschema.xml’ is the name of the schema file that was downloaded

 

Hash and upload EDM file

1. Hash and upload EDM file

a. Here is the Syntax for hashing and uploading your EDM file into your Tenant’s Compliance Center

 

EdmUploadAgent.exe /UploadData /DataStoreName employeeidmedicaledmschema /DataFile C:\scripts\DemoEmployeeIDsMedicalFull.csv /HashLocation C:\scripts\EDM\Hash /Schema C:\scripts\EDM\Data\employeeidmedicaledmschema.xml

 

  • Note #1 – employeeidmedicaledmschema is your schema.  This means we will use the name ‘employeeidmedicaledmschema ’ which we used in the UI.
  • Note #2 – The ‘C:\scripts\EDM\Data’ directory is where the schema XML file was placed and will be read at this time.
  • Note #3 – The ‘C:\scripts\EDM\hash’ directory is where the hash file will be placed during the upload to your tenant.

b. This command will tell you the % of the uploaded EDM file from the step above.

 

EdmUploadAgent.exe /GetSession /DataStoreName employeeidmedicaledmschema

 

 

 

 

  • Note – If you change your EDM Schema or spreadsheet, you will need to re-run the EDMUploadAgent steps above.

 

 

Appendix and Links

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Note: This solution is a sample and may be used with Microsoft Compliance tools for dissemination of reference information only. This solution is not intended or made available for use as a replacement for professional and individualized technical advice from Microsoft or a Microsoft certified partner when it comes to the implementation of a compliance and/or advanced eDiscovery solution and no license or right is granted by Microsoft to use this solution for such purposes. This solution is not designed or intended to be a substitute for professional technical advice from Microsoft or a Microsoft certified partner when it comes to the design or implementation of a compliance and/or advanced eDiscovery solution and should not be used as such.  Customer bears the sole risk and responsibility for any use. Microsoft does not warrant that the solution or any materials provided in connection therewith will be sufficient for any business purposes or meet the business requirements of any person or organization.

 

Updated Dec 21, 2022
Version 2.0
No CommentsBe the first to comment