Microsoft Purview- Paint By Numbers Series (Part 1a) - Exact Data Match (old UI)
Published Aug 19 2021 10:16 AM 2,411 Views
Microsoft

paint_by_numbers_splash_picture.jpg

 

Before we start, please not that if you want to see a table of contents for all the sections of this blog and their various Purview topics, you can locate the in the following link:

Microsoft Purview- Paint By Numbers Series (Part 0) - Overview - Microsoft Tech Community

 

Disclaimer

This document is not meant to replace any official documentation, including those found at docs.microsoft.com.  Those documents are continually updated and maintained by Microsoft Corporation.  If there is a discrepancy between this document and what you find in the Compliance User Interface (UI) or inside of a reference in docs.microsoft.com, you should always defer to that official documentation and contact your Microsoft Account team as needed.  Links to the docs.microsoft.com data will be referenced both in the document steps as well as in the appendix.

All of the following steps should be done with test data, and where possible, testing should be performed in a test environment.  Testing should never be performed against production data.

 

Target Audience

The Exact Data Match (EDM) section of this blog series is aimed at Compliance officers who need to identify not just any PII and PHI data but the exact PII and PHI belonging to their employees and customers/patients.

 

Document Scope

This document is meant to guide an administrator who is “net new” to Microsoft E5 Compliance through:

  • Configuration of an Exact Data Match (EDM).
  • Testing of an Exact Data Match (EDM).

 

It is presumed that you already have a Sensitive Information Type that you want to use in your Exact Data Match policy.  For the purposes of this document, I will use a copy of the U.S. Social Security Number (SSN) called “U.S. SSN – Numbers Only” that I created in Part 1 of this blog series. 

 

Out-of-Scope

This document does not cover any other aspect of Microsoft E5 Compliance, including:

  • Data Classification
  • Data Protection Loss (DLP) for Exchange, OneDrive, Devices
  • Microsoft Cloud App Security (MCAS)
  • Records Management (retention and disposal)
  • Information Protection
  • Advanced eDiscovery

It is presumed that you have a pre-existing of understanding of what Microsoft E5 Compliance does and how to navigate the User Interface (UI).

It is also presumed you are using an existing Information Types (SIT) or a SIT you have created for your testing.

 

If you wish to set up and test any of the other aspects of Microsoft E5 Compliance, please refer to Part 1 of this blog series (listed in the link below) for the latest entries to this blog.  That webpage will be updated with any new walk throughs or Compliance relevant information, as time allows.

Microsoft Compliance - Paint By Numbers Series (Part 1) - Sensitive Information Types - Microsoft Te...

Overview of Document

  1. Use Case
  2. Definitions
  3. Notes
  4. Pre-requisites
  5. Chose which Sensitive Information Types (or SIT) you wish to use.
  6. Create an Exact Data Match (EDM)
  7. Test your EDM data.
  8. Appendix and Links

 

Use Case

Exact Data Matches (EDM) are used to apply Compliance to specific information, not only a pattern.  Here is an example of how to use EDM. 

Example – You do not want to look for all Social Security Numbers (SSNs) as not all SSNs are your patient or customer data.  Nor are all 9-digit numbers SSNs.

 

Definitions

  • Datastore – This is another term for a data you want to match.  For example, this could be a list of PHI or PII data for your customers or employees.
  • Rule packages – this is an alternate term for an Exact Data Match (EDM) schema
  • Keyword – a SIT, there is a limit of 50 words total
  • Keyword dictionary – This type of keyword SIT leverages a file to gather its keywords and has a 100,000 term limit per dictionary
  • Regex – a regular expression used to perform pattern matching (ex. “\d{5}”)
  • Personal Health Information (PHI) = There are many websites that list what these are.   18 items are normally listed as PHI.  Some of these are Socical Security Number, Credit Card Number address, Data of Birth, Name
  • Confidence – This is the amount of confidence the policy and administrator can have in the SIT has discovered in the data content.  Here are different ways Microsoft Compliance explains confidence
    • Older method to express confidence – percentage (65%, 75%, 85%)
    • New method to express confidence – Low, Medium High
  • Proximity = The proximity between SIT components (i.e. keywords, regex expressions, etc).  The default is 300 characters.

 

Notes

  • To keep everything as simple as possible, we will use ‘hrdata’ as our schema and file names wherever possible.
  • Replication times for a Compliance changes to take affect
    • SITs and EDMs should take affect within 15 minutes, but could take much longer depending on how your Tenant is configured for replication, Availability Zones, etc.
    • DLP policies will take approximately 15 minutes to take affect
    • Other Compliances items could take 24-48 hours for other changes to take affect

 

 

Pre-requisites

 

Create a .CSV file

 

For my spreadsheet, I will be using only a handful of names and SSNs.  This will make my testing simpler later on.

I have created 3 columns with the following names: FName, LName, SSN.  These column names will be used when creating an EDM Schema later on.

 

James_Havens_0-1628288582844.png

 

 

Create EDM_DataUploaders Security Group in AAD

 

  1. Go to portal.azure.com and click on Azure Active Directory (AAD)

 

James_Havens_1-1628288582849.png

 

 

  1. Click Goups->Add New Group. Name the new group ‘EDM_DataUploaders’ and add the user you are using to access Compliance Manager and/or the user you are using to do your testing.  My account is the MOD_Admin account.

 

James_Havens_2-1628288582854.png

 

James_Havens_3-1628288582855.png

 

 

Create an Exact Data Match (EDM)

 

Create EDM Schema 

 

  1. Go to Data Classification -> Exact data matches -> EDM Schema
  2. Click Create EDM Schema

 

James_Havens_4-1628288582856.png

 

 

  1. Give the schema a name and description.  I will name my schema “hrdata”.

 

James_Havens_5-1628288582857.png

 

 

  1. Deselect Ignore delimiters and punctuation for all schema fields

James_Havens_6-1628288582860.png

 

 

  1. In Schema field #1, use the same word as is found in the top of your first column.  Mine is called “fname”.  I will select Field is case-insensitive.  I will not select Field is searchable.
    1. Note – only select searchable for columns associated with SITs you have created.   

James_Havens_7-1628288582863.png

 

 

  1. I will then add 2 more Schema fields.  One for “lname” which is case-insensitive, and one for SSN which is case searchable.  For the SSN, I will use the “U.S. SSN – Numbers Only” SIT I created in the Part 1 of this blog series.

 

  1. For each Schema field, under Choose delimiters and punctuation to ignore for this field, select Hypen (‘-‘).  The reason for this is some names have hyphens and many social security numbers also have hyphens.

 

James_Havens_8-1628288582865.png

 

 

 

  1. Click Save and then select your Schema and review that you have it configured properly.  Here is what my schema looks like.

 

James_Havens_9-1628288582871.png

 

 

  1. If everything looks correct, move to the section for creating an EDM sensitive information type.

 

Create EDM Sensitive Information Type

 

  1. Go to Data Classification -> Exact data matches -> EDM sensitive info types
  2. Click Create EDM sensitive info types
  3. The first thing to do in creating an EDM SIT is to add a Data Store Schema.  We created this in the section above.  Click Choose an existing EDM schema.

 

James_Havens_0-1628298818597.png

 

 

  1. In the pop-up, choose the schema from above and click Add.  I’ll be using “hrdata”.

 

James_Havens_1-1628298818601.png

 

 

  1. Your schema fields will appear.  Click Next.

 

James_Havens_2-1628298818635.png

 

 

  1. On the Patterns window, click Create Pattern.

James_Havens_3-1628298818639.png

 

a. Choose your confidence level.  Then your Primary Element.  I’ll be using “ssn”.

James_Havens_4-1628298818643.png

 

b. In the Primary element sensitive info type, click Choose sensitive info type.

 

James_Havens_5-1628298818646.png

 

c. Now we must add our SIT that was created in part one of the Blog.  I will run a search for “ssn” and add my “U.S. SSN – numbers only” SIT.  This SIT will them be run against all the specific social security numbers that I will upload in the next section.

 

James_Havens_6-1628298818653.png

 

d. When you are satisfied, click Done and then click Done again and then click Next

  1. In the Confidence Level and Character Proximity section select what you want.  I will change my confidence to High Confidence level and accept the character proximity default.

James_Havens_7-1628298818660.png

 

 

  1. For my name use “HR Data” and my description will be “hrdata.”

James_Havens_8-1628298818661.png

 

 

  1. Click Next.  Review and when you are satisfied, click Submit.

 

EdmUploadAgent - Steps to Upload agent

 

With the EdmUploadAgent, you will:

  1. download your schema from your tenants
  2. validate that it works
  3. hash (and salt, if desired) your exact data matches from your spreadsheet back into your tenant. 

Once the that EDM information is into your tenant, you can then proceed to the other steps of the blog for your testing.

 

  1. Create working folders
    1. Create a folder to work on this
      1.       Example:

C:\scripts\

 

b. Create a subfolder for your hash.  This is where the hashing file will reside.

i.    Example

 

C:\scripts\EDM\hash

 

c. Create a subfolder for your data.  This is where the schema will be downloaded from your Compliance tenant.

i.     Example:

 

C:\scripts\EDM\hash

 

  1. Here is the link to download the EdmUploadAgent.exe agent.
    1. https://go.microsoft.com/fwlink/?linkid=2088639

 

  1. Install EdmUploadAgent.exe
  2. Open a CMD (command line) as an Administrator
  3. Change the directory to the EDM Upload Agent directory
    1. Default directory:

 

C:\Program Files\Microsoft\EdmUploadAgent

 

b. Note – for a list of EdmUploadAgent commands and syntax use the following command:

 

EdmUploadAgent.exe /?

 

  1. Authorize the EDM Upload Agent to the proper tenant. 
    1.  Run the following command with your Tenant Admin credentials 
    2. This is only needed once and will use your admin account.
      1.       Here is the command to authorize:

 

EdmUploadAgent.exe /Authorize

 

  1. Download Schema
    1. You need to download the schema from your tenant to your local computer.  This will be used later on in step 2 of the next section to hash your data during upload
    2. Here is a sample of the command

 

EdmUploadAgent.exe /SaveSchema /DataStoreName hrdata /OutputDir C:\scripts\EDM\Data

 

  • Note – ‘hrdata’ is the name of the EDM schema you created previously.
  • Note – The ‘C:\scripts\EDM\Data’ directory is where the schema XML file will be placed.

Testing your Schema

8. Validate your Schema

a. Here is the syntax you will use to validate that the schema is correct.

 

EdmUploadAgent.exe /ValidateData /DataFile C:\scripts\Edm\hrdata.csv /Schema C:\scripts\EDM\Data\hrdata.xml

 

  •       Note – ‘hrdata.csv’ is the name of my EDM spreadsheet with my PHI or PII
  •       Note – ‘hrdata.xml’ is the name of the schema file that was downloaded

 

Hash and upload EDM file

9. Hash and upload EDM file

a. Here is the Syntax for hashing and uploading your EDM file into your Tenant’s Compliance Center

 

EdmUploadAgent.exe /UploadData /DataStoreName hrdata /DataFile C:\scripts\EDM\hrdata.csv /HashLocation C:\scripts\EDM\Hash /Schema C:\scripts\EDM\Data\hrdata.xml

 

  • Note #1 – DataStoreName is your schema.  This means we will use the name ‘hrdata’ which we used in the UI.
  • Note #2 – The ‘C:\scripts\EDM\Data’ directory is where the schema XML file was placed and will be read at this time.
  • Note #3 – The ‘C:\scripts\EDM\hash’ directory is where the hash file will be placed during the upload to your tenant.

 

  1. This command will tell you the % of the uploaded EDM file in step 9.

 

EdmUploadAgent.exe /GetSession /DataStoreName hrdata

 

James_Havens_0-1628299652199.png

 

 

 

  • Note – If you change your EDM Schema, you will need to re-run steps 7 – 9.
  • Note – If you change your EDM spreadsheet, you will need to re-run 7 – 9.

 

  1. If everything is correct, we can move to the next blog and create and test a DLP policy with exact data.   Or you can get move on to one of the other parts of this blog series.

 

Appendix and Links

Other EDM blogs to review

 

I recommend you look at the following blog entries by my co-worker.  He wrote a 3-part series in 2020 on using PowerShell to create the components of an EDM in Microsoft Compliance Manager as well as two additional blog entries in the first half of 2021 related to EDM enhancements, including one on how to create the a EDM Schema and EDM Sensitive Information Type (SIT) through the newer Graphical Interface.  This blog entry will differentiate from those blog entries in 3 ways:

  • There is no XML creation or editing.
  • I do not be running any PowerShell scripts.
  • I will only be using the Compliance UI and the EDM_DataUploader tool.

Here are the links to my co-worker’s related blogs:

 

 

 

 

 

 

Other Resources

 

Note: This solution is a sample and may be used with Microsoft Compliance tools for dissemination of reference information only. This solution is not intended or made available for use as a replacement for professional and individualized technical advice from Microsoft or a Microsoft certified partner when it comes to the implementation of a compliance and/or advanced eDiscovery solution and no license or right is granted by Microsoft to use this solution for such purposes. This solution is not designed or intended to be a substitute for professional technical advice from Microsoft or a Microsoft certified partner when it comes to the design or implementation of a compliance and/or advanced eDiscovery solution and should not be used as such.  Customer bears the sole risk and responsibility for any use. Microsoft does not warrant that the solution or any materials provided in connection therewith will be sufficient for any business purposes or meet the business requirements of any person or organization.

 

Co-Authors
Version history
Last update:
‎Dec 21 2022 01:06 PM
Updated by: