Blog Post

Microsoft Security Community Blog

10 MIN READ

Implementing Microsoft Exact Data Match (EDM) Part 2

Microsoft

Apr 30, 2020

When we last left our superheroes, they were on a mission to configure EDM to help protect the world! Change a few small things in that last sentence and it sums up what happened in Part 1 of this blog series. We learned how EDM can greatly assist in ensuring that the data being uploaded to the cloud such as PII and PHI will be properly discovered and protected in the DLP process.

The next step in our EDM setup is to create a Rule Package XML. This is probably the most crucial step in setting up EDM. The Rule Pack controls or sets the criteria for how a match is made. I am going to walk you through the setup of the rule pack and explain the criteria and its use.

The first thing we need to do prior to even creating the rule pack is to configure a custom Sensitive Information Type that will define the SRN, you remember, the Superhero Registration number. In our CSV file you can see that the SRN is a 5-digit number. This makes this pretty easy to set this up. We will use regex to define this as “\d{5}”.

To create this new custom sensitive info type I used the Compliance Center (compliance.microsoft.com)

Login to the compliance center, select “Data classification” from the menu on the left, then select “Sensitive info types” and then “+Create info type”, choose a name and description for the type and click Next

On the next screen click on “+ Add an element”

Select “Regular expression” from the “Detect content containing” drop down box, enter the regex “\d{5}” leave all the other options at the defaults and click Next

Review the settings and click Finish

Click Yes to Test the created sensitive type

If the test does not automatically show up, find the newly created sensitive info type and click on it, then click “Test type”

Either click and drag or browse to your CSV file and then click Test

You should then get the results (In my case my file has 21 records and all 21 SRN were found, click Finish

Now that we have the base custom sensitive info type created, we can move along to the rule pack file. Highly recommend starting with the sample rule pack from the documentation. In the example rule pack you will see on line 2 the line: <RulePack id="fd098e03-1796-41a5-8ab6-198c93c62b11"> You will need to replace this GUID with a new GUID, use the New-GUID PowerShell cmdlet to do this.

Replace the existing GUID with the new one you just created. Now on line 4 of the sample rule pack file you will see another GUID, this one associated with the Publisher ID. Again use the New-GUID cmdlet to generate a new GUID for this and then replace the existing GUID with the new one. Lines 5 and 6 deal with the language localization, they are set for English, if you need to change go ahead and make the changes. Lines 7,8 and 9 deal with the publisher name and rulepack name and description, fill in whatever you would like. Here is my first 12 lines of the rule pack after making the changes from the sample file:

Ok now is where the work comes in and deciding what you want to consider an exact match. I am going to go line by line here and explain the attributes and options you have to configure this. Line 13 just starts the Rules section:

Line 14: notice another GUID found after “ExactMatch ID, you need to replace this using the new-GUID cmdlet again. The next attribute is “PatternsProximity”, this is currently set to 300. Pattern Proximity tells the service how many characters to look for (before and after) additional corroborative evidence, like first name and last name, is to the IDMatch, which in our case in the Superhero Registration Number, or SRN. More information about the PatternsProximity Attribute can be found here. The below was taken from that link:

I am keeping the PatternsProximity at 300 for my rule pack, you can change if you like. The next attribute is “DataStore”. This needs to match the Datastore named entered in the Schema file. As you remember it is names SIPAIdentities. Last attribute on line 14 is “recommendedConfidence” and is set at 65 in the sample file. Confidence level is the confidence that the data identified is actual and confirmed to be what is being looked for, the high the confidence level, the better the match. This attributed is a suggestion base confidence level. I am going to keep this at 65 and will explain more about confidence levels in the next few lines.

Line 15, this only has 1 attribute, “Pattern ConfidenceLevel” and is set as the same of the “recommendedConfidence” attribute on the previous line. This Confidence level is different, what this is setting is the Confidence level for the next line, just matching on the SRN (keep reading).

Line 16, this line finally sets up what is being looked for. This is a direct correlation to the Schema file and which fields we set as Searchable. The first field we will setup is the SRN field. Also on this line we will utilize the custom sensitive info type “Superhero-Registration-Number(SRN)”. This sets the SRN field to use this type and regex we setup. The first attribute is “IdMatch matches” and we will replace SSN with our SRN field. The “classification” attribute is where we will use the name of the custom type created, “Superhero-Registration-Number(SRN)”. You can use the full name or you could use the GUID of the custom type, you would need to use PowerShell to get the GUID.

Line 17 is just closing off the pattern match for SRN alone, no changes needed

Line 18 is now where we begin to look for supporting data from the datastore that indicates the SRN number found is more than just a random 5-digit number. When we add criteria to find additional data, within the Proximity of 300 characters, we do this. We want to see is the Superhero’s first name, last name, nickname or home location is located on the document or email being scanned. This is really where confidence levels really come in.

Let’s say Mike, one of the HR reps for SIRA, has a document for all of the superhero’s benefits. What would happen if this document is sent out to external people, like maybe someone who is a member of the Legion of Doom, that would be a serious issue. When we create a DLP Policy (this will be coming later on) we want to know if the data is sent out, what is the level of confidence that the data can identify a superhero’s secret identity. We can set the criteria for the Confidence levels and line 18 starts this.

I am going to keep Line 18 at 75 for the confidence level, this line also starts a new pattern lookup.

Line 19, what are we looking (Searching for), again this is the SRN number as the classification of the custom sensitive info type we create earlier. Basically this line is duplicate with line 16.

Line 20-26, in these lines we are setting the criteria for the additional matches we are looking for in relationship to SRN within the Proximity limit we set in line 14. The sample file is stating it is looking for at least 3 matches of the following 6 lines in the XML file. The maxMatches attribute is really not needed here, because it is set at 100, it is over the eligible limit of six fields to search for. If this was the only condition we were using, we could omit maxMatches.

Note – Find additional information about mimMatches and maxMatches here.

I will be using maxMatches for our SuperHeros and will explain how. Using min and max matches allows for tiering of the confidence level. I can set the criteria for 75% confidence match to include 2 of 4 remaining fields (Firstname, Lastname, Nickname and Home) and then I can set the criteria for 85% confidence to match 3 of 4, and then 95% confidence if 4 out of 4 are found. One thing to note, is for all of these, the system first must find the SRN and let find the other fields within the 300-character proximity limit.

Here is what the updated lines look like (line 25 & 26 are just XML ending that match pattern section)

Now I am going to add some lines for the 85% and 95% matches. They will look almost identical to lines 20-26, the only changes will be the confidence level and the min and max matches attributes. Here are all three patterns, again notice only thing that changes between them is the confidence level and min and max match attributes. The reason I am adding in different

Now let’s add in the other searchable field from our Schema file, the Nickname field. You will notice we used the Nickname field in the above examples, nothing wrong with this. But now we are going to key on the Nickname field first and then look for additional fields to corroborate the data.

Note: Just like the SRN searchable field, I must create a custom sensitive info type to set the classification for the Nickname field. I first tried to create a Regex to look for one word or it could be two. I thought I had created the correct Regex, but it did not work in actual testing. I shifted and decided to move to a Dictionary file. The reason I went with a dictionary file versus using Keywords is that Keywords are used for just a couple words and has a 50-character limit where dictionary file can contain upwards of 100,000 terms per dictionary.

To create the new custom sensitive info type, select Create Info Type from Data Classification\Sensitive Info Types, give a name and description and click Next

Select Add element and then select the Dictionary (Large keywords) and click on add a dictionary

Click on Create new keyword dictionaries

Give the dictionary a name and then add the nicknames, each on a separate line and click Save

Select the newly created keyword dictionary and click Add

Click Next

Review the information and click Finish

I am going to make this very similar to the SRN field we just completed, will need to ensure a new GUID is created for the ExactMatch ID, take a look:

Ok, now we have the criteria set, all we need to do is name the new Sensitive Info Policies in the last section of the rule pack file, LocalizedStrings.

In this section there is only one entry in the sample file, for our file we will need two, one for SRN and one for Nickname. These are straight forward and simply set the language, name and description for the sensitivity info type that we just configured. The most important thing to be aware of is that you need to copy the GUID created for the configuration to the localization section. See below, I copied the two GUIDs and then created the naming entries.

Here is a link to the entire rulepack.xml file, I encourage you to only use this as a reference and not just copy and paste for your rule pack file. Putting this together is very informative and helps you learn the system.

Now that the rule pack is done, we need to upload to the service. To upload we need to connect to Remote PowerShell again just like we did for the Schema file. Here are the instructions for connecting to the Office 365 Security and Compliance center using PowerShell when you have Multi-factor auth.

Once connected, issue the below commands (make sure you are in the directory that has the rulepack.xml file in it.

$rulepack=Get-Content .\\rulepack.xml -Encoding Byte -ReadCount 0

New-DlpSensitiveInformationTypeRulePackage -FileData $rulepack

Next item on the agenda is to index and upload the sensitive data. To do this you will need to download the EDM Upload Agent that is available in step 1 of the previous link. When you go to the link to get the download, pay attention to the setup needed for a security group. You will need to create an Office 365 Security group; you can do this from the Microsoft Admin portal or the Azure AD Admin portal. Create the group and name it EDM_DataUploaders. Add the user account to this group that you have been using for the project.

Download and install the EDM Upload Agent, ensure you are a member of the newly created group and that you are a local admin on the machine you will be uploading from.

Start a command prompt and run it as an administrator.
Change the directory to the EDM Upload Agent directory, C:\Program Files\Microsoft\EdmUploadAgent.

First step, and only needed once, is to authorize the EDM Upload Agent to the proper tenant. To do this run the following command and then login with your Tenant credentials, the one that you just added to the Security group created, EdmUploadAgent.exe /Authorize

Next step is to index and upload the csv file. Here is the syntax for the command, EdmUploadAgent.exe /UploadData /DataStoreName \<DataStoreName\> /DataFile \<DataFilePath\> /HashLocation \<HashedFileLocation\> for me, here is what the command looks like: (be sure to pre-create the Hash folder)

EdmUploadAgent.exe /UploadData /DataStoreName SIPAIdentities /DataFile C:\Scripts\EDM\Superheros-CSV.csv /HashLocation C:\Scripts\EDM\Hash

Depending on the size of the source file, it might take some time to upload. To check status, you can use the other commands available via the tool, just type EDMUploadAgent.exe to get a list of the commands. Use the /GetSession switch to check on the status.

My file was very small, only 22 rows so it took no time at all. Currently the service supports up to 10 million rows with 5 searchable fields. Microsoft is working on increasing both limits.

Note: You can split up very large datastores into multiple smaller datastores. The benefit of this is you could use a data source that is larger than the current limits. The downside is you will need to configure a separate Schema file for each datastore as well as a rule pack for each one. You will also end up with multiple sensitive info types for each datastore and then will need to ensure that your DLP policies are referencing all of the sensitive info types for each datastore. You can have one DLP policy, but it will need to look for SRN_DS1, SRN_DS2, SRN_DS3 and so on.

This completes part 2 of the series. While we only really worked on the Rule Pack in this part, I hope you understand how important it is to your entire EDM solution. Next up in Part 3 we will dive into creating DLP policies and doing some testing!

Updated May 18, 2020

Version 2.0

SeanMcNeill

Microsoft

Joined July 15, 2016

View Profile

Microsoft Security Community Blog

Follow this blog board to get notified when there's new activity

23 Comments

SeanMcNeill
Microsoft
Oct 13, 2021
Shinji_Miura if you are referring to the updating of Sensitive info via the EDMUploader, then no, the new file, once uploaded and indexed would replace the current data. Should not be a gap in coverage.
Shinji_Miura
Microsoft
Oct 13, 2021
Hi SeanMcNeill
Is there a possibility of being unable to detect sensitive information during uploading the sensitive data via EdmUploadAgent ?
Regards,
mevaibhav83
Copper Contributor
Nov 17, 2020
thanks SeanMcNeill ! I have confirmed with support and validated and can confirm that EDMUploadAgent works ONLY with 'Cloud only' security group. (it does not respect on-prem synced AD security group)
SeanMcNeill
Microsoft
Oct 19, 2020
mevaibhav83 My lab I only have Cloud Accounts, do not have on-premises. It was just a suggestion, to see if it would work.
mevaibhav83
Copper Contributor
Oct 19, 2020
SeanMcNeill - did cloud only security group work for you? I am asking this because it does not make sense to me to remove existing on-prem AD security group (which synced to AAD) and create cloud only security AD group.
pradell1957
Copper Contributor
Oct 16, 2020
SeanMcNeill Thanks Sean, my testing with social security numbers was always with keywords and the various social security number formats that were in the Sensitive Info Types Definition. I even tried creating my own Sensitive Info Type but it still only matched on the social security number exactly as it appeared in the EDM Uploaded Hash file. I plan on testing the new normalization features once released before falling back to uploading each format of a social social number.
SeanMcNeill
Microsoft
Oct 16, 2020
pradell1957

The best advice I can give for determining the format of the Builtin Sensitive Info Types If to review Sensitive Info Types Definitions Notice that some SSN numbers will be determined but also the need for SSN Type keyword as well to be present for matching.

Yes there is work being done for data normalization as you referenced, that change is still in development, but appears to be headed for production this month.
SeanMcNeill
Microsoft
Oct 16, 2020
mevaibhav83

Instead of an Active Directory Group sync'd via AD Connect, can you try creating a cloud based security group and try?
pradell1957
Copper Contributor
Oct 16, 2020
SeanMcNeill Hi Sean, We are in the process of implementing EDM into production and creating our EDM Upload files to be hashed. The two pieces of data we are including in the upload are social security number and account number. Do I need to upload the social security number in each format it can possibly be triggered on?. For example 111-22-3333, 111223333, or 111 22 3333. Using the built in Microsoft sensitive info type for SSN I can only make it match on the format of the social security number that is in the uploaded hash file which would be 111223333. It will not match on 111-22-3333. It really is EDM based on my testing. I saw an update on the Microsoft Roadmap for EDM supporting data configuration but assume this doesn't change anything?

Microsoft Information Protection: Exact Data Match will support data configuration
Exact Data Match will support data configuration, allowing text case and character delimiters to optionally be ignored, helping reduce the need for manually defining minor variations in the hashed and uploaded data being protected
Feature ID: 65880
Added to Roadmap: 7/7/2020
Last Modified: 9/16/2020
Tags: General Availability, DoD, Microsoft Information Protection, Worldwide (Standard Multi-Tenant), GCC High, GCC
mevaibhav83
Copper Contributor
Oct 16, 2020
thanks SeanMcNeill - found it (it was my bad). same error with .\EdmUploadAgent.exe /Authorize

security group that i have created already synced to AAD via AADConnect today morning so no reason why this command can't find my membership from that group. find attached screenshot.