Implementing Microsoft Exact Data Match (EDM) Part 2
Published Apr 30 2020 10:04 AM 17K Views
Microsoft

When we last left our superheroes, they were on a mission to configure EDM to help protect the world! Change a few small things in that last sentence and it sums up what happened in Part 1 of this blog series. We learned how EDM can greatly assist in ensuring that the data being uploaded to the cloud such as PII and PHI will be properly discovered and protected in the DLP process.

The next step in our EDM setup is to create a Rule Package XML. This is probably the most crucial step in setting up EDM.  The Rule Pack controls or sets the criteria for how a match is made. I am going to walk you through the setup of the rule pack and explain the criteria and its use. 

The first thing we need to do prior to even creating the rule pack is to configure a custom Sensitive Information Type that will define the SRN, you remember, the Superhero Registration number. In our CSV file you can see that the SRN is a 5-digit number.  This makes this pretty easy to set this up.  We will use regex to define this as “\d{5}”. 

To create this new custom sensitive info type I used the Compliance Center (compliance.microsoft.com) 

  1. Login to the compliance center, select “Data classification” from the menu on the left, then select “Sensitive info types” and then “+Create info type”, choose a name and description for the type and click Next
 

SIT1.png

 

  1. On the next screen click on “+ Add an element”
 

SIT2.png

 

  1. Select “Regular expression” from the “Detect content containing” drop down box, enter the regex “\d{5}” leave all the other options at the defaults and click Next
 

 sit3.png

 

  1. Review the settings and click Finish
 

sit4.png

 

  1. Click Yes to Test the created sensitive type
 

sit5.png

 

  1. If the test does not automatically show up, find the newly created sensitive info type and click on it, then click “Test type”
 

sit6.png

 

  1. Either click and drag or browse to your CSV file and then click Test
 

sit7.png

 

  1. You should then get the results (In my case my file has 21 records and all 21 SRN were found, click Finish
 

sit8.png

 

Now that we have the base custom sensitive info type created, we can move along to the rule pack file. Highly recommend starting with the sample rule pack from the documentation. In the example rule pack you will see on line 2 the line:   <RulePack id="fd098e03-1796-41a5-8ab6-198c93c62b11">  You will need to replace this GUID with a new GUID, use the New-GUID PowerShell cmdlet to do this.

 

guid1.png

Replace the existing GUID with the new one you just created. Now on line 4 of the sample rule pack file you will see another GUID, this one associated with the Publisher ID. Again use the New-GUID cmdlet to generate a new GUID for this and then replace the existing GUID with the new one.  Lines 5 and 6 deal with the language localization, they are set for English, if you need to change go ahead and make the changes. Lines 7,8 and 9 deal with the publisher name and rulepack name and description, fill in whatever you would like. Here is my first 12 lines of the rule pack after making the changes from the sample file:

 

rp1.png

Ok now is where the work comes in and deciding what you want to consider an exact match.  I am going to go line by line here and explain the attributes and options you have to configure this. Line 13 just starts the Rules section:

 

rp2.png

Line 14:  notice another GUID found after “ExactMatch ID, you need to replace this using the new-GUID cmdlet again. The next attribute is “PatternsProximity”, this is currently set to 300. Pattern Proximity tells the service how many characters to look for (before and after) additional corroborative evidence, like first name and last name, is to the IDMatch, which in our case in the Superhero Registration Number, or SRN. More information about the PatternsProximity Attribute can be found here. The below was taken from that link:

 

rp3.png

I am keeping the PatternsProximity at 300 for my rule pack, you can change if you like. The next attribute is “DataStore”. This needs to match the Datastore named entered in the Schema file. As you remember it is names SIPAIdentities. Last attribute on line 14 is “recommendedConfidence” and is set at 65 in the sample file. Confidence level is the confidence that the data identified is actual and confirmed to be what is being looked for, the high the confidence level, the better the match. This attributed is a suggestion base confidence level. I am going to keep this at 65 and will explain more about confidence levels in the next few lines.

 

RP4.png

Line 15, this only has 1 attribute, “Pattern ConfidenceLevel” and is set as the same of the “recommendedConfidence” attribute on the previous line. This Confidence level is different, what this is setting is the Confidence level for the next line, just matching on the SRN (keep reading).

 

RP5.png

Line 16, this line finally sets up what is being looked for. This is a direct correlation to the Schema file and which fields we set as Searchable. The first field we will setup is the SRN field. Also on this line we will utilize the custom sensitive info type “Superhero-Registration-Number(SRN)”. This sets the SRN field to use this type and regex we setup. The first attribute is “IdMatch matches”  and we will replace SSN with our SRN field. The “classification” attribute is where we will use the name of the custom type created, “Superhero-Registration-Number(SRN)”. You can use the full name or you could use the GUID of the custom type, you would need to use PowerShell to get the GUID.

 

RP6.png

Line 17 is just closing off the pattern match for SRN alone, no changes needed

 

rp7.png

Line 18 is now where we begin to look for supporting data from the datastore that indicates the SRN number found is more than just a random 5-digit number. When we add criteria to find additional data, within the Proximity of 300 characters, we do this.  We want to see is the Superhero’s first name, last name, nickname or home location is located on the document or email being scanned. This is really where confidence levels really come in. 

Let’s say Mike, one of the HR reps for SIRA, has a document for all of the superhero’s benefits. What would happen if this document is sent out to external people, like maybe someone who is a member of the Legion of Doom, that would be a serious issue.  When we create a DLP Policy (this will be coming later on) we want to know if the data is sent out, what is the level of confidence that the data can identify a superhero’s secret identity. We can set the criteria for the Confidence levels and line 18 starts this.

I am going to keep Line 18 at 75 for the confidence level, this line also starts a new pattern lookup.

 

RP8.png

Line 19, what are we looking (Searching for), again this is the SRN number as the classification of the custom sensitive info type we create earlier.  Basically this line is duplicate with line 16. 

 

rp9.png

Line 20-26, in these lines we are setting the criteria for the additional matches we are looking for in relationship to SRN within the Proximity limit we set in line 14. The sample file is stating it is looking for at least 3 matches of the following 6 lines in the XML file. The maxMatches attribute is really not needed here, because it is set at 100, it is over the eligible limit of six fields to search for. If this was the only condition we were using, we could omit maxMatches.

Note – Find additional information about mimMatches and maxMatches here.

I will be using maxMatches for our SuperHeros and will explain how. Using min and max matches allows for tiering of the confidence level. I can set the criteria for 75% confidence match to include 2 of 4 remaining fields (Firstname, Lastname, Nickname and Home) and then I can set the criteria for 85% confidence to match 3 of 4, and then 95% confidence if 4 out of 4 are found. One thing to note, is for all of these, the system first must find the SRN and let find the other fields within the 300-character proximity limit.

Here is what the updated lines look like (line 25 & 26 are just XML ending that match pattern section)

 

rp10.png

 

 

 

 

Now I am going to add some lines for the 85% and 95% matches. They will look almost identical to lines 20-26, the only changes will be the confidence level and the min and max matches attributes. Here are all three patterns, again notice only thing that changes between them is the confidence level and min and max match attributes. The reason I am adding in different

 

rp11.png

Now let’s add in the other searchable field from our Schema file, the Nickname field. You will notice we used the Nickname field in the above examples, nothing wrong with this.  But now we are going to key on the Nickname field first and then look for additional fields to corroborate the data.

 

Note: Just like the SRN searchable field, I must create a custom sensitive info type to set the classification for the Nickname field. I first tried to create a Regex to look for one word or it could be two.  I thought I had created the correct Regex, but it did not work in actual testing.  I shifted and decided to move to a Dictionary file. The reason I went with a dictionary file versus using Keywords is that Keywords are used for just a couple words and has a 50-character limit where dictionary file can contain upwards of 100,000 terms per dictionary.

 

  1. To create the new custom sensitive info type, select Create Info Type from Data Classification\Sensitive Info Types, give a name and description and click Next
 

cs1.png

 

  1. Select Add element and then select the Dictionary (Large keywords) and click on add a dictionary
 

cs2.png

 

  1. Click on Create new keyword dictionaries
 

cs3.png

 

  1. Give the dictionary a name and then add the nicknames, each on a separate line and click Save
 

cs4.png

 

  1. Select the newly created keyword dictionary and click Add
 

cs5.png

 

  1. Click Next
 

cs6.png

 

  1. Review the information and click Finish
 

cs7.png

 

 I am going to make this very similar to the SRN field we just completed, will need to ensure a new GUID is created for the ExactMatch ID, take a look:

 

a1.png

Ok, now we have the criteria set, all we need to do is name the new Sensitive Info Policies in the last section of the rule pack file, LocalizedStrings.

In this section there is only one entry in the sample file, for our file we will need two, one for SRN and one for Nickname. These are straight forward and simply set the language, name and description for the sensitivity info type that we just configured. The most important thing to be aware of is that you need to copy the GUID created for the configuration to the localization section. See below, I copied the two GUIDs and then created the naming entries.

 

a2.png

Here is a link to the entire rulepack.xml file, I encourage you to only use this as a reference and not just copy and paste for your rule pack file. Putting this together is very informative and helps you learn the system.

Now that the rule pack is done, we need to upload to the service. To upload we need to connect to Remote PowerShell again just like we did for the Schema file. Here are the instructions for connecting to the Office 365 Security and Compliance center using PowerShell when y...

Once connected, issue the below commands (make sure you are in the directory that has the rulepack.xml file in it.

$rulepack=Get-Content .\\rulepack.xml -Encoding Byte -ReadCount 0

New-DlpSensitiveInformationTypeRulePackage -FileData $rulepack

 

a3.png

Next item on the agenda is to index and upload the sensitive data. To do this you will need to download the EDM Upload Agent that is available in step 1 of the previous link. When you go to the link to get the download, pay attention to the setup needed for a security group. You will need to create an Office 365 Security group; you can do this from the Microsoft Admin portal or the Azure AD Admin portal.  Create the group and name it EDM_DataUploaders. Add the user account to this group that you have been using for the project.

 

a4.png

 

 

 

 

 

 

 

 

 

 

 

Download and install the EDM Upload Agent, ensure you are a member of the newly created group and that you are a local admin on the machine you will be uploading from. 

  1. Start a command prompt and run it as an administrator. 
  2. Change the directory to the EDM Upload Agent directory, C:\Program Files\Microsoft\EdmUploadAgent.
 

a5.png

  1. First step, and only needed once, is to authorize the EDM Upload Agent to the proper tenant. To do this run the following command and then login with your Tenant credentials, the one that you just added to the Security group created, EdmUploadAgent.exe /Authorize
 

a6.png

  1. Next step is to index and upload the csv file. Here is the syntax for the command, EdmUploadAgent.exe /UploadData /DataStoreName \<DataStoreName\> /DataFile \<DataFilePath\> /HashLocation \<HashedFileLocation\> for me, here is what the command looks like: (be sure to pre-create the Hash folder)

EdmUploadAgent.exe /UploadData /DataStoreName SIPAIdentities /DataFile C:\Scripts\EDM\Superheros-CSV.csv /HashLocation C:\Scripts\EDM\Hash   

 

a7.png

 

  1. Depending on the size of the source file, it might take some time to upload. To check status, you can use the other commands available via the tool, just type EDMUploadAgent.exe to get a list of the commands. Use the /GetSession switch to check on the status.
 

a8.png

 

My file was very small, only 22 rows so it took no time at all.  Currently the service supports up to 10 million rows with 5 searchable fields.  Microsoft is working on increasing both limits. 

Note:  You can split up very large datastores into multiple smaller datastores. The benefit of this is you could use a data source that is larger than the current limits. The downside is you will need to configure a separate Schema file for each datastore as well as a rule pack for each one. You will also end up with multiple sensitive info types for each datastore and then will need to ensure that your DLP policies are referencing all of the sensitive info types for each datastore. You can have one DLP policy, but it will need to look for SRN_DS1, SRN_DS2, SRN_DS3 and so on.

This completes part 2 of the series. While we only really worked on the Rule Pack in this part, I hope you understand how important it is to your entire EDM solution. Next up in Part 3 we will dive into creating DLP policies and doing some testing!

 

23 Comments
Version history
Last update:
‎May 18 2020 08:25 AM
Updated by: