Microsoft Secure Tech Accelerator
Apr 03 2024, 07:00 AM - 11:00 AM (PDT)
Microsoft Tech Community
Implementing Microsoft Exact Data Match (EDM) Part 2
Published Apr 30 2020 10:04 AM 16.7K Views
Microsoft

When we last left our superheroes, they were on a mission to configure EDM to help protect the world! Change a few small things in that last sentence and it sums up what happened in Part 1 of this blog series. We learned how EDM can greatly assist in ensuring that the data being uploaded to the cloud such as PII and PHI will be properly discovered and protected in the DLP process.

The next step in our EDM setup is to create a Rule Package XML. This is probably the most crucial step in setting up EDM.  The Rule Pack controls or sets the criteria for how a match is made. I am going to walk you through the setup of the rule pack and explain the criteria and its use. 

The first thing we need to do prior to even creating the rule pack is to configure a custom Sensitive Information Type that will define the SRN, you remember, the Superhero Registration number. In our CSV file you can see that the SRN is a 5-digit number.  This makes this pretty easy to set this up.  We will use regex to define this as “\d{5}”. 

To create this new custom sensitive info type I used the Compliance Center (compliance.microsoft.com) 

  1. Login to the compliance center, select “Data classification” from the menu on the left, then select “Sensitive info types” and then “+Create info type”, choose a name and description for the type and click Next
 

SIT1.png

 

  1. On the next screen click on “+ Add an element”
 

SIT2.png

 

  1. Select “Regular expression” from the “Detect content containing” drop down box, enter the regex “\d{5}” leave all the other options at the defaults and click Next
 

 sit3.png

 

  1. Review the settings and click Finish
 

sit4.png

 

  1. Click Yes to Test the created sensitive type
 

sit5.png

 

  1. If the test does not automatically show up, find the newly created sensitive info type and click on it, then click “Test type”
 

sit6.png

 

  1. Either click and drag or browse to your CSV file and then click Test
 

sit7.png

 

  1. You should then get the results (In my case my file has 21 records and all 21 SRN were found, click Finish
 

sit8.png

 

Now that we have the base custom sensitive info type created, we can move along to the rule pack file. Highly recommend starting with the sample rule pack from the documentation. In the example rule pack you will see on line 2 the line:   <RulePack id="fd098e03-1796-41a5-8ab6-198c93c62b11">  You will need to replace this GUID with a new GUID, use the New-GUID PowerShell cmdlet to do this.

 

guid1.png

Replace the existing GUID with the new one you just created. Now on line 4 of the sample rule pack file you will see another GUID, this one associated with the Publisher ID. Again use the New-GUID cmdlet to generate a new GUID for this and then replace the existing GUID with the new one.  Lines 5 and 6 deal with the language localization, they are set for English, if you need to change go ahead and make the changes. Lines 7,8 and 9 deal with the publisher name and rulepack name and description, fill in whatever you would like. Here is my first 12 lines of the rule pack after making the changes from the sample file:

 

rp1.png

Ok now is where the work comes in and deciding what you want to consider an exact match.  I am going to go line by line here and explain the attributes and options you have to configure this. Line 13 just starts the Rules section:

 

rp2.png

Line 14:  notice another GUID found after “ExactMatch ID, you need to replace this using the new-GUID cmdlet again. The next attribute is “PatternsProximity”, this is currently set to 300. Pattern Proximity tells the service how many characters to look for (before and after) additional corroborative evidence, like first name and last name, is to the IDMatch, which in our case in the Superhero Registration Number, or SRN. More information about the PatternsProximity Attribute can be found here. The below was taken from that link:

 

rp3.png

I am keeping the PatternsProximity at 300 for my rule pack, you can change if you like. The next attribute is “DataStore”. This needs to match the Datastore named entered in the Schema file. As you remember it is names SIPAIdentities. Last attribute on line 14 is “recommendedConfidence” and is set at 65 in the sample file. Confidence level is the confidence that the data identified is actual and confirmed to be what is being looked for, the high the confidence level, the better the match. This attributed is a suggestion base confidence level. I am going to keep this at 65 and will explain more about confidence levels in the next few lines.

 

RP4.png

Line 15, this only has 1 attribute, “Pattern ConfidenceLevel” and is set as the same of the “recommendedConfidence” attribute on the previous line. This Confidence level is different, what this is setting is the Confidence level for the next line, just matching on the SRN (keep reading).

 

RP5.png

Line 16, this line finally sets up what is being looked for. This is a direct correlation to the Schema file and which fields we set as Searchable. The first field we will setup is the SRN field. Also on this line we will utilize the custom sensitive info type “Superhero-Registration-Number(SRN)”. This sets the SRN field to use this type and regex we setup. The first attribute is “IdMatch matches”  and we will replace SSN with our SRN field. The “classification” attribute is where we will use the name of the custom type created, “Superhero-Registration-Number(SRN)”. You can use the full name or you could use the GUID of the custom type, you would need to use PowerShell to get the GUID.

 

RP6.png

Line 17 is just closing off the pattern match for SRN alone, no changes needed

 

rp7.png

Line 18 is now where we begin to look for supporting data from the datastore that indicates the SRN number found is more than just a random 5-digit number. When we add criteria to find additional data, within the Proximity of 300 characters, we do this.  We want to see is the Superhero’s first name, last name, nickname or home location is located on the document or email being scanned. This is really where confidence levels really come in. 

Let’s say Mike, one of the HR reps for SIRA, has a document for all of the superhero’s benefits. What would happen if this document is sent out to external people, like maybe someone who is a member of the Legion of Doom, that would be a serious issue.  When we create a DLP Policy (this will be coming later on) we want to know if the data is sent out, what is the level of confidence that the data can identify a superhero’s secret identity. We can set the criteria for the Confidence levels and line 18 starts this.

I am going to keep Line 18 at 75 for the confidence level, this line also starts a new pattern lookup.

 

RP8.png

Line 19, what are we looking (Searching for), again this is the SRN number as the classification of the custom sensitive info type we create earlier.  Basically this line is duplicate with line 16. 

 

rp9.png

Line 20-26, in these lines we are setting the criteria for the additional matches we are looking for in relationship to SRN within the Proximity limit we set in line 14. The sample file is stating it is looking for at least 3 matches of the following 6 lines in the XML file. The maxMatches attribute is really not needed here, because it is set at 100, it is over the eligible limit of six fields to search for. If this was the only condition we were using, we could omit maxMatches.

Note – Find additional information about mimMatches and maxMatches here.

I will be using maxMatches for our SuperHeros and will explain how. Using min and max matches allows for tiering of the confidence level. I can set the criteria for 75% confidence match to include 2 of 4 remaining fields (Firstname, Lastname, Nickname and Home) and then I can set the criteria for 85% confidence to match 3 of 4, and then 95% confidence if 4 out of 4 are found. One thing to note, is for all of these, the system first must find the SRN and let find the other fields within the 300-character proximity limit.

Here is what the updated lines look like (line 25 & 26 are just XML ending that match pattern section)

 

rp10.png

 

 

 

 

Now I am going to add some lines for the 85% and 95% matches. They will look almost identical to lines 20-26, the only changes will be the confidence level and the min and max matches attributes. Here are all three patterns, again notice only thing that changes between them is the confidence level and min and max match attributes. The reason I am adding in different

 

rp11.png

Now let’s add in the other searchable field from our Schema file, the Nickname field. You will notice we used the Nickname field in the above examples, nothing wrong with this.  But now we are going to key on the Nickname field first and then look for additional fields to corroborate the data.

 

Note: Just like the SRN searchable field, I must create a custom sensitive info type to set the classification for the Nickname field. I first tried to create a Regex to look for one word or it could be two.  I thought I had created the correct Regex, but it did not work in actual testing.  I shifted and decided to move to a Dictionary file. The reason I went with a dictionary file versus using Keywords is that Keywords are used for just a couple words and has a 50-character limit where dictionary file can contain upwards of 100,000 terms per dictionary.

 

  1. To create the new custom sensitive info type, select Create Info Type from Data Classification\Sensitive Info Types, give a name and description and click Next
 

cs1.png

 

  1. Select Add element and then select the Dictionary (Large keywords) and click on add a dictionary
 

cs2.png

 

  1. Click on Create new keyword dictionaries
 

cs3.png

 

  1. Give the dictionary a name and then add the nicknames, each on a separate line and click Save
 

cs4.png

 

  1. Select the newly created keyword dictionary and click Add
 

cs5.png

 

  1. Click Next
 

cs6.png

 

  1. Review the information and click Finish
 

cs7.png

 

 I am going to make this very similar to the SRN field we just completed, will need to ensure a new GUID is created for the ExactMatch ID, take a look:

 

a1.png

Ok, now we have the criteria set, all we need to do is name the new Sensitive Info Policies in the last section of the rule pack file, LocalizedStrings.

In this section there is only one entry in the sample file, for our file we will need two, one for SRN and one for Nickname. These are straight forward and simply set the language, name and description for the sensitivity info type that we just configured. The most important thing to be aware of is that you need to copy the GUID created for the configuration to the localization section. See below, I copied the two GUIDs and then created the naming entries.

 

a2.png

Here is a link to the entire rulepack.xml file, I encourage you to only use this as a reference and not just copy and paste for your rule pack file. Putting this together is very informative and helps you learn the system.

Now that the rule pack is done, we need to upload to the service. To upload we need to connect to Remote PowerShell again just like we did for the Schema file. Here are the instructions for connecting to the Office 365 Security and Compliance center using PowerShell when y...

Once connected, issue the below commands (make sure you are in the directory that has the rulepack.xml file in it.

$rulepack=Get-Content .\\rulepack.xml -Encoding Byte -ReadCount 0

New-DlpSensitiveInformationTypeRulePackage -FileData $rulepack

 

a3.png

Next item on the agenda is to index and upload the sensitive data. To do this you will need to download the EDM Upload Agent that is available in step 1 of the previous link. When you go to the link to get the download, pay attention to the setup needed for a security group. You will need to create an Office 365 Security group; you can do this from the Microsoft Admin portal or the Azure AD Admin portal.  Create the group and name it EDM_DataUploaders. Add the user account to this group that you have been using for the project.

 

a4.png

 

 

 

 

 

 

 

 

 

 

 

Download and install the EDM Upload Agent, ensure you are a member of the newly created group and that you are a local admin on the machine you will be uploading from. 

  1. Start a command prompt and run it as an administrator. 
  2. Change the directory to the EDM Upload Agent directory, C:\Program Files\Microsoft\EdmUploadAgent.
 

a5.png

  1. First step, and only needed once, is to authorize the EDM Upload Agent to the proper tenant. To do this run the following command and then login with your Tenant credentials, the one that you just added to the Security group created, EdmUploadAgent.exe /Authorize
 

a6.png

  1. Next step is to index and upload the csv file. Here is the syntax for the command, EdmUploadAgent.exe /UploadData /DataStoreName \<DataStoreName\> /DataFile \<DataFilePath\> /HashLocation \<HashedFileLocation\> for me, here is what the command looks like: (be sure to pre-create the Hash folder)

EdmUploadAgent.exe /UploadData /DataStoreName SIPAIdentities /DataFile C:\Scripts\EDM\Superheros-CSV.csv /HashLocation C:\Scripts\EDM\Hash   

 

a7.png

 

  1. Depending on the size of the source file, it might take some time to upload. To check status, you can use the other commands available via the tool, just type EDMUploadAgent.exe to get a list of the commands. Use the /GetSession switch to check on the status.
 

a8.png

 

My file was very small, only 22 rows so it took no time at all.  Currently the service supports up to 10 million rows with 5 searchable fields.  Microsoft is working on increasing both limits. 

Note:  You can split up very large datastores into multiple smaller datastores. The benefit of this is you could use a data source that is larger than the current limits. The downside is you will need to configure a separate Schema file for each datastore as well as a rule pack for each one. You will also end up with multiple sensitive info types for each datastore and then will need to ensure that your DLP policies are referencing all of the sensitive info types for each datastore. You can have one DLP policy, but it will need to look for SRN_DS1, SRN_DS2, SRN_DS3 and so on.

This completes part 2 of the series. While we only really worked on the Rule Pack in this part, I hope you understand how important it is to your entire EDM solution. Next up in Part 3 we will dive into creating DLP policies and doing some testing!

 

23 Comments
Copper Contributor

Hi @Sean McNeill,

 

We are currently running through this setup and have been able to successfully establish the schema and rule package. The issue we are currently facing deals with that security group you mention in part 2 for the EDM upload agent. 

 

"To do this you will need to download the EDM Upload Agent that is available in step 1 of the previous link. When you go to the link to get the download, pay attention to the setup needed for a security group. You will need to create an Office 365 Security group; you can do this from the Microsoft Admin portal or the Azure AD Admin portal.  Create the group and name it EDM_DataUploaders. Add the user account to this group that you have been using for the project."

 

We tagged one of our Global Admins and he followed the instructions to spin this group up for us and add our accounts accordingly. He thought it was odd it had no roles assigned to it, is this just something the executable looks for the user to belong to?

 

We've spent a great deal of troubleshooting this over the last few days. We have tried this on and off the network to remove any chances of the web proxy or firewall creating an issue, verified antivirus is not blocking it (although that would be ironic because we're using Defender), ensured we were local admins, the CLI is in admin and we had a global admin run it to alleviate any issues it might be with permissions. We continue to get the error:

 

c:\Program Files\Microsoft\EdmUploadAgent>EdmUploadAgent.exe /Authorize
Command failed.
Error Type: Microsoft.DataClassification.Edm.Client.EdmServiceClientException
Error Code: InternalServerError
Response: Error: ErrorCode: UserNotInSecurityGroup
Message: The uploading user is not a part of the security group: 'EDM_DataUploaders'.
Target:
InnerError: Date: 2020-06-16T18:57:46.0000000
ErrorCode: UserNotInSecurityGroup
ClientRequestId: a36aa8d6-fee5-45af-ab12-5515169d81ad
DiagnosticInfo:
ActivityId: 65aaf8be-c22d-4f9d-a6bb-79b8677c6368

 

When attempting to authorize. We've tried looking this up and there doesnt seem to be much documentation available aside from MS docs and your post. 

 

Thank you!

Erin

 

Microsoft

@erinboris 

Did you verify that the EDM_Uploaders group is a Security Group?

 

One thing I can recommend you try is to delete the TokenCache.dat file in the c:\program files\Microsoft\EDMUploadAgent directory and then retry the Authorization step.  Ensure that the account you are using is a Local Admin on the machine running EDM Upload Agent and is a member of the EDM_Uploaders group.

 

Copper Contributor

Thank you for the fast reply @Sean McNeill ! 

 

We are local admins, we have been assigned group membership and that group is listed as a security group in Security and Compliance Center and created via admin center by one of our GAs. We tried deleting the token file and uninstalling fully and reinstalling, unfortunately we are receiving the same error each time. We have opened a support ticket and are hoping to hear back shortly. Thank you again.

 

c:\Program Files\Microsoft\EdmUploadAgent>EdmUploadAgent.exe /authorize
Command failed.
Error Type: Microsoft.DataClassification.Edm.Client.EdmServiceClientException
Error Code: InternalServerError
Response: Error: ErrorCode: UserNotInSecurityGroup
Message: The uploading user is not a part of the security group: 'EDM_DataUploaders'.
Target:
InnerError: Date: 2020-06-18T14:18:44.0000000
ErrorCode: UserNotInSecurityGroup
ClientRequestId: f795828f-6171-42b7-b756-d22038d47962
DiagnosticInfo:
ActivityId: 76d4d556-9a62-4139-aeba-9051470f1e0c

 

 

Microsoft

@erinboris Sorry to hear that you are still having issues.  If you don't mind could you please post the fix from Support?  

Copper Contributor

Hi Sean, Does Microsoft have the hash algorithm documented somewhere? Our organization is in the process of implementing EDM and performing a risk assessment. I know it's not MD5

Microsoft

@pradell1957 Microsoft uses SHA 256 Algorithm for Hashing.  This was discussed in the EDM Webinar, https://techcommunity.microsoft.com/t5/microsoft-security-and/microsoft-information-protection-and-c...

 

Brass Contributor

Sean,

 

I am getting the error on the EDM rule when try to view it from admin center:

 

There is an error in XML document (1, 2).

 

Request: /api/DlpSensitiveInformationTypeRulePackage/?rulePackId=316c6600-aac4-4193-933f-bc59d0bdb73a

Status code: 500

Exception: System.InvalidOperationException

Diagnostic information: {Version:17.00.5164.004,Environment:NCUPROD,DeploymentId:20f2cc324705402eab61bf7930170e26,InstanceId:WebRole_IN_1,SID:878c31a1-5dad-4d1d-ac6c-82663ed91d19,CID:9abd02b0-b7eb-4936-9f49-1bf614a8edfb}

Time: 2020-08-21T15:09:53.2818532Z

 

I checked the custom rule package file using xsd.exe tool and it is correct. This error - should I ignore or troubleshoot?

 

Thanks,

Ketan Shah

Microsoft

@kshah1999  You can ignore this error, it is an issue with displaying the EDM info and does not affect functionality.  The PG is aware and working on this.

Copper Contributor

Did anyone got fix about following error? where pre-requisite all are in-place but still getting following error

 

PS C:\AA\EdmUploadAgent> .\EdmUploadAgent.exe /Authorize
Command failed.
Error Type: Microsoft.DataClassification.Edm.Client.EdmServiceClientException
Error Code: InternalServerError
Response: Error: ErrorCode: UserNotInSecurityGroup
Message: The uploading user is not a part of the security group: 'EDM_DataUploaders'.
Target:
InnerError: Date: 2020-10-16T09:17:46.0000000
ErrorCode: UserNotInSecurityGroup
ClientRequestId: 00000000-0000-0000-0000-000000000000
DiagnosticInfo:
ActivityId: 00000000-0000-0000-0000-000000000000

 

@Sean McNeill @pradell1957 

 

 

Microsoft

@mevaibhav83 

This error appears to be with the EDM Upload Agent, and it appears that the user you are running in as is not a member of the EDM_DataUploaders Group.  In there article above it talks about creating the group and adding user to it for uploading the data to the EDM service.

Copper Contributor

@mevaibhav83-

 

Are you sure the account you are using to execute the EdmUploadAgent.exe is member of EDM_DataUploaders group in O365?

 

This group is required to be created and user/account that is used in on-premise connection of "EdmUploadAgent.exe /Authorize" command must be part of that group. If it is not then you will get above error. I have seen that when I tried first time with a different group name. Then created the group and error is gone as well as I was able to upload data with no issues.

 

Thanks,

 

Ketan Shah

@kshah1999 @Sean McNeill 

Copper Contributor

@Sean McNeill @Ketan Shah  - My account is indeed part of that exact security group name but still getting same error. can you send me which version of  EDM uploader agent you are using. mine looks to be 17.0.x. that i found from log file or CAN you share where can i download EDMUploaderagent as official article looks to have link not working [https://docs.microsoft.com/en-us/microsoft-365/compliance/create-custom-sensitive-information-types-...

 

Appreciate further help!

Microsoft

Use this Link https://go.microsoft.com/fwlink/?linkid=2088639  for the EDM Download for Commercial-GCC

 

 

Screenshot 2020-10-16 091522.jpg

Copper Contributor

thanks @Sean McNeill  - found it (it was my bad). same error with .\EdmUploadAgent.exe /Authorize

 

security group that i have created already synced to AAD via AADConnect today morning so no reason why this command can't find my membership from that group. find attached screenshot.

 

mevaibhav83_0-1602862055540.png

 

Copper Contributor

@Sean McNeill  Hi Sean, We are in the process of implementing EDM into production and creating our EDM Upload files to be hashed. The two pieces of data we are including in the upload are social security number and account number. Do I need to upload the social security number in each format it can possibly be triggered on?. For example 111-22-3333, 111223333, or 111 22 3333. Using the built in Microsoft sensitive info type for SSN I can only make it match on the format of the social security number that is in the uploaded hash file which would be 111223333.  It will not match on 111-22-3333. It really is EDM based on my testing. I saw an update on the Microsoft Roadmap for EDM supporting data configuration but assume this doesn't change anything?

 

Microsoft Information Protection: Exact Data Match will support data configuration

Exact Data Match will support data configuration, allowing text case and character delimiters to optionally be ignored, helping reduce the need for manually defining minor variations in the hashed and uploaded data being protected

  • Feature ID: 65880
  • Added to Roadmap: 7/7/2020
  • Last Modified: 9/16/2020
  • Tags: General Availability, DoD, Microsoft Information Protection, Worldwide (Standard Multi-Tenant), GCC High, GCC
Microsoft

@mevaibhav83 

Instead of an Active Directory Group sync'd via AD Connect, can you try creating a cloud based security group and try?

Microsoft

@pradell1957 

The best advice I can give for determining the format of the Builtin Sensitive Info Types If to review Sensitive Info Types Definitions Notice that some SSN numbers will be determined but also the need for SSN Type keyword as well to be present for matching.

 

Yes there is work being done for data normalization as you referenced, that change is still in development, but appears to be headed for production this month.

Copper Contributor

@Sean McNeill  Thanks Sean, my testing with social security numbers was always with keywords and the various social security number formats that were in the Sensitive Info Types Definition. I even tried creating my own Sensitive Info Type but it still only matched on the social security number exactly as it appeared in the EDM Uploaded Hash file.  I plan on testing the new normalization features once released before falling back to uploading each format of a social social number. 

Copper Contributor

@Sean McNeill  - did cloud only security group work for you? I am asking this because it does not make sense to me to remove existing on-prem AD security group (which synced to AAD) and create cloud only security AD group.

Microsoft

@mevaibhav83 My lab I only have Cloud Accounts, do not have on-premises.  It was just a suggestion, to see if it would work.  

Copper Contributor

thanks @Sean McNeill ! I have confirmed with support and validated and can confirm that EDMUploadAgent works ONLY with 'Cloud only' security group. (it does not respect on-prem synced AD security group)

Microsoft

Hi @Sean McNeill 
Is there a possibility of being unable to detect sensitive information during uploading the sensitive data via EdmUploadAgent ?
Regards,

Microsoft

@Shinji_Miura if you are referring to the updating of Sensitive info via the EDMUploader, then no, the new file, once uploaded and indexed would replace the current data.  Should not be a gap in coverage.

Version history
Last update:
‎May 18 2020 08:25 AM
Updated by: