Microsoft launched the Exact Data Match (EDM) feature in August of 2019. This new capability enhances an organization’s ability to identify and accurately target specific data. EDM goes beyond just checking for data that matches patterns, it creates a datastore or dictionary of actual corporate data like employee information or customer specific information to ensure the data is not sent via email or shared out to external users.
EDM can help reduce probably one of the biggest issues with Data Loss Prevention (DLP) - false positives. A false positive for DLP is when data is treated as Sensitive to the company, but really is not. Microsoft has over 99 built-in sensitive information types, but most of these types rely on pattern matching using regular expressions (regex) sequences that define a search pattern. Even pattern matching with regex is hard to define. Let’s look at Social Security Number (SSN).
An SSN is a 9-digit number that is assigned to each worker within the United States. The SSN is used to identify and track a person’s wages or self-employment earnings and is then used to monitor your Social Security Benefits when they begin. With everyone having an SSN it would seem very easy to define what it is – a 9-digit number. However, an SSN is pretty hard to identify. There are many ways people write out their SSNs, but the most common ones are the following: 123456789 or 123-45-6789 or 123 45 6789. Prior to 2011 there was a strong formatting that set certain parts of the number mush fall within specific ranges. SSN issued after 2011 do not have the strong formatting. Many ways to identify an SSN is by looking for the three ways the SSN could be formatted as well as including keywords, like SSN, Social Security, Soc Sec, SSN#, etc.
With EDM, a healthcare company can now securely upload a datastore containing all of its patient’s names, addresses, MRN (Medical Record Number), SSN, etc. When an internal user goes to share out a file that’s located on their OneDrive for Business (OD4B) or sends an email of a document containing patient information, the Microsoft DLP service will scan the document and it can prevent the document from being shared or emailed outside the organization. EDM ensures this by enabling the DLP service to look for specific SSN of the customers or patients instead of looking for a number that looks like an SSN.
Let’s get going with implementing EDM. For this I decided to use superheroes and their hidden identities. We all work at the Superhero Identity Protection Agency (SIPA) and at SIPA, our number one goal is the protection of the secret identity of the world’s superheroes. We have a database that contains everything you could want to know about a superhero. To create our EDM Datastore we’ll export data from the database.
Here is the CSV file that we’ll use as the basis for our EDM Datastore.
In the table above, you can see how the data looks. Notice that we have a header row. The Superhero Registration Number (SRN) is used to identify each superhero. We also exported their first and last names, Superhero (Nickname) name and their Home origin.
The documentation to create Custom Sensitive Information Types with EDM is located here. I highly recommend you reference this document as it is very informative and will be kept up to date. The first step we need to do is define the Schema for our EDM Datastore. To do this we utilize XML and the CSV file we exported from our SIPA database.
A sample Schema is in the documentation. For our Schema, we first need to determine what fields we want to be searchable. The searchable fields are the key fields that we want to utilize that are critical for identification. How you configure your Schema is up to you, but for SIPA we have determined that the SRN and Nickname fields are the fields we want to be searchable.
Note: Searchable fields should be unique to the datastore, or as unique as possible. We know SRN is never duplicated in the Superhero Database so that is why it is chosen. We also know there is only one Superman, one Black Widow, one Wolverine, etc., so that is why we choose it as a searchable field. It does not make sense, at least in this instance to use something like Firstname as a searchable field. While in the sample CSV we do not have any duplicate first names, when you begin to think about documents and artifacts being using within SIPA, someone could mention Steve and be addressing Steve Jones in Database management and not Steve Rogers, Captain America.
Now that we’ve identified the searchable fields, all we need to do is create the XML file.
Let’s go over the XML file. I highlighted the second row above as it is important. Notice the ‘DataStore name=”SIPAIdentities”’ entry, this is important as it reflects the name of the datastore it applies to. The field names were all taken from the header row of the CSV file. You can also see that I set the “SRN” and “Nickname” fields as searchable. I have named the Schema file, edm.xml.
8. You now have a datastore Schema uploaded and ready.
This will wrap up part 1. We now understand more about EDM and why it’s helpful. We have begun the journey to getting EDM setup and protecting those who protect us, the Superheroes! Please check in for Part 2 of this journey as we will continue to learn more about EDM, DLP and the superheroes, as well and get the EDM configuration wrapped up!