Microsoft Purview Exact Data Match (EDM) support for multi-token corroborative evidence

Martin_Berzin · ‎Dec 14 2023

Introducing multi-token support for corroborative evidence in Exact Data Match

A new feature that improves the accuracy and effectiveness of EDM detection.

Exact Data Match (EDM) is a powerful feature that allows you to detect sensitive data, such as customers’ personally identifiable information (PII), in your organization based on your own data sources; this may include customer records, employee records, or patient records. EDM works by hashing your data and uploading the hashes to Microsoft 365, where it can be used to create custom sensitive information types (SITs) that match the hashed values in your data. You can then use these SITs in various Microsoft Purview compliance solutions, such as Data Loss Prevention (DLP), Insider Risk Management, or Information Protection auto-labeling, to protect your sensitive data from unauthorized access or leakage.

One of the key components of EDM is corroborative evidence, which are additional fields in your data source that provide more context and confidence for the detection of the primary field. For example, if you have a data source that contains social security numbers (SSNs) and names of your customers, you can use the name field as corroborative evidence for the SSN field, so that only the SSNs that are associated with the correct names in your data source are detected as matches. This reduces the chances of false positives and increases the accuracy of EDM detection.

However, until now, EDM had a limitation when it came to corroborative evidence fields that contained more than one word, such as last names or street addresses. These fields are considered multi-token fields, because they consist of multiple tokens or words separated by spaces or other delimiters. EDM could not match these fields correctly unless they were mapped to a SIT that could detect them as a single entity. For example, if you had a corroborative evidence field called "Address" that contained multi-token values like "123 Main Street, New York, NY" or "1 Microsoft Way, Redmond, WA", EDM would not be able to match these values unless you mapped them to a SIT that could detect entire street addresses; otherwise, EDM would only compare the individual words in the content, such as "123", "Main", or "Street", with the entire string in the corroborative evidence field and miss the match even if the full address was present in the content being classified.

This limitation negatively impacted the accuracy and efficacy of EDM, especially for data sources that contained many multi-token fields that were not easily mapped to existing SITs. To address this issue, we are excited to announce a new feature that enables EDM to support multi-token corroborative evidence fields without requiring a SIT mapping. This feature allows EDM to compare the hashes of consecutive words in the content with the hashes of the multi-token fields in your data source and produce a match if they are identical. This way, EDM can detect multi-token fields such as names, addresses, medical conditions, or any other corroborative evidence fields that may contain more than one word, as long as they are marked as multi-token in your EDM schema. This feature is also supported for double-byte character language sets (DBCS), such as Japanese kanji, which don’t separate words by spaces.

How to use multi-token support for corroborative evidence

To use this feature, you need to opt-in for multi-token support for each corroborative evidence field that you want to enable it for. You can do this either through the new EDM UI experience or through the schema XML update. Here are the steps for each method:

Opt-in through the new EDM UI experience

The new EDM UI experience is a wizard that guides you through the process of creating or editing an EDM schema and uploading your data. You can access it from the Microsoft 365 compliance center, under Data classification > Sensitive info types > Exact data match.

When you upload a sample file to the EDM wizard, it will automatically map the SIT that can best detect the sample data in each field. If none of the SITs in your environment can detect the sample data in a field, it will default to the single-token match mode. This means that EDM will only compare the hashes of the individual words identified in the field in the DLP policy with the words in the content. Any field selected as a primary element requires a SIT to be mapped to it, but corroborative evidence fields can be mapped to a SIT or be selected as either single-token or multi-token.

However, if the sample data in the field contains multi-token values and the single-token option is selected, the wizard will show a warning that using single-token mode may result in missed detections. In that case, you can change the match mode to Multi-token, which means that EDM will compare the hashes of consecutive words in the field with the words in the content, up to a maximum number of tokens supported by this feature (currently 5). For example, if your field contains the value "Jane Doe", EDM will compare the hashes of "Jane", "Doe", and "Jane Doe" with the content, and produce a match if any of them are identical to the hashed value in your data source.

Similarly, if you select Multi-token for a field that only contains single-token data, the wizard will show a warning that using multi-token mode may result in higher latency, as the multi-token checks are inefficient and not required if the actual EDM data later hashed and uploaded is expected to be single-token for that field. In general, if a SIT can be accurately mapped to a corroborative evidence field, it is preferred to do so rather than rely on single-token or multi-token matching.

Please note that if the sample data you provide to the wizard is not entirely representative of your actual data, you might not have multi-token values in the sample, even though some of your production data might contain them. For example, some people in your production data might have multiple first names, but the names you provided to the wizard might all be single-word. In that case, you might not see a warning and you might not realize that multi-token matching is needed for such fields. Make sure your sample data is representative of your production data in this regard and keep an eye on fields that might contain more than one word in a small subset of records, such as last names in countries where most people use a single last name.

Opt-in through the EDM schema XML update

You can also opt-in to using multi-token corroborative evidence for one or more fields when creating a new schema or editing an existing one in XML format, which requires the use of PowerShell. An example schema used to protect patient records is shown below.

Each corroborative evidence field can optionally be configured for multi-token support through the new parameter isMultiToken, which can be set to true or false. Any field that is set to true will be treated as a multi-token field by EDM and compared with consecutive words in the content, up to the maximum number of tokens supported by this feature. Any field that is set to false or that does not have this parameter will be treated as a single-token field by EDM and compared with individual words in the content.

Recommendations and feedback

We recommend the following best practices when using this feature:

After creating or editing the schema, please wait at least one hour before downloading the schema to be used for the EDM data upload, to ensure the system has synced. Otherwise, you may get an error message when attempting to download the schema via the command line using the following syntax: EdmUploadAgent.exe /SaveSchema /DataStoreName <schema name> /OutputDir <path to output folder>
Important: please don't use the EDM upload agent to download the XML schema for manual edit and re-upload of the schema, as doing so will result in errors since the EDM upload agent downloads the schema with additional tags that don't pass schema creation checks.
We recommend trimming any multi-token corroborative evidence fields to the maximum number of tokens supported by the multi-token feature (which is currently 5 tokens), otherwise that corroborative evidence may not be detected; alternatively, map the multi-token field to a SIT that can fully detect the multi-token data.
We recommend first testing the EDM SIT using the test cmdlet and waiting 24 hours after creating or editing the EDM schema before testing it in a policy in a solution such as DLP.

We hope you find this feature useful and we welcome your feedback (which can be left on this blog post or for managed accounts, provided to your Microsoft contact).

Thank you for your interest in EDM!

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Microsoft Purview Exact Data Match (EDM) support for multi-token corroborative evidence