Public preview of out-of-the-box trainable classifiers with auto-labeling support
Published Sep 27 2022 11:25 AM 3,888 Views
Microsoft

In order to protect sensitive data, it must first be discovered and labeled. However, traditional classification techniques such as regular expression, and manual or rule-based approaches can’t easily handle massive volumes of data. Leveraging machine learning-enabled trainable classifiers can greatly improve the speed, accuracy, and coverage in identifying sensitive data at an enterprise scale.

 

At Microsoft, our goal is to provide a built-in, intelligent, unified, and extensible solution to protect sensitive data across your digital estate – in Microsoft 365 cloud services, on-premises, third-party SaaS applications, and more. With Microsoft Purview Information Protection, we are building a unified set of capabilities for data classification, labeling, and protection not only in Office apps but also in other popular productivity services where information resides (e.g. SharePoint Online, Exchange Online, and Microsoft Teams), as well as endpoint devices. 

 

We are pleased to announce the public preview of 23 new pre-trained ready-to-use trainable classifiers, which can be used in server-side sensitivity auto-labeling policies across workloads. These are in addition to the current 15 pretrained business content and behavior trainable classifiers. These new classifiers are different than our custom trainable classifiers that organizations can train to identify proprietary or market-vertical-specific sensitive data using samples of their own documents. Instead, these new trainable classifiers are already pre-trained using diverse and large numbers of real-world samples to provide broad coverage of multiple common business functions (e.g. finance, legal, corporate/executive, information technology, sales, operations, production, human resources, and marketing communications).

 

Category

Classifiers available

IP and Trade Secrets

1.       Mergers & acquisitions (M&A Files)

2.       Software product development files

3.       Project reports and documents

4.       Business plans

Finance (GLBA, FCRA, PCI, etc.)

1.       Bank statements 

2.       Budgets

3.       Financial audit reports

4.       Financial statements

5.       Loan agreements

6.       Statements of work

7.       Invoices

Healthcare (HIPAA)

1.       Health and medical forms

HR (GDPR and Privacy regulations)

1.       Employee disciplinary action files

2.       Employee insurance files

3.       Employment agreements

4.       Paystubs

IT

1.       Network design files

Legal

1.       License agreements

2.       Non-disclosure agreements

Operations

1.       Construction specifications

2.       Manufacturing batch files

Sales and marketing

1.       Sales and revenue reports

Other

1.       Meeting notes

In addition to broad categories (e.g., finance), these classifiers can also be leveraged to detect more granular documents (specific forms) including intellectual property and trade secrets. Leveraged together, these trainable classifiers can detect more than 30 types of sensitive data.

 

Rigorous and comprehensive model pre-training process

Multiple inputs (e.g., classification schema, regulatory context, customer and industry subject matter expert inputs, IP considerations, common and customized business processes) were leveraged to increase model accuracy and scalability. Our engineering team leveraged Microsoft’s broad and deep machine learning expertise and leading frameworks, platforms, and development environments that include proprietary and open-source platforms (e.g., PyTorch, ML.NET, Babel, ONNX) in the model generation, building, peer review, testing (includes real time) and feedback in the development workflow for these trainable classifiers. 

 

Key capabilities

  • Auto-labeling support for sensitivity labels: Microsoft Purview Information Protection can use these new trainable classifiers in server-side auto-labeling policies for Microsoft SharePoint, OneDrive, and Exchange. System admins can now leverage this new capability to more quickly and comprehensively discover, label, and protect massive volumes of sensitive data across their digital estate with pre-trained models optimized for accuracy and scalability. The screenshot below shows how to add trainable classifiers in auto-labeling for files and emails.

Trevor_Rusher_0-1661533103180.png

 

  • Simulation mode for sensitivity auto-labeling: Leverage simulation mode to view the results before turning on the policy. The simulation mode shows details about the matched items across all workloads where the policy has been created.

 

Anna_Chiang_1-1661392578112.png

 

  • Automatic labels applied to files on SharePoint, Exchange, and OneDrive: Once the policies are created, the corresponding labels are automatically applied to the sensitive content in SharePoint and OneDrive.

Anna_Chiang_2-1661392578120.png

 

  • Content explorer: Discover and view categories of sensitive data content matching these trainable classifiers as well specific files that contain sensitive data in Microsoft SharePoint, OneDrive, Microsoft Teams, and Exchange as shown in the screenshot below.

Anna_Chiang_3-1661392578128.png

 

Data Loss Prevention support: As referenced in an earlier blog post, Microsoft Purview Data Loss Prevention now supports all trainable classifiers, which also includes these new classifiers in public preview. This is for all DLP workloads: SharePoint, OneDrive, Teams, Exchange, and Endpoint DLP. The new trainable classifiers can be easily added to DLP policies, as shown below.

 

Trevor_Rusher_1-1661533278178.png

 

Anna_Chiang_6-1661392578199.png

 

Automatically apply retention labels: Microsoft Purview Data Lifecycle and Records Management can use these trainable classifiers as a condition in auto-apply retention label policies. These policies can apply a retention label to content located in SharePoint for Microsoft 365 sites, OneDrive accounts, and Exchange Online email, including attachments. To give you confidence that the correct information will be labeled and inform your policy settings, we recommend using the content explorer method mentioned above to understand what will match each classifier and the location.

 

retention label - trainable classifier.png

Apply a retention label to content that matches a trainable classifier

 

How to Get Started 

Get access to Microsoft Purview solutions directly in the Microsoft Purview compliance portal with a trial. By enabling the trial in the Purview compliance portal, you can quickly access these new trainable classifiers. Visit your Microsoft Purview compliance portal for more details or check out the Microsoft Purview solutions trial (an active Microsoft 365 E5 subscription is required for access to the new trainable classifiers).

 

Authors: Anna Chiang and Annapurna Saripalli 

3 Comments
Version history
Last update:
‎Oct 07 2022 11:27 AM
Updated by: