In order to protect sensitive data, it must first be discovered and labeled. However, traditional classification techniques such as regular expression, and manual or rule-based approaches can’t easily handle massive volumes of data. Leveraging machine learning-enabled trainable classifiers can greatly improve the speed, accuracy, and coverage in identifying sensitive data at an enterprise scale.
At Microsoft, our goal is to provide a built-in, intelligent, unified, and extensible solution to protect sensitive data across your digital estate – in Microsoft 365 cloud services, on-premises, third-party SaaS applications, and more. With Microsoft Purview Information Protection, we are building a unified set of capabilities for data classification, labeling, and protection not only in Office apps but also in other popular productivity services where information resides (e.g. SharePoint Online, Exchange Online, and Microsoft Teams), as well as endpoint devices.
We are pleased to announce the public preview of 23 new pre-trained ready-to-use trainable classifiers, which can be used in server-side sensitivity auto-labeling policies across workloads. These are in addition to the current 15 pretrained business content and behavior trainable classifiers. These new classifiers are different than our custom trainable classifiers that organizations can train to identify proprietary or market-vertical-specific sensitive data using samples of their own documents. Instead, these new trainable classifiers are already pre-trained using diverse and large numbers of real-world samples to provide broad coverage of multiple common business functions (e.g. finance, legal, corporate/executive, information technology, sales, operations, production, human resources, and marketing communications).
Category |
Classifiers available |
IP and Trade Secrets |
1. Mergers & acquisitions (M&A Files) 2. Software product development files 3. Project reports and documents 4. Business plans |
Finance (GLBA, FCRA, PCI, etc.) |
1. Bank statements 2. Budgets 3. Financial audit reports 4. Financial statements 5. Loan agreements 6. Statements of work 7. Invoices |
Healthcare (HIPAA) |
1. Health and medical forms |
HR (GDPR and Privacy regulations) |
1. Employee disciplinary action files 2. Employee insurance files 3. Employment agreements 4. Paystubs |
IT |
1. Network design files |
Legal |
1. License agreements 2. Non-disclosure agreements |
Operations |
1. Construction specifications 2. Manufacturing batch files |
Sales and marketing |
1. Sales and revenue reports |
Other |
1. Meeting notes |
In addition to broad categories (e.g., finance), these classifiers can also be leveraged to detect more granular documents (specific forms) including intellectual property and trade secrets. Leveraged together, these trainable classifiers can detect more than 30 types of sensitive data.
Rigorous and comprehensive model pre-training process
Multiple inputs (e.g., classification schema, regulatory context, customer and industry subject matter expert inputs, IP considerations, common and customized business processes) were leveraged to increase model accuracy and scalability. Our engineering team leveraged Microsoft’s broad and deep machine learning expertise and leading frameworks, platforms, and development environments that include proprietary and open-source platforms (e.g., PyTorch, ML.NET, Babel, ONNX) in the model generation, building, peer review, testing (includes real time) and feedback in the development workflow for these trainable classifiers.
Key capabilities
Data Loss Prevention support: As referenced in an earlier blog post, Microsoft Purview Data Loss Prevention now supports all trainable classifiers, which also includes these new classifiers in public preview. This is for all DLP workloads: SharePoint, OneDrive, Teams, Exchange, and Endpoint DLP. The new trainable classifiers can be easily added to DLP policies, as shown below.
Automatically apply retention labels: Microsoft Purview Data Lifecycle and Records Management can use these trainable classifiers as a condition in auto-apply retention label policies. These policies can apply a retention label to content located in SharePoint for Microsoft 365 sites, OneDrive accounts, and Exchange Online email, including attachments. To give you confidence that the correct information will be labeled and inform your policy settings, we recommend using the content explorer method mentioned above to understand what will match each classifier and the location.
Apply a retention label to content that matches a trainable classifier
How to Get Started
Get access to Microsoft Purview solutions directly in the Microsoft Purview compliance portal with a trial. By enabling the trial in the Purview compliance portal, you can quickly access these new trainable classifiers. Visit your Microsoft Purview compliance portal for more details or check out the Microsoft Purview solutions trial (an active Microsoft 365 E5 subscription is required for access to the new trainable classifiers).
Authors: Anna Chiang and Annapurna Saripalli
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.