Public preview of new Source Code Classifier and general availability of more trainable classifiers
Published Feb 06 2023 09:00 AM 8,034 Views

Today, organizations across various industries are generating massive amounts of data, and its volume grows exponentially every year. According to Statista, by 2025 the volume of data/information created, captured, copied, and consumed will reach 191 zettabytes, a 186% increase from 91 zettabytes in 2022. Leveraging machine learning-enabled out-of-the-box trainable classifiers can greatly improve the speed, accuracy, and coverage in identifying sensitive data at enterprise scale.


At Microsoft, our goal is to provide a built-in, intelligent, unified, and extensible solution to protect sensitive data across your entire digital estate including multi-cloud (hybrid, Microsoft and non-Microsoft clouds) as well as Microsoft and third-party SaaS applications and services. With Microsoft Purview Information Protection, we provide  a unified set of capabilities for data classification, labeling, and protection across multiple platforms.  


What’s new in public preview

Unauthorized exfiltration of source code by insiders can expose organizations to great risk of intellectual property (IP) loss and potential damages. We are excited to announce the public preview of a new and enhanced trainable classifier for detecting source code. We’ve heard your feedback, and this new classifier supports more extensions (70+), 23 programming languages, addresses customer inputs, and can detect embedded and partial source code. This replaces the existing source code classifier and can be directly used in auto-labeling and data loss prevention policies. For existing policies that specify source code as a condition, the new classifier will be automatically applied in place of the existing one, with no action needed from customers.




Classifier description 

This classifier detects whether a given document contains any programming code. 

Supported programming languages (23) 

ActionScript, C, C#, C++, Clojure, CoffeeScript, Go, Haskell, Java, JavaScript, Lua, MATLAB, Objective-C, Perl, PHP, Python, R, Ruby, Scala, Shell, Swift, TeX, Vim Script 

Covered extensions 

.c, .h, .w, .cs, .cake, .csx, .cpp, .c++, .cc, .cp, .cxx, .hh, .hpp, .hxx, .java, .js, .m, .matlab, .pl, .perl, .pm, .prl, .ipb, .php, .php3, .php4, .php5, .py, .pyc, .pyo, .r, .rl, .rb, .irb, .swift, .as, .clj, .cljs, .cljc, .coffee, .Go, .hs, .hsc, .lua, .lub, .m, .mm, .scala, .sca, .Tex,T, .xs, . sh, .vim, .edn, .javac, .lhs, .mjs, .pod, .r, .rda, .RData, .rds, .rb, .bash, .docx, .docm, .doc, .dotx, .dotm, .dot, .pdf, .rtf, .txt, .one, .eml, .msg, .pptx, .pptm, .ppt, .potx, .potm, .pot, .ppsx, .ppsm, .pps, .ppam, .ppa, .xlsx, .xlsm, .xlsb, .xls, .csv, .xltx, .xltm, .xlt, .xlam, .xla, .sc, .lit coffee

New features 

This classifier can detect code that is embedded in the text files or even partial code. Provides good performance with 20% code in typically documents or 20 lines of code in a 100 page document (as an example) 


Besides source code, we recommend that the following classifiers for detecting IP and trade secrets be added to policies for maximum protection: 

  • Software product development files 
  • Network design files 
  • General IT content with the IT classifier 


Figure 1. Screenshot of the new source code classifier in action with DLP policies 


In addition to the new and improved source code and resume trainable classifiers, we’re pleased to also announce the public preview of 13 new additional trainable classifiers, which can quickly identify highly confidential personal identifiable information (PII), such as personal financial information, account statements, and employee stocks and financial bond records.


Figure 2. Table that includes the enhanced source code and resume classifiers and 13 new trainable classifiers in public preview


New general availability: new classifiers, integration into regulatory templates, and shorter message detection

We are excited to announce the general availability of 23 new purpose-built trainable classifiers that were previously available in public preview. These 23 classifiers are now generally available along with server-side auto-labeling policies for sensitivity labels across SharePoint, OneDrive, Exchange, Microsoft Teams, and endpoint DLP. These classifiers are supported in simulation mode in server-side auto-labeling, in which system admins can create policies, turn on simulation, view the results before turning on the policy. Once the policies are created, the corresponding labels are automatically applied to the sensitive content in SharePoint and OneDrive as well as emails.


E5 starter templates now include trainable classifiers

For more accurate and comprehensive sensitive data identification, specific trainable classifiers are now integrated with the existing E5 Enhanced starter templates along with related sensitive information types (SITs), to help you easily configure your policies for regulatory compliance across various regulatory standards: U.S. Gramm-Leach-Bliley Act (GLBA), U.S. Health Insurance Act (HIPAA) Enhanced, General Data Protection Regulation (GDPR) Enhanced, and U.S. Personally Identifiable Information (PII) Data Enhanced. Additions. You can easily add these templates to server side auto-labeling policies and DLP protect your sensitive data against top regulatory risks and help address compliance.


Shorter message detection of sensitive information in Teams

We have improved our trainable classifiers to work on shorter messages and especially Teams conversations so sensitive content shared across Teams is also protected and secure collaboration can be ensured. The classifiers shown below can detect content as short as 50 tokens/words:

  • Mergers & Acquisitions
  • Employee insurance files
  • Network design files
  • Loan agreements and offer letters
  • Financial audit reports
  • Manufacturing batch records
  • Product development files
  • Construction specifications
  • Employee disciplinary action files


Leverage our investment in trainable classifiers

Our engineering team leveraged Microsoft’s broad and deep machine learning expertise and leading frameworks, platforms, and development environments that include proprietary and open-source platforms (e.g., PyTorch, ML.NET, Babel, ONNX) in the model generation, building, peer review, testing (includes real-time) and feedback in the development workflow for these trainable classifiers. By doing the “heavy lifting” in pre-training and (re)optimizing these classifiers across the most common sensitive business categories, we’re enabling you to more quickly and comprehensively discover, label, and protect massive volumes of sensitive data across your digital estate. Our trainable classifiers include the most common business categories requested by our customers across geographic regions. They’re able to identify sensitive data in nine broad business categories as well as specific types of documents. Furthermore, advanced classification algorithms are able to more easily adapt to changes to regulatory and dynamic business contexts.


We are constantly extending our product capabilities to help organizations secure their sensitive data and content. We look forward to hearing your feedback!

How to Get Started 

Get access to Microsoft Purview solutions directly in the Microsoft Purview compliance portal with a trial. By enabling the trial in the Purview compliance portal, you can quickly access these new trainable classifiers. Visit your Microsoft Purview compliance portal for more details or check out the Microsoft Purview solutions trial (an active Microsoft 365 E5 subscription is required for access to the new trainable classifiers).

Version history
Last update:
‎Feb 15 2023 02:24 PM
Updated by: