Microsoft Entra Suite Tech Accelerator
Aug 14 2024, 07:00 AM - 09:30 AM (PDT)
Microsoft Tech Community

Trainable Classifiers - Tips

Brass Contributor
Hello All,
Just sharing some tips to assist with the process of data collection and the creation of trainable classifiers for the purpose of labelling/Data Loss prevention.
-Regarding training Machine Learning to recognize a certain document type, It must have one or more recognizable aspects. 
Possible usable recognizable aspects of the data/document type:
-In the below examples, we focus on Document Fingerprinting and Previously identifiable Sensitive information Type. 
For e.g.
Regarding positive samples, The below file samples display a pattern, CC info (dummy data), Include Keywords referring to CC info such CVV2/AMEX etc.... as well as SSN information.






-This can be regarded as a pattern for positive detection. The above data samples (about 150 samples of a similar pattern) are stored in a folder in a dedicated SharePoint Site(In the below screenshot, Same items are used as false samples for another classifier).


-Regarding Negative samples, It is the same concept, It can be also stored in a folder in a dedicated Sharepoint Site and have a unique pattern or fingerprint. for e.g.
-The below samples represent Credential information (dummy), Need to be about 150 samples or so. The samples should strongly represent a uniform document/data type different from positive samples.






Similarly the data is stored in a dedicated folder in a  SharePoint Site:


Once the trainable classifier is created and fed this information, It will successfully identify data type to facilitate detection and minimize potential false positive.



0 Replies