Microsoft Entra Suite Tech Accelerator
Aug 14 2024, 07:00 AM - 09:30 AM (PDT)
Microsoft Tech Community

Trainable Classifiers - Tips

Brass Contributor
Hello All,
Just sharing some tips to assist with the process of data collection and the creation of trainable classifiers for the purpose of labelling/Data Loss prevention.
 
-Regarding training Machine Learning to recognize a certain document type, It must have one or more recognizable aspects. 
 
Possible usable recognizable aspects of the data/document type:
 
 
-In the below examples, we focus on Document Fingerprinting and Previously identifiable Sensitive information Type. 
 
 
 
For e.g.
Regarding positive samples, The below file samples display a pattern, CC info (dummy data), Include Keywords referring to CC info such CVV2/AMEX etc.... as well as SSN information.
 
AhmedSHMK_0-1718691026173.png

 

AhmedSHMK_1-1718691026174.png

 

AhmedSHMK_2-1718691026175.png

 

 
-This can be regarded as a pattern for positive detection. The above data samples (about 150 samples of a similar pattern) are stored in a folder in a dedicated SharePoint Site(In the below screenshot, Same items are used as false samples for another classifier).
AhmedSHMK_3-1718691026176.png

 

 
 
-Regarding Negative samples, It is the same concept, It can be also stored in a folder in a dedicated Sharepoint Site and have a unique pattern or fingerprint. for e.g.
 
-The below samples represent Credential information (dummy), Need to be about 150 samples or so. The samples should strongly represent a uniform document/data type different from positive samples.
 
AhmedSHMK_4-1718691026177.png

 

 
AhmedSHMK_5-1718691026177.png

 

AhmedSHMK_6-1718691026178.png

 

 
 
Similarly the data is stored in a dedicated folder in a  SharePoint Site:
 
AhmedSHMK_7-1718691026179.png

 

 
 
 
Once the trainable classifier is created and fed this information, It will successfully identify data type to facilitate detection and minimize potential false positive.
 
 
AhmedSHMK_0-1718692609253.png

 


 

0 Replies