Forum Discussion
AhmedSHMK
Jun 18, 2024Brass Contributor
Trainable Classifiers - Tips
Hello All,
Just sharing some tips to assist with the process of data collection and the creation of trainable classifiers for the purpose of labelling/Data Loss prevention.
-Regarding training Machine Learning to recognize a certain document type, It must have one or more recognizable aspects.
Possible usable recognizable aspects of the data/document type:
- Keyword or metadata values (keyword query language)
- Previously identified patterns of sensitive information like social security, credit card, or bank account numbers (Sensitive information type entity definitions)
- Document fingerprinting: recognizing an item because it's a variation on a template
- The presence of exact strings exact data match
-In the below examples, we focus on Document Fingerprinting and Previously identifiable Sensitive information Type.
For e.g.
Regarding positive samples, The below file samples display a pattern, CC info (dummy data), Include Keywords referring to CC info such CVV2/AMEX etc.... as well as SSN information.
-This can be regarded as a pattern for positive detection. The above data samples (about 150 samples of a similar pattern) are stored in a folder in a dedicated SharePoint Site(In the below screenshot, Same items are used as false samples for another classifier).
-Regarding Negative samples, It is the same concept, It can be also stored in a folder in a dedicated Sharepoint Site and have a unique pattern or fingerprint. for e.g.
-The below samples represent Credential information (dummy), Need to be about 150 samples or so. The samples should strongly represent a uniform document/data type different from positive samples.
Similarly the data is stored in a dedicated folder in a SharePoint Site:
Once the trainable classifier is created and fed this information, It will successfully identify data type to facilitate detection and minimize potential false positive.
- GrantNelsonCopper Contributor
I have attempted doing this with no luck so far. I have a positive set of files (50+ txt files with member IDs) and 400+ negative set of files that do not contain those member IDs. The training finishes with a message saying "Training completed with failures", "Failed due to training error". I have not found any information on what this means or how to resolve it. These errors are too generic to be helpful.
- AhmedSHMKBrass Contributor
GrantNelsonIt's very tricky and yes really does not have any documentation or way to get detailed logs that I am aware of.
For me I created many failed classifiers before I was able to create them.
Regarding how you set up the files, There should be a recognizable pre-defined info type in the samples (i.e. ensure member IDs can already be recognized somehow if you have not created an SIT with Regex for e.g. yet), One of the below:
- Keyword or metadata values (keyword query language)
- Previously identified patterns of sensitive information like social security, credit card, or bank account numbers (Sensitive information type entity definitions)
- Document fingerprinting: recognizing an item because it's a variation on a template
- The presence of exact strings exact data match
i.e. something already matching an SIT within your organization/Microsoft created SIT, You could have multiple but be careful as samples must be close enough in format to not confuse the training process.
My advise if it fails is to try as much as possible to make sure samples are close enough in format and the found info types, And to expand later if needed.
Good Luck!