Forum Discussion
Trainable Classifiers - Tips
- Keyword or metadata values (keyword query language)
- Previously identified patterns of sensitive information like social security, credit card, or bank account numbers https://learn.microsoft.com/en-us/purview/sit-sensitive-information-type-entity-definitions
- https://learn.microsoft.com/en-us/purview/sit-document-fingerprinting: recognizing an item because it's a variation on a template
- The presence of exact strings https://learn.microsoft.com/en-us/purview/sit-learn-about-exact-data-match-based-sits#learn-about-exact-data-match-based-sensitive-information-types
2 Replies
- GrantNelsonCopper Contributor
I have attempted doing this with no luck so far. I have a positive set of files (50+ txt files with member IDs) and 400+ negative set of files that do not contain those member IDs. The training finishes with a message saying "Training completed with failures", "Failed due to training error". I have not found any information on what this means or how to resolve it. These errors are too generic to be helpful.
- AhmedSHMKBrass Contributor
GrantNelsonIt's very tricky and yes really does not have any documentation or way to get detailed logs that I am aware of.
For me I created many failed classifiers before I was able to create them.
Regarding how you set up the files, There should be a recognizable pre-defined info type in the samples (i.e. ensure member IDs can already be recognized somehow if you have not created an SIT with Regex for e.g. yet), One of the below:
- Keyword or metadata values (keyword query language)
- Previously identified patterns of sensitive information like social security, credit card, or bank account numbers https://learn.microsoft.com/en-us/purview/sit-sensitive-information-type-entity-definitions
- https://learn.microsoft.com/en-us/purview/sit-document-fingerprinting: recognizing an item because it's a variation on a template
- The presence of exact strings https://learn.microsoft.com/en-us/purview/sit-learn-about-exact-data-match-based-sits#learn-about-exact-data-match-based-sensitive-information-types
i.e. something already matching an SIT within your organization/Microsoft created SIT, You could have multiple but be careful as samples must be close enough in format to not confuse the training process.
My advise if it fails is to try as much as possible to make sure samples are close enough in format and the found info types, And to expand later if needed.
Good Luck!