Non-greedy extractors

Copper Contributor

Hi everyone,


Do SharePoint Syntex extractors for document understanding support non-greedy regex expressions? 


For example, I'm trying to pull a contract term but throughout the file it has multiple instances of the term, and extracts both. I would use the location feature under advanced to point out that it only happens near the beginning of the Contract, but with the sample size I'm working with and the various different contracts I'm trying to get this working with, doing that breaks the extractor for other contracts.




1 Reply
Unless it is supported directly in the RegEx expression the answer I think is no. I've done some work with clients to use different classification (e.g., StyleAContract, StyleBContract) so my models can be a bit smarter. You can still reuse the evaluations, but then use better proximity or location in file to assist the AI in the model. If everything is just "Contract" then it is up to you to make the RegEx find only one instance.

I like the idea in general as a feature of non-repeating extractors (not just for RegEx, but for all. For example, if there is a PO and "Total" is at the top and in the table of details, I only want it to extract once. So I can get "$3,005" and not $3,005,$3005" as a result.

The other option is to use Flow and do some post-processing to remove duplicates (I've done that for some clients). This takes extra effort but is very effective.