Blog Post

Document Processing Blog
2 MIN READ

April 2022 – SharePoint Syntex AI Optimizations

Chris McNulty's avatar
Chris McNulty
Silver Contributor
Apr 06, 2022

SharePoint Syntex uses AI to organize & manage content, optimize search and compliance, and automate and improve your most critical business processes.

 

To improve the quality and consistency of text extraction from a file, Syntex now employs a more natural reading order – this also provides improved language support. We are optimizing the optical character recognition (OCR) service used by Syntex document understanding models. As a result of this enhancement, if a document understanding model was trained using PDF example files for OCR, the model should be tested to ensure it is accurately extracting data, as expected.

 

This enhancement helps Syntex extract multiline values inside tables or cells rather than reading the generated text top to bottom.  Consider this sample table:

A

B

C

1

2

3

Red

Blue

Green

D

E

F

4

5

6

Apple

Orange

Banana


Our previous OCR model “reads” the text stream as:

A 1 Red, B 2 Blue, C 3 Green…

 

This optimization will help the cloud read your tables more naturally:

ABC, 123, Red Blue Green…

 

You should check your models – and if you have text stored in a tabular layout in PDF files, you can take advantage of this update now.  Retraining models is simple:

  1. Go to the Content Center and click the Models link
  2. In the Models section, click on the name of the model to be reviewed/updated as needed
  3. In the Entity Extractor section of the page, click on one of the Entity Names
  4. Go to the Test tab
  5. Click on Add example files
  6. Upload a few new training files – these can be a duplicate of existing docs or net new files. Note – If uploading duplicate copies of the existing files, select the Keep both versions option so that the file will have a slightly different name.
  7. Review the Predictions column for extracted text to determine if the right content is still be extracted. Check each entity extractor.
  8. If there are discrepancies, then select Exit Training and go back through the model training process, starting with the Train Classifier step.
  9. Once the model updates are completed, go to the model’s main page and in the Where the model is applied section, select the Sync link to ensure the updates are published out to all document libraries where the model is applied.
  10. If needed, run the updated model on the local document library’s existing documents by selecting all documents that need to be re-run and clicking on Classify and extract.

We welcome your comments and feedback here on the Tech Community. Thank you.

Published Apr 06, 2022
Version 1.0

5 Comments

  • Mario_Fulan's avatar
    Mario_Fulan
    Iron Contributor

    Azure Cognitive does a text extraction. It works pretty well usually for text-backed PDF forms but not so much it the PDF is simply a tiff image. Not sure which type you are using, but it is sometimes a puzzle when things just change but when it does it generally is because someone else complained and they fixed it for their use case 🙂

  • peb71b's avatar
    peb71b
    Copper Contributor

    Thanks Mario Fulan , I'll give that a go to confirm.  It was puzzling why they just stopped working but this post does explain it was a "feature" change, which I'm sure helps some but not my case.  As a quick alternative, I switched to Power Automate Desktop which does the job by extracting tables (which come out as I expected).  Shame as I like Syntex.

  • Mario_Fulan's avatar
    Mario_Fulan
    Iron Contributor

    You should take a look at azure cognitive text extraction for OCR to see what it extracts via an API call. That is one thing I have done to determine if the issue is Syntex or the underlying Azure Cognitive. If it is cognitive, not sure what solutions the product group may have other than to use their influence on the underlying technical team.

  • peb71b's avatar
    peb71b
    Copper Contributor

    I've been using SharePoint Syntex since pre-release with some good results.

    Last week we had a batch of new documents and we noticed many unexpected results.

    We don’t control the PDF format that we receive but we were getting results that parsed as the below, I have been using the "After Text" rule to identify the numbers to extract, example:

     

    Unfortunately the PDF table is now parsed differently (below example of same form):

     

    Can we use the previous method of Syntex parsing PDFs, as the new method does not work in many of our scenarios.  

    Thanks

  • peb71b's avatar
    peb71b
    Copper Contributor

    Hi, is there a way to use the old method for some models?  We had some models working well but now producing unexpected/wrong results due to the change in reading tables.