document intelligence studio
3 TopicsAzure Document Intelligence and Content Understanding
Hello, Our customer has dozens of Excel and PDF files. These files come in various formats, and the layouts may change over time. For example, some files provide data in a standard tabular structure, others use pivot-style Excel layouts, and some follow more complex or semi-structured formats. In total, we currently have approximately 150 distinct Excel templates and 80 distinct PDF templates. We need to extract information from these files and ingest it into normalized tables. Therefore, our requirement is to automatically infer the structure of each file, extract the required values, and load the results into Databricks tables. Given that there are already many template variations—and that new templates may emerge over time—what would be the recommended pipeline, technology stack, and architecture? Should we prefer Azure Document Intelligence? One option would be to create a custom model per template type. However, when a user uploads a new file, how can we reliably match the file to the correct existing model? Additionally, what should happen if a user uploads an Excel/PDF file in a significantly different format that does not resemble any existing template?4Views0likes0CommentsDoc Intelligence: Custom Extraction model | Confidence score deterioration with new formats/layouts
Hi everyone, This is my first time using custom extraction models on the Document Intelligence service, and I would appreciate your input on an experiment I am conducting. I wanted to investigate how these models' confidence scores behave when documents with significantly different format/layout are introduced (later) in the training phase. I started by training models with documents in the same format (some of worse picture quality and slightly rotated), increasingly adding more samples (a new model was trained every time I added new documents, at increments of 5). After every new model was trained, I checked scores against the same, unseen by the model holdout set that had the same format with those in the training set. After training the final model, with 35 identically formatted documents, I started introducing documents with a significantly different format/layout and retraining (at increments of 10). Confidence scores against the holdout set (unchanged) dropped after doing so, without recovering to previous levels. See graph below showing how confidence scores evolved after every training step (adding new documents at every step). Any insights as to why this has happened?284Views1like2CommentsExtracting data from unstructured forms using Azure AI Document Intelligence.
In our latest blog post, we delve into a scenario where our B2B product helps businesses extract data from messy PDFs, emails, and websites. Say goodbye to manual extraction—Azure AI Document Intelligence does the heavy lifting. Let’s explore how it works In our latest blog post, we delve into a scenario where our B2B product helps businesses extract data from messy PDFs, emails, and websites. Say goodbye to manual extraction—Azure AI Document Intelligence does the heavy lifting. Let’s explore how it works6KViews1like0Comments