content understanding
1 TopicAzure Document Intelligence and Content Understanding
Hello, Our customer has dozens of Excel and PDF files. These files come in various formats, and the layouts may change over time. For example, some files provide data in a standard tabular structure, others use pivot-style Excel layouts, and some follow more complex or semi-structured formats. In total, we currently have approximately 150 distinct Excel templates and 80 distinct PDF templates. We need to extract information from these files and ingest it into normalized tables. Therefore, our requirement is to automatically infer the structure of each file, extract the required values, and load the results into Databricks tables. Given that there are already many template variations—and that new templates may emerge over time—what would be the recommended pipeline, technology stack, and architecture? Should we prefer Azure Document Intelligence? One option would be to create a custom model per template type. However, when a user uploads a new file, how can we reliably match the file to the correct existing model? Additionally, what should happen if a user uploads an Excel/PDF file in a significantly different format that does not resemble any existing template?5Views0likes0Comments