Forum Discussion

rlxnw84's avatar
rlxnw84
Copper Contributor
Jan 14, 2026

Azure Document Intelligence and Content Understanding

Hello,

Our customer has dozens of Excel and PDF files. These files come in various formats, and the layouts may change over time. For example, some files provide data in a standard tabular structure, others use pivot-style Excel layouts, and some follow more complex or semi-structured formats. In total, we currently have approximately 150 distinct Excel templates and 80 distinct PDF templates.

 

We need to extract information from these files and ingest it into normalized tables. Therefore, our requirement is to automatically infer the structure of each file, extract the required values, and load the results into Databricks tables.

 

Given that there are already many template variations—and that new templates may emerge over time—what would be the recommended pipeline, technology stack, and architecture?

 

Should we prefer Azure Document Intelligence? One option would be to create a custom model per template type. However, when a user uploads a new file, how can we reliably match the file to the correct existing model? Additionally, what should happen if a user uploads an Excel/PDF file in a significantly different format that does not resemble any existing template?

1 Reply

  • hi rlxnw84​ You’re definitely not alone here, once you get beyond a handful of fixed templates, this problem becomes more about architecture than any single AI service.

    With ~150 Excel and ~80 PDF layouts (and more coming), I’d be very cautious about going down a pure “one custom model per template” path. It can work at small scale, but matching new uploads to the right model and maintaining all of them quickly becomes painful.

    What I’ve seen work better in practice is a classification-first pipeline:

    1.Ingest the file (ADLS Gen2, basic metadata, source info).

    2.Classify the document before extraction:

    Use Azure Document Intelligence (Layout / General Document) or Content Understanding to pull structural signals (tables, headers, key phrases).

    Optionally add similarity scoring (text + layout embeddings) to decide which type of document this is, not which exact template.

    3.Extract using a hybrid approach:

    General DI models for most PDFs.

    Native Excel parsing for XLSX where possible.

    Custom DI models only for high-value, relatively stable layouts.

    4.Normalize in Databricks:

    Map everything into a canonical schema.

    Capture confidence, model used, and document class for traceability.

    5.Have a fallback + human loop:

    If confidence is low or the format doesn’t match anything known, flag it as a “new format”.

    Use those cases to refine classification or decide if a new model is actually worth creating.

    This way, new or drifting layouts don’t break the pipeline, they just go through a slower path until you decide how to handle them long term.

    So yes, Azure Document Intelligence is a good fit, but usually as part of a broader pipeline rather than the only answer. The real win tends to come from combining DI + light classification + Databricks-side logic, instead of trying to perfectly model every template upfront.

    Would be interested to hear if others have found good ways to manage template drift at this scale ,that’s usually the hardest part.

     

Resources