Forum Discussion
Azure Document Intelligence and Content Understanding
hi rlxnw84 You’re definitely not alone here, once you get beyond a handful of fixed templates, this problem becomes more about architecture than any single AI service.
With ~150 Excel and ~80 PDF layouts (and more coming), I’d be very cautious about going down a pure “one custom model per template” path. It can work at small scale, but matching new uploads to the right model and maintaining all of them quickly becomes painful.
What I’ve seen work better in practice is a classification-first pipeline:
1.Ingest the file (ADLS Gen2, basic metadata, source info).
2.Classify the document before extraction:
Use Azure Document Intelligence (Layout / General Document) or Content Understanding to pull structural signals (tables, headers, key phrases).
Optionally add similarity scoring (text + layout embeddings) to decide which type of document this is, not which exact template.
3.Extract using a hybrid approach:
General DI models for most PDFs.
Native Excel parsing for XLSX where possible.
Custom DI models only for high-value, relatively stable layouts.
4.Normalize in Databricks:
Map everything into a canonical schema.
Capture confidence, model used, and document class for traceability.
5.Have a fallback + human loop:
If confidence is low or the format doesn’t match anything known, flag it as a “new format”.
Use those cases to refine classification or decide if a new model is actually worth creating.
This way, new or drifting layouts don’t break the pipeline, they just go through a slower path until you decide how to handle them long term.
So yes, Azure Document Intelligence is a good fit, but usually as part of a broader pipeline rather than the only answer. The real win tends to come from combining DI + light classification + Databricks-side logic, instead of trying to perfectly model every template upfront.
Would be interested to hear if others have found good ways to manage template drift at this scale ,that’s usually the hardest part.