Forum Discussion
Tdullers
Apr 16, 2024Copper Contributor
Limits for training a Azure AI Document Intelligence Custom Classification Model
I have the following question:
- The latest 2024-02-29-preview version for training a Custom Classification model, provides great results and as vast improvement compared to the previous currently GA version.
- Yet, we noticed some limitations in respect to the maximum training data size and limit of 10,000 pages. As also mentioned on: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/build-a-custom-model?view=doc-intel-4.0.0 stating:
"For custom classification model training, the total size of training data is 1 GB with a maximum of 10,000 pages."
- Are these limits expected to increase in newer versions?
- Is there a way to have this limit raised for specific use-cases?
- Does the limitation make sense? Meaning: Is the limitation there for a reason? E.g. does one need to make different design choices to circumvent this limit. e.g. Using multiple more specific classification models?
Context: If you want to train a custom classification model for recognising e.g. 300 different document types, that each have multiple possible specifics, you very easily hit the 10k pages limit, since 300 different types of documents, each having e.g. 3 pages, limits the number of training documents per type to 11, which is not a lot.
Thanks upfront!
- user_2429799Copper Contributor[Copilot answer]
The current limitations for training a Custom Classification model in Azure AI Document Intelligence are indeed a maximum of 1 GB of training data and a limit of 10,000 pages1. These limits are in place to ensure the efficiency and effectiveness of the model training process1.
As for whether these limits will increase in future versions, it’s not explicitly stated in the available resources. However, Microsoft does mention that features, approaches, and processes may change based on user feedback12. So, it’s possible that these limits could be adjusted in the future.
Regarding the design choices to circumvent this limit, one approach could be to use multiple more specific classification models1. For instance, if you have 300 different document types, you could create several models, each trained on a subset of the document types. This could help you stay within the page limit for each model.
Another important aspect to consider is the support for incremental training starting with the 2024-02-29-preview API1. This allows you to add new samples to existing classes or add new classes by referencing an existing classifier1. This could be a useful feature to leverage when dealing with a large number of document types.
In summary, while the current limitations do pose a challenge for large-scale applications, there are potential workarounds and future updates may bring changes to these constraints. I hope this information is helpful and feel free to ask if you have more questions!