Process Large scale PDF or images to Extract information forms using Applied AI Form Recognizer
Published Jun 08 2022 09:00 AM 5,470 Views

Optimized large scale forms processing using Applied AI Services


Thanks to co-authors @Robert Nottoli @Michael McKechney @Mark Hoiland @lee Hansen


Use Case

  • Have millions of forms to process
  • Like 15 to 20 million pages or more
  • These are forms with multiple pages but only few pages might have the data to extract
  • Forms might have 3-15 pages or more
  • Data to pull might be 2 or 3 pages
  • Split the pages to process in Form recognizer to reduce AI cost
  • Use python ai library to filter the pages needed for AI services
  • Process is split into 2 sections
    1. Process the pages needed for AI services
    1. Process the pages needed for the AI and send that to form Recognizer
  • Idea here is to show how to preprocess PDF or images to extract needed info for AI Cognitive Services to process.
  • Both the below steps can be scaled as needed based on requirements




Architecture - End to End processingArchitecture - End to End processing




2 Parts processing


Azure Python Function

  • Python function to process PDF to only pick pages needed to process in AI
  • Instead of 15 million pages can be reduced to 2 or 3 million pages
  • Using existing open-source packages like pytesseract to pull only pages needed
  • Scale pdf processing using azure functions


Azure C# function to process Form Recognizer

  • Functions to take the reduced pages and send to Form Recognizer
  • Process form recognizer output save to SQL for further reporting
  • Azure analytics is used for further data processing
  • Scale functions as needed to process forms
  • Reduced form sends 2 to 3 million requests rather than 15 million pages to AI services


Above process shows how we can process large scale pdf, images for various use cases and also control Azure Applied AI cost. Same process can be used for Event driven and Batch processing.

Version history
Last update:
‎Jan 25 2024 08:41 AM
Updated by: