Blog Post

AI - Azure AI services Blog
2 MIN READ

Process Large scale PDF or images to Extract information forms using Applied AI Form Recognizer

BalaB's avatar
BalaB
Icon for Microsoft rankMicrosoft
Jun 08, 2022

Optimized large scale forms processing using Applied AI Services

 

Thanks to co-authors Robert Nottoli @Michael McKechney Mark Hoiland @lee Hansen

 

Use Case

  • Have millions of forms to process
  • Like 15 to 20 million pages or more
  • These are forms with multiple pages but only few pages might have the data to extract
  • Forms might have 3-15 pages or more
  • Data to pull might be 2 or 3 pages
  • Split the pages to process in Form recognizer to reduce AI cost
  • Use python ai library to filter the pages needed for AI services
  • Process is split into 2 sections
    1. Process the pages needed for AI services
    1. Process the pages needed for the AI and send that to form Recognizer
  • Idea here is to show how to preprocess PDF or images to extract needed info for AI Cognitive Services to process.
  • Both the below steps can be scaled as needed based on requirements

 

Architecture

 

Architecture - End to End processing

 

 

 

2 Parts processing

 

Azure Python Function

  • Python function to process PDF to only pick pages needed to process in AI
  • Instead of 15 million pages can be reduced to 2 or 3 million pages
  • Using existing open-source packages like pytesseract to pull only pages needed
  • Scale pdf processing using azure functions
  • https://github.com/balakreshnan/PythonAIFunction

 

Azure C# function to process Form Recognizer

  • Functions to take the reduced pages and send to Form Recognizer
  • Process form recognizer output save to SQL for further reporting
  • Azure analytics is used for further data processing
  • Scale functions as needed to process forms
  • Reduced form sends 2 to 3 million requests rather than 15 million pages to AI services
  • https://github.com/balakreshnan/HighThroughputFormRecognizer

 

Above process shows how we can process large scale pdf, images for various use cases and also control Azure Applied AI cost. Same process can be used for Event driven and Batch processing.

Updated Jan 25, 2024
Version 6.0
  • ja_Hor_365's avatar
    ja_Hor_365
    Brass Contributor

    Saludos, excelente articulo

    Permiteme una consiulta, realizaste el estimado en recursos en Azure y tiempo de procesamiento de la extraccion de los datos para las 3 M de imagenes ?

    Gracias