User Profile
yodobrin
Former Employee
Joined 4 years ago
User Widgets
Recent Discussions
Optimizing Data Extraction Accuracy with Custom Models in Azure AI Document Intelligence
Context Many users of Azure AI Document Intelligence experience challenges with the accuracy of data extraction. This article addresses a key reason for these challenges: the reliance on a singular model approach. As businesses turn to AI for extracting data from documents, Azure AI Document Intelligence stands out as a preferred solution. It boasts an impressive mix of user-friendly features and sophisticated AI capabilities, tailored to accommodate a variety of data processing demands. However, the platform's versatility sometimes leads users to a common misunderstanding—that a single, all-encompassing model can meet all their needs. In this article, we highlight the limitations of this approach and suggest an alternative within Azure AI Document Intelligence. Specifically, we'll advocate the advantages of developing custom models and the significance of tailoring these models to the specific types of documents being processed. We will also cover the role of classification models in improving the precision of data extraction. The Singular Model Problem Although it is possible in Azure AI Document Intelligence to create a single extraction model based on all of your varying document types, this poses a major challenge in producing accurate results. For example, extracting valuable information from contracts that vary in format, structure, content, and legal jargon would be equivalent to fitting a square peg into a round hole. Here is a potential conceptual diagram to illustrate the point. The challenges that we frequently see include: Multiple Document Structures: Usually there are a multitude of document structures. For example, invoices are not the same as receipts, and contracts are not the same as resumes. When trying to use the same model to extract information from different document structures, a singular model struggles to handle this diversity resulting in lower accuracy. Single document with multiple structures: In many cases, a single document can have multiple pages, each with a different structure. Some of these pages for one document may span multiple pages, while for another, they only span one. Treating each section of the document as one can also result in lower accuracy. The challenges posed by the singular model problem are significant. When dealing with diverse document types, such as contracts, invoices, and resumes, a one-size-fits-all approach struggles to maintain accuracy. In the next section, we’ll explore a practical design to address these issues effectively. A practical design for improving data extraction accuracy in Azure AI Document Intelligence In this section, we outline an approach to enhance document processing using Azure AI Document Intelligence. Highlighted by the diagram below, our method employs a system of multiple models, each designed for a specific type of document structure. This section details the steps and key factors to consider, helping users improve their accuracy in data extraction. Consider it as a practical design for document processing using Azure AI Document Intelligence using the following diagram. Our recommended design involves a shift towards a more composed approach, utilizing multiple custom extraction models and a classification model to enhance the accuracy and efficiency of document processing. Based on this, consider the following: Use multiple extraction models for differing document structures: Train and deploy separate custom extraction models, each specifically designed for a particular type of document structure. This ensures that each model is highly optimized for its intended document type, improving accuracy in data extraction. Create a classification model to determine extraction model route: Implement a custom classification model to serve as the initial layer of processing. This model categorizes documents based on their structure and content, ensuring that each document is routed to the most appropriate model for further processing. Perform pre-processing techniques to optimize accuracy for multi-page documents: For documents consisting of multiple pages with diverse structures, divide them into smaller segments, each representing a distinct document type. Process each segment with the model best suited to its structure. This approach is particularly effective for composite documents that act more like collections of various document types rather than a single, uniform entity. Conclusion It is important to recognize that each of your document types require their own techniques for accurate data extraction. Azure AI Document Intelligence is a powerful AI tool to provide this capability, but it our experiences, alongside various ISVs and Digital Natives, revealed a critical insight: the pursuit of a singular, catch-all model falls short of expectations. The key takeaway is to move beyond the one-model myth and embrace custom models tailored to the unique structures of your documents. By applying a practice of aligning document intelligence capabilities with the specific needs of your documents, this change can significantly enhance the accuracy of your data extraction with Azure AI Document Intelligence. Read more on document processing with Azure AI Thank you for taking the time to read this article. We are sharing our insights for ISVs and Startups that enable document processing in their AI-powered solutions, based on real-world challenges we encounter. We invite you to continue your learning through our additional insights in this series. Using Structured Outputs in Azure OpenAI’s GPT-4o for consistent document data processing Discover how to leverage GPT-4o’s Structured Outputs to ensure reliable, schema-compliant document data processing. Using Azure AI Document Intelligence and Azure OpenAI to extract structured data from documents Discover how Azure AI Document Intelligence and Azure OpenAI efficiently extract structured data from documents, streamlining document processing workflows for AI-powered solutions. Evaluating the quality of AI document data extraction with small and large language models Discover our evaluation of the effectiveness of AI models in quality document data extraction using small and large language models (SLMs and LLMs). Further Reading Azure AI Document Intelligence - Model Accuracy FAQ | Microsoft Learn Implementing MLOps for Training Custom Models with Azure AI Document Intelligence | Microsoft Tech Community4.4KViews3likes0CommentsClient-Side Compute: A Greener Approach to Natural Language Data Queries
Introduction Using natural language to interact with data can significantly enhance our ability to work with and understand information, making data more accessible and useful for everyone. Considering the latest advances in large language models (LLMs), it seems like the obvious solution. However, while we've made strides in interacting with unstructured data using NLP and AI, structured data interaction still poses challenges. Using LLMs to convert natural language into domain-specific languages like SQL is a common and valid use case, showcasing a strong capability of these models. This blog identifies the limitations of current solutions and introduces novel, energy-efficient approaches to enhance efficiency and flexibility. My team focuses on ISVs and how each design decision impacts them. For example, if the ISV needs to allow "chat with data" as a solution, they must also address the challenges of hosting, monetizing, and securing these features. We present two key strategies: Leveraging deterministic tools to execute the domain-specific language on the appropriate systems and Offloading compute to client devices. These strategies not only improve performance and scalability but also reduce server load, making them ideal for ISVs looking to provide seamless and sustainable data access to their customers. The Challenge: Efficiently Interacting with Structured Data Structured data, typically stored in databases, structured files, and spreadsheets, is the backbone of business intelligence and analytics. However, querying and extracting insights from this data often requires knowledge of specific query languages like SQL, creating a barrier for many users. Additionally, ISVs face the challenge of anticipating the diverse ways their customers want to interact with their data. Due to increasing customer demand for natural language interfaces to simplify and intuitively access their data, ISVs are pressured to develop solutions that bridge the gap between users and the structured data they need to interact with. While using LLMs to convert natural language queries into domain-specific languages such as SQL is a powerful capability, it alone doesn't solve the problem. The next step is to execute these queries efficiently on the appropriate systems. Implementing such a solution must include several fundamental guardrails to ensure the generated SQL is safe to execute. Moreover, there is the additional challenge of managing the computational load. Hosting these capabilities on ISV servers can be resource-intensive and costly. Therefore, an effective solution must not only translate natural language into executable queries but also optimize how these queries are processed. This involves leveraging deterministic tools to execute domain-specific languages and offloading compute tasks to client devices. By doing so, ISVs can provide more efficient, scalable, and cost-effective data interaction solutions to their customers. A Common Use Case An ISV collects data from various sources, some public and most from its customers (or tenants). These tenants could come from various industries such as retail, healthcare, and finance, each requiring tailored data solutions. The ISV implements a medallion pattern for data ingestion, a design pattern that organizes data into layers (bronze, silver, and gold) to ensure data quality and accessibility. In this pattern, raw data is ingested into the bronze layer, cleaned and enriched into the silver layer, and then aggregated into the gold layer for analysis. The gold tables, containing the aggregated data, are generally smaller than 20MB per tenant. The data ingestion pipeline runs periodically, populating the gold tables hosted on Azure SQL Database. Data isolation is managed using row-level security or multiple schemas, tailored to the ISV's requirements. The next step for the ISV is to provide access for its tenants to the data through a web application, leveraging homegrown dashboards and reporting capabilities. Often, these ISVs are small companies that do not have the resources to implement a full Business Continuity and Disaster Recovery (BCDR) approach or afford paid tools like Power BI, and thus rely on homegrown or free packages. Despite having a robust infrastructure, the ISV faces several challenges: Complex Query Language: Users often struggle with the complexity of SQL or other query languages required to extract insights from the data. This creates a barrier to effective data utilization. Performance and Scalability: The server load increases significantly with complex queries, especially when multiple tenants access the data simultaneously. This can lead to performance bottlenecks and scalability issues. Cost and Resource Management: Hosting the necessary computational resources to handle data queries on the ISV’s servers is resource-intensive and costly. This includes maintaining high-performance databases and application servers. User Experience: Customers increasingly demand the ability to interact with their data using natural language, expecting a seamless and intuitive user experience. For more detailed information on the medallion pattern, you can refer to this link. The architecture diagram above illustrates the current setup: Data Sources: Public sources and tenant data are ingested into the system. Storage: The data lake (or lake house) process the data from multiple sources, perform cleansing, and store the data in the gold tables periodically. Orchestrator: Orchestrating ELT/ETL is done using Azure Fabric/Synapse or Azure Data Factory pipelines. Serving: The web application is hosted on Azure App Service, the data is queried using Azure SQL Database. Visualize: Data is reported using Power BI or other reporting tools, including home grown dashboards. Enhanced Approach: Energy-Efficient Data Interaction To address the challenges mentioned earlier, the ISV can adopt the following strategies: Leveraging Deterministic Tools for Query Execution: Translation: Utilize LLMs to convert natural language queries into SQL. Execution: Create a sandbox environment for each customer's data. This sandbox is hosted on lower-cost storage, such as a storage container per customer, which contains a snapshot of the data they can interact with. Data Management: The same data ingestion pipeline that updates the gold table in Azure SQL is adapted to update a customer-specific data set stored in their respective storage container. The idea is to use SQLite to store the customer-specific data, ensuring it is lightweight and portable. Benefits: Efficiency and Security: Ensures that queries are executed efficiently and securely, leveraging the robust capabilities of SQL databases while minimizing risks. By isolating each customer's data in a sandbox, the need for sophisticated guardrails against bad queries and overloading the reporting database is significantly reduced. Cost & Energy Savings: No need to manage or host a dedicated reporting database. Since the customer-specific data is hosted on Azure storage containers, the ISV avoids the costs and energy consumption associated with maintaining high-performance database infrastructure. Scalability and Reliability: The ISV does not need to plan for the worst-case scenario of all customers running queries simultaneously, which could impact the health of a centralized reporting database. Each customer's queries are isolated to their data, ensuring system stability and performance. Offloading Compute to Client Devices: Data Transmission: The client-side application ensures it has the current data snapshot available for the customer to work with. For example, it can check the data’s timestamp or use another method to verify if the local data is up-to-date and download the latest version if necessary. This snapshot is encapsulated in portable formats like JSON, SQLite, or Parquet. Local Processing: The client-side application processes the data locally using the translated SQL queries. Benefits: Performance: Reduces server load, enhances scalability, and provides faster query responses by utilizing the client’s computational resources. Cost & Energy Savings: Significant cost savings by reducing the need for high-performance server infrastructure. Hosting a static website and leveraging client devices' processing power also reduces overall energy consumption. Flexibility: Ensures that customers always work with the most current data without the need for constant server communication. Revised Architecture Data Sources: Public sources and tenant data are ingested into the system. Storage: The data lake (or lake house) process the data from multiple sources, perform cleansing, and store the data in customer specific containers. This enhances security and isolation. Orchestrator: Orchestrating ELT/ETL is done using Azure Fabric/Synapse or Azure Data Factory pipelines. The above components are hosted in the ISV's infrastructure. The client side web application will pull the data from the customer specific containers and process the data locally. Please visit our Azure OpenAI .NET Starter Kit for further reading and understanding - focus on the 07_ChatWithJson and 08_ChatWithData notebooks. Why This Approach? Efficiency: Data queries are executed locally, reducing the load on the server and improving performance. Security: Data is securely isolated within a client-side sandbox, ensuring customers can only query what is provided. Cost & Energy Saving: Hosting a static website is significantly cheaper and more energy-efficient than hosting a web application with a database. This approach leverages the processing power of client devices, further reducing infrastructure costs and energy consumption. Scalability: By isolating each customer's data in a sandbox, the ISV does not need to worry about the impact of simultaneous queries on a centralized database, ensuring system reliability and scalability. Flexibility: Ensures that customers always have access to the most current data without the need for constant server communication. Potential Downsides and Pitfalls Client-Side Performance Variability: The approach relies on the computational power of client devices. Data Synchronization: Ensuring that the local data snapshot on client devices is up-to-date can be challenging. Delays in synchronization could lead to users working with outdated data. Conclusion By adopting these strategies, ISVs can provide a more efficient, scalable, and cost-effective solution for natural language querying of structured data. Leveraging deterministic tools for executing domain-specific languages within isolated sandboxes ensures robust and secure query execution. Offloading compute to client devices not only reduces server load but also enhances performance and scalability, providing a seamless and intuitive user experience.1.2KViews7likes0Comments
Recent Blog Articles
Future-Proofing AI: Strategies for Effective Model Upgrades in Azure OpenAI
TL;DR: This post navigates the intricate world of AI model upgrades, with a spotlight on Azure OpenAI's embedding models like text-embedding-ada-002. We emphasize the critical importance of consisten...4.7KViews2likes0Comments