artificial intelligence

61 Topics

Azure OpenAI Landing Zone reference architecture
In this article, delve into the synergy of Azure Landing Zones and Azure OpenAI Service, building a secure and scalable AI environment. Unpack the Azure OpenAI Landing Zone architecture, which integrates numerous Azure services for optimal AI workloads. Explore robust security measures and the significance of monitoring for operational success. This journey of deploying Azure OpenAI evolves alongside Azure's continual innovation.
FreddyAyala
Jul 24, 2023 Place Azure Architecture Blog
209KViews
42likes
20Comments
Security Best Practices for GenAI Applications (OpenAI) in Azure
This article presents an in-depth guide on security best practices for GenAI applications that use LLM models within the Azure platform. Aimed at developers and system administrators, it explores the essentials for maintaining the confidentiality, integrity, and availability of LLMs such as Azure OpenAI. It delves into practical measures for addressing security challenges, including data breaches, misuse of AI, and regulatory compliance, while also emphasizing the role of a shared responsibility model in cloud security. The guide provides a comprehensive roadmap for implementing layered security strategies, encryption protocols, access controls, and monitoring practices to ensure the robust security of LLM applications in Azure.
FreddyAyala
Jan 16, 2024 Place Azure Architecture Blog
73KViews
20likes
0Comments
Empowering AI: Building and Deploying Azure AI Landing Zones with Terraform
Discover the power of deploying Azure AI Landing Zones with Terraform. Explore best practices, secure connectivity, and streamlined access to Azure AI services. Learn to create a strong cloud foundation, optimize performance, and ensure governance for your AI solutions. Join us on this practical journey to harness the true capabilities of AI.
FreddyAyala
Aug 04, 2023 Place Azure Architecture Blog
30KViews
9likes
17Comments
Building a Private ChatGPT Interface With Azure OpenAI
ChatGPT is amazing on so many levels, and it’s free. But you know what they say. If something is free, then you are the product
shanebaldacchino
Jul 10, 2023 Place Azure Architecture Blog
62KViews
8likes
14Comments
Demystifying Azure OpenAI Networking for Secure Chatbot Deployment
Embark on a technical exploration of Azure's networking features for building secure chatbots. In this article, we'll dive deep into the practical aspects of Azure's networking capabilities and their crucial role in ensuring the security of your OpenAI deployments. With real-world use cases and step-by-step instructions, you'll gain practical insights into optimizing Azure and OpenAI for your projects.
FreddyAyala
Sep 20, 2023 Place Azure Architecture Blog
27KViews
7likes
9Comments
AI Studio End-to-End Baseline Reference Implementation
Discover the Future of AI Deployment with Azure AI Studio’s Baseline Reference Implementation Azure AI Studio is reshaping the landscape of cloud AI integration with its commitment to operational excellence and strategic alignment with core business objectives. We are thrilled to introduce Azure AI Studio’s end-to-end baseline reference implementation—a streamlined architecture crafted for seamless, scalable, and secure AI cloud deployments. Embark on a journey to deploy sophisticated AI workloads with confidence, supported by Azure AI Studio's robust baseline architecture. Whether it's hosting interactive AI playgrounds, constructing complex AI workflows with Promptflow, or ensuring resilient and secure deployments within Azure's managed network environment, this implementation is your blueprint for success. Embrace a new era of AI innovation where security and scalability converge with organizational compliance and governance. Join us in deploying tomorrow's AI solutions, today.
FreddyAyala
Aug 30, 2024 Place Azure Architecture Blog
4.1KViews
6likes
0Comments
Azure OpenAI chat baseline architecture in an Azure landing zone
Unlock the potential of AI in the cloud. Our Azure OpenAI Chat Baseline Architecture provides the blueprint you need to transition from testing to a full production environment, all within an Azure landing zone. Ensure security, scalability, and governance are part of your AI strategy. Get started with our concise guide and embrace the future of AI on Azure.
FreddyAyala
May 22, 2024 Place Azure Architecture Blog
7.6KViews
6likes
0Comments
Selecting the Right Agentic Solution on Azure - Part 1
Recently, we have seen a surge in requests from customers and Microsoft partners seeking guidance on building and deploying agentic solutions at various scales. With the rise of Generative AI, replacing traditional APIs with agents has become increasingly popular. There are several approaches to building, deploying, running, and orchestrating agents on Azure. In this discussion, I will focus exclusively on Azure-specific tools, services, and methodologies, setting aside Copilot and Copilot Studio for now. This article describes the options available as of today. 1. Azure OpenAI Assistants API: This feature within Azure OpenAI Service enables developers to create conversational agents (“assistants”) based on OpenAI models (such as GPT-3.5 and GPT-4). It supports capabilities like memory, tool/function calls, and retrieval (e.g., document search). However, Microsoft has already deprecated version 1 of the Azure OpenAI Assistants API, and version 2 remains in preview. Microsoft strongly recommends migrating all existing Assistants API-based agents to the Agent Service. Additionally, OpenAI is retiring the Assistants API and advises developers to use the modern “Response” API instead (see migration detail). Given these developments, it is not advisable to use the Assistants API for building agents. Instead, you should use the Azure AI Agent Service, which is part of Azure AI Foundry. 2. Workflows with AI agents and models in Azure Logic Apps (Preview) – As the name suggests, this feature is currently in public preview and is only available with Logic Apps Standard, not with the consumption plan. You can enhance your workflow by integrating agentic capabilities. For example, in a visa processing workflow, decisions can be made based on priority, application type, nationality, and background checks using a knowledge base. The workflow can then route cases to the appropriate queue and prepare messages accordingly. Workflows can be implemented either as chat assistant or APIs. If your project is workflow-dependent and you are ready to implement agents in a declarative way, this is a great option. However, there are currently limited choices for models and regional availability. For CI/CD, there is an Azure Logic Apps Standard template available for VS Code you can use. 3. Azure AI Agent Service – Part of Azure AI Foundry, the Azure AI Agent Service allows you to provision agents declaratively from the UI. You can consume various OpenAI models (with support for non-OpenAI models coming soon) and leverage important tools or knowledge bases such as files, Azure AI Search, SharePoint, and Fabric. You can connect agents together and create hierarchical agent dependencies. SDKs are available for building agents within agent services using Python, C#, or Java. Microsoft manages the infrastructure to host and run these agents in isolated containers. The service offers role-based access control, MS Entra ID integration, and options to bring your own storage for agent states and Azure Key Vault keys. You can also incorporate different actions including invoking a Logic App instance from your agent. There is also option to trigger an agent using Logic Apps (preview). Microsoft recommends using Agent Service/Azure Foundry as the destination for agents, as further enhancements and investments are focused here. 4. Agent Orchestrators – There are several excellent orchestrators available, such as LlamaIndex, LangGraph, LangChain, and two from Microsoft—Semantic Kernel and AutoGen. These options are ideal if you need full control over agent creation, hosting, and orchestration. They are developer-only solutions and do not offer a UI (barring AutoGen Studio having some UI assistance). You can create complex, multi-layered agent connections. You can then host and run these agents in you choice of Azure services like AKS or Apps Service. Additionally, you have the option to create agents using Agent Service and then orchestrate them with one of these orchestrators. Choosing the Right Solution The choice of agentic solution depends on several factors, including whether you prefer code or no-code approaches, control over the hosting platform, customer needs, scalability, maintenance, orchestration complexity, security, and cost. Customer Need: If agents need to be part of a workflow, use AI Agents in Logic Apps; otherwise, consider other options. No-Code: For workflow-based agents, Logic Apps is suitable; for other scenarios, Azure AI Agent Service is recommended. Hosting and Maintenance: If Logic Apps is not an option and you prefer not to maintain your own environment, use Azure AI Agent Service. Otherwise, consider custom agent orchestrators like Semantic Kernel or AutoGen to build the agent and services like AKS or Apps Service to host those. Orchestration Complexity: For simple hierarchical agent connections, Azure AI Agent Service is good choice. For complex orchestration, use an agent orchestrator. Versioning - If you are concerned about versioning to ensure solid CI/CD regime then you may have to chose Agent Orchestrators. Agent Service still miss this feature clarity. We have some work-around but it is not robust implementation. Hopefully we will catch up soon with a better versioning solution. Summary: When selecting the right agentic solution on Azure, consider the latest recommendations and platform developments. For most scenarios, Microsoft advises using the Azure AI Agent Service within Azure Foundry, as it is the focus of ongoing enhancements and support. For workflow-driven projects, Azure Logic Apps with agentic capabilities may be suitable, while advanced users can leverage orchestrators for custom agent architectures. In Selecting the Right Agentic Solution on Azure – Part 2 (Security) blog we will examine the security aspects of each option, one by one.
pranabpaul
Sep 16, 2025 Place Azure Architecture Blog
1.1KViews
5likes
0Comments
Modernizing Enterprise IT & Knowledge Support with Azure-Native Multiagent AI and LangGraph.
Industry: Energy Location: North America Executive Summary: AI-Driven Multi-Agent Knowledge and IT Support Solution for an Energy Industry Firm A North American energy company sought to modernize its legacy knowledge and IT support chatbot, which was underperforming across key metrics. The existing system, built on static rules and scripts, delivered slow and often inaccurate responses, failing to meet the organization’s standards for employee engagement and operational efficiency. To address this challenge, we proposed and designed a cloud-native, AI-powered multi-agent system hosted on Microsoft Azure. Built on the LangGraph orchestration framework and Azure AI Foundry. This solution integrates advanced ai agent hierarchies, allowing for contextual, domain-specific knowledge retrieval and automated IT support. It improves speed, accuracy, and adaptability, delivering measurable gains in support resolution time, employee satisfaction, and knowledge accessibility. Business Use Case Challenge: The organization’s internal support chatbot was not scaling with the needs of its workforce. Employees experienced delays, poor response relevance, and limited capabilities in both research assistance and IT troubleshooting. This led to increased reliance on human support teams, raising operational costs and slowing productivity. Solution Overview: We implemented a LangGraph-based hierarchical multi-agent system, segmented by business domains (e.g., IT Support, Business Domain Knowledge). It enables the creation of a multi-level hierarchical ai agent-based system by creating a top-level supervisor that manages multiple supervisor agents, each of which handle a business domain within the organization. In this solution, each domain supervisor manages the worker or ReAct agents within its domain (IT support and Knowledge Retrieval). Agentic Workflow: Architecture: Solution Components: AI Agent Orchestration Framework: Langgraph Multiagent and multilevel hierarchies (Python) Frontend: React.js, FastAPI, Chainlit Server (Dev), CopilotKit Agentic UI (Prod) Memory Management/ Context Engineering: Azure Cosmos DB Memory Store Data Source: Azure Data Lake Gen2 Vector Store: Azure AI Search Agentic Retrieval & Integrated Vectorization for Data ingestion, Query Decomposition and Parallel Subqueries Secrets Management: Azure Key Vault Traditional and AI Agentic Observability: Azure Foundry and Azure Monitor (Log Analytics and Application Insights) Model Catalog: Azure AI Foundry LLM-Judge Based Evaluation: Online Evaluation of GenAI App with Azure AI Evaluation Python SDK Guardrails and AI Content Safety: AI Foundry Content Safety, Prompt Jailbreaks and Blocklists AI Governance: Azure AI Foundry Security: Managed Identity, RBAC, Network Security Responsible AI DevOps: GitHub Actions for Apps & Infra CICD, Azure App Service for hosting Strategic Value: This solution lays the groundwork for enterprise-wide AI adoption by creating a flexible, modular and extensible agentic framework. It not only replaces a legacy system but enables future expansion into HR, Compliance, and Operational domains with minimal overhead. This is the first in a series of future posts will provide a deeper dive into specific components of the solution.
Charles_Chukwudozie
Jul 02, 2025 Place Azure Architecture Blog
734Views
5likes
1Comment
Advanced RAG Solution Accelerator
Overview What is RAG and Why Advanced RAG? Retrieval-Augmented Generation (RAG) is a natural language processing technique that combines the strengths of retrieval-based and generation-based models. It uses search algorithms to retrieve relevant data from external sources such as databases, knowledge bases, document corpora, and web pages. This retrieved data, known as "grounding information," is then input into a large language model (LLM) to generate more accurate, relevant, and up-to-date outputs. Figure1: High level Retrieval Augmented Flow Usage Patterns Here are some horizontal use cases where customers have used Retrieval Augmented Generation based systems: Conversational Search and Insights: Summarize large volumes of information for easier consumption and communication. Content Generation: Tailor interactions with individualized information to produce personalized output and recommendations. AI Assistant, Q&A, and Decisioning: Analyze and interpret data to uncover patterns, identify trends, gain valuable insights, and answer questions. Also, below are a few examples of vertical use cases where Retrieval Augmented Generation have been beneficial. Public Sector Knowledge Base: A government agency needs a system to provide citizens with information about public services, such as permits, licenses, and local regulations. Compliance Document Retrieval: A regulatory body must assist organizations in understanding compliance requirements through a database of guidelines and policies. Healthcare Patient Education: A health department aims to provide patients with educational resources about common conditions and treatments. Challenges with Baseline RAG: Ability to cover complex data: RAG on plain text content seems to do fine. However, when the content becomes more complex like financial reports with images and complex tables and document sections spanning pages, being able to parse and index them isn’t straightforward. Context Window Limitations: As the dataset scales, the performance of RAG systems can degrade, particularly due to the "lost in the middle" phenomenon, making it challenging to retrieve specific information from large datasets. Search Limitations: Though there have been advancements in Search technology to be able to perform vector-based searches, however searching over vector embedding alone may not be sufficient for achieving high accuracy. Groundedness: When the search context is not enough, sometimes RAG systems can generate incorrect or misleading information that is not grounded in the customer’s data. Careful evaluations may be necessary to catch these and fix them. Latency and User Experience: Balancing performance and latency is crucial, as high latency can negatively impact the user experience. Optimizing this balance is a key challenge. Quality Improvement Levers: Identifying and effectively utilizing the right levers for quality improvements, such as accuracy, latency, and relevance, can be difficult. Advanced RAG aims to address the challenges faced with Baseline RAG by incorporating advanced techniques for ingestion, formatting, and intent extraction from both structured and unstructured data. It provides improved baseline architecture to build a more scalable solution that meets the accuracy and performance requirements of the business. By implementing advanced methodologies in data ingestion, search optimization, and evaluation, Advanced RAG enhances the overall effectiveness and reliability of Retrieval-Augmented Generation systems. This ensures that the business value of RAG systems is maximized, aligning technological capabilities with business needs. RAG Quality Improvement Background Our implementation uses default configurations from document indexing services to ingest financial data. We use Azure AI search for indexing it. The content was also vectorized in the index. The search index covered a few years of financial reports for the company. Once the RAG solution was implemented, overall accuracy was measured using the GPT similarity metric, which evaluates the similarity between user-provided ground truth answers and the model's predicted answers on a scale of 1 to 5, where 5 represents that the system produced answers that perfectly matched the ground truth answers. Accuracy Improvement Efforts To improve the accuracy of the Retrieval-Augmented Generation (RAG) system, several strategies were implemented that could be grouped under three different categories; Ingestion improvements, search improvements and improvements in tooling and evaluation. Ingestion Improvements: Improve Parsing: Efforts were made to minimize information loss during ingestion by handling data in images and complex tables. Image descriptions were generated, and various techniques were used to handle complex tables, including converting them into Markdown, HTML, and paragraph formats to ensure accurate parsing of tabular data. Information in images: The image below shows the performance of Microsoft Stock compared to the rest of the market (S&P 500 and NASDAQ). Efficient parsing techniques can eliminate the need for additional tables and supporting text content by extracting key insights from images and storing them in the form of text. Figure 2: Example of information in images Complex Tables: The image below shows an example of a table with financial data represented in a complex structure in the financial report. In this particular example, the table contains multiple sub columns (years) within a top-level column along with rows spanning over multiple lines. Figure 3: Example of a complex table in financial reports Optimal Chunk Size: The impact of chunk size on search results was analyzed. Parsed content was split into paragraphs, and a small percentage of these paragraphs were used to generate questions. Custom scripts created a question-to-paragraph mapping dataset. Different indexes with varying chunk sizes (e.g., 3k, 8k) were created, and search results were evaluated for different values of top_k. Recall Values with Different Chunk Sizes: The image and table below show recall values with different top_k search results on different indexes with different chunk sizes. For example, with a chunk size of 3k characters, the recall was 78.5% for top_k = 7 and 91% for top_k = 25. Figure 4: Recall on different chunk sizes Recall Chunk Size 8K (chars) 3K (chars) top_k = 7 69% 78.5% top_k = 25 76% 91% Table 1: Search Recall on different chunk sizes Based on the search recall value, for our content, it seems chunks of size 3K characters would work best for our content and we could use top_k of 25 to get most of the relevant search results Index Enrichment: Additional metadata was added to chunks during ingestion to aid in better retrieval. This included metadata in additional fields used during search (like headings, section topics etc.) and other fields used for filtering (report year etc.). Search Improvements: Pre-processing of User Input: Techniques such as rephrasing and query expansion were used to enhance the quality of user input. Advanced Search Features: The use of vector, semantic, and hybrid search features was implemented to increase the number of relevant results. Filtering and Reranking: Filters were dynamically extracted from user queries, and search results were reranked to improve relevance. Example of a rephrased user query: Below is an example of a user prompt that is then rephrased and fanned out into various smaller (more focused) search queries, produced from GPT-4o using a custom prompt User Query: Can you explain the difference between the gross profit for Microsoft in 2023 and 2024? Rephaser Output: Evaluation and Tooling: Standardizing Datasets: As the project progressed, soon we had too many datasets which started to result in inconsistent ways of measuring the quality of the bot response. To resolve that issue, we standardized our dataset and used AML to store, document and version them. When any updates were made to the dataset (say some inconsistency was found in the golden data set and that user prompt was ignored from accuracy computation the dataset was updated and a new version created). This way everyone was using a known dataset for evaluations. Standardizing Accuracy Calculation: To calculate accuracy of bot’s answers, similarity score is used, which is a rating between 1 to 5 based on how similar bot’s answer is to the golden dataset. Initially, the similarity score metric included in the Prompt Flow was used to calculate this, but soon we realized that from the produced scores it wasn’t easy to understand why certain things were scored in a certain way. So, the team created its own prompt and calibrated it against how humans had done the evaluations. The tuned prompt was then used in prompt flow to run evaluations. The prompt, along with scoring the bot’s result also provides reason why it gave that score, which useful in analyzing the results. The following image shows a snippet of that prompt: Figure 5: Custom prompt for scoring bot response Automating Accuracy Calculations: Tools were also developed to automate the generation of predictions and the evaluation of accuracy in a more repeatable and consistent way. More details on Analysis can be found in the Eval Tool section Analyzing problematic queries: Running evaluations and just looking at the overall / average score wasn’t enough to analyze the cause of the issue. So, we took a first pass at categorizing the user queries into certain buckets. These categories became: Queries that are direct hit on some content in the report – like revenue for a year Queries where we need to perform some calculations - like Gearing Ratio Queries that do compare and contrast across some KPI – largest two segments by revenue Open ended queries where we need to perform analysis – like why and what? Later, LLM was leveraged to auto categorize ground truth questions as more ground truth questions were updated. Once the questions were categorized, evaluations were broken down by these categories to ease analysis and understanding problematic queries. Figure below shows a snapshot of the spread of the user queries in the ground truth by the following categories: Figure 6 Spread of user prompt by categories The figure below shows a snapshot of similarity scores across these categories Figure 7 Average similarity score by category Later, another category was added (difficulty level) with below values. The final results were reported across these categories. o Easy: If the search context had the answer user was looking for o Medium: If there was no direct hit and required some calculations to get to the final result o Hard: If the question required some analysis to be performed on the retrieved/calculated data like a financial analyst does. Results after the Accuracy Improvement Efforts After multiple iterations of accuracy improvement efforts and stabilizing the solution the overall accuracy of the system came around 4.3, making the overall solution more acceptable to the end user. The solution was also scaled up to cover content across multiple years, with over 15 financial reports and roughly 1300 pages in total. Another important metric to consider – the pass rate based on question type (% of time some answers were scored a value of 4 or a 5) to ensure the copilot was consistently passing these ground truths. The table below lists the pass rate by difficulty: Pass rate by difficulty Difficulty Easy Medium Hard Pass Rate 95 79 72 Table 2: Accuracy Improvement Analysis by Difficulty Solution architecture The RAG solution is designed to handle various tasks using robust and scalable architecture. The architecture includes the following key aspects: Security User Authentication: The solution uses Microsoft Entra ID for user authentication, ensuring secure access to the system. Network Security: All runtime components are locked behind a Virtual Network (VNet) to ensure that traffic does not traverse public networks. This enhances security by isolating the components from external threats. Managed Identities: The solution leverages managed identities where possible to simplify the management of secrets and credentials. This reduces the risk of credential exposure and makes it easier to manage access to Azure resources. Composability Modular Design: The solution is broken down into smaller, well-defined core microservices and skills that act as plug-and-play components. This modular design allows you to use existing services or bring in new ones to meet your specific needs. Core Microservices: Backend services handle different aspects of the solution, such as session management, data processing, runtime configuration, and orchestration. Skills: Specialized services provide specific capabilities, such as cognitive search and image processing. These skills can be easily integrated or replaced as needed. Iterability Configuration Service: The solution includes a configuration service that allows you to create runtime configurations for each microservice. This enables you to make changes, such as updating prompts or search indexes, without redeploying the entire solution. Per-User Prompt Configuration: Configuration service can be used to apply different configurations for each user prompt, allowing for rapid experimentation and iteration. This flexibility helps to quickly adapt to changing requirements and improve the overall system. Testing and Evaluation: The solution also comes with the ability to run dummy/simulated conversations in the form of nightly runs, end-to-end integration tests on demand, and an evaluation tool to perform end-to-end evaluation of the solution. Logging and Instrumentation Application Insights: The solution integrates with Azure Application Insights in Azure Monitor for logging and instrumentation, making it easy to debug by reviewing logs. Traceability: One can easily trace what is happening in the backend using the conversation_id and dialog_id (unique GUIDs generated by the frontend) for each user session and interaction. This helps in identifying and resolving issues quickly. Figure 8: Solution Architecture Before exploring the data flow, we begin with the Ingestion process, crucial for preparing the solution. This involves creating and populating the search index with relevant content (corpus). Detailed instructions on parsing, chunking, and indexing can be found in the Solution capabilities section of the document. User Query Processing Flow User Authentication: Users interact with the bot via a web application and must authenticate using Microsoft Entra ID to ensure secure access. User Interaction: Once authenticated, users can submit requests through text or voice: The web app establishes a WebSocket connection with the backend session manager. For voice interactions, Microsoft Speech Services are utilized for live transcription. The web app requests a speech token from the backend, which is then used in the Speech SDK for transcription. Token Management: The backend retrieves secrets from Key Vault to generate tokens necessary for front end operations. Transcription and Submission: After transcription, the web app submits the transcribed text to the backend. Session Management: The session manager assigns unique connection IDs for each WebSocket connection to identify clients. User prompts are then pushed into a message queue, implemented using Azure Cache for Redis. Orchestrator: The orchestrator plays a critical role in managing the flow of information. It reads the user query from the message queue and performs several actions: Plan & Execute: It identifies the required actions based on the user query and context. Permissions: It checks user permissions using Role-Based Access Control (RBAC) or custom permissions on the content. NOTE: The current implementation doesn’t do it, however Orchestrator could easily be updated to do so. Invoke Actions: It triggers the appropriate actions, such as invoking the Azure AI Search for retrieving relevant information. Azure AI Search: The orchestrator interacts with Azure AI Search to query the unstructured knowledge base. This involves searching through financial reports or other content to find the information the user requested. Status & Response: The orchestrator processes the search results and formulates a response. It updates the queue with the status and the final response, which includes any necessary predictions or additional information. Session Manager: The response from the orchestrator is sent back to the session manager. This component is responsible for maintaining the session’s integrity and ensuring that each client receives the correct response. It uses the unique connection ID to route the response back to the appropriate client. Web App: The web app receives a response from the session manager. It then delivers the bot's response back to the user, completing the interaction cycle. This response can be in text and /or speech format, depending on the user's initial input method. Update History: On successful completion of bot response, the session manager updates the user profile and conversation history in the storage component. This includes details about user intents and entities, ensuring that the system can provide personalized and context-aware responses in future interactions. Developer Logs / Instrumentation: Throughout the process, logs and instrumentation data are collected. These logs are essential for monitoring and debugging the system, as well as for enhancing its performance and reliability. Evaluations and Quality Enhancements: The collected data along with golden datasets, manual feedback is utilized for ongoing evaluations and quality enhancements. Tools like Azure AI Foundry and VS Code along with the configuration service are used to test the bots, develop and evaluate different prompts and models. Monitoring and Reporting: The system is continuously monitored using Azure Monitor and other analytics tools. Power BI dashboards provide insights into system performance, user interactions, and other key metrics. This ensures that the solution remains responsive and effective over time. Solution capabilities The solution will support the following capabilities: Document Ingestion Pipeline Document ingestion in a Retrieval-Augmented Generation (RAG) application is a critical process that ensures efficient and accurate retrieval of information. Currently, the ingestion service supports the following scenarios: Large financial documents containing complex tables, graphs, charts and other figures Large retail product catalogs containing images and descriptions The overall process can be broken down into three primary stages: Document Loadin: The Document Loader is the first stage in the document ingestion pipeline. Its primary function is to load documents into memory and extract text and metadata. One can configure to use either Azure AI Document Intelligence service or LangChain with Azure AI Document Intelligence for text extraction. Document Parsing: Document Parser is the second stage in the document ingestion pipeline. Its role is to process the loaded text and metadata, splitting the document into manageable chunks and cleaning the text for indexing. One can use either a Fixed-size chunking with overlap or go with Layout-based chunking, where with the use of LLMs chunking is done based on whether certain paragraphs should be kept together. The solution used layout-based chunking and sections and subsections were extracted and maintained as metadata for the chunked paragraphs. Document Indexing: Document Indexer is the final stage in the document ingestion pipeline. Its purpose is to upload the parsed chunks into a search index, enabling efficient retrieval based on user queries. As part of document parsing additional metadata (section and subsection names and titles) are also passed along with the text to be indexed. Main content and certain metadata fields are also stored as vectors to enable better retrieval. Figure 9: Indexing by document Search Once the Ingestion pipeline is executed successfully resulting in a valid, queryable Search Index, the Search service can be configured and integrated into the end-to-end RAG application. The Search Service exposes an API that enables users to query a search index in Azure AI Search. It processes natural language queries, applies requested filters, and invokes search requests against the preconfigured search configuration using the Azure AI Search SDK. Search Index Configuration: The search index configuration defines the schema and the type of search to apply, including simple text search, vector search, hybrid search, and hybrid with additional semantic understanding. This is done as part of index creation and document ingestion. User Query: The process starts with a user query, a natural language input from the user. Query Embeddings Generation: Using an LLM, the query is vectorized so hybrid search could be performed on the user query. Search Filter Generation: From the user query, filters, based on criteria such as equality, range conditions, and substring matches, are generated to refine the search results. Search Invocation: The search service constructs a query using the embedding and filters, sends it to Azure AI Search via the Azure AI Search SDK, and receives the search results. Pruning: Pruning refines these results further to ensure relevance based on additional semantic filtering and ranking. Search Results: The final output represents the items from the search index that best match the user’s query, after all filters and pruning have been applied. Query Reprocessing One of the first steps we approach when we receive a chat message is preprocessing to make sure that we have better search results that will enable our RAG system to answer the question accurately. We perform the following steps as part of the preprocessing: Message Rephrasing: When the chatbot receives a new message, we need to make sure that we rephrase this message based on the chat history as this new message may depend on the previous context. For example, when we ask, “Which team won the Premier League in 2023?” and then we ask a follow-up question “What about the following year?” we will need to rephrase this follow-up question to “Which team won the premier League in 2024?” Fanout: If the query is asking about complex data that does not exist in the indexed documents, it can be calculated by other simpler data that already exists in the document. For example, if the indexed documents are financial reports and the query is asking about the gross profit margin, if we search for gross profit margin, we may not find it in the indexed documents. But to calculate the gross profit margin, we can use both the Revenue and the Cost Of Goods (COGS) which exist in the indexed documents. If we can break down the original question about gross profit margin to sub questions for Revenue and COGS, then that would help the model to calculate the gross profit margin given these values. Check out the new service Rewrite queries with semantic ranker in Azure AI Search (Preview). AI Skills To ensure modularity and ease of maintenance, our solution designates any service capable of providing data as a "skill." This approach allows for seamless plug-and-play integration of components. For instance, Azure AI Search is treated as a skill within our architecture. Should the solution require additional data sources, such as a relational database, these can be encapsulated within an API and similarly integrated as skills. Wrapping content providers as skills serves two primary purposes: Enhanced Logging and Debugging: Skills can be configured to incorporate logging and instrumentation fields, ensuring that all generated logs include relevant context. This uniformity greatly facilitates efficient debugging by providing comprehensive log insights. Dynamic Configuration: Skills can leverage the configuration service to expose runtime configurations. This flexibility is particularly beneficial during evaluations, allowing for adjustments such as modifying the number of top-k results or switching to a different search index to accommodate improvements in data ingestion. By adopting this skill-based approach, the architecture remains adaptable and scalable, supporting ongoing enhancements and diverse data integration. Sharing Intermediate Results Sharing intermediate results from the RAG process provides the user with details about what is happening once a query is sent to the bot. This is especially useful when the query takes a long time to return. This also helps to see how the query was broken down into smaller queries, so if something goes wrong (especially for harder queries), the user could have the ability to rephrase and get a better response. Once the user sends the query to the bot, the orchestrator emits intermediate updates like “Searching for ...”, “Retrieved XX results...” before the final answer is delivered. Figure 10: Messaging Framework Architecture to support this: WebSocket connection (Client <> Sessions Manager) - When the client connects to the session manager a persistent WebSocket connection is created, all communication between the client and session manager is handled through this connection. This also allows queueing up of multiple messages from the client. The session manager listens to the incoming messages and queues them up in a message queue. Then requests are handled one by one. Meanwhile, intermediate messages and final answers of the previously submitted messages are sent asynchronously back to the client. Message Queue (Session Manager <> Orchestrator) - once the session manager receives a request its enqueued into a task queue. Since there can be multiple orchestrator instances running in the cluster, the task queue ensures only one instance receives a particular request. The orchestrator then begins the RAG process. As the RAG process continues, the orchestrator sends intermediate messages by publishing them to a message queue. All instances of the session manager subscribe to this message queue. The instance handling the client relevant to the incoming message forwards it to the client. Runtime Configuration The runtime configuration service enhances the architecture's dynamicity and flexibility. It enables core services and AI skills to decouple and parameterize various components, such as prompts, search data settings, and operational parameters. These services can easily override default configurations with new versions at runtime, allowing for dynamic behavior adjustments during operation. Figure 11: Runtime Configuration Core Services and AI Skills: define unique identifiers for their individual configurations. At runtime, they check if the payload consists of a configuration override. If yes, they attempt to retrieve it from the cache. In a scenario where it is not present in cache memory, i.e., first time fetch, they read from the configuration service and save it in cache memory for future references. Configuration Service: facilitates Create, Read and Delete operations for a new configuration. Validates the incoming config against a Pydantic model and generates a unique version for the configuration upon successful save. Cosmos DB: persists the new config version. Redis: high availability memory store for storing and quick retrievals of configurations for subsequent queries. Evaluation Tool Improving accuracy of RAG based solution is a continuous process that requires experimenting with different changes, running predictions with those changes (running user query through the bot), evaluating the bot produced result against the ground truth and analyzing the issues and then repeating these steps again. All this required a consistent way of evaluating the end-to-end results. Initially the team did the evaluation and scoring of the results manually but as the search index grew (ingested a few thousand financial reports) and the golden dataset grew, doing it manually was very time-consuming. So, the team developed a custom prompt and used LLM to do the scoring. The prompt was calibrated against the human scores. Once the prompt was stabilized Evaluation tool was built to do two things: For each golden question call the bot endpoint and generate the prediction (bot answer) Then take the ground truth and predicted results to run evaluation of them and produce some metrics. Implementation Guide Please refer to GitHub repo. Additional Resources Get started on Azure AI Foundry Evaluation of generative AI applications Generate adversarial simulations for safety evaluation Generate synthetic data and simulate non-adversarial tasks AI architecture guidance to build AI workloads on Azure Responsible AI Tools and Practices
raniabayoumy
Mar 19, 2025 Place Azure Architecture Blog
4.2KViews
5likes
0Comments