Blog Post

Azure AI Foundry Blog
7 MIN READ

Context-Aware RAG System with Azure AI Search to Cut Token Costs and Boost Accuracy

Shikhaghildiyal's avatar
Oct 23, 2025

Discover how to optimize every token and maximize model performance with this hands-on guide. From mastering context-aware chunking to integrating Azure AI Search and implementing intelligent cost-saving strategies — you’ll learn practical techniques to make your AI faster, leaner, and more efficient. Whether you're building your first prototype or fine-tuning an enterprise-grade system, this guide equips you to unlock the true power of AI with precision and scalability."

🚀 Introduction

As AI copilots and assistants become integral to enterprises, one question dominates architecture discussions:

“How can we make large language models (LLMs) provide accurate, source-grounded answers — without blowing up token costs?”

Retrieval-Augmented Generation (RAG) is the industry’s go-to strategy for this challenge. But traditional RAG pipelines often use static document chunking, which breaks semantic context and drives inefficiencies.

To address this, we built a context-aware, cost-optimized RAG pipeline using Azure AI Search and Azure OpenAI, leveraging AI-driven semantic chunking and intelligent retrieval.
The result: accurate answers with up to 85% lower token consumption.

Majorly in this blog we are considering:

  1. Tokenization
  2. Chunking

The Problem with Naive Chunking

Most RAG systems split documents by token or character count (e.g., every 1,000 tokens).
This is easy to implement but introduces real-world problems:

  • 🧩 Loss of context — sentences or concepts get split mid-idea.
  • ⚙️ Retrieval noise — irrelevant fragments appear in top results.
  • 💸 Higher cost — you often send 5× more text than necessary.

These issues degrade both accuracy and cost efficiency.

🧠 Context-Aware Chunking: Smarter Document Segmentation

Instead of breaking text arbitrarily, our system uses an LLM-powered preprocessor to identify semantic boundaries — meaning each chunk represents a complete and coherent concept.

Example

Naive chunking:

“Azure OpenAI Service offers… [cut] …integrates with Azure AI Search for intelligent retrieval.”

Context-aware chunking:

“Azure OpenAI Service provides access to models like GPT-4o, enabling developers to integrate advanced natural language understanding and generation into their applications. It can be paired with Azure AI Search for efficient, context-aware information retrieval.”

✅ The chunk is self-contained and semantically meaningful.

This allows the retriever to match queries with conceptually complete information rather than partial sentences — leading to precision and fewer chunks needed per query.

 

Architecture Diagram

 

Chunking Service:

Purpose: Transforms messy enterprise data (wikis, PDFs, transcripts, repos, images) into structured, model-friendly chunks for Retrieval-Augmented Generation (RAG).

ChallengeChunking FixLLM context limitsBreaks docs into smaller piecesEmbedding sizeKeeps within token boundsRetrieval accuracyGranular, relevant sections onlyNoiseRemoves irrelevant blocksTraceabilityChunk IDs for auditabilityCost/latencyRe-embed only changed chunks

The Chunking Flow (End-to-End)

The Chunking Service sits in the ingestion pipeline and follows this sequence:

  1. Ingestion: Raw text arrives from sources (wiki, repo, transcript, PDF, image description).
  2. Token-aware splitting: Large text is cut into manageable pre-chunks with a 100-token overlap, ensuring no semantic drift across boundaries.
  3. Semantic segmentation: Each pre-chunk is passed to an Azure OpenAI Chat model with a structured prompt.
    • Output = JSON array of semantic chunks (sectiontitle, speaker, content).
  4. Optional overlap injection: Character-level overlap can be applied across chunks for discourse-heavy text like meeting transcripts.
  5. Embedding generation: Each chunk is passed to Azure OpenAI Embeddings API (text-embedding-3-small), producing a 1536-dimension vector.
  6. Indexing: Chunks (text + vectors) are uploaded to Azure AI Search.
  7. Retrieval: During question answering or document generation, the system pulls top-k chunks, concatenates them, and enriches the prompt for the LLM.

Resilience & Traceability

The service is built to handle real-world pipeline issues. It retries once on rate limits, validates JSON outputs, and fails fast on malformed data instead of silently dropping chunks. Each chunk is assigned a unique ID (chunk_<sequence>_<sourceTag>), making retrieval auditable and enabling selective re-embedding when only parts of a document change.

 

☁️ Why Azure AI Search Matters Here

Azure AI Search (formerly Cognitive Search) is the heart of the retrieval pipeline.

Key Roles:

  1. Vector Search Engine:
    Stores embeddings of chunks and performs semantic similarity search.
  2. Hybrid Search (Keyword + Vector):
    Combines lexical and semantic matching for high precision and recall.
  3. Scalability:
    Supports millions of chunks with blazing-fast search latency.
  4. Metadata Filtering:
    Enables fine-grained retrieval (e.g., by document type, author, section).
  5. Native Integration with Azure OpenAI:
    Allows a seamless, end-to-end RAG pipeline without third-party dependencies.

In short, Azure AI Search provides the speed, scalability, and semantic intelligence to make your RAG pipeline enterprise-grade.

 

💡 Importance of Azure OpenAI

Azure OpenAI complements Azure AI Search by providing:

  • High-quality embeddings (text-embedding-3-large) for accurate vector search.
  • Powerful generative reasoning (GPT-4o or GPT-4.1) to craft contextually relevant answers.
  • Security and compliance within your organization’s Azure boundary — critical for regulated environments.

Together, these two services form the retrieval (Azure AI Search) and generation (Azure OpenAI) halves of your RAG system.

 

💰 Token Efficiency 

By limiting the model’s input to only the most relevant, semantically meaningful chunks, you drastically reduce prompt size and cost.

ApproachTokens per QueryTypical CostAccuracy
Full-document prompt~15,000–20,000Very highMedium
Fixed-size RAG chunks~5,000–8,000ModerateMedium-high
Context-aware RAG (this approach)~2,000–3,000LowHigh

💰 Token Cost Reduction Analysis

Let’s quantify it:

StepNaive Approach (no RAG)Your Approach (Context-Aware RAG)
Prompt context sizeEntire document (e.g., 15,000 tokens)Top 3 chunks (e.g., 2,000 tokens)
Tokens per query~16,000 (incl. user + system)~2,500
Cost reduction~84% reduction in token usage
AccuracyOften low (hallucinations)Higher (targeted retrieval)

That’s roughly an 80–85% reduction in token usage while improving both accuracy and response speed.

 

🧱 Tech Stack Overview

ComponentServicePurpose
Chunking EngineAzure OpenAI (GPT models)Generate context-aware chunks
Embedding ModelAzure OpenAI Embedding APICreate high-dimensional vectors
RetrieverAzure AI SearchPerform hybrid and vector search
GeneratorAzure OpenAI GPT-4oProduce final answer
Orchestration LayerPython / FastAPI / .NET c#Handle RAG pipeline

🔍 The Bottom Line

By adopting context-aware chunking and Azure AI Search-powered RAG, you achieve:

  • Higher accuracy (contextually complete retrievals)
  • 💸 Lower cost (token-efficient prompts)
  • Faster latency (smaller context per call)
  • 🧩 Scalable and secure architecture (fully Azure-native)

This is the same design philosophy powering Microsoft Copilot and other enterprise AI assistants today.

 

🧪 Real-Life Example: Context-Aware RAG in Action

To bring this architecture to life, let’s walk through a simple example of how documents can be chunked, embedded, stored in Azure AI Search, and then queried to generate accurate, cost-efficient answers.

Imagine you want to build an internal knowledge assistant that answers developer questions from your company’s Azure documentation.

⚙️ Step 1: Intelligent Document Chunking

We’ll use a small LLM call to segment text into context-aware chunks — rather than fixed token counts

//Context Aware Chunking
//text can be your retrieved text from any page/ document
private async Task<List<SemanticChunk>> AzureOpenAIChunk(string text)
{
    try
    {
            string prompt = $@"
            Divide the following text into logical, meaningful chunks. 
            Each chunk should represent a coherent section, topic, or idea. 
            Return the result as a JSON array, where each object contains:
            - sectiontitle
            - speaker (if applicable, otherwise leave empty)
            - content

            Do not add any extra commentary or explanation. 
            Only output the JSON array. Do not give content an array, try to keep all in string.
            TEXT:
            {text}"
        var client = GetAzureOpenAIClient();

        var chatCompletionsOptions = new ChatCompletionOptions
        {
            Temperature = 0,
            FrequencyPenalty = 0,
            PresencePenalty = 0
        };

        var Messages = new List<OpenAI.Chat.ChatMessage>
            {
                new SystemChatMessage("You are a text processing assistant."),
                new UserChatMessage(prompt)
            };

        var chatClient = client.GetChatClient(
            deploymentName: _appSettings.Agent.Model);

        var response = await chatClient.CompleteChatAsync(Messages, chatCompletionsOptions);
        string responseText = response.Value.Content[0].Text.ToString();
        string cleaned = Regex.Replace(responseText, @"```[\s\S]*?```", match =>
        {
            var match1 = match.Value.Replace("```json", "").Trim();
            return match1.Replace("```", "").Trim();
        });
        // Try to parse the response as JSON array of chunks
        return CreateChunkArray(cleaned);
    }
    catch (JsonException ex)
    {
        _logger.LogError("Failed to parse GPT response: " + ex.Message);
        throw;
    }
    catch (Exception ex)
    {
        _logger.LogError("Error in AzureOpenAIChunk: " + ex.Message);
        throw;
    }
}

🧠 Step 2: Adding Overlaps for better result

We are adding overlapping between chunks for better and accurate answers. Overlapping window can be modified based on the documents.

 public List<SemanticChunk> AddOverlap(List<SemanticChunk> chunks, string IDText, int overlapChars = 0)
 {
     var overlappedChunks = new List<SemanticChunk>();

     for (int i = 0; i < chunks.Count; i++)
     {
         var current = chunks[i];

         string previousOverlap = i > 0
             ? chunks[i - 1].Content[^Math.Min(overlapChars, chunks[i - 1].Content.Length)..]
             : "";

         string combinedText = previousOverlap + "\n" + current.Content;

         var Id = $"chunk_{i + '_' + IDText}";
         overlappedChunks.Add(new SemanticChunk
         {
             Id = Regex.Replace(Id, @"[^A-Za-z0-9_\-=]", "_"), 
             Content = combinedText,
             SectionTitle = current.SectionTitle
         });
     }

     return overlappedChunks;
 }

🧠 Step 3: Generate and Store Embeddings in Azure AI Search

We convert each chunk into an embedding vector and push it to an Azure AI Search index.

 public async Task<List<SemanticChunk>> AddEmbeddings(List<SemanticChunk> chunks)
 {
     var client = GetAzureOpenAIClient();
     var embeddingClient = client.GetEmbeddingClient("text-embedding-3-small");

     foreach (var chunk in chunks)
     {
         // Generate embedding using the EmbeddingClient
         var embeddingResult = await embeddingClient.GenerateEmbeddingAsync(chunk.Content).ConfigureAwait(false);
         chunk.Embedding = embeddingResult.Value.ToFloats();
     }
     return chunks;
 }
public async Task UploadDocsAsync(List<SemanticChunk> chunks)
{
    try
    {
        var indexClient = GetSearchindexClient();
        var searchClient = indexClient.GetSearchClient(_indexName);
        var result = await searchClient.UploadDocumentsAsync(chunks);
    }
    catch (Exception ex)
    {
        _logger.LogError("Failed to upload documents: " + ex);
        throw;
    }
}

🤖 Step 4: Generate the Final Answer with Azure OpenAI

Now we combine the top chunks with the user query to create a cost-efficient, context-rich prompt.

P.S. : Here in this example we have used semantic kernel agent , in real time any agent can be used and any prompt can be updated.

var context = await _aiSearchService.GetSemanticSearchresultsAsync(UserQuery); // Gets chunks from Azure AI Search
//here UserQuery is query asked by user/any question prompt which need to be answered.
string questionWithContext = $@"Answer the question briefly in short relevant words based on the context provided.
                              Context : {context}. \n\n 
                              Question : {UserQuery}?";
var  _agentModel = new AgentModel()
 {
     Model = _appSettings.Agent.Model,
     AgentName = "Answering_Agent",
     Temperature = _appSettings.Agent.Temperature,
     TopP = _appSettings.Agent.TopP,
     AgentInstructions = $@"You are a cloud Migration Architect. " +
                        "Analyze all the details from top to bottom in context based on the details provided for the Migration of APP app using Azure Services. Do not assume anything." +
                        "There can be conflicting details for a question , please verify all details of the context. If there are any conflict please start your answer with word - **Conflict**." +
                        "There might not be answers for all the questions, please verify all details of the context. If there are no answer for question just mention - **No Information**"
 };
_agentModel = await _agentService.CreateAgentAsync(_agentModel);
_agentModel.QuestionWithContext = questionWithContext;
var modelWithResponse = await _agentService.GetAnswerAsync(_agentModel);

 

🧠 Final Thoughts

Context-aware RAG isn’t just a performance optimization — it’s an architectural evolution.
It shifts the focus from feeding LLMs more data to feeding them the right data.

By letting Azure AI Search handle intelligent retrieval and Azure OpenAI handle reasoning, you create an efficient, explainable, and scalable AI assistant.

The outcome:

Smarter answers, lower costs, and a pipeline that scales with your enterprise.

Wiki Link: Tokenization and Chunking
IP Link: AI Migration Accelerator

 

 

Updated Oct 23, 2025
Version 1.0
No CommentsBe the first to comment