Blog Post

Educator Developer Blog
10 MIN READ

Building Enterprise-Grade Local RAG Applications with Semantic Kernel and Foundry Local

kinfey's avatar
kinfey
Icon for Microsoft rankMicrosoft
Aug 25, 2025

In today's AI-driven landscape, organizations are increasingly looking for ways to leverage large language models while maintaining data privacy and reducing operational costs. Microsoft Foundry Local is a free tool that lets developers run generative‑AI models entirely on‑device—no Azure subscription, no internet connection, and no data leaving your laptop or desktop. Combined with Microsoft's Semantic Kernel framework, this creates a powerful platform for building local Retrieval-Augmented Generation (RAG) applications.

This blog will walk you through architecting and implementing a production-ready RAG solution using these technologies, focusing on practical implementation patterns and architectural considerations.

Understanding the Technology Stack

Foundry Local: Edge AI with ONNX Runtime

Foundry Local enables efficient, secure, and scalable AI model inference directly on your devices. Built on top of ONNX Runtime, it provides several key advantages for enterprise applications:

  • Hardware Abstraction: To optimize performance based on the available hardware, Foundry Local supports: multiple execution providers, such as NVIDIA CUDA, AMD, Qualcomm, Intel.
  • Model Flexibility: Because the local gateway implements the same /v1/chat/completions routes as OpenAI, you can point existing Python or JS clients to base_url=manager.endpoint and they just work.
  • Privacy-First Architecture: All processing occurs locally, ensuring sensitive data never leaves your environment

Semantic Kernel: The AI Orchestration Layer

Semantic Kernel is a lightweight, open-source development kit that lets you easily build AI agents and integrate the latest AI models into your C#, Python, or Java codebase. It serves as the middleware between your application logic and AI models, providing:

  • Model Agnostic Design: Switch between different LLMs with minimal code changes
  • Plugin Architecture: Extend functionality with custom functions and tools
  • Memory Management: Built-in support for semantic memory and vector stores

RAG Architecture Overview

The integration of Foundry Local with Semantic Kernel creates a robust local RAG architecture that balances performance, privacy, and scalability:

This architecture ensures that all components—from document processing to response generation—operate entirely within your local environment.

Implementation Guide

Prerequisites and Environment Setup

Before implementing the RAG solution, ensure your development environment meets these requirements:

System Requirements:

  • .NET 8.0 or later
  • Docker (for Qdrant vector store)
  • Qdrant
  • Foundry Local installed
  • Visual Studio Code with .NET Extension Pack

Step 1: Setting Up Foundry Local

First, install and configure Foundry Local with your chosen model:

# Install Foundry Local (Windows/macOS)
# Windows: Download installer from GitHub releases
# macOS: brew install foundrylocal

# List available models
foundry model list

# Download a suitable model (e.g., Phi-3.5-mini)
foundry model download qwen2.5-0.5b-instruct-generic-cpu

# Start the service
foundry service start 

Foundry Local downloads the model variant that best matches your system's hardware and software configuration. For example, if you have an NVIDIA GPU, it downloads the CUDA version of the model.

Step 2: Configuring the Vector Store with Qdrant

Deploy Qdrant as your vector database using Docker:

# Start Qdrant container 
docker run -p 6333:6333 -p 6334:6334 \ -e QDRANT__SERVICE__HTTP_PORT="6333" \ -e QDRANT__SERVICE__GRPC_PORT="6334" \ qdrant/qdrant

Step 3: Configuring Semantic Kernel with Local Models

One of the key challenges in building a fully local RAG solution is handling embeddings. Most of the examples we've seen in recent months show how to implement RAG by consuming cloud services such as OpenAI or Azure Search. That might not be suitable for use cases where data must remain local.

The solution involves configuring Semantic Kernel to use both Foundry Local for chat completion and ONNX-based embedding models for text vectorization:

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Connectors.Onnx;
using Microsoft.SemanticKernel.Memory;

// Initialize the Semantic Kernel builder for local AI orchestration
var builder = Kernel.CreateBuilder();

// Configure Foundry Local chat completion service
// This connects to the local Foundry service running on port 5273
// The service provides OpenAI-compatible API endpoints for seamless integration
builder.AddOpenAIChatCompletion(
    modelId: "qwen2.5-0.5b-instruct-generic-gpu",    // Model identifier matching Foundry Local
    endpoint: new Uri("http://localhost:5273/v1"),     // Local Foundry endpoint
    apiKey: "",                                        // No API key needed for local service
    serviceId: "qwen2.5-0.5b");                      // Service identifier for kernel resolution

// Configure local ONNX embedding model for text vectorization
// These models run entirely offline for privacy-preserving embeddings
var embeddingModelPath = "Your Jinaai jina-embeddings-v2-base-en onnx model path";
var vocabPath = "Your Jinaai jina-embeddings-v2-base-en vocab file path";

// Add BERT-based ONNX embedding generation
// This enables local text-to-vector conversion without cloud dependencies
builder.AddBertOnnxTextEmbeddingGeneration(embeddingModelPath, vocabPath);

// Build the configured kernel instance
var kernel = builder.Build();

 

Step 4: Implementing Vector Store Operations

The VectorStoreService class provides a robust interface for managing document embeddings in Qdrant. This service handles collection initialization, vector storage, and similarity search operations that form the backbone of our RAG system:

public class VectorStoreService
{
    private readonly QdrantClient _client;
    private readonly string _collectionName;

    /// <summary>
    /// Initializes a new instance of the VectorStoreService
    /// </summary>
    /// <param name="endpoint">Qdrant server endpoint (e.g., http://localhost:6334)</param>
    /// <param name="apiKey">API key for authentication (empty for local deployment)</param>
    /// <param name="collectionName">Name of the vector collection to manage</param>
    public VectorStoreService(string endpoint, string apiKey, string collectionName)
    {
        _client = new QdrantClient(new Uri(endpoint));
        _collectionName = collectionName;
    }

    /// <summary>
    /// Initializes the vector collection with specified dimensions
    /// Creates a new collection if it doesn't exist, otherwise uses the existing one
    /// </summary>
    /// <param name="vectorSize">Embedding vector dimensions (default: 768 for most BERT models)</param>
    public async Task InitializeAsync(int vectorSize = 768)
    {
        try
        {
            // Attempt to get existing collection info
            await _client.GetCollectionInfoAsync(_collectionName);
        }
        catch
        {
            // Create new collection with cosine similarity for semantic search
            await _client.CreateCollectionAsync(_collectionName, new VectorParams
            {
                Size = (ulong)vectorSize,
                Distance = Distance.Cosine  // Cosine similarity works well for text embeddings
            });
        }
    }

    /// <summary>
    /// Stores or updates a vector embedding with associated metadata
    /// </summary>
    /// <param name="id">Unique identifier for the vector point</param>
    /// <param name="embedding">Vector embedding of the text chunk</param>
    /// <param name="metadata">Associated metadata (document ID, chunk text, etc.)</param>
    public async Task UpsertAsync(string id, ReadOnlyMemory<float> embedding, Dictionary<string, object> metadata)
    {
        // Create a point structure for Qdrant storage
        var point = new PointStruct
        {
            Id = new PointId { Uuid = id },
            Vectors = embedding.ToArray(),
            Payload = { }
        };

        // Convert metadata to Qdrant-compatible format
        foreach (var kvp in metadata)
        {
            point.Payload[kvp.Key] = kvp.Value switch
            {
                string s => s,
                int i => i,
                bool b => b,
                _ => kvp.Value.ToString() ?? string.Empty
            };
        }

        // Store the vector point in the collection
        await _client.UpsertAsync(_collectionName, new[] { point });
    }

    /// <summary>
    /// Performs similarity search to find relevant document chunks
    /// </summary>
    /// <param name="queryEmbedding">Vector embedding of the user query</param>
    /// <param name="limit">Maximum number of results to return</param>
    /// <returns>List of scored points ordered by similarity</returns>
    public async Task<List<ScoredPoint>> SearchAsync(ReadOnlyMemory<float> queryEmbedding, int limit = 3)
    {
        var searchResult = await _client.SearchAsync(_collectionName, queryEmbedding.ToArray(), limit: (ulong)limit);
        return searchResult.ToList();
    }
}

 

Step 5: Building the RAG Query Pipeline

The RagQueryService orchestrates the complete RAG workflow, from query vectorization to context retrieval and response generation. This service demonstrates the power of combining local embeddings with Foundry Local's chat completion:

public class RagQueryService
{
    private readonly IEmbeddingGenerator<string, Embedding<float>> _embeddingService;
    private readonly IChatCompletionService _chatService;
    private readonly VectorStoreService _vectorStoreService;

    /// <summary>
    /// Initializes the RAG query service with required dependencies
    /// </summary>
    public RagQueryService(
        IEmbeddingGenerator<string, Embedding<float>> embeddingService,
        IChatCompletionService chatService,
        VectorStoreService vectorStoreService)
    {
        _embeddingService = embeddingService;
        _chatService = chatService;
        _vectorStoreService = vectorStoreService;
    }

    /// <summary>
    /// Processes a user question through the complete RAG pipeline
    /// </summary>
    /// <param name="question">User's natural language question</param>
    /// <returns>AI-generated answer based on retrieved context</returns>
    public async Task<string> QueryAsync(string question)
    {
        // Step 1: Convert the user question into a vector embedding
        // This embedding will be used for similarity search in the vector store
        var queryEmbeddingResult = await _embeddingService.GenerateAsync(question);
        var queryEmbedding = queryEmbeddingResult.Vector;
        
        // Step 2: Perform semantic search to find the most relevant document chunks
        // Retrieve top 5 most similar chunks based on cosine similarity
        var searchResults = await _vectorStoreService.SearchAsync(queryEmbedding, limit: 5);

        // Step 3: Extract and concatenate text content from search results
        // This forms the context that will inform the AI's response
        string contextText = "";
        foreach (var result in searchResults)
        {
            if (result.Payload.TryGetValue("text", out var text))
            {
                contextText += text.ToString() + " ";
            }
        }

        // Step 4: Construct a prompt that combines the question with retrieved context
        // This prompt guides the AI to answer based on the specific context
        var prompt = $@"Based on the question: '{question}', please provide a comprehensive answer using the following context. 
        Optimize and simplify the content for clarity:
        
        Context: {contextText}";

        // Step 5: Create chat history with system instruction and user prompt
        var chatHistory = new ChatHistory();
        chatHistory.AddSystemMessage("You are a helpful assistant that answers questions based on the provided context. " +
                                    "Use only the information from the context to answer questions accurately.");
        chatHistory.AddUserMessage(prompt);

        // Step 6: Generate streaming response using Foundry Local
        // Stream the response for better user experience
        var fullMessage = string.Empty;
        await foreach (var chatUpdate in _chatService.GetStreamingChatMessageContentsAsync(chatHistory, cancellationToken: default))
        {                     
            if (chatUpdate.Content is { Length: > 0 })
            {
                fullMessage += chatUpdate.Content;
            }
        }
        
        return fullMessage ?? "I couldn't generate a response based on the available context.";
    }
}

 

Step 6: Document Ingestion and Text Chunking

The DocumentIngestionService handles the critical task of processing documents for RAG. It implements intelligent text chunking with overlap to ensure context continuity and generates embeddings for efficient semantic search:

public class DocumentIngestionService
{
    private readonly IEmbeddingGenerator<string, Embedding<float>> _embeddingService;
    private readonly VectorStoreService _vectorStoreService;

    /// <summary>
    /// Initializes the document ingestion service
    /// </summary>
    public DocumentIngestionService(
        IEmbeddingGenerator<string, Embedding<float>> embeddingService, 
        VectorStoreService vectorStoreService)
    {
        _embeddingService = embeddingService;
        _vectorStoreService = vectorStoreService;
    }

    /// <summary>
    /// Processes a document by chunking text and storing embeddings
    /// </summary>
    /// <param name="documentPath">File path to the document to process</param>
    /// <param name="documentId">Unique identifier for tracking the document</param>
    public async Task IngestDocumentAsync(string documentPath, string documentId)
    {
        // Read the entire document content
        var content = await File.ReadAllTextAsync(documentPath);
        
        // Split document into manageable chunks with overlap for context preservation
        // 300 words per chunk with 60-word overlap ensures semantic continuity
        var chunks = ChunkText(content, chunkSize: 300, overlap: 60);

        // Process each chunk individually
        for (int i = 0; i < chunks.Count; i++)
        {
            var chunk = chunks[i];
            
            // Generate vector embedding for the text chunk
            var embeddingResult = await _embeddingService.GenerateAsync(chunk);
            var embedding = embeddingResult.Vector;
            
            // Store the chunk embedding with comprehensive metadata
            await _vectorStoreService.UpsertAsync(
                id: Guid.NewGuid().ToString(),
                embedding: embedding,
                metadata: new Dictionary<string, object>
                {
                    ["document_id"] = documentId,        // Links chunk to original document
                    ["chunk_index"] = i,                 // Maintains chunk order
                    ["text"] = chunk,                    // Stores original text for retrieval
                    ["document_path"] = documentPath     // Tracks source file location
                }
            );
        }
    }

    /// <summary>
    /// Implements intelligent text chunking with configurable overlap
    /// Overlap ensures that context spanning chunk boundaries is preserved
    /// </summary>
    /// <param name="text">Text content to chunk</param>
    /// <param name="chunkSize">Number of words per chunk</param>
    /// <param name="overlap">Number of overlapping words between chunks</param>
    /// <returns>List of text chunks with preserved context</returns>
    private List<string> ChunkText(string text, int chunkSize, int overlap)
    {
        var chunks = new List<string>();
        var words = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
        
        // Create overlapping chunks to maintain context continuity
        for (int i = 0; i < words.Length; i += chunkSize - overlap)
        {
            // Extract words for this chunk, respecting boundaries
            var chunkWords = words.Skip(i).Take(chunkSize).ToArray();
            var chunk = string.Join(" ", chunkWords);
            chunks.Add(chunk);
            
            // Stop if we've processed all words
            if (i + chunkSize >= words.Length)
                break;
        }
        
        return chunks;
    }
}

 

Step 7: Orchestrating the Complete RAG Application

This final step demonstrates how to wire together all components into a working RAG application. The code shows the complete workflow from service initialization to document processing and query execution:

// Step 1: Retrieve configured services from the Semantic Kernel
// These services were configured in Step 3 with local models
var chatService = kernel.GetRequiredService<IChatCompletionService>(serviceKey: "qwen2.5-0.5b");
var embeddingService = kernel.GetRequiredService<IEmbeddingGenerator<string, Embedding<float>>>();

// Step 2: Initialize the vector store service
// Connect to local Qdrant instance running on port 6334
// Collection name "demodocs" will store our document embeddings
var vectorStoreService = new VectorStoreService(
    endpoint: "http://localhost:6334",
    apiKey: "",                           // No API key needed for local Qdrant
    collectionName: "demodocs");

// Step 3: Initialize the vector collection
// This creates the collection if it doesn't exist, with proper embedding dimensions
await vectorStoreService.InitializeAsync();

// Step 4: Create service instances for document processing and querying
var documentIngestionService = new DocumentIngestionService(embeddingService, vectorStoreService);
var ragQueryService = new RagQueryService(embeddingService, chatService, vectorStoreService);

// Step 5: Ingest a sample document into the RAG system
// Replace with your actual document path and provide a unique document ID
var filePath = "./foundry-local-architecture.md";
var documentId = "foundry-architecture-doc";

// Process the document: chunk text, generate embeddings, and store in vector database
await documentIngestionService.IngestDocumentAsync(filePath, documentId);

// Step 6: Test the RAG system with a sample query
var question = "What's Foundry Local?";

// Execute the complete RAG pipeline:
// 1. Convert question to embedding
// 2. Search for relevant document chunks
// 3. Generate contextual response using Foundry Local
var answer = await ragQueryService.QueryAsync(question);

// Step 7: Display the result
Console.WriteLine($"Question: {question}");
Console.WriteLine($"Answer: {answer}");

Key Integration Points:

  1. Service Resolution: The kernel automatically resolves the configured chat and embedding services
  2. Vector Store Management: Proper initialization ensures the collection exists with correct dimensions
  3. Error Handling: The system gracefully handles missing collections and connectivity issues
  4. Scalability: This pattern supports multiple documents and concurrent queries

Conclusion

Building RAG applications with Semantic Kernel and Foundry Local provides a robust foundation for privacy-conscious, cost-effective AI solutions. This architecture enables organizations to leverage powerful language models while maintaining complete control over their data and infrastructure.

The combination of Semantic Kernel's orchestration capabilities and Foundry Local's edge-optimized inference creates a production-ready platform that scales from development through enterprise deployment. As the local AI ecosystem continues to mature, this approach positions organizations to take advantage of emerging capabilities while maintaining their privacy and security requirements.

By implementing the patterns and practices outlined in this guide, development teams can create sophisticated RAG applications that deliver enterprise-grade performance without compromising on data privacy or operational costs.

Updated Jul 17, 2025
Version 1.0
No CommentsBe the first to comment