azure ai foundry
193 TopicsBuilding an Offline AI Interview Coach with Foundry Local, RAG, and SQLite
How to build a 100% offline, AI-powered interview preparation tool using Microsoft Foundry Local, Retrieval-Augmented Generation, and nothing but JavaScript. Foundry Local 100% Offline RAG + TF-IDF JavaScript / Node.js Contents Introduction What is RAG and Why Offline? Architecture Overview Setting Up Foundry Local Building the RAG Pipeline The Chat Engine Dual Interfaces: Web & CLI Testing Adapting for Your Own Use Case What I Learned Getting Started Introduction Imagine preparing for a job interview with an AI assistant that knows your CV inside and out, understands the job you're applying for, and generates tailored questions, all without ever sending your data to the cloud. That's exactly what Interview Doctor does. Interview Doctor's web UI, a polished, dark-themed interface running entirely on your local machine. In this post, I'll walk you through how I built an interview prep tool as a fully offline JavaScript application using: Foundry Local — Microsoft's on-device AI runtime SQLite — for storing document chunks and TF-IDF vectors RAG (Retrieval-Augmented Generation) — to ground the AI in your actual documents Express.js — for the web server Node.js built-in test runner — for testing with zero extra dependencies No cloud. No API keys. No internet required. Everything runs on your machine. What is RAG and Why Does It Matter? Retrieval-Augmented Generation (RAG) is a pattern that makes AI models dramatically more useful for domain-specific tasks. Instead of relying solely on what a model learned during training (which can be outdated or generic), RAG: Retrieves relevant chunks from your own documents Augments the model's prompt with those chunks as context Generates a response grounded in your actual data For Interview Doctor, this means the AI doesn't just ask generic interview questions, it asks questions specific to your CV, your experience, and the specific job you're applying for. Why Offline RAG? Privacy is the obvious benefit, your CV and job applications never leave your device. But there's more: No API costs — run as many queries as you want No rate limits — iterate rapidly during your prep Works anywhere — on a plane, in a café with bad Wi-Fi, anywhere Consistent performance — no cold starts, no API latency Architecture Overview Complete architecture showing all components and data flow. The application has two interfaces (CLI and Web) that share the same core engine: Document Ingestion — PDFs and markdown files are chunked and indexed Vector Store — SQLite stores chunks with TF-IDF vectors Retrieval — queries are matched against stored chunks using cosine similarity Generation — relevant chunks are injected into the prompt sent to the local LLM Step 1: Setting Up Foundry Local First, install Foundry Local: # Windows winget install Microsoft.FoundryLocal # macOS brew install microsoft/foundrylocal/foundrylocal The JavaScript SDK handles everything else — starting the service, downloading the model, and connecting: import { FoundryLocalManager } from "foundry-local-sdk"; import { OpenAI } from "openai"; const manager = new FoundryLocalManager(); const modelInfo = await manager.init("phi-3.5-mini"); // Foundry Local exposes an OpenAI-compatible API const openai = new OpenAI({ baseURL: manager.endpoint, // Dynamic port, discovered by SDK apiKey: manager.apiKey, }); ⚠️ Key Insight Foundry Local uses a dynamic port never hardcode localhost:5272 . Always use manager.endpoint which is discovered by the SDK at runtime. Step 2: Building the RAG Pipeline Document Chunking Documents are split into overlapping chunks of ~200 tokens. The overlap ensures important context isn't lost at chunk boundaries: export function chunkText(text, maxTokens = 200, overlapTokens = 25) { const words = text.split(/\s+/).filter(Boolean); if (words.length <= maxTokens) return [text.trim()]; const chunks = []; let start = 0; while (start < words.length) { const end = Math.min(start + maxTokens, words.length); chunks.push(words.slice(start, end).join(" ")); if (end >= words.length) break; start = end - overlapTokens; } return chunks; } Why 200 tokens with 25-token overlap? Small chunks keep retrieved context compact for the model's limited context window. Overlap prevents information loss at boundaries. And it's all pure string operations, no dependencies needed. TF-IDF Vectors Instead of using a separate embedding model (which would consume precious memory alongside the LLM), we use TF-IDF, a classic information retrieval technique: export function termFrequency(text) { const tf = new Map(); const tokens = text .toLowerCase() .replace(/[^a-z0-9\-']/g, " ") .split(/\s+/) .filter((t) => t.length > 1); for (const t of tokens) { tf.set(t, (tf.get(t) || 0) + 1); } return tf; } export function cosineSimilarity(a, b) { let dot = 0, normA = 0, normB = 0; for (const [term, freq] of a) { normA += freq * freq; if (b.has(term)) dot += freq * b.get(term); } for (const [, freq] of b) normB += freq * freq; if (normA === 0 || normB === 0) return 0; return dot / (Math.sqrt(normA) * Math.sqrt(normB)); } Each document chunk becomes a sparse vector of word frequencies. At query time, we compute cosine similarity between the query vector and all stored chunk vectors to find the most relevant matches. SQLite as a Vector Store Chunks and their TF-IDF vectors are stored in SQLite using sql.js (pure JavaScript — no native compilation needed): export class VectorStore { // Created via: const store = await VectorStore.create(dbPath) insert(docId, title, category, chunkIndex, content) { const tf = termFrequency(content); const tfJson = JSON.stringify([...tf]); this.db.run( "INSERT INTO chunks (...) VALUES (?, ?, ?, ?, ?, ?)", [docId, title, category, chunkIndex, content, tfJson] ); this.save(); } search(query, topK = 5) { const queryTf = termFrequency(query); // Score each chunk by cosine similarity, return top-K } } 💡 Why SQLite for Vectors? For a CV plus a few job descriptions (dozens of chunks), brute-force cosine similarity over SQLite rows is near-instant (~1ms). No need for Pinecone, Qdrant, or Chroma — just a single .db file on disk. Step 3: The RAG Chat Engine The chat engine ties retrieval and generation together: async *queryStream(userMessage, history = []) { // 1. Retrieve relevant CV/JD chunks const chunks = this.retrieve(userMessage); const context = this._buildContext(chunks); // 2. Build the prompt with retrieved context const messages = [ { role: "system", content: SYSTEM_PROMPT }, { role: "system", content: `Retrieved context:\n\n${context}` }, ...history, { role: "user", content: userMessage }, ]; // 3. Stream from the local model const stream = await this.openai.chat.completions.create({ model: this.modelId, messages, temperature: 0.3, stream: true, }); // 4. Yield chunks as they arrive for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) yield { type: "text", data: content }; } } The flow is straightforward: vectorize the query, retrieve with cosine similarity, build a prompt with context, and stream from the local LLM. The temperature: 0.3 keeps responses focused — important for interview preparation where consistency matters. Step 4: Dual Interfaces — Web & CLI Web UI The web frontend is a single HTML file with inline CSS and JavaScript — no build step, no framework, no React or Vue. It communicates with the Express backend via REST and SSE: File upload via multipart/form-data Streaming chat via Server-Sent Events (SSE) Quick-action buttons for common follow-up queries (coaching tips, gap analysis, mock interview) The setup form with job title, seniority level, and a pasted job description — ready to generate tailored interview questions. CLI The CLI provides the same experience in the terminal with ANSI-coloured output: npm run cli It walks you through uploading your CV, entering the job details, and then generates streaming questions. Follow-up questions work interactively. Both interfaces share the same ChatEngine class, they're thin layers over identical logic. Edge Mode For constrained devices, toggle Edge mode to use a compact system prompt that fits within smaller context windows: Edge mode activated, uses a minimal prompt for devices with limited resources. Step 5: Testing Tests use the Node.js built-in test runner, no Jest, no Mocha, no extra dependencies: import { describe, it } from "node:test"; import assert from "node:assert/strict"; describe("chunkText", () => { it("returns single chunk for short text", () => { const chunks = chunkText("short text", 200, 25); assert.equal(chunks.length, 1); }); it("maintains overlap between chunks", () => { // Verifies overlapping tokens between consecutive chunks }); }); npm test Tests cover the chunker, vector store, config, prompts, and server API contract, all without needing Foundry Local running. Adapting for Your Own Use Case Interview Doctor is a pattern, not just a product. You can adapt it for any domain: What to Change How Domain documents Replace files in docs/ with your content System prompt Edit src/prompts.js Chunk sizes Adjust config.chunkSize and config.chunkOverlap Model Change config.model — run foundry model list UI Modify public/index.html — it's a single file Ideas for Adaptation Customer support bot — ingest your product docs and FAQs Code review assistant — ingest coding standards and best practices Study guide — ingest textbooks and lecture notes Compliance checker — ingest regulatory documents Onboarding assistant — ingest company handbooks and processes What I Learned Offline AI is production-ready. Foundry Local + small models like Phi-3.5 Mini are genuinely useful for focused tasks. You don't need vector databases for small collections. SQLite + TF-IDF is fast, simple, and has zero infrastructure overhead. RAG quality depends on chunking. Getting chunk sizes right for your use case is more impactful than the retrieval algorithm. The OpenAI-compatible API is a game-changer. Switching from cloud to local was mostly just changing the baseURL . Dual interfaces are easy when you share the engine. The CLI and Web UI are thin layers over the same ChatEngine class. ⚡ Performance Notes On a typical laptop (no GPU): ingestion takes under 1 second for ~20 documents, retrieval is ~1ms, and the first LLM token arrives in 2-5 seconds. Foundry Local automatically selects the best model variant for your hardware (CUDA GPU, NPU, or CPU). Getting Started git clone https://github.com/leestott/interview-doctor-js.git cd interview-doctor-js npm install npm run ingest npm start # Web UI at http://127.0.0.1:3000 # or npm run cli # Interactive terminal The full source code is on GitHub. Star it, fork it, adapt it — and good luck with your interviews! Resources Foundry Local — Microsoft's on-device AI runtime Foundry Local SDK (npm) — JavaScript SDK Foundry Local GitHub — Source, samples, and documentation Local RAG Reference — Reference RAG implementation Interview Doctor (JavaScript) — This project's source codeEstablish an Oracle Database Connection hosted on Azure VM via AI Foundry Agent
I have came across a requirement to create a AI Foundry agent that will accept requests from user like below: a. "I want to connect to abcprd database hosted on subscription sub1, and resource group rg1 and check the AWR report from xAM-yPM on a specific date (eg 21-Oct-2025) b. Check locking session/RMAN backup failures/active sessions from the database abcprd hosted on subscription sub1, and resource group rg1. The agent should be able to fetch the relevant query from knowledge base . connect to the database and run the report for the duration mentioned. It should then fetch the report and pass it to the LLM (GPT 4.1 in our case) for investigations. I am looking for approach to connect to the oracle database based on user's request and execute the query obtained from knowledge base.93Views0likes2CommentsMicrosoft Foundry Labs: A Practical Fast Lane from Research to Real Developer Work
Why developers need a fast lane from research → prototypes AI engineering has a speed problem, but it is not a shortage of announcements. The hard part is turning research into a useful prototype before the next wave of models, tools, or agent patterns shows up. That gap matters. AI engineers want to compare quality, latency, and cost before they wire a model into a product. Full-stack teams want to test whether an agent workflow is real or just demo. Platform and operations teams want to know when an experiment can graduate into something observable and supportable. Microsoft makes that case directly in introducing Microsoft Foundry Labs: breakthroughs are arriving faster, and time from research to product has compressed from years to months. If you build real systems, the question is not "What is the coolest demo?" It is "Which experiments are worth my next hour, and how do I evaluate them without creating demo-ware?" That is where Microsoft Foundry Labs becomes interesting. What is Microsoft Foundry Labs? Microsoft Foundry Labs is a place to explore early-stage experiments and prototypes from Microsoft, with an explicit focus on research-driven innovation. The homepage describes it as a way to get a glimpse of potential future directions for AI through experimental technologies from Microsoft Research and more. The announcement adds the operating idea: Labs is a single access point for developers to experiment with new models from Microsoft, explore frameworks, and share feedback. That framing matters. Labs is not just a gallery of flashy ideas. It is a developer-facing exploration surface for projects that are still close to research: models, agent systems, UX ideas, and tool experiments. Here's some things you can do on Labs: Play with tomorrow’s AI, today: 30+ experimental projects—from models to agents—are openly available to fork and build upon, alongside direct access to breakthrough research from Microsoft. Go from prototype to production, fast: Seamless integration with Microsoft Foundry gives you access to 11,000+ models with built-in compute, safety, observability, and governance—so you can move from local experimentation to full-scale production without complex containerization or switching platforms. Build with the people shaping the future of AI: Join a thriving community of 25,000+ developers across Discord and GitHub with direct access to Microsoft researchers and engineers to share feedback and help shape the most promising technologies. What Labs is not: it is not a promise that every project has a production deployment path today, a long-term support commitment, or a hardened enterprise operating model. Spotlight: a few Labs experiments worth a developer's attention Phi-4-Reasoning-Vision-15B: A compact open-weight multimodal reasoning model that is interesting if you care about the quality-versus-efficiency tradeoff in smaller reasoning systems. BitNet: A native 1-bit large language model that is compelling for engineers who care about memory, compute, and energy efficiency. Fara-7B: An ultra-compact agentic small language model designed for computer use, which makes it relevant for builders exploring UI automation and on-device agents. OmniParser V2: A screen parsing module that turns interfaces into actionable elements, directly relevant to computer-use and UI-interaction agents. If you want to inspect actual code, the Labs project pages also expose official repository links for some of these experiments, including OmniParser, Magentic-UI, and BitNet. Labs vs. Foundry: how to think about the boundary The simplest mental model is this: Labs is the exploration edge; Foundry is the platform layer. The Microsoft Foundry documentation describes the broader platform as "the AI app and agent factory" to build, optimize, and govern AI apps and agents at scale. That is a different promise from Labs. Foundry is where you move from curiosity to implementation: model access, agent services, SDKs, observability, evaluation, monitoring, and governance. Labs helps you explore what might matter next. Foundry helps you build, optimize, and govern what matters now. Labs is where you test a research-shaped idea. Foundry is where you decide whether that idea can survive integration, evaluation, tracing, cost controls, and production scrutiny. That also means Labs is not a replacement for the broader Foundry workflow. If an experiment catches your attention, the next question is not "Can I ship this tomorrow?" It is "What is the integration path, and how will I measure whether it deserves promotion?" What's real today vs. what's experimental Real today: Labs is live as an official exploration hub, and Foundry is the broader platform for building, evaluating, monitoring, and governing AI apps and agents. Experimental by design: Labs projects are presented as experiments and prototypes, so they still need validation for your use case. A developer's lens: Models, Agents, Observability What makes Labs useful is not that it shows new things. It is that it gives developers a way to inspect those things through the same three concerns that matter in every serious AI system: model choice, agent design, and observability. Diagram description: imagine a loop with three boxes in a row: Models, Agents, and Observability. A forward arrow runs across the row, and a feedback arrow loops from Observability back to Models. The point is that evaluation data should change both model choices and agent design, instead of arriving too late. Models: what to look for in Labs experiments If you are model-curious, Labs should trigger an evaluation mindset, not a fandom mindset. When you see something like Phi-4-Reasoning-Vision-15B or BitNet on the Labs homepage, ask three things: what capability is being demonstrated, what constraints are obvious, and what the integration path would look like. This is where the Microsoft Foundry Playgrounds mindset is useful even if you started in Labs. The documentation emphasizes model comparison, prompt iteration, parameter tuning, tools, safety guardrails, and code export. It also pushes the right pre-production questions: price-to-performance, latency, tool integration, and code readiness. That is how I would use Labs for models: not to choose winners, but to generate hypotheses worth testing. If a Labs experiment looks promising, move quickly into a small evaluation matrix around capability, latency, cost, and integration friction. Agents: what Labs unlocks for agent builders Labs is especially interesting for agent builders because many of the projects point toward orchestration and tool-use patterns that matter in practice. The official announcement highlights projects across models and agentic frameworks, including Magentic-One and OmniParser v2. On the homepage, projects such as Fara-7B, OmniParser V2, TypeAgent, and Magentic-UI point in a similar direction: agents get more useful when they can reason over tools, interfaces, plans, and human feedback loops. For working developers, that means Labs can act as a scouting surface for agent patterns rather than just agent demos. Look for UI or computer-use style agents when your system needs to act through an interface rather than an API. Look for planning or tool-selection patterns when orchestration matters more than raw model quality. My suggestion: when a Labs project looks relevant to agent work, do not ask "Can I copy this architecture?" Ask "Which agent pattern is being explored here, and under what constraints would it be useful in my system?" Observability: how to experiment responsibly and measure what matters Observability is where prototypes usually go to die, because teams postpone it until after they have something flashy. That is backwards. If you care about real systems, tracing, evaluation, monitoring, and governance should start during prototyping. The Microsoft Foundry documentation already puts that operating model in plain view through guidance for tracing applications, evaluating agentic workflows, and monitoring generative AI apps. The Microsoft Foundry Playgrounds page is also explicit that the agents playground supports tracing and evaluation through AgentOps. At the governance layer, the AI gateway in Azure API Management documentation reinforces why this matters beyond demos. It covers monitoring and logging AI interactions, tracking token metrics, logging prompts and completions, managing quotas, applying safety policies, and governing models, agents, and tools. You do not need every one of those controls on day one, but you do need the habit: if a prototype cannot tell you what it did, why it failed, and what it cost, it is not ready to influence a roadmap. "Pick one and try it": a 20-minute hands-on path Keep this lightweight and tool-agnostic. The point is not to memorize a product UI. The point is to run a disciplined experiment. Browse Labs and pick an experiment aligned to your work. Start at Microsoft Foundry Labs and choose one project that is adjacent to a real problem you have: model efficiency, multimodal reasoning, UI agents, debugging workflows, or human-in-the-loop design. Read the project page and jump to the repo or paper if available. Use the Labs entry to understand the claim being made. Then read the supporting material, not just the summary sentence. Define one small test task and explicit success criteria. Keep it concrete: latency budget, accuracy target, cost ceiling, acceptable safety behavior, or failure rate under a narrow scenario. Capture telemetry from the start. At minimum, keep prompts or inputs, outputs, intermediate decisions, and failures. If the experiment involves tools or agents, include tool choices and obvious reasons for failure or recovery. Make a hard call. Decide whether to keep exploring or wait for a stronger production-grade path. "Interesting" is not the same as "ready for integration." Minimal experiment logger (my suggestion): if you want a lightweight way to avoid demo-ware, even a local JSONL log is enough to capture prompts, outputs, decisions, failures, and latency while you compare ideas from Labs. import json import time from pathlib import Path LOG_PATH = Path("experiment-log.jsonl") def record_event(name, payload): # Append one event per line so runs are easy to diff and analyze later. with LOG_PATH.open("a", encoding="utf-8") as handle: handle.write(json.dumps({"event": name, **payload}) + "\n") def run_experiment(user_input): started = time.time() try: # Replace this stub with your real model or agent call. output = user_input.upper() decision = "keep exploring" if len(output) < 80 else "wait" record_event( "experiment_result", { "input": user_input, "output": output, "decision": decision, "latency_ms": round((time.time() - started) * 1000, 2), "failure": None, }, ) except Exception as error: record_event( "experiment_result", { "input": user_input, "output": None, "decision": "failed", "latency_ms": round((time.time() - started) * 1000, 2), "failure": str(error), }, ) raise if __name__ == "__main__": run_experiment("Summarize the constraints of this Labs project.") That script is intentionally boring. That is the point. It gives you a repeatable, runnable starting point for comparing experiments without pretending you already have a full observability stack. Practical tips: how I evaluate Labs experiments before betting a roadmap on them Separate the idea from the implementation path. A strong research direction can still have a weak near-term integration story. Test one workload, not ten. Pick a narrow task that resembles your production reality and see whether the experiment moves the needle. Track cost and latency as first-class metrics. A novel capability that breaks your budget or response-time envelope is still a failed fit. Treat agent demos skeptically unless you can inspect behavior. Tool calls, traces, failure cases, and recovery paths matter more than polished output. Common pitfalls are predictable here. Do not confuse a research win with a deployment path. Labs is for exploration, so you still need to validate integration, safety, and operations. Do not evaluate with vague prompts. Use a narrow task and explicit success criteria, or you will end up comparing vibes instead of outcomes. Do not skip telemetry because the prototype is small. If you cannot inspect failures early, the prototype will teach you very little. Do not ignore known limitations. For example, the Fara-7B project page explicitly notes challenges on more complex tasks, instruction-following mistakes, and hallucinations, which is exactly the kind of constraint you should carry into evaluation. What to explore next Azure AI Foundry Labs matters because it gives developers a practical way to explore research-shaped ideas before they harden into mainstream patterns. The smart move is to use Labs as an input into better platform decisions: explore in Labs, validate with the discipline encouraged by Foundry playgrounds, and then bring the learnings back into the broader Foundry workflow. Takeaway 1: Labs is an exploration surface for early-stage, research-driven experiments and prototypes, not a blanket promise of production readiness. Takeaway 2: The right workflow is Labs for discovery, then Microsoft Foundry for implementation, optimization, evaluation, monitoring, and governance. Takeaway 3: Tracing, evaluations, and telemetry should start during prototyping, because that is how you avoid confusing a compelling demo with a viable system. If you are curious, start with Microsoft Foundry Labs, read the official context in Introducing Microsoft Foundry Labs, and then map what you learn into the platform guidance in Microsoft Foundry documentation. Try this next Open Microsoft Foundry Labs and choose one experiment that matches a real workload you care about. Use the mindset from Microsoft Foundry Playgrounds to define a small validation task around quality, latency, cost, and safety. Write down the minimum telemetry you need before continuing: inputs, outputs, decisions, failures, and token or cost signals. Read the relevant operating guidance in AI gateway in Azure API Management if your experiment may eventually need monitoring, quotas, safety policies, or governance. Promote only the experiments that can explain their value clearly in a Foundry-shaped build, evaluation, and observability workflow.Vectorless Reasoning-Based RAG: A New Approach to Retrieval-Augmented Generation
Introduction Retrieval-Augmented Generation (RAG) has become a widely adopted architecture for building AI applications that combine Large Language Models (LLMs) with external knowledge sources. Traditional RAG pipelines rely heavily on vector embeddings and similarity search to retrieve relevant documents. While this works well for many scenarios, it introduces challenges such as: Requires chunking documents into small segments Important context can be split across chunks Embedding generation and vector databases add infrastructure complexity A new paradigm called Vectorless Reasoning-Based RAG is emerging to address these challenges. One framework enabling this approach is PageIndex, an open-source document indexing system that organizes documents into a hierarchical tree structure and allows Large Language Models (LLMs) to perform reasoning-based retrieval over that structure. Vectorless Reasoning-Based RAG Instead of vectors, this approach uses structured document navigation. User Query ->Document Tree Structure ->LLM Reasoning ->Relevant Nodes Retrieved ->LLM Generates Answer This mimics how humans read documents: Look at the table of contents Identify relevant sections Read the relevant content Answer the question Core features No Vector Database: It relies on document structure and LLM reasoning for retrieval. It does not depend on vector similarity search. No Chunking: Documents are not split into artificial chunks. Instead, they are organized using their natural structure, such as pages and sections. Human-like Retrieval: The system mimics how human experts read documents. It navigates through sections and extracts information from relevant parts. Better Explainability and Traceability: Retrieval is based on reasoning. The results can be traced back to specific pages and sections. This makes the process easier to interpret. It avoids opaque and approximate vector search, often called “vibe retrieval.” When to Use Vectorless RAG Vectorless RAG works best when: Data is structured or semi-structured Documents have clear metadata Knowledge sources are well organized Queries require reasoning rather than semantic similarity Examples: enterprise knowledge bases internal documentation systems compliance and policy search healthcare documentation financial reporting Implementing Vectorless RAG with Azure AI Foundry Step 1 : Install Pageindex using pip command, from pageindex import PageIndexClient import pageindex.utils as utils # Get your PageIndex API key from https://dash.pageindex.ai/api-keys PAGEINDEX_API_KEY = "YOUR_PAGEINDEX_API_KEY" pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY) Step 2 : Set up your LLM Example using Azure OpenAI: from openai import AsyncAzureOpenAI client = AsyncAzureOpenAI( api_key=AZURE_OPENAI_API_KEY, azure_endpoint=AZURE_OPENAI_ENDPOINT, api_version=AZURE_OPENAI_API_VERSION ) async def call_llm(prompt, temperature=0): response = await client.chat.completions.create( model=AZURE_DEPLOYMENT_NAME, messages=[{"role": "user", "content": prompt}], temperature=temperature ) return response.choices[0].message.content.strip() Step 3: Page Tree Generation import os, requests pdf_url = "https://arxiv.org/pdf/2501.12948.pdf" //give the pdf url for tree generation, here given one for example pdf_path = os.path.join("../data", pdf_url.split('/')[-1]) os.makedirs(os.path.dirname(pdf_path), exist_ok=True) response = requests.get(pdf_url) with open(pdf_path, "wb") as f: f.write(response.content) print(f"Downloaded {pdf_url}") doc_id = pi_client.submit_document(pdf_path)["doc_id"] print('Document Submitted:', doc_id) Step 4 : Print the generated pageindex tree structure if pi_client.is_retrieval_ready(doc_id): tree = pi_client.get_tree(doc_id, node_summary=True)['result'] print('Simplified Tree Structure of the Document:') utils.print_tree(tree) else: print("Processing document, please try again later...") Step 5 : Use LLM for tree search and identify nodes that might contain relevant context import json query = "What are the conclusions in this document?" tree_without_text = utils.remove_fields(tree.copy(), fields=['text']) search_prompt = f""" You are given a question and a tree structure of a document. Each node contains a node id, node title, and a corresponding summary. Your task is to find all nodes that are likely to contain the answer to the question. Question: {query} Document tree structure: {json.dumps(tree_without_text, indent=2)} Please reply in the following JSON format: {{ "thinking": "<Your thinking process on which nodes are relevant to the question>", "node_list": ["node_id_1", "node_id_2", ..., "node_id_n"] }} Directly return the final JSON structure. Do not output anything else. """ tree_search_result = await call_llm(search_prompt) Step 6 : Print retrieved nodes and reasoning process node_map = utils.create_node_mapping(tree) tree_search_result_json = json.loads(tree_search_result) print('Reasoning Process:') utils.print_wrapped(tree_search_result_json['thinking']) print('\nRetrieved Nodes:') for node_id in tree_search_result_json["node_list"]: node = node_map[node_id] print(f"Node ID: {node['node_id']}\t Page: {node['page_index']}\t Title: {node['title']}") Step 7: Answer generation node_list = json.loads(tree_search_result)["node_list"] relevant_content = "\n\n".join(node_map[node_id]["text"] for node_id in node_list) print('Retrieved Context:\n') utils.print_wrapped(relevant_content[:1000] + '...') answer_prompt = f""" Answer the question based on the context: Question: {query} Context: {relevant_content} Provide a clear, concise answer based only on the context provided. """ print('Generated Answer:\n') answer = await call_llm(answer_prompt) utils.print_wrapped(answer) When to Use Each Approach Both vector-based RAG and vectorless RAG have their strengths. Choosing the right approach depends on the nature of the documents and the type of retrieval required. When to Use Vector Database–Based RAG Vector-based retrieval works best when dealing with large collections of unrelated or loosely structured documents. In such cases, semantic similarity is often sufficient to identify relevant information quickly. Use vector RAG when: Searching across many independent documents Semantic similarity is sufficient to locate relevant content Real-time retrieval is required over very large datasets Common use cases include: Customer support knowledge bases Conversational chatbots Product and content search systems When to Use Vectorless RAG Vectorless approaches such as PageIndex are better suited for long, structured documents where understanding the logical organization of the content is important. Use vectorless RAG when: Documents contain clear hierarchical structure Logical reasoning across sections is required High retrieval accuracy is critical Typical examples include: Financial filings and regulatory reports Legal documents and contracts Technical manuals and documentation Academic and research papers In these scenarios, navigating the document structure allows the system to identify the exact section that logically contains the answer, rather than relying only on semantic similarity. Conclusion Vector databases significantly advanced RAG architectures by enabling scalable semantic search across large datasets. However, they are not the optimal solution for every type of document. Vectorless approaches such as PageIndex introduce a different philosophy: instead of retrieving text that is merely semantically similar, they retrieve text that is logically relevant by reasoning over the structure of the document. As RAG architectures continue to evolve, the future will likely combine the strengths of both approaches. Hybrid systems that integrate vector search for broad retrieval and reasoning-based navigation for precision may offer the best balance of scalability and accuracy for enterprise AI applications.1.2KViews2likes0CommentsHosted Containers and AI Agent Solutions
If you have built a proof-of-concept AI agent on your laptop and wondered how to turn it into something other people can actually use, you are not alone. The gap between a working prototype and a production-ready service is where most agent projects stall. Hosted containers close that gap faster than any other approach available today. This post walks through why containers and managed hosting platforms like Azure Container Apps are an ideal fit for multi-agent AI systems, what practical benefits they unlock, and how you can get started with minimal friction. The problem with "it works on my machine" Most AI agent projects begin the same way: a Python script, an API key, and a local terminal. That workflow is perfect for experimentation, but it creates a handful of problems the moment you try to share your work. First, your colleagues need the same Python version, the same dependencies, and the same environment variables. Second, long-running agent pipelines tie up your machine and compete with everything else you are doing. Third, there is no reliable URL anyone can visit to use the system, which means every demo involves a screen share or a recorded video. Containers solve all three problems in one step. A single Dockerfile captures the runtime, the dependencies, and the startup command. Once the image builds, it runs identically on any machine, any cloud, or any colleague's laptop. Why containers suit AI agents particularly well AI agents have characteristics that make them a better fit for containers than many traditional web applications. Long, unpredictable execution times A typical web request completes in milliseconds. An agent pipeline that retrieves context from a database, imports a codebase, runs four verification agents in sequence, and generates a report can take two to five minutes. Managed container platforms handle long-running requests gracefully, with configurable timeouts and automatic keep-alive, whereas many serverless platforms impose strict execution limits that agent workloads quickly exceed. Heavy, specialised dependencies Agent applications often depend on large packages: machine learning libraries, language model SDKs, database drivers, and Git tooling. A container image bundles all of these once at build time. There is no cold-start dependency resolution and no version conflict with other projects on the same server. Stateless by design Most agent pipelines are stateless. They receive a request, execute a sequence of steps, and return a result. This maps perfectly to the container model, where each instance handles requests independently and the platform can scale the number of instances up or down based on demand. Reproducible environments When an agent misbehaves in production, you need to reproduce the issue locally. With containers, the production environment and the local environment are the same image. There is no "works on my machine" ambiguity. A real example: multi-agent code verification To make this concrete, consider a system called Opustest, an open-source project that uses the Microsoft Agent Framework with Azure OpenAI to analyse Python codebases automatically. The system runs AI agents in a pipeline: A Code Example Retrieval Agent queries Azure Cosmos DB for curated examples of good and bad Python code, providing the quality standards for the review. A Codebase Import Agent reads all Python files from a Git repository cloned on the server. Four Verification Agents each score a different dimension of code quality (coding standards, functional correctness, known error handling, and unknown error handling) on a scale of 0 to 5. A Report Generation Agent compiles all scores and errors into an HTML report with fix prompts that can be exported and fed directly into a coding assistant. The entire pipeline is orchestrated by a FastAPI backend that streams progress updates to the browser via Server-Sent Events. Users paste a Git URL, watch each stage light up in real time, and receive a detailed report at the end. The app in action Landing page: the default Git URL mode, ready for a repository link. Local Path mode: toggling to analyse a codebase from a local directory. Repository URL entered: a GitHub repository ready for verification. Stage 1: the Code Example Retrieval Agent fetching standards from Cosmos DB. Stage 3: the four Verification Agents scoring the codebase. Stage 4: the Report Generation Agent compiling the final report. Verification complete: all stages finished with a success banner. Report detail: scores and the errors table with fix prompts. The Dockerfile The container definition for this system is remarkably simple: FROM python:3.12-slim RUN apt-get update && apt-get install -y --no-install-recommends git \ && rm -rf /var/lib/apt/lists/* WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY backend/ backend/ COPY frontend/ frontend/ RUN adduser --disabled-password --gecos "" appuser USER appuser EXPOSE 8000 CMD ["uvicorn", "backend.app:app", "--host", "0.0.0.0", "--port", "8000"] Twenty lines. That is all it takes to package a six-agent AI system with a web frontend, a FastAPI backend, Git support, and all Python dependencies into a portable, production-ready image. Notice the security detail: the container runs as a non-root user. This is a best practice that many tutorials skip, but it matters when you are deploying to a shared platform. From image to production in one command With the Azure Developer CLI ( azd ), deploying this container to Azure Container Apps takes a single command: azd up Behind the scenes, azd reads an azure.yaml file that declares the project structure, provisions the infrastructure defined in Bicep templates (a Container Apps environment, an Azure Container Registry, and a Cosmos DB account), builds the Docker image, pushes it to the registry, deploys it to the container app, and even seeds the database with sample data via a post-provision hook. The result is a publicly accessible URL serving the full agent system, with automatic HTTPS, built-in scaling, and zero infrastructure to manage manually. Microsoft Hosted Agents vs Azure Container Apps: choosing the right home Microsoft offers two distinct approaches for running AI agent workloads in the cloud. Understanding the difference is important when deciding how to host your solution. Microsoft Foundry Hosted Agent Service (Microsoft Foundry) Microsoft Foundry provides a fully managed agent hosting service. You define your agent's behaviour declaratively, upload it to the platform, and Foundry handles execution, scaling, and lifecycle management. This is an excellent choice when your agents fit within the platform's conventions: single-purpose agents that respond to prompts, use built-in tool integrations, and do not require custom server-side logic or a bespoke frontend. Key characteristics of hosted agents in Foundry: Fully managed execution. You do not provision or maintain any infrastructure. The platform runs your agent and handles scaling automatically. Declarative configuration. Agents are defined through configuration and prompt templates rather than custom application code. Built-in tool ecosystem. Foundry provides pre-built connections to Azure services, knowledge stores, and evaluation tooling. Opinionated runtime. The platform controls the execution environment, request handling, and networking. Azure Container Apps Azure Container Apps is a managed container hosting platform. You package your entire application (agents, backend, frontend, and all dependencies) into a Docker image and deploy it. The platform handles scaling, HTTPS, and infrastructure, but you retain full control over what runs inside the container. Key characteristics of Container Apps: Full application control. You own the runtime, the web framework, the agent orchestration logic, and the frontend. Custom networking. You can serve a web UI, expose REST APIs, stream Server-Sent Events, or run WebSocket connections. Arbitrary dependencies. Your container can include any system package, any Python library, and any tooling (like Git for cloning repositories). Portable. The same Docker image runs locally, in CI, and in production without modification. Why Opustest uses Container Apps Opustest requires capabilities that go beyond what a managed agent hosting platform provides: Requirement Hosted Agents (Foundry) Container Apps Custom web UI with real-time progress Not supported natively Full control via FastAPI and SSE Multi-agent orchestration pipeline Platform-managed, limited customisation Custom orchestrator with arbitrary logic Git repository cloning on the server Not available Install Git in the container image Server-Sent Events streaming Not supported Full HTTP control Custom HTML report generation Limited to platform outputs Generate and serve any content Export button for Copilot prompts Not available Custom frontend with JavaScript RAG retrieval from Cosmos DB Possible via built-in connectors Direct SDK access with full query control The core reason is straightforward: Opustest is not just a set of agents. It is a complete web application that happens to use agents as its processing engine. It needs a custom frontend, real-time streaming, server-side Git operations, and full control over how the agent pipeline executes. Container Apps provides all of this while still offering managed infrastructure, automatic scaling, and zero server maintenance. When to choose which Choose Microsoft Hosted Agents when your use case is primarily conversational or prompt-driven, when you want the fastest path to a working agent with minimal code, and when the built-in tool ecosystem covers your integration needs. Choose Azure Container Apps when you need a custom frontend, custom orchestration logic, real-time streaming, server-side processing beyond prompt-response patterns, or when your agent system is part of a larger application with its own web server and API surface. Both approaches use the same underlying AI models via Azure OpenAI. The difference is in how much control you need over the surrounding application. Five practical benefits of hosted containers for agents 1. Consistent deployments across environments Whether you are running the container locally with docker run , in a CI pipeline, or on Azure Container Apps, the behaviour is identical. Configuration differences are handled through environment variables, not code changes. This eliminates an entire category of "it works locally but breaks in production" bugs. 2. Scaling without re-architecture Azure Container Apps can scale from zero instances (paying nothing when idle) to multiple instances under load. Because agent pipelines are stateless, each request is routed to whichever instance is available. You do not need to redesign your application to handle concurrency; the platform does it for you. 3. Isolation between services If your agent system grows to include multiple services (perhaps a separate service for document processing or a background worker for batch analysis), each service gets its own container. They can be deployed, scaled, and updated independently. A bug in one service does not bring down the others. 4. Built-in observability Managed container platforms provide logging, metrics, and health checks out of the box. When an agent pipeline fails after three minutes of execution, you can inspect the container logs to see exactly which stage failed and why, without adding custom logging infrastructure. 5. Infrastructure as code The entire deployment can be defined in code. Bicep templates, Terraform configurations, or Pulumi programmes describe every resource. This means deployments are repeatable, reviewable, and version-controlled alongside your application code. No clicking through portals, no undocumented manual steps. Common concerns addressed "Containers add complexity" For a single-file script, this is a fair point. But the moment your agent system has more than one dependency, a Dockerfile is simpler to maintain than a set of installation instructions. It is also self-documenting: anyone reading the Dockerfile knows exactly what the system needs to run. "Serverless is simpler" Serverless functions are excellent for short, event-driven tasks. But agent pipelines that run for minutes, require persistent connections (like SSE streaming), and depend on large packages are a poor fit for most serverless platforms. Containers give you the operational simplicity of managed hosting without the execution constraints. "I do not want to learn Docker" A basic Dockerfile for a Python application is fewer than ten lines. The core concepts are straightforward: start from a base image, install dependencies, copy your code, and specify the startup command. The learning investment is small relative to the deployment problems it solves. "What about cost?" Azure Container Apps supports scale-to-zero, meaning you pay nothing when the application is idle. For development and demonstration purposes, this makes hosted containers extremely cost-effective. You only pay for the compute time your agents actually use. Getting started: a practical checklist If you are ready to containerise your own agent solution, here is a step-by-step approach. Step 1: Write a Dockerfile. Start from an official Python base image. Install system-level dependencies (like Git, if your agents clone repositories), then your Python packages, then your application code. Run as a non-root user. Step 2: Test locally. Build and run the image on your machine: docker build -t my-agent-app . docker run -p 8000:8000 --env-file .env my-agent-app If it works locally, it will work in the cloud. Step 3: Define your infrastructure. Use Bicep, Terraform, or the Azure Developer CLI to declare the resources you need: a container app, a container registry, and any backing services (databases, key vaults, AI endpoints). Step 4: Deploy. Push your image to the registry and deploy to the container platform. With azd , this is a single command. With CI/CD, it is a pipeline that runs on every push to your main branch. Step 5: Iterate. Change your agent code, rebuild the image, and redeploy. The cycle is fast because Docker layer caching means only changed layers are rebuilt. The broader picture The AI agent ecosystem is maturing rapidly. Frameworks like Microsoft Agent Framework, LangChain, Semantic Kernel, and AutoGen make it straightforward to build sophisticated multi-agent systems. But building is only half the challenge. The other half is running these systems reliably, securely, and at scale. Hosted containers offer the best balance of flexibility and operational simplicity for agent workloads. They do not impose the execution limits of serverless platforms. They do not require the operational overhead of managing virtual machines. They give you a portable, reproducible unit of deployment that works the same everywhere. If you have an agent prototype sitting on your laptop, the path to making it available to your team, your organisation, or the world is shorter than you think. Write a Dockerfile, define your infrastructure, run azd up , and share the URL. Your agents deserve a proper home. Hosted containers are that home. Resources Azure Container Apps documentation Microsoft Foundry Hosted Agents Azure Developer CLI (azd) Microsoft Agent Framework Docker getting started guide Opustest: AI-powered code verification (source code)Announcing the IQ Series: Foundry IQ
AI agents are rapidly becoming a new way to build applications. But for agents to be truly useful, they need access to the knowledge and context that helps them reason about the world they operate in. That’s where Foundry IQ comes in. Today we’re announcing the IQ Series: Foundry IQ, a new set of developer-focused episodes exploring how to build knowledge-centric AI systems using Foundry IQ. The series focuses on the core ideas behind how modern AI systems work with knowledge, how they retrieve information, reason across sources, synthesize answers, and orchestrate multi-step interactions. Instead of treating retrieval as a single step in a pipeline, Foundry IQ approaches knowledge as something that AI systems actively work with throughout the reasoning process. The IQ Series breaks down these concepts and shows how they come together when building real AI applications. You can explore the series and all the accompanying samples here: 👉 https://aka.ms/iq-series What is Foundry IQ? Foundry IQ helps AI systems work with knowledge in a more structured and intentional way. Rather than wiring retrieval logic directly into every application, developers can define knowledge bases that connect to documents, data sources, and other information systems. AI agents can then query these knowledge bases to gather the context they need to generate responses, make decisions, or complete tasks. This model allows knowledge to be organized, reused, and combined across applications, instead of being rebuilt for each new scenario. What's covered in the IQ Series? The Foundry IQ episodes in the IQ Series explore the key building blocks behind knowledge-driven AI systems from how knowledge enters the system to how agents ultimately query and use it. The series is released as three weekly episodes: Foundry IQ: Unlocking Knowledge for Your Agents — March 18, 2026: Introduces Foundry IQ and the core ideas behind it. The episode explains how AI agents work with knowledge and walks through the main components of the Foundry IQ that support knowledge-driven applications. Foundry IQ: Building the Data Pipeline with Knowledge Sources — March 25, 2026: Focuses on Knowledge Sources and how different types of content flow into Foundry IQ. It explores how systems such as SharePoint, Fabric, OneLake, Azure Blob Storage, Azure AI Search, and the web contribute information that AI systems can later retrieve and use. Foundry IQ: Querying the Multi-Source AI Knowledge Bases — April 1, 2026: Dives into the Knowledge Bases and how multiple knowledge sources can be organized behind a single endpoint. The episode demonstrates how AI systems query across these sources and synthesize information to answer complex questions. Each episode includes a short executive introduction, a tech talk exploring the topic in depth, and a visual recap with doodle summaries of the key ideas. Alongside the episodes, the GitHub repository provides cookbooks with sample code, summary of the episodes, and additinal learning resources, so developers can explore the concepts and apply them in their own projects. Explore the Repo All episodes and supporting materials live in the IQ Series repository: 👉 https://aka.ms/iq-series Inside the repository you’ll find: The Foundry IQ episode links Cookbooks for each episode Links to documentation and additional resources If you're building AI agents or exploring how AI systems can work with knowledge, the IQ Series is a great place to start. Watch the episodes and explore the cookbooks! We’re excited to see what you build and welcome your feedback & ideas as the series evolves.Securing Azure AI Agents: Identity, Access Control, and Guardrails in Microsoft Foundry
Azure AI agents are no longer passive chatbots—they are autonomous systems that reason, call tools, and access enterprise data. In this post, we explore how Microsoft Foundry secures AI agents by design using identity‑first architecture, RBAC‑based access control, and multi‑stage guardrails. The result is a Zero Trust–aligned approach that allows organizations to scale AI agents safely while maintaining strong governance, compliance, and operational visibility.423Views0likes0CommentsFoundry IQ: Unlocking ubiquitous knowledge for agents
Introducing Foundry IQ by Azure AI Search in Microsoft Foundry. Foundry IQ is a centralized knowledge layer that connects agents to data with the next generation of retrieval-augmented generation (RAG). Foundry IQ includes the following features: Knowledge bases: Available directly in the new Foundry portal, knowledge bases are reusable, topic-centric collections that ground multiple agents and applications through a single API. Automated indexed and federated knowledge sources – Expand what data an agent can reach by connecting to both indexed and remote knowledge sources. For indexed sources, Foundry IQ delivers automatic indexing, vectorization, and enrichment for text, images, and complex documents. Agentic retrieval engine in knowledge bases – A self-reflective query engine that uses AI to plan, select sources, search, rank and synthesize answers across sources with configurable “retrieval reasoning effort.” Enterprise-grade security and governance – Support for document-level access control, alignment with existing permissions models, and options for both indexed and remote data. Foundry IQ is available in public preview through the new Foundry portal and Azure portal with Azure AI Search. Foundry IQ is part of Microsoft's intelligence layer with Fabric IQ and Work IQ.38KViews6likes3Commentso3-deep-research is failed with the status incomplete with the reason as content filter
I working on an to do an deep research on internal data. I'm using currently the Azure OpenAI Responses API with MCP Tool. The underlying MCP server deployed into ACA with search and fetch tool with signatures in complaint with the specification (https://developers.openai.com/apps-sdk/build/mcp-server#company-knowledge-compatibility). OpenAI client created with 03-deep-research model with MCP tool, in a loop response status being checked. (https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/deep-research#remote-mcp-server-with-deep-research) Deep Research is being carried out for sometime, I could see in the log that handshake has been made, ListTools invoked, search tool is called post that fetch is called for the queries framed by the model.. But intermittently, the response status is becoming "incomplete" with incomplete reason as "content_filter". Otherwise the deep research is working fine. Not able identify the root cause as there is seems to be no way to identify what caused the content filtration whether its the prompt or completion. How to debug and check the root cause and rectify this ? Or is there known issue with the o3-deep-research model's intermediate reasoning completions Or search and fetch tool results are causing this ? I had uploaded a file made it available to MCP server, the search and fetch tool uses an Azure OpenAI agent to search the data using File Search and fetch tool gets the content of the file based on the id passed. For same file and same research topic the issue is not occurring always but intermittently.51Views0likes0CommentsBuilding a Multi-Agent On-Call Copilot with Microsoft Agent Framework
Four AI agents, one incident payload, structured triage in under 60 seconds powered by Microsoft Agent Framework and Foundry Hosted Agents. Multi-Agent Microsoft Agent Framework Foundry Hosted Agents Python SRE / Incident Response When an incident fires at 3 AM, every second the on-call engineer spends piecing together alerts, logs, and metrics is a second not spent fixing the problem. What if an AI system could ingest the raw incident signals and hand you a structured triage, a Slack update, a stakeholder brief, and a draft post-incident report, all in under 10 seconds? That’s exactly what On-Call Copilot does. In this post, we’ll walk through how we built it using the Microsoft Agent Framework, deployed it as a Foundry Hosted Agent, and discuss the key design decisions that make multi-agent orchestration practical for production workloads. The full source code is open-source on GitHub. You can deploy your own instance with a single azd up . Why Multi-Agent? The Problem with Single-Prompt Triage Early AI incident assistants used a single large prompt: “Here is the incident. Give me root causes, actions, a Slack message, and a post-incident report.” This approach has two fundamental problems: Context overload. A real incident may have 800 lines of logs, 10 alert lines, and dense metrics. Asking one model to process everything and produce four distinct output formats in a single turn pushes token limits and degrades quality. Conflicting concerns. Triage reasoning and communication drafting are cognitively different tasks. A model optimised for structured JSON analysis often produces stilted Slack messages—and vice versa. The fix is specialisation: decompose the task into focused agents, give each agent a narrow instruction set, and run them in parallel. This is the core pattern that the Microsoft Agent Framework makes easy. Architecture: Four Agents Running Concurrently On-Call Copilot is deployed as a Foundry Hosted Agent—a containerised Python service running on Microsoft Foundry’s managed infrastructure. The core orchestrator uses ConcurrentBuilder from the Microsoft Agent Framework SDK to run four specialist agents in parallel via asyncio.gather() . All four panels populated simultaneously: Triage (red), Summary (blue), Comms (green), PIR (purple). Architecture: The orchestrator runs four specialist agents concurrently via asyncio.gather(), then merges their JSON fragments into a single response. All four agents The solution share a single Azure OpenAI Model Router deployment. Rather than hardcoding gpt-4o or gpt-4o-mini , Model Router analyses request complexity and routes automatically. A simple triage prompt costs less; a long post-incident synthesis uses a more capable model. One deployment name, zero model-selection code. Meet the Four Agents 🔍 Triage Agent Root cause analysis, immediate actions, missing data identification, and runbook alignment. suspected_root_causes · immediate_actions · missing_information · runbook_alignment 📋 Summary Agent Concise incident narrative: what happened and current status (ONGOING / MITIGATED / RESOLVED). summary.what_happened · summary.current_status 📢 Comms Agent Audience-appropriate communications: Slack channel update with emoji conventions, plus a non-technical stakeholder brief. comms.slack_update · comms.stakeholder_update 📝 PIR Agent Post-incident report: chronological timeline, quantified customer impact, and specific prevention actions. post_incident_report.timeline · .customer_impact · .prevention_actions The Code: Building the Orchestrator The entry point is remarkably concise. ConcurrentBuilder handles all the async wiring—you just declare the agents and let the framework handle parallelism, error propagation, and response merging. main.py — Orchestrator from agent_framework import ConcurrentBuilder from agent_framework.azure import AzureOpenAIChatClient from azure.ai.agentserver.agentframework import from_agent_framework from azure.identity import DefaultAzureCredential, get_bearer_token_provider from app.agents.triage import TRIAGE_INSTRUCTIONS from app.agents.comms import COMMS_INSTRUCTIONS from app.agents.pir import PIR_INSTRUCTIONS from app.agents.summary import SUMMARY_INSTRUCTIONS _credential = DefaultAzureCredential() _token_provider = get_bearer_token_provider( _credential, "https://cognitiveservices.azure.com/.default" ) def create_workflow_builder(): """Create 4 specialist agents and wire them into a ConcurrentBuilder.""" triage = AzureOpenAIChatClient(ad_token_provider=_token_provider).create_agent( instructions=TRIAGE_INSTRUCTIONS, name="triage-agent", ) summary = AzureOpenAIChatClient(ad_token_provider=_token_provider).create_agent( instructions=SUMMARY_INSTRUCTIONS, name="summary-agent", ) comms = AzureOpenAIChatClient(ad_token_provider=_token_provider).create_agent( instructions=COMMS_INSTRUCTIONS, name="comms-agent", ) pir = AzureOpenAIChatClient(ad_token_provider=_token_provider).create_agent( instructions=PIR_INSTRUCTIONS, name="pir-agent", ) return ConcurrentBuilder().participants([triage, summary, comms, pir]) def main(): builder = create_workflow_builder() from_agent_framework(builder.build).run() # starts on port 8088 if __name__ == "__main__": main() Key insight: DefaultAzureCredential means there are no API keys anywhere in the codebase. The container uses managed identity in production; local development uses your az login session. The same code runs in both environments without modification. Agent Instructions: Prompts as Configuration Each agent receives a tightly scoped system prompt that defines its output schema and guardrails. Here’s the Triage Agent—the most complex of the four: app/agents/triage.py TRIAGE_INSTRUCTIONS = """\ You are the **Triage Agent**, an expert Site Reliability Engineer specialising in root cause analysis and incident response. ## Task Analyse the incident data and return a single JSON object with ONLY these keys: { "suspected_root_causes": [ { "hypothesis": "string – concise root cause hypothesis", "evidence": ["string – supporting evidence from the input"], "confidence": 0.0 // 0-1, how confident you are } ], "immediate_actions": [ { "step": "string – concrete action with runnable command if applicable", "owner_role": "oncall-eng | dba | infra-eng | platform-eng", "priority": "P0 | P1 | P2 | P3" } ], "missing_information": [ { "question": "string – what data is missing", "why_it_matters": "string – why this data would help" } ], "runbook_alignment": { "matched_steps": ["string – runbook steps that match the situation"], "gaps": ["string – gaps or missing runbook coverage"] } } ## Guardrails 1. **No secrets** – redact any credential-like material as [REDACTED]. 2. **No hallucination** – if data is insufficient, set confidence to 0 and add entries to missing_information. 3. **Diagnostic suggestions** – when data is sparse, include diagnostic steps in immediate_actions. 4. **Structured output only** – return ONLY valid JSON, no prose. """ The Comms Agent follows the same pattern but targets a different audience: app/agents/comms.py COMMS_INSTRUCTIONS = """\ You are the **Comms Agent**, an expert incident communications writer. ## Task Return a single JSON object with ONLY this key: { "comms": { "slack_update": "Slack-formatted message with emoji, severity, status, impact, next steps, and ETA", "stakeholder_update": "Non-technical summary for executives. Focus on business impact and resolution." } } ## Guidelines - Slack: Use :rotating_light: for active SEV1/2, :warning: for degraded, :white_check_mark: for resolved. - Stakeholder: No jargon. Translate to business impact. - Tone: Calm, factual, action-oriented. Never blame individuals. - Structured output only – return ONLY valid JSON, no prose. """ Instructions as config, not code. Agent behaviour is defined entirely by instruction text strings. A non-developer can refine agent behaviour by editing the prompt and redeploying no Python changes needed. The Incident Envelope: What Goes In The agent accepts a single JSON envelope. It can come from a monitoring alert webhook, a PagerDuty payload, or a manual CLI invocation: Incident Input (JSON) { "incident_id": "INC-20260217-002", "title": "DB connection pool exhausted — checkout-api degraded", "severity": "SEV1", "timeframe": { "start": "2026-02-17T14:02:00Z", "end": null }, "alerts": [ { "name": "DatabaseConnectionPoolNearLimit", "description": "Connection pool at 99.7% on orders-db-primary", "timestamp": "2026-02-17T14:03:00Z" } ], "logs": [ { "source": "order-worker", "lines": [ "ERROR: connection timeout after 30s (attempt 3/3)", "WARN: pool exhausted, queueing request (queue_depth=847)" ] } ], "metrics": [ { "name": "db_connection_pool_utilization_pct", "window": "5m", "values_summary": "Jumped from 22% to 99.7% at 14:03Z" } ], "runbook_excerpt": "Step 1: Check DB connection dashboard...", "constraints": { "max_time_minutes": 15, "environment": "production", "region": "swedencentral" } } Declaring the Hosted Agent The agent is registered with Microsoft Foundry via a declarative agent.yaml file. This tells Foundry how to discover and route requests to the container: agent.yaml kind: hosted name: oncall-copilot description: | Multi-agent hosted agent that ingests incident signals and runs 4 specialist agents concurrently via Microsoft Agent Framework ConcurrentBuilder. metadata: tags: - Azure AI AgentServer - Microsoft Agent Framework - Multi-Agent - Model Router protocols: - protocol: responses environment_variables: - name: AZURE_OPENAI_ENDPOINT value: ${AZURE_OPENAI_ENDPOINT} - name: AZURE_OPENAI_CHAT_DEPLOYMENT_NAME value: model-router The protocols: [responses] declaration exposes the agent via the Foundry Responses API on port 8088. Clients can invoke it with a standard HTTP POST no custom API needed. Invoking the Agent Once deployed, you can invoke the agent with the project’s built-in scripts or directly via curl : CLI / curl # Using the included invoke script python scripts/invoke.py --demo 2 # multi-signal SEV1 demo python scripts/invoke.py --scenario 1 # Redis cluster outage # Or with curl directly TOKEN=$(az account get-access-token \ --resource https://ai.azure.com --query accessToken -o tsv) curl -X POST \ "$AZURE_AI_PROJECT_ENDPOINT/openai/responses?api-version=2025-05-15-preview" \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "input": [ {"role": "user", "content": "<incident JSON here>"} ], "agent": { "type": "agent_reference", "name": "oncall-copilot" } }' The Browser UI The project includes a zero-dependency browser UI built with plain HTML, CSS, and vanilla JavaScript—no React, no bundler. A Python http.server backend proxies requests to the Foundry endpoint. The empty state. Quick-load buttons pre-populate the JSON editor with demo incidents or scenario files. Demo 1 loaded: API Gateway 5xx spike, SEV3. The JSON is fully editable before submitting. Agent Output Panels Triage: Root causes ranked by confidence. Evidence is collapsed under each hypothesis. Triage: Immediate actions with P0/P1/P2 priority badges and owner roles. Comms: Slack card with emoji substitution and a stakeholder executive summary. PIR: Chronological timeline with an ONGOING marker, customer impact in a red-bordered box. Performance: Parallel Execution Matters Incident Type Complexity Parallel Latency Sequential (est.) Single alert, minimal context (SEV4) Low 4–6 s ~16 s Multi-signal, logs + metrics (SEV2) Medium 7–10 s ~28 s Full SEV1 with long log lines High 10–15 s ~40 s Post-incident synthesis (resolved) High 10–14 s ~38 s asyncio.gather() running four independent agents cuts total latency by 3–4× compared to sequential execution. For a SEV1 at 3 AM, that’s the difference between a 10-second AI-powered head start and a 40-second wait. Five Key Design Decisions Parallel over sequential Each agent is independent and processes the full incident payload in isolation. ConcurrentBuilder with asyncio.gather() is the right primitive—no inter-agent dependencies, no shared state. JSON-only agent instructions Every agent returns only valid JSON with a defined schema. The orchestrator merges fragments with merged.update(agent_output) . No parsing, no extraction, no post-processing. No hardcoded model names AZURE_OPENAI_CHAT_DEPLOYMENT_NAME=model-router is the only model reference. Model Router selects the best model at runtime based on prompt complexity. When new models ship, the agent gets better for free. DefaultAzureCredential everywhere No API keys. No token management code. Managed identity in production, az login in development. Same code, both environments. Instructions as configuration Each agent’s system prompt is a plain Python string. Behaviour changes are text edits, not code logic. A non-developer can refine prompts and redeploy. Guardrails: Built into the Prompts The agent instructions include explicit guardrails that don’t require external filtering: No hallucination: When data is insufficient, the agent sets confidence: 0 and populates missing_information rather than inventing facts. Secret redaction: Each agent is instructed to redact credential-like patterns as [REDACTED] in its output. Mark unknowns: Undeterminable fields use the literal string "UNKNOWN" rather than plausible-sounding guesses. Diagnostic suggestions: When signal is sparse, immediate_actions includes diagnostic steps that gather missing information before prescribing a fix. Model Router: Automatic Model Selection One of the most powerful aspects of this architecture is Model Router. Instead of choosing between gpt-4o , gpt-4o-mini , or o3-mini per agent, you deploy a single model-router endpoint. Model Router analyses each request’s complexity and routes it to the most cost-effective model that can handle it. Model Router insights: models selected per request with associated costs. Model Router telemetry from Microsoft Foundry: request distribution and cost analysis. This means you get optimal cost-performance without writing any model-selection logic. A simple Summary Agent prompt may route to gpt-4o-mini , while a complex Triage Agent prompt with 800 lines of logs routes to gpt-4o all automatically. Deployment: One Command The repo includes both azure.yaml and agent.yaml , so deployment is a single command: Deploy to Foundry # Deploy everything: infra + container + Model Router + Hosted Agent azd up This provisions the Foundry project resources, builds the Docker image, pushes to Azure Container Registry, deploys a Model Router instance, and creates the Hosted Agent. For more control, you can use the SDK deploy script: Manual Docker + SDK deploy # Build and push (must be linux/amd64) docker build --platform linux/amd64 -t oncall-copilot:v1 . docker tag oncall-copilot:v1 $ACR_IMAGE docker push $ACR_IMAGE # Create the hosted agent python scripts/deploy_sdk.py Getting Started Quickstart # Clone git clone https://github.com/microsoft-foundry/oncall-copilot cd oncall-copilot # Install python -m venv .venv source .venv/bin/activate # .venv\Scripts\activate on Windows pip install -r requirements.txt # Set environment variables export AZURE_OPENAI_ENDPOINT="https://<account>.openai.azure.com/" export AZURE_OPENAI_CHAT_DEPLOYMENT_NAME="model-router" export AZURE_AI_PROJECT_ENDPOINT="https://<account>.services.ai.azure.com/api/projects/<project>" # Validate schemas locally (no Azure needed) MOCK_MODE=true python scripts/validate.py # Deploy to Foundry azd up # Invoke the deployed agent python scripts/invoke.py --demo 1 # Start the browser UI python ui/server.py # → http://localhost:7860 Extending: Add Your Own Agent Adding a fifth agent is straightforward. Follow this pattern: Create app/agents/<name>.py with a *_INSTRUCTIONS constant following the existing pattern. Add the agent’s output keys to app/schemas.py . Register it in main.py : main.py — Adding a 5th agent from app.agents.my_new_agent import NEW_INSTRUCTIONS new_agent = AzureOpenAIChatClient( ad_token_provider=_token_provider ).create_agent( instructions=NEW_INSTRUCTIONS, name="new-agent", ) workflow = ConcurrentBuilder().participants( [triage, summary, comms, pir, new_agent] ) Ideas for extensions: a ticket auto-creation agent that creates Jira or Azure DevOps items from the PIR output, a webhook adapter agent that normalises PagerDuty or Datadog payloads, or a human-in-the-loop agent that surfaces missing_information as an interactive form. Key Takeaways for AI Engineers The multi-agent pattern isn’t just for chatbots. Any task that can be decomposed into independent subtasks with distinct output schemas is a candidate. Incident response, document processing, code review, data pipeline validation—the pattern transfers. Microsoft Agent Framework gives you ConcurrentBuilder for parallel execution and AzureOpenAIChatClient for Azure-native auth—you write the prompts, the framework handles the plumbing. Foundry Hosted Agents let you deploy containerised agents with managed infrastructure, automatic scaling, and built-in telemetry. No Kubernetes, no custom API gateway. Model Router eliminates the model selection problem. One deployment name handles all scenarios with optimal cost-performance tradeoffs. Prompt-as-config means your agents are iterable by anyone who can edit text. The feedback loop from “this output could be better” to “deployed improvement” is minutes, not sprints. Resources Microsoft Agent Framework SDK powering the multi-agent orchestration Model Router Automatic model selection based on prompt complexity Foundry Hosted Agents Deploy containerised agents on managed infrastructure ConcurrentBuilder Samples Official agents-in-workflow sample this project follows DefaultAzureCredential Zero-config auth chain used throughout Hosted Agents Concepts Architecture overview of Foundry Hosted Agents The On-Call Copilot sample is open source under the MIT licence. Contributions, scenario files, and agent instruction improvements are welcome via pull request.